Synonyms

Human-robot communication; Language; Symbol grounding

Definition

Voice speech interfaces concerns the design and use of algorithms and tools based on natural language and machine-learning methods for human-robot communication.

Overview

A fundamental behavioral and cognitive capability of a robot interacting with a human user is speech, since spoken language is the primary means used by people to communicate with each other. Moreover, communication between people, and between humans and robots, is not only based on speech. Rather, communication is based on a rich multimodal process that combines spoken language with a variety of nonverbal behaviors such as eye gaze, hand gestures, tactile interaction, and emotional cues (Mavridis 2015; Cangelosi and Schlesinger 2015). Speech-based interfaces, complemented by multimodal communication, can contribute to forming a consistent and robust recognition process for the robot (and humans) by reducing ambiguity about the sensory environment. This can improve performance compared to unimodal, speech-only recognition by using complementary information sources (Cangelosi and Ogata 2017).

Key Research Findings

In robot research, we can distinguish two main approaches to the design of speech interfaces and language communication capabilities: the first is based on natural language processing (NLP) techniques, and the second approach uses language learning methods. The NLP-based approach typically combines off-the-shelf techniques and language processing tools (e.g., readymade lexicons and knowledge bases, parsers, automatic speech recognition, and speech synthesis software) to implement in the robot the capability to respond to linguistic instructions and to utter sentences to express a request. The language learning approach, on the other hand, uses machine-learning methods (e.g., neural networks, Bayesian methods) to train the robot to acquire language skills. NLP approaches require the a priori definition of the language system and a top-down solution to link words with meanings/tasks. Learning-based approaches, on the contrary, allow the open-ended adaptation of the lexicon and the generalization to novel sentences and meanings. This distinction should not be taken as a rigid dichotomy. In practice, hybrid NLP-learning approaches are sometimes used, since some NLP robotic approaches do use machine-learning methods (e.g., most of current speech recognition systems are based on statistical learning and on deep neural network methods), and certain learning robot studies partially rely on off-the-shelf NLP tools.

Examples of Application

NLP-Based Robot Speech Interfaces

The design of language communication skills in humanoid robots, for pure conversational purposes or to ask the robot to perform specific tasks, has benefited from the combination of a series of readymade NLP software solutions in automatic speech recognition, parsing, and dialog system.

Conversational robots have their origins in conversational agents and virtual chatterbots (e.g., the A.L.I.C.E. and Eugene bots - Wallace 2009). This virtual chatterbot tradition has naturally led to the design of conversational robots. For example, the android robot family developed by Ishiguro and colleagues (Ishiguro 2007), including the latest project ERICA (ERato Intelligent Conversational Android), can be mostly considered conversational agents, as they have been either used for entertainment purposes or for robot teleoperation applications. Other humanoid robot platforms used for conversational purposes; for user guidance in buildings, museums, and transport stations; and as companion for children include Robovie (Shiomi et al. 2008), Nao (e.g., Kennedy et al. 2014), and other purpose-built robots such as the Robot Manzai (Hayashi et al. 2008). These conversational robots can typically function in a semi-autonomous or tele-control fashion, as well as in a full autonomic setup.

Many NLP-based speech interfaces for robots are designed not only for conversational or social companionship purposes, but rather with the primary aim to use the speech interface for the comprehension of the user instruction, to be able to respond to the user request by selecting the appropriate motor behavior. Linguistic requests typically concern object manipulation tasks (e.g., “pick up red box,” “give me a hammer,” “clean the table”) and navigation scenarios (e.g., “go to the kitchen,” “take me to the exit”). The use of speech interfaces for language instruction understanding requires a tight coupling of the robot’s visual and motor representations with its language processing and knowledge representation methods. In NLP-based approaches, this link is commonly predefined by the designer (i.e., no autonomous or learned grounding of the robot’s words is required, but the robot can only use designer-defined “meanings”). In the robotics literature, different methods have been proposed. An example of an NLP-based approach linking language and sensorimotor representation for linguistic instruction execution is that by Hara et al. (2004). They used a PR2 humanoid robot to validate a Bayesian method for the detection, extraction, and association of speech events through the fusion of audio (sound localization using a microphone array) and video (tracking of human body) information. Aloimonos, Patra, and colleagues developed a language and action representation formalism and a semantic network tool for action knowledge representation and speech interface of object manipulation tasks (Pastra and Aloimonos 2012). This system, called PRAXICON, uses a goal-based representation of actions, using a multimodal semantic network type representation. This has been used for robot experiments on understanding language instructions on tool use with the iCub robot (Antunes et al. 2017) and as an action/language representation system for learning food cooking actions from “watching” videos (i.e., computer vision analysis of cooking videos via a combination of object segmentation and recognition and deep-learning classification of goal-based actions). This system has been tested on the Baxter robot using videos available in the Web (Yang et al. 2015).

Language Learning Approaches

Language learning models for robot speech systems tend to use two main approaches. The first focuses on developmental learning, i.e., the modeling of the gradual acquisition of language skills, coupled with other behavioral and cognitive skills. This modeling approach follows an infant-like developmental pathway (Cangelosi and Schlesinger 2015; Cangelosi et al. 2010). The second approach focuses on the use of machine-learning techniques to train the robot to learn communication lexicons and their relationship to other modalities.

Developmental language learning models are based on the Developmental Robotics approach, which is the interdisciplinary approach to the autonomous design of behavioral and cognitive capabilities in artificial agents and robots, that takes direct inspiration from the developmental principles and mechanisms observed in natural cognitive systems such as children (Cangelosi and Schlesinger 2015). As such this approach puts a strong emphasis on constraining the robot’s cognitive and linguistic architecture, and bases behavioral and learning performance onto known child psychology theories, data, and developmental principles. This permits the modeling of the developmental sequence of qualitative and quantitative stages, leading to the acquisition of adult-like sensorimotor, cognitive, and linguistic skills after training.

A seminal example of a developmental robotics model of language learning is that on the acquisition and grounding of first words, as in the robotics model of the child psychology experiments by Samuelson et al. (2011) on the role of embodiment (posture) in early word learning. Morse et al. (2015) propose a robot model replicating the original child experiments. The developmental robot model and experiments use the Epigenetic Robotics Architecture (ERA: Morse et al. 2010). The core of this architecture consists of three self-organizing maps with modified Hebbian learning between their units. The first (visual) map is driven by pre-processed visual information, implemented as an HSV spectrogram of the color of each object in view. The second (body) map is driven by postural information (the current motor encoder values of the eyes, head, and torso of the robot). The final (word) map responds uniquely to each word encountered (pre-processed by a commercial automatic speech recognition system). The visual color map and the word map are both fully connected to the body posture map, with connection weights adjusted by a normalized positive and negative Hebbian learning rule. Units within each map are also fully connected with each other by constant inhibitory connections, mimicking the structure of connectionist Interactive Activation and Competition models. The iCub robot’s initial behavior is driven by sensitivity to movement, i.e., a motor saliency maps that detects which objects or body parts move. Various experiments demonstrate that robot can learn and ground the meaning of words on its perceptual experience, mediated by its bossy posture strategy. Further extension of this model has investigated additional developmental learning mechanisms such as mutual exclusivity (Morse and Cangelosi 2017; Twomey et al. 2016).

Other developmental language models have looked at the learning of both object and action labels, moving toward first examples of syntax learning. Tikhanoff et al. (2011) proposed a simulation model of the iCub robot on the development of a lexicon to understand simple commands such as “pick_up blue_ball.” They used a modular neuro-cognitive architecture, based on neural networks and vision/speech recognition systems (e.g., SPHINX for speech recognition). These control the various cognitive and sensorimotor capabilities of the iCub and integrate them to learn the names of objects and actions. Other developmental robotics models have specifically looked at grammar development, e.g., modeling the emergence of semantic compositionality for syntactic compositionality for multiple word combinations and generalization (Sugita and Tani 2005; Tuci et al. 2011). Others have directly looked at the modeling of complex syntactic mechanisms, as with the robotics experiments on Fluid Construction Grammar (Steels 2012), which permits the study of syntactic properties such as verb tense and morphology. For example, the seminal work by Sugita and Tani (2005) investigated the emergence of compositional meanings and lexica with no a priori knowledge of any lexical or formal syntactic representations. These have led to more complex models and algorithms for robot speech system using recurrent neural networks such as Multiple Time Scale Recurrent Neural Networks (e.g., Yamashita and Tani 2008; Zhong et al. 2017).

The machine-learning approach to robot speech interfaces focuses on the multimodal integration of speech, vision, and motor knowledge. This contributes to the design of robust communication systems by reducing ambiguity about the sensory environment. Humans successfully recognize the environment for many tasks by combining sensory-inputs from multiple modalities such as vision, audition, and somatic sensation. Although multimodal integration remains a difficult challenge in robots, a few models have been proposed to address this using machine-learning methods based on Bayesian algorithms and on deep learning.

Different types of graphical models of multimodal classification have been proposed for learning-based speech interfaces. Celikkanat et al. realized many-to-many relationships between objects and contexts in robotics tasks using latent Dirichlet allocation (LDA) (Celikkanat et al. 2014). Nakamura et al. proposed studies on multimodal classification using multilateral latent Dirichlet allocation (MLDA) and its extension (Nakamura et al. 2015). They developed robotic systems that could obtain visual, sound, and tactile information by handling objects. The robot repeated the grasping several times and shook the object to acquire sound information. By applying the MLDA, the authors showed that robots were able to classify many objects into categories similar to human classification results (Nakamura et al. 2009). Araki et al. developed an MLDA online and conducted experiments on multimode category acquisition completely autonomously in the home environment (Araki et al. 2011). Lallee et al. proposed a multimodal convergence map based on self-organizing maps which integrated visual-motor and language modalities (Lallee and Ford Dominey 2013).

One of the most representative applications of multimodal integrated learning is Audio Visual Speech Recognition (AVSR: e.g., Nefian et al. 2002). The basic idea of AVSR is to compensate for broken audio input by using visual information obtained from the movement of the lips of the speaker. Most of the existing lip-reading experiments are still restricted to relatively simple tasks such as separated or connected random words, separated or concatenated digits, separated or concatenated letters. Furthermore, universally common standards have not been established in lip-reading research.

More recently, a few multimodal robot speech systems have been developed using deep learning algorithms. Noda et al. (2015) proposed a speech recognition model by taking two approaches of deep neural networks for noise reduction of speech features and for visual information in a complementary role. Specifically, the perception features acquired from the audio signal and the corresponding mouth region image were integrated. Two kinds of deep-learning methods, Deep Denoising Auto-encoder (DDA) and Convolution Neural Network (CNN), were used to extract features from audio information and from visual information, respectively. In addition, a multi-stream hidden Markov model (MSHMM) was applied to integrate the two perceptual features acquired from the speech signal and the mouth region image. They found that the CNN output showed higher recognition rates than the visual features extracted by PCA, and the effect of the different image resolutions was not prominent. In another study, Noda et al. (2014) proposed multimodal temporal sequence integration learning framework for robot behavior using DNN for multimodal time series integrated learning, as well as feature extraction by dimensional compression. They designed a framework with multiple DNNs as a cross-modal memory retriever and as a temporal sequence predictor. Specifically, they integrated image, sound signal, and motor modalities with multiple Deep Auto-encoder (DA) for object manipulation with the Nao robot. The learning experiments were conducted on six types of object manipulations of the humanoid robot NAO, generated by direct teaching.

Future Directions for Research

The latest significant developments in artificial intelligence and machine learning, such as deep learning and large-scale neural networks, offer significant opportunities for the extension and scaling up of robot speech interfaces. For example, deep-learning methods, such as Convolution Neural Networks, are becoming key techniques to realize future multimodal integration and dialogue communication systems. However, on its own, deep learning cannot address the whole problem of robot speech interfaces. Deep learning typically takes a batch learning and a supervised learning approach, thus making it incapable of working online. It acquires representations approximating the given input data, but it cannot define novel symbols (and meanings) about the world, as humans do with language generativity. Moreover, although deep networks can exceed performance of human beings in some particular data processing tasks, they have significant and critical limitations. For example, it can be extremely difficult to understand the internal mechanism of deep networks, thus making it hard to identify the causes of errors. Thus, the understanding of robot language and deep-learning system as a mathematically, multidimensional complex system, i.e., a dynamic system, is an important area for future work.

Another important direction for future research is the focus on a cognitive system approach where symbol acquisition is grounded on the robot’s own sensorimotor system (cf. symbol grounding problem: Cangelosi 2010) and emerges from the interaction between the robot, the human user, and their environment. This requires a long-term and open-ended development of the human-robot interaction and communication system. The issue of open-ended learning has been recognized as being crucial in developmental and learning-based approaches to robotics. In the specific case of language capabilities, it is important for future research to build open-ended learning systems where the robot can accumulate and bootstrap linguistics knowledge at the various levels of analysis of language (from phonetic, to lexical, semantic, syntactic, and pragmatics) to achieve human-like natural language communication capabilities.