Abstract Concept Learning in Cognitive Robots

Purpose of Review Understanding and manipulating abstract concepts is a fundamental characteristic of human intelligence that is currently missing in artificial agents. Without it, the ability of these robots to interact socially with humans while performing their tasks would be hindered. However, what is needed to empower our robots with such a capability? In this article, we discuss some recent attempts on cognitive robot modeling of these concepts underpinned by some neurophysiological principles. Recent Findings For advanced learning of abstract concepts, an artificial agent needs a (robotic) body, because abstract and concrete concepts are considered a continuum, and abstract concepts can be learned by linking them to concrete embodied perceptions. Pioneering studies provided valuable information about the simulation of artificial learning and demonstrated the value of the cognitive robotics approach to study aspects of abstract cognition. Summary There are a few successful examples of cognitive models of abstract knowledge based on connectionist and probabilistic modeling techniques. However, the modeling of abstract concept learning in robots is currently limited at narrow tasks. To make further progress, we argue that closer collaboration among multiple disciplines is required to share expertise and co-design future studies. Particularly important is to create and share benchmark datasets of human learning behavior.


Introduction
Human intelligence is characterized by the ability to create and manipulate abstract concepts like "wisdom" and "love." This ability is at the core of human creativity, and, indeed, it is required for advanced cognitive capabilities like the retrieval of past thoughts and memories, relational reasoning and problem-solving in current situations, and the processing of thoughts linked to the future (e.g., planning and design). For these reasons, abstract concepts constitute an essential part of human language, where abstract words are often used in daily conversations to represent emotions, events, and situations that occur in physical environments and social interactions among people.
Human language includes concrete concepts, such as "water" or "stone," which are linked to objects that can be objectively defined and understood. These are usually studied through a bottom-up approach that involves five major levels of analysis: phonetics, lexical, semantic, syntactic, and pragmatics. In contrast, abstract concepts like "beauty" or "freedom" do not have specific physical referents; hence, they are more ambiguous, and their notion can significantly variate across individuals [1••]. In this article, abstract concepts are broadly defined as higher-order, or complex, thoughts that are not bounded to a single, perceptually derived information and that do not exist at any particular time or place [2].
Even if the most intuitive definition of abstraction is opposite to that of concreteness, abstract and concrete concepts are not a dichotomy, but they are considered part of a continuum [3], in which entities can have both abstract and concrete features in different proportions ranging from highly abstract (e.g., "justice") to highly concrete (e.g., "glass"). The continuum view has gained strength in recent years, after the growing evidence in support of embodied and grounded theories of cognition. In fact, a number of proposals have argued that abstract concepts can be grounded in the sensorimotor system like concrete concepts (see review by Pexman [4•]), characterized by a continuum from unembodied (fully symbolic) to strongly embodied [5]. A fundamental assumption of this view is that abstract concepts can be linked to embodied perceptions and learned through a process of progressive abstraction [6].
The embodied theories for the development of abstract thinking and reasoning constitute the theoretical resource for the design of artificial agents capable of abstract and symbolic processing, which is required for higher cognitive functions such as natural language understanding. This is one of the current challenges for the fast-growing field of cognitive robotics where future robots are expected to take on tasks once thought too complex or delicate to automate, especially in the fields of social care, companionship, therapy, domestic assistance, entertainment, and education [7][8][9].
This short review aims at stimulating new investigation in cognitive robotics and artificial intelligence toward the creation of smarter robots that will be capable of understanding and manipulating abstract concept and words, thus overcoming the current limitations in human-robot communication using natural language. The latter is the most intuitive form of user interfaces [10]. To this end, we will present pioneering work on cognitive robotics models of abstract words implemented using grounding transfer mechanism. This short review is based on [11], which provides an extensive survey and analysis of the multidisciplinary contributions to the field.
Abstract concepts can be categorized into different domains and each can be acquired using a combination of different strategies. As an example, we will present a "direct grounding" strategy for the embodied learning of numerical concepts that combines gestures and action with words, such as in the use of finger counting representations for augment teaching a child (or a robot) about numbers. Numbers are a special domain of abstract concepts that constitute the building blocks of mathematics, a language of the human mind that can express the fundamental workings of the physical world and make the universe intelligible. Finally, we will discuss the current limitations and give our conclusion and future direction about abstract cognition and robotics research.

Cognitive Robotics Models of Abstract Concepts
Cognitive models that enable robots to learn new words and concepts typically adopt an embodied and grounded approach. Cangelosi and Riga [12] have introduced the "direct grounding" approaches for developing language models in robots and presented applications of this strategy in learning more concrete words, i.e., where the robot learns the names of objects it can perceive or words for actions it is performing or observing. For instance, robots can simulate the early stages of language development via the interaction of infants with caregivers, which is reviewed in [13].
The abstract/concrete continuum view of concepts suggests that the learning of higher-order, more abstract words may be obtained by extending the strategies and models for the grounding of concrete words. In the "grounding transfer" strategy, new concepts and words are learned by the robot in successive stages, via combining words whose meanings have been previously acquired through direct grounding. For example, a robot can discover the meaning of "centaur" if instructed to merge the previously acquired grounded meanings of "man" and "horse," and then transferring the result to the new word, without ever seeing such a fantastic animal. In the "direct grounding" strategy, the robot learns abstract concepts by associating words to gestures and actions, for example, the use of finger counting to teach a child (or a robot) to count. In the following section, we first review some examples of the "grounding transfer" strategy; then the next section presents cognitive robotics models of number cognition that are examples of the "direct grounding" strategy.
The last section presents a relatively recent idea, which suggests that abstract meaning is grounded through emotions [14]. The argument is that emotional experience should be considered a primary source of the embodied information that supports the development of abstract thinking and reasoning. Indeed, they form a continuum that goes from sensorimotor experience that strongly characterizes concrete words representations, to emotional experiences that dominates representations for abstract words [15].

Cognitive Robotics Models that Learns Higher Abstract Concepts via Grounding Transfer
The abstract/concrete continuum view of concepts suggests that the learning of higher-order, more abstract words may be obtained by extending the strategies and models for the grounding of concrete words. However, in the scientific literature, there are only very few examples that explore such an extension. Recently, Cangelosi and Stramandinoli [16••] offered a review of two main strategies for grounding concepts without the sensorimotor experience of direct physical referents.
Paul et al. [17] studied models that allow a robot to "understand" spatial language instructions by efficiently ground them in the context of its world representation. The central contribution of this work is an improvement in computational efficiency rather than looking into the modeling of human learning behavior. The efficiency is achieved via adaptive probabilistic models that form a Markov boundary between abstract variables and concrete groundings, effectively decorrelating them from the remaining variables in the graph. The architecture includes two stages; in the second stage, there is a second model that utilizes the abstractions of the first stage but infers a coarse symbolic structure from the word and the environment model and then performs fine-grained inference over the reduced graphical model, further improving the efficiency of inference. Empirical evaluation of the proposed system with a fixed and a mobile robot demonstrated accurate grounding of abstract concepts embedded in complex natural language instructions.
In the connectionist domain, recurrent neural networks (RNNs) are particularly suitable structures for modeling abstract concept learning behavior because the recurrent links allow the network to manage the sequence of progressive abstraction. Historically, two main types of RNN were proposed: the Jordan type, with a recursion from the output to the input [18], and the Elman type, with recursion on the hidden layer [19].
Following the "grounding transfer" view, Stramandinoli, Marocco, and Cangelosi [20,21] investigated the problem of grounding intermediate abstract concepts, i.e., higher-order actions that can be obtained by combining concrete motor concepts. Stramandinoli et al. [20] have performed experiments on a cognitive model based for the humanoid robot iCub on an RNN of the Elman type, which permits the learning of higher-order concepts based on temporal sequences of action primitives and word sentences. The training of the model is incremental. The mechanism includes two stages: (i) the basic grounding (BG) and (ii) higher grounding (HG) transfer mechanisms. During the BG, the robot learns a set of action primitives (e.g., "PULL," "PUSH," or "GRASP") using embodied and situated strategies. Two different stages were implemented for the HG training to enable different levels of the combination between basic and complex actions. In the first HG stage (i.e., HG-1), a sequence of previously learned words (e.g., "RECEIVE [is] PUSH [and] GRASP [and] PULL") are provided to guide the hierarchical organization of the basic concepts to learn novel concepts (e.g., "GIVE"). During the second HG stage (i.e. HG-2), the robot learns three new higher-order words ("accept," "reject," "keep") consisting of the combination of basic action primitives and higher-order words acquired during the previous HG-1 stage (e.g., "KEEP [is] PICK [and] NEUTRAL"). HG-2 adds a further hierarchical combination of words from both concrete concepts (BG) and the first level of abstraction words (HG-1). This training methodology is extremely flexible and permits to freely add novel words to the known vocabulary of the robot or to completely rearrange the wordmeaning associations.
In a follow-up work, Stramandinoli et al. [21] proposed a partially RNN (Jordan type) for learning the relationships between motor primitives and objects and performed experiments on the iCub robot for investigating the grounding of more abstract action words, such as "utilize" or "create." In this case, the grounding of abstract action words is achieved through the integration of the linguistic, perceptual, and motor input modalities, recorded from the iCub sensors, into a threelayer RNN model (Fig. 1). The iCub robot first develops some basic perceptual and motor skills, such as "PUSH," "PULL," and "LIFT," necessary for initiating the physical interaction with the environment, and then it can use such knowledge to ground language. The training of the model is incremental and consists of three stages: (i) pre-linguistic, the robot is trained to recognize a set of objects (e.g., BRUSH, KNIFE, HAMMER) and learn object-related actions primitives (e.g., PAINT, HIT, CUT) by combining low-level motor primitives together. (ii) Linguistic-perceptual training-this is the first stage of language acquisition. The model is trained to associate labels to the corresponding object and actions (two-word sentences consisting of a verb followed by a noun e.g., SCRUB [with] BRUSH); these words are directly grounded in perception and motor experience. (iii) Linguistic abstract training-abstract action words (i.e., UTILIZE, MAKE) are grounded by combining and recalling the perceptual and motor knowledge previously linked to basic words (i.e., the previous linguisticperceptual training). To derive the meaning of abstract action words the robot, guided by linguistic instructions (e.g., "UTILIZE a KNIFE"), organizes the knowledge directly grounded in perception and motor knowledge. This phase of the training represents the abstract stage of language acquisition when new concepts are formed by combining the meaning of terms acquired during the previous stages of the training.

Cognitive Robotics Models of Numerical Concepts-Development and Representation
This section concisely reviews some of the major computational models that were created to simulate the development of numerical cognition in artificial cognitive systems and robots. A more detailed review of the topic can be found in [22••].
Recently, Di Nuovo et al. conducted several experiments [23][24][25][26][27][28][29] with the iCub humanoid robot to explore whether the association of finger counting with number words and/or visual digits could serve to bootstrap numerical cognition in a cognitive robot. The models are summarized in Fig. 2. These were created merging three RNNs of the Elman type, which were trained separately and then merged to learn the classification of the three inputs: finger counting (motor), digit recognition (visual), and number words (auditory), i.e., the triplecode model [30]. Also, the model mimics the two-hemisphere organization of the brain. Results of the various robotic experiments show that learning finger sequencing together with the number word sequences speeds up the building of the neural network's internal links resulting in a qualitatively better understanding (higher likelihood of the correct classification) of the real number representations. Cluster analysis with an optimal strategy confirmed that internal representations of the finger configurations can be an ideal basis for building an embodied representation of digits in the robot.
Further investigation focused on increasing biological adherence of the models with deep learning approaches, which are inspired by the complex layered organization and the functioning of the cerebral cortex [31]. Indeed, follow-up studies [26,27] presented an extended simulation that incorporated the neural link between visual and motor areas observed in several neuroscientific studies. Particularly, Di Nuovo [27] investigated the long short-term memory architecture [32] for learning to perform addition with the support of the robot's finger counting. Interestingly, the model showed similarities with studies with humans (children and adults) by performing an unusual number of split-five errors, which can be linked to the five finger representations [33].
Di Nuovo and McClelland [28••] investigated the perceptual process of recognizing spoken digits in deep, convolutional neural networks embodied in the iCub robot. Simulation results showed that the robot's fingers boost the performance by setting up the network and augmenting the training examples when these were numerically limited. This is a common scenario in robotics, where robots will likely learn from a small amount of data. The embodied representation (fingers encoder values) was compared to other representations, showing that fingers can represent the real counterpart of that artificial representation and they can maximize the learning performance. Results are associated with some behavior also observed in several human studies in developmental psychology and neuroimaging. Overall, the hand-based representation provided our artificial system information about magnitude representations that improved the creation of a more uniform number line, as seen in children [34].

Cognitive Robotics Models of Emotions
The idea that robots may have emotions has captured the imagination of many researchers in the field of artificial intelligence who have identified the crucial importance of emotions in the design of more intelligent and sociable robots [35].
Despite that there is a general agreement that the next generation of cognitive architectures must integrate emotion and cognition to build realistic models of human-machine interaction, in practice, the computational modeling of emotion has been often been underrated in cognitive architecture research. Instead, computational modeling of emotion is frequently considered later with the addition of an emotion module that can influence some of the components of the general cognitive architecture (for a review, see [36]). Models account for emotion as well as some other aspects of cognition, but usually, they are not aiming to be comprehensive architectures (for a review, see [37]).
Pessoa [38] identified two main categories of applications for emotions models in robotics: (i) as an add-on to general architectures to provide robots urgency to action and decisions and (ii) to aid understanding emotion in humans, or to generate human-like expressions.
An example of the first category can be found in eMODUL, a perceptual system of the emotion-cognition interaction specifically designed for robotics by [39]. The Fig. 1 A partially recurrent neural network model for language abstraction eMODUL system is situated in its physical and social environment, and its components constantly appraise events from the body and the world with a particular interest in emotionally relevant stimuli that affect other computational/cognitive processes (e.g., allocation of resources, organization of behavior). The system continuously receives and integrates emotionally modulated signals into the information processing flow for higher-order processing. Thereby, the system sensations and actions are emotionally biased. In terms of the system autonomy, emotional modulations have an impact on the allocation of cognitive/computational resources and the organization of behavior appropriately with regard to the system's interaction and task/goal demands. The authors provide two experimental examples of the application of the eMODUL system with artificial neural networks, in which emotional modulation consists in increasing or decreasing the synaptic efficacy of targeted populations of neurons involved in these processes. The first experiment is in the context of a survival problem, where the hunger modulation makes the robots more determined to access the resource and feed. The second is visual search task, designed similarly to the common experimental paradigm in psychology, whereas the emotional (frustration or boredom) modulation of attention increases the robot performance and fosters the exploratory behavior to avoid deadlocks.
As an example of the second category, Prescott et al. [40] included emotional signals in a neuroscience-inspired multimodal computational architecture for the autobiographical memory system, named the Mental Time Travel Model (MTTM), implemented to control the iCub robot. The MTTM allows for retrieving past events, including the related emotional associations, and their projection into an imagined future by using the same system. This architecture proved useful for the social capabilities of robots by enabling recognition of face, voice (including emotion), action, and touch gesture when interacting with humans. The authors propose using this system for abstract reasoning like imagining future events, simulating and visualizing actions, as well as planning actions before actual execution. This work is still at an early stage; however, experiments show that deploying emotionally mediated memory models into a brain-inspired control architecture for the iCub robot has enhanced the robot's capability for recognizing social actors and actions.

Conclusions
All the studies presented provided valuable information about the simulation of artificial learning and demonstrated the value of the embodied approach to cognitive robotics when studying crucial aspects of cognition like learning abstract concepts. However, most neuro-psychological contributions recognize that an extension beyond a purely grounded approach is needed to fully account for the representation of abstract concepts. Another open issue is that so far, proposals for grounding abstract concepts are yet to be tested in studies with children. It will be important to investigate whether children's early abstract concepts' acquisition is grounded through metaphors, language co-occurrence, and/or emotion [4]. To this end, developmental robotics modeling can provide a powerful tool to collect preliminary information to evaluate or compare existing theories and to make novel experimental predictions that can be tested on humans [41•]. They could provide computational evidence in the debate on language development between "nativists" and "empiricists" by modeling the alternate theories and analyzing the resulting robot behavior in comparison to children's behavior.
Still, the cognitive robotics models proposed so far are relatively naïve because they are focused on simulating only a single aspect, verified with dummy tasks in simplified scenarios, and provided limited evidence of their generalization ability in the alternative, realistic settings. They have considered only concepts (e.g., metaphorical concepts such as "to grasp an idea") that have been empirically investigated in humans and have been already found grounded in action and perception systems. Thus, we are yet to see if we might extend these conclusions to other kinds of abstract concepts such as "politics" or "metaphysics." This is also the case of emotion modeling, which was predominantly studied in terms of replicating human social behavior, while very little has been done for improving robots' abstract thinking. Significant improvement in the complexity of the models and, moreover, the test scenarios are needed before cognitive robotics modeling can be considered a reliable tool in education, neuroscience, and psychology research.
The reason for this lack of reality can be attributed to the limitations of the current robotic platforms, but also it is partially due to the unavailability of raw data from children's experiments. Indeed, there are no open "benchmark" databases for cognitive robotics, unlike the typical open data behavior in machine learning. Robotic modelers can use only the post-processed data and statistical analyses for designing and validating models.
Further multidisciplinary research is required to gather data from children and get a better understanding of the underlining processes and strategies for abstract thinking and reasoning. It seems likely that there are developmental differences in the acquisition of the different types of concepts; therefore, hybrid models that combine sensory-motor experience and language appear as a viable option that should be investigated. In this respect, cognitive robotics can contribute both to the theoretical development of abstract concepts acquisition and use in humans, i.e., providing a simulated environment for testing hypotheses, and benefit from the discoveries to create innovative models of human-like learning and social interaction.
To advance the knowledge in this interdisciplinary field, we remark that closer collaboration among researchers of the multiple disciplines involved is required to share expertise and co-design joint studies. Importantly, we see the need for wellmatched artificial simulations and real experiments to have well-matched data between robots and children's tasks. Crucially, open databases should be made available to facilitate the machine learning community to engage with this field and replicate the success obtained in other fields of application, like speech recognition, computer vision, and autonomous driving of cars.
Funding The work of Alessandro Di Nuovo was supported by the EPSRC grant EP/P030033/1 (NUMBERS). The work of Angelo Cangelosi was in part supported by the H2020 projects eLADDA (grant agreement 857897) and STRoNA (grant agreement 794425) and US AFOSR-EOARD project THRIVE++ (Award no. FA9550-19-1-7002).

Compliance with Ethical Standards
Conflict of Interest The authors declare that they have no conflict of interest.
Human and Animal Rights and Informed Consent This article does not contain any studies with human or animal subjects performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.