1 Introduction

Robots operating in various human environments must adaptively and sequentially acquire new categories for places and unknown words related to various places as well as the map of the environment (Kostavelis and Gasteratos 2015). It is desirable for robots to acquire place categories and vocabulary autonomously based on their experience because it is difficult to manually design spatial knowledge in advance. Related research in the fields of semantic mapping and place categorization (Pronobis and Jensfelt 2012; Kostavelis and Gasteratos 2015; Sünderhauf et al. 2016; Landsiedel et al. 2017; Rangel et al. 2018) has attracted considerable interest in recent years. However, conventional approaches in most of these studies are limited insofar as the robots cannot learn unknown words and unknown place categories without pre-set vocabulary and categories. In addition, the processes for Simultaneous Localization And Mapping (SLAM) (Thrun et al. 2005) and for estimating semantics related to place have been addressed as separated module processes. However, in our proposed approach, the robot can automatically and simultaneously perform place categorization and environment mapping, and it can learn unknown words without prior knowledge. Our previously proposed unsupervised Bayesian probabilistic model integrates multimodal place categorization, lexical acquisition, and SLAM. In particular, this paper focuses on the problems of estimation accuracy and computational scalability in online learning.

We define a spatial concept as a place category is autonomously learned by the robot based on multimodal perceptual information, which includes names of places, features of scene images, and position distributions. Then, we define a position distribution as the spatial extent representing a place in the environment. Our study regarding the spatial concept formation and the lexical acquisition also constitute constructive approaches to the human developmental process and symbol emergence in cognitive developmental systems (Cangelosi and Schlesinger 2015; Taniguchi et al. 2018b). Thus, we assume that the robot has not acquired any vocabulary in advance and can recognize only phonemes or syllables. In addition, the robot does not have prior knowledge of the current environment. In this study, a scenario in which the user teaches the robot the name of a place using a spoken utterance while moving together in the environment is studied. An overview of the scenario for online learning task is shown in Fig. 1. The robot and the user move around the environment. When they come to a place where the user wishes to teach, the user speaks a sentence regarding the place to the robot. The robot recognizes the speech, including unknown words, and segments the speech into words. Then, the robot obtains the present estimated position, the scene image, and the speech signal at that time, and acquires spatial knowledge regarding the environment, such as the relationship between words and places.

Fig. 1
figure 1

Overview of the scenario for online learning in this study. We assume a scenario in which the user teaches the robot the name of the place using a spoken utterance while moving together in the environment. The robot learns spatial concepts, language models, and maps while sequentially correcting mistakes from previous learnings based on its interaction with the user and environment, as shown from the bottom left to the bottom right

In online learning, also called sequential learning or incremental learning, an increase in scalability without reducing accuracy is especially important but difficult to achieve for mobile robots. Online learning has the advantage of being performed in real-time. This means that it can be used to adapt immediately to new data by sequentially estimating parameters each time. On the other hand, batch learning takes time to collect large amounts of data and to iterate it for learning. In the case of online learning, previous knowledge can be used immediately for reasoning and tasks such as language communication. Taniguchi et al. (2017) focused on deriving and constructing an appropriate online learning algorithm mathematically based on a theory of machine learning. In our previous work, we proposed SpCoSLAM as an integrated model of nonparametric Bayesian multimodal categorization, a Bayesian filter-based SLAM, speech recognition, and word segmentation, from the standpoint of unsupervised machine learning. However, this algorithm (Taniguchi et al. 2017) had inferior accuracy in terms of categorization and word segmentation compared to batch learning, owing to a situation whereby sufficient statistical information could not be used at the early stages of learning. In addition, speech recognition and unsupervised word segmentation were not completely online, and batch learning was used as an approximation. Therefore, the computational complexity of the processes of speech recognition and unsupervised word segmentation increased with an increase in training data. To enable online learning based on long-term human–robot interactions with limited computational resources, the following core problems need to be solved: (i) the increase in calculation cost owing to an increase in data, and (ii) the decrease in estimation accuracy when compared with batch learning. In intelligent robotics, the framework of online learning is regarded as important. In particular, online learning, which solves the above problem, is required for robots that gain knowledge while moving in the real world.

We here describe improved and scalable algorithms to solve the above-mentioned problems. The improved algorithm mainly addresses the problems of misrecognition (misclassification) and word segmentation in online learning. The scalable algorithm mainly addresses the problem of the increase in computation time. In this study, we introduce the approach of fixed-lag rejuvenation, which is considered particularly effective at solving these problems. Regarding the problem of online lexical acquisition, the improved and scalable algorithms take two respective approaches to the solution. The improved algorithm addresses the problem of under-segmentation, whereby the phoneme sequence is insufficiently segmented, by changing the manner by which the language model is updated such that it re-segments the word sequence. The scalable algorithm performs in a pseudo-online manner by introducing a fixed-lag rejuvenation approach to speech recognition and word segmentation.

One of the advantages to the proposed online learning algorithm is that spatial concepts mistakenly learned by the robot can be corrected sequentially, something that could not be achieved thus far. Moreover, with the proposed algorithm, the robot can flexibly deal with changes in the environment and the names of places. The lower part in Fig. 1 shows the progress of online learning. In the lower left of Fig. 1, clustered places and words are incorrectly estimated, as shown by the elongated purple and blue ellipses. In the lower right of Fig. 1, by contrast, more accurate estimation is achieved by correcting errors as learning progresses. This is realized by reviewing and rethinking previous estimation results when new data is obtained.

The main contributions of this paper are as follows:

  • We propose an improved and scalable online learning algorithm with several novel techniques such as fixed-lag rejuvenation.

  • The improved online algorithm achieves an accuracy of place categorization and lexical acquisition comparable to batch learning.

  • The scalable online algorithm achieves faster learning compared to original algorithms by reducing the order of computational complexity.

The remainder of this paper is organized as follows. In Sect. 2, we discuss related work on the formation of spatial concepts and online learning that is relevant to our study. In Sect. 3, we present an overview of the model, along with the formulation and the original online learning algorithm, SpCoSLAM. In Sect. 4, we present our proposed algorithms for improved and scalable online learning. In Sect. 5, we discuss the effectiveness of the proposed algorithms in a real environment. In Sect. 6, we evaluate the performance of place categorization and lexical acquisition in various virtual home environments. Section 7 concludes the paper.

2 Related work

2.1 Spatial concept formation

Taguchi et al. (2011) proposed an unsupervised method for simultaneously categorizing self-positions and phoneme sequences from user speech without any prior language model. Taniguchi et al. (2016, 2018a) proposed the nonparametric Bayesian Spatial Concept Acquisition method (SpCoA) using an unsupervised word segmentation method, latticelm (Neubig et al. 2012), and SpCoA++ for highly accurate lexical acquisition as a result of updating the language model. Gu et al. (2016) proposed a method to learn relative spatial concepts, i.e., the words related to distance and direction, from the positional relationship between an utterer and objects. Isobe et al. (2017) proposed a learning method to derive the relationship between objects and places using image features obtained by a Convolutional Neural Network (CNN) (Krizhevsky et al. 2012). Hagiwara et al. (2018) implemented a hierarchical clustering method for the formation of hierarchical place concepts. However, none of the above methods can sequentially learn spatial concepts from unknown environments without a map, because they rely on batch-learning algorithms. Therefore, we developed in previous work an online algorithm, SpCoSLAM (Taniguchi et al. 2017), that can sequentially learn a map, a lexicon, and spatial concepts to integrate positions, speech signals, and scene images. In Taniguchi et al. (2017), however, the accuracy was inferior to that of SpCoA. In this paper, we also compare our proposal to the latest batch learning method, SpCoA++. Because SpCoA++ is able to achieve nearly correct lexical acquisition, if we can successfully overcome the above problems by appropriately devising the learning algorithm, its accuracy should improve even with online lexical acquisition.

Our approach is relevant to research integrating semantic mapping with natural language processing (Walter et al. 2013; Hemachandra et al. 2014). Walter et al. (2013) developed an algorithm that can learn semantic graphs to integrate semantic representation into metric maps from natural language descriptions of aspects such as labels and spatial relationships. Hemachandra et al. (2014) proposed a mechanism to more effectively ground natural language descriptions by integrating scene appearance observations using camera images and laser data. In these studies, a word list, place labels, and the number of category types were known in advance. However, it is challenging to sequentially acquire new words and categories efficiently from a situation in which the lists of words and categories are not provided in advance. Our study includes lexical acquisition for unknown words and formation of new categories from speech signals using spatial information.

Ball et al. (2013) implemented a biologically inspired mapping system, RatSLAM, which is related to pose cells in the hippocampus of a rodent. In addition, robots called Lingodroids using RatSLAM could acquire a lexicon related to places through robot-to-robot communication (Heath et al. 2016). These studies reported that robots created their own vocabulary. Ueda et al. (2016) proposed a brain-inspired method, namely, a Particle Filter on Episode (PFoE) for agent decision making. PFoE can estimate the agent’s internal state based on previous events recalled at the time. All previous data is thus accumulated to construct a state space in PFoE. We believe that PFoE is unsuitable for long-term trials because the state space becomes enormous. By contrast, our approach forms concepts from episodes using resources more reasonably for calculations, insofar as the state space is reduced through clustering. Although our proposed method was not originally inspired by biology or brain science, such research is highly suggestive. SpCoSLAM is an integrated model of self-localization, mapping, concept formation, and lexical acquisition. From the point of view of the brain, it may be possible to regard SpCoSLAM as a model that imitates some functions of the hippocampus and the cerebral cortex. If we assume that the training data—i.e., the robot’s experiences based on a user’s utterances—is the episodic memory, and that spatial concepts are semantic memory, the proposed algorithm can be interpreted as a representation of the process of forming concepts by extracting meaning from short-term episodic memory sequentially. Such matters are not further discussed in this paper, although they remain important for future research.

Fig. 2
figure 2

Left: graphical model representation of SpCoSLAM (Taniguchi et al. 2017). Gray nodes indicate observation variables, and white nodes are unobserved latent variables. Right: description of random variables in SpCoSLAM

2.2 Improvement of online learning based on particle filters in unsupervised Bayesian models

As an approach involving Bayesian models that is similar to our model, there are related studies on object concepts. In particular, Araki et al. (2012b) proposed online Multimodal Latent Dirichlet Allocation (oMLDA) to acquire object concepts in an online manner, and combined this with the Nested Pitman–Yor Language Model (NPYLM), making it possible to perform lexical acquisition of unknown words sequentially. Aoki et al. (2016) constructed an algorithm that can infer an approximately global optimal solution by representing it as a single integrated model. The NPYLM is an unsupervised morphological analysis method based on a statistical model that enables word segmentation exclusively from phoneme sequences (Mochihashi et al. 2009). In addition, Nishihara et al. (2017) was able to reduce phoneme recognition errors by applying PFoMDLA to inferences using a particle filter instead of oMLDA. In these studies, online learning was realized as an algorithm in unsupervised machine learning. A spatial concept requires more real-time processing than an object concept because the robot learns spatial concepts while it moves through the environment. The mobile robot should not halt its spatial movement for calculations. Therefore, a more efficient and scalable algorithm is required.

Canini et al. (2009) improved the accuracy of an online algorithm based on a particle filter with the rejuvenation technique. This technique resamples some randomly selected samples of previous observation data from a conditional probabilistic distribution similar to Gibbs sampling. For a completely random choice, the robot needs to memorize all of the previous data. Rejuvenation can deal with the problem of degenerating particles in particle filters. In this study, we introduce rejuvenation into our SpCoSLAM online learning algorithm. In our algorithm, we perform resampling from some recent data. Therefore, we consider that it will be possible to improve the estimation accuracy efficiently.

As another particle filter approach, Börschinger and Johnson (2011) proposed an online algorithm based on a Bayesian model for word segmentation. In addition, Börschinger and Johnson (2012) presented an incremental learning algorithm that introduces rejuvenation to a particle filter. They improved the performance of word segmentation with higher accuracy. The studies above were premised on segmentation of sequences without phoneme recognition errors. In this study, by contrast, the online word segmentation task is particularly challenging because phoneme recognition errors are included in speech recognition results.

3 SpCoSLAM: Online learning for spatial concepts and lexical acquisition with mapping

3.1 Overview

SpCoSLAM has the advantage that spatial concept formation, lexical acquisition, and SLAM, can be performed simultaneously by an integrated model. Figure 2 shows the graphical model of SpCoSLAM and lists each variable of the graphical model. The details of the formulation of the generation process represented by the graphical model are described in Taniguchi et al. (2017). The method learns sequential spatial concepts for unknown environments without maps. It also learns the many-to-many correspondences between places and words via spatial concepts and can mutually complement the uncertainty of information using multimodal information. Furthermore, the proposed method estimates an appropriate number of clusters of spatial concepts and position distributions depending on the data by using the so-called online Chinese Restaurant Process (CRP) (Aldous 1985), one of the constitutive methods of the Dirichlet Process (DP). In addition, lexical acquisition including unknown words is possible by sequentially updating the language model.

The procedure of SpCoSLAM for each step is described as follows. (a) The robot obtains Weighted Finite-State Transducer (WFST) speech recognition results from the user’s speech signals using a language model. (b) The robot obtains the likelihood of self-localization by performing FastSLAM. (c) The robot segments the WFST speech recognition results using an unsupervised word segmentation approach called latticelm (Neubig et al. 2012). (d) The robot obtains the latent variables of spatial concepts by sampling. (e) The robot obtains the marginal likelihood of the observed data as the importance weight. (f) The robot updates the environmental map. (g) The robot estimates the set of model parameters of the spatial concepts from the observed data and the sampled variables. (h) The robot updates the language model of the maximum weight for the next step. (i) The particles are resampled according to their weights. Steps (b)–(g) are performed for each particle.

3.2 Formulation of the online learning algorithm

Our previously proposed online learning algorithm, SpCoSLAM, introduces sequential equation updates to estimate the parameters of the spatial concepts into the formulation of a Rao-Blackwellized Particle Filter (RBPF) (Doucet et al. 2000) in the FastSLAM 2.0 algorithm, which is landmark-based SLAM (Montemerlo et al. 2003), and the technique (Grisetti et al. 2007) applied to grid-based SLAM in a similar manner to that in FastSLAM 2.0. The particle filter is advantageous in that parallel processing can be easily applied because each particle can be calculated independently.

In the formulation of SpCoSLAM, the joint posterior distribution can be factorized to the probability distributions of a language model LM, a map m, the set of model parameters of spatial concepts \(\varTheta = \{ {\mathbf {W}}, {\varvec{\mu }}, \varvec{\varSigma }, {\theta }, {\phi }, \pi \}\), the joint distribution of the self-position trajectory \(x_{0:t}\), and the set of latent variables \(\mathbf {C}_{1:t} = \{i_{1:t},C_{1:t},S_{1:t} \}\). We describe the joint posterior distribution of SpCoSLAM as follows:

$$\begin{aligned}&p(x_{0:t},\mathbf {C}_{1:t}, LM, \varTheta , m \mid u_{1:t}, z_{1:t}, y_{1:t}, f_{1:t}, AM, \mathbf {h}) \nonumber \\&\quad =p(LM \mid S_{1:t}, \lambda ) p(\varTheta \mid x_{0:t}, \mathbf {C}_{1:t}, f_{1:t}, \mathbf {h})p(m \mid x_{0:t}, z_{1:t}) \nonumber \\&\qquad \cdot ~\underbrace{p(x_{0:t},\mathbf {C}_{1:t} \mid u_{1:t}, z_{1:t}, y_{1:t}, f_{1:t}, AM, \mathbf {h})}_\mathrm{Particle~filter} \end{aligned}$$
(1)

where the set of hyperparameters is denoted by \(\mathbf {h}= \{ \alpha ,\beta ,\gamma ,\chi ,\lambda , m_{0},\kappa _{0}, V_{0},\nu _{0} \}\). It is noteworthy that the speech signal \(y_{t}\) is not observed during all time-steps. Herein, the proposed method is equivalent to FastSLAM 2.0 when \(y_{t}\) is not observed, i.e., when the speech signal is a trigger for the place categorization.

3.2.1 Particle filter algorithm

The particle filter algorithm uses Sampling Importance Resampling (SIR). The importance weight is denoted by \(\omega _{t}^{[r]}={P_{t}^{[r]}}/{Q_{t}^{[r]}}\) for each particle, where r is the particle index. The target distribution is \(P_{t}^{[r]}\), and the proposal distribution is \(Q_{t}^{[r]}\). The number of particles is R. The following equations are also calculated for each particle r; however, the subscripts representing the particle index are omitted.

We apply two modifications related to the weighting of the original SpCoSLAM algorithm (Taniguchi et al. 2017): (i) additional weight for \(i_{t}\), \(C_{t}\), and \(x_{t}\) (AW), and (ii) weight for selecting a language model LM (WS). These modifications are more theoretically reasonable than the original SpCoSLAM model, and our proposed SpCoSLAM 2.0 online learning algorithm is extended on their basis.

We describe the target distribution \(P_{t}\) that modified the derivation of Taniguchi et al. (2017) as follows:

$$\begin{aligned} P_{t}= & {} p(x_{0:t},\mathbf {C}_{1:t} \mid u_{1:t}, z_{1:t}, y_{1:t}, f_{1:t}, AM, \mathbf {h}) \nonumber \\\approx & {} p(i_{t},C_{t} \mid x_{0:t}, i_{1:t-1},C_{1:t-1},S_{1:t},f_{1:t},\mathbf {h}) \nonumber \\&\cdot ~p(z_{t} \mid x_{t}, m_{t-1})p(f_{t} \mid C_{1:t-1}, f_{1:t-1}, \mathbf {h}) \nonumber \\&\cdot ~p(x_{t} \mid x_{t-1}, u_{t})p(S_{t} \mid S_{1:t-1},y_{1:t},AM,\lambda )\nonumber \\&\cdot ~\underbrace{p( x_{t} \mid x_{0:t-1}, i_{1:t-1},{C}_{1:t-1}, \mathbf {h})}_{\mathrm{Additional~part}} \nonumber \\&\cdot ~\frac{p(S_{t} \mid S_{1:t-1}, C_{1:t-1}, \alpha ,\beta )}{p(S_{t} \mid S_{1:t-1}, \beta )} \cdot P_{t-1}, \end{aligned}$$
(2)

where the term \(p( x_{t} \mid x_{0:t-1}, i_{1:t-1},{C}_{1:t-1}, \mathbf {h})\) is the additional part compared to the original equation.

Here, the target distribution for the particle filter is the marginal joint posterior distribution of the self-positions \(x_{0:t}\) and the set of latent variables \(\mathbf {C}_{1:t}\) because it is based on the RBPF technique adopted in FastSLAM in the same manner. The latent variables that are local parameters are estimated by a particle filter, and the probability distributions for global parameters LM, \(\varTheta \), and m are calculated and held independently for each estimated particle.

We describe the proposal distribution \(Q_{t}\) as follows:

$$\begin{aligned} Q_{t}= & {} q(x_{0:t},\mathbf {C}_{1:t} \mid u_{1:t}, z_{1:t}, y_{1:t}, f_{1:t}, AM, \mathbf {h}) \nonumber \\= & {} p(x_{t} \mid x_{t-1},z_{t},m_{t-1},u_{t}) \nonumber \\&\cdot ~p(i_{t},C_{t} \mid x_{0:t}, i_{1:t-1},C_{1:t-1},S_{1:t},f_{1:t},\mathbf {h}) \nonumber \\&\cdot ~p(S_{t} \mid S_{1:t-1},y_{1:t},AM,\lambda ) \cdot Q_{t-1}. \end{aligned}$$
(3)

Then, \(p(x_{t} \mid x_{t-1},z_{t},m_{t-1},u_{t})\) is equivalent to the proposal distribution of FastSLAM 2.0. The probability distribution of \(i_{t}\) and \(C_{t}\) is the marginal distribution pertaining to the set of model parameters \(\varTheta \). This distribution can be calculated using a formula equivalent to collapsed Gibbs sampling. The details are described in Taniguchi et al. (2017).

3.2.2 Sampling of words using speech recognition and word segmentation

We approximate the probability distribution of \(S_{t}\) in (3) as speech recognition with the language model \(LM_{t-1}\) and unsupervised word segmentation using the WFST speech recognition results with latticelm (Neubig et al. 2012) as follows:

$$\begin{aligned}&p(S_{t} \mid S_{1:t-1},y_{1:t},AM,\lambda ) \nonumber \\&\quad \approx \mathrm{latticelm}(S_{1:t} \mid \mathcal{L}_{1:t},\lambda )\mathrm{SR}(\mathcal{L}_{1:t} \mid y_{1:t},AM,LM_{t-1})\nonumber \\ \end{aligned}$$
(4)

where \(\mathrm{SR}()\) denotes the function of speech recognition, \(\mathcal{L}_{1:t}\) denotes the speech recognition results in WFST format, which is a word graph representing the speech recognition results. In the original mathematical formulas, only \(S_{t}\) should be obtained by sampling. However, latticelm is a tool originally designed for batch learning. In addition, in order to perform unsupervised word segmentation, it is necessary to extract statistical information from the observation data. Therefore, resampling is necessary using all data from 1 to t, instead of exclusively using the distribution at time-step t.

3.2.3 Additional weight for \(i_{t}\), \(C_{t}\), and \(x_{t}\) (AW)

Finally, the importance weight \(\omega _{t}\) modified from Taniguchi et al. (2017) is represented as follows:

$$\begin{aligned} \omega _{t}\approx & {} \sum _{i_{t}=k} \Bigl [ p(x_{t} \mid x_{0:t-1}, i_{1:t-1}, i_{t}=k, \mathbf {h}) \nonumber \\&~\underbrace{\qquad \cdot ~\sum _{C_{t}=l} p( i_{t}=k, C_{t}=l \mid C_{1:t-1}, i_{1:t-1}, \mathbf {h}) \Bigr ]}_{\mathrm{Additional~part}} \nonumber \\&\cdot ~p(z_{t} \mid m_{t-1}, x_{t-1},u_{t})p(f_{t} \mid C_{1:t-1}, f_{1:t-1}, \mathbf {h}) \nonumber \\&\cdot ~\frac{p(S_{t} \mid S_{1:t-1}, C_{1:t-1}, \alpha ,\beta )}{p(S_{t} \mid S_{1:t-1}, \beta )} \cdot \omega _{t-1}. \end{aligned}$$
(5)

Unlike the original SpCoSLAM algorithm, the marginal likelihood for \(i_{t}\) and \(C_{t}\) weighted by the marginal likelihood for the position distribution was added to the additional part of the first term on the right side of (5). The amount of calculations does not increase because most of the formulas for weight \(\omega _{t}\) are already calculated when \(i_{t}\) and \(C_{t}\) are sampled. Weight calculation in consideration of the likelihood of the entire model can be realized by (5). This is described in Algorithm 1 (Line 16) and Algorithm 2 (Line 17).

3.2.4 Weight for selecting a language model LM (WS)

In the formulation of (1), it is desirable to estimate the language model \(LM_{t}\) for each particle. In other words, speech recognition of the amount of data multiplied by the number of particles for each teaching utterance must be performed. In this paper, to reduce the computational cost, we use a language model \(LM_{t}\) of a particle with the maximum weight for speech recognition.

We also modify the weight for selecting the language model from the entire weight \(\omega _{t}\) of the model to the weight \(\omega _{S}\) related to word information:

$$\begin{aligned} \omega _{S} = \frac{p(S_{1:t} \mid C_{1:t-1}, \alpha ,\beta )}{p(S_{1:t} \mid \beta )}. \end{aligned}$$
(6)

The segmentation result from all of the uttered sentences for each particle changes at every step because the word segmentation processes use all previous data. Indeed, better word segmentation results can be selected by a weight that considers not only current data but also previous data. In addition, this modified weight corresponds to mutual information used for selecting the word segmentation results in SpCoA++ (Taniguchi et al. 2018a). This is described in Algorithm 1 (Line 23) and Algorithm 2 (Line 24).

4 SpCoSLAM 2.0: improved and scalable online learning algorithm

In this section, we describe an improved and scalable online learning algorithm, SpCoSLAM 2.0, that overcomes the problems in the original algorithm. Although the generative process and graphical model for SpCoSLAM are the same, the learning algorithm is different. SpCoSLAM 2.0 is a novel learning algorithm proposed with a modified mathematical formulation that retains the model structure, similar to the extension from FastSLAM to FastSLAM 2.0. First, the algorithm is improved by introducing techniques such as rejuvenation, as explained in Sect. 4.1. Next, a scalable algorithm is developed to reduce the calculation time while maintaining higher accuracy than the original algorithm, as described in Sect. 4.2.

figure e
figure f

4.1 Improving the estimation accuracy

We now turn to the details of the improved algorithm. Here, we introduce two elements: fixed-lag rejuvenation of latent variables, and re-segmentation of word sequences. A pseudo-code for the improved algorithm is given in Algorithm 1.

4.1.1 Fixed-lag rejuvenation of \(i_{t}\) and \(C_{t}\) (FLR–\(i_{t}\), \(C_{t}\))

Canini et al. (2009) demonstrated improved accuracy with rejuvenation by resampling previous samples randomly. This is based on a result of the independent and identically distributed (i.i.d.) assumption on the latent variables in the Latent Dirichlet Allocation (LDA) model. However, in the case of selecting from previous data of all time points, all previous samples need to be held in the memory. In the proposed algorithm, we introduce Fixed-Lag Rejuvenation (FLR) inspired by the Monte Carlo fixed-lag smoother (Kitagawa 2014). This approach is similar to the sampling strategy of fixed-lag roughening for particle filter-based SLAM in Beevers and Huang (2007). Beevers and Huang (2007) indicated that the statistical estimation error could be reduced by applying Markov Chain Monte Carlo (MCMC)–based sampling to the trajectory samples over a fixed lag at each time step.

The fixed-lag smoother is a particle smoothing method that estimates particles approximating the smoothing distribution \(p(\mathbf {C}_{\tau } \mid D_{1:t})\)\((\tau <t)\), where D is observed data. It is obtained by a simple modification to the particle filter. In this algorithm, particles are saved from time-step \(t - T_{L} + 1\) to t and are resampled according to the weight based on newly observed data each step. Here, the value of the fixed-lag is denoted by \(T_{L}\). This technique means that the particles at step \(\tau \) can be estimated not by using the observed data \(D_{1:\tau }\), but rather with \(D_{1:\tau + T_{L}}\), i.e., the smoothing distribution \(p(\mathbf {C}_{\tau } \mid D_{1:\tau + T_{L}})\). In general, a smoothing method such as a fixed-lag particle smoother provides more accurate estimations than naive online estimation methods such as a particle filter in estimating the joint posterior distribution of latent variables.

Fig. 3
figure 3

Overview of the Fixed-Lag Rejuvenation of \(i_{t}\) and \(C_{t}\). Left: naive online learning in the original algorithm. Right: online learning using FLR in the improved algorithm. The thick orange frame is estimated by sampling. In this case, the fixed-lag value \(T_{L}\) is three. The gray boxes mean that the estimated value will never again be updated, i.e., distributions of already immobilized (fixed) latent variables by online learning (Color figure online)

Figure 3 shows an overview of the FLR of \(i_{t}\) and \(C_{t}\). The notation \(\tau \mid t\) in the box in Fig. 3 is shorthand notation for the subscript representing the time-step in the conditional marginal posterior distribution, e.g., \(p(\mathbf {C}_{\tau } \mid D_{1:t})\). The FLR is the process of sampling the latent variables \(i_{\tau }\) and \(C_{\tau }\) by iterating \(T_{L}\) times from the previous step \(t-T_{L}+1\) to the current step t for each particle as follows:

$$\begin{aligned} i_{\tau },C_{\tau } \sim p(i_{\tau },C_{\tau } \mid x_{0:t},S_{1:t},f_{1:t}, i_{ \{1:t\mid \lnot \tau \} },C_{ \{1:t\mid \lnot \tau \} },\mathbf {h})\nonumber \\ \end{aligned}$$
(7)

where \(i_{ \{1:t\mid \lnot \tau \} }\) and \(C_{ \{1:t\mid \lnot \tau \} }\) denote sets of elements from 1 to t without the elements of step \(\tau \). In this case, the latent variables of step \(t-T_{L}\) can be sampled using data up to step t, as described in Algorithm 1 (Lines 12–14). Equation (7) is the same as the conditional posterior probability distribution for marginalized (collapsed) Gibbs sampling used in batch learning. Therefore, the FLR corresponds to slightly iterate Gibbs sampling for some recent previous latent variables in online learning.

4.1.2 Re-segmentation of word sequences (RS)

We introduce re-segmentation of word sequences to improve the accuracy of word segmentation. In the original algorithm, we approximated the left side of (4) by registering the word sequences segmented by latticelm to the word dictionary. However, this can be considered a process of sampling a language model LM from word sequences \(S_{1:t}^{*}\) and a hyperparameter \(\lambda \) of a language model. Therefore, we adopt NPYLM, an unsupervised word segmentation method (Mochihashi et al. 2009), to estimate a language model from the word sequences as follows:

$$\begin{aligned} LM \sim \mathrm{NPYLM}(LM \mid S_{1:t}^{*}, \lambda ). \end{aligned}$$
(8)

The procedure of introducing the RS is as follows: (i) word sequences \(S_{1:t}\) are obtained by WFST speech recognition and latticelm; (ii) word sentences \(S_{1:t}^{*}\) of a maximum likelihood particle are converted into syllable sequences, and segmented into word sequences using NPYLM; (iii) the word dictionary LM is updated using segmented words, as described in Algorithm 1 (Line 24). In this manner, we can overcome problematic words that tend to become under-segmented while taking into account the uncertainty of speech recognition errors by latticelm. Note that there is a discrepancy between the words used for spatial concept acquisition and the word set registered in the word dictionary.

Fig. 4
figure 4

Overview of the fixed-lag rejuvenation of \(S_{t}\). Left: batch learning with the original algorithm. Right: pseudo-online learning using FLR in the scalable algorithm. The thick orange frame is estimated by sampling from the joint distribution. In this case, the fixed-lag value \(T_{L}\) is three. The gray boxes denote that the estimated value will never be updated again, i.e., distributions of already immobilized (fixed) latent variables by online learning (Color figure online)

4.2 Scalability for reduced computational cost

In this section, we describe the details of the scalable algorithm. Here, we introduce two elements: the sequential Bayesian update of the parameters in the posterior distribution, and unsupervised word segmentation from WFST speech recognition results using FLR. The scalable algorithm can be combined with the FLR \(C_{t}\), \(i_{t}\) of the improved algorithm. The pseudo-code for the scalable algorithm is given in Algorithm 2.

4.2.1 Sequential Bayesian update of parameters in the posterior distribution (SBU)

We introduce a Sequential Bayesian Update (SBU) for the posterior hyperparameters \(H_{t}\) in the posterior distribution. In the original algorithm, the model parameters \(\varTheta \) are estimated from all the data \(D_{1:t}=\{ f_{1:t}, y_{1:t} \}\) and the set of latent variables \(\mathbf {C}_{1:t}\) during each step. However, FastSLAM avoids holding all the previous data by updating a map \(m_{t}\) from \(x_{t}\), \(z_{t}\), and \(m_{t-1}\) sequentially. That is, it assumes the measurement model \(p(z_{t} \mid x_{0:t}, z_{1:t-1})=p(z_{t} \mid x_{t}, m_{t-1})\) and the updated occupancy grid map \(p(m_{t} \mid x_{0:t},z_{1:t})=p(m_{t} \mid x_{t},z_{t},m_{t-1})\). Similarly, the posterior hyperparameters \(H_{t}\) can be calculated from the new data \({D}_{t}\), latent variables \(\mathbf {C}_{t}\), and posterior hyperparameters \(H_{t-1}\) from previous steps. Thus, both the computational and memory efficiency, crucial for long-term learning with real robots, can be significantly improved. The SBU for the posterior hyperparameters is calculated as follows:

$$\begin{aligned} p(\varTheta \mid H_{t})= & {} p(\varTheta \mid D_{1:t},\mathbf {C}_{1:t}, \mathbf {h}) \nonumber \\= & {} p(\varTheta \mid D_{t},\mathbf {C}_{t}, \{ D_{1:t-1},\mathbf {C}_{1:t-1}, \mathbf {h} \} ) \nonumber \\= & {} p(\varTheta \mid D_{t},\mathbf {C}_{t}, H_{t-1}) \nonumber \\\propto & {} p(D_{t} \mid \mathbf {C}_{t}, \varTheta )p(\varTheta \mid H_{t-1}). \end{aligned}$$
(9)

These posterior hyperparameters \(H_{t}\) can also be used to sample \(\mathbf {C}_{t}\). In the implementation, it suffices to hold values of the statistics obtained during the calculation of the posterior distribution. Here, the calculation results from the left side and the right side of (9) are strictly the same. The SBU approach is also said to keep track of sufficient statistics in the particle filter (Kantas et al. 2015).

The SBU equation is used together with FLR as follows:

$$\begin{aligned} p(\varTheta \mid H_{t})= & {} p(\varTheta \mid D_{1:t},\mathbf {C}_{1:t}, \mathbf {h}) \nonumber \\= & {} p(\varTheta \mid D_{t'+1:t},\mathbf {C}_{t'+1:t}, \{ D_{1:t'}, \mathbf {C}_{1:t'}, \mathbf {h} \} ) \nonumber \\= & {} p(\varTheta \mid D_{t'+1:t},\mathbf {C}_{t'+1:t}, H_{t'}) \nonumber \\\propto & {} p(D_{t'+1:t} \mid \mathbf {C}_{t'+1:t}, \varTheta )p(\varTheta \mid H_{t'}), \end{aligned}$$
(10)

where a time-step before the lag value is \(t' = t - T_{L}\). In this case, it is only necessary to hold the observed data and posterior hyperparameters of the number corresponding to the lag value \(T_{L}\). Equation (10) is applied to Algorithm 2.

4.2.2 WFST speech recognition and unsupervised word segmentation using FLR (FLR–\(S_{t}\))

We describe the proposed algorithm that combines FLR and SBU to address problems of the unsupervised online word segmentation and to reduce the computation time simultaneously. FLR can also be extended to the sampling of \(S_{t}\) in a pseudo-online manner. Figure 4 shows an overview of the FLR of \(S_{t}\). The notation \(\tau \mid t\) takes the same meaning as it does in Fig. 3. The data used for speech recognition and word segmentation is modified from that in (4) to data with a fixed-lag interval. In addition, speech recognition is performed using the initial syllable dictionary in the steps before step \(T_{L}\) and using a word dictionary from step \(t'\) in the steps proceeding step \(T_{L}+1\). In this case, we can perform word segmentation based on the statistical information collected from the WFSTs recognized using the number of data for the lag value \(T_ {L}\). FLR performs simultaneous sampling of word sequences \(S_{t'+1:t}\) of time-steps from \(t'+1\) to the current step t as follows:

$$\begin{aligned} S_{t'+1:t}\sim & {} p(S_{t'+1:t} \mid y_{t'+1:t},AM,S_{1:t'},\lambda ) \nonumber \\\approx & {} \mathrm{latticelm}(S_{t'+1:t} \mid \mathcal{L}_{t'+1:t},\lambda ) \nonumber \\&\cdot ~\mathrm{SR}(\mathcal{L}_{t'+1:t} \mid y_{t'+1:t},AM,LM_{t'}). \end{aligned}$$
(11)

Therefore, this approach can address the problem in the original algorithm by which incorrect word segmentation in early learning stages was propagated to the following learning stages.

Here, the amount of calculations is constant throughout each step, irrespective of the total amount of data. This property of the FLR of \(S_{t}\) is an important advantage in scalability. However, there is a concern that word segmentation using FLR becomes inaccurate compared to batch learning because of the limited availability of statistical information. Essentially, the scalable algorithm is a trade-off between calculation time and word segmentation accuracy. In the language model update, the word dictionary \(LM_{t}\) holds information regarding words \(S_{t'+1:t}\) segmented from steps \(t'+1\) to t and the previous word dictionary \(LM_{t'}\). This is described in Algorithm 2 (Lines 4, 12, and 25).

Table 1 shows the order of computational complexity for each learning algorithm. The data number is denoted N, the number of particles R, the value of fixed-lag \(T_{L}\), the number of iterations for Gibbs sampling in batch learning G, the number of candidates of word segmentation results for updating the language model in SpCoA++ M, and the number of iterations for the parameter estimation in SpCoA++ I. Variables without N are constants that can be preset by the user. Among these algorithms, therefore, only the scalable algorithm does not depend on the number of data N. In this case, the computational efficiency of the scalable algorithm is better than the original SpCoSLAM algorithm when \(T_{L}<N\).

Table 1 Computational complexity of the learning algorithms
Fig. 5
figure 5

Top: learning results of position distributions in a generated map. Ellipses denote the position distributions drawn on the map at steps 15, 30, and 50. The colors of the ellipses were randomly determined for each index number \(i_{t}=k\). Bottom: examples of scene images captured by the robot. The correct word (in English) and estimated words are shown for each position distribution at steps 15, 30, and 50 (Color figure online)

5 Experiment I

We performed experiments to demonstrate online learning of spatial concepts in a novel environment. In addition, we performed evaluations of place categorization and lexical acquisition related to places. We compared the performance of the following methods:

  1. (A)

    SpCoSLAM (Taniguchi et al. 2017)

  2. (B)

    SpCoSLAM with AW + WS (Sect. 3.2)

  3. (C)

    SpCoSLAM 2.0 (FLR–\(i_{t},C_{t}\))

  4. (D)

    SpCoSLAM 2.0 (FLR–\(i_{t},C_{t}\) + RS)

  5. (E)

    SpCoSLAM 2.0 (FLR–\(i_{t},C_{t},S_{t}\) + SBU)

  6. (F)

    SpCoA++ (Batch learning) (Taniguchi et al. 2018a)

Methods (A) and (B) used the original and modified SpCoSLAM algorithms. Methods (C) and (D) used the proposed improved algorithms under different conditions. In methods (C) and (D), the lag value for FLR was set to \(T_{L}=10\). Method (E) used the proposed scalable algorithm under three different conditions: the lag values for the FLR were set to \(T_{L}=\) 1, 10, and 20 for (E1), (E2), and (E3), respectively. Batch-learning methods (F) was estimated by Gibbs sampling based on a weak-limit approximation (Fox et al. 2011) of the Stick-Breaking Process (SBP) (Sethuraman 1994), one of the constitutive methods of the Dirichlet Process (DP). The upper limits of the spatial concepts and position distributions were set to \(L=50\) and \(K=50\), respectively. We set the number of iterations for Gibbs sampling to \(G=100\). In method (F), we set the number of candidate word segmentation results for updating the language model to \(M=6\), and the number of iterative estimation procedures to \(I=10\). In addition, (F) did not use image features in the same manner as the original model setting. Note that SpCoA++ (F) was not evaluated in Taniguchi et al. (2017) because it is the latest batch-learning method.

5.1 Online learning

We conducted experiments of online spatial concept acquisition in a real environment. We implemented SpCoSLAM 2.0 based on the open-source SpCoSLAMFootnote 1, extending the gmapping package and implementing grid-based FastSLAM 2.0 (Grisetti et al. 2007) in the Robot Operating System (ROS). We used an open dataset, albert-b-laser-vision, i.e., a rosbag file containing the odometry, laser range data, and image data. This dataset was obtained from the Robotics Data Set Repository (Radish) (Howard and Roy 2003). We prepared Japanese speech data corresponding to the movement of the robot from the above-mentioned dataset because speech data was not initially included. The total number of taught utterances was \(N=50\), including 10 types of phrases. The robot learned 10 places and 9 place names. The microphone was a SHURE PG27-USB. Julius dictation-kit-v4.4 (DNN-HMM decoding) (Lee and Kawahara 2009) was used as a speech recognizer. The initial word dictionary contained 115 Japanese syllables. The unsupervised word segmentation system used latticelm (Neubig et al. 2012). The image feature extractor was implemented with Caffe, a deep-learning framework (Jia et al. 2014). We used a pre-trained CNN model, Places365-ResNet, trained with 365 scene categories from the Places2 Database with 1.8 million images (Zhou et al. 2018). The number of particles was \(R=30\). The hyperparameters for online learning were set as follows: \(\alpha =20\), \(\gamma =0.1\), \(\beta =0.1\), \(\chi =0.1\), \(m_{0}=[ 0 , 0 ]^\mathrm{T}\), \(\kappa _{0}=0.001\), \(V_{0}=\mathrm{diag}(2,2)\), and \(\nu _{0}=3\). The above-mentioned parameters were set such that all online methods were tested under the same conditions. The hyperparameters for batch learning were set as follows: \(\alpha =10\), \(\gamma =10\), \(\beta =0.1\), \(m_{0}=[ 0 , 0 ]^\mathrm{T}\), \(\kappa _{0}=0.001\), \(V_{0}=\mathrm{diag}(2,2)\), and \(\nu _{0}=3\). The hyperparameters were determined manually and empirically according to each method. Note that the speech recognition decoder, the image feature extractor, and the hyperparameters were changed from Taniguchi et al. (2017).

Figure 5 (top) shows the position distributions in the environmental maps at steps 15, 30, and 50 with (D). This figure visualizes how spatial concepts are acquired during sequential mapping of the environment. The position distributions were appropriately formed for places uttered by a user each time. In step 15, the map covers only 2 rooms (in the upper right) and a corridor, with 5 position distributions. The map obtained at step 50 covers the entire environment, and there were eventually 11 estimated position distributions. Figure 5 (bottom) shows an example of the correct phoneme sequence of the place name, and the three best words estimated by the probability distribution \(p(S_{t} \mid i_{t}, \varTheta _{t}, LM_{t})\) at step t. The left side shows an example of the scene images observed in the \(i_{t}\)-th position distribution corresponding to the name of each place. As the steps proceed, it can be seen that the words corresponding to the places were stably learned as phoneme sequences closer to the correct answers. For example, in /kyouyuuseki/ (shared desk), in step 15, the correspondence between the place and phoneme sequence was insufficiently learned: e.g., /bawakyoo/ and /yuseki/. However, by step 50, the word was learned correctly: /kyouyuseki/. The index of the position distribution of /roboqtookiba/ (robot storage space) was changed from 4 to 6. This change means that the label number switched as a result of the previous estimate values being modified while learning progressed. Details of the online learning experiment can be found in a video onlineFootnote 2.

5.2 Evaluation metrics

We evaluated the different algorithms according to the following metrics: the Adjusted Rand Index (ARI) (Hubert and Arabie 1985) of the classification results of spatial concepts \(C_{1:N}\) and position distribution \(i_{1:N}\); the Estimation Accuracy Rate (EAR) of the estimated total numbers of spatial concepts L and position distributions K; and the Phoneme Accuracy Rate (PAR) of uttered sentences and words related to places. We conducted six learning trials under each algorithm condition. The details of the evaluation metrics are described in the following sections.

5.2.1 Estimation accuracy of spatial concepts

We compared the matching rate for the estimated indices \(C_{1:N}\) of the spatial concept and the classification results of the correct answers given by a person. In this experiment, the evaluation metric adopts the ARI, which is a measure of the similarity between two clustering results. The matching rate for the estimated indices \(i_{1:N}\) of the position distributions was evaluated in the same manner.

In addition, we evaluated the estimated number of spatial concepts L and position distributions K using the EAR. The EAR was calculated as follows:

$$\begin{aligned} \mathrm{EAR}= \mathrm{max} \left( 1 - \frac{\mid n^\mathrm{C}_{t} - n^\mathrm{E}_{t} \mid }{n^\mathrm{C}_{t}}, 0 \right) \end{aligned}$$
(12)

where \(n^\mathrm{C}_{t}\) is the correct number and \(n^\mathrm{E}_{t}\) is the estimated number at time-step t.

Table 2 Evaluation results in a real environment

5.2.2 PAR of uttered sentences

We next compared the accuracy rate of phoneme recognition and word segmentation for all the recognized sentences. However, it was difficult to separately weigh the ambiguous phoneme recognition and the unsupervised word segmentation. Therefore, the experiment considered the position of a delimiter as a single letter. The correct phoneme sequence was suitably segmented into Japanese morphemes using MeCab (Kudo 2006), an off-the-shelf Japanese morphological analyzer that is widely used for natural language processing. However, the name of the place was considered a single word.

We calculated the PAR of the uttered sentences with the correct phoneme sequence \(s^\mathrm{P}_{t}\), and a phoneme sequence \(s^\mathrm{R}_{t}\) of the recognition result of each uttered sentence. The PAR was calculated as follows:

$$\begin{aligned} \mathrm{PAR}= \mathrm{max} \left( 1 - \frac{\mathrm{LD}(s^\mathrm{P}_{t},s^\mathrm{R}_{t})}{n^\mathrm{p}}, 0 \right) \end{aligned}$$
(13)

where \(\mathrm{LD}()\) was calculated using the Levenshtein distance between \(s^\mathrm{P}_{t}\) and \(s^\mathrm{R}_{t}\). Here, \(n^\mathrm{P}\) denotes the number of phonemes of the correct phoneme sequence.

5.2.3 PAR of words related to places

We also evaluated whether a phoneme sequence has learned the properly segmented place names. This experiment assumed a request for the best phoneme sequence, \(s_{t}^{*}\), representing the self-position \(x_{t}\) of the robot. We compared the PAR of words with the correct place name and a selected word for each teaching place. The PAR was calculated using (13).

The selection of a word \(s_{t,b}^{*}\) was calculated as follows:

$$\begin{aligned} s_{t}^{*} = {{\,\mathrm{argmax}\,}}_{S_{t,b}} p(S_{t,b} \mid x_{t}, \varTheta _{t}, LM_{t}). \end{aligned}$$
(14)

In this experiment, we used the self-position \(x_{t}\) that was not included in the training data to evaluate the PAR of words. Here, the robot can perform sufficiently accurate self-localization using a laser range finder. Therefore, in this experiment, we assume that \(x_{t}\) is given an accurate coordinate value without errors.

The more a method accurately recognized words and acquired spatial concepts, the higher is the PAR. We consider this evaluation metric to be an overall measure of the proposed method.

5.3 Evaluation results and discussion

In this section, we discuss the improvement and scalability of the proposed learning algorithms. Table 2 lists the averages of the evaluation values calculated using the metrics ARI, EAR, and PAR at step 50.

Fig. 6
figure 6

Examples of corrected place clustering results. Left: the original algorithm (A). Right: the improved SpCoSLAM 2.0 algorithm (D)

5.3.1 ARI and EAR results

In terms of categorization accuracy, the proposed algorithms that introduced FLR tended to show higher ARI values than the original algorithms (A) and (B) of SpCoSLAM. Figure 6 shows examples of the progress of place clustering for position distributions in (A) and (D). The step numbers in the figures on the left (A) and right (D) are not the same. In these cases, large position distributions covering distant areas were learned, i.e., the purple ellipses in the figures on top. In (A), incorrect clustering results were obtained during the final step (i.e., step 50) because the original SpCoSLAM algorithm cannot correct past erroneous estimations. By contrast, in (D) by introducing FLR, an incorrect cluster occurred at step 25 (top right figure). However, the proposed algorithm could correct previous erroneous estimates at step 30 (bottom right figure). Therefore, in the original algorithm (A), estimation errors adversely affect subsequent estimations. However, SpCoSLAM 2.0 (D) obtained more accurate estimations immediately, despite previous incorrect estimations. Similar situations to (D) were also confirmed in other proposed algorithms that introduced FLR. Experimental results demonstrated that FLR, which resamples the latent variables of the previous step using observations up to the current step, contributes to improving the accuracy of online place clustering.

Fig. 7
figure 7

Change in the EAR regarding the estimated total number of spatial concepts L (top) and position distributions K (bottom) for each step

Figure 7 shows the results of the EAR values with spatial concepts and position distributions, i.e., the accuracy of the estimated number of clusters, for each step. The EAR values were not stable in the steps during the first half, although they converged stably to high values in the latter half. In the result at step 50, (D) showed the highest EAR value L and (E1) showed the highest EAR value K. However, for both L and K, looking at all the steps on average, (A) and (B) yielded relatively low values overall, and (C) and (D) yielded relatively high values. (E1)–(E3) tended to show values between original algorithms, (A) and (B), and improved algorithms with FLR, (C) and (D). From the results of (C) and (D), EAR values improved considerably by introducing the FLR of \(C_{t}\) and \(i_{t}\).

5.3.2 PAR sentence and word results

From the results of the improved algorithm (D), the PAR values (sentence and word) improved markedly by adding the re-segmentation of the word sequences. These results show that the robot can accurately segment the names of places and learn the relationship between places and words more precisely. In particular, method (D), which combines the FLR and RS, achieved an overall improvement comparable to the other online algorithms. Some trial results showed PAR values comparable to those of SpCoA++ (F). Figure 8 shows the PAR of words for each step. The PAR tended to increase as a whole. Therefore, it can be expected that the PAR values will further increase as the number of steps advances. Table 3 presents examples of word segmentation results with the four methods. The correct phoneme sequence, i.e., ground truth, was segmented into Japanese morphemes using MeCab (Kudo 2006), where “ \(\mid \) ” denotes a delimiter, i.e., a word segment position. The parts in bold correspond to the name for each place. SpCoSLAM (A) showed under-segmentation results in many cases. On the other hand, it can be seen that SpCoSLAM 2.0 (D) and (E3) properly segmented the phoneme sequences representing the name of the place. Comparing (D) and (E3), (D) obtained segmentation results close to those of the batch learning method (F), and (E3) sometimes slightly over-segmented words. Therefore, SpCoSLAM 2.0 can mitigate under-segmentation when the word segmentation of the batch learning method is applied in a pseudo-online manner.

Fig. 8
figure 8

Change in the PAR of words for each step

Table 3 Examples of word segmentation results of uttered sentences

5.3.3 Original and modified SpCoSLAM algorithms

Although the modified SpCoSLAM (B) is theoretically more appropriate than the original algorithm (A), few differences were found between them. In the proposed algorithms, the time-driven process, i.e., SLAM part, and the event-driven process, i.e., spatial concept formation and lexical acquisition, were estimated by the same particle filter. Although self-localization and mapping were performed each time the robot moved in an environment, latent variables for the spatial concepts and lexicon are updated only upon the user’s utterance. Thus, particles can fluctuate as a result of resampling due to movement in the absence of the user’s utterance. Consequently, the weight for self-localization might be influential, rather than the weight for the spatial concept and lexicon. This will be investigated in future work.

Fig. 9
figure 9

Calculation times par step for evaluating scalability

5.3.4 Calculation time and scalable algorithm

Figure 9 shows the calculation times between online learning algorithms. With batch learning, SpCoA++’s overall calculation time including the runtime of rosbag for SLAM was 13,850.873 s, and the calculation times per iteration for the iterative estimation procedure and Gibbs sampling were 1,318.954 s and 1.833 s, respectively. In the original SpCoSLAM algorithm, (A) and (B), and the improved SpCoSLAM 2.0 algorithm, (C) and (D), the calculation time increased with the number of steps, i.e., as the amount of data increased. However, the scalable SpCoSLAM 2.0 algorithm (E1)–(E3) retained a constant calculation time regardless of an increase in the amount of data. Therefore, we can exert particularly powerful effects for long-term learning.

In the scalable algorithm (E1)–(E3), the evaluation values of ARI and PAR tended to improve overall when the lag value increased. In particular, when the lag value was 20, relatively high evaluation values are seen to approach those of the improved algorithm.

Owing to a trade-off between the fixed-lag size and accuracy, the algorithm needs to be set appropriately according to both the computational power embedded in the robot and the duration requirements for actual operation. In this experiment, we did not evaluate the scalability of the algorithm with parallel processing. However, we considered that the proposed algorithm could be executed even faster by parallelizing the particle process and by using Graphics Processing Units (GPUs). As such, we consider that the robot would be able to move within the environment while learning in real-time.

6 Experiment II

In this experiment, it is investigated whether trends similar to the evaluation results of the real environmental dataset in Sect. 5 can be stably obtained across different environments. Place categorization and lexical acquisition related to places in virtual home environments were evaluated, and the evaluation metrics ARI, EAR, and PAR for the methods (A)–(F) were compared in the same manner as in Sect. 5.

Fig. 10
figure 10

Examples of home environments in SIGVerse

Table 4 Evaluation results in simulator environments

6.1 Condition

Online spatial concept acquisition experiments were conducted in various virtual home environments. The simulator environment was SIGVerse version 3.0 (Inamura et al. 2010), a client-server based architecture that can connect the ROS and Unity. The virtual robot in SIGVerse was Toyota’s Human Support Robot (HSR), and we used 10 different home environmentsFootnote 3 created using Sweet Home 3DFootnote 4, which is a free software for interior design application. Figure 10 shows examples of the home environments. For each place, 10 training data were provided on average. The total number of taught utterances was \(N = 60\), including 10 types of phrases. The robot learned six places and their respective names. The microphone and speech recognizer were the same as those in Sect. 5.1. The image feature extractor was a pre-trained BVLC CaffeNet model (Jia et al. 2014). The number of particles was \(R = 10\). The hyperparameters for learning were set as follows: \(\alpha =10.0\), \(\gamma =1.0\), \(\beta =0.1\), \(\chi =0.1\), \(m_{0}=[ 0 , 0 ]^\mathrm{T}\), \(\kappa _{0}=0.001\), \(V_{0}=\mathrm{diag}(2,2)\), and \(\nu _{0}=3\). The hyperparameters were determined manually and empirically. The above-mentioned parameters were set such that all methods were tested under the same conditions. In method (F), the upper limits of the spatial concepts and position distributions were set to \(L = 20\) and \(K = 20\), respectively. The other settings were identical to those in Sect. 5.

The main target of the evaluation in this study is the accuracy of place clustering and lexical acquisition, i.e., extended points in SpCoSLAM 2.0. Therefore, in this experiment, it is assumed that sufficiently accurate mapping and self-localization are possible with a high-precision distance sensor, and using an online learning algorithm which separates and omits the SLAM process was executed. The true values obtained by the simulator were used as the self-position data.

Fig. 11
figure 11

Change in PAR of words for each step in simulator environments

6.2 Result

In this section, the improvement and scalability of the proposed learning algorithms in home environments are discussed. Table 4 lists the averages of the evaluation values calculated using the metrics ARI, EAR, and PAR at step 60.

The ARI showed a similar trend as the result of real environmental data. However, compared to the original algorithms (A) and (B), there was almost no difference in the values in algorithms that introduced FLR. In addition, the EAR showed a slightly different trend than the real environmental data. In the improved algorithms (C) and (D), the number L of categories of spatial concepts smaller than the true value was estimated compared to other algorithms. We consider that this reason was due to the fact that it was re-combined into the same category by FLR. Because the dataset was obtained in the simulator environment, for example, the image features could be insufficiently obtained for place categorization, i.e., similar features might be found in different places. Such a problem did not occur when using real environmental data.

Fig. 12
figure 12

Calculation times par step in simulator environments

The PAR had the same tendency as the result of real environment data. Similar to Sect. 5.3, the improved algorithm with RS (D) showed lexical acquisition accuracy comparable to batch learning (F). In addition, the scalable algorithms with FLR of \(S_{t}\) (E2) and (E3) showed higher values than the original algorithms. Figure 11 shows the average values of the PAR of words for each step in different environments. Similar to Fig. 8, the PAR tended to increase overall. Thus, it can be seen that RS and FLR of \(S_{t}\) work effectively in virtual home environments.

In the comparison of the original and modified SpCoSLAM algorithms (A) and (B), the modified algorithm (B) showed higher overall values in the evaluation values of ARI and PAR. We consider that the weight for the spatial concept and lexicon acted more directly in this experiment than in the experiment in Sect. 5, because it was not affected by the weight for self-localization.

In scalable algorithms (E1)–(E3), as the FLR value increased, the tendency for the overall evaluation values to increase appeared more prominently than for the results of real environment data.

Figure 12 shows the average calculation times between online learning algorithms in simulator environments. We confirmed that the result was similar to Fig. 9, which was the result using the real environment data. With batch learning, SpCoA++’s overall average calculation time was 8,076.288 s, and the calculation times per iteration for the iterative estimation procedure and Gibbs sampling were 807.623 s and 1.346 s, respectively.

The following are the common inferences from the results of both the simulation and real-world environments. For the online learning, if the user requires the performance of lexical acquisition even at an increased time cost, they can execute the improved algorithm (D) or scalable algorithm with a larger lag value, e.g., (E2) and (E3). If the user requires high-speed calculation, they can obtain better results faster than the conventional algorithm (A) by executing a scalable algorithm such as (E1) and (E2).

7 Conclusion

This paper proposed an improved and scalable online learning algorithm to address the problems encountered by our previously proposed SpCoSLAM algorithm. Specifically, we proposed online learning algorithm, called SpCoSLAM 2.0, for spatial concepts and lexical acquisition, for higher accuracy and scalability. In experiments, we conducted online learning with a robot in a novel environment without any pre-existing lexicon and map. In addition, we compared the proposed algorithm to the original online algorithm and to batch learning in terms of the estimation accuracy and calculation time. The results demonstrate that the proposed algorithm is more accurate than the original algorithm and of comparable accuracy to batch learning. Moreover, the calculation time of the proposed scalable algorithm becomes constant for each step, regardless of the amount of training data. We expect this work to contribute to the realization of long-term spatial language interactions between humans and robots.

In the future, we shall experiment with long-term online learning of spatial concepts in large-scale environments based on the scalable algorithm proposed in this paper. Furthermore, with additional development, it will be possible to introduce a forgetting mechanism to the proposed algorithm as with Araki et al. (2012a). When a robot continues to operate over a long period of time it will encounter changes in the environment, such as the names of places and areas. Consequently, the robot will benefit from using the latest observation data as opposed to the previous observation data. We believe that such a mechanism will be especially effective for long-term learning.

The proposed method constructs spatial concepts on a metric map; however, it can also be extended to learning the topological structure of places as with Karaoğuz and Bozma (2016); Luperto and Amigoni (2018). We explore whether this facilitates navigation tasks with human–robot linguistic interactions. In addition, loop-closure detection has been studied actively in recent years, as is evident from long-term visual SLAM (Han et al. 2018). The generative model of SpCoSLAM is connected to SLAM and lexical acquisition via latent variables related to the spatial concepts. Therefore, we shall also explore loop-closure detection based on speech signals and investigate whether spatial concepts can positively affect mapping.

We will explore whether the SpCoSLAM model proposed herein can be integrated with other probabilistic models to form a large-scale cognitive model for general-purpose autonomous intelligent robots using a SERKET architecture (Nakamura et al. 2018). However, applications of the SERKET architecture are limited due to its computational cost for learning the enormous parameters of the whole model. Even in such a case, we consider that our proposed approach to online learning will be extensively useful because it can be applied to various other Bayesian models.