1 Introduction

The public transportation system is the backbone of a city’s infrastructure, and the intelligent transportation system (ITS) has been an essential chapter for the smart city blueprint. Most studies for ITS focus on traffic flow prediction (Ren and Xie 2017; Geng et al. 2019; Guo et al. 2019; Shi et al. 2020; Wang et al. 2019; Li et al. 2020). Tensor-based methods, such as tensor decomposition (Ren and Xie 2017), tensor completion (Li et al. 2020), as well as deep learning methods, such as convolutional neural networks (Geng et al. 2019), graph convolutional networks (Yu et al. 2018), and spatiotemporal attentions (Guo et al. 2019), have been developed to predict city-wide traffic flow (Geng et al. 2019), metro station-level passenger flow (Li et al. 2020), origin-destination (OD) flow matrix (Ren and Xie 2017; Wang et al. 2019; Shi et al. 2020), etc.

However, the methods mentioned above target traffic flow prediction at a macro level and utilize the traffic data of passengers indiscriminately, consequently neglecting the personalized travel characteristics of individual passengers. For example, to calculate the passenger flow value, only the number of passengers is counted, which abandons individual information (Yi et al. 2019). Thus, those methods could not directly handle individual travel data.

To overcome this issue, we propose to fully utilize the “individual” travel data for an “individualized” travel pattern discovery. Individual travel data preserve the abundant trajectory information, i.e., that passenger u departs from origin o at time t and arrives at destination d at time \(t'\). This encourages us to focus on the following individualized analysis tasks due to their high research values (Zhao et al. 2017). We aim at the following two goals:

  • Clustering of origin, destination, and time (ODT): The latent clusters for origin, destination, and time could be better learned from individual travel pattern data, given the abundant information is well preserved. The intuition is that those origins may belong to the same cluster if they all co-occur in the same type of passengers, as shown in Fig. 1(a). The learned clusters for ODT could guide better urban planning, more suitable station-surrounding facilities, and uncover the peak hour for crowd control.

  • Clustering of passengers: Passengers will also be clustered into different groups based on their trajectories. For public transport providers, with a better understanding of individual passengers’ travel patterns, customized promotions and more suitable operational policies can be designed. For example, the fare surcharge-reward scheme could be tailored for different passenger groups (Tang et al. 2020).

Fig. 1
figure 1

(a) Different passengers travel from O cluster to D cluster at T cluster; (b) Analogy from document to passenger

The two clustering tasks above for Origin, Destination, Time, and Passenger are called for short as “ODT-P Multi-clustering”. However, these two tasks are rather challenging due to the multi-mode spatiotemporal big data and the influence from the external environment.

  • Challenge 1: Multi-mode spatiotemporal big data. Take Hong Kong as an example, there are 2 million passengers daily: Each passenger has multiple trips, and each trip has multiple modes such as origin, destination, and time.

  • Challenge 2: external environment. Moreover, passenger behaviors are also affected by the external environment, such as the locations and surroundings of stations. If two stations are geographically adjacent to each other or located in similar functional areas (such as business area, residential area, school), they will attract the same type of passengers.

To tackle the aforementioned challenges, we propose a novel Graph-Regularized Tensor Latent Dirichlet Allocation model (GR-TensorLDA) for ODT-P multi-clustering based on individual passenger travel patterns. First, to preserve the multi-mode structure of the high dimensional spatiotemporal data, we focus on the tensor-based methodology (Kolda and Bader 2009), which represents the original data with three-mode tensors, where different modes represent ODT respectively. Secondly, a tensor LDA model is proposed to achieve ODT-P multi-clustering.

The main novelty of our proposed method is that we extend the traditional LDA (Blei et al. 2003) to a tensor version and apply into individual traffic data. An important analogy is made as shown in Fig. 1(b):

(1) “Word”-level: A trip is viewed as a three-dimensional word \({\varvec{w}} = (w^O, w^D, w^T)\); (2) “Document”-level: A passenger with several trips, i.e., “a bag of words”, is treated as a three-dimensional document \(\varvec{{\mathbf {d}}_u} \in \mathrm {R}^{O \times D \times T}\). Generative processes in the passenger-level and trip-level will be defined along with each mode of ODT; (3) “Topic”-level: Therefore, the latent topic will also be formulated as a tensor, with each element as \({\varvec{z}} = (z^O, z^D, z^T)\).

The clusters of ODT-P will be eventually obtained in the following way: (1) ODT clustering: Along each dimension of ODT, the topic is a latent distribution of words, which can be viewed as a cluster containing different words; (2) P clustering: each passenger is represented by the latent distribution of the tensor topics, which will be utilized to cluster passengers.

Our most significant technical contributions are twofold:

  • Semantic graph structure: To tackle Challenge 2, we incorporate the external environment as graph structures into the model. Precisely, we first formulate the station-related information as two graphs: (1) A geographical graph measures the spatial distance between stations; (2) A contextual graph quantifies whether two stations are located in similar functional areas (Geng et al. 2019; Li et al. 2020). Then the graph structures are incorporated into the tensor LDA generative process for OD, such that if two stations are close on these two graphs, they are more likely to be in the same topic. We show that by adding such graph regularizations, the interpretability of the learned ODT-P clusters can be significantly improved.

  • Efficient online algorithm: Since the graph regularization breaks the conjugacy, standard optimization techniques such as Gibbs sampling (Griffiths and Steyvers 2004) are no longer possible, we propose a tensorized variational Expectation-Maximization (EM) algorithm to estimate parameters. Moreover, to tackle Challenge 1, we need an efficient and scalable algorithm to deal with massive passenger data. Therefore, we further propose to conduct the algorithm in an online stochastic learning manner (Hoffman et al. 2010). We show that to reach the same level of performance, the online learning algorithm converges twice faster than the batch learning algorithm.

The remainder of the paper is structured as follows. Section 2 briefly reviews existing tensor methods, individual travel analysis, and topic models. Section 3 formulates the proposed model, and Section 4 proposes an efficient optimization algorithm. Section 5 provides a detailed experiment to demonstrate the improved meaningfulness of the learned clusters and model scalability; Section 6 gives the conclusion and discusses the future work and model generalization.

2 Related works

2.1 Tensor and tucker decomposition

We would like to first introduce tensor and tensor decomposition since high dimensional data are usually formulated as a tensor, and tensor decomposition is widely used for clustering (Kolda and Bader 2009; Sun and Axhausen 2016). Tensor is mathematically defined as a multi-dimensional array, which is believed to have sufficient capacity to preserve innate complex correlations across multiple dimensions (Kolda and Bader 2009). One of the most popular techniques is tensor decomposition. Tucker decomposition is a high-order principal component analysis. It decomposes a tensor \({\mathcal {X}}^{O \times D \times T}\) into a core tensor \({\mathcal {C}}^{J \times K \times L}\) multiplied by a mode matrix along each dimension, \({\mathbf {U}}^O, {\mathbf {U}}^D, {\mathbf {U}}^T\): \(\varvec{{\mathcal {X}}} = \varvec{{\mathcal {C}}} \times _{1} {{\mathbf {U}}}^O \times _{2} {{\mathbf {U}}}^D \times _{3} {{\mathbf {U}}}^T\): Tensor decomposition has been applied into smart transportation for prediction (Li et al. 2020) and clustering (Sun and Axhausen 2016). However, tensor decomposition might be rather impractical to be applied to our problem. The main reason is: data formulated with each individual passenger are extremely sparse due to curse of dimensionality. According to our preliminary study (Li et al. 2021), the sparsity could reach 99.97%, which paralyzes traditional tensor decomposition methods (Tang et al. 2018). Therefore, a technique that specializes in individual analysis is needed.

2.2 Individual travel pattern analysis

The new generation of ITS aims to be more personalized. Recently individual travel data have been utilized for passenger clustering (Briand et al. 2017; Mohamed et al. 2016), station clustering (Mohamed et al. 2016), and personalized services such as travel time estimation (Tang et al. 2018), route recommendation (Liu et al. 2019), destination inference (Cheng et al. 2020), driving state recognition (Yi et al. 2019), and activity discovery (Zhao et al. 2020). There are mainly two kinds of approaches as follows.

2.2.1 Spatio-temporal feature engineering

The first kind of approach relies on intense feature engineering to extract useful features such as spatial, temporal, OD pair, transportation mode. It then combines the features with traditional statistical learning models for clustering and prediction, such as K-mean clustering (Zhao et al. 2017), boosting tree (Liu et al. 2019), random forest (Yi et al. 2019). However, the feature extraction is rather complicated and differs from one system to another, which does not offer a universal solution. Furthermore, it typically assumes that the feature has the same dimension for each passenger. However, the number of trips of each passenger can be dramatically different. In contrast, our model learns the latent dimension in a data-driven way and can be accommodated to different numbers of trips, which offers a general solution with explainable results.

2.2.2 Generative models

As the second option, generative models (Briand et al. 2017; Mohamed et al. 2016; Tang et al. 2018; Cheng et al. 2020; Zhao et al. 2020) have been adopted into individual traffic analysis. To cluster passengers’ temporal behaviors, Briand et al. (2017) and Mohamed et al. (2016) proposed two-layer generative models with a mixture of Gaussian or unigrams model: The first layer partitions passengers into clusters, and the second layer formulates each cluster’s temporal activity. However, the limitation is that passengers are only clustered based on their active or boarding time; Therefore, the passengers’ abundant spatiotemporal information is not fully utilized. As a result, no insights about the latent nature of origins and destinations could be obtained.

To capture all dimensions of the spatiotemporal information for individual passengers, researchers have adopted topic models into individual traffic data (Cheng et al. 2020; Zhao et al. 2020), where an individual passenger’s travel data is regarded as a document, where each trip is recorded as a word. Specifically, Cheng et al. (2020) and Zhao et al. (2020) proposed a high-dimensional LDA model with a generative process on each dimension, such as location, day, hour, and trip duration. However, the existing methods ignored the underlining spatial correlations in the passengers’ travel data, which may lead to a clustering model that could not reflect reality. Compared with them, our most significant advantages are that we incorporate semantic graphs into the LDA generative process. This is inspired by the state-of-the-art topic models (Yao et al. 2017; Li et al. 2019b) that incorporates knowledge graph to improve the interpretability of model output, which we will review them in details in the following section. Such coupling of the “pure-trip” data with external contexts significantly improves topics’ interpretability, and that we propose an efficient online stochastic learning algorithm based on a variational EM algorithm.

2.3 Graph-based topic models

Incorporating knowledge graphs into topic models could enhance the interpretability of the learned topics (Yao et al. 2017; Li et al. 2019a; Mei et al. 2008; Chen et al. 2016; Li et al. 2019b). In particular, two categories of incorporating methods are considered in state-of-the-art models. The first category embeds words into a continuous space with word relations defined by an external knowledge graph such as DBpedia, WordNet (Yao et al. 2017; Li et al. 2019a). However, in traffic data, such knowledge graph is only applied to a single word representation, which cannot be used for high-dimensional word representation. The second category introduces graph-based regularization (Mei et al. 2008; Chen et al. 2016; Li et al. 2019b), such as graph harmonic function, to encourage entities close on the graph to be more likely to have the same topic. These regularization-based techniques are compatible with our generative model. However, the existing graph-based topic models are formulated only for one-dimensional word, not for high-dimensional data like our passenger travel data. Moreover, the challenges lie in the parameter learning for our corresponding tensor topic model. To this end, we rigorously develop online stochastic learning based on tensorized variational EM algorithm to estimate parameters with higher efficiency and scalability.

2.4 Multi-view subspace clustering

Last but not the least, subspace clustering is also a popular method for high-dimensional clustering (Parsons et al. 2004), which learns data representation in certain low-dimensional subspaces and clusters the data points. Multi-view subspace clustering (Gao et al. 2015; Zhang et al. 2017, 2018) specifically deals with data represented by multiple distinct feature sets.

We would like to emphasize the difference between our model and subspace clustering from the perspectives of data and model: (1) The typical multi-view data are formulated as \({\mathbf{X}}_v \in {\mathbb {R}}^{d_v \times n}\), where \(d_v\) and n are the number of features and samples on the v-th view. Our data instead present a hierarchical structure: a passenger has a sequence of a few trip records, and each trip is an instance in a three-dimensional space of ODT. Moreover, our data suffer from high sparsity, which also hinders subspace clustering, e.g., factorization-based methods, from normal functioning. (2) A typical formulation of multi-view subspace clustering is based on the data’s self-expression property (Gao et al. 2015), which is to use the data set to represent itself: \({\mathbf {X}}_v = {\mathbf {X}}_v {\mathbf {Z}}_v + {\mathbf {E}}_v\), where is the subspace representation matrix of the v-th view, and the nonzero elements in \({\mathbf {Z}}_v\) correspond to the data points from the same subspace. Various methods are proposed to add different regularizations on \({\mathbf {Z}}_{v}\), such as sparsity (Elhamifar and Vidal 2013), low rank (Liu et al. 2012) and smoothness (Hu et al. 2014). However, self-expression property cannot be applied to our model since our data are already high-dimensional and extremely sparse, which may lead to a even higher-dimensional and more sparse \({\mathbf {Z}}\).

3 Proposed methodology

We will introduce the proposed methodology. Section 3.1 gives the data formulation and the notations; Section 3.2 introduces the tensor topic and tucker decomposition; Section 3.3 formulates the generative process along each dimension; Section 3.4 formulates the graph structure for dimension OD; Section 3.5 gives the final loss function. The concepts of “passenger” and “document”, “trip” and “word”, “topic” and “cluster” are interchangeably used here.

Table 1 Notation

3.1 Data representation and notation

Firstly, the notations throughout this paper are as follows: We denote scalars in italics, e.g., n, vectors by lowercase letters in boldface, e.g., \(\varvec{\beta }\), matrices by uppercase boldface letters, e.g., \({\mathbf {B}}\), and tensors by boldface script capital \( \varvec{{\mathcal {W}}}\).

Then, we would like to define the data representation. A trip is defined as a three-dimensional tuple (i.e., word) \({\varvec{w}} = (w^O, w^D, w^T)\), indicating a trip that starts from origin \(w^O\) at time \(w^T\) and heads to destination \(w^D\). \(V^O, V^D, V^T\) are the vocabulary sizes for ODT respectively. A passenger who has traveled several trips is regarded as “a bag of words” (i.e., document): \( {\mathbf {d}}_u = \{{\varvec{w}}_1, \dots , {\varvec{w}}_{\underline{i}}, \dots , {\varvec{w}}_{N_u}\}\), with \(\underline{i}\) as the i-trip of the passenger u, \(N_u\) as the number of trips from passenger \(u \in {\mathbb {R}}^M\), and M as the total number of passengers. All the notations in the paper are summarized in Table 1.

3.2 Tensor topic definition and decomposition

The topic for the i-th word \({\varvec{w}}_{\underline{i}}\) in passenger u is also formulated as a three-element tuple \({\mathbf {z}}_{j,k,l} = (z^O_{\underline{i}j},z^D_{\underline{i}k},z^T_{\underline{i}l}) (\underline{i} \in {\mathbb {R}}^{N_u}\), \(j \in {\mathbb {R}}^{J}\), \(k \in {\mathbb {R}}^{K}\), \(l \in {\mathbb {R}}^{L})\), where JKL are the number of latent topics of ODT respectively, and \((z^O_{\underline{i}j},z^D_{\underline{i}k},z^T_{\underline{i}l})\) indicates the i-th word belongs to the j-th ‘O’ topic, k-th ‘D’ topic and l-th ‘T’ topic, respectively (Cheng et al. 2020).

According to Bayes’ theorem, the probability of the i-trip \({\varvec{w}}_{\underline{i}} = (w^O_{\underline{i}}, w^D_{\underline{i}}, w^T_{\underline{i}})\) from passenger \({\mathbf {d}}_u\) can be written as:

$$\begin{aligned} \begin{aligned} P( {\varvec{w}}_{\underline{i}}&= (o,d,t) \mid {\mathbf {d}}_u) = \sum ^J_{j=1}\sum ^K_{k=1}\sum ^L_{l=1}P(z^O_{\underline{i}j},z^D_{\underline{i}k},z^T_{\underline{i}l}\mid {\mathbf {d}}_u) \times \\ P(w^O_{\underline{i}}&=o\mid z^O_{\underline{i}j}) P(w^D_{\underline{i}}=d\mid z^D_{\underline{i}k}) P(w^T_{\underline{i}}=t\mid z^T_{\underline{i}l}). \end{aligned} \end{aligned}$$
(1)

We denote \(P(z^O_{\underline{i}j},z^D_{\underline{i}k},z^T_{\underline{i}l} \mid {\mathbf {d}}_u) = c_{u,j,k,l}\) as the probability that topic \({\mathbf {z}}\) for i-trip in passenger \({\mathbf {d}}_u\) is (jkl); We further denote \(P(w^O_{\underline{i}}=o\mid z^O_{\underline{i}j}) = \beta ^O_{jo}\) as the probability of \(w^O\) in i-trip is o given the j-th origin topic; Similarly for dimension D and T we have \(P(w^D_{\underline{i}}=d\mid z^D_{\underline{i}k}) = \beta ^D_{kd} , P(w^T_{\underline{i}}=t\mid z^T_{\underline{i}l}) = \beta ^T_{lt}\).

Fig. 2
figure 2

Tensor topic and tucker decomposition

Tucker Decomposition: Eq. (1) can be presented in Tucker decomposition as follows:

$$\begin{aligned} \varvec{{\mathcal {W}}}_u = \varvec{{\mathcal {C}}}_u \times _{1} {{\mathbf {B}}}^O \times _{2} {{\mathbf {B}}}^D \times _{3} {{\mathbf {B}}}^T. \end{aligned}$$
(2)

The essence of the model is revealed as probabilistic tucker decomposition (Kolda and Bader 2009), where the tensor ODT data for each passenger u is \(\varvec{{\mathcal {W}}}_u \in {\mathbb {R}}^{V^O \times V^D \times V^T}\), which is decomposed into a core tensor \(\varvec{{\mathcal {C}}}_u \in {\mathbb {R}}^{J \times K \times L}\), and along each dimension there is a mode matrix \({{\mathbf{B}}}^O \in {\mathbb {R}}^{J \times V^O}, {{\mathbf {B}}}^D \in {\mathbb {R}}^{K \times V^D}, {{\mathbf{B}}}^T \in {\mathbb {R}}^{L \times V^T}\) as shown in Fig. 2.

It is worth mentioning that although the essence of the model is a decomposition, yet since \(\varvec{{\mathcal {W}}}_u\) is intractable, \(\varvec{{\mathcal {C}}}_u, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\) could not be learned by decomposing \(\varvec{{\mathcal {W}}}_u\). Instead, we learn the latent parameters first and then \(\varvec{{\mathcal {W}}}_u\) could be calculated.

3.3 Generative process

The generative process for a trip \({\varvec{w}}\) will be defined along ODT.

Prior: Dirichlet distribution is known as a good conjugate prior for multinomial distribution (Zhou 2018). The tensor topic distribution \(\varvec{{\mathcal {C}}}_u \in {\mathbb {R}}^{J \times K \times L}\) for the u-th passenger is generated from the 3D-Dirichlet distribution with parameter \(\varvec{{\mathcal {A}}}\in {\mathbb {R}}^{J \times K \times L}\) and each element \(c_{u,j,k,l} (\sum _j \sum _k \sum _l c_{u,j,k,l} = 1)\) defines the possibility for the passenger to have trips from topic \({\mathbf {z}}_{j,k,l}\):

$$\begin{aligned} \varvec{{\mathcal {C}}}_u \sim \text {Dir}_{J \times K \times L}(\varvec{{\mathcal {A}}}). \end{aligned}$$
(3)

Passenger to Tensor Topic: The topic for the i-trip in the u-th passenger is drawn from the multinomial distribution:

$$\begin{aligned} z_{j,k,l}\mid {\varvec{w}}_{\underline{i}} \sim \text {Multi}( c_{u,j,k,l}). \end{aligned}$$
(4)

Topic to Trip: We define \({{\mathbf{B}}}^O \in {\mathbb {R}}^{J \times V^O}\) as the topic-trip matrix, in which the element \(\beta _{jo}^O\) is the multinomial probability that \(w^O=o\) is drawn from the j-th origin topic, \(P(w^O_{\underline{i}}=o\mid z^O_{\underline{i}j})\). \({{\mathbf {B}}}^D\), \({{\mathbf {B}}}^T\) are defined the same way. Therefore the word \((w^O, w^D, w^T)\) is drawn from each multinomial distribution with parameter \({{\mathbf {B}}}^O, {{\mathbf {B}}}^D\) and \({{\mathbf {B}}}^T\) respectively:

$$\begin{aligned} \begin{aligned} w^O \sim \text {Multi} (\varvec{\beta }^O_{j}), w^D \sim \text {Multi} (\varvec{\beta }^D_{k}), w^T \sim \text {Multi} (\varvec{\beta }^T_{l}). \end{aligned} \end{aligned}$$
(5)

3.4 Graph structure on origin and destination

If two stations are geographically close to each other or located in the similar functional area, intuitively these two stations are more likely to be in the same topic. This external information will be formulated as a graph and then introduced into the model as the Laplacian regularization.

Precisely, two graphs are defined accordingly to capture the inter-relationships of different stations: (1) Geographical graph \({\mathbf {G}}_{net}\): describes how two stations are geographically close to each other on the network; (2) Functional similarity graph \({\mathbf {G}}_{poi}\): quantifies how similar the functions of two stations’ locations are. These two graphs have effects on both OD dimensions, with definition details in Section 5.2.

The graph regularization term is defined as follows. Take one graph in origin dimension as an example,

$$\begin{aligned} R({\mathbf {G}}^{O}) = \frac{1}{2} \sum _{o_1, o_2 \in {\mathbf {G}}^{O}} \kappa (o_1, o_2) \sum _{j=1}^{J}(\beta ^{O}_{j o_{1}} - \beta ^{O}_{j o_{2}})^2 = \frac{1}{2} \sum _{j=1}^{J} (\varvec{\beta }^{O}_j)^T {\mathbf {L}} \varvec{\beta }^{O}_j, \end{aligned}$$
(6)

where \({\mathbf{G}}^O \in {\mathbb {R}}^{V^O \times V^O}\) is the graph on origin stations, \(\kappa (o_1, o_2) = \{{\mathbf {G}}^O\}_{o_1, o_2}\) defines the weight between entity \(o_1\) and \(o_2\), and \({\mathbf {L}}\) is the Laplacian matrix for graph \({\mathbf {G}}^O\) (Li et al. 2020; Wang et al. 2015; Yu et al. 2019). The intuition is that two stations that are closer on the graph will be more likely to have the same topic (Mei et al. 2008).

With the corresponding Laplacian matrices as \({\mathbf {L}}_{net}\) and \({\mathbf {L}}_{poi}\), the final graph Laplacian penalty on OD could be formulated as:

$$\begin{aligned} \begin{aligned} R({\mathbf {G}}^{O, D})&= \frac{1}{2} \sum _j (\varvec{\beta }^{O}_j)^T (\mu {\mathbf {L}}_{net} + (1-\mu ) {\mathbf {L}}_{poi}) \varvec{\beta }^{O}_j \\&\quad + \frac{1}{2} \sum _k (\varvec{\beta }^{D}_k)^T (\nu {\mathbf {L}}_{net} + (1-\nu ) {\mathbf {L}}_{poi}) \varvec{\beta }^{D}_k. \end{aligned} \end{aligned}$$
(7)

where the tuning parameters \(\mu \) and \(\nu \) adjust the relative effect of two graphs on OD respectively. The whole generative process is shown in Fig. 3.

Fig. 3
figure 3

Generative Process for trips from each passenger via latent topics \({\mathbf {z}}\)

3.5 Loss function

In order to learn the model parameters \(\varvec{{\mathcal {A}}}, {\mathbf {B}}^O, {\mathbf {B}}^D\) and \({\mathbf {B}}^T\) for the GR-TensorLDA model, the log-likelihood function could be formulated as follows:

$$\begin{aligned} L(\varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T) = \sum _{u=1}^{M} \text {log} P({\mathbf {d}}_u \mid \varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T) \end{aligned}$$
(8)

where \(P({\mathbf {d}}_u \mid \varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T)\) is the marginal distribution of a passenger which can be defined as: \(P({\mathbf {d}}_u \mid \varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T)= \int P(\varvec{{\mathcal {C}}}_u \mid \varvec{{\mathcal {A}}}) \left( \prod _{\underline{i}=1}^{N_u} \sum _{{\mathbf {z}}_{\underline{i}}} P({\mathbf {z}}_{\underline{i}} \mid \varvec{{\mathcal {C}}}_u) P(w_{\underline{i}}^O \mid z_{\underline{i}}^O, {{\mathbf {B}}}^O) P(w_{\underline{i}}^D \mid z_{\underline{i}}^D, {{\mathbf {B}}}^D) P(w_{\underline{i}}^T \mid z_{\underline{i}}^T, {{\mathbf {B}}}^T) \right) d \varvec{{\mathcal {C}}}_u\). However, the quantity \(P({\mathbf {d}}_u \mid \varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T)\) cannot be computed tractably due to the coupling between \(\varvec{{\mathcal {A}}}\) and \({{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\) in the summation over latent topics Blei et al. (2003). Luckily, variational inference provides us with a tractable lower bound on the log likelihood, which will be elaborated in detail in Section 4.

After combining the external knowledge, the model parameters are learned by maximizing the regularized likelihood function, with \(\lambda \) tuning the penalty strength:

$$\begin{aligned} \max _{\varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T} \lambda L(\varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T) - (1-\lambda ) R({\mathbf {G}}^{O, D}). \end{aligned}$$
(9)

4 Parameter estimation

In this section, we describe a tensorized variational EM-algorithm to optimize the model parameters \(\varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\) in Eq. (9), efficiently. Based on the existing variational EM-algorithm (Blei et al. 2003), we mainly emphasize the following significant contributions in our algorithm: (1) For E-step in Section 4.1, the variational E-step is extended from one-dimension words to high-dimension words with tucker decomposition; (2) For M-step in Section 4.2, the gradient ascend method is adopted to address the graph regularizations; (3) Most importantly, in Section 4.3, an online learning algorithm is proposed to handle the big data problem in smart transportation systems. The algorithm is summarized as follows:

  • Tensorized variational E-step: to approximate posterior, four variational distributions \(q(\cdot )\)s are introduced for \(\varvec{{\mathcal {C}}}, z^O, z^D, z^T\) with free variational parameters \(\varvec{{\mathcal {E}}}, {\varvec{\Phi }}^O, {\varvec{\Phi }}^D, {\varvec{\Phi }}^T\) respectively, as shown in Fig. 4; then the lower bound (LB) for the original log-likelihood is calculated by Jensen’s inequality, with optimal variational parameters learned to maximize the LB;

  • M-step: \(\varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\) are estimated to maximize the tightest LB learned from E-step.

Fig. 4
figure 4

Variational distribution to approximate posterior

4.1 Tensorized variational E-Step

The variational distribution is formulated as Eq. (10) to approximate the posterior distribution of each passenger:

$$\begin{aligned} \begin{aligned} Q&= q \Bigl (\varvec{{\mathcal {C}}}, (z^O, z^D, z^T) \mid \varvec{{\mathcal {E}}}, (\varvec{\Phi }^O, \varvec{\Phi }^D, \varvec{\Phi }^T) \Bigr ) \\&= q(\varvec{{\mathcal {C}}} \mid \varvec{{\mathcal {E}}}) \prod _{i=1}^{N_u} q(z^O_{\underline{i}}\mid \varvec{\phi }^O_{\underline{i}}) q(z^D_{\underline{i}}\mid \varvec{\phi }^D_{\underline{i}}) q(z^T_{\underline{i}}\mid \varvec{\phi }^T_{\underline{i}}), \end{aligned} \end{aligned}$$
(10)

where \(\varvec{\phi }^O_{\underline{i}} \in {\mathbb {R}}^{J}\), with \(\phi ^O_{\underline{i} j}\) interpreting the probability that word at i-th position in current document is generated from origin topic j.

A tight lower-bound is found by minimizing Kullback-Leibler (KL) divergence between the inference distribution Q and posterior P:

$$\begin{aligned} \begin{aligned} \min _{\varvec{{\mathcal {E}}}, \varvec{\Phi }^O, \varvec{\Phi }^D, \varvec{\Phi }^T} KL \Bigl [ Q \parallel P\Bigl (\varvec{{\mathcal {C}}}, z^O, z^D, z^T\mid {\mathbf {d}}_u, \varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T \Bigr ) \Bigr ]. \end{aligned} \end{aligned}$$
(11)

As shown in Appendix A.2, the optimal variational parameters \(\varvec{{\mathcal {E}}}^*, (\varvec{\Phi }^{O*}, \varvec{\Phi }^{D*}, \varvec{\Phi }^{T*})\) are learned by computing the derivatives of the KL divergence and setting them to zero, with results shown in Eqs. (12) and (13).

To estimate \(\phi _{\underline{i}j}^O\) for the u-th passenger by Appendix A.2.1:

$$\begin{aligned} \begin{aligned}&\phi _{u,\underline{i}j}^O \propto \beta _{jo}^O \exp \Bigl [\sum _{k=1}^{K} \sum _{l=1}^{L} \phi _{u,\underline{i}k}^D \phi _{u,\underline{i}l}^T \Bigl ( \Psi (\epsilon _{u,j,k,l})-\Psi (\sum _{jkl}\epsilon _{u,j,k,l}) \Bigr ) \Bigr ]. \end{aligned} \end{aligned}$$
(12)

The parameter of one dimension, for example, \(\phi _{\underline{i}j}^O\), is not only related to its own dimension but also other dimensions \(\phi _{\underline{i}k}^D\) and \(\phi _{\underline{i}l}^T\).

Therefore, we could perform the block coordinate descent algorithm, which iteratively update the parameters for ODT dimensions until convergence.

To estimate \(\epsilon _{j,k,l}\) for the u-th passenger via Appendix A.2.2:

$$\begin{aligned} \epsilon _{u,j,k,l} = \alpha _{j,k,l} + \sum _{i = 1}^{N_u}\phi ^O_{u,\underline{i} j} \phi ^D_{u,\underline{i} k} \phi ^T_{u,\underline{i} l}. \end{aligned}$$
(13)
figure a

4.2 M-Step

In the M-step, we aim to maximize the lower bound learned from E-step with respect to \({{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\) and \(\varvec{{\mathcal {A}}}\).

As shown in Appendix A.3.1, \({{\mathbf {B}}}^O\) and \({{\mathbf {B}}}^D\) cannot be solved in the closed-form solution due to the graph regularization. Therefore, we propose to use the gradient ascend method to update \({{\mathbf {B}}}^O\) and \({{\mathbf {B}}}^D\):

$$\begin{aligned} \varvec{\beta }_j^{O, \tau +1} = \varvec{\beta }_j^{O, \tau } + r \nabla L(\varvec{\beta }_j^{O}). \end{aligned}$$
(14)

the gradient with respect to \(\beta _{jo}^O\) is \(\nabla L(\varvec{\beta }_j^{O}) = \lambda \frac{1}{\beta ^O_{j o_1}} \sum _{u=1}^{M} \sum _{\underline{i}=1}^{N_u} \phi ^O_{u, \underline{i} j} {\mathbf {1}}(w_{u \underline{i}}^O = o_1) - (1-\lambda ) \sum _{o_2}(\mu \kappa ^{G_{net}}_{o_1 o_2} + (1-\mu ) \kappa ^{G_{poi}}_{o_1 o_2})(\beta ^O_{j o_1}-\beta ^O_{j o_2}) + a^O_j\).

\({{\mathbf {B}}}^T\) has a closed form solution as shown in Appendix A.3.2:

$$\begin{aligned} \beta ^{T}_{l,t} = \sum _{u=1}^{M} \sum _{i=1}^{N_u} \phi ^T_{u,\underline{i},l} {\mathbf {1}}( w^T_{u \underline{i}}=t ). \end{aligned}$$
(15)

Finally, similar to the original LDA model (Blei et al. 2003), \(\varvec{{\mathcal {A}}}\) can be estimated using the Newton-Raphson method:

$$\begin{aligned} \alpha _{j,k,l}^{s+1} = \alpha _{j,k,l}^{s} -\Bigl [ H^{-1}(\varvec{{\mathcal {A}}})g(\varvec{{\mathcal {A}}}) \Bigr ]_{j,k,l}. \end{aligned}$$
(16)

Its detailed derivation is given in Appendix A.3.3.

It is worth mentioning that Eqs. (14), (15), and (16) do not show a direct relation between JKL and model parameters: this is because JKL affect the variational variables and then the variational variables affect model parameters. Besides, in the parameter initialization, since we do not have a prior knowledge about the distributions, equalized and uniform parameters are initialized.

The whole algorithm is shown as Algorithm 1: it is in a batch learning manner which needs to read through the whole document set for each iteration.

4.3 Online learning algorithm

In practice, the proposed Algorithm 1 is very computationally intense since it updates parameters in a batch learning manner, which iterates between analyzing each observation and updating dataset-wide variational parameters. Therefore, as shown in Line 5 in Algorithm 1, E-step needs a full pass through the entire corpus each iteration, which is impractical when dealing with a large dataset containing tens of thousands of passengers. For example, Hong Kong, as an international transport hub, the monthly visitor arrivals were recorded to 6 million in Dec-2018, and the batch learning algorithm is not suitable for the situation where new visitors are continually arriving (Hoffman et al. 2010).

To this end, to efficiently implement the proposed model in real traffic systems, we further develop an online stochastic algorithm, which outputs good estimates outstandingly faster than the batch algorithm. To avoid repetitious details, only the critical differences in the algorithm will be explained.

figure b

In E-step, the updating equations from Eq. (12) to Eq. (13) remain the same except that variational variables are updated each time a passenger s is read, as shown in Line 6–7 in Algorithm 2.

In M-step, to update model parameters stochastically, once observing the current passenger s, we first assume the optimal model parameters are learned if the entire corpus contained this passenger s repeated M times: under this setting (denoted with the accent symbol \({\tilde{\beta }}\) ), \(\tilde{{\mathbf {B}}}^O\) and \(\tilde{{\mathbf {B}}}^D\) are updated by stochastic gradient ascend, with the gradient calculated based on the single observation \({\mathbf {d}}_s\) times M:

$$\begin{aligned} \tilde{\varvec{\beta }}_{j}^{O} = \tilde{\varvec{\beta }}_{j}^{O} + r \nabla L_s (\tilde{\varvec{\beta }}_{j}^{O}). \end{aligned}$$
(17)

where gradient \(\nabla L_s (\tilde{\varvec{\beta }}_j^{O})\) with respect to \({\tilde{\beta }}_{jo}^O\) is \(\nabla L_s (\tilde{\varvec{\beta }}_j^{O}) = \lambda \frac{M}{{\tilde{\beta }}^O_{j o_1}} \sum _{\underline{i}=1}^{N_s} \phi ^O_{s, \underline{i} j} {\mathbf {1}}(w_{s \underline{i}}^O = o_1) - (1-\lambda ) \sum _{o_2}(\mu \kappa ^{G_{net}}_{o_1 o_2} + (1-\mu ) \kappa ^{G_{poi}}_{o_1 o_2})({\tilde{\beta }}^O_{j o_1}-{\tilde{\beta }}^O_{j o_2}) + a^O_j\), with details in Appendix A.3.1.

Similar with Eq. (15), \(\tilde{{\mathbf {B}}}^T\) has closed-form solution with passenger s repeated M times:

$$\begin{aligned} {\tilde{\beta }}^{T}_{l,t} = M \sum _{i=1}^{N_s} \phi ^T_{s,\underline{i}l} {\mathbf {1}}( w^T_{s \underline{i}}=t ). \end{aligned}$$
(18)

Then the final model parameters \({\mathbf {B}}^{O, D, T}\) are updated by using a weighted average of its previous value and \(\tilde{{\mathbf {B}}}^{O, D, T}\):

$$\begin{aligned} {\mathbf {B}}^{O,D,T}=(1-\rho _s){\mathbf {B}}^{O,D,T} + \rho _s\tilde{{\mathbf {B}}}^{O,D,T}. \end{aligned}$$
(19)

where \(\rho _s =(\tau _0 + s)^{-\kappa }\). \(\kappa \in (0.5,1]\) controls the rate at which old values of \(\tilde{{\mathbf {B}}}\) are forgotten and guarantees convergence; \(\tau _0 \le 0\) slows down the early iterations.

The proposed online learning algorithm of the GR-TensorLDA method is summarized in Algorithm 2.

The computational complexity of each iteration in the batch learning algorithm is \({\mathcal {O}} \left( MN (J+K+L) \right) \) (\(M \ggg N, J, K, L\)), whereas the complexity of each iteration in the online learning algorithm reduces to \( {\mathcal {O}}(N(J+K+L))\) (Teh et al. 2007).

Mini-Batches: To reduce noise, parameters could be updated with a mini-batch containing multiple observations, with mini-batch size as S. \(\tilde{{\mathbf {B}}}^T\) is updated as \({\tilde{\beta }}^{T}_{l,t} = \frac{M}{S} \sum _{s=1}^{S} \sum _{i=1}^{N_s} \phi ^T_{s,\underline{i}l} {\mathbf {1}}( w^T_{s \underline{i}}=t )\). \(\tilde{{\mathbf {B}}}^{O, D}\) are updated by stochastic gradient ascend with mini-batches.

5 Experiments

5.1 Dataset

The individual travel data from 1st-Jan-2017 to 31st-Mar-2017 are chosen for analysis. Each trip has recorded the anonymized passenger ID, entry station, exit station, entry time, and exit time. In this implementation, entry station, exit station, and hour stamp of entry time have been collected for each trip and aggregated over the whole three months to ensure each passenger has enough trips for analysis, with average amount of trips around 134. The Hong Kong MTR system has 98 stations in total and operates in 24 hours. Thus the vocabulary size for origin, destination, and time is 98, 98, 24. We randomly pick 50,000 passengers as the training set, 1000 passengers as the validation set for tuning parameter selection, and another 1000 passengers as the test set. The data information is summarized in Table 2.

Table 2 Data information

5.2 Graph definition

As discussed in Section 3.4, the geographical graph and the functional similarity graph affect passengers’ travel patterns. Here we would like to define the two graphs in detail.

Geographical graph: Two spatially close stations are more likely to be in the same topic. The distance from station i to j in the graph \(\{G_{net}\}_{i, j}\) is usually simplified as an “H-hop” binary graph: if from station i to j less than H-hops are needed, two stations are connected, and the edge between them is ’1’; Otherwise, the edge is ’0’. We set \(H=3\) since a survey stated that passengers are willing to travel freely if two stations are only three hops away (Geng et al. 2019; Li et al. 2021).

$$\begin{aligned} \{{\mathbf {G}}_{net}\}_{i,j}={\left\{ \begin{array}{ll} 1 &{} \text {hop distance}_{i,j} <= H\\ 0 &{} \text {hop distance}_{i,j} > H\\ \end{array}\right. } \end{aligned}$$

Functional similarity graph: Two stations located in highly similar functional areas are also prone to be in the same topic. A functional similarity graph is commonly formulated with the point of information (POI) (Li et al. 2020; Zhong et al. 2017). We collect each station’s surrounding POIsFootnote 1 with the following seven services: hotel, leisure shopping, major building, public facilities, residential, school, and public transport. Each element of the POI vector indicates the amount of the corresponding service nearby the station. The element \(\{G_{poi}\}_{i, j}\) can be defined as cosine similarity between POI vectors of station i and station j, and a higher value in \({\mathbf {G}}_{poi}\) means a higher functional similarity between two stations:

$$\begin{aligned} \{{\mathbf {G}}_{poi}\}_{i,j}= \frac{\mathbf{POI }_{i} \cdot {\mathbf{POI }_{j}}}{\Vert \mathbf{POI }_{i}\Vert \cdot \Vert \mathbf{POI }_{j}\Vert } \end{aligned}$$

5.3 Benchmark methods

We apply the following benchmark methods to passenger travel data and compare the results with the proposed model. However, a relatively limited amount of research targets ODT-P multi-clustering based on individual passenger travel data.

Table 3 Comparison of benchmark methods
  • One-dimensional LDA (1d-LDA): It defines the generative process from document to topic, topic to word in a single dimension (Blei et al. 2003). To apply it, three-dimensional data \((w^O, w^D, w^T) = (o, d, t)\) are flattened into one-dimensional \(w = odt\): each different element creates a new word; thus, the total vocabulary size for the new data format is expanded to \(98 \times 98 \times 24 \sim 10^5\), and the computational complexity of each iteration is \({\mathcal {O}} \left( MNK \right) \).

  • Tucker Decomposition (Tucker): It decomposes an ODT flow tensor into a core tensor and mode matrices along each dimension. Each rank vector from a mode matrix can be regarded as a cluster (Sun and Axhausen 2016). However, this method is not applicable to individual travel data due to extreme sparsity. Merely to examine its ODT clustering performance, we feed it with macro-level passenger flow data with the dimension of origin, destination, and time (Sun and Axhausen 2016).

  • CP Decomposition with Graphs (CP-G): Similar to tucker decomposition, the input is passenger flow data to check its ODT clustering performance. It decomposes an ODT flow tensor into a weight vector and mode matrices along each dimension, with graphs on OD (Li et al. 2020).

  • Three-dimensional LDA with Gibbs sampling (3d-LDA(Gibbs)): It also defines a generative process for each dimension, however, without any semantic graph structure. Parameters are estimated by Gibbs sampling (Cheng et al. 2020), with the computational complexity of each iteration as \({\mathcal {O}} \left( MN (J+K+L) \right) \) (Porteous et al. 2008; Xiao and Stibor 2010).

All the methods are compared by whether it is an individualized analysis (Indiv. for short), tensor-based (Tensor), and graph-structured (Graph) model with efficient online algorithm (Eff.) and low computational complexity (Complexity) as shown in Table 3. Only our model ticks all the boxes.

5.4 Evaluation metrics

Traditional topic models only measure the quality of the model via perplexity, but ignore how “interpretable” the learned topics are. For example, a topic containing all words related to covid (e.g., omicron, vaccine, quarantine, etc: these words are all connected in a knowledge graph) is more “interpretable” than a topic containing words from various themes (e.g., covid, solar energy, iPhone: these words are far away to each other in a knowledge graph). Such “interpretability” is usually measured by point-wise mutual information (PMI), known as topic coherence. Besides, given our graph structure on origin and destination dimensions, we also innovatively design distance of graph to measure the “interpretability”: a topic with words that are close to each other on a graph is more interpretable than the one that does not.

Topic Coherence PMI: PMI is to evaluate how meaningful the learned topics along each dimension are (Yao et al. 2017; Newman et al. 2010). For example, topic j in dimension of the origin stations is calculated as \(PMI(\varvec{\beta }^O_j) = \sum _{o_1, o_2 \in N^o_j, o_1 \ne o_2} \frac{P(w^O_{o_1}, w^O_{o_2})}{P(w^O_{o_1})P(w^O_{o_2})}\) , where \(N^o_j\) is the top N words in origin topic j, and we choose the top 10 words. \(P(w^O_{o})\) is the probability that word \(w^O = o\) is observed in a passenger, and \(P(w^O_{o_1}, w^O_{o_2})\) captures the probability that \(w^O = o_1\) and \(w^O = o_2\) co-occur in the same passenger. A higher PMI value means a more coherent topic.

Distance on Graph (\(d_{G}\)): Based on the Laplacian matrix of a graph, \(d_{G}\) measures the distance of the word components from a topic. A smaller value means a more concentrated topic. For example, the distance on graph for origin topic j is defined as \(d_{G_{net}} = (\varvec{\beta }^{O}_j)^T {\mathbf {L}}_{net} \varvec{\beta }^{O}_j, \; d_{G_{poi}} = (\varvec{\beta }^{O}_j)^T {\mathbf {L}}_{poi} \varvec{\beta }^{O}_j\).

Perplexity: Perplexity (Blei et al. 2003) examines the likelihood of the proposed model in the test set. A lower perplexity means a higher likelihood.

5.5 Parameter tuning and station topics

5.5.1 Parameter tuning

The number of topics in each ODT dimension is set as \(J = 10, K = 10, L = 4\) in our dataset. This is chosen by the expert knowledge: \(J, K=10\) since the POI of each station has seven elements; \(L=4\) since there are usually at least three time-components capturing morning peak, evening peak, and midday trend. Generally, if there is no prior information about the parameters, JKL could be determined by a grid search to minimize perplexity (Blei et al. 2003) or maximize topic coherence (Yao et al. 2017). Theoretically, with bigger JKL, the dimensions of model parameters \(\varvec{{\mathcal {A}}}, {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\) also increase, which means more latent clusters are introduced to describe data pattern: this will naturally increase model’s likelihood and decrease perplexity. However, too large JKL may cause overfitting. Tuning parameters for graph regularization \(\lambda , \mu , \nu \) are searched from grids \(\lambda , \mu , \nu \in \{0.1, 0.2, \dots , 0.9\}\), and configuration parameters for online algorithm \(\kappa \in \{0.5, 0.6, \dots , 1.0\}\), \(\tau _0 \in \{1, 4, 16, 64, 256, 1024\}\) and \(S \in \{ 1, 4, 16, 64, 256, 1024 \}\). The optimal values are chosen to maximize likelihood in the validation set, with \(\lambda = 0.5, \mu =0.4, \nu = 0.2\) and \( S=128, \kappa =0.5, \tau _0 = 256\).

5.5.2 Topic matching

It is worth mentioning that before the comparison, the topics learned from different methods need to be matched first. We match two topics from two methods if they have the highest cosine similarity. For example, topic i from method \(m_1\) refers to topic \({\hat{j}}\) from method \(m_2\) when \( {\hat{j}} = \arg max_{j} \{ \text {cos-similarity}(\varvec{\beta }^{m_1}_{i}, \varvec{\beta }^{m_2}_{j}) \}\).

5.5.3 PMI and distance on graph

The PMI and \(d_{G}\) for the learned origin topics are shown in Table 4, with the best performance highlighted in boldface and the second-best performance highlighted in underline.

Table 4 Origin topic

From Table 4, we find that: (1) Tucker and CP decomposition have the worst PMI since they are methods targeting macro-level traffic analysis. Thus it ignores the individual passenger information; However, tensor decomposition with graphs considered (i.e., CP-G) still has higher PMI and lower \(d_G\) than that without graphs; (2) Our proposed method achieves twice higher PMI than 3d-LDA(Gibbs) for most topics, which means the proposed model can discover more meaningful topics. This is due to the external information introduced as graphs, with more than 50% lower \(d_G\) observed on both graphs.

5.5.4 Perplexity vs interpretability

In Table 5, our model has a 10% higher perplexity in the test set, which means a lower likelihood score for these passengers. When we introduce the regularization term into the loss function, it naturally decreases the likelihood score because regularization terms generally reduce the model fitting accuracy (i.e., likelihood score) in the exchange of a better generalizability on the testing samples.

Table 5 Trade-off between perplexity and intepretability

However, in the literature, it has been shown that perplexity is not a good measure compared to the topic coherence PMI score. This is because the perplexity itself does not reflect the meaningfulness of topics, and topics with lower perplexity might even conflict with the real-world knowledge (Yao et al. 2017; Chang et al. 2009). Our experiments show that, by adding the graph regularization, although the model likelihood score (i.e., 10% higher perplexity measure) is lower, the model interpretability and generalization power (i.e., as seen in the twice higher PMI and 50% lower \(d_{G}\) measures) are significantly improved. Therefore, we observed such a trade-off between perplexity and interpretability.

1d-LDA has the highest perplexity since it is not a tensor-based model, thus cannot preserve the innate spatiotemporal correlations.

Fig. 5
figure 5

(a) POI features for origin topics; (b) The locations of top 10 stations from origin topics 4,5,6 and 8

5.6 Improved interpretability in station topics

To better demonstrate the enhanced interpretability of the proposed model’s topics and check how they reflect the real world, we also visualize: (1) the topic POI features, (2) topic locations, and (3) topic station components based on the real metro map, in comparison to the second-best model 3d-LDA (Gibbs) in origin dimension only. The discovered topics can be used as station clusters.

(1) Topic POI Feature to check the POI feature of each topic: Usually, more distinguishable topics are better (Zhu et al. 2012). The POI features of topics from 3d-LDA(Gibbs) and our model are calculated as \({\mathbf {B}}^O_{(poi)} = {\mathbf {B}}^O {\varvec{G}}_{poi}\), where \({\mathbf{B}}^O_{(poi)} \in {\mathbb {R}}^{J \times N_{poi}}\) and the j-th row \(\varvec{\beta }^O_{j(poi)} \in {\mathbb {R}}^{N_{poi}}\) indicates the POI feature of this topic. As shown in Fig. 5. (a2), topics from the proposed model capture more distinct POI groups: origin topic 4, 5, 6, and 8 capture the POI leisure shopping, major building, residential and school respectively; Topics from 3d-LDA(Gibbs) instead, as shown in Fig. 5. (a1), capture the topics with similar and non-distinguishable POI distribution. The distinct POI patterns are observed in our destination topics too.

(2) Topic Location on Map to check each topic’s location on map: The top 10 stations with the highest weights from origin topics 4, 5, 6, and 8 are located on the metro map. In Fig. 5. (b), the stations from our topics are concentrated in the same line/region; however, topics without graphs are dispersed among different regions. Therefore, the topics learned from our proposed method have significant improvement in interpretability and reflect external knowledge.

Fig. 6
figure 6

The top 10 stations from topic 6, with station weights presented as line charts on top, real metro lines plotted below and station POI features presented as pie charts

(3) Station Components Analysis to check each topic’s station component in terms of the station location and POI: A further detailed study about the top 10 station components with the highest weights inside each topic (here topic 6 is chosen) is conducted to check those stations’ exact locations and the surrounding POIs. (1) In Fig. 6. (b), the top 10 stations of our topic 6 are all located in the same metro line, and the POI feature of each station is also mainly residential; (2) On the contrary, in Fig. 6. (a) the top 10 stations of topic 6 without graphs are scattered over different three lines, and those stations also have quite different POI features, such as station 80 in leisure shopping, station 99 in the major building.

To conclude, the identified topics from the proposed model have better physical meanings and interpretability.

Applications of OD Clustering: Metro companies usually categorize stations by their expert knowledge, such that different station categories have different operational, marketing, and urban planning strategies. However, this categorization is usually out-of-date since a station’s feature evolves over time. The learned station clustering is purely data-driven and updated with data, which provides better insights.

5.7 Time topics

Fig. 7
figure 7

(a) Time topics, (b) Passenger clusters

The time topics from our method are shown in Fig. 7(a), Topic 0 captures the first morning peak (7–8 AM); Topic 1 captures the second morning peak (9–10 AM); Topic 2 captures the mid-day trend, and Topic 3 captures the evening peak (8 PM).

Applications of Time Clustering: The learned time topics could offer clear insights about the peak hours and enable crowd management.

5.8 Passenger clustering

The passenger cluster could be learned from each passenger’s tensor topic distribution parameter \(\varvec{{\mathcal {C}}}_u\). The distance between two passengers’ topic distributions could be measured by the Euclidean distance, the Jensen Shannon (JS) divergence (Cheng et al. 2020) and so on. We choose the JS divergence since it is a symmetric measure for distributions. Then clustering methods such as K-means could be applied to cluster passengers.

Two passenger clusters are shown: In Fig. 7(b1) ‘student’ cluster travels from O6 (residential) to D0 (school) at T0 (7–8 AM) and travels back from O8 (school) to D4 (residential) at T2 (mid-day); In Fig. 7(b2) ‘white-collar’ cluster usually travels from O5 (major building) to D9 (major building) at T1 (9–10 AM) and after work travels from O5 (major building) to D8 (leisure shopping) at T3 (8 PM).

Applications of passenger clustering: (1) Customized Services: Passenger clustering could help public transport companies better understand passenger demographics, enabling customized travel reward plans or tailored advertisements for different passenger clusters. (2) Destination Inference: Moreover, the conditional probability based on the latent parameters \( {{\mathbf {B}}}^O, {{\mathbf {B}}}^D, {{\mathbf {B}}}^T\), and \(\varvec{{\mathcal {C}}}_u\) enables more potential applications. For example, given the passenger \(u'\), origin \(o'\), and entry time \(t'\), the destination could be predicted by: \(P( w^D=d \mid w^O=o', w^T=t', u') \propto P( w^O=o', w^D=d, w^T=t', u') = \sum ^J_{j=1}\sum ^K_{k=1}\sum ^L_{l=1} c_{u',j,k,l} \times \beta ^O_{jo'} \times \beta ^D_{kd} \times \beta ^T_{lt'}.\) As a result, a destination crowd warning message in the mobile application could then be directed towards each passenger. With \( {{\mathbf {B}}}^O, {{\mathbf {B}}}^D\) better estimated with graphs, for passengers who travel between two stations (accounts for 83.8% of the population in our dataset), the destination inference accuracy is improved by 18% compared with 3d-LDA(Gibbs), as shown in Table 6. In Fig. 8, the most popular OD pairs (i.e., \(o' \rightarrow d'\)) are selected to visualize destination inference. With the input of passenger \(u'\), who has these OD pairs, and the time \(t'\) when he/she enters \(o'\), the destination is inferred by our method with higher accuracy (correct \(d'\) in green); However, the method without graphs makes more mistakes (wrong d in red).

Table 6 Destination inference accuracy
Fig. 8
figure 8

Destination inference for selected OD pairs: \(o'\) in blue, ground-truth \(d'\) in grey, correct d in green, wrong d in red

Fig. 9
figure 9

Convergence comparison of (a) log-likelihood evolution and (b) origin perplexity evolution between Online Algorithm 2 in green and Batch Algorithm 1 in red. Each point marker ‘\(\cdot \)’ in online version denotes 10 iterations, and each cross marker ‘\(\times \)’ in batch version denotes 1 iteration

5.9 Faster convergence from online algorithm

Last but not least, as shown in Fig. 9, we compare the convergence speed of the algorithm’s batch version and its online version in the same computation environment. The proposed online stochastic algorithm needs more iterations to converge but 60% less time: The online version (shown in color green) converges at \(t\approx 300\), more than twice faster than the batch algorithm (shown in color red), which converges at \(t\approx 700\).

Moreover, the online algorithm also converges with with better parameter estimates: (1) Higher log-likelihood: As shown in Fig. 9. (a), online version converges at log-likelihood \(\approx -113 \times 10^3\), higher than batch version’s convergence at log-likelihood \(\approx -118 \times 10^3\); (2) Lower perplexity: as shown in Fig. 9. (b), online version converges at origin perplexity \(\approx 60\), lower than batch version’s convergence at origin perplexity \(\approx 89\).

6 Conclusion

In this paper, we studied the ODT-P multi-clustering problem for the individual passenger travel pattern to achieve meaningful clusters on each dimension (origin, destination, and time) and on the individual passenger by incorporating external information on the origin and destination stations. To solve this challenge, we proposed a novel graph-regularized tensor Latent Dirichlet Allocation model, which applies to the travel data of each passenger with the consideration of the external information as the graph regularization. We proposed a tensorized variational EM-algorithm to estimate parameters. To improve the scalability, an online learning algorithm is further proposed. In the case study based on the Hong Kong metro system, we demonstrate our superiority over state-of-the-art methods in terms of two times higher topic coherence, 50% lower distance on graph, and better interpretability. Our improvement is also reflected on its 20% more accurate individual destination inference. The proposed online learning method can also converge twice faster with the same good performance as the batch learning method.

7 Future work

Our model will be extended to cover trip duration, which is a continuous variable. The generative process will be extended to handle continuous distribution correspondingly. Besides, due to the independent assumption of Dirichlet distribution, correlations between passengers will also be further examined.

7.1 Generalization

This work could also be applied to non-metro data such as bus and sharing rides if the ODT information is recorded. In the road traffic, the nodes in \({\mathbf {G}}_{net}\) and \({\mathbf {G}}_{poi}\) could be defined as different grids, road segments, or zip code zones, and the edges could be defined similarly if distance and POI are available.