1 Introduction

Understanding human mobility and behavior patterns is the prerequisite for effective decision-making and interventions in urban mobility systems under normal and abnormal situations (e.g., congestions, pandemics, and natural disasters). Nowadays, with the rapid development of sensing techniques, large-volume and multi-source mobility data (e.g. smart card data in public transport, GPS trajectory data, and geotagged social media data) can be accessed and applied in human mobility pattern analysis. Mobility data may provide human travel behavior records at a disaggregated level, which significantly benefits mobility-related studies, such as individual travel patterns and behavior analysis [1,2,3].

In order to understand user behavior and optimizing public transport systems, the level of interest or dependence of individual users or groups of users on specific stations in a public transport system can be an important aspect. Numerous factors influence this measure, including station location, user preferences, travel patterns, and the availability of alternative transportation options. For instance, users may show a higher level of attention towards stations that are close to their home or workplace, or that offer convenient transfer options. Through the analysis of user-station attention, transport providers can gain valuable insights into user preferences and travel patterns, which can assist them in making data-driven decisions to improve services. Moreover, studying user-station attention leads to a better comprehension of human mobility, such as travel purpose and underlying motivations, and facilitates downstream applications in transport, such as mobility prediction and location recommendation.

Although many studies have explored visiting patterns of passengers to stations, we have found no studies on user-station attention in the literature. We introduce the user-station attention problem as the problem of inferring the ‘real’ hidden (unobserved) user-station attention from the observed station visit counts data (e.g., from smart card data). The problem is challenging given its intrinsic unsupervised learning nature. In addition, it is elusive to associate the visit counts with attention for cases with zero or relatively small (e.g., 1-2 visits per week) visit counts. For example, a user having no historical visiting record to a station does not necessarily mean that the user has no interest in that station. It could be because the user is unaware of or has not yet visited that station, but could be potentially interested to go (e.g., based on nearby points of interest or restaurants). Furthermore, missing station visit count data for new users will lead to the ‘cold start’ problem when analyzing their behaviors for proactive interventions.

Mathematically, the studied user-station attention problem is similar to the user-item rating problem in recommendation systems (e.g., items, restaurants, movies, and music). Generally, the proposed models in user-item rating studies aim to capture the user similarity (e.g., sociodemographic characteristics) and item-item relationships (e.g., laptop and mouse) to make effective recommendations [4,5,6]. Analogous to that, an effective user-station attention inference method should model the user similarity and station-station relationships. In our context, user similarity is measured by mobility patterns extracted from users’ travel histories (approximating users’ sociodemographic). In the studied problem, we consider the user-station attention as the attention of a user to a destination station assuming the user’s origin station is known or inferred from the residential location, and we also assume the time-dependent activities are a prior. Therefore, station-station relationships should capture the travel purposes/activities in order to facilitate the user-(destination) station attention inference. In other words, we assume that users with similar travel patterns and traveling from the same station at the same time are more likely to visit the same destination station (detailed discussions in Section 3.1).

The knowledge graph (KG) is a potential approach to model relations between stations. KG is a widely studied method in recent years, which is a graph-based knowledge representation and organization method. KG can be regarded as a large knowledge base with a directed network structure (nodes representing entities and edges are relationships between entities) [7]. The main advantage of KG is that it can capture and present not only the directly observed information, but also the intricate relations among knowledge, and connect the fragmented pieces of knowledge in various information systems. The KG has been successfully applied in a wide range of domains such as searching, journalism, healthy, entertainment, network security, and pharmaceuticals [8,9,10,11]. In the urban mobility area, the temporal aspects of mobility events (e.g., departures, stations) and their dependencies are essential elements in the modeling analysis. Despite several studies attempting to develop KGs for transportation considering times [12,13,14], they are in essence static KGs since they use times as entity attributes but the relationships do not change over time.

Another challenge for constructing KG in the mobility area is that the knowledge is usually hidden (e.g., trip purpose) in the observed travel trajectories. Developing automatic methods to extract the hidden knowledge is needed but challenging in practice without ground-truth observations.

In this paper, we introduce the user-station attention inference problem using observed station visit counts data in public transport and develop a matrix decomposition method capturing both user similarity and station-station relationships modeled using KG. Specifically, we capture the user similarity information from the user-station visit counts matrix. It extracts the stations’ latent representation and hidden relations (activities) between stations to construct the mobility knowledge graph (MKG) from smart card data. To extract the hidden relations between stations, a neural network (NN)-based nonlinear decomposition approach is introduced. The main contributions are summarized as follows:

  • Develop a methodology to infer the ‘real’ user-station attention from partially observed user-station visit counts data. It captures both user similarity and station-station activity relations using MKG;

  • Propose MKG construction framework from smart card data. It captures the spatiotemporal travel pattern correlations between stations using a neural network-based nonlinear decomposition method;

  • Conduct systematic case studies to validate the proposed method using both synthetic and real-world datasets, and comparing with benchmark models.

The remaining paper is organized as follows. Section 2 reviews the related works of user-station attention problem and knowledge graphs. Section 3 describes the research problem and framework, and proposes the user-station attention inference method. Section 4 presents the case study to validate the proposed method and explore its performance in user-station attention matrix inference. Section 5 summarizes the main findings and outlines future research directions.

2 Related work

User-station attention has not been previously studied. However, the concept is related to the problem of suggesting relevant items to users in recommender systems (or recommendation systems). In recommender systems, algorithms aim at predicting a user’s ‘rating’ or ‘preference’ for an item (e.g., movies, books, restaurants, etc.). In recent years, the KG has been widely used in recommender systems and achieved fruitful results [15,16,17,18,19]. Concretely, a KG-based recommender system exploits the connection between entities (nodes) representing users, items to be recommended, and their interactions (relations) [4]. The relations, explicit or implicit, are used to identify items that may be interesting to the target user [17]. Thus, compared to other approaches (e.g., collaborative filtering, and content-based filtering), the relations in KG can provide additional valuable information to facilitate inference between nodes to discover new connections. It leads to better prediction performance, especially when the dataset is sparse [19]. In general, the KG exhibits rich semantic correlations that can be used to explore hidden connections, help expand users’ interests to increase the diversity of recommended items, and enhance the interpretability of the system [20]. Therefore, the KG has great advantages in mining relationships and establishing connections between entities, which lays an important foundation for user-station attention inference.

KGs as a form of structured knowledge representation and extraction method have drawn great research attention from both academia and industry. The KG can be classified into general knowledge graphs and domain knowledge graphs according to the application field [16]. The domain KGs usually require more comprehensive background knowledge and specific datasets than the general ones. Given the studied problems, this paper focuses on reviewing the domain KGs.

KGs have been extensively studied in many fields, especially in the medical field. Many studies constructed a medicine/health knowledge graph by extracting medical entities, relations, and properties from different sources of medical knowledge, respectively [21,22,23]. Yang et al. [24] developed a corporate risk knowledge graph using basic corporate information from encyclopedia data through information extraction, knowledge fusion, ontology construction, and dynamic knowledge reasoning. It introduced a time dimension to describe the evolutionary characteristics of enterprise risk events and focused on characters and entities in the corporate field. Du et al. [25] built a GIS knowledge graph (GIS-KG) by combining various GIS knowledge sources to form a hierarchical ontology and using deep-learning techniques to map GIS publications onto the ontology. By capturing and integrating GIS knowledge from various primary sources, it improved domain-specific information retrieval for the wide-ranging field of geographic information science and its technologies. Zhang et al. [26] systematically analyzed the maritime dangerous goods related knowledge and defined concepts and relations to build an ontology framework and then filled entities in this framework to construct a knowledge graph for maritime dangerous goods. It was used for the efficient retrieval of dangerous goods knowledge, as well as the automatic judgment of segregation requirements. Mao et al. [27] proposed a process safety knowledge graph and developed a domain ontology on the delayed coking process to handle the process safety knowledge and support risk causes and consequences analysis. It enhanced the knowledge-based analysis abilities in discovering the hidden relationships between possible risk causes and consequences in an emergency situation.

The knowledge sources of the above studies are all based on domain textual knowledge, and different methods are used to extract knowledge. In addition, most studies used the triple form structure (entity-relation-entity) with a few using the quadruple form (entity-relation-entity-property) to extract information from static text data.

Compared with other domain knowledge graphs, research in transportation is still emerging. For example, Tan et al. [14] constructed a knowledge graph of an urban traffic system and extracted entities, relations, and attributes from Automatic Fare Collection (AFC) data, static basic data of metro lines and stations, urban road traffic data, and other data to construct a traffic field ontology. It was used for traffic knowledge discovery and intelligent question answering of urban traffic services (traveler discovery). Shan et al. [28] proposed the notion of an urban knowledge graph, which composes multiple traffic topics, entities, attributes, and features extracted from traffic text knowledge. It performs better than traditional statistical methods for reasoning pedestrian volume based on a hybrid reasoning algorithm. Chen et al. [12] proposed a knowledge graph-based framework for digital contact tracing in public transportation systems during the COVID-19 pandemic. The framework fuses multi-source data and uses the trip-chaining model to construct a knowledge graph, from which a contact network is extracted and a breadth-first search algorithm is developed to efficiently trace infected passengers in the contact network. Wang et al. [29] studied the integration of reinforcement learning and spatial knowledge graph for incremental mobile user profiling. The spatial knowledge graph extracted entities from the environment and established the semantic connectivity among spatial entities and characterized the semantics of user visits over connected locations, activities, and zones. Li et al. [30] presented a knowledge graph-based framework for discovering potential destinations of low predictability travelers.The framework considers trip association relationships and constructs a trip knowledge graph to model the trip scenario. With a knowledge graph embedding model, it can obtain the possible ranking of unobserved destinations by calculating triples’ distance.

Despite efforts made in numerous studies to develop KGs for transportation that incorporate time as an entity attribute, they remain static in nature as the relationships do not dynamically change over time. Furthermore, constructing a KG in the mobility area is particularly challenging in comparison to other fields due to the fact that the knowledge, such as trip purpose, is typically hidden within the observed travel trajectories. Therefore, it is essential but challenging to develop methods for extracting this hidden knowledge in practical applications without ground-truth observations.

Table 1 Notation

3 Methodology

This section introduces the user-station attention inference problem. It proposes the inference methodology assisted by MKG and also correspondingly the MKG construction models from smart card data. The notation used throughout the paper is shown in Table 1.

3.1 Problem description and research framework

In practice, the attention of a user to stations and the station-station relations (activities) in MKG could vary for different time periods. Therefore, we infer the user-station attention matrix and develop a set of sub-graphs corresponding to different time periods, such as morning peak and off-peak. For ease of presentation, we define key terminologies used in this paper.

Definition 1

(Individual mobility trajectories) The individual mobility trajectories are a sequence of trips of users, i.e., \(T_r=\{tr_1, tr_2, tr_3,\dots , tr_n\}\), with a trip \(tr_i=\{user\ id, origin, destination, entry\ time, exit\ time\}\).

Definition 2

(User-station visit counts matrix) The user-station visit counts matrix \(\textbf{X}=[x_{ij}] \in \mathbb R^{m\times n}\) describes the visiting counts of users to stations, where m is the number of users, n the number of stations. For each entry \(x_{ij}\) in matrix \(\textbf{X}\), it is the visiting counts of \(i^{th}\) user to \(j^{th}\) station.

Definition 3

(User-station attention matrix) The user-station attention matrix \(\textbf{Y}=[y_{ij}] \in \mathbb R^{m\times n}\) is defined as the whole attention of the transport network, where m is the number of users, n the number of stations. For each entry \(y\in (0,1)\) in matrix \(\textbf{Y}\), it is the attention of \(i^{th}\) user to \(j^{th}\) station.

Definition 4

(Mobility knowledge graph) The mobility knowledge graph is defined as a time-dependent knowledge graph, which is composed of sub-graphs for different time intervals. Each sub-graph is a quadruple form knowledge graph composed of nodes, edges, and time period, in which nodes represent stations and edges represent the relations (travel activities) between stations during a certain time period. It can be represented as \(\textbf{G} = \{(\textbf{l}_o,\textbf{r},\textbf{l}_d,t) \mid \textbf{l}_o\in \textbf{L}_o,\textbf{r}\in \textbf{R},\textbf{l}_d\in \textbf{L}_d \}\), where \(\textbf{l}_o\) is the origin stations’ latent vector; r is the relationship’s latent vector; \(\textbf{l}_d\) is the destination stations’ latent vector; and t is the time period. At each time period t, the time element in \(\textbf{G}\) becomes irrelevant in the sub-graph.

Note that the user-station visit counts matrix provides a simple count of the number of times a user has visited a particular station, without indicating the importance or dependence of the user on that station. In contrast, the user-station attention matrix can quantify the importance of a station to a user or group, providing insights into the level of interest or dependency of users on different stations. Also, the user-station visit counts may be biased towards frequently visited stations, potentially overlooking less frequently visited stations and resulting in an incomplete understanding of the user-station relationship. Therefore, while the user-station visit counts matrix provides basic information about the users and stations, the user-station attention matrix offers a more comprehensive, intuitive, and nuanced understanding of their relationships.

Figure 1 shows the studied problem in this paper. The problem is to infer the full and ‘true’ user-station attention matrix \(\textbf{Y}\) given a partially observed user-station visit counts matrix \(\textbf{X}_p\). Due to mechanical faults, new users, or ticket users in the network, matrix \(\textbf{X}\) may not be fully observed. Moreover, the zero and small values (e.g., 1-2 visits per week) in matrix \(\textbf{X}\) are regarded as ‘untrusted’ observations. Therefore, only certain observations in matrix \(\textbf{X}\) (i.e., partially observed user-station visit counts matrix \(\textbf{X}_p\)) are used to infer the user-station attention matrix. Similar to the user-item rating prediction problem in the recommender system, we represent users and stations in a latent low-dimensional space of dimension k in Eq. (1). The \(i^{th}\) user is represented by a latent vector \( \textbf{u}_i \in \mathbb R^{1\times k}\) and \(j^{th}\) station by a latent vector \(\textbf{l}_j \in \mathbb R^{1\times k}\). The attention of \(i^{th}\) user to \(j^{th}\) station is expressed as the inner product of their latent representations [31] as follows:

$$\begin{aligned} \textbf{y}_{ij} = {\textbf{u}_i}{\textbf{l}^T_j}. \end{aligned}$$
(1)
Fig. 1
figure 1

Problem description. (a) Observed user-station visit counts matrix (Green grid: 0; Orange grid: observations with extremely small values; Blue grid: observations with normal values; ×: missed observation). (b) The user-station attention matrix

However, there are two challenges to solving this problem. Firstly, as discussed previously, the zero and extremely small values in matrix \(\textbf{X}\) may not reflect the user’s real attention to a station, which should be considered noisy data. Secondly, the lack of historical trajectory data (missed values) can lead to cold start problems. To tackle this issue, the user similarity and station-station relations (activities) are exploited to assist in inferring the fully user-station attention matrix.

Fig. 2
figure 2

Research framework

A common implementation in recommender systems is to recommend new items by identifying other users with similar tastes. Similarly, the user’s attention to a destination that has not been visited or not frequently visited can be inferred based on similar users. In addition, a person’s travel demand is derived from their daily activity patterns. Therefore, we assume that people with the same travel activity and traveling from the same origin station at the same time period would more likely to visit the same destination. People with the same travel activity refer to those who have the same travel purpose, such as work, shopping, school, etc. Such an assumption is important for accurate and robust user-station attention estimation. By analyzing the travel patterns of users with similar travel activities and factoring in the origin station and time period, it may be possible to infer the attention of a particular user towards a destination. For instance, if a destination has not been visited by a particular user but has been frequently visited by users with similar travel activities during the same time period, it may be inferred that the user has a certain level of interest in that destination.

For missing values, the activities (relations) between OD pairs contribute to the user-station (destination) attention inference. For example, suppose the relation between stations A and B is shopping during off-peak and a new user travels from station A at off-peak, then we can infer the new user may possibly have attention to station B. In order to assist the studied problem, we thus define the station-station relations in MKG as travel purposes/activities. Given these, we formally define our studied problems as below.

Problem 1

(User-station attention matrix inference) Given the partially observed user-station visit counts matrix \(\textbf{X}_p\) to infer the fully user-station attention matrix \(\textbf{Y}\) capturing both user similarity and station-station activity relations using MKG.

Problem 2

(MKG construction) Given the individual mobility trajectories \(T_r\), extract the stations’ latent representation \(\textbf{L}\) and the hidden relationships (activities) \(\textbf{R}\) capturing the spatiotemporal correlations of travel patterns during time interval t to construct the MKG \(\textbf{G} ( \textbf{L}_o,\textbf{R}, \textbf{L}_d,t)\) in a form of < origin, relationship, destination, time period >.

Figure 2 shows the framework of the proposed method. The input data are individual mobility trajectories \(T_r\). It contains three modules: user similarity, MKG construction, and user-station attention matrix inference.

  1. 1.

    User Similarity This step constructs the observed user-station visit count matrix \(\textbf{X}\) from mobility trajectory data and obtained the users’ latent representation \(\textbf{U}\) based on a matrix decomposition approach. In matrix \(\textbf{X}\), each row records the visiting counts of a user at different stations, which reflects the user’s travel pattern. The user similarity is reflected by different rows of the matrix. By matrix factorization, the users’ latent representation can be obtained, which captures the user similarity (implicit) features.

  2. 2.

    MKG Construction This step constructs the temporal station demand matrix \(\textbf{T}\) and OD flow matrix \(\textbf{S}\) from mobility trajectory data. An NN-based nonlinear decomposition approach is proposed to extract the stations’ latent representation, and station-station relations (activities) \(\textbf{R}\) to construct the MKG. The extracted relations capture the spatiotemporal travel pattern correlations between stations.

  3. 3.

    User-Station Attention Matrix Inference It infers the full and ‘real’ user-station attention matrix \(\textbf{Y}\) capturing both user similarity and station-station relations using the constructed MKG.

3.2 User similarity

By collecting the user’s check-in and check-out data from the mobility trajectories \(T_r\) during time period t, we can construct the observed user-station visit counts matrix \( \textbf{X} \in \mathbb R^{m\times n}\), where m is the number of users, and n the number of stations. Each entry \(x_{ij}\) in matrix \(\textbf{X}\) represents the visiting numbers of \(i^{th}\) user to \(j^{th}\) station. Each row aggregates the user’s number of visits to each station, which reflect the user’s travel pattern. Therefore, we can capture the users’ hidden characteristics (user similarity) by decomposing matrix \(\textbf{X}\). Similar to Eq. (1), based on the matrix factorization, the matrix \(\textbf{X}\) can be decomposed into:

$$\begin{aligned} \textbf{X} \approx {\textbf{U}}{\textbf{L}^T}. \end{aligned}$$
(2)

\(\textbf{U} \in \mathbb R^{m \times k}\) is the users’ latent representation capturing the user similarity (implicit); \( \textbf{L} \in \mathbb R^{n \times k}\) is the stations’ latent representation. Therefore, by optimizing the following objective function, we can obtain \(\textbf{U}\) and \(\textbf{L}\).

$$\begin{aligned} \min _{\textbf{U},\textbf{L}\ge 0} \left\| \textbf{X} - \textbf{U} \textbf{L}^{T} \right\| ^{2}+\lambda \left[ \left\| \textbf{U}\right\| ^{2}+\left\| \textbf{L}\right\| ^{2}\right] . \end{aligned}$$
(3)

\(\left\| \cdot \right\| \) denotes the Frobenius norm and \(\lambda \) the regularization rate. The last term is the regularization for penalizing the norm of \(\textbf{U}\) and \(\textbf{L}\). The Frobenius norm is the Euclidean norm applied to the vectorized version of a matrix. It is commonly used to regularize the objective function in optimization problems [32].

After this step, we can obtain \(\textbf{U}\) and \(\textbf{L}\) to infer the user-station attention matrix \(\textbf{Y}\). However, it has two major drawbacks. First, the learned stations’ latent representation \(\textbf{L}\) is not easy to interpret; second, it does not capture the station-station relations and other constraint information (e.g., temporal, spatial, spatiotemporal features). To overcome these limitations, the MKG is constructed to enrich stations’ latent representation.

3.3 Mobility knowledge graph construction

The construction of MKG consists of two parts: entity extraction and relation extraction. Entity extraction refers to extracting different station entities (including subway stations, train stations, or bus stops) from mobility data. Unlike GPS trajectory data, the passenger’s boarding and alighting stations are usually recorded in the smart card data, which makes it not difficult to identify the passengers’ travel station. Since all variables in this study are calculated in the latent space, we need to obtain the stations’ latent representation. Relation extraction is to extract the implicit relations (activities) between the stations. In the context of MKG in public transport, relations represent the spatiotemporal travel pattern correlations between two stations within a certain time period, which is in an implicit form (‘hidden’ mechanism). Therefore, relation extraction is a core part of MKG construction.

3.3.1 Stations’ latent representation

For each station, we construct the station’s latent vector \( \textbf{l} = [\textbf{l}_t;\textbf{l}_s]\) capturing both temporal and spatial features of station demand patterns, where \(\textbf{l}_t\) and \(\textbf{l}_s\) are the station’s temporal and spatial latent vectors, respectively. The travel characteristics of each station are time-varying. For example, the number of visitors at a station may be quite different between peak and off-peak hours. Therefore, it is important to extract the visiting feature of stations based on time. We first select the time period of interest t (e.g., morning peak or off-peak), then construct the temporal station demand matrix \( \textbf{T} \in \mathbb R^{n \times h}\), where h is the number of hours within time period t. An entry \(T_{ij}\) in matrix \(\textbf{T}\) represents the visit counts of \(i^{th}\) station in the \(j^{th}\) hour. According to the Non-negative Matrix Factorization (NMF) theory [33], the non-negative matrix \(\textbf{T}\) can be decomposed into two non-negative matrices, which is utilized to extract the visiting feature of each station. The time series matrix \(\textbf{T}\) can be decomposed into:

$$\begin{aligned} \textbf{T} \approx \textbf{L}_t \textbf{Q}. \end{aligned}$$
(4)

Matrix \( \textbf{L}_t \in \mathbb R ^{n \times k_t}\) is the stations’ temporal latent representation, \( \textbf{Q} \in \mathbb R^{k_t \times h}\) is the coefficient matrix and \(k_t\) is the number of stations’ temporal latent features.

To obtain the spatial latent features between OD-pairs, the OD flow matrix \(\textbf{S} \in \mathbb R^{n \times n}\) is constructed, in which an entry \(S_{ij}\) represents the visit counts from \(i^{th}\) station to \(i^{th}\) station. Given the matrix \(\textbf{S}\), it can be decomposed into:

$$\begin{aligned} \textbf{S} \approx \textbf{L}_s \textbf{D} \textbf{L}_s^T. \end{aligned}$$
(5)

Matrix \(\textbf{L}_s \in \mathbb R^{n \times k_s}\) is the stations’ spatial latent representation; \(\textbf{L} \in \mathbb R^{k_s \times k_s}\) is the feature diagonal matrix and \(k_s\) is the number of spatial latent features. The OD flow matrix \(\textbf{S}\) records the relation from one station to another, which can also be seen as spatial dependencies. Therefore, the OD flow matrix \(\textbf{S}\) can be regarded as the embedding of the system’s topology. Similarly, the matrix \(\textbf{L}_s\) and \(\textbf{D}\) are the stations’ spatial embedding and the interactions in the latent component space, respectively.

Fig. 3
figure 3

Framework of feedforward neural network

Combining Eqs. (4) and (5), the stations’ latent representation \(\textbf{L}\) is expressed as \(\textbf{L}= [\textbf{L}_t;\textbf{L}_s]\), which is formed by concatenating stations’ temporal latent representation \(\textbf{L}_t\) and spatial latent representation \(\textbf{L}_s\). Therefore, we can get the stations’ latent representation \(\textbf{L}\) by optimizing the following objective function

$$\begin{aligned} \begin{aligned} \min _{\textbf{L}_t, \textbf{L}_s, \textbf{Q}, \textbf{D} \ge 0}&\left\| \textbf{S} - \textbf{L}_s \textbf{D} \textbf{L}_s^T \right\| ^{2}+\left\| \textbf{T}-\textbf{L}_t \textbf{Q}\right\| ^{2} \\&+\lambda \left[ \left\| \textbf{L}_t\right\| ^{2}+\left\| \textbf{L}_s\right\| ^{2}+\Vert \textbf{Q}\Vert ^{2}+\Vert \textbf{D}\Vert ^{2}\right] . \end{aligned} \end{aligned}$$
(6)

The first and second term capture the spatial and temporal latent patterns, respectively; the last term is the regularization for penalizing the norm of \(\textbf{L}_t\), \(\textbf{L}_s\), \(\textbf{Q}\), and \(\textbf{D}\); \(\lambda \) is the regularization rate. Actually, Eqs. (4) and (5) can be seen as stations’ temporal and spatial-based information extraction respectively. After this step, the stations’ temporal and spatial related latent representation \(\textbf{L}= [\textbf{L}_t;\textbf{L}_s]\) is extracted and can be used to enrich \(\textbf{L}\) in Eq. (3).

3.3.2 Relation extraction

Relation extraction for KG is an extensive and profound research topic and most studies focus on supervised learning methods [34,35,36,37]. In contrast to the conventional supervised relation extraction learning task, the relation extraction problem in this paper is an unsupervised learning task since the ‘real’ relations between stations in MKG are unknown in the training data. Moreover, compared to linear matrix decomposition, the non-linear matrix decomposition method can extract more information and simultaneously capture spatiotemporal travel patterns of OD pairs (rather than arbitrarily concatenating spatial and temporal patterns). In the OD flow matrix \(\textbf{S}\), each entry records a specific flow from the origin to the destination, which is determined by characteristics of origin and destination stations, as well as their relations. Therefore, the relations can be extracted from the OD flow matrix \(\textbf{S}\). We propose a neural network (NN)-based model, based on the model used in [38], to formulate the problem to extract the relations’ and stations’ latent representation with the objective of recovering the observed OD flow matrix (i.e., a supervised learning model).

To illustrate that, Fig. 3 shows the feedforward NN framework for the MKG relation extraction learning task. The input data \([\textbf{l}_o; \textbf{r}_{od}; \textbf{l}_d]\) is the triple form of MKG in a certain time period, which is horizontally concatenated by origin’s latent vector \(\textbf{l}_o\), relation’s latent vector \(\textbf{r}_{od}\) and destination’s latent vector \(\textbf{l}_d\). The estimated OD flow is generated through the hidden layer and the initial weights \(\textbf{W}_1\),\(\textbf{W}_2\) of the NN. Then the error between the estimated and observed OD flows is calculated and used to update the weights and the input triple form via the backpropagation algorithm. The objective is to minimize the error in order to obtain the relations and stations’ latent representation to construct MKG. Adopting a supervised learning approach for this unsupervised task yields several advantages, including the network’s ability to acquire more intricate information (i.e., comprehending spatiotemporal OD travel patterns) and enhancing the inference performance.

In this method, the stations’ spatiotemporal latent representation and the relations’ latent representation are defined as \(\textbf{L}_{ts}\in R^{n\times k_{ts}}\) and \(\textbf{R}\in \mathbb {R}^{p\times k_{r}}\), respectively, where p is the numbers of relations; \(k_{ts}\) and \(k_r\) are the number of latent features of stations and relations. To specify a relationship between two stations, a latent indexing vector \(\textbf{H} \in \mathbb R^{1 \times p}\) is introduced, and the latent vector of a specific relation is \(\textbf{r}_{od}=\textbf{H}\textbf{R}\). Therefore, the objective of the relation extraction is to map each input vector \([\textbf{l}_o; \textbf{r}_{od}; \textbf{l}_d]\) to an output entry \(S_{od}\). The relation extraction problem is defined as: Given a training set \(\{([\textbf{l}_1;\textbf{r}_{12};\textbf{l}_{2}],S_{12} ),([\textbf{l}_{1};\textbf{r}_{13};\textbf{l}_{3}], S_{13}),\dots , ([\textbf{l}_n;\textbf{r}_{n(n-2)};~\textbf{l}_{n-2}], S_{n(n-2)}),([\textbf{l}_{n};\textbf{r}_{n(n-1)};\textbf{l}_{(n-1)}], S_{n(n-1)}) \}\), for each input \(\textbf{z}=[\textbf{l}_o;\textbf{r}_{od};\textbf{l}_d] \in \textbf{Z}\) and corresponding output entry \(S_{od}\in \textbf{S}\), the goal is to learn a function \(f: \textbf{Z} \rightarrow \textbf{S}\) mapping inputs to outputs. For each input vector \(\textbf{z}\), the feedforward NN-based \(f_{NN}(z)\) is defined as:

$$\begin{aligned} f_{NN}(\textbf{z}) = \textbf{W}_2 Sigmiod({(\textbf{z} \textbf{W}_1)}^T). \end{aligned}$$
(7)

Matrix \(\textbf{W}_1 \in \mathbb R^{(2k_{ts}+k_r) \times w}\) and \(\textbf{W}_2 \in \mathbb R^{1 \times w}\) represent the weights of the first layers and second layers; w is the number of hidden neurons and Sigmoid() is the activation function.

Similar to Eq. (6), by considering both temporal and spatial features, i.e., Eqs. (4) and (7), the stations’ latent representation \(\textbf{L}_{ts}\), and relation \(\textbf{R}\) can be learned by minimizing the following objective function:

$$\begin{aligned} \begin{aligned} \min _{\textbf{L}_{ts},\textbf{R},\textbf{H},\textbf{Q}\ge 0}&\sum \limits _{ \forall (\textbf{l}_o,\textbf{r}_{od},\textbf{l}_d) \in \textbf{S}}{(S_{od} -f_{NN}(\textbf{z}))^{2}}+\left\| \textbf{T} - \textbf{L}_{ts} \textbf{Q}\right\| ^{2} \\&+\lambda \left[ \left\| \textbf{L}_{ts}\right\| ^{2}+\left\| \textbf{H}\right\| ^{2}+\Vert \textbf{R}\Vert ^{2}+\Vert \textbf{Q}\Vert ^{2}\right] . \end{aligned} \end{aligned}$$
(8)

The stations’ latent spatiotemporal representation \(\textbf{L}_{ts}\) is shared in the first and second terms and constrained by the two terms simultaneously. The parameter \(\lambda \) is the regularization rate. The last term is the regularization for penalizing the norm of \(\textbf{L}_{ts}\), \(\textbf{H}\), \(\textbf{Q}\), and \(\textbf{R}\). After this step, the MKG can be represented as \(\textbf{G}(\textbf{L}_{ts}, \textbf{R}, \textbf{L}_{ts}, t)\).

3.4 User-station attention matrix inference

As mentioned previously, the user-station attention inference problem should comprehensively consider both user similarity and station-station relationships, i.e., Eqs. (3) and (8). Therefore, the user-station attention matrix can be inferred by optimizing the following objective function:

$$\begin{aligned} \begin{aligned} \min _{\textbf{U},\textbf{L}_{ts},\textbf{R},\textbf{H},\textbf{Q}\ge 0}&\left\| \textbf{X} - \textbf{U} \textbf{L}_{ts}^{T} \right\| ^{2}+ \sum \limits _{ \forall (\textbf{l}_o,\textbf{r}_{od},\textbf{l}_d) \in \textbf{S}}{(S_{od} \!-\!f_{NN}(\textbf{z}))^{2}} \\&+\!\left\| \textbf{T} - \textbf{L}_{ts} \textbf{Q}\right\| ^{2} \\&+\!\lambda \left[ \left\| \textbf{U}\right\| ^{2}\!+\!\left\| \textbf{L}_{ts}\right\| ^{2}\!+\!\left\| \textbf{H}\right\| ^{2}\!+\!\Vert \textbf{R}\Vert ^{2}\!+\!\Vert \textbf{Q}\Vert ^{2}\right] . \end{aligned} \end{aligned}$$
(9)

The first term is to obtain the users’ latent representation capturing the user similarity (implicit) features. The second and third terms capture the stations’ spatiotemporal latent patterns, and station-station relations capture the spatiotemporal travel pattern correlations between the two stations. The parameter \(\lambda \) is the regularization rate. The last term is the regularization for penalizing the norm of \(\textbf{U}\), \(\textbf{L}_{ts}\), \(\textbf{H}\), \(\textbf{Q}\), and \(\textbf{R}\). Therefore, the fully user-station attention matrix \(\textbf{Y}\) is inferred by capturing both user similarity and station-station relations using MKG.

3.5 Solution algorithm

To solve the optimization problem in Eq. (9), we develop an iterative user-station attention matrix inference algorithm based on the gradient descent algorithm. Algorithm 1 presents the pseudo-code of the solution algorithm.

figure a

4 Case study

4.1 Data preparation

To validate the proposed method, both synthetic data and real-world smart card data from Hong Kong Mass Transit Railway (MTR) are used. MTR is the major public transport network of Hong Kong, serving the urbanized areas of Hong Kong Island, Kowloon, and the New Territories. The smart card data records trip transaction information including tap-in/out stations and times. The data used was from January \(1^{st}\) to March \(31^{st}\), 2018 (86476 users and 19 million trips). According to the characteristics of actual data, the study considers the time interval of morning peak (MP, 07:00-10:00) and off-peak (OP, 10:00-15:00) to construct MKG and randomly selected 1000 individuals to infer the user-station attention, respectively.

The information utilized in our approach comprises time-dependent demand, OD flows, and user-station visit counts, all of which can be obtained from various mobility data sources such as smartcard data, GPS data, and geotagged data. None of these data sources involve sensitive or private information. The user-station visit counts data are identified solely by trips (without or with anonymized card ID) and do not contain any personally identifiable information, such as name, address, or email. Table 2 shows the structure of the AFC database used in the study.

Table 2 Example of AFC data structure in database

Since the observed user-station visit counts data does not reflect the real user-station attention, the real user-station attention cannot be obtained from historical observations. To validate the proposed method, we use the synthetic user-station visit counts data generated from a pre-set ‘real’ user-station attention matrix. Algorithm 2 illustrates the process of generating the synthetic user-station visit counts matrix. It takes inputs of pre-set ‘real’ user-station attention matrix \(\textbf{Y}\), number of stations n, number of users m, number of iterations (days) d, deviation \(\sigma \), and outputs the simulated ‘observed’ user-station visit counts matrix \(\textbf{X}^s\). To keep the same travel pattern with the real-world dataset, the pre-set ‘real’ user-station attention matrix \(\textbf{Y}\) is set as the normalized observed user-station visit counts matrix \(\textbf{X}\). Each user is simulated for 1000 days (i.e., \(d =1000\)) and the deviation \(\sigma \) (e.g., \( \sigma =0.01\)) is used to simulate the random variation of users’ visits on different days.

figure b

To visually check the quality of the synthetic data, we normalize the simulated visit counts matrix as the simulated attention matrix. We expect that the simulated attention matrix retains the structure of the pre-set real attention matrix but also has certain deviations. Fig. 4 shows the comparison of the pre-set (Fig. 4(a)) and the simulated (Fig. 4(b)) user-station attention matrix, and their differences (Fig. 4(c)) for 200 users. The difference is calculated by subtracting the pre-set from the simulated matrix. The result shows that the output simulated user-station visit counts matrix retains the travel pattern of the real-world dataset and also contains certain variations which reasonably represent the studied user-station attention inference problem in practice.

Fig. 4
figure 4

Heat map comparison of pre-set ‘real’ and simulated user-station attention matrix

4.2 Validation using synthetic data

The pre-set user-station attention matrix \(\textbf{Y}\) is considered the ground-truth attention. The problem is to infer the user-station attention matrix \(\hat{\textbf{Y}}\) given partial simulated user-station visit counts matrix based on MKG. The model performance is validated by comparing the inferred user-station attention matrix with the pre-set one. Two performance metrics are used for synthetic data validation: 1) mean absolute error (MAE), and 2) root mean squared error (RMSE) metrics are used. The MAE and RMSE can be defined as:

$$\begin{aligned} MAE=\frac{\sum \limits _{i,j}{\mid y_{ij} -\hat{y}_{ij} \mid }}{N}. \end{aligned}$$
(10)
$$\begin{aligned} RMSE=\sqrt{\frac{\sum \limits _{i,j}{ (y_{ij} -\hat{y}_{ij})^2}}{N}}. \end{aligned}$$
(11)

\(y_{ij}\) is the ground-truth user-station attention. \(\hat{y}_{ij}\) is the inferred user-station attention, and N denotes the number of user-station pairs in the network.

The proposed method is validated under both sparse and dense training data settings, i.e., 10% and 30% of the positive value cells (trustworthy data) are selected as the training dataset. Moreover, the study compares the model performance with its variants (MKG capturing different information). The baseline models are as follows:

  • US_None. The model captures the user similarity information (i.e., Eq. (3)) without MKG (not capturing station relations).

  • US_Lt. The model captures user similarity (i.e., Eq. (2)) and stations’ temporal latent features (i.e., Eq. (4)) in representing MKG relations.

  • US_[Lt;Ls]. The model captures the user similarity (i.e., Eq. (2)) and stations’ temporal and spatial information in representing MKG relations using a rule-based approach (i.e., Eq. (6)). That is the stations’ latent representation is formed by concatenating the temporal latent matrix \(\textbf{L}_t\) and spatial latent matrix \(\textbf{L}_s\).

  • US_[MKG]. The model considers both user similarity and station-station relations modeled using MKG (i.e., Eq. (9)). The stations’ latent representation simultaneously captures the spatiotemporal latent features using the NN-based nonlinear matrix decomposition.

We use a grid-search method to find the optimal parameter settings in Algorithm 1 for the experimental analysis. The optimal hyper-parameter values in Algorithm 1 are: \(k= 20, k_t=10, k_s=10,k_{ts}=20, p=30, w=20, \lambda = 0.01, \alpha =1e-5, N=100\).

Figure 5 shows the models comparison of baselines based on MAE and RMSE. In general, incorporating MKG significantly improves the model inference performance over the US_None model for both morning peak and off-peak periods. The US_MKG model has the best performance, followed by the US_[Lt;Ls] and US_Lt models. Compared with US_None, US_MKG reduces the MAE and RMSE by about 35% and 16% at morning peak and off-peak, respectively. Compared to the morning peak, the performance in off-peak is worse for all models. This is because morning peak travels tend to be more regular for activities. Furthermore, it shows that the inference performance is very close under the two training data settings with different density levels, which indicates its potential in dealing with sparse user-station attention visit count matrix in inferring the true attentions.

Fig. 5
figure 5

Performance comparison of baselines. (a) MAE in the morning peak; (b) RMSE in the morning peak; (c) MAE in the off-peak; (d) RMSE in the off-peak

Fig. 6
figure 6

Performance comparison of different training sets in the morning peak and off-peak. (a) MAE in the morning peak; (b) RMSE in the morning peak; (c) MAE in the off-peak; (d) RMSE in the off-peak

To further illustrate the impact of different data densities and different training datasets on the model, the sensitivity analysis is performed for the optimal model US_MKG. We compare the model performance trained using only positive entries (PG, without zero entries) and the full entries (FG, including zero entries) with different density settings for training (i.e., 10%, 30%, and 100%). Figure 6 shows the performance comparison. Compared with using the full entries as the training set, the models using only positive entries perform better in both the morning peak and off-peak scenarios. It indicates that reliable user-station visit counts information could benefit the inference of true user-station attention. In addition, it shows that the model performance is positively correlated to the number of training samples. The model performance of the off-peak is slightly worse than that of the morning peak, as the travel characteristics of users are more regular in the morning peak.

Fig. 7
figure 7

Performance comparison of baselines based on recall@K (K=1, 2, …, 10). (a) Morning peak and dense setting = 30%; (b) Morning peak and dense setting = 10%; (c) Off-peak and dense setting = 30%; (d) Off-peak and dense setting = 10%

4.3 Validation using actual data

We use actual data to further validate the proposed model performance by comparing it with baseline models. The ‘observed’ user-station attention \(\textbf{Y}\) (normalized observed user-station visit counts matrix \(\textbf{X}\)) is considered the ground-truth attention. For the actual data, the recall is used to compare the performance. Since zeros in the observed user-station visit counts matrix \(\textbf{X}\) represents unreliable data, the recall only considers the positive values. For each user, the recall@K was defined as:

$$\begin{aligned} recall@K=\frac{TP@K}{TP@K+FN@K}. \end{aligned}$$
(12)

where TP@K is the number of stations that the user pays attention to (positive value) in the top K stations, and \(TP@K+FN@K\) is the total number of stations that the user pays attention to. For example, for a user i, we can get the inferred user-station attention list \({\{ \hat{y}_{i1},\hat{y}_{i2}, \hat{y}_{i3},\dots , \hat{y}_{in}\} }\) and the corresponding ground-truth user-station attention list is \({\{ y_{i1},y_{i2},y_{i3},\dots ,y_{in}\} }\). Then we can get a candidate set\({\{(\hat{y}_{i1},y_{i1}), (\hat{y}_{i2},y_{i2} ),\dots ,(\hat{y}_{in},y_{in})\}}\), where \(\hat{y}_{in}\ge 0\) and \(y_{ij}\ge 0\). When the length of the candidate set is greater than K, TP@K is K. Otherwise, K is the length of the candidate set. \(TP@K+FN@K\) is the number of ground-truth user-station attention greater than zero. For example, suppose \(K=2\), the length of the candidate set is 5, and there are 20 ground-truth user-station attention greater than zero, then recall@K is 0.1. The recall for the entire matrix is the average recall of all users.

Fig. 8
figure 8

Example of a user’s attention to stations (Orange square: visited stations; green square: inferred stations with attention)

The experiment compares the proposed model performance with baseline models introduced in the above section. Similarly, the experiment compares the performance of the framework under sparse (insufficient) and dense (sufficient) training settings, i.e., 10% and 30% positive entries are selected as the training dataset. Figure 7 shows the experimental results. Figure 7(a) and (c) show the performance of morning peak and off-peak under a dense setting (i.e., 30%), whereas Fig. 7(b) and (d) show the results under a sparse training setting (i.e., 10%). In general, the results indicate that the MKG greatly improves the model’s performance in inferring true attention. The US_MKG model consistently performs better than others under different settings. Since US_None does not introduce any MKG information, it performs worst in both morning peak and off-peak, especially under a sparse setting. Compared with US_Lt and US_[Lt; Ls] which separately consider the temporal and spatial information, US_MKG simultaneously considers the stations’ latent representations and their relations which can more effectively capture hidden information. Similarly, we found that the performance in the morning peak is higher than that in the off-peak as the travel is more concentrated and regular in the morning peak.

4.4 Implications and discussions

Figure 8 visualizes a user’s attention to different stations inferred using the proposed method. The orange squares represent visited stations from the user’s historical trajectories. The green squares denote the stations (unobserved previously) that the user may also be interested in visiting. The user’s origin is ‘Lo Wu’ connecting the railway station ‘Luohu’ in Shenzhen. The user visited stations ‘WongTaiSin’, ‘MongKok’, and ‘Central’ in the central business areas with places of interest and shopping malls. So the user may be a buyer and his/her travel activity (station-station relationship) is likely to be ‘shopping’ or ‘purchasing’. The attention stations inferred from the model are ‘Hung Hom’, ‘Tsim Sha Tsui’, and ‘CausewayBay’ which are close to these observed visiting stations and are also places of interest and shopping areas. Therefore, the inferred attention stations are reasonable and the user may have interest to go to these areas for shopping activities.

Note that the studied user-station attention problem is mathematically similar to the user-item rating problem in recommendation systems. The two problems share a commonality in that they both utilize user similarity and relationships to infer items or stations that have no prior interaction history. However, there are also differences between the two problems. For instance, the context of travel behavior in transportation is different from that of item recommendations. For user-station attention, the user’s travel pattern remains relatively stable within a certain period and geographic area. In contrast, the user’s choice of items is highly changeable and unpredictable, as it is influenced by various factors, including personal preferences, current needs, and external stimuli. For example, while users may frequently travel to the same stations for work or school, their shopping habits may vary significantly from day to day, with some days focused on groceries, while others may involve purchases of clothing, electronics, or other goods.

Analyzing user-station attention can provide insights into the underlying factors driving user behavior within public transport systems, which can be leveraged to develop predictive models of user behavior and demand. For instance, historical data on user-station attention can be used to forecast the most heavily used stations during different times of the day or week, allowing for optimized scheduling and routing of public transport services. Additionally, user-station attention analysis can facilitate the development of personalized location recommendation systems based on a user’s travel history and preferences, such as recommending the best station to start a journey based on the destination, travel time, and other relevant factors. Moreover, the station-station relationship in MKG is also of great significance for other application scenarios. For example, in precision marketing, it can assist to locate more target groups and place targeted advertisements based on the main travel activities of stations/locations. The knowledge obtained from the investigation of user-station attention can be utilized to enhance the efficiency, efficacy, and overall user experience of public transportation systems, rendering them more appealing and user-friendly while contributing to promoting sustainable urban mobility.

For the practical implementation of our approach, the absence of privacy or sensitive data previously discussed is a prerequisite. Furthermore, the accuracy and reliability of the data are critical to ensure the validity and practicality of our model in real-world scenarios. Therefore, data should be collected and processed in a way that mitigates the risk of bias. Moreover, for greater generalizability, our proposed approach should be applied to datasets and contexts beyond those used in the initial analysis. Additionally, the involvement of stakeholders such as transportation authorities, policymakers, and users is essential in the design and implementation to ensure that our approach addresses real-world needs and is implemented effectively.

5 Conclusion

This paper introduces the user-station attention inference problem from partially observed user-station visit counts data in public transport. We develop a matrix decomposition method capturing simultaneously user similarity and station-stations relationships using the knowledge graph. Also, we propose a neural network-based nonlinear decomposition approach to effectively extract the MKG relations from smart card data capturing the latent spatiotemporal travel dependencies.

The experiments using both synthetic data and smart card data verify the validity of the proposed model by comparing the with benchmark approaches. The results illustrate the value of the knowledge graph in contributing to the inference of user-station attention. Compared with the model without MKG, the model with MKG will improve the model performance by 35% in MAE and 16% in RMSE. The results also show that it is important to select the positive observations in training the model (rather than using all observations with zero values). The model is less sensitive to the dense settings (10%, 30%, and 100% of samples used for training) provided only positive observations are used. Future work will focus on explaining the inferred user-station attentions by exploring more context information (e.g., points of interest, station features), and also examine its value in contributing to prediction applications, such as individual mobility prediction.