1 Introduction

Over the past decade, the advancements in technology have led to the emergence of personal electronic devices, such as wearable devices, smartphone and others sensors. The widespread adoption of these devices revolutionized our ability to capture and document moments from our daily lives, providing valuable data that offer insights into activities, experiences, and behaviors. They allow the acquisition of images, audio, locations, physical activity, and physiological signals among other data, and these personal data can be applied in a wide range of applications, such as information retrieval, lifelogging, health and diseases monitoring, behavior analysis, and just to name a few [1, 2].

Among the several types of data collected, global positioning system (GPS) data plays a crucial role in understanding a person’s life [3]. However, the huge amount of spatiotemporal data generated by these sensors, such as geographic position (latitude and longitude) and timestamps, become an important issue for data mining and location-based applications, which has created numerous opportunities for researchers across diverse fields of study [4]. Furthermore, GPS data can be harnessed to analyze travel patterns, understand mobility behaviors, and even detect the mode of transportation used by individuals.

In several self-monitoring applications explored in the literature [2, 5], such as lifelogging applications, raw GPS data does not directly provide specific location names or information about the transport mode being used by an individual. This highlights the necessity of utilizing data processing and mining algorithms to extract valuable insights from the collected data [6]. In lifelogging systems, the locations names are obtained based on clustering GPS data [7] and geolocation APIs. However, accurate and reliable transport mode detection in GPS trajectories data remains a challenge, as differentiating between walking, cycling, or driving exclusively based on GPS data can be complex, which leads to the need to develop more sophisticated algorithms. Even the most recent proposed approaches require predefined temporal intervals or trajectory sizes [8, 9], limiting their adaptability to real-world scenarios characterized by several trajectory lengths and inconsistent data intervals. This paper introduces a new approach for transportation mode identification that eliminates the need for such predefined constraints in order to further enhance the usefulness of location and GPS data, facilitating a comprehensive understanding of an individual’s life and enabling the development of innovative applications.

The key challenge in GPS trajectory classification lies in accurately recognizing the underlying transportation mode for each GPS point, segment, or trajectory. A GPS trajectory represents the movement of an object, consisting of sampled positions with timestamps and related movement information such as speed and direction [3]. Different transportation modes exhibit distinct spatial characteristics, making it necessary to employ classification techniques that can efficiently capture these differences [10].

In this paper, a novel transportation mode identification approach is proposed based on methodology that transforms non-image data to images [11] and takes advantage of well-known computer vision neural network architectures [12] to predict transport modes, such as walk, car, bike, bus and train. The approach involves extracting hand-crafted features from raw GPS trajectories, which contain multiple GPS points describing the fundamental motion characteristics of a moving object. Particularly, this process does not require defining fixed-size lengths for trajectories, removing outlier points, or cleaning trajectories. This approach shows to be an effective method achieving stat-of-the-art results on the GPS dataset collected by the Microsoft GeoLife project [13,14,15].

The rest of the paper is organized as follows. Section 2 provides a comprehensive review of related work in transportation mode identification based on GPS data. Section 3 presents the proposed approach used in this work. Section 4 discusses the experimental setup, presents the results and analysis. Finally, Sect. 5 concludes the paper and outlines potential avenues for future research.

2 Related work

Several efforts have been made in the attempt to solve problems in the transportation domain, using trajectory data collected from several sensors including GPS, GSM, or accelerometer sensors, among others [4]. However, although these solutions have a common goal which is to determine the transport mode of a trajectory, such walking or using bus, train, just to name a few, these approaches provide different techniques and applications [4, 8, 9, 16,17,18,19,20,21].

Over the literature in transport mode identification, researchers have explored various approaches and techniques from Artificial Intelligence (AI) research field, ranging from traditional machine learning algorithms [19, 22], such as Support Vector Machines and Decision Trees, to more advanced techniques from Deep Learning algorithms [6, 8, 20], such as Convolutional Neural Network (CNN) and Long short-term memory (LSTM).

Previous research [13, 16, 23] has focused on feature construction and mode identification as two essential steps in transportation mode identification. Feature construction involves extracting dynamic state attributes from GPS trajectories, such as velocity, acceleration, and other relevant factors, and organizing them into a suitable feature structure. Learning-based models, particularly those leveraging deep learning techniques, have emerged as powerful tools for transportation mode identification due to their ability to automatically learn complex patterns from raw data [10].

Recent years have witnessed remarkable progress in the field of artificial intelligence, such as innovations in deep learning and neural networks, the emergence of new and optimized hardware, and the exponential growth of the amount of data that is generated. These factors have collectively boosted the field forward, enabling significant leaps in our understanding and capabilities. Several approaches have been proposed to address the challenges of transportation mode identification using deep learning. Some studies have explored the use of hand-crafted features, such as maximum velocity, mean acceleration, among others [21]. On the other hand, alternative approaches have leveraged deep learning models to automatically learn deep features from GPS trajectories. In order to employ these extracted features from the GPS trajectory data to feed a deep learning model, they need to be transformed into images [6, 9, 24]. However, several approaches have directly used the GPS data itself to represent the trajectories as 2D images of data structures, employing these trajectory images as input for deep learning [10, 17, 25].

In order to perform classification using traditional machine learning algorithms, some features have to be extracted from GPS trajectory data. Zheng et al. [13, 22, 23] segmented the GPS trajectories extracting features, such as length, velocity, acceleration, and covariance, and explored more sophisticated features, like heading change rate, stop rate and velocity change rate. The authors used decision tree, Bayesian net, support vector machine, and conditional random field as inference models, to classify these features into four distinct transportation modes, namely walk, driving, bus and bike. Moreover, a graph-based post-processing algorithm was used to improve the model performance, achieving higher accuracy and better precision with Decision Tree compared to other methods. Wang et al. [26] also extracted several features related to distance, velocity, acceleration and other advanced features. In addition to the Decision Trees algorithm, the authors also explored gradient boosting ensemble methods, LightGBM (Light Gradient Boosting Machine) and XGBoost (eXtreme Gradient Boosting) classifiers to perform some experiments improving the results and concluding that LightGBM performs better than other on transport modes, such as car, subway and train, with some filtering of outlier trajectories.

As GPS sensors are susceptible to capturing data erroneously, they can affect the trajectory segments quality, leading to extract features that can influence negatively the performance of trained models. Etemad et al. [16] explored the effects of noise removal on transportation mode prediction. The authors also introduced new features, such as bearing rate. Throughout the literature, researchers have conducted experiments using traditional machine learning techniques, introducing new and more advanced features to increase the performance of these models. Namdarpour et al. [21] proposed the selection and construction of features based on genetic programming, while Li et al. [19] used Geographic Information System (GIS) information to extract features that are capable to effectively differentiate transportation modes, which are generally considered difficult to distinguish.

Due to advances in artificial intelligence, innovations in deep learning and neural networks, the development of new and optimized hardware, and the growth of the amount of data that is generated, the field has been able to take great leaps in recent years and so has the accuracy rates for computer vision and natural language processing algorithms. In the domain of transport mode identification, the key challenge is to structure raw GPS data or extracted features into a format that is both acceptable for deep learning architectures and efficient enough to represent fundamental motion characteristics of a moving object [6].

Dabiri et al. [6, 27] proposed an approach to comprise several features extracted from GPS data, including speed, acceleration, jerk, and bearing rate, into a multidimensional image and developed a Convolutional Neural Network (CNN) for high-level feature extraction and classification. Another authors have explored hand-crafted features extracted from the GPS data. Nawaz et al. [4] converted these features into a multidimensional images and proposed a deep learning model combining convolutional neural networks (CNNs) and long short-term memory (LSTM), to extract high-level features and capture sequential patterns in GPS and weather data. In [20] the author also explored LSTM neural networks, and introduced a mechanism based on discrete wavelet transform to extract time-frequency domain features of the trajectories to improve classification accuracy. Meanwhile, Zhu et al. [24] investigated various time series augmentation methods and found that discrete wavelet transform and flip augmentations yielded the best results for CNN and LSTM models.

Another deep learning networks have emerged as effective approaches for transport mode identification based on hand-crafted features, such as convolutional autoencoders. Markos et al. [18] proposed an unsupervised deep learning approach using a convolutional autoencoder (CAE) and clustering layer for transportation mode identification and also incorporated mobility-related statistics as additional features. More recently, Zeng et al. [9] presented a sequence-based framework called trajectory-as-a-sequence (TaaS), utilizing a sequence-to-sequence (seq2seq) model consisting of a convolutional encoder (CE) and a recurrent conditional random field (RCRF). The model leveraged high-level trajectory features and context information to generate accurate travel mode label sequences. Additionally, bus-related features were designed to differentiate between high-speed travel modes, such as bus, car and railway. Jiang et al. [28] proposed a Recurrent Neural Networks (RNNs) architecture embedding features into another space, and finally employing maxout gated recurrent units (GRUs) for further processing. However, due to possible inaccuracy caused by sensor-related issues, the authors employ a Hampel filter to identify and filter outliers in the feature space.

Several studies have explored the use of spatial characteristics by mapping GPS trajectories into 2D image data structures, called trajectory images, instead of extract hand-craft features from the GPS data [10, 17, 25]. These images are influenced by two factors: the spatial extent of the cropped trajectory and the size of the projected image. By employing different combinations of spatial range and image size, it becomes possible to capture diverse levels of trajectory details, commonly referred to as spatial scale [10].

These research works highlight the advancements in artificial intelligence techniques for transportation mode identification and provide valuable insights into features extraction from GPS trajectories. Despite the progress made in this field, there are still limitations that need to be addressed. One such limitation is the practical application in real-world cases, where trajectories can vary in the number of points or over time, depending on the type of transport. Over the literature, a fixed size strategy has been commonly adopted, either based on temporal intervals or the number of GPS points, resulting in trajectory segments with fixed dimensions for transport mode classification. This kind of approach may not be the most suitable for real-world scenarios, such as lifelogging, due to inconsistent sensor data and significant variations in the number and timing of points along the trajectories. As a result, there is a need to develop more flexible and adaptive methods that can handle the dynamic nature of GPS data in diverse real-world applications.

Based on the research conducted in this paper and the insights gained in this field, a novel method is proposed for transportation mode identification without the need to define a fixed time or number of points for a GPS trajectory. While existing works in the literature have represented GPS trajectories as image representations, these images are typically generated either by directly projecting the trajectory into a 2D image or by extracting a few features and generating a temporal signal for each sequential feature. This results in a multidimensional image comprising all these signals as input for deep learning models. In contrast, our approach introduces a new methodology to extract a multitude of features from GPS trajectories and construct a meaningful image representation for each trajectory by mapping these features onto image pixels.

3 Proposed approach

In this paper, a deep learning approach is proposed to attempt solving the problem of transport mode identification using an vision transformer (ViT) [29]-based architecture, the DeepViT [12]. This approach combines a powerful deep learning algorithm with hand-crafted features extracted from GPS data. Inspired in the several works over the literature [4, 6, 8, 16] basic features are extracted from the raw GPS data, such as velocity, distances, acceleration, jerk and bearing. In order to increase the number of features and improve the model, statistical and more advance features are also extracted from the basic features, such as averages, maximums, among others. To further enhance the representation of these features, a DeepInsight methodology [11] was explored, which transforms non-image samples into a well-organized image-form.

By taking advantage from ViT models, originally designed for image classification tasks, this approach benefit from its ability to capture complex spatial dependencies and learn representations from the extracted features. The transformation of trajectories data features to images, enables the ViT model to effectively uses its attention mechanisms and learn representations from these images.

Overall, the proposed approach takes advantage of deep learning strengths and hand-crafted feature extraction to achieve improved accuracy in transportation mode identification. Figure 1 illustrates the overall process of the proposed approach. In the following sections, a description of the proposed approach is provided, including the implementation of the DeepViT model [12], the extraction of hand-crafted features, and the transformation of these features into image representations using DeepInsight [11].

Fig. 1
figure 1

Overall representation of the proposed approach process from the raw GPS trajectories data to the classification of transportation modes

3.1 Features extraction

Most of the approaches in the literature of transport mode identification have focused on segmenting one transport mode trajectory into multiple fixed size time intervals to extract features. However, these approaches may not be suitable for generalization, such as lifelogging systems, considering the inherent variability in GPS data. This segmentation into fixed size time intervals or points assumes a constant and uniform sampling rate, which in real-world scenarios cannot be possible, due to there being smaller trajectories with less points than the trained model or not having enough GPS data in a fixed time interval.

In this work, given a GPS trajectory, \(T_i\), where i is the number of trajectories in the dataset, several point-level motion features are extracted from each trajectory between the GPS points, \(p_j\) and its successor \(p_{j+1}\), where j is the number of points of each trajectory. These features are harvest distance \(h_{p_j}\), duration \(d_{p_j}\), velocity \(V_{p_j}\), acceleration \(A_{p_j}\), jerk \(J_{p_j}\), bearing \(B_{p_j}\), bearing rate \(Br_{p_j}\) as follows:

$$\begin{aligned}&{a = \sin {\left( \frac{\Phi _{p_{j+1}} - \Phi _{p_{j}}}{2}\right) ^2}} \\&{b = \cos {\left( \Phi _{p_{j}}\right) } * \cos {\left( \Phi _{p_{j+1}}\right) } * \sin {\left( \frac{\lambda _{p_{j+1}} - \lambda _{p_{j}}}{2}\right) }^2} \\&{h_{p_j} = 2*R*\sin ^{-1}{\left( \sqrt{ a + b }\right) } } \\&{d_{p_j} = p_{j+1}(t) - p_{j}(t)} \\&{V_{p_j} = \frac{h_{p_j}}{d_{p_j}}} \\&{A_{p_j} = \frac{V_{p_{j+1}} - V_{p_j}}{d_{p_j}}} \\&{J_{p_j} = \frac{A_{p_{j+1}} - A_{p_j}}{d_{p_j}}} \\&{x = \sin {\left( \lambda _{p_{j+1}} - \lambda _{p_{j}}\right) } * \cos {\left( \Phi _{p_{j+1}}\right) }} \\&{y = \cos {\left( \Phi _{p_{j}}\right) } * \sin {\left( \Phi _{p_{j+1}}\right) }} \\&{- \sin {\left( \Phi _{p_{j}}\right) } * \cos {\left( \Phi _{p_{j+1}}\right) } * \cos {\left( \lambda _{p_{j+1}} - \lambda _{p_{j}}\right) } } \\&{B_{p_j} = \tan ^{-1}\left( x,y\right) } \\&{Br_{p_j} = \frac{B_{p_j+1} - B_{p_j}}{d_{p_j}}} \end{aligned}$$

where R is the radius of the earth and \(\Phi _p\) and \(\lambda _p\) are the latitude and longitude of a GPS point, respectively.

In this work, the transport mode identification is performed through GPS trajectories without fixed time or number of point, conducting the extraction of several segment-level features based on the basic features. In the first stage, several statistical features are extracted from the basic features of each trajectory, such as minimum, maximum, mean, median, standard deviation and percentiles. Inspired in [22], the top-3 values of trajectories basic features are selected as segment-level features, as several points with positional errors can cause abnormal maximum values for the trajectories. Moreover, according to [19], the stop ratio of each trajectory is computed considering any velocity between two points below 0.6m/s as a stop. Additionally, the distance corresponding to the straight line from the first point of the trajectory to the last one is calculated, along with the straight ratio [19] and the total area of the polygon generated by the trajectory. Furthermore, the Time Series Feature Extraction Library (TSFEL) [30] was explored to extracted more advanced features from temporal, statistical and spectral domains based on each trajectory’s basic features. In total, 399 features were extracted from the GPS trajectories, which are used to generate images for each trajectory.

3.2 DeepInsight

Sharma et al. [11] proposed a methodology to transform a non-image data to an image and presented several experimental results showing the usefulness of DeepInsight for non-image data, such as gene-expression, speech, texts and artificial datasets, to apply these data into CNN for classification purposes.

In this work, the DeepInsight is used to transform the extracted features from GPS trajectories data into images. As a first step, these features are used to find the location in the images. Following the DeepInsight methodology [11], the training set, defined as \(X = \{x_1, x_2, \dots , x_n\}\) where n is the number of samples, is used to find the location of the features, \(F = \{f_1, f_2, \dots , f_k\}\), where k is the number of features extracted from each GPS trajectory data. Basically, F can be obtained by transposing X.

Furthermore, a nonlinear dimensionality reduction technique, T-distributed stochastic neighbor embedding (t-SNE) [31], is applied to this feature set F, resulting in a 2D plane where each point represents a feature’s location, i.e., these points only define the location of features, not the feature itself or expression values. To ensure compatibility with deep learning architectures, the convex hull algorithm is used to find the smallest rectangle containing all point, followed by a rotation. Thereafter, the Cartesian coordinates are converted to pixels. Due to pixel limitations of an image, the conversion involves averaging certain features to fit within the image size. Consequently, the pixel frame represents the feature positions for each sample \(x_n\). In the next step, the features values are mapped onto these pixel locations. In order to ensure optimal results, an appropriate image resolution should be selected and prevent that multiple features acquires the same pixel location.

As the features have an high variance, a normalization is made based on Log normalization and scaling procedure. The pixel values of an image is usually represented by values from 0 to 255 and can be normalized in the range of 0 to 1. As mentioned in [11], the features values are to be normalized before applying the image transformation. In this work, the norm-2 normalization was used to normalize the features extracted from the trajectories. In this normalization, the minimum value is adjusted for each feature or attribute, and then a global maximum is used in the logarithmic scale to place the feature values between 0 and 1 [11].

3.3 Deep vision transform

In this work, the deep vision transform (DeepViT) architecture [12] was used to train a model using the images generated by the features transformation. These architecture adopts the transformer architecture, such as in ViT [29], originally designed for natural language processing (NLP) tasks, to efficiently perform computer vision tasks applying a standard transformer directly to images.

The main approach of this model is to split an image into patches and provide the sequence of linear embeddings of these patches as an input to a transformer. Image patches are treated as sequence of tokens (words) like in NLP. Instead of using convolutional layers like traditional CNNs, ViT models employs a series of self-attention mechanisms to capture the relationships between different tokens (image patches) in the image [29]. However, as identified in [12], the self-attention mechanism in the deeper layers of ViT models fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Zhou et al. [12] introduced a simple and efficient attention mechanism, named re-attention, to improve the performance of ViT models. This mechanism facilitates information exchange among different attention heads and re-generates attention maps, increasing their diversity across different layers with minimal computation and memory cost.

4 Experiments and discussion

In this work, the GeoLife dataset [13,14,15] was used, a GPS trajectory dataset collected in the GeoLife project by 182 users over a period of more than five years, conducted by Microsoft Research Asia. Each GPS trajectory in this dataset represents a sequence of time-stamped points, containing latitude, longitude, and altitude information.

Most of the users in this dataset labeled their trajectories with corresponding transportation modes, such as walk, bike, bus, car, train, among others. Consequently, these annotations provide the opportunity to train new deep learning models to predict the transportation modes of new GPS trajectory data. The GeoLife dataset contains a total of 14,718 annotated trajectories. However, only 9520 trajectories contain more than three GPS points and do not contain duplicates.

The selection of transport modes for experimental results varies among the mentioned literature. In this work, the five most common modes of transport identified throughout the literature were chosen, namely, walk, bike, car, bus, and train. Table 1 shows the number of GPS trajectories and points for these transportation modes of the GeoLife dataset. These selected trajectories form the basis of our experiments and analysis for transportation mode identification.

Table 1 Distribution of GPS trajectories and points among the transportation modes considered for experiments

4.1 Image construction

Once the features are extracted from the trajectories, they are used to finding features locations (points in the Cartesian plane) in a 2D plane using the dimensionality reduction technique, t-SNE. The convex hull algorithm is then applied to identify the smallest rectangle containing these points, as shown in Fig. 2. After rotating on this rectangle, the features values are mapped into the images using the computed pixel coordinates from the previous step. Figure 3 presents one image per class generated following the DeepInsight pipeline. To the human eye, the differences in these images may seem minimal; however, for deep learning algorithms, these small pixel differences in each class image are significant. Subsequently, the generated images are used for training and evaluating the deep learning model.

Fig. 2
figure 2

Cartesian plane of the feature’s location obtained using t-SNE method with the smallest rectangle founded

Fig. 3
figure 3

Examples of images for each class representing the features extracted from the GPS data

4.2 Training

In this section, the training configurations are presented for the best model obtained using the DeepViT network architecture. After converting the features into images, a cross-validation with 20 repetitions of training and testing were conducted, where the dataset was divided into 80% for training and 20% for testing. The dataset division was performed using a shuffle method and resulted in stratified split sets at each new training repetition.

As the DeepViT settings and parameters are originally prepared for training models on image classification in large datasets, such as ImageNet, the DeepViT input parameters were fine-tuned to achieve the best possible results for an unbalanced dataset with small training sets. The proposed model undergoes several changes compared to the original works [12, 29], only 6 hidden layers and 8 attention heads for each attention layer were used in the transformer encoder. The dimensionality of the encoder layers and the pooler layer was adjusted to 64, and the classification head size was set to 512. The model takes images of size \(64\times 64\), which are divided into patches of \(32\times 32\) to feed and train the proposed DeepViT model.

The model was trained using the AdamW optimizer [32] and the cross-entropy loss criterion. The optimizer learning rate was set to \(1\times 10^{-4}\). Each training repetition consisted of 300 epochs with a batch size of 128, achieving the best model performance at epoch 133 with a test accuracy of \(92.96\%\). The accuracy of the model was obtained by comparing the model’s predictions to the ground truth labels on the testing set. Figure 4 shows the comparison of training and testing accuracy over the 300 epochs. It is evident from this comparison that the model can be trained within only 200 epochs, achieving state-of-the-art results for transportation mode identification without experiencing over-fitting issues.

Fig. 4
figure 4

Train and test accuracy for 300 epochs on the DeepViT model with proposed configurations

Since the training and testing sets are not balanced, the accuracy result may not accurately reflect the model’s efficiency for each of the different classes. To address this, the confusion matrix, precision and recall of the best model is calculated on the testing set and the obtained results are presented in Table 2. Analyzing these results, the effectiveness of the DeepViT architecture in inferring the transportation modes was demonstrated, obtaining a f1-score of \(90.21\%\).

Table 2 Results obtained on the testing set for the best proposed model, considering the confusion matrix, recall, and precision values

All the experiments were performed using a computer equipped with an AMD Ryzen 5 5600x CPU, 32 GB RAM, and an NVIDIA GeForce RTX 3080ti GPU. During training, the proposed model used only approximately 1.6 GB of GPU memory out of the available 12 GB. The training process for 300 epochs took approximately 5 min, demonstrating that the model training has low resource consumption.

4.3 Performance overview

In this section, the results of the proposed model are compared with results of several related studies in the literature of transportation mode identification. The accuracy results of the compared approaches were adopted from their respective publications. Table 3 presents the comparison of the proposed model with other studies, based on the most recent results that used similar approaches by extracting hand-crafted features from GPS data and training deep learning models to classify the transportation modes of the trajectories.

Table 3 Performance comparison of transportation mode identification approaches using deep learning models

These comparisons show that the proposed model achieves state-of-the-art results with an accuracy of \(92.96\%\). However, it is worth noting that the experimental setup and the training dataset vary in several aspects across the different studies. For instance, the number of trajectories of each class in the dataset differs, which are segmented and divided into fixed-size trajectories in other studies. Additionally, some studies consider different transportation modes as a unique transportation mode, such as classifying taxis as a car transportation mode or/and consider rail-based modes as train.

Comparing the studies mentioned in Table 3, all authors follow a fixed-size segments strategy to divide the GPS trajectories into several segments [4, 6, 8, 9, 20]. They use time interval thresholds and the number of GPS points for each segment, resulting in different dataset sizes for training and testing deep learning models. For example, trajectories are divided if the time interval between two consecutive GPS points exceeded 10 min, and each of these segments has to be a fixed-size of 200 GPS points. Additionally, the size of training and testing sets varies across these studies, Dabiri et al. [6], Kim et al. [8] and Zeng et al. [9] divided their datasets into 80% for training and 20% for testing sets, similar to this work, while James et al. [20] used a split of 75% for training and 25% testing, and Nawaz et al. [4] used 70% for training and 30% testing.

As mentioned before, in this work, the trajectories were not segmented to produce more segments and the full trajectories were used to extract features and train a deep learning model. Another relevant difference between the proposed approach and the mentioned studies lies in the data pre-processing and cleaning. Some studies discarded points and trajectories based on maximum thresholds of speed and acceleration for each transport mode [6, 8, 9]. In contrast, as in this work, James et al. [20], did not employ a data cleaning approach, testing the robustness of the proposed transportation mode identification model.

Table 4 Comparison of random forest and deep learning architectures on transportation mode identification using the same procedure of proposed approach

Similar to the presented studies, this work also extracted several basic features from the GPS points sequence of the trajectories, including distance, velocity, acceleration, among others. From these motion features, more advanced features were derived to train and test the deep learning model of the proposed approach. However, it’s worth noting that some of these basic features vary across the studies. For instance, Nawaz et al. [4] incorporated weather features using external data from weather datasets, while Zeng et al. [9] included additional bus features by obtaining distances from the nearest bus stops.

These substantial differences in the performed experiments lead to the conclusion that the present work in this study obtained a state-of-the-art result with an accuracy of \(92.96\%\), without segmenting, filtering or cleaning the original dataset trajectories. The only filtering conducted in this work is the removal of trajectories with less than 3 GPS points, with less than 3 points, the trajectories motion characteristics, such as velocity, acceleration, just to name a few, cannot be inferred from them.

4.4 Model performance comparison

An experiment was conducted to compare different deep learning architectures and machine learning methods. Random forest classifier, a well-known machine learning technique, and two deep learning CNN networks, ResNet50 and EfficientNetV2-S were used. Moreover, two vision transformers-based networks, ViT and CaiT, were also tested. For the random forest method, the features extract from the GPS trajectories were normalized using the DeepInsight normalization based on log normalization and scaling procedure. Afterward, a model was trained and tested with the same repetitions and dataset splits as described in section 4.2, resulting in a test accuracy of \(92.29\%\).

For the deep learning models, the same procedure was adopted by using the generated images from the DeepInsight pipeline. Table 4 presents the results for each model tested in this study. As can be seen in these results, generating images based on the extracted features is a reliable approach for transportation mode identification and performed well in machine learning and deep learning architectures. However, when comparing the train and test accuracy of the random forest and Resnet50 with other models, it is evident that these two models suffer from overfitting during the training process, as they obtained higher accuracy in training compared but do not improve the results in testing, potentially making it difficult the model’s generalization. In contrast, other models showed similar train and test accuracy over several epochs of the training process.

The work developed for this research is publicly available on GitHub, providing open access to its implementation and facilitating reproducibility of the presented results. The pre-trained models are also available and interested readers can explore the details of the transport mode identification in the publicly available project.Footnote 1

5 Conclusions

In this paper, a novel transportation mode identification technique was proposed that transforms GPS trajectories data features to images form, following the DeepInsight methodology, and then applies these images to the DeepViT architecture. The goal is to train a model capable of classifying transport modes such as walk, car, bus, bike and train.

Unlike previous studies, the proposed method extracts a huge amount of features directly from raw GPS trajectories without changing the temporal and GPS points sequence, avoiding the creation of new trajectories with fixed-sizes for classification. After feature extraction, a normalization is performed and the location of each feature in an image representation using t-SNE, a dimensionality reduction algorithm. Through the generated pixel locations of each feature, a mapping process transforms non-image data into an image representation.

By following this methodology, the transformed images can be applied to a well-known vision transformers, neural networks originally used in NLP and adapted for computer vision tasks, outperforming traditional CNNs. As demonstrated in this work, this proposed approach can achieve state-of-the-art results with an accuracy of \(92.96\%\) in predicting five transport modes of the GeoLife dataset.

Despite the success of our approach, some transport modes are incorrectly classified, indicating the need for further analysis of the extracted features to address these classification failures. The exploration of new features to enhance classification accuracy remains a perspective for future work.

Transportation mode identification still presents challenges, especially when dealing with unbalanced and limited training data, which can impact the performance of neural networks. Nevertheless, this approach provides accurate and reliable transport mode detection, applicable in real-world scenarios, facilitating comprehensive understanding in research areas such as information retrieval and lifelogging.

In real-world cases, such as lifelogging, individuals use GPS devices to obtain daily trajectories that mix multiple modes of transportation. The development of methods for segmenting these trajectories is essential so that each trajectory contains only one mode of transport, to be classified using the proposed method.

To further improve the results, future work could involve obtaining and annotating new GPS trajectories to create pre-trained models, as has been done in the latest deep learning models for classification tasks. Using these pre-trained models for fine-tuning could lead to more robust and accurate models for training datasets like GeoLife.