Introduction

Travel survey methods can be broadly classified into two categories. In the first category, respondents are asked to provide details of their trips based on memory. The second category is the most recent approach, where travel data is automatically recorded by devices either placed at fixed locations or carried by the respondents themselves (Hato 2010). In recent times, researchers have been focussing on the second approach to determine travel patterns, primarily due to its significant benefits compared to conventional travel diaries and questionnaires. Such traditional methods are usually expensive, time consuming and require considerable effort on the part of the respondents. Moreover, the start and end times of the trips reported are usually approximate, while small trips are often left unreported. An additional factor is that people’s perceptions of in-vehicle time vary according to different modes of travel. For example, a person travelling by car will underestimate the travel time, whereas the same person travelling via public transport will overestimate it (Ettema et al. 1996; Stopher 1992). This decreases the accuracy of the data collected and, in turn, affects subsequent transportation planning and design.

For the purpose of automatic travel data collection, researchers now employ sensors such as global positioning systems (GPS) and accelerometers, among others. A GPS can locate the position of a device anywhere in the world with varying accuracy, depending on factors such as the number of satellites in view. An accelerometer, on the other hand, measures the acceleration of a device in three directions with respect to gravitational force. This means that when the device is placed on a flat surface, an acceleration of 1 g is detected in a downward direction, whereas zero acceleration is recorded in the other two directions. Modern smartphones are now equipped with both of the above sensors, so any methodology developed using either or both sensors can be very easily applied via smartphones. The rapid global increase in smartphone usage, especially in developed countries, offers a perfect opportunity to utilise them to collect travel data.

A major concern for researchers is to accurately detect the mode of transportation used by a person carrying a device (either a smartphone or a purpose-built device with the necessary embedded sensors). Mode determination will not just prove beneficial for the transportation sector, but will also pave the way for a new and effective means of advertising. For example, if a user’s location and mode of transportation are known in real time, a message can be sent to his or her mobile phone advertising the nearest facilities available in connection with the mode detected. In addition, products relating to a particular mode used can be advertised directly to the user. In this way, the data accumulated can be used to implement a targeted customer-oriented advertising programme.

For the purpose of mode detection, the analyst can currently avail of different types of classification algorithms. Some of these are listed in Table 1, along with their advantages and disadvantages and the methodologies associated with each. A number of researchers have compared the various algorithms in comparative studies, some of which are summarised in Table 2. The four classifiers used in this study, namely, AdaBoost, SVM, decision tree and random forests, were selected based on the results derived from existing comparative studies. Those algorithms have exhibited good performance in numerous studies, and their respective advantages and disadvantages can be seen in Table 1. The current study compares the algorithms with a view to ascertaining the one most appropriate for transportation mode detection.

Table 1 A comparison among major classification algorithms
Table 2 Some previous algorithm comparison studies

Related work

The related work can be divided into three sections depending on the sensors used to determine the travel mode, as follows: GPS only, accelerometer only and GPS with accelerometer. Each section is detailed below. GSM communications (Anderson and Muller 2006; Sohn et al. 2006) and local area wireless technology (Wi-Fi) (Mun et al. 2008), are also employed for the purpose of mode detection, but due to their relatively low accuracy, they will not be mentioned here.

GPS only

Various studies confirm that the use of GPS data loggers has resulted in greater data accuracy compared to conventional paper diaries and telephone surveys (Forrest and Pearson 2005; Ohmori et al. 2005; Wolf et al. 2003). Accuracy is further enhanced when GPS data is used on a geographic information system (GIS) application (Chung and Shalaby 2005; Schönfelder et al. 2002; Tsui and Shalaby 2006; Wolf et al. 2001).

In their study, Tsui and Shalaby 2006 recorded the GPS logs of participants based in Toronto. The average and maximum speed, in addition to the acceleration, were deduced from the GPS data. This data, along with information relating to public transport routes, was used to determine the transport modes. The prediction accuracy achieved was more than 90 %, a figure slightly higher than the method that did not use GIS information. Chung and Shalaby (2005) asked one participant to repeat 60 trips for the Toronto ‘Transportation Tomorrow Survey’, carrying a GPS device. The recorded GPS data was used in combination with GIS data to achieve a mode prediction accuracy of 92 %.

A study by Stopher et al. (2008) used a probability matrix to differentiate between travel modes. Trip characteristics such as bicycle ownership, maximum speed, average speed and most frequent speed defined whether a person was walking, cycling or using motorised transport. Further GIS data was utilised to distinguish between motorised transportation modes. In another study by Bohte and Maat (2009), a similar methodology was used for mode detection. Firstly, the average and maximum speeds were used to determine whether the respondent was walking, cycling or driving a car. Secondly, in line with the rules of interpretation, GPS data was plotted on GIS maps to determine whether the motorised trip was by car or by train. A prediction accuracy of 70 % was achieved.

Stenneth et al. (2011) took a different approach to solving the problem. They used GPS data, along with ground conditions, to extract the features to be used in learning algorithms. The features included the average accuracy of the GPS coordinates, average speed, average heading change, average acceleration, bus location proximity, rail line trajectory proximity, bus stop proximity rate and zip code. The classification algorithms used were as follows: (1) naïve Bayes, (2) Bayesian network, (3) decision tree, (4) random forests and (5) Multilayer Perceptron (MLP). The results suggested that random forests is the best classifier, with an average prediction accuracy of 93.7 %.

GPS and GIS information was also used by Chen et al. (2010) to distinguish between six different transportation modes in the city of New York. Prediction accuracy ranged from 60 to 95 %.

Our study is different in the sense that GPS data is not used at all. Although GPS data has been shown to work well for mode detection, certain disadvantages are associated with it. The main difficulty is the drop in accuracy due to signal loss or degradation during warm or cold starts, and in ‘urban canyons’ (Gong et al. 2012; Schuessler and Axhausen 2009; Stopher et al. 2008). Warm and cold starts happen when a GPS logger requires between 5 and 30 s more to find enough satellites for accurate location detection after being off (or underground) for a long period of time. In densely built central business districts (CBDs), satellite signals do not generally reach the GPS device directly but are bounced off tall buildings. This is known as the urban canyon effect. The above drawbacks associated with GPS use tend to decrease the accuracy of the results extracted from GPS data. Furthermore, respondents’ privacy concerns are also a problem in this area. If a smartphone is used as the data collection instrument, developing a methodology using acceleration data alone will not only address the above problems but will also extend the battery time of the smartphone during data collection, as the GPS sensor will not be activated.

Accelerometer only

Much of the available research focusses on using accelerometer data for the classification of physical activity (Bao and Intille 2004; Lester et al. 2006; Tapia et al. 2007), including research conducted using iPhone accelerometer data (Nham et al. 2008). In that case, data from only three participants was used to predict the mode of travel. For classification purposes, the LIBSVM framework (Chang and Lin 2011) was used, where the first 70 % of the data set for each mode was selected as the training set and the remaining 30 % was used as the test set. While the prediction results were reasonably accurate but highly varied among the participants, ranging from 88 to 97 %, the overall validity of the study is questionable in light of the small amount of data used. Nick et al. (2010) collected acceleration data for three modes including walking, car and train. In total, 90 % of the entire data set was used to train two classification algorithms, namely naïve Bayes and SVM, whereas the remaining 10 % was used as test data. According to the results, SVM outperformed naïve Bayes and achieved a classification accuracy of 97.32 %.

In a recent study by Hemminki et al. (2013), 16 participants from four countries collected accelerometer data spanning more than 150 h and covering six modes of transportation. A mean recall accuracy of 82.4 % was achieved.

Our work provides a comparison between the main classifiers used in research on this area. Although only four modes were classified (due to data constraints), the accuracy achieved was outstanding (mean 99.8 %). Therefore, this methodology is also expected to work quite satisfactorily for additional modes.

GPS with accelerometer

The use of GPS data accompanied by accelerometer data is a relatively novel approach, and few studies have reported methodologies utilising both types of sensor data. For instance, Reddy et al. (2010) used the decision tree followed by the discrete hidden markov model (DHMM) to identify transportation modes, including stationary, walking, running, biking and motorised transport. The classification system was tested on a data set obtained from sixteen participants and an accuracy of 93.6 % was achieved.

A comparison between the various pre-processing techniques used in several studies was carried out by Figo et al. (2010). Data for prediction and comparison purposes was obtained for three activities, walking, running and jumping. Almost 50 % of the data was used to train the algorithm. The results suggest that for the three-activity scenario, the best frequency-domain techniques yielded comparable results to the best time-domain techniques. But for the two-activity scenario, the best time-domain techniques prevailed.

Moreover, Nitsche et al. (2012) gathered 266 h of travel data with the help of 14 test participants and extracted 72 features for use in probabilistic classifiers. The results ranged from 50 to 98 % over different modes of transportation.

Feng and Timmermans (2013) carried out a study comparing the following three approaches: GPS data only, accelerometer data only and GPS combined with accelerometer data. The study used the Bayesian belief network model for classification purposes. The results showed that the acceleration only approach, with a mean validation accuracy of 88.87 %, works better than GPS only (mean 78.4 %), but the combined data approach outperforms both of them, with a mean validation accuracy of 91.7 %. The use of bicycle ownership, motorcycle ownership and car ownership variables presents a small constraint to the goal of collecting data automatically without putting any burden on the respondents.

Data collection

The data was collected from three cities in Japan, namely, Niigata, Gifu and Matsuyama. In Niigata, the surveys were conducted during January and February 2011 and involved 12 participants; in Gifu, they were conducted in December 2010 and January 2011 and involved 8 participants; and in Matsuyama, they were conducted in November 2010 and January 2011 and involved 26 participants. The data collected can be classified into location data and trip data.

Collection method

The location data was recorded using Behavioural Context Addressable Loggers in the Shell (BCAL) (Hato 2010). BCALs, shown in Fig. 1, are purpose-built wearable devices equipped with different sensors, in addition to a GPS and an accelerometer. They can record location as well as acceleration in three directions, a task that is now possible using modern smartphones. The BCALs observed the various sensors’ readings at a frequency of 16 Hz or 16 readings per second, but the readings transmitted to the server were spaced out at an average of 5 s. Hence, the maximum, minimum and average readings were calculated by the device for each 5 s interval and then recorded by the server. The wearable devices were kept in the same position throughout the trip so that accelerations in different directions could be judged easily.

Fig. 1
figure 1

BCALs equipped with various sensors

The trip data was collected using paper-based travel diaries in which the respondents were asked to record the details of their everyday trips. Feedback calls were made to the respondents to correct any mistakes made during reporting. Again, this is a task that can be fulfilled using smartphones, a method used by many researchers. A simple application developed for the smartphone can be utilised to record the start and end of a trip, as well as the mode of transportation used.

Data description

The location data comprised GPS data and accelerometer data. The accelerometer data recorded was the minimum, maximum and average acceleration in movement, crosswise and vertical directions. Moreover, resultant acceleration and average resultant acceleration were also noted. The trip data covered the information regarding each trip, i.e., the date, start time, end time and mode used.

Amount of data

Table 3 presents the raw location data and the mode-assigned data (discussed in “Data collection” section: Mode assignment) for each city. The table also shows the assignment of the data to various modes. Table 4 displays the trip share for each mode.

Table 3 Amount of data collected through BCALs
Table 4 Number of trips recorded

Due to data limitations, the analysis was carried out for four modes only. Acceleration data relating to the bus as a fifth mode was either non-existent or so small that it was not treated separately but simply merged with the car travel data. Similarly, only one trip was recorded for Shinkansen (the high-speed train), and instead of adding a new mode, it was included with the train data.

Mode assignment

The location data was filtered in terms of the trip data. For example, if accelerometer data was recorded with respect to a user for a specific day, but the user had not registered any trips for that particular day in the trip data, then the accelerometer data recorded was of no use. Moreover, data sets with zero acceleration (‘rest’ position) were also discarded.

Using the departure and arrival times listed in the trip data, the corresponding data sets in the location data were assigned the respective mode of transportation, as shown in Fig. 2. After the mode of transportation was assigned to the location data, the remaining data sets were disposed of. The reason some data remained unassigned is that the accelerometer data may have contained data sets recorded before the start of the trip or after the end of the trip. The remaining data was used in subsequent pre-processing and analysis.

Fig. 2
figure 2

Example of mode assignment

Methodology

Elementary analysis

A distinction between the modes was detected upon careful examination of the acceleration data. For instance, Figs. 3, 4, 5 and 6 show part of the acceleration data for each mode. It can be observed that walking has maximum variability, followed by cycling. This could be due to excessive movement by the traveler carrying the device. On the other hand, the car and train modes showed relatively small acceleration variability, probably due to the smooth travelling environment. Therefore a clear distinction can be perceived between the different modes by just inspecting the acceleration data.

Fig. 3
figure 3

Average resultant acceleration for walking

Fig. 4
figure 4

Average resultant acceleration for bicycling

Fig. 5
figure 5

Average Resultant Acceleration for automobile travel

Fig. 6
figure 6

Average resultant acceleration for train travel

Pre-processing

Pre-processing was applied in two stages. First, the moving average was calculated, followed by the differences between each mode.

The moving average was calculated at 25 point, 50 point, 75 point, 100 point and 125 point in order to identify the trend most likely to maximise classification accuracy.

In this case, x denotes the various data entries for acceleration in any direction, n is the total number of data entries and k is the window size (25, 50, 75, 100 and 125) for calculating the moving average. At any position i within the data, the window will cover x j entries to calculate the moving average. The window will keep the reference entry x i at the centre, except at the start and end of the data set. As the reference entry x i moves closer to the start or end of the data set, the window will be suppressed. As a solution to this, the window was halved at the start and end of the data set, with the reference entry kept at one end of the window rather than placed in the centre. The following Eqs. 1 and 2 were formulated for the calculation of the k point average. Equation 2 was used only for average resultant acceleration.

$$(k \, point \,Avg)_{i} = \left\{ {\begin{array}{*{20}l} {\frac{2}{k}\mathop \sum \limits_{j = i}^{{i + {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} x_{j} } & {if \,i \le {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \\ {\frac{1}{k}\mathop \sum \limits_{{j = i - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}}^{{i + {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} x_{j} } & { if \,{\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}} < i < n - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \\ {\frac{2}{k}\mathop \sum \limits_{{j = i - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}}^{i} x_{j} } & {if \, i \ge n - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \\ \end{array} } \right.$$
(1)
$$(k \, point \, Avg)_{i} = \left\{ {\begin{array}{*{20}l} {\frac{2}{k}\mathop \sum \limits_{j = i}^{{i + {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} \left| {x_{i} - x_{i - 1} } \right|_{j} } & {if \, i \le {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \\ {\frac{1}{k}\mathop \sum \limits_{{j = i - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}}^{{i + {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} \left| {x_{i} - x_{i - 1} } \right|_{j} } & {if \, {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}} < i < n - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \\ {\frac{2}{k}\mathop \sum \limits_{{j = i - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}}^{i} \left| {x_{i} - x_{i - 1} } \right|_{j} } & {if \, i \ge n - {\raise0.7ex\hbox{$k$} \!\mathord{\left/ {\vphantom {k 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}} \\ \end{array} } \right.$$
(2)

Equation 1 shows that at the start of the data set, that is, when the reference position i had not yet exceeded the k/2 mark, a window of size k/2 was used to calculate the average value, with the reference value at the start of the window. Similarly, at the end, the k/2-sized window was used, keeping the reference value at the end of the window. Between these two extremes, the window size was increased to k, with k/2 before i and k/2 after i.

In this way, moving averages were calculated for maximum, minimum and average accelerations in the movement, crosswise and vertical directions. Furthermore, moving averages were also calculated for resultant and average resultant acceleration (\(acc_{\text{res}}\), \(acc_{{{\text{avg}}.{\text{res}}}}\)). After the original values were replaced with the moving averages, the differences between maximum and minimum accelerations (\(acc_{ \hbox{max} } , acc_{ \hbox{min} }\)) were calculated for all three directions (\(cross, vert, mov\)), and their differences subsequently calculated. Moreover, the differences between average accelerations (\(acc_{\text{avg}}\)) along the three directions were also calculated. Equations 39 show the complete procedure used for the difference calculations.

$$D_{d} = acc_{{{ \hbox{max} }.d}} - acc_{{{ \hbox{min} }.d}} \, for \, d = cross,vert, mov$$
(3)
$$D_{1} = D_{cross} - D_{vert} - D_{mov}$$
(4)
$$D_{2} = D_{vert} - D_{mov} - D_{cross}$$
(5)
$$D_{3} = D_{mov} - D_{cross} - D_{vert}$$
(6)
$$D_{a1} = acc_{avg.cross} - acc_{avg.vert} - acc_{avg.mov}$$
(7)
$$D_{a2} = acc_{avg.vert} - acc_{avg.mov} - acc_{avg.cross}$$
(8)
$$D_{a3} = acc_{avg.mov} - acc_{avg.cross} - acc_{avg.vert}$$
(9)

Figure 7 shows the entire pre-processing method. After pre-processing, the final features were as follows: maximum, minimum and average acceleration along the three directions; differences between maximum and minimum \(\left( {D_{x} , D_{y} , D_{z} } \right)\); their differences \(\left( {D_{1} , D_{2} , D_{3} } \right)\); differences between average accelerations \(\left( {D_{a1} , D_{a2} , D_{a3} } \right)\); resultant acceleration and average resultant acceleration. In addition, moving averages were calculated for all values.

Fig. 7
figure 7

Pre-processing and feature extraction

Training and test data selection

As the data for each mode was different, the training data was randomly selected in the following two ways:

  1. (1)

    Equal number selection

  2. (2)

    Equal proportion selection

While equal number selection ensures that all the modes are equally represented in the training data set, the algorithm lacks sufficient training for the most frequently occurring mode in the test data set. Conversely, equal proportion selection ascertains that training is done proportionally for the test data set, but the modes are not represented equally in the training data set. This variation may affect the prediction results.

Equal number selection

For each city, the mode with the least data was selected and the number corresponding to 70 % of that data was calculated. The data equal to that number was then randomly selected from each mode to form the training data set, leaving the rest as a test data set.

In this way, no matter how much difference was present between the modes, the training data always comprised equal numbers from each. Table 5 shows the amount of training data selected for each city.

Table 5 Amount of training data used for travel mode classification

Equal proportion selection

A total of 70 % of data for each mode was randomly selected to form the training data and the remaining 30 % was used to test the algorithms. This method yielded a much larger quantity of training data, which can be seen in Table 5.

Classifiers

In order to determine the classifier that most accurately predicts transportation mode, a comparison was made between (a) Support Vector Machines (SVM); (b) Adaptive Boosting (AdaBoost); (c) decision tree using rpart, and (d) random forests. These classifiers were selected due to their frequent and established use in existing literature. The aim was to identify the best performing algorithm by carrying out a comparison between them.

Support vector machine

SVM is a state-of-the-art classification method that was introduced by Boser et al. (1992). SVM has a vast range of applications in bioinformatics, text recognition, image recognition, robotics and many other fields.

SVM fits into the kernel methods category (Shawe-Taylor and Cristianini 2004). Kernels allow the use of linear methods to solve non-linear problems. However, the efficient use of SVMs depends largely on knowledge of how this classifier works, and the user first needs to decide what pre-processing method to use.

A suitable kernel must then be selected, after which the user faces the difficulty of setting parameters for SVM and the selected kernel. A comprehensive guide for this purpose is provided by Ben-Hur and Weston (2010).

SVM is a linear two-class classifier. For simplification purposes, it is assumed that the two classes are labelled as +1 (positive examples) and −1 (negative examples). In the sample case below, x i is an i th example in a data set \(\left( {x_{i} , y_{i} } \right)_{i = 1}^{n}\), where y i is the class label associated with that example; boldface \(\varvec{x}\) is a vector with components x i . The linear classifier is specified by a dot product and is defined by the function, as below:

$$\varvec{w}^{T} \varvec{x} = \mathop \sum \limits_{i} w_{i} x_{i}$$
(10)
$$f\left( x \right) = \varvec{w}^{T} \varvec{x} + b$$
(11)

The vector \(\varvec{w}\) is known as the weight vector and b is called the bias. The set of points x, for which f(x) = 0, constitutes a hyperplane. This hyperplane, shown in Fig. 8, divides the space into two regions so as to separate the data into two classes.

Fig. 8
figure 8

Linear Classifier with support vectors shown as circled data points

The circled data points are the points closest to the hyperplane and are called the ‘support vectors’. The margin is the distance from the support vectors to the hyperplane. The aim is to maximise the geometric margin \(1/\parallel \varvec{w}\parallel\), which is equivalent to minimising \(\parallel \varvec{w}^{2} \parallel\). This leads to the following optimisation problem:

$$\hbox{min} \parallel \varvec{w}^{2} \parallel subject \, to \, y_{i} \left( {\varvec{w}^{T} \varvec{x}_{i} + b} \right) \ge 1\; i = 1, \ldots ..,n$$
(12)

With the help of kernels, the concept of linear classifiers can be extended to non-linear problems. Kernels are used because direct computation of non-linear features is very expensive in the case of a huge quantity of data. Some famous kernels are shown in Eqs 1315 below:

$${\text{Linear Kernel}} \,k\left( {x,x^{'} } \right) = x.x^{'}$$
(13)
$${\text{Gaussian Kernel}}\;k\left( {x,x^{'} } \right) = \exp \left( { - \gamma \parallel x - x^{\prime}}\parallel{^{2}}\right),\quad \gamma > 0$$
(14)
$${\text{Polynomial Kernel}}\;k\left( {x,x^{'} } \right) = (x.x^{'} + 1)^{d} , \quad d \in N$$
(15)

As SVM is a binary class classifier, the one-against-one technique was employed and the correct class was determined using a voting mechanism.

Adaptive boosting (AdaBoost)

As a solution to many of the difficulties of earlier boosting algorithms, AdaBoost was first introduced by Freund and Schapire (1997). Using the same example as that mentioned above for SVM, AdaBoost takes the training data and calls a weak classifier repeatedly. Starting with equal weights for all the examples, the weights for incorrectly classified examples are increased after each round so that the algorithm can focus more on the difficult examples. Consequently, a strong classifier is constructed as shown in Fig. 9.

Fig. 9
figure 9

Concept of AdaBoost

In this example, the initial weights are w (1) i  = 1 for all data points x i . In order to generate a set of M classifiers, the same number of iterations is done. At each iteration, W is the sum of the weights of all data points, whereas W e is the sum of the weights of misclassified data points.

For \(m = 1 \;to \, M\)

  • Select the classifier k m which minimizes W e

$$W_{e} = \mathop \sum \limits_{{k_{m} (x_{i} ) \ne y_{i} }} w_{i}^{(m)}$$
(16)
  • Set the weight α m of the classifier

$$\alpha_{m} = \frac{1}{2}ln\left( {\frac{{1 - e_{m} }}{{e_{m} }}} \right)$$
(17)

Where e m  = W e /W

  • Update the weights of the data points for the next iteration.

$$w_{i}^{(m + 1)} = \left\{ {\begin{array}{*{20}l} {w_{i}^{(m)} e^{{\alpha_{m}}} \quad if \, k_{m} (x_{i} ) \ne y_{i} } \\ {w_{i}^{(m)} e^{{ - \alpha_{m}}} \quad if\,k_{m} \left( {x_{i} } \right) = y_{i} } \\ \end{array} } \right.$$
(18)

Similar to SVM, AdaBoost can only solve binary class problems, so the one-against-all technique was used and the correct class was assigned only when there was one unique answer. Consequently, some of the data remained unclassified, and this was again investigated using AdaBoost in a similar way. In the end, SVM was used to classify any remaining unclassified data.

Decision trees using rpart

A decision tree is a classifier that employs recursive partitioning to arrive at a decision. The data set is split into branch-like segments, and those segments form an inverted decision tree that originates from a starting node called the root. The root has no incoming edge, whereas all the other nodes in the tree have exactly one incoming edge. Nodes with outgoing edges are known as internal or test nodes, while the rest are known as leaves, terminals or decision nodes. Each internal node splits the data into two or more segments according to certain rules, which depend on the attribute values.

Each terminal node corresponds to a target class. The data is classified while navigating from the root down to the leaves. Along the way, internal nodes decide the path of the decision in light of certain rules, which are also defined by the algorithm. Figure 10 presents a simple decision tree for a sample trip with only two features.

Fig. 10
figure 10

Decision tree presenting mode classification using acceleration data

Rpart is an acronym for ‘recursive partitioning’, a statistical package written in the R programming language, which applies Classification and Regression Trees (CART), as discussed by Breiman et al. (1984). This partitioning method can be applied to many different kinds of data. In this case, there was a classification problem and pruning was carried out to eliminate the effects of over fitting.

Random forests

Random forest, developed by Breiman (2001), is an ensemble classification and regression method that constructs a number of decision trees at the training level, predicts the class using each tree and outputs the final class as the mode of the individually predicted classes. Because the classification method involves tree-like structures, and because randomness is inherently built into the procedure, the method is named ‘random forests’. One of the major advantages of random forests is that pruning is not required.

Each unpruned tree is grown using CART. At each node, a subset of features from the data is randomly selected and the best split is made using that subset. A large number of trees can be grown, and each tree uses nearly 63 % of the given training data randomly selected. The remaining 37 %, known as ‘Out of Bag’ or OOB data, is used to test each tree. Obviously, it will be different for each tree. Trees are grown by means of binary partitioning. At each node, a subset of the predictors or features is randomly selected. Typically the subset is \(\sqrt k\), where k is the total number of features. Among the subset features, the best feature is used for the split. For the resulting nodes, new feature subsets are selected randomly. New data is predicted using all the trees and the result is finalised by taking the mode of the individual results (classification problem) or their average (regression problem). Figure 11 presents the general procedure involved in random forests.

Fig. 11
figure 11

General procedure of random forest

Results and discussion

The overall classification results of the classifiers for the different moving averages, as well as the two types of training data selection methods, are summarized in Figs. 12, 13 and 14. From the figures, it is evident that maximum prediction accuracy can be achieved by employing a 125-point moving average at the pre-processing stage. For the 125-point moving average, Table 6 shows the overall classification accuracies, while Table 7 gives the detailed results. The accuracy calculated can be considered producer accuracy. For example, if the prediction accuracy is 85 %, this means that 85 % of the known data carrying a certain class label (ground truth) is returned with the same label by the algorithm. The accuracies were calculated after creating confusion matrices and dividing the number of correct predictions for each mode by the total quantity of data in the test data set that is linked to that mode.

Fig. 12
figure 12

Prediction accuracy for Niigata city using (a) equal number method and (b) equal proportion method

Fig. 13
figure 13

Prediction accuracy for Gifu city using (a) equal number method and (b) equal proportion method

Fig. 14
figure 14

Prediction accuracy for Matsuyama city using (a) equal number method and (b) equal proportion method

Table 6 Overall classification results at 125 point moving average
Table 7 Classification results at 125 point moving average

It can also be observed from the figures, as well as from the results listed in the tables, that the equal proportion method is better than the equal number method, but some of the detailed results show differently. For instance, in the equal proportion method, SVM and AdaBoost seem to perform well, with overall accuracy exceeding 85 % in all cases. However, a breakdown of the accuracies at mode level reveals that the accuracy in terms of train transport prediction is very poor, in fact zero in case of Niigata and Matsuyama. This is because the amount of data corresponding to train transportation in the training data is relatively very small, which results in a zero prediction accuracy, even for the training data itself.

Random forests performs best in all cases. In particular, its accuracy is very high, at 99.8 %, for the 125-point moving average using the equal proportion method. Even in the equal number method, the overall accuracy is greater than 91 %, which is quite impressive. The next best performer is decision tree, followed by AdaBoost and then SVM.

The developed methodology was tested for three cities in order to establish the stability as well as the broader applicability of the approach. The results suggest that similar classification accuracy was achieved for the three cities. This is an indication that the approach is stable and might yield a good level of accuracy for other cities in Japan. But to confirm this, more data is required.

A careful examination of the results reveals that when using random forests, the prediction accuracy of the train transportation mode is the highest of all the modes, in fact 100 %, in the case of the equal number method. However, the same mode is predicted with the least accuracy relatively for the equal proportion method. This suggests that the prediction accuracy of the train mode can easily be improved by collecting more data so as to increase its representation in the training data set. Therefore, the optimum solution is to collect a comparable amount of data for each mode so that both selection methods will yield a training data set of a similar size.

Conclusion

This study shows that by using only the acceleration data, the transportation mode being used by the device carrier can be detected with a high level of accuracy. The developed methodology has the potential to complement or partially replace conventional travel survey methods. Furthermore, the data required for the developed approach can be collected using smartphones, which will increase its applicability. Automatic mode detection will assist transportation planners in studying and modelling people’s travel behaviour more easily and with higher accuracy. This, in turn, will improve subsequent planning and design works.

Apart from a good classification algorithm, the training sample size and appropriate pre-processing are also vital for achieving better results, which is a primary focus of this study. The data was collected by respondents in three Japanese cities, namely Niigata, Gifu and Matsuyama. The training sample size was set based on two data selection methods. Of the two data selection methods tested, the equal proportion method performed better. Moreover, regarding the pre-processing phase, varying window sizes were used to calculate moving averages. A 125-point moving average improves the prediction accuracy relative to others, although the variation is minimal. Finally, of the four algorithms used in this study, random forests outperformed all the others. A combination of all of the optimum conditions described above yielded an overall prediction accuracy of 99.8 %. The surveyed cities exhibited similar classification accuracies, indicating that this approach might also be applied to other areas with the expectation of good results.

This study highlights a limitation with respect to the SVM and AdaBoost algorithms. Minimal representation of the train transportation mode in the training data set following the equal proportion selection method resulted in the total misclassification of train data during prediction. This shows that the training of SVM requires equal or comparable representation from all classes, and the same is true for AdaBoost. On the other hand, no such constraints exist in the case of decision tree and random forests. A further observation was made regarding the computational time required by the algorithms. SVM and AdaBoost are very time consuming when it comes to large data sets like those used in this study, whereas decision tree and random forests outmatch them in this respect also.

However, the ideal scenario is to have a nearly equal amount of data for each contributing mode and then use the equal proportion method. In this manner, the strengths of both methods will be combined and yield even better prediction results. One of the limitations of this study relates to the fixed positioning of the data collection device while its carrier was travelling. The positioning should be flexible, especially in cases where purpose-built devices are replaced by smartphones. The newly developed methodology needs to be modified and extended to incorporate varying placement of the device. Furthermore, the new approach should also be checked for additional modes. To this end, behaviour models can also be incorporated into the analysis in order to enhance accuracy, and may be especially beneficial in the case of insufficient collected data.