Introduction

Since decades, urban travel forecasting models have played a crucial role in supporting infrastructure investment decisions. The majority of these models are based on the classical four-step modeling approach (Mcnally 2007). The mobility patterns have become more complex, which motivated planners to build more sophisticated models. This results in the development of disaggregate modeling approaches known as activity-based models. The rationale for this approach is that people travel to fulfil the demand to conduct activities at different locations. While the trip-based modeling approach is simpler to implement than activity-based models, it poorly handles activity sequence patterns by disconnected non-home based trips (Adler and Ben-Akiva 1979; Axhausen and Gärling 1992).

One of the components of activity-based modelling is to build activity sequences of people. This paper presents a new method to generate activity sequence patterns for activity-based modeling frameworks. Compared to traditional activity sequencing procedures that often require to model separately mandatory activity generation, discretionary activity generation and scheduling the order of these activities, the proposed approach is straightforward to implement and runs much faster. In the literature, the definition of activity sequence patterns vary. For example, Thomas F. Golob (2000) defines activity sequence patterns as simple and complex, such as complex home-based trip chains. The current study defines an activity sequence pattern as a sequence of activities performed during a day (e.g. home to work to shopping to home), regardless of the duration and location of activities.

State of the art of activity sequencing

The crucial component of activity-based modeling is the generation of the complete activity diary of an individual or a household unit. This mainly involves three sub-components. The first is the type of activity and its sequential arrangement; the second is the destination/location of activities and the third is the allocation of start time, end time and duration of the activity (Charypar and Nagel 2005). The current research is focused upon the first component: i.e. type and sequence of activities, known as activity sequencing or activity pattern generation model. The problem of activity sequencing is solved using different modeling methodologies and methods. In some studies, activity sequencing is modeled integrated with activity destination or activity time allocations or both. Some studies model activity sequence independently. The methods used can be broadly classified as discrete choice-based methods, machine learning-based methods and other methods.

One of the early works was done by Pas (1984) who studied the effect of sociodemographic characteristics on activity-travel patterns. He applied cluster analysis to identify five travel-activity patterns. He then used a linear logit model to analyse a contingency table to examine the relationship between activity sequences and socio-demographic variables. Recker et al. (1986a, b) provided a theoretical framework to model complex travel behavior. Individuals select an activity pattern form the choice set that maximizes their utility. The concept was later operationalized in the STARCHILD modeling system by Recker et al. (1986b). Bowman and Ben-Akiva (2001) estimated a nested logit model for daily activity pattern choice as part of the integrated activity-based model system. An activity pattern was defined by the primary activity, primary tour type and number and purpose of secondary tours. A choice set of 54 activity patterns were defined. The Boston household survey was used to estimate the model. Another study by Bhat (2001) modeled activity patterns of workers during evening hours using a multinomial logit model. The stops, activity duration and travel time deviations were modeled explicitly. Gangrade et al. (2002) used a nested logit model to model activity sequences only for commuters. Poisson and negative binomial regression models were used to predict activity frequencies for each individual, while a simple heuristic rule was applied to determine the sequence of activities. Ordóñez Medina (2015) focused on sequencing of activities, their start and end times, duration of activities, location of activities and mode used to perform activities. A spatio-temporal network method was used to form activity sequences while utility functions were used to select an activity sequence. Habib et al. (2017) studied activity-travel patterns of non-workers using a random utility maximizing approach in sequential nested structures. Activity type, destination location and time expenditure choice were modeled. Activities were sequentially added to form activity-travel behaviour for non-workers. They explicitly considered travel time budget in the formation of activity-travel pattern. Paleti et al. (2017) developed a rank-ordered logit model to model activity sequence, location and tour structure in an integrated manner. The alternatives in the choice set were defined as the combination of activity type and location. Recently, Xu et al. (2018) used a combination of mixed integer linear programming (MILP) and random utility maximization (RUM) architecture to model activity patterns. The household activity pattern problem (HAPP) structure was used as an optimization function and parameters were estimated using path-size logit models. The choice set of each traveller was generated and individualized using clustering algorithm and goal-programming based on the individual’s spatial and temporal constraints.

Several studies used machine learning algorithms to analyse travel behaviour. Those include Charypar and Nagel (2005) who applied a genetic algorithm to model activity schedules of individuals. Activity type, time allocations and location of activity was part of the activity schedule. A utility function was developed that was used as a fitness function to compare and choose the best activity schedule. Hanson and Huff (1986) used principle component analysis (PCA) and k-means clustering algorithms to classify individuals based on the multi-day travel-activity patterns. The analysis yields five clusters that were distinguished by discriminant analysis based on socio-demographic variables. PCA was also used by Pitombo et al. (2009) to reduce the dimension of the data from 39 to 10 to suggest a taxonomy of travellers regarding their travel behaviour. Multiple linear regression analysis was used to model different travel patterns using the reduced ten dimensions as independent variables. Diana and Pronello (2010) used correspondence analysis to segment travellers’ profiles regarding their stated mode choices. Age, person type and attitude towards transit choice were among the variables used for analysis. Jiang et al. (2012) used the k-means clustering via PCA to cluster daily activity patterns on weekdays and weekends. PCA was used to obtain eigen-activities. This way, a thorough understanding of the trip behaviour could be deducted. Hafezi et al. (2018) used random forest theory to model daily activity patterns. Socio-demographic and prior activities were used as predictor variables. The model has an activity sequence replication accuracy of up to 77%. Hafezi et al. (2019) developed a pattern recognition modeling framework to model activity schedules. A fuzzy C-means clustering algorithm was used to cluster data into twelve clusters of homogenous activity schedules. The schedule is defined by activity type, start time, end time, duration and sequential arrangement of activities. Individuals were assigned to particular clusters based on their socio-demographic attributes. Recently, Drchal et al. (2019) introduced the concept of a data-driven activity scheduler using machine learning methods. The schedule consists of activity type, duration, location and mode choice. Four core modules were developed to model each component of the schedule in a sequential manner.

Other methods include the nonlinear canonical correlation analysis by Golob (1986) to study the relationship between home-based trip chains and socioeconomic characteristics of the person without considering the spatial and temporal aspect of activities. Travellers were categorized into the three segments workers, students and others. He found various explanatory variables for each segment that effects trip-chaining behaviour. Kitamura et al. (2000) used a microsimulation approach to develop travel patterns. Monte Carlo simulation was used to generate daily travel patterns. The study showed that it is possible to generate synthetic travel patterns through microsimulation. Han and Sohn (2016) used continuous hidden Markov models while sequence alignment methods are used by Kim (2014) and Sammour et al. (2012). Li and Lee (2017) used probabilistic grammars to model daily activity patterns.

The most common problem with behavioural choice modeling is the determination of choice sets of alternatives. The possible sequence of activities could result in huge number of alternatives in the choice set. On one hand, the high number of alternatives may not be suitable to use discrete choice models, especially if activity sequences are combined with activity locations. On the other hand, some of the resulting combinations may not be representative of travel behaviour and not observed in household travel surveys. Moreover, the problem with machine learning algorithms is that they are designed as a black box and are computationally expensive. Therefore, the current study develops a methodology that considers complete activity sequence patterns of individuals based on their socio-demographic attributes. Further, among the most favourable alternatives for an individual, preference is given to those alternatives that are more frequently selected in the household travel survey.

Data and modeling framework

This chapter explains the data analysis and modeling framework used in this study. A two-step method is used to build the model. In the first part, activity sequence patterns are analyzed and a relationship between activity sequence patterns and socio-demographic variables is established. In the second step, a prediction framework is developed that uses the relationships defined in the first step to calculate the probabilities of all activity sequence patterns for a given person. Overall, the modeling framework is a non-home activity focused model. The home is not simulated as a location of activities but just a place to return. The activity sequences defined does not necessary start and end at home but rather could be incomplete as represented in the survey data for a 24-h period.

The activity sequence patterns are modeled at person level instead of household level. There is a debate in the literature as to which analysis unit is better, person-level of analysis or household-level of analysis. Some studies (Downes et al. 1978; Supernak et al. 1983) support a person-level analysis, because most activities are conducted by individual persons (joint activities form an exception). Others (Badoe and Chen 2004) supports a household-level analysis, because activities of one household member (such as, pick up the child or do grocery shopping) will affect the need to perform these activities of other household members. Pokhrel (2017) compared both approaches econometrically and found no conclusive recommendation which approach would perform better. The current study modeled activity sequence patterns at the person level because the methodology used for model building can provide better results at the person level than at the household level as the unit of analysis. The person level can provide a greater heterogeneity in activity sequencing behaviour. As intra-household interactions are not modeled, accompanying trips are modeled as an activity. If a member of the household is accompanying another member, this type of activity sequence pattern is modeled, and the activity is termed as ‘accompanying’. For example, there is an activity sequence pattern as ‘home-accompanying-home’ used in the model. However, the link to the other household member that the person is accompanying with is not modeled. From our household travel survey, it was unfortunately not possible to identify whether the person is accompanying a household member or some other person.

Generation of activity sequence patterns

The German household travel survey for the year 2008 called ‘Mobilität in Deutschland (MiD 2008)’, or ‘Mobility in Germany’, was used as a primary data source for this model. The MiD 2008 contains travel data reported in the form of individual trips performed by the household members. For every household, travel data of each member of the household is collected on the same day (Cyganski et al. 2013). This study considers seven types of activities for the generation of activity sequence patterns. They are “Home (H)”, “Work (W)”, “Education (E)”, “Shop (S)”, “Leisure (L)”, “Private errands (PE)” and “Accompanying (AC)”. A total of 5683 unique activity sequence patterns were generated from 45,746 individuals. Each individual corresponds to one activity sequence pattern and some patterns are also repeated. For instance, 4467 individuals performed the same activity sequence pattern as ‘home-work-home’ (Table 1) on the survey day. The top 5 activity sequence patterns are shown in Table 1.

Table 1 Activity sequence patterns observed

A detailed explanation of the activity sequence generation process is described by Ahmed (2018). These activity participations are reported for a single day; therefore, activity sequence patterns also represent a day which is assumed to be an average weekday.

Activity sequence pattern analysis

To study the relationship between activity sequence patterns and socio-demographic attributes, the generated activity sequence patterns were checked for missing values of socio-demographic attributes. Person records with missing value of any of the attribute was omitted. This reduced the number of individuals to 39,080 with 5136 activity sequence patterns. Most of the activity sequence pattern types were rare types with very low number of survey records. For multiple correspondence analysis, it is important to have a sufficiently large sample size for each category to obtain meaningful relationships. One approach to overcome this problem is to group categories (Husson et al. 2017, p. 150). However, it is impossible to decide whether, for example, H–W–S–H–L–H should be grouped with H–W–S–H–S–H or H–W–S–H–PE–H. To avoid a manual (and arbitrary) grouping process, this problem is approached by selecting activity sequence patterns with 30 or more person records.

For household type definition in trip generation, it is state of practice to use 30 survey records as a threshold for each household type to be considered (Moeckel et al. 2017). This criterion was used in the current study and 112 activity sequence patterns were found with survey records of 30 and more. It is important to note that these 112 activity sequence patterns represent 72% of the dataset, as shown in Table 2. It can also be seen from Table 2 that criteria of 20 and above would add around 60 more activity sequence patterns while the dataset representation would only increase by 4%. The criteria of 10 and above would add 193 activity sequence patterns with dataset representation of 8% more.

Table 2 Activity sequence pattern statistics

The inclusion of more than 70% of the dataset with a reasonable number of activity sequence patterns was considered to be a reasonable representation of the most common activity sequencing behaviour. These 112 activity sequence patterns were generated by 28,172 individuals. For simplicity, each activity sequence pattern is coded as TC (TC1-TC112), where TC1 is the most frequent activity sequence pattern while TC112 is the least frequent activity sequence pattern. All 112 activity sequence patterns with respective codes are described by Ahmed (2018). The remaining 5024 of the activity sequence patterns were grouped together into a single group. The group can be assigned a probability value and the selection of activity sequence patterns from that group may be done randomly based on the frequency of activity sequence pattern. Given the infrequent occurrence of those activity sequence patterns and the lack of data to describe who commonly would conduct such rare activity sequence patterns, a random selection is considered to be plausible. For the 112 much more frequent activity sequence patterns, true probabilities that are based on socio-demographic attributes are calculated for every individual.

Socio-demographic attributes

As socio-demographic attributes, household characteristics and person characteristics were considered. Table 3 shows attributes included in this study and their frequency after omitting incomplete records. The column “Number of person records” shows the number of surveyed households or persons in the dataset. The column “Expanded number of records” describes how many households/persons are represented after the respective expansion factors of the survey were applied. The attributes in Table 3 were selected based on state-of-practice in transportation modeling and are assumed to explain activity sequencing behavior. All attributes used were transformed into categorical variables. To the best of the authors’ knowledge, there is no formal method in multiple correspondence analysis to detect which variables are statistically most significant. However, the contribution of each variable to the multiple correspondence analysis dimensions can give an idea of the most important categories which explains variability in the dataset.

Table 3 Attributes considered in the model

Multiple correspondence analysis (MCA)

Methodology

Correspondence analysis (CA) and its extension Multiple Correspondence analysis (MCA) identifies relationships among variables in the survey data. While the more common Principal Component Analysis (PCA) is suited for metric data, the similar method MCA is used with categorical data. Multiple correspondence analysis is typically used to analyse surveys where individuals respond to questions with specific answers (Husson et al. 2017, p. 127). Commonly, rows represent respondents and columns represent variables. For multiple correspondence analysis, each question is considered as a variable and each answer is considered as a category of that variable (Husson et al. 2017).

This method reduces the dimensions of the data by creating new principal dimensions. Each dimension represents a certain amount of variance in the dataset. Each dimension can also be interpreted as a synthetic variable (Husson et al. 2017, p.15), which is formed by the contribution of each variable category. Some dimensions may explain variability of a certain category more than other categories. The method finds the principal dimensions, which explain the variability between individuals and associations among variable categories. The dimensions are orthogonal to each other. Thus, instead of \(K\) dimensions for individuals and \(I\) dimensions for categories, the reduced number of dimension, i.e. principal dimensions, are formed. Thus the data can be plotted in two dimensions, which helps to study the relationships in the data.

For model estimation, the dataset was prepared in a way that each row represented an individual and each column represented socio-demographic attributes and an activity sequence pattern, as reported in the MiD 2008. These variables naturally fall into two groups, one defining the person and the other defining travel behaviour. The model uses socio-demographic attributes as active variables and activity sequence patterns as supplementary variables. The dimensions were formed using active variables, i.e. socio-demographic attributes, to establish the principal dimensions of the multiple correspondence analysis. Then, supplementary variables, i.e. activity sequence patterns, were added to show the relationship between the principal dimensions and supplementary variables.

MCA results

The model was estimated using the attributes defined in the section on socio-demographic attributes. The results of the model estimation are described below.

Dimensions

The model estimation resulted in the formation of 15 principal dimensions as shown in Fig. 1 below:

Fig. 1
figure 1

Variable decomposition per dimension

The number of dimensions to be formed is estimated by the multiple correspondence analysis method itself. Each dimension explains a certain amount of variance in the dataset. For example, dimension 1 explains 18.5% of the variance in the dataset, while dimension 2 explains 15.4% of the variance in dataset. Every consecutive dimension explains less variance than the previous one, and all 15 dimensions together can explain 100% of the variance.

Socio-demographic attributes

Each variable category may be represented in all the 15 dimensions. Table 4 shows the estimation results for the first two dimensions. The columns ‘Dimension 1’ and ‘Dimension 2’ show the coordinate points of each category for the first and second dimension. The coordinates describe for each dimension a relative position a given category has in comparison to other categories. These coordinates can be used to visualize results (compare Fig. 2) and to calculate the most likely activity sequence pattern. The column ‘contribution’ represents the contribution of each category in the formation of the respective dimension. For instance, A1 (age between 0 and 15) contributes the most in the definition of dimension 1 with the share of more than 15%, while O1 (employed), contributes the least in the formation of dimension 1 with the share of only 0.032%. The column v-test is a test-value that follows a Gaussian distribution. Values below − 1.96 or above 1.96 indicate that the coordinate is statistically significantly different from zero.

Table 4 Results of categories in first two dimensions
Fig. 2
figure 2

Factor map of categories

The cosine2 value describes to which degree a dimension explains a category. For example, in dimension 1 the cosine2 for category H1 is 0.30 (or 30%) and in dimension 2 it is 0.05 (or 5%). The sum of the cosine2 values across all 15 dimensions is equal to 1 (or 100%) for each category. The detailed calculation procedure of these parameters can be found in Husson et al. (2017). Not all categories are displayed well in any particular dimension. The squared correlation of categories for all 15 dimensions shows that H3 (household size category 3) is well represented in dimension 1 while I3 (income group 3 i.e. from 3000 to 5000 euros per month) is best represented in dimension 6.

The relationship among attribute categories in the first two dimensions is shown in the factor map in Fig. 2. The closer the categories, the more related they are with each other. The categories are coloured, and the lines (drawn manually) show the distribution of each category along the map. The map shows that the socio-demographic attributes can be broadly grouped into three clusters. The top right part of the map shows that students (O2) are more likely to fall between 0 and 22 years (A1) and to not have a driver’s license (NL). NL is a little further away from this group of three points as only people with the age 17 or older can have a driver’s license in Germany. Overall, the upper right quadrant of the map mostly represents a cluster of students. The middle portion of the map shows that workers (O1) are more likely to be between 23 and 64 years (A2 and A3), have high categories of 2000 euros and above (I2, I3 and I4), with a higher likelihood to own a car and a driver’s license (L). This portion represents the workers’ cluster.

The top left portion of the graph mainly shows retired people (O3) with lower income and an age (A4) of 65 years and above. They could also be homemakers or currently unemployed people with lower income and more likely to not own a car. This portion of the map could be called the cluster others.

The relationship of household income and auto ownership is well depicted. It shows that people with low income (I1) tend to either not own a car (C0) or own one car (C1). People with medium income (I2) are more likely to own one or two cars, while people with high income (I3 and I4) tend to own two and more cars (C2 and C3). The location of household sizes is also interesting. Students (O2) are more likely to live with their parents (represented by household size of 3 or more people, H3).

The gender categories are close to each other and near the origin, which shows that gender has little influence on other characteristics in dimensions 1 and 2. The interpretation of the above map seems intuitive, but it should be kept in mind that Fig. 2 only shows the relationship among attribute categories in two dimensions.

Activity sequence pattern

The distribution of activity sequence patterns in the first two dimensions is shown in Fig. 3. Only the top and bottom 10 activity sequence patterns are labelled for visual reasons. Similar parameters as shown in Table 4 were generated by multiple correspondence analysis for activity sequence patterns except ‘contribution’, as activity sequence patterns did not contribute towards the formation of dimensions.

Fig. 3
figure 3

Factor map of activity sequence patterns

The map above shows that the activity sequence patterns can be broadly grouped into three clusters which correspond to the clusters defined for socio-demographic attributes (Fig. 2). The one at the top right belongs to the activity sequence patterns that contain education activities (e.g. TC4, TC6 and TC190) and can be called education cluster, the cluster at the centre bottom belongs to activity sequence patterns that contain work activities (e.g. TC1, TC9, TC103, TC107, TC108, TC110 and TC 112) and can be called work cluster, and the cluster in the middle belongs to activity sequence patterns containing every other type of activity modeled (e.g. TC2, TC3, TC111) and can be called the other cluster. To better understand the relationship between clusters of socio-demographic attributes and activity sequence patterns, a combined map of attribute categories and activity sequence patterns is shown in Fig. 4. The activity sequence patterns are colour coded in three clusters according to the type of activity they contain. It is also interesting to note that activity sequence pattern TC1 (H–W–H), which is the most frequently occurring activity sequence pattern in the dataset, is very close to activity sequence pattern TC112 (H–W–H–S–H–PE–H), which is the least common activity sequence pattern in the dataset shown in bold and italic in Fig. 3. Both activity sequence patterns contain work activities. The similar positions are based on the similarity between these activity sequence patterns.

Fig. 4
figure 4

Factor map showing the relationship between attribute categories and activity sequence patterns

Figures 2, 3 and 4 were all plotted by the same two dimensions, even though the method generated 15 dimensions. The same scatter plots could be created for any combination of these 15 dimensions. However, dimension 1 and dimension 2 have the highest explanatory power for the variation in the data with 18.46% and 15.42%, respectively. As shown in Fig. 1, dimensions 3 through 15 have substantially less explanatory power. The patterns identified in Figs. 2, 3 and 4 become the most apparent when plotting dimension 1 and dimension 2 in one scatter plot.

Prediction methodology

Multiple correspondence analysis is a descriptive technique used to analyse relationships in the data. To the extent of the knowledge of the authors, there is no standard procedure to perform predictions directly from multiple correspondence analysis results. Therefore, a procedure was developed to estimate the probabiliy of activity sequence patterns from MCA results based on the socio-demographic attributes.

The prediction model is based on the concept of utility maximization theory. The utility for an individual to select an activity sequence pattern depends on the distance between individual and activity sequence pattern. MCA is used to calculate the coordinates of a given individual (based on its socio-demographic attributes) and the coordinates of every activity sequence pattern. The closer the coordinates of an activity sequence pattern are to an individual, the greater the likelihood that the individual will choose this activity sequence pattern. MCA calculates several dimensions, and each dimension contributes to explain which activity sequence pattern might be chosen. The selection method iterates across all dimensions to calculate a combined distance between the individual and all activity sequence patterns. Every dimension is weighted by its corresponding contribution to explain the selection of activity sequence patterns. The likelihood (expressed by MCA distance) for individual i to select activity sequence pattern j is calculated by Eq. 1

$$d_{ij} = { }\sqrt {\mathop \sum \limits_{{k{ } = 1}}^{n} \left[ {\left( {I_{i} - { }TC_{j} } \right)_{k}^{2} \times { }w_{k} } \right]}$$
(1)

where \(d_{ij}\)  = distance between individual \(i\) and activity sequence pattern j; \(I_{i}\) = coordinate value of individual in \(k{\text{th}}\) dimension; \(TC_{j}\) = coordinate value of activity sequence pattern \(j\) in \(k{\text{th}}\) dimension; \(w_{k}\) = weight of \(k{\text{th}}\) dimension (equal to percentage of variance explained by dimension \(k\)); \(k\) = dimension number; \(n\) = total number of dimensions

Not all the dimensions explain the same amount of variability, but some dimensions are more important than other ones. Therefore, a weighted factor (\(w_{k} )\) is included in Eq. 1 for a weighted calculation of distance. These weights are equal to the percentage of variance explained by each dimension formed by multiple correspondence analysis. The distance should be computed considering all the dimensions in order to capture the whole variability of data. By using Eq. 1, a matrix can be formed that provides the distances between individuals and activity sequence patterns.

Calculation of utilities

The utility of an activity sequence pattern for a given individual depends on the impedance or distance between the individual and the activity sequence pattern. The utility of an activity sequence pattern closer to the individual is larger than the activity sequence pattern that is farther from the individual. For this purpose, an exponential function is used. Given an individual \(i\), the utility of each activity sequence pattern \(j\) can be calculated by Eq. 2 below:

$${U}_{ij}= {e}^{-\alpha \times {d}_{ij}}$$
(2)

where \({U}_{ij}\) = utility of individual \(i\) for activity sequence pattern \(j\); \({d}_{ij}\) = distance between individual \(i\) to activity sequence pattern \(j\); \(\alpha\) = calibrated parameter

By using Eq. 2, a utility matrix can be formed between all individuals and all observed activity sequence patterns.

Calculation of probabilities

Based on the calculated utilities, the probability of individual \(i\) choosing activity sequence pattern \(j\) can be calculated using Eq. 3 below:

$${P}_{ij}=\boldsymbol{ }\frac{{e}^{\beta \times {u}_{ij}}\times \frac{{f}_{j}}{{d}_{ij}}}{\sum_{j=1}^{n}\left({e}^{\beta \times {u}_{ij}}\times \frac{{f}_{j}}{{d}_{ij}}\right)}$$
(3)

where \({P}_{ij}\) = probability of individual \(i\) choosing activity sequence pattern \(j\); \({u}_{ij}\) = utility of individual \(i\) for activity sequence pattern \(j\); \({f}_{j}\)= relative frequency of activity sequence pattern \(j\) as observed in the survey data; \({d}_{ij}\) = distance between individual \(i\) to activity sequence pattern \(j\); \(n\) = number of activity sequence patterns; \(\beta\) = calibrated parameter

Equation 3 multiplies the exponentiated utility with a scaling factor consisting of the relative frequency of activity sequence pattern j divided by the distance between activity sequence pattern j and individual i. This scaling factor ensures that very frequent activity sequence pattern (such as home–work–home) are chosen more frequently than less common ones, even if this less common activity sequence pattern is relatively close to a given individual. Two activity sequence patterns can be at the same distance from the individual based on its socio-demographic attributes, but one activity sequence pattern could be more frequent then the other one. The distance calculated by the MCA method merely determines that a given activity sequence pattern is closer to one person type than to another, but this distance does not indicate the frequency of an activity sequence pattern. For example, TC1 (H–W–H) and TC112 (H–W–H–S–H–PE–H) are both activity sequence patterns that are selected mostly by workers. Therefore, the distance calculated by MCA of these two activity sequence patterns is close to workers. However, TC1 is the most frequently selected activity sequence pattern, while TC112 is a less frequently selected activity sequence pattern. If the selection was made based on distance only, TC1 and TC112 would be selected with similar frequency. While the weights help to select the right frequency of activity sequence patterns, the MCA method helps to assign the most likely activity sequence patterns to a right person based on their socio-demographic attributes.

Predictive model results

For model calibration, unique person types were determined. A total of 899 unique types were found. All individuals were assigned to the three categories of workers, students and others. Initially, the model was calibrated for these three groups. The coordinates of the individuals that were used to build the multiple correspondence analysis model were already estimated by the same analysis. Given the coordinates of activity sequence patterns from the multiple correspondence analysis model, the distance between individuals and activity sequence patterns was calculated using Eq. 1. The parameter \(\alpha\) of the utility equation (Eq. 2) and the parameter \(\beta\) of the probability equation (Eq. 3) were calibrated for each group. The \(\alpha\) parameter defines to which degree individuals mostly consider activity sequence patterns that are very close in terms of MCA distance, or whether individuals also consider activity sequence patterns that are a little bit further away. The \(\alpha\) value should be calibrated to different values for different individuals, because different user groups are likely to have a different “tolerance” for less common activity sequence patterns. The calibration suggests that students stick more closely to the most likely activity sequence patterns (expressed by the larger \(\alpha\) value of 0.6) than workers (\(\alpha\) = 0.45). Most students go to school and back, while workers tend to have more variety in activity sequence patterns, and therefore, will choose activity sequence patterns with larger MCA distances as well. The beta parameter is used to translate utilities into probabilities. The beta parameter defines how much individuals focus on the highest utilities, or whether people are willing to (occasionally) select options with lower utilities.

Coefficient of determination (\({R}^{2}\)) and root-mean-square-error (RMSE) were used as a measure for calibrating the parameters, maximizing \({R}^{2}\) and minimizing RMSE. These goodness-of-fit measures describe how closely the model predicts activity sequence patterns aggregated to educational, work and leisure clusters. The results are shown in Table 5 and Fig. 5. For this comparison, all activity sequence patterns that contain the activity work were grouped to the activity cluster work. All activity sequence patterns that contain the activity education were grouped to the education cluster. The remaining activity sequence patterns were aggregated to other. The observed probability of 2.04% of the education cluster for a worker is calculated by adding all the observed probabilities of all activity sequence patterns present in that cluster. The predicted probabilities are calculated by the addition of predicted probabilities within each cluster. The results show that for workers, 59.34% of the activity sequence patterns were predicted for cluster 2 (work-related activity sequence patterns), while the model is predicting 59.84% from the same cluster. Similarly, education and other activity sequence patterns were predicted reasonably well for workers. For students, the model is predicting correctly education activity sequence patterns but overpredicts activity sequence patterns in the work cluster.

Fig. 5
figure 5

a–c Model calibration results for the three categories workers, students and others

Table 5 Model Calibration Results for three categories

A reason for this reduced fit could be the larger number of work activity sequence patterns (31 activity sequence patterns) compared to education activity sequence patterns (15 activity sequence patterns). The others category predicts reasonably well the activity sequence patterns from cluster 3 (others cluster), while underpredicting the education and overpredicting the work activity sequence patterns.

Secondly, the model was calibrated for the top 10 most frequently occurring person types to compare the aggregate calibration with a calibration for individual person types. (Table 6).

Table 6 Person type definitions

For instance, person type 1 represents a group of individuals who are workers, with household size 1, auto-ownership category 1, age group category 1, male, income category 1 and own a driver’s license. This person type represents 2.0% of the population. The parameters \(\alpha\) and \(\beta\) were calibrated individually for these ten person types as shown in Table 7 and Fig. 6. The figure also points out the outliers for every person type. The parameters calibrated for the three broad groups (Table 5) applied to person types shows that the person types are rather sensitive to calibration factors.

Table 7 Model Calibration results for top 10-person types based on individual calibration factors
Fig. 6
figure 6figure 6

a–j Model calibration results for top 10-person types

\({R}^{2}\) drops significantly and RMSE increases significantly for person type 5, 6, 7 and 10 when individually calibrated parameters are used. For instance, \({R}^{2}\) drops from 0.92 to 0.79 and RMSE increase from 0.79 to 2.26 for person type 5. The cluster probabilities also do not match as well for these person types when parameters aggregated to three groups are used. Therefore, it seems advisable to calibrate each person type separately to obtain best results. However, infrequently occurring person types will not have enough records in the survey for calibration. It is therefore suggested to use individually calibrated parameters for all person types with at least 30 records and aggregated parameters for the three person groups for all other person types.

The described methodology can simulate the selection between those 112 activity sequence patterns for which at least 30 observations were found in the survey. With this method, 72% of all persons and their activity sequence patterns are covered. The other less frequently observed activity sequence patterns, which often are the more complex patterns, also need to be included in application to represent the full pattern of activity sequences. As those remaining 28% cannot be related with statistical confidence to socio-demographic attributes, they are assigned randomly. They are assigned randomly based on the observed frequency, under the constraint that education-related patterns are assigned to students only and work-related patterns are assigned to workers only.

Conclusions

The proposed methodology works reasonably well to explain frequently observed activity sequences. In the given dataset, 72% of all observed activity sequence patterns were conducted by at least 30 respondents, and therefore, were deemed to be represented sufficiently well by this survey. The remaining 28% percent of all observed activity sequence patterns are more complex and rather rare, and therefore, cannot be represented by this method as well-founded. While the assignment of rare activity sequence patterns has a random component, frequently observed activity sequence patterns are assigned realistically based on socio-demographic data.

One of the most challenging problems for activity-based models is the determination of activity sequences. Commonly, a two-step choice is implemented, where first a main activity is selected, and subsequently, stops on this tour may be added. The hypothesis of this common approach for activity-based models is that every activity sequence is driven by one dominant activity. Furthermore, activity sequences are commonly modeled from home to home, i.e., the sequence H–W–H–S–H would be modeled as two independent sequences (H–W–H and H–S–H). With the exception of time constraints, most activity-based models do not consider the attributes of the first tour when modeling the second tour. Furthermore, most activity-based models limit the number of intermediate stops to one or two stops, thereby limiting the diversity of modeled activity sequences. In practice, activity-based models solve this problem by a very large number of discrete choice problem sets that are ordered based on heuristic assumptions (such as, a main tour purpose is chosen first).

The study described here developed a methodology that considers activity sequence patterns of individuals based on socio-demographic attributes. No individual discrete choice models needed to be estimated, but a probability for 112 different activity sequence patterns was calculated based on the agents’ socio-demographic attributes. Among the most favourable alternatives for an individual, preference is given to those alternatives that are more frequently selected according to the household travel survey. The whole day is simulated as one consistent sequence, and the number of intermediate stops is only limited by the sample size of complex activity sequence patterns found in the survey data. Moreover, the methodology is flexible and could include land-use and transportation related attributes to further explain activity sequence patterns. With this new method, a very complex module of every activity-based model was simplified into a single step that creates a large variety of activity sequence patterns based on socio-demographic data.

For a full travel demand model, the presented methodology needs to be expanded by a location choice model for each activity, a mode choice model and an activity timing model. The location choice model needs to work similar to a traditional destination choice model, with the exception that the locations of each activity on a tour tends to be in spatial proximity (i.e., for a H-W-S–H activity sequence pattern, someone would most likely select a shopping location near work or near home). The mode choice model works similarly to a traditional mode choice model of a four-step model, with the exception that commonly a main mode is chosen for an entire tour, and sub modes are chosen for individual trips. The activity timing model needs to model a preferred arrival time and an activity duration. While there is still some model development needed, the method presented here is considered to be a major step towards significantly simpler activity-based models.