Data
The UK has a comparatively developed online market for groceries (Kantar Worldpanel 2015), hence offers an opportunity to study impacts of ICT use on personal travel patterns for the purposes of the present study. National and regional travel surveys (e.g., British National Travel Survey (DfT 2016), London Travel Demand Survey (TfL 2016)) often used in transport research offer detailed data on physical trips for food shopping, yet capture very limited information regarding online activity. Specific episodes are not recorded and information regarding when each online shopping event took place, crucial for our purposes here, is not available.
Data used for the present study were obtained from a consumer panel run by Kantar Worldpanel (KWP). Participating households use hand-held optical scanners to record daily purchases of fast moving consumer goods that are brought home. Fast moving consumer goods include products found in supermarkets typically bought frequently and at relatively low prices. Examples are groceries, toiletries, health and beauty items. Panellists are also asked to scan and send till receipts whenever they make a shopping trip or get deliveries. Shopping occasions are recorded regardless of the retail chain, hence include visits to smaller local shops, independent corner shops, and larger supermarket chains. KWP is a continuous panel where households can participate for as long as they wish providing longitudinal data and participants receive incentives in the form of redeemable points. Household characteristics including socio-demographics are collected at the initial interview and continuously updated every six-months or annually. All household members record their purchases separately, however the dataset employed had an indicator for main shopper only but otherwise did not distinguish between other household members. KWP defines main shoppers as household members that are responsible for the bulk of grocery shopping in their household. For model estimations, we use shopping occasions recorded by all shoppers in the household without distinguishing between different members of the household. Purchases from all retailers including online retailers are recorded together with basket sizes in terms of monetary value. Unlike travel surveys, both online and in-store shopping observations contain data at the same level of detail. Shopping records include visits where a transaction occurs since product level purchases are recorded, hence will exclude search only or returns only visits. Since the focus is on grocery shopping, however, shopping occasions which will not involve any purchases are likely minimal. Information on basket sizes is also crucial as the amount of shopping will also potentially influence the duration until the next shopping occasion. For assessing the influence of online shopping on inter-shopping duration, it is important to separately account for basket size effects especially since online shopping is often associated with large basket sizes due to minimum size requirements for delivery. Consumer panel data is also attractive as market research companies operate consumer panels in many countries and some make it available to researchers.Footnote 2 Availability is important for generalisability of our proposed method. Consumer panels, however, were also found to have some shortcomings. Information on the retail chain is available, but the specific store is not known. They also do not gather travel related information, for instance, travel mode choice for shopping is not collected that is potentially important when analysing travel implications. Market research companies collect relatively limited demographics information on respondents and transitions (e.g., changes in employment, income, marital status, children) are not always well recorded.
The sample obtained from KWP covers a one year period between September 2013 and August 2014 from a sample of 168 households in London. 124 households in our analysis sample are panellists who reside in the two selected boroughs, Barnet and Enfield. Thirty four out of 124 were online shopping households with reported online observations during the one year period. An additional forty four households were drawn randomly from London among online shopping households since we are primarily interested in their behaviour. This, however, causes sampling bias as online shopping households are over represented in our estimation sample. To correct for this, we use weights for all observations assuming that the share of online shopping households in London matches observed shares in Enfield and Barnet.Footnote 3 When we compare demographics for our sample and Census data from Enfield and Barnet in 2011 (Office for National Statistics 2011), we found that households with older main shoppers and highest social classes are overrepresented in our study sample (Suel 2016). We note that no information is available from the Census on main shoppers, hence age of household reference person is used as proxy for comparison.
While non-traditional panel data offers significant advantages for research, potential problems and sampling biases should not be overlooked. We note that our sample will share important characteristics relating to the geographical proximity of residential addresses. Also, while the number of recurring observations from each responding household was large, the main limitation is that the number of households in the sample was relatively small due to budget constraints. If population scale prediction is a prioriry, then a larger and more representative sample would be desirable. Here, we demonstrate the insights drawn and prediction capabilities that can be developed with existing data sources using modelling tools as presented below.
Analysis methods
Hazard-based methods have been initially developed for modelling duration data such as time to failure or time to some form of state change. These methods are often used in survival analysis in the context of biological problems (e.g., expected duration of time to death or organ failure) and reliability analysis for mechanical systems. Hensher and Mannering (1994); Bhat and Pinjari (2000) presented extensive reviews of hazard-based duration models and their applications in transport research. The basic idea is to model the probability of an end-of-duration occurrence given that the duration lasted for some time. This probability will depend on the length of elapsed time and also on relevant co-variates. These models are suitable for analysing inter-shopping duration, i.e. time between two consecutive shopping occasions. The probability of ending a duration since last shopping activity is dependent on length of the duration (time elapsed since last activity) due to depletion effects (i.e. an individual is more likely to go grocery shopping on any given day if s/he has not done so for a longer period). This duration will also likely depend on other observable characteristics such as household size (larger households consume more and might need to go shopping more frequently) and basket size on last activity (if more products are stocked the need to go shopping again might decrease).
In this paper, we aim to model gap times between recurring grocery shopping occasions using the Cox proportional hazards model. The probability of an end-of-duration occurrence (i.e., a shopping observation) will take place at time t given that it has lasted until t is described by the hazard function h(t). Estimating effect sizes associated with selected determinants of duration such as demographic variables and situational factors are of interest for studying inter-shopping duration behaviour. In the Cox proportional hazards model, it is assumed that covariates act multiplicatively on an underlying hazard function, which can be represented as in Eq. (1). Note that using the Cox approach, the hazard distribution itself need not to be estimated for estimating the effects of different factors for an end-of-duration occurrence (Cox 1972).
$$\begin{aligned} h(t|Z)= h_{o}(t) \exp (\beta Z) \end{aligned}$$
(1)
where \(h_{o}(t)\): baseline hazard; Z: vector of covariates; \(\beta\): parameters to be estimated.
The proportional hazard form presented in Eq. (1) is based on the assumption that gap times are independent and identically distributed. This might be restrictive, for instance, if estimation data contains multiple observations of recurring shopping occasions from the same shopper as is the case here. Correlation between observations belonging to the same shopper due to unobserved factors can be accounted for using so called frailty models. For this case, we use the form in Eq. (2) for the proportional hazard function (Cook and Lawless 2007).
$$\begin{aligned} h(t|Z)= h_{o}(t) \exp (\beta Z + b_i) \end{aligned}$$
(2)
where \(b_i\) is the unobserved random effect associated with individual i and are assumed to follow a Gaussian distribution. Additionally, random effect terms capture the influence of unobserved risk factors i.e. omitted variables affecting the hazard. While only few risk factors can be observed and meaningfully measured, other unknown or unmeasured variables may influence the time between shopping occasions. Modelling individual unobserved random effects does not inform on specific omitted variables (on how many, which and how important), but highlights that some significant factors have been excluded (Hougaard 1995); it will however reduce omitted variables’ bias by capturing effects of omitted variables independent from the observed covariates. We report estimation results using both forms in Eq. (1) and Eq. (2) when discussing our findings.
For our empirical analyses, we differentiate between shopping events and shopping trips. For the former, gap times are defined as time between two shopping events regardless of whether they were online or in-store. Online and in-store observations are both modelled as end-of-duration occurrences. For the latter, on the other hand, gap times are defined as time elapsed between two in-store shopping trips. Only in-store observations, which generate personal trips to physical stores, are modelled as end-of-duration occurrences. From available data, we calculated times between shopping events and shopping trips separately over one year. These are also used as durations for hazard based analyses. Figure 1 shows the distribution of inter-shopping-event and inter-shopping-trip times over our sample. As expected, times between shopping trips are longer in duration on average and there are cases where households do only online shopping for long periods without physically visiting stores. Note that there were also more outlier cases for inter-shipping-trip durations grouped in 14+ days bin with more cases where time between consecutive shopping trip exceed two weeks. When multiple shopping events or trips are observed in a single day, durations are computed as being evenly distributed across one day. The first bins in histogram plots include shopping activity observed within one day, which makes up almost 35% of all observations for each category.
Our main interest in this paper is to test the effect of online shopping activity on the instantaneous rate of occurrence of shopping (inter-shopping-event durations) and travel for shopping (inter-shopping-trip durations). We do so by incorporating, along with other shopper characteristics and situational variables, two key indicators: (i) whether the household have adopted online shopping as a channel; and (ii) whether their last shopping was online. The former is a dummy variable and is equal to one only after a respondent is observed shopping online and remains equal to one for the remainder of our observation period.Footnote 4 The former variable therefore captures a general long term effect of adoption, whereas the latter would reveal short term (tactical) effects of buying online.
Specification search was conducted using household characteristics and basket size in terms of monetary value. The panel collects information on income, but panellists are not required to report this information. Income information was missing for the sig of the households in the dataset we were provided with. Therefore, social class was used as a proxy for income where higher social classes are assumed to be associated with higher income bands.Footnote 5 We did not have information on gender. The socio-demographic variables used in specification search included: social class, household size, number of children, age, education level, and life stage. Covariates were left out where no significant effects were identified (social class, number of children, education level, and life stage). The final model specification, in addition to two key online channel related indicators, included: household size, age of main shopper, and basket size in the last shopping occasion, all as scale variables. Basket characteristics is often neglected in transport literature, because this type of information is not usually available in traditional datasets used in travel demand analysis. However, as briefly outlined above and presented in the results section basket size variable bears a strong effect. A summary of descriptive statistics for covariates used in the estimation of hazard models is presented in Table 1, along with numbers of shopping events and shopping trips.
Table 1 Descriptive statistics for the sample