Introduction

In general, disaggregate travel models use data from household travel surveys. However, problems with that survey collection process are well documented, from consumer resistance and language impediments to response ambiguity and increasing percentages of cell phone-only households. Many large regional surveys do not sample from particular rural and suburban areas, and do not offer a sufficient number of observations to conduct detailed analysis at the sub-regional level. Along with the rising cost of surveying households, these problems have motivated researchers to consider alternative data gathering methods, such as simulating travel surveys (Pointer et al. 2004) or accessing public microdata surveys (Purvis 1994).

One way to obtain data for disaggregate choice modeling while avoiding surveys is to enhance state department of motor vehicles (DMV) micro-level ownership records with socioeconomic and demographic information from the United States census. In that way, characteristics of the local community serve as proxies for information missing about the households, such as income or ethnicity. Although the use of zonal data is widespread in transportation research, using zonal proxies is not common. However, the technique is regularly used in the study of epidemiology, because public health data are often missing important elements due to privacy concerns.

To assess its validity, we simulate the proxy method by supplementing limited information from a recent household auto ownership survey with zonal census aggregates, resembling data that could be constructed using information from the DMV and the Census. Census information is reported at a range of spatial levels, specifically census block group, tract, and zipcode. A census block group, representing about 1,000 people, is the smallest unit for which detailed socioeconomic and demographic information is aggregated. Census tracts, composed of a set of block groups, contain 4,000 people on average. Census tracts are specifically designed to group individuals relatively homogeneous in terms of demographics and economic status (US-Census-Bureau 1994). According to the U.S. Census Bureau, census tracts are intended to be permanent statistical subdivisions, increasing their usefulness in empirical applications. Zip codes, which were introduced by the U.S. Postal Service for operational reasons rather than to represent similar individuals, contain on average 30,000 individuals. Accordingly, we estimate three ownership choice models using these three separate zonal proxy groups and compare our results to a parallel model using only data drawn from a household survey. We evaluate the proxy models on the bases of their predictive ability and their approximation of household model coefficients.

Overview and related work

Traditionally, research into vehicle choice has taken one of two general approaches, differing by the degree of aggregation. In the aggregate approach, the existence of a representative consumer or more flexibly a known distribution of preferences across the population is assumed by the researcher, allowing the use of broad aggregates of demand and supply in estimating the significance of a set of automobile attributes. Such models adequately forecast future car ownership at an aggregate level (Salon 2006). This aggregate approach is considered cost-efficient in terms of computation and data requirements, but it can hamper the ability to distinguish among the presumed explanatory variables in terms of their effect on vehicle choice (Potoglou and Kanaroglou 2008). In the disaggregate approach, the researcher considers the household as the decision-making entity, gathers micro-level data on its characteristics and recent vehicle choices, estimates the contribution of various components to automobile choice, and then sums over households to project the composition of future vehicle fleets. The disaggregate approach has the advantage of better capturing consumer behavior, and as a result, the relationship between vehicle attributes, household characteristics and ownership choice. Although the data requirements are more exhaustive, disaggregate models are the preferred method of modeling vehicle choice (Bhat and Pulugurta 1998), particularly if policy efficacy is considered.

Annually, each U.S. state’s motor vehicles division collects a wide range of information from its registered drivers. For example, the local vehicle stock is implied by registration fees paid annually to the DMV. The typical DMV records intra- and inter-state vehicle transactions. The DMV knows the population of individuals licensed to drive a car, including certain demographic information about them, their type of license and their home addresses. Yet neither the typical DMV nor localities regularly collect detailed information on the makeup of the automobile fleet as well as their operators for the express purpose of studying transportation choices. As a result, those researchers interested in the disaggregate approach to estimating the factors associated with the question of how people determine which automobile to purchase must resort to surveying individual consumers, a process that can be expensive in terms of time and money. Potential difficulties encountered include selection of the appropriate sample, unresponsiveness of targeted subjects, and mistakes or ambiguity in responses. Researchers have also noticed troubling trends in survey response: growing consumer resistance, proportion of homes unavailable to be reached by telephone, and per-home survey costs (Stopher and Greaves 2007). The supposed choice set is often narrowed out of concerns over the cost of customized surveys (Train and Winston 2005).

Another possible source of socioeconomic and demographic data is the U.S. decennial census, which is freely available. Information such as age, ethnicity, education, poverty status, and income level are collected, aggregated and reported at multiple spatial scales, from statewide averages to groupings by residential block. Additionally, the census collects other information directly relevant to the study of vehicle choice, such as average travel to work time and the percentage of residents that use various modes of transit.

By using a proxy approach, merging zonal census-level information with data on household vehicle ownership or purchases could result in important, novel insights into the vehicle choice decision, expanding not only the viable topics available for research but also the size of the population considered. Neighborhood characteristics, which have been found significant in automobile choice (Zegras 2007), could be obtained easily. Furthermore, a vehicle choice model combining aggregated characteristics with disaggregate choice data may help improve on the traditional role of inexpensively generated aggregate models to forecast car ownership, while still allowing the researcher freedom to consider policy analysis as allowed by disaggregate models.

Although the use of zonal aggregates to proxy for missing micro-level data is not common in vehicle choice research, it has been applied for nearly a century in the study of health outcomes (Krieger et al. 2003). Whereas residential address and individual health status, such as disease cases or discrete self-reported health scale measures, are often available in public health records, important possible predictors of disease susceptibility such as educational attainment, occupational status, and income often go uncollected (Berjon et al. 2005). To investigate the role of socioeconomic status in predicting health outcomes, researchers frequently match individual addresses to one or more spatial levels for which census information is collected, whether at the level of zip codes, tracts, or block groups. These area-wide aggregates are then used to substitute for missing data in order to control for important explanatory variables that would otherwise not be included, with final analysis conducted on the resulting unified data set (Gowrisankaran and Town 1999). The U.S. does not report census data at the household level.

Use of proxy aggregates in place of micro-level variables does not suffer from the ecological fallacy by the classic definition (Krieger 1992), but it does bias the results of statistical analysis. In the study of the effect of socioeconomic status on health outcomes, this point has lead some researchers to advise caution in the interpretation of the results it yields (Geronimus et al. 1996). Others have found that the proxy method has unique advantages, and that the degree of bias diminishes when researchers derive census information from smaller geographic units of aggregation (Soobader et al. 2001). According to Subramanian et al. (2006), area-based proxies provide conservative approximations of the micro-level variables they are intended to represent.

Statistical considerations

Before we can estimate and compare the results of our analyses, there are a few important points to consider regarding the statistical validity of using aggregate-level variables to proxy for household information. These concerns apply to the value and significance attached to parameter values estimated using the proxy method. If the researcher is solely interested in predicting the level of household automobile ownership, the following concerns will not affect the validity of the results.

To control for household information unavailable in DMV data, values of missing explanatory variables are imputed from area-wide aggregates. According to Wickens (1972), the use of even a poor proxy is preferred to simply omitting the information. Nevertheless, the use of aggregate proxies likely biases household-level parameters. Geronimus et al. (1996) show that two sources of bias can affect the proxy estimates. First, since an aggregate variable is only partially correlated with its individual counterpart, its use introduces a measurement error related to the degree of correlation. Second, aggregate variables may themselves represent information useful in the micro-level analysis. Perhaps households consider both their own characteristics as well as those of their neighborhood in making the vehicle choice decision: beyond individual income, local income level may be an important factor. In that case, the magnitude of estimated coefficients will increase. Of course, if neighborhood contextual effects themselves factor into household vehicle choice, the absence of local aggregates from traditional models leads to a similar set of problems. Additionally, Geronimus et al. demonstrate that estimates of the remaining explanatory variables are also affected by the introduction of proxy aggregates, with bias resulting from their correlation with the proxy aggregates and the missing original variables, and the relationship between the missing variables and the outcome.

Because aggregate explanatory variables tend to be relatively smooth with respect to their household counterparts, they may not exhibit enough variance to provide meaningful estimates. Indeed, if the spatial scale of aggregation were large enough so as to envelop all the observations, the census proxies would be identical for every individual. This obstacle should diminish as the geographic unit of aggregation narrows relative to the size of the studied region. Also, aggregate variables may exhibit a high degree of multicollinearity, making it difficult to estimate parameter values. Moreover, according to Moulton (1990), failing to account for potential error correlations at the aggregate level could bias standard errors downwards. As a result, if spatial aggregates are geographically correlated, the statistical significance attached to results found via the proxy method may be spurious.

Data

The first source of data for this analysis is the 2000 San Francisco Bay Area Travel Survey (BATS), commissioned by the Bay Area Metropolitan Transportation Commission (MTC). Data on household vehicle fleet ownership and socioeconomic/demographic information were collected from 15,064 households in the nine-county region, during the period February 2000 to March 2001. Although the residential addresses of survey participants are not reported, BATS includes the pertinent matched census block group, tract, and zip code. The survey achieved a 99.9% success rate in geocoding the home addresses of surveyed households. Inclusion of geocoded locations in the BATS dataset allows comparison between individual and area-wide aggregate choice models.

Census information is imported from the year 2000 United States Census Summary File 3 for the census tracts, block groups, and zip codes listed in the BATS dataset. For each spatial level of aggregation, Census data included population size, racial composition, average age, employment figures, and median income. Census data for each aggregation level were appended to the individual information provided by BATS according to the geocoded block group, tract, and zip code indicated in the dataset. Because this study compares the results of parallel models, the 2,911 BATS households that did not report income, age, employment status, or ethnicity are excluded from the analysis.

Methodology

Automobile ownership affects a household’s transportation mode, its destination of interest in leisure activities, and the number of trips it makes (Nobile et al. 1997). Disaggregate models focused on predicting the number of cars chosen by a household are used to provide inputs into transport projection models (De Jong et al. 2004). As in the study of scaled health outcomes, since a given household chooses its ownership level from a known set of available options, discrete choice models are generally used to estimate relevant vehicle choice parameters. Among these, Bhat and Pulugurta (1998) found that unordered-response mechanisms, such as the multinomial logit model (MNL), fit ownership data better than do ordered-response models. In order to verify this, we estimated an ordered logit (ORL) version of each ownership model and rejected it on the basis of Akaike’s Information Criterion (AIC).

Model of household ownership choice

We model only the demand side of the auto market, and assume that the supply of cars is perfectly elastic. The explanatory variables are chosen both on the basis of their likelihood to play a role in ownership choice according to the literature, and their being comparable across approaches. As the most commonly used model to explain systems with several discrete outcomes, the MNL model applies random utility theory to the auto ownership decision (Ben-Akiva and Lerman 1985). A household chooses the number of cars in order to maximize its own utility; for each level of car ownership (j), let the utility for an individual household be given by

$$ U_{j} = V_{j} + \varepsilon_{j} ,\,j = 0,\, 1,\, 2,\, 3,\, 4 $$
(1)

where V j represents the deterministic portion of utility, and ɛ j denotes a random component. The household subscript i is suppressed for simplicity. Deterministic utility is then defined as being composed of a vector of attributes multiplied by parameters. For each household,

$$ V_{j} = \beta_{j}^{\prime } x, $$
(2)

and

$$ V_{j}^{I} = \beta_{j}^{{I^{\prime } }} x_{*}^{I} ;\,V_{j}^{A} = \beta_{j}^{{A^{\prime } }} x_{*}^{A} , $$

for

$$ x_{*}^{I} = \left[ {x^{C} } \right.\left. {x^{I} } \right],\,x_{*}^{A} = \left[ {x^{C} } \right.\left. {x^{A} } \right] $$

Define V I j as the deterministic component in the case of alternative j where the vector of household characteristics (x I), such as logarithm of income, and householder age are employed in estimation. Likewise, V A j represents the deterministic component of utility when area-wide aggregates (x A) are used as proxies in the definition of household utility; for instance, the logarithm of median income, average age, and neighborhood racial composition represented in percentages for the census block group. For both coefficient vectors in E.2, x C represents household data on the number of licensed individuals in the home, since this information was provided by BATS and is available in the state DMV records one could make use of in the census approach.

Given that BATS data on households that own five or more cars make up only about 1% of the sample, we model Bay Area residents as having five possible vehicle ownership choices: zero, one, two, three, or “four or more” cars. The household chooses an ownership alternative—the dependent variable (y)—in the choice set in order to maximize its utility. Under the maintained assumption of independent random error, ownership utilities across individuals are uncorrelated. We assume that ɛ is identically, type I extreme value distributed for all households, and can thus represent the model with an MNL framework. Given the multinomial outcome variable and the specified choice probabilities, the maximum likelihood function is implicitly defined. Coefficient values are estimated in order to maximize the likelihood of observing the original sample.

One potential drawback of the multinomial logit framework is that it assumes the Independence of Irrelevant Alternatives (IIA). That is, the ratio of the probabilities of any two alternatives must remain independent to changes in other alternatives. In effect, cross-alternative correlation is not allowed. For example, if the option of owning four or more cars was suddenly removed from the choice set, households in that group would not necessarily be predisposed to own three cars, but instead would distribute to the other choice alternatives at the ratio of the original probabilities. Modeling alternatives that have the advantage of avoiding the IIA problem include ordered GEV and mixed logit (Train 2003). Mixed logit models allow flexibility in substitution patterns, and also account for coefficients that vary in the population. Although we did not use a mixed logit model in this paper, the effect of using aggregate proxies in mixed logit models has not been thoroughly studied, and is an area we intend to research in the future.

Empirical results and discussion

Descriptive statistics

Table 1 displays means and standard deviations for the variables we used in multinomial logit estimation of the factors associated with vehicle ownership choice. We removed BATS records that neglected to provide information for any of the variables in the table. After listwise deletion, the sample of surveyed households amounted to 12,153 observations, representing 3,626, 1,344, and 261 different census block groups, tracts and zip codes, respectively. The mean auto ownership for the households under study was about two (with the truncation at four).

Table 1 Descriptive statistics

The first two items in Table 1 are elements available to both survey and area-based analyses. That is, only having access to DMV driver and ownership registration records would not preclude the researcher from discovering how many licensed drivers reside in a household as well as its vehicle portfolio. After that, sections are constructed so that the individual household characteristic is listed above the relevant census measure. For example, resident age precedes the average age of the block group population. Finally, the last three variables listed in the table are used only in the full model that combines both household and census-level information.

As expected, the standard deviation and range of each area-based measure decreases as the geographical region of interest grows in size. The aggregate averages are smoothed when the size of the relevant census area increases, from block group, to tract, to zipcode. Although the standard deviation and range of aggregate values is lower than that of their individual counterparts, they contain enough variation to make estimation meaningful. For example, while the dichotomous indicator for an unemployed head of household takes on values of zero or one by definition, the proportion of unemployed individuals in a block group ranges from 2 to 96%.

Household survey data for income, age of the householder and household size all display similar means to those derived from the census data. For instance, the mean of log income earned in 1999 for the sampled individuals in BATS matched the median log census zipcode income. This indicates that the BATS sample is representative of the Bay Area geographic region with respect to these factors. Beyond that, they are reported in like units to their household counterparts.

On the other hand, all race and unemployment aggregates are reported as percentages, while their household counterparts are dichotomous. Besides the indicator for a black head of household, the household BATS race variables do not match the area aggregates very well. Only 3% of the BATS sample reported a Hispanic heritage, while Hispanics represented 15% of census tracts. For Asians, this disparity was 7% and 16%, respectively. These inconsistencies were not due to such households being less likely to report information critical to this study and being subject to listwise deletion. Instead, Asian and Hispanic households were undersampled by BATS relative to the population as reported by US Census Summary File 3. Census unemployment aggregates are calculated with respect to the entire adult population of a geographic region, not necessarily those of working age, and thus report higher means than their individual counterparts. Additionally, the census aggregates for these variables are reported in different units, percentage of residents versus a 0–1 dummy, and represent perhaps entirely different characteristics than do the household level data.

Full model

We estimated a MNL model using both household information and area aggregates to determine whether neighborhood characteristics themselves factor into the vehicle ownership decision. Following Geronimus et al. (1996), this model yields some intuition about the direction of the bias that we can expect from employing aggregate proxies. Recall that if a given household considers average neighborhood income when selecting its ownership level even after controlling for household income, then the parameter for the income proxy may be biased upwards. Because they represent the lowest aggregation level for which census information is available, we incorporated block group averages in the model.

As a result of using the MNL, an unordered discrete choice model, the parameters in Table 2 convey the sign of the effect on the probability of selecting that ownership alternative relative to not owning a car, given an increase in the explanatory variable. For example, the significant coefficient for logged income of 0.86 relates that an increase in income is associated with a household being more likely to own a single car than none at all. The table shows that block group median income, average household size and some aggregate race variables are significant at 1% level in the combined model. Average resident age and the remaining aggregate race controls were significant at the 5% level. Broadly, these results indicate that households weigh both their own characteristics as well as those of their immediate social network in their ownership choice. For example the coefficients for household and neighborhood median income are significant and positive, suggesting that individuals may seek to “keep up with the Joneses” with respect to auto ownership. The age and race parameters suggest that cultural factors may also affect choice level.

Table 2 Ownership choice parameters for the full model (Base category: Zero car households)

Additionally, we found that some neighborhood characteristics unavailable in traditional household surveys yet easily constructed from census data were significant in the model, concurring with Zegras’ result (2007). Households in block groups with newer homes were more likely to select any level of car ownership over zero. The reverse was true for homes in densely populated and highly educated areas. Consequently, vehicle choice modelers should consider accounting for such neighborhood characteristics even if they have access to household survey data.

We confirmed Bhat and Pulugurta’s (1998) result for the BATS data: MNL was preferred to ORL on the basis of AIC. Presumably, this result signifies that the difference between ownership alternatives is not limited to simply the number of cars. One possible explanation is that the utility placed on different ownership levels may be related to vehicle heterogeneity in multi-car homes. Additionally, we found that the full model predicted at least a 0.5 probability of selecting the actual level of household ownership for over 65% of the BATS sample. When we relaxed the criteria and instead labeled the alternative with the highest probability of selection (although not necessarily at least 0.5) as the predicted alternative, the model achieved a successful prediction rate of 70%.

Proxy models

According to Geronimus et al. (1996), the bias resulting from the use of census proxies depends on two factors. First, the extent to which census aggregates represent their micro level counterparts. Second, whether aggregates themselves belong in the household choice model. Keeping these aspects in mind, we evaluate the usefulness of the census approach by comparing its parameter values, marginal effects, and predictive ability with those produced using only household survey data.

Table 3 displays estimates for the household model alongside those for the simulated proxy models. For the latter, census aggregates from block groups, tracts and zipcodes substitute median income, average household size, average resident age, proportionate race makeup and unemployment for their household level counterparts. The household variable for the number of licensed individuals in the home is shared by the census approach, since this information is also found in DMV records. The base category is defined as a zero car home. All estimates are robust to errors clustered at the tract level.

Table 3 Parameter estimates for parallel ownership choice models (Base category: Zero car households)

The MNL results in Table 3 show that the more representative proxy variables well approximate their household level counterparts: coefficients for aggregate income, household size, and age each display similar results to the household model in terms of sign and statistical significance. This is also true for the “number of licensed individuals” variable, the only one shared by both approaches. For example, the household coefficient of 2 for a single car home indicates that the addition of a licensed individual increases the likelihood of owning a car, relative to none at all. The proxy models estimate this coefficient as 2.18, 2.25, and 2.34 using block group, tract, and zipcode data, respectively. For each census level, estimated coefficients for the alternative specific constant—indicating whether a household will choose a given ownership level relative to not owning a car, everything else being equal—match the survey approach very well.

MNL parameters do not specify the magnitude of the effect on probability, since one of the consequences of nonlinear regression is that marginal effects differ by point of evaluation. Instead, the marginal effects calculated at the mean of the covariate vector are shown in Table 4, and represent the partial derivative of the outcome with respect to the regressors. In other words, the table shows the effect on the probability of selecting an alternative given a one unit increase in the explanatory variable. For instance, if an additional individual joins a single car household, the marginal effect of −0.09 in Table 4 indicates that the probability that a given household chooses to own one car decreases by 9%.

Table 4 Marginal effects of the modeled explanatory variables on ownership choice (Evaluated at the mean)

Like in Table 3, the marginal effects for the representative variables displayed in Table 4 are comparable across approaches on the basis of sign and significance. Once again, the magnitude of the proxies is similar to the corresponding household measures. For instance, using survey data alone, we find that a one unit increase in the logarithm of income is associated with a 13% chance that a household is more likely to own two cars. The complementary estimates generated by the median income proxy are 9%, 8%, and 9% for the three models.

On the other hand, when the census aggregates do not represent their household level counterparts very well, the estimates they generate often do not match those using solely micro-level data. The BATS survey was not representative of the San Francisco Bay area with respect to race or employment status. Additionally, the census aggregates for these variables are reported in different units, percentage of residents versus a 0–1 dummy, and represent perhaps entirely different characteristics than do the household level data. From that standpoint, it is not surprising that the proxy models do not approximate the household estimates with respect to these variables.

On the basis of the above results, the census approach approximates the parameter values and marginal effects of the BATS survey model reasonably well when its data are representative and presented in like units to the household-level variables. Krieger (1992) noted that proxies drawn from census block groups matched individual results better than tract data. It is intuitive that block group aggregates should better resemble the characteristics of their residents, since they are about 1/3 the size of census tracts. In our study of ownership choice, this proves to be the case. As shown in Table 3, block group proxies minimized the sum of squared deviations (SSD) from the representative household parameter values when compared with tract and zipcode census data.

In order to calculate the SSD, divide the variables so that

$$ x = \left[ {x_{R} \left. {x_{N} } \right]} \right. $$
(3)

where x R is composed of these representative variables, while x N are the remaining characteristics. Then, the SSD for every proxy model p is calculated by summing over all representative variables l

$$ SSD_{p} = \sum\limits_{l} {\left( {\beta_{R}^{A} - \beta_{R}^{I} } \right)^{2} } $$
(4)

where \( \beta_{R}^{A} \) is the coefficient on a representative proxy variable, and \( \beta_{R}^{I} \) its household level complement SSD is minimized for the block group model, meaning that its representative parameters best approximate those produced from the survey data, followed by tract and zipcode parameters. This result verifies that the proxy variables taken from lower levels of census aggregation better match their BATS household counterparts.

Finally, the proxy models fit the observed ownership outcomes very well, as related by the likelihood and predictive attributes displayed in Table 3. In fact, each of the aggregate models would be preferred to the survey model on the basis of traditional criteria for model selection. The block group model exhibited the highest log likelihood at convergence, and the correspondingly lowest AIC value. Consequently, because all models began with the same log likelihood at zero, the block group model displays the highest pseudo R-squared value, 31.7%. On the basis of average predictive ability, the block group proxies are again preferred to every other estimated model. The block group model assigned at least a probability of 0.5 to the correct ownership choice 64.8% of the time, while assigning a plurality of probability to the correct choice for 69.7% of all observations. Using individual data alone, the comparable correct predictions were 64.4% and 69.7%, respectively. These results lend some confidence to the proxy method if a researcher does not have access to household survey data, particularly if the research objective is to predict household automobile ownership.

Conclusions

In this paper, we have evaluated the performance of census proxies for household variables in the estimation of a micro-level discrete choice model of auto ownership. The models using census proxies display impressive fit of the data. We find that the proxy method better approximates its household survey counterparts when the aggregates are representative. Among the models using proxies, we determined that the block group data are preferred to tract and zipcode aggregates. At least in the case of the BATS 2000 survey, our results show that vehicle choice estimates derived from the census proxy approach can serve as reasonable approximations for micro-level parameters, particularly in terms of exhibiting the proper sign and significance.

Not only are aggregate proxies less expensive to acquire and analyze, they may in fact offer an improvement over the traditional survey-driven approach. The block group ownership model actually performs better than the model using survey data alone in terms of likelihood and predictive ability. Evidently, an individual considers not only his status, but also the condition of the neighborhood around him in deciding how many vehicles to own. In the health studies literature, these are referred to as contextual effects (Diez-Roux 2003). Even researchers that have access to household survey data should contemplate controlling for local aggregates.

If the researcher’s objective is to predict household vehicle choice, our results demonstrate the promise of using proxy data. Although we have shown that models using some aggregate proxies can predict original ownership at least as well as the survey model, an important caveat is that prediction of future vehicle ownership is dependent upon the availability of aggregate data. If tract-level demographic or education estimates are nonexistent for a future period, the researcher must constrain the predictive model to existing data for practical applications.