Demography

, Volume 53, Issue 5, pp 1535–1554 | Cite as

Spatial Variation in the Quality of American Community Survey Estimates

  • David C. Folch
  • Daniel Arribas-Bel
  • Julia Koschinsky
  • Seth E. Spielman
Article

Abstract

Social science research, public and private sector decisions, and allocations of federal resources often rely on data from the American Community Survey (ACS). However, this critical data source has high uncertainty in some of its most frequently used estimates. Using 2006–2010 ACS median household income estimates at the census tract scale as a test case, we explore spatial and nonspatial patterns in ACS estimate quality. We find that spatial patterns of uncertainty in the northern United States differ from those in the southern United States, and they are also different in suburbs than in urban cores. In both cases, uncertainty is lower in the former than the latter. In addition, uncertainty is higher in areas with lower incomes. We use a series of multivariate spatial regression models to describe the patterns of association between uncertainty in estimates and economic, demographic, and geographic factors, controlling for the number of responses. We find that these demographic and geographic patterns in estimate quality persist even after we account for the number of responses. Our results indicate that data quality varies across places, making cross-sectional analysis both within and across regions less reliable. Finally, we present advice for data users and potential solutions to the challenges identified.

Keywords

American Community Survey Data uncertainty Income estimates Margin of error Spatial analysis 

Introduction

Data produced by the U.S. Census Bureau in the American Community Survey (ACS) are crucial inputs to social science research as well as for public and private sector decisions. However, these uses of the data are complicated by the high margins of error (MOE) found in the ACS estimates. High MOEs are especially common when the estimate reflects a small subset of the population or covers a small geographic area, such as a census tract. For example, more than 44 % of the ACS census tract estimates of children in poverty have an MOE at least as large as the estimate itself.1 Errors of this magnitude can blur our understanding of places.

There is growing recognition of the challenges associated with using ACS estimates due to their high levels of uncertainty (Bazuin and Fraser 2013; Citro and Kalton 2007; MacDonald 2006; Salvo and Lobo 2006; Spielman et al. 2014). The Census Bureau itself recognizes and makes public this uncertainty (e.g., U.S. Census Bureau 2009a). Despite this recognition, the nature and causes of the uncertainty in the ACS are not widely understood. This research demonstrates that the uncertainty generally does not follow a random pattern across attributes and space. The implication is that the quality of ACS data varies systematically, with some census tracts (or counties, states, and so on) having higher levels of uncertainty than others.

This article examines the patterns of uncertainty in median household income estimates from the 2006–2010 ACS at the census tract scale. This variable is of significant interest and has broad impacts. Median household income estimates are used, for example, to determine market demand for services in the public, private, and nonprofit sectors; make decisions about retail site location; and study income inequality. Initially, we illustrate patterns in the quality of ACS median household income estimates. We then explain these patterns through a series of spatial regression models. These models identify key spatial and demographic determinants of the variation in the quality of ACS estimates. We conclude with a discussion of strategies for handling uncertainty in ACS estimates.

Constructing ACS Estimates

The ACS is a survey and thus susceptible to the same challenges as other estimates based on a sample of the population. The ACS approach is to continuously collect surveys and then tabulate those surveys into an enormous array of estimates. This “rolling” sample strategy forces ACS designers and users to confront the three-way tension among attribute, temporal, and spatial resolutions. Given the fixed number of surveys collected each year, it is not possible to create an estimate that has detailed resolution along all three dimensions as well as provide low uncertainty.

Before 1940, the Census Bureau collected decennial census data from every (or nearly every) person and household in the country, using a single list of questions. Between 1940 and 2000, the bureau used a two-tiered data collection approach with a short list of questions presented country-wide as well as a supplemental list asked of a sample of the respondents. This split became known as the “short form” and “long form.” Starting in 2005, the bureau shifted the long form to an annual data collection model—namely, the American Community Survey (ACS)—and confined future decennial censuses to the short form only. Census data have always contained nonsampling errors based on myriad known and unknown data collection challenges. The census data generated between 1940 and 2000 from the long form have sampling errors as well.

Although the long form remained largely consistent between the 2000 decennial census and the ACS, the data collection strategy changed dramatically. In 2000, the decennial census sampled approximately 19 million households to collect long-form data for a single point in time (April 1, 2000). In contrast, the ACS contacts a sample of households every month, targeting approximately 3.54 million households each year (3 million prior to 2011).

The ACS sample size is large enough to make estimates for large cities and counties (more than 65,000 residents) every year, and locations with 20,000 or more residents receive estimates based on data aggregated over three years.2 The estimates for all other census geographies, no matter their population size, are built from five years of data. Using fewer completed surveys to build an estimate makes it more uncertain. Thus, for smaller areas (such as census tracts), temporal resolution is sacrificed to bolster estimate quality. However, five years of data are not always enough to generate reliable estimates.

In principle, this interlacing of geographic and temporal scales allows the Census Bureau to manage estimate quality while releasing data annually.3 However, these structural changes in data collection were accompanied by a large reduction in the sample size. The median number of completed long-form surveys per census tract in 2000 was 249; comparatively, for the 2006–2010 ACS, it was only 123 (see Fig. 1).
Fig. 1

Response (count) by census tract: Census 2000 long form and ACS 2006–2010

Attribute resolution also impacts data quality. The ACS measures total population for most geographic areas with little uncertainty, but uncertainty for subpopulations can be high and can vary widely from place to place. For example, in census tract 190602 in the Belmont Cragin neighborhood of Chicago, Illinois, the number of children in poverty is somewhere between 9 and 965 (2006–2010 ACS estimates).

An established and commonly used measure of uncertainty is the coefficient of variation (CV), which is the standard error of an estimate divided by the estimate itself. It can be computed from published ACS data and provides a relative measure of uncertainty: essentially, the error measured as a percentage of the estimate. The first five box plots in Fig. 2 highlight the distribution of census tract scale CV for count estimates of various population groups; the latter three show the same for median household income. Superimposed on each box plot is a point representing the median CV value for counties (diamonds) and census block groups (squares), respectively, that are larger and smaller than census tracts.
Fig. 2

Coefficient of variation (CV) distribution for selected variables and spatial scales: ACS 2006–2010. Age and ethnic differences in median household (HH) income are based on the head of household

The plots show striking differences in both the magnitude and range of uncertainty for different subpopulations. Both sets of box plots generally show the expected outcome that smaller subpopulations have larger CVs. However, the overall size of the Hispanic population and the population aged 65 and older is similar—at 16 % and 13 % of the national total, respectively—but the distribution of CV at the tract scale is quite different.

Seven of the eight plots show that larger places (counties) tend to have lower uncertainty and smaller places (block groups) have higher uncertainty, which is expected. Although the order of the county, tract, and block group medians is the same in nearly all the plots, the positions of the points relative to the tract distributions vary, especially in the case of median household income. The subpopulation of female Hispanic residents aged 65 and older represents a group that is difficult to capture at small geographic scales. The result is that no data are reported in the ACS for any block group, and many counties and tracts are also suppressed. For this subpopulation, the county and tract median values are nearly identical.

The following section uses a fixed spatial resolution (census tracts) and fixed attribute (median household income) to show how ACS estimate quality varies from place to place.

Patterns in Estimate Quality

All ACS estimates are associated with some level of uncertainty. However, places with higher uncertainty are not distributed randomly. In other words, high uncertainty is not equally likely in all locations and for all demographic groups. Higher levels of uncertainty in median household income estimates exist in the south and southwestern United States, near city centers, and in places with lower median incomes (shown in upcoming figures). Interestingly, inner cities and low-income households—the areas and groups typically of most interest to social science researchers—tend to have the greatest uncertainty in median household income estimates.

Median household income is a key indicator of socioeconomic status, and its uncertainty, as measured by the CV, is the focus of our analysis. Although the CV is a commonly used metric, its magnitude can be interpreted in many ways. A comprehensive report on the ACS (Citro and Kalton 2007) produced for the National Research Council (NRC) states that the maximum acceptable CV should be in the 0.10–0.12 range, while noting that “what constitutes an acceptable level of precision for a survey estimate depends on the uses to be made of the estimate” (p. 67). A white paper produced by the software company ESRI (2011) characterizes a CV less than 0.12 as having high reliability, 0.12–0.40 as medium reliability, and anything more than 0.40 as low reliability.

Figure 3 presents the distribution of the CV of median household income for more than 70,000 census tracts in the contiguous United States for the 2006–2010 ACS data.4 The median CV is 0.095, with a slightly higher average of 0.110 due to the long tail in the distribution. Using the high end of the NRC range (0.12), approximately one-third (32.1 %) of U.S. census tracts have too much uncertainty, whereas two-thirds (67.9 %) would be considered acceptable.
Fig. 3

Distribution of median household (HH) income coefficient of variation, census tracts: ACS 2006–2010

The magnitude and distribution of median income CVs are concerning, but they do not necessarily imply systematic patterns in the uncertainty, which is the topic of this study. We explore the potential for possible existence of spatial patterning in the CVs using the Moran’s I measure of local spatial autocorrelation (Anselin 1995). This measure identifies locations on a map where abnormally high or low values are clustered and further provides a statistical test to determine whether the cluster is significant.5

Figure 4 highlights (1) census tracts in red that are part of statistically significant clusters of high CV, (2) census tracts in blue that are in low CV clusters, and (3) census tracts in white that are not statistically significant. If there were no spatial pattern to the magnitude of CVs (i.e., if they followed a random spatial distribution), then the map would be mostly white, indicating that tracts with high (or low) CVs do not cluster together. However, the map shows that the northern part of the country—particularly the Midwest—has concentrations of high-quality (low-uncertainty) estimates, denoted in blue; red clusters, indicating low-quality estimates (high uncertainty) are located more in the South and Southwest. Because census tracts average approximately 4,000 people, at this scale, the map visually emphasizes low-density parts of the country.
Fig. 4

Concentrations of median household income coefficient of variation (CV), census tracts: ACS 2006–2010. Results are based on the local Moran’s I statistic. Red tracts are in statistically significant high-CV clusters, blue tracts are in clusters of low CVs, and white tracts are not part of a statistically significant cluster

Figure 5 zooms in further to present the CV distribution for nine metropolitan areas across the country. In Madison, Wisconsin, for example, approximately 75 % of census tracts are below the national median uncertainty level. In contrast, nearly 75 % of the New Orleans, Louisiana, census tracts are above this level. Other metropolitan areas, such as Chicago and San Diego, California, have median levels nearly identical to the national median. This diversity in the magnitude and range of uncertainty across metropolitan areas can affect the reliability of cross-sectional analyses.
Fig. 5

Distribution of uncertainty on median household income, selected metropolitan area census tracts: ACS 2006–2010. The horizontal line represents the U.S. median value of .095. Two outlier census tracts are not shown: New York City, with a CV of 2.73, and Chicago, with a CV of 1.41

The spatial variation in attribute uncertainty is not simply a macro-scale phenomenon that varies from region to region, but also manifests itself within regions. Figure 6 presents clusters of high and low CVs within the Chicago metropolitan area. The map highlights multiple high-CV concentrations around central Chicago and one in (nearby) central Gary, Indiana. The low-CV areas are generally located in the exurban periphery of the region, with notable exceptions, such as the village of Oak Lawn, located southwest of downtown Chicago.
Fig. 6

Concentrations of median household income coefficient of variation (CV), Chicago metropolitan area census tracts: ACS 2006–2010. Results are based on the local Moran’s I statistic. Red tracts are in statistically significant high-CV clusters, blue tracts are in clusters of low CVs, and white tracts are not part of a statistically significant cluster

The spatial pattern seen in Chicago is indicative of a broader pattern of higher CVs near city centers and lower CVs at city edges. We explore this pattern by grouping census tracts from the largest 150 metropolitan statistical areas (MSAs)6 based on relative distance from their city center.7 This approach allows us to standardize distances from regions of all sizes and shapes—for example, from places with a waterfront downtown, like San Diego, to places with a centered downtown, like Denver, Colorado.8

Figure 7 summarizes the data from 150 MSAs and shows a steep decline in the CV as distance initially increases from urban cores. This decline eventually moderates and then begins increasing again when reaching the peripheries of the regions.
Fig. 7

Summary of median household income coefficient of variation based on distance from urban center, census tracts for the largest 150 metropolitan areas: ACS 2006–2010

In addition to a clear spatial structure, CVs of median household income in large U.S. metropolitan areas also display a pattern across income levels. Figure 8 is similar in design to Fig. 7 except that it shows tracts ordered by increasing income within their MSA before they are split into percentile bins. This approach allows us to control for intermetropolitan variations in income. The results show that uncertainty in median household income declines as median household income increases.
Fig. 8

Variation in uncertainty on median household income coefficient of variation based on increasing income, largest 150 metropolitan areas census tracts: ACS 2006–2010

The similar patterns in Figs. 7 and 8 are likely related given that lower-income residents in some MSAs live closer to the urban core. The increasing level of uncertainty at the urban periphery is likely caused by the diversity of exurban locations, which can range from wealthy suburban enclaves to lower-income agricultural communities.

Multivariate Methodology and Data

The previous section shows clear geographic and socioeconomic patterns in the quality of ACS median household income estimates. In this section, we introduce a multivariate model to examine the robustness of those findings while simultaneously controlling for other potential factors that might affect the quality of ACS estimates. Examples of these factors are measures of demographic diversity, the number of ACS survey responses, and population change.

We introduce a national scale spatial regression framework that accounts for spatial patterns in the data. Next, we extend this model to assess the stability of these results between the North and South, motivated by the North-South variation in uncertainty found in Fig. 4. Finally, we subset the data to assess the relationship between distance to urban center and income uncertainty for tracts within metropolitan areas.

Data

The dependent variable in the model introduced in the next section is the coefficient of variation (CV) of median household income. Due to the complexity of the sampling and weighting processes used to compute ACS estimates, the Census Bureau estimates standard errors (the numerator of the CV) using a method called “successive differences replication.” This approach computes all ACS estimates 80 times using different base weights on the completed surveys, and then uses all this information to compute the variability in the actual estimate. (For more details, see U.S. Census Bureau 2009b: chapter 12.)

When considering potential determinants of uncertainty in ACS estimates, we first consider the total number of surveys collected in the particular tract (i.e., the number of housing units responding to the ACS, hu_respond). More responses are expected to reduce the CV. This is the only variable in the model taken from the ACS. Unlike most ACS data, this variable is a count of completed surveys, not an estimate. We specifically do not include other ACS variables in the model because they are expected to be affected by the same problems that we aim to understand.

It is important that the variables used to explain CV in median household income are, as much as possible, unaffected by measurement error. As a result, the pool of potential explanatory variables is heavily constrained, so we turn to the 2010 decennial census and a 2010 restricted-use database from the U.S. Department of Housing and Urban Development (HUD). Given that the ACS data are collected over five years, there is some degree of temporal mismatch. However, the extent to which this affects our results is limited because the variables used are rather persistent over time. The variables used and their sources are summarized in Table 1.
Table 1

Description of dependent and independent variables

Variables

Description

Expected Sign

Source

Dependent Variable

 Income CV

Median household income uncertainty (CV)

 

2006–2010 ACS

hu_respond

Responses to the ACS (housing units)

2006–2010 ACS

Sociodemographics (X)

hud_total

Federally subsidized (housing units)

+

HUD

black_rt

African American residents (rate)

+

2010 Census

hisp_rt

Hispanic residents (rate)

+

2010 Census

race_simp

Racial/ethnic diversity (Simpson)

+

2010 Census

age_simp

Resident age diversity (Simpson)

+

2010 Census

Housing and Residential Structure (H)

group_pop

Group quarters (population)

+

2010 Census

vacant

Vacant (housing units)

+

2010 Census

Tract Structure (T)

area

Size of tract (land area)

+

2010 Census

tr_nochange

2000–2010 tract boundary stability (dummy variable)

2000–2010 Censusa

dist2centerb

Distance to urban core (meters)

2010 Censusc

population

Total residents (population)

2010 Census

Notes: All variables are measured in levels unless noted in the table. Natural logarithms are taken of each variable before use in the econometric model.

aComputed by authors based on the physical census tract boundary changes between 2000 and 2010 reported in the 2010 Census Tract Relationship Files.

b Included only in regressions for metropolitan area.

c Urban centers are identified using the U.S. Geological Survey’s Geographic Names Information System. The latitude-longitude marker for the first city listed in the MSA name is extracted from the database, and then distances are computed from each census tract centroid to the urban center.

Our concern in the multivariate models that follow relates to potential correlation of socioeconomic and/or geographic attributes with the quality of income estimates. Ideally, there would be no correlation, a result indicating that uncertainty at the level of the census tract is not a systematic function of the place. We therefore model characteristics of places along three dimensions: sociodemographics and diversity, housing and residential environment, and tract structure.

The first dimension considers residents’ demographic characteristics (race, ethnicity, and diversity), along with a proxy for income in the form of the number of federally subsidized housing units (hud_total).9 Diversity is measured using the Simpson index (also known as the Herfindahl index), which takes higher values if the population is spread more evenly across groups and lower values if the population is more concentrated in a single group within the tract. We measure diversity in resident age and race/ethnicity of residents, with the expectation that more diverse places will have higher income variability and thus higher CV since the population is not of uniform type.10

The second dimension captures variables on the types of housing units in the place (group quarters population and vacant units). The final dimension investigates the census tract as a unit of analysis by looking at (1) variation in land area; (2) whether the tract population was stable over the previous decade, proxied by tracts that are split because population growth pushed them over the maximum size threshold or merged because population decline placed them below the minimum size threshold; and (3) population (with more populous tracts expected to have less uncertainty).

The unit of analysis is always the census tract. Although there are high-level parameters that constrain tract delineation (U.S. Census Bureau 1994), exceptions do occur. For this reason, we include only those census tracts with a population greater than 500 and with more than 200 housing units. In general, our inclusion definition excludes anomalous tracts, such as national parks, lakes, large prisons, and large college dormitories. We also exclude tracts with a household income estimate but no associated MOE. These exclusions generally occur when the median household income estimate is at the reporting bounds of $2,499 or $250,001. Combined, these steps remove 1,186 tracts. The geography is further constrained to the contiguous United States, leaving 71,353 census tracts for the national analysis and 58,386 for the MSA analysis. The northern region includes 33,093 tracts compared with 38,260 tracts in the South.

Model

The national model to assess the relationships between income uncertainty and response, diversity, population change, and other factors is specified by the following multivariate equation:
$$ \log {y}_i=\upalpha +\updelta \log hu\_ respon{d}_i+\upphi \log {\mathbf{X}}_i+\upbeta \log {\mathbf{H}}_i+\upgamma \log {\mathbf{T}}_i+{u}_i, $$
(1)
where yi represents the median household income coefficient of variation in tract i.

The median household income CV in tract i is explained by the number of housing units responding to the ACS (hu_respondi); a set of five sociodemographic variables related to race and diversity in the tract (Xi); two characteristics of the housing environment indicating potential residential instability (Hi); and four indicators of population change and distance to urban core (Ti) (see Table 1 for details). The equation takes a log-log specification that allows the associated regression coefficients δ, ϕ, β, and γ, to be interpreted as elasticities, expressing the percentage change expected on yi given a 1 % increase in the explanatory variable.

Given the fine-grained scale of census tracts, it is likely that some of the unobserved characteristics captured by the error term (ui) in Eq. (1) are spatially autocorrelated, in which case the estimates lose precision (Anselin 1988). To account for this form of spatial dependence, we assume a spatial autoregressive error term of the following structure:
$$ {u}_i=\uplambda {\displaystyle \sum_j{w}_{ij}}{u}_j+{\upvarepsilon}_i, $$
(2)
where wij is the ijth element of a spatial weights matrix W that formally represents the spatial connectivity of tracts, and εi is an independent and identically distributed (i.i.d.) and well-behaved disturbance.11 The spatial autoregression coefficient λ captures the strength and direction of the autocorrelation in the error terms from Eq. (1). These models are estimated with a generalized method of moments, following the approach proposed by Arraiz et al. (2010), which suggests an estimator robust to spatial autocorrelation and heteroskedasticity.

We next explore regional variations in the correlates of income CV. To do so, we apply a spatial regimes approach that determines whether factors associated with CV play a different role in the northern and southern parts of the country.12 In essence, the national model is estimated separately for the North and South (spatial regimes). The census tract remains the unit of analysis.

The baseline model of Eq. (1) is expanded with spatial regimes, yielding the following:
$$ \log {y}_{ir}={\upalpha}_r+{\updelta}_r \log hu\_ respon{d}_{ir}+{\upphi}_r \log {\mathbf{X}}_{ir}+{\upbeta}_r \log {\mathbf{H}}_{ir}+{\upgamma}_r \log {\mathbf{T}}_{ir}+{u}_{ir}, $$
(3)
where the subscript r indicates membership in the North or South; otherwise, the specification remains the same. Thus, we obtain a set of parameters (α, δ, ϕ, β, γ) for both the northern and southern United States. Also, we subset the spatial weights matrix W so that only neighbors within the same region remain connected. This formulation is equivalent to running separate regressions for each group of observations under the same r subscript. The advantage of this framework is that it allows for a significance test of the equality of estimated parameters between regions by means of the spatial Chow test (Anselin 1990).

Finally, we examine the relationship between income uncertainty and distance to the urban center. Because this relationship can be assessed only within metropolitan areas, we restrict this analysis to all census tracts in the 150 largest MSAs in the country. MSAs are defined by the U.S. Office of Management and Budget and approximate a functional urban region. Although MSAs do not cover the entire extent of the country, these 150 MSAs are home to approximately 72 % of the U.S. population in 2010.

Findings

Generally, the models indicate statistically significant differences between the northern and southern United States (see Chow tests,13 Table 2). Investigating the individual variables within the model shows that most have the same directional effect on CV but at different magnitudes. For example, the regression coefficients on the number of housing units that responded to the ACS are negative for both regions (–0.4732 in the North and –0.4279 in the South), but the difference between them is statistically significant. Likewise, the regression coefficients on the percentage of the population that is African American are positive for both regions (.1676 in the North and .0817 in the South), but the difference between them is statistically significant as well. An exception is that increased land area leads to higher CV in the South and lower in the North.
Table 2

National and regional spatial regression results with spatial regimes tests

   

Regimes Model

National Model

MSA Model

North

South

Chow Test

Constant

1.4105**

1.5197**

1.4180**

1.5811**

*

hu_respond

–0.4950**

–0.4649**

–0.4732**

–0.4279**

**

hud_total

0.0338**

0.0334**

0.0356**

0.0347**

 

black_rt

0.1384**

0.1010**

0.1676**

0.0817**

**

hisp_rt

0.2046**

0.2293**

0.2696**

0.2104**

race_simp

–0.0189**

–0.0367**

–0.0111

–0.0615**

**

age_simp

–0.1417**

–0.1529**

–0.2515**

–0.0929**

**

group_pop

0.0115**

0.0131**

0.0125**

0.0132**

 

vacant

0.1167**

0.1130**

0.1141**

0.1013**

*

area

0.0068**

–0.0009

–0.0110**

0.0216**

**

tr_nochange

0.0272**

0.0233**

0.0042

0.0290**

**

population

–0.1801**

–0.1968**

–0.1755**

–0.2348**

**

dist2center

 

–0.0076**

   

λ

0.1597**

0.1426**

0.1102**

0.1449**

**

Pseudo R2

.3209

.3154

.3839

.2652

**

N

71,353

58,386

33,093

38,260

 

Notes: The dependent variable is the median household income coefficient of variation. All variables are in logarithms, except the dummy variable tr_nochange. Pseudo-R2 is the squared correlation between the actual and predicted dependent variable.

p < .10; *p < .05; **p < .01

Both racial diversity and changes to tract boundaries are insignificant in the North but significant in the South. We also find that increases in the number of responses and age diversity reduce CV more in the North than the South, and increases in percentage African American increase CV more in the North than South. Increases in total tract population have a greater impact on CV reduction in the South than the North.

Mirroring the results from the section Patterns in Estimate Quality, we find the proxy for low-income neighborhoods (hud_total) is positively associated with income CV after controlling for other relevant factors. A 1 % increase in the number of HUD-assisted low-income rental units results in a 0.03 % to 0.04 % increase in CV. Moreover, this result remains stable across regions, as indicated by the insignificant Chow test (Table 2).

Next, Fig. 7 demonstrates that uncertainty decreases with distance from the city center, and this finding also holds in a multivariate context. After we control for 11 other determinants of income CV, a 1 % increase in distance from the city center is associated with a small yet significant percentage reduction in CV (–0.0076). As mentioned, this result was computed only for the subset of tracts in metropolitan areas.

As expected, a key component of income CV relates to survey response. Of all determinants, higher number of responses (hu_respond) is most strongly, significantly, and negatively related to CV across the country, in urban areas and in both regions (Table 2). In other words, more survey responses are associated with less uncertainty. For the country as a whole, a 1 % increase in the number of responding housing units results in almost a 0.5 % reduction (–0.495) in CV. The removal of the covariates in X, H, and T from Eq. (1) has little influence on the estimated impact of the number of responses.

The relationship between the number of responses and income CV is lowest in urban areas (–0.4649) and the South (–0.4279). The difference in this relation between northern and southern areas is significant (p = .01) as indicated by the Chow test. These results reinforce the premise that more raw data from which to build the estimates results in lower uncertainty in those estimates, as standard statistical sampling theory predicts.14

Although most variables have the expected impact on uncertainty as outlined in Table 1, we find an inverse relationship between racial/ethnic diversity and income CV after holding a tract’s share of African American and Hispanic residents constant. In other words, higher levels of racial and ethnic segregation (i.e., lower diversity) is associated with larger income CVs. This relationship is strongest in the southern region, where a 1 % increase in the Simpson index (indicating more diversity) results in a 0.06 % decrease in income CV. We find the same negative relationship with age diversity. Given that the number of responses is controlled, we had expected more homogeneous places to be easier to measure and thus have lower uncertainty.

The results for population stability (tr_nochange), measured by lack of change in tract boundaries from 2000 to 2010, are also counterintuitive. If tract boundary changes indicate large population growth or decline, then we would expect this instability to lead to measurement challenges. However, we find that more stability leads to higher CV; the exception is for the North, where this finding is insignificant.

The physical land area of a census tract turned out to be the most unstable variable across the four models. Larger land areas tend to reflect lower population densities. The coefficient is insignificant in the MSA model, positive for the national and South models, and negative for the North model. These differences likely reflect the nature of tract size distribution in the different models. The national model includes large rural tracts, which are mostly excluded in the MSA model; the higher-density North has, on average, smaller tracts than the South.

The model for the northern region has the largest explanatory power (pseudo-R2 of .3839), followed in order by that for the country overall, urban areas, and the southern region. The parameter λ, alluding to the spatial dependence in the error terms, is highly significant in all models, with the smallest magnitude in the northern region (0.1102). Although tests for multicollinearity (not reported) suggest its presence, a simulation experiment indicates negligible differences in the final estimates.15

Implications

As shown in the previous sections, the quality of median household income estimates from the ACS is correlated with both geographic location and the attributes of places. These findings remain intact even after controlling for the number of respondents in the tract, which is the strongest determinant of uncertainty. The implications of this variation can be profound.

It is not uncommon today for researchers across the social sciences to present a map of demographic data and then draw inferences from that map. These maps can show clear patterns and lead to statements such as “housing affordability is lower in the eastern part of the state than the western,” or “inner-city rates are higher than suburban.” However, the unequal distribution of uncertainty across these places implies that our confidence in these conclusions should be tempered. Some areas of the map likely have highly reliable estimates, while others vary to the point of making the observed differences meaningless. In this context, we discuss approaches to help ACS data users navigate this dilemma: first, in the arena of descriptive statistics; and second, when using these data in models.

The goal of integrating uncertainty into descriptive reporting of ACS estimates is to add value to tables, charts, and maps through clear communication of the error associated with the estimates. The most straightforward example of this is adding a column of MOEs to a table of estimates or including a second map that shows the MOEs. However, these approaches may not be the most effective means of communicating uncertainty information to all audiences, especially those unfamiliar with the interpretation of survey data. It is possible that a coarser representation of the uncertainty could be used, such as red/yellow/green icons (mimicking a traffic light) attached to each estimate in a table, indicating how much caution should be used when interpreting the value.

In a cartographic context, interactions of color intensity or hatching overlays could be used to create a single map that integrates the estimates and MOEs (see, e.g., Sun and Wong 2010). For additional discussions and approaches, readers are referred to extensive research on visualizing geospatial uncertainty (e.g., MacEachren 1992; MacEachren et al. 2005; Wong and Sun 2013).

Correlations, regression coefficients, and other forms of model output derived from ACS estimates are fraught with hidden statistical issues because input data are measured with error. The primary issue is attenuation bias, which causes the magnitude of a statistic (e.g., correlation or regression coefficient) to be reduced when one or more variables are measured with error. Corrections for the correlation coefficient have existed for more than a century (Spearman 1904) but not without controversy (Muchinsky 1996).

With regression, the issue is more complex. Standard econometric theory assumes that explanatory variables are deterministic and measured without error. The presence of a variable with error in the design matrix has the potential to taint all regression coefficients in unpredictable ways (Greene 2003: chapter 5). The main problem for these models lies in the interpretation of regression coefficients that are likely to be biased if the error is large. Various errors-in-variables and instrumental variables–type approaches have been considered as potential fixes for this problem (Bound et al. 2001). The spatial structure embedded in the ACS is another option to consider in order to ameliorate the data challenges (for a discussion on the topic, see Anselin and Lozano 2008).

For the dependent variable, consequences are less severe since the measurement mismatch is transferred to the error component of the regression specification. The error component can then be appropriately modeled to account for its structure. One strength of ACS data, compared with other cases of error in measurement, is that the magnitude of the error on each estimate is known and can be used in the model specification. Future research along these lines could improve not only how we interpret model results based on survey data with systematically varying estimate quality but also how the actual results are computed.

In recent years, the Census Bureau has made a concerted effort to address the challenge of variation in the uncertainty through more sophisticated sampling approaches (Bruce and Robinson 2009). When the ACS began, an area (a census block) was assigned to one of seven sample rate categories. Subsequent research into this initial approach has resulted in an increase to 16 categories (Sommers and Hefter 2010). However, budget constraints limit the Census Bureau to 3.54 million surveys per year, making this a zero-sum game: if one area is oversampled, another must be undersampled. Thus, the variation in uncertainty is reduced by allowing some places to have worse estimates and others to have better ones.

The problem of too few completed surveys on which to construct reliable estimates can be addressed directly by the user. By combining estimates, either by attribute or geographic area, the user boosts the amount of raw data supporting newly derived estimates. For example, a set of three age ranges built by collapsing a set of 10 age ranges will nearly always be more reliable (U.S. Census Bureau 2009a). In the income example, tracts could be classified by income group (e.g., low or high income) instead of by the continuous variable median income; misclassification will certainly happen for tracts near the threshold, but the problem of income uncertainty is reduced.

Alternatively, geographic areas can be combined into custom regions. Spielman and Folch (2015) presented an automated approach for geographic aggregation that joins census tracts into “regions” in such a way that estimates for the regions that have CVs below a user-defined threshold. These user-based approaches take the uncertainty tradeoff out of the hands of the Census Bureau and put it directly in the hands of the researcher, who achieves the desired reliability by sacrificing granularity.

Conclusion

When researchers think of measurement error, they generally hope that it is low and of the well-behaved type that follows a uniform random distribution across all observations. What this research has shown is that neither of these assumptions should be made when working with ACS data. The issue of higher levels of uncertainty for ACS estimates compared with the decennial census is widely recognized among researchers; however, the variation in this uncertainty is not as well understood. As this work has shown, not only is uncertainty clustered over space but also the characteristics of places can inform the magnitude of uncertainty.

Using median household income at the census tract scale as an exemplar, we find higher uncertainty in urban cores relative to suburban areas and in lower-income areas, as well as a differential pattern in uncertainty in the northern United States relative to the southern United States. These patterns remain visible even when controlling for the number of survey respondents by location and other correlates of uncertainty. The implication for users is that seemingly straightforward interpretations of ACS estimates require examination of the underlying uncertainty in the estimates.

Although this is clearly a critical issue for researchers, it does not mean that ACS estimates have too much error at the tract level to be useful in research and decision making. Some places have reliable data, and to the extent that a research project is confined to these locations, there is little to worry about. For those conducting research with mixed uncertainty levels, we highlighted approaches that can reveal the uncertainty information in mapping and tabular formats.

Users can also directly control the problem by grouping estimates by census tracts into regions or by attribute—that is, by reducing the granularity of the data.

Creative strategies are needed to both lower overall uncertainty in raw ACS estimates and reduce variation in this uncertainty. Published work by the U.S. Census Bureau (e.g., Castro and Hefter 2008; Sommers and Hefter 2010) and actual changes in the ACS show that efforts continue to be made along both these fronts. A strength of the ACS data collection model is that methodological improvements can be made at any time and do not need to wait for a decadal cycle to repeat, which opens the door for innovative ideas from the research community to be integrated into the ACS in a reasonable time frame.

Footnotes

  1. 1.

    Children in poverty are those under 18 living in a family whose income is below the poverty level and who are related to the householder by birth, marriage, or adoption. In 32,007 of the 72,539 census tracts (44.1 %) in the contiguous United States, the MOE is greater than or equal to the estimate.

  2. 2.

    The 2011–2013 ACS estimates represent the last three-year data release.

  3. 3.

    By 2009, the Census Bureau had collected sufficient data to start distributing one-year, three-year, and five-year estimates each year.

  4. 4.

    These estimates are based on census tracts in the contiguous United States, with outlier tracts removed (see the Data section).

  5. 5.

    The statistic is computed for each census tract on the map. Statistical significance is based on 999 random permutations of the data, and a significance level of .05.

  6. 6.

    These areas are determined by the December 2009 definition of Core Based Statistical Areas (http://www.census.gov/population/metro/files/lists/2009/List1.txt).

  7. 7.

    We order the census tracts within an MSA by increasing distance from their city center; we then split this ordered list into 100 bins (percentiles). We repeat this process for the 150 largest MSAs and then pool the tracts by percentile for all MSAs into a single set of 100 bins. When completed, the first bin has the urban core of all the MSAs, and the 100th bin has the urban fringe of each MSA. The points in Fig. 7 represent the median CV value from each bin.

  8. 8.

    Urban centers are identified using the U.S. Geological Survey Geographic Names Information System. The latitude-longitude marker for the first city listed in the MSA name is extracted from the database, and then distances are computed from each census tract centroid to the urban center.

  9. 9.

    Federally subsidized housing programs include public housing (traditional and HOPE VI), multifamily housing (including housing for the elderly and disabled, Sections 202 and 811), and vouchers (predominantly, housing choice vouchers for tenants). Address-level records are aggregated to the tract level.

  10. 10.

    Age diversity is computed using four groups: younger than 18 years, 18–34 years, 35–64 years, and 65 and older. Racial/ethnic diversity is also computed using four groups: white (non-Hispanic), African American (non-Hispanic), Asian (non-Hispanic), and Hispanic.

  11. 11.

    All the results shown relate to a spatial weights matrix built using the queen contiguity criterion, under which two observations are neighbors and are thus assigned a weight of 1 if they share a border of any length, including a single point. This matrix is then standardized so that every row sums to 1, effectively converting ∑jwijuj into the average value of u in the surroundings of i.

  12. 12.

    The southern part of the country is defined as Alabama, Arizona, Arkansas, California, Colorado, Delaware, District of Columbia, Florida, Georgia, Kentucky, Louisiana, Maryland, Mississippi, Nevada, New Mexico, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Utah, Virginia, and West Virginia. The remaining states, excluding Alaska and Hawaii, are assigned to the northern part.

  13. 13.

    The Chow test identifies whether there is a statistical difference between the magnitude of regression coefficients in the North and South models.

  14. 14.

    Because it is directly related to the standard error of the estimate, in a simple random sample and with variance held constant, the CV is proportional to \( 1/\sqrt{n} \). In a log-log setting, as the one we have in our regressions, this means that sample size (hu_respond) should theoretically have an effect of \( -0.5\left(1/\sqrt{n}=1/\sqrt{n^{0.5}}={n}^{-0.5}\right) \) on the CV, which is almost exactly what we find in the national regression. The smaller coefficients associated with the other specifications and geographic variation in the coefficient likely reflect differences in the response propensities of those particular subsets of households. These varying response propensities create “design effects.” In other words, the need to adjust for varying response propensities, in a sense, makes the sample less random, thus loosening the theoretical relationship between sample size and CV.

  15. 15.

    The simulations repeatedly dropped 10 % of the observations and then reestimated the model. The distribution of parameter estimates across the simulations displayed only minor variations in the estimates.

Notes

Acknowledgments

David C. Folch and Seth E. Spielman acknowledge financial support from the National Science Foundation (Grant No. 1132008). Julia Koschinsky acknowledges funding from the National Institutes of Health (Grant 2-R01CA126858). The authors are solely responsible for the accuracy of the statements and interpretations contained in this publication. Such interpretations do not necessarily reflect the views of any government. This work used the Python Spatial Analysis Library (Rey and Anselin 2007; www.pysal.org).

References

  1. Anselin, L. (1988). Spatial econometrics. Dordrecht, The Netherlands: Kluwer Academic Publishers.Google Scholar
  2. Anselin, L. (1990). Spatial dependence and spatial structural instability in applied regression analysis. Journal of Regional Science, 30, 185–207.CrossRefGoogle Scholar
  3. Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical Analysis, 27, 93–115.CrossRefGoogle Scholar
  4. Anselin, L., & Lozano, N. (2008). Errors in variables and spatial effects in hedonic house price models of ambient air quality. Empirical Economics, 34(5), 5–34.CrossRefGoogle Scholar
  5. Arraiz, I., Drukker, D., Kelejian, H., & Prucha, I. (2010). A spatial Cliff-Ord-type model with heteroskedastic innovations: Small and large sample results. Journal of Regional Science, 50, 592–614.CrossRefGoogle Scholar
  6. Bazuin, J. T., & Fraser, J. C. (2013). How the ACS gets it wrong: The story of the American Community Survey and a small, inner city neighborhood. Applied Geography, 45, 292–302.CrossRefGoogle Scholar
  7. Bound, J., Brown, C., & Mathiowetz, N. (2001). Measurement error in survey data. In J. J. Heckman & E. Leamer (Eds.), Handbook of econometrics (Vol. 5, pp. 3705–3843). Amsterdam, The Netherlands: Elsevier Science.Google Scholar
  8. Bruce, A., & Robinson, J. G. (2009). Tract level planning database with census 2000 data (Technical report). Washington, DC: U.S. Census Bureau.Google Scholar
  9. Castro, E. C., Jr., & Hefter, S. P. (2008). Redesigning the American Community Survey computer assisted personal interview sample. In Proceedings of the Survey Research Methods Section, American Statistical Association. Retrieved from http://www.amstat.org/sections/srms/Proceedings/
  10. Citro, C. F., & Kalton, G. (2007). Using the American Community Survey: Benefits and challenges. Washington, DC: National Academies Press.Google Scholar
  11. ESRI. (2011). The American Community Survey (Technical report). Redlands, CA: ESRI.Google Scholar
  12. Greene, W. (2003). Econometric analysis. Upper Saddle River, NJ: Prentice Hall.Google Scholar
  13. MacDonald, H. (2006). The American Community Survey: Warmer (more current), but fuzzier (less precise) than the decennial census. Journal of the American Planning Association, 72, 491–503.CrossRefGoogle Scholar
  14. MacEachren, A. M. (1992). Visualizing uncertain information. Cartographic Perspectives, 1992(13), 10–19.Google Scholar
  15. MacEachren, A. M., Robinson, A., Hopper, S., Gardner, S., Murray, R., Gahegan, M., & Hetzler, E. (2005). Visualizing geospatial information uncertainty: What we know and what we need to know. Cartography and Geographic Information Science, 32, 139–160.CrossRefGoogle Scholar
  16. Muchinsky, P. M. (1996). The correction for attenuation. Educational and Psychological Measurement, 56, 63–75.CrossRefGoogle Scholar
  17. Rey, S. J., & Anselin, L. (2007). PySAL: A python library of spatial analytical methods. Review of Regional Studies, 37, 5–27.Google Scholar
  18. Salvo, J. J., & Lobo, A. P. (2006). Moving from a decennial census to a continuous measurement survey: Factors affecting nonresponse at the neighborhood level. Population Research and Policy Review, 25, 225–241.CrossRefGoogle Scholar
  19. Sommers, D., & Hefter, S. P. (2010). American Community Survey sample stratification—Current and new methodology (Technical report). Washington, DC: U.S. Census Bureau.Google Scholar
  20. Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.CrossRefGoogle Scholar
  21. Spielman, S. E., & Folch, D. C. (2015). Reducing uncertainty in the American Community Survey through data-driven regionalization. PLoS ONE, 10(2). doi:10.1371/journal.pone.0115626
  22. Spielman, S. E., Folch, D. C., & Nagle, N. N. (2014). Patterns and causes of uncertainty in the American Community Survey. Applied Geography, 46, 147–157.CrossRefGoogle Scholar
  23. Sun, M., & Wong, D. W. S. (2010). Incorporating data quality information in mapping American Community Survey data. Cartography and Geographic Information Science, 37, 285–299.CrossRefGoogle Scholar
  24. U.S. Census Bureau. (1994). Geographic areas reference manual (Technical report). Washington, DC: U.S. Census Bureau.Google Scholar
  25. U.S. Census Bureau. (2009a). A compass for understanding and using American Community Survey Data: What researchers need to know. Washington, DC: U.S. Government Printing Office.Google Scholar
  26. U.S. Census Bureau. (2009b). Design and methodology: American Community Survey. Washington, DC: U.S. Government Printing Office.Google Scholar
  27. Wong, D. W., & Sun, M. (2013). Handling data quality information of survey data in GIS: A case of using the American Community Survey data. Spatial Demography, 1, 3–16.CrossRefGoogle Scholar

Copyright information

© Population Association of America 2016

Authors and Affiliations

  • David C. Folch
    • 1
  • Daniel Arribas-Bel
    • 2
  • Julia Koschinsky
    • 3
  • Seth E. Spielman
    • 4
  1. 1.Department of GeographyFlorida State UniversityTallahasseeUSA
  2. 2.Department of Geography and PlanningUniversity of LiverpoolLiverpoolUK
  3. 3.Center for Spatial Data ScienceUniversity of ChicagoChicagoUSA
  4. 4.Department of GeographyUniversity of Colorado at BoulderBoulderUSA

Personalised recommendations