Introduction

The 2020 census has received more attention, and been the subject of more controversy, than any other United States census in recent memory. Census-related issues—from the proposed addition of a citizenship question to the challenges of collecting data in the midst of the global COVID-19 pandemic—have been ever-present in the headlines. As of December 2020, the controversy around the census is still ongoing: The Supreme Court is now hearing a case proposing to remove unauthorized immigrants from the apportionment counts, the population measurements that re-allocate seats in the House of Representatives (and presidential electoral votes) across states. The potential implications of this ruling cannot be understated and will likely be felt far beyond the next 10 years.

Behind all of this controversy and attention is concern about the ability of the census to accurately count all residents of the United States, commonly referred to as undercount. The circumstances around Census 2020 have aroused great concern among demographic experts and policymakers that this census may suffer from unprecedented levels of inaccuracy. Particularly troubling is a possible increase in the undercount of communities of color and the foreign born. These populations are known to have experienced disproportionately high levels of undercount in previous censuses and are likely among the groups most affected by the challenges presented by the pandemic and the current political climate.

Despite the long-standing knowledge that particular groups are missed more frequently in the census at the national level, very little is known about how undercount rates vary locally by race and Hispanic status. For example, Hispanic children in Ohio and Black children in Pennsylvania may be undercounted at a different rate than Hispanic children in New Mexico and Black children in Mississippi. As decennial census counts determine the allocation of over $1.5 trillion annually in federal funding across states and counties (Sullivan, 2020), differences in undercount across geographic areas may leave some communities under-resourced to serve their neediest members, many of whom are children. Indeed, state and local funding allocations for several programs that directly serve low income children, such as the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), Head Start, and the State Children’s Health Insurance Program (SCHIP) are determined by local area census counts of young children (Reamer, 2017). The population aged 0–4 has also historically suffered some of the highest census undercount rates of any age group.

In this paper I estimate how the census undercount of young children varies across states by race and Hispanic status. Research on geographic variation in undercount is scarce as the data necessary to form highly accurate subnational population measures independent of the census is difficult to obtain. Besides being of interest due to their outsize role in determining both undercount levels and funding allocations, young children are among the easiest groups to estimate from administrative sources, mostly due to the high accuracy of U.S. birth records. To further increase accuracy, I narrow my focus to native-born children and estimate undercount by state of birth (rather than state of residence) to remove any reliance on measures of international and internal migration. I am the first to my knowledge to document the substantial variation in undercount rates by state in the 2000 census, and the first to show state-level undercount patterns by race and Hispanic status for any census. As the identification of place of birth relies on information collected on the long form, discontinued after Census 2000, I cannot repeat my analysis for more recent censuses. However, the patterns shown in my 2000 results are very similar to those shown in the single extant similar study of young children in 2010—evidence that such patterns are likely consistent across time and still informative today. I also am the first to correlate state-level undercount patterns with administrative data from birth certificates, showing that for native-born young children, foreign-born status of the mother is strongly associated with higher undercount at the state level, particularly for Black children. Results for Hispanic children suggest caution is needed when using the Hispanic status of the parents to define that of the child, particularly in light of recent work on the fluidity of race and ethnic status (Liebler et al., 2017).

This paper proceeds as follows. The next section provides a short summary of the existing research on census undercount of young children, framing my work within the literature. The following section describes the calculation of the undercount of young children by state of birth, presents results, and tests the robustness of the results to assumptions about net emigration. The penultimate section correlates birth certificate information on mothers with state-level undercounts by race and Hispanic status. The final section concludes.

Background

Young children are one of the highest undercounted age groups in the United States census and those of other countries (O’Hare, 2017), dating back to at least 1850 (Hacker, 2013) through the present day (O’Hare, 2014b, 2015; Robinson et al., 1993, 2002). Despite the prevalence of the phenomenon, the underlying reasons why young children are more often missed than older individuals remain mysterious. To be undercounted, young children are either left off a household’s census form while other members are enumerated, or the entire household is missed but only the undercount of young children is evident due to overcount of older age groups. How these two mechanisms play a role in the undercount of children remains unknown. However, children in households with so-called “hard-to-count” characteristics, such as belonging to a minority racial and ethnic group or residing in rental housing, are more likely to be undercounted than children who do not share these characteristics (Bruce & Robinson, 2003).Footnote 1

The two existing studies most similar to mine are O’Hare (2014c) and O’Hare et al. (2016). They both use the demographic analysis (DA) method to estimate state-level undercounts of young children aged 0–4 in the 2010 census. The DA methodology involves creating a measure of the population independent of census counts, usually using births, deaths, and measures of net migration, to estimate undercount.Footnote 2 O’Hare (2014c) and O’Hare et al. (2016) use the Census Bureau’s Vintage 2010 state population estimates as a benchmark for comparison to the census counts. For children aged 0–4, these population estimates consist of birth records adjusted for deaths and net internal and international migration. O’Hare (2014c) calculates undercount by state for all young children in 2010, and shows this undercount is strongly correlated with the size of the Black and Hispanic populations in each state, as well as with state-level measures of linguistic isolation, low education, and unemployment. O’Hare et al. (2016) focus on Hispanic children only, and show that the undercount of Hispanic children in the 2010 census is concentrated among a small number of states.Footnote 3

This paper expands on this existing work in several important ways. First, despite using slightly different samples of young children, the pattern of undercount across states is remarkably similar across my results and those of the two studies above. As we use two completely different censuses—2000 and 2010—this is evidence that the factors driving these patterns are not unique to a single census and are likely persistent through time. Second, I estimate state-level undercounts for four different racial and ethnic groups: non-Black, Black, non-Black Hispanic, and non-Black non-Hispanic, while previous work only did so for all children and Hispanic children. State-level undercount patterns are quite different across these four groups. Third, unlike previous work, I am able to form uncertainty measures on my undercount estimates and test how assumptions about net emigration affect my results. Finally, I employ multivariate regression analysis to correlate the characteristics of mothers and fathers of young children from birth certificates with state-of-birth level undercounts, providing a more detailed picture of the relationship between the undercount of young children and parents’ characteristics than the methods used in previous work.

Undercount of Native-Born Children by State of Birth

Data and Methods

Demographic analysis has been used by the Census Bureau since 1950 to measure the population coverage of the decennial census (Robinson, 2010). DA uses measures of births, deaths, and migration to form a measure of the national population independent of any census count, called the demographic estimate (DE). This value is then compared to the census estimate (CE) from the decennial census to calculate the undercount rate of the census:

$$\begin{aligned} \%U = \frac{{\rm DE} - {\rm CE}}{{\rm DE}} \times 100. \end{aligned}$$
(1)

The undercount rate (\(\%U\)) is expressed as a percentage. I follow the convention used by the Census Bureau: positive values imply an undercount (the demographic estimate is larger than the census estimate) and negative an overcount (the census estimate exceeds the demographic estimate).

In DA, the demographic estimate is assumed to be the correct measure of the population. However, some components of the demographic estimate are more accurate than others. In particular, three components of net migration have no administrative records: foreign-born emigration, native-born emigration, and “residual foreign-born migration” (including unauthorized and quasi-legal migrants).Footnote 4 Measures of these components are based on residual methods, combining estimates from previous censuses with administrative records like death records and legal immigration records.Footnote 5 As the estimates of these components are based largely on assumptions rather than administrative records (such as the accuracy of the enumeration of the previous census), it is hard to evaluate their accuracy. By focusing exclusively on native-born young children, I do not need to incorporate potentially inaccurate estimates of the foreign-born. However, children born in the United States may emigrate prior to the census, and these children should therefore be excluded from the demographic estimate. As the United States does not keep records of citizens who leave the country, measuring emigrants is difficult. Given the lack of reliable estimates of emigration, I make the necessary assumption that the emigration of native-born young children is zero. If this assumption is violated, the demographic estimate is biased upward, increasing the likelihood my results show an undercount of native-born children. Later in the paper, I test the implications of this assumption for my results.

Data

Under the assumption of zero out-migration, I only require data on births and deaths to calculate the demographic estimate of native-born young children aged 0–4. Data on births come from the National Center for Health Statistics’ (NCHS) Vital Statistics Natality Birth Data. As I calculate undercount for native-born children aged 0–4 on April 1, 2000, I use births that occurred in the United States between April 1, 1995 and March 31, 2000. Each birth record contains information on the child (such as birthweight, sex, month of birth, location of birth, etc.) and the parents (such as age, race, education, place of residence, etc.). In addition to using these records for birth counts by cohort, race/Hispanic status, and state of birth, I use them to compute the state-of-birth-level information on mothers used later in the paper. Registration of births in the United States is assumed to be complete for my cohorts of interest, as the Census Bureau assumes complete registration beginning with the 1985 birth cohort.

I use death data from the NCHS Vital Statistics Multiple Cause-of-Death Mortality Data. The data contain individual records for all deaths occurring in the United States in each calendar year, and include date of death, place of death, state of birth, age, residence, sex, race, and cause of death. I use all deaths occurring between April 1, 1995 and March 31, 2000 of children in my birth cohorts of interest. However, as death records do not contain month of birth, and only month of death and age, a challenge arises in assigning decedents to specific birth cohorts. I describe the procedure I employ to assign deaths to birth cohorts in Appendix 2.

For my census estimate, I use a different source than the Census Bureau. As the Bureau only reports undercount by race and age, they use the 100% census counts for their census estimate. As I measure undercount for the native born by state of birth, I use the 5% PUMS as my census estimate (Ruggles et al., 2020), as place of birth is only recorded on the census long form in 2000. The long form was administered to approximately 16% of the population. However, one can still form full population counts from the 5% PUMS using the provided weights, which are constructed to match the full population counts.

Race and Hispanic Status

One of the largest challenges in demographic analysis is the assignment of race and Hispanic status. In order for DA to be accurate, the values of these characteristics must be consistently defined across multiple sources: birth certificates, death certificates, and census responses. In the case of young children, census responses are not self-responses, but those of a proxy (usually a parent). Race and Hispanic status are similarly recorded by proxy on death certificates. Birth certificates contain this information for both parents (if it’s not missing, as discussed below), which must be used to define the race and Hispanic status of the child. Due to the difficulty of ensuring race is defined consistently across these sources, the Census Bureau historically reported DA estimates for two race categories: Black and non-Black, as they did in 2000.Footnote 6 Starting in 2010, an additional category was added: Black alone or in combination (Devine et al., 2010).

As the Bureau did in 2000, I also use two race categories in my analysis. I use the same race assignment rule for Black and non-Black as the Census Bureau did in 2000: the father rule. Research done by the Bureau and others showed that this rule produced an estimate of the Black population that most closely matched census estimates (Passel, 1990; Robinson, 2010). This involves assigning the race of the father to the child, and if the father’s race is missing, assigning the mother’s race.Footnote 7 Census 2000 was the last year this rule was used in DA. With the increase in multiracial births, the Census Bureau developed a file in the early 2000s which linked children aged 0 to 17 with their parents in the 2000 census. This “Kid Link File” was used to establish patterns of race reporting in multiracial households, and the distribution of responses was applied to aggregated birth records to assign race in 2010 (Sink & Jensen, 2019). As this file was not available in 2000, the father rule was used to assign births to the two race categories.

The assignment of Hispanic status to births in 2000 is more complicated than the assignment of race. The Census Bureau did not report DA results by Hispanic status in 2000; they began doing so in 2010 for ages 0–19 only (Devine et al., 2010). Hispanic status was not reported by all states on the birth and death certificates until 1993 and 1997, respectively, preventing the use of DA for the Hispanic population for older birth cohorts. As I focus on young children aged 0–4 in 2000, Hispanic status is available for these cohorts. However, unlike race, Hispanic status is missing on some birth and death records, and the fraction of records missing this characteristic varies by state. In Appendix 3, I show the extent of this variation and describe how I account for missing values when assigning Hispanic status to births and deaths. As I did for race, I use the father rule to assign Hispanic status to births.Footnote 8

The 2000 census was the first census in which a respondent could identify as belonging to multiple races, which could complicate the assignment of race and Hispanic status. To simplify the issue, I use the variable RACESING, generated by IPUMS-USA, to identify a child as Black or non-Black in census data (Ruggles et al., 2020). This variable was created by the IPUMS team to enable the consistent definition of race across earlier censuses that only allowed one race and census 2000 and later. It assigns each multiple-race person to a single race based on age, sex, Hispanic status, and geographic characteristics. Hispanic status is recorded separately from race, and I use the IPUMS variable HISPAN to assign children as Hispanic or non-Hispanic.

The Undercount Rate and Uncertainty

Once I form the demographic and census estimates, I calculate the undercount rate as in Eq. (1). I estimate undercounts for the native-born population aged 0–4 by race (Black and non-Black), Hispanic status,Footnote 9 and state of birth. I do not estimate undercount by sex as undercount rates for young children do not vary by sex (O’Hare, 2014a).

Typically, undercount rates are reported without any measures of uncertainty.Footnote 10 Birth and death records as well as the 100% census counts are assumed to comprise the full population and therefore have no associated standard error. Uncertainty in measures of migration, while suspected to be substantial, is not incorporated in official undercount estimates reported by the Census Bureau. In my case, as my demographic estimate only contains birth and death records of recent birth cohorts, I assume it has no associated error. However, to calculate undercount by state of birth, I use the 5% PUMS for the census estimate. The 2000 5% PUMS is a stratified subsample of the full census sample that received the census long form (about 16% of the population), and therefore has sampling error (U.S. Census Bureau, 2003). The supplied sampling weights can be used to form standard errors and confidence intervals. If the demographic estimate falls outside of the 95% confidence interval of the census estimate, the undercount rate is significantly different from zero at that level of confidence. If not, the undercount rate is statistically indistinguishable from zero.

Results

Undercount by Race

The undercount rate for all native-born children aged 0–4 in the 2000 U.S. census is 5.0%, significant at the 95% level. Undercount of this group by state of birth and race (Black and non-Black) is shown in Fig. 1, and in tabular form in Appendix Table 5. The maps in Fig. 1 show states with statistically significant undercount rates; rates for all states are shown in Appendix Table 5. For non-Black young children (Fig. 1a), the overall national undercount is 4.72, significant at the 95% level. Significant undercount rates range from 1.61 (Pennsylvania) to 10.10 (Hawaii).Footnote 11 Twenty-six of the fifty states have statistically significant levels of undercount for native-born non-Black children. States with the highest levels of undercount tend to be those with high Hispanic populations, such as California (9.74%), Arizona (9.52), Texas (8.36), and Florida (6.11). Two exceptions to this are Tennessee (6.30%) and North Dakota (6.12).Footnote 12

Fig. 1
figure 1

Census undercount by race and state of birth, native-born children aged 0–4, 2000 census

Fig. 2
figure 2

Census undercount by Hispanic status and state of birth, non-Black native-born children aged 0–4, 2000 census

Fig. 3
figure 3

Comparison of 2000 and 2010 state-level undercounts, children aged 0–4

The pattern for undercount of Black native-born young children by state of birth in Fig. 1b is quite different. The national overall undercount is higher than for non-Black native-born young children: 6.45%. States with significant undercount range from Louisiana (4.73%) to Massachusetts (24.56). Several states have undercount estimates less than zero, indicating an overcount, with Washington’s (− 14.75) and New Mexico’s (− 38.73) statistically significant at the 95% level. Fewer states (16 out of 50) have significant undercounts than for non-Black children, although this is unsurprising due to the small Black population in many states leading to very large confidence intervals on the census estimate.Footnote 13 States with the highest undercounts are mostly located in the Northeast, with the exception of Florida and Hawaii.Footnote 14

Undercount by Hispanic Status

Figure 2 shows the undercount of native-born non-Black young children in 2000 by state of birth and Hispanic status. Results are shown in tabular form in Appendix Table 6. Undercount for non-Hispanic non-Black young children is shown in Fig. 2a. Total national undercount for this group is 3.51%, and states with significant undercounts range from Pennsylvania (1.72) to Hawaii (12.52). States with statistically insignificant undercounts have values ranging from − 0.89 (Rhode Island) to 5.79 (Wyoming). As with the undercount for all non-Black children, states with the highest undercounts tend to be in the Southwest.

Undercount for Hispanic children is shown in Fig. 2b. The national total for native-born non-Black Hispanic young children is higher than for non-Hispanic and Black young children: 8.97%. States with significant undercount rates range from Colorado (4.44) to Rhode Island (20.78), and those with non-significant rates from Vermont (− 139) to North Dakota (17.45). Six states have significant overcounts of Hispanic young children. The states with the highest Hispanic undercount rates are in the Northeast and Southwest. Several states with high Hispanic undercount rates also have high non-Hispanic undercount rates, such as Arizona and California.

Comparison to 2010 State-Level Undercount

As I am only able to calculate undercount of native-born children for the 2000 census, a natural concern is whether the patterns across states and racial/ethnic groups evident in the 2000 results are unique to that census, or if they are more consistent through time. Additionally, as my results are only for native-born children by state of birth, can they be used to infer the characteristics of undercount for all children and by state of residence? Insight on these questions may be drawn from comparing my results to those reported by O’Hare (2014c). O’Hare reports undercount by state of residence for all young children in the 2010 census, using the Census Bureau’s population estimates as the demographic estimate. His results for the 2010 census and mine for the 2000 census are shown in Fig. 3.Footnote 15 There are three main differences between the sample used by me and that used by O’Hare. First, my sample only includes native-born children aged 0–4, while O’Hare’s includes all children of these ages.Footnote 16 Second, I compute undercount by state of birth, while he calculates it by state of residence. Third, and perhaps most significantly, my results are for the 2000 census while his are for 2010.

There appears to be little difference in undercount of native-born children by state of birth in 2000 and all children by state of residence in 2010. Most states have very similar levels of undercount across the two censuses, and the total undercount rate for the entire country is the same in both years (5.0%). Only seven states have statistically significantly different undercount rates across the two censuses.Footnote 17 In terms of numerical difference, 24 states have undercount values less than 1 percentage point apart in 2000 and 2010, and 39 states are less than 2 percentage points apart. The patterns of undercount across states are also quite similar between the two censuses.

Despite the substantial differences between the samples used by me for 2000 and by O’Hare for 2010, both the numerical undercount values and the patterns across states are very similar. Therefore, it appears reasonable to conclude that the patterns and values of undercount for native-born children in 2000 by state of birth are likely evidence of state-level undercount patterns for all young children that are persistent across recent census years.

Emigration

A key assumption in my undercount calculations is that the emigration of native-born young children is zero. Obviously, this is untrue; some children born in the U.S. leave soon after birth. If this emigration is substantial, it could mean my demographic estimates—and therefore my undercount results—are too high. Additionally, if children born in some states emigrate at higher rates than children born in other states, not accounting for this emigration could affect the pattern of undercount across states of birth shown by my results.

I estimate the potential effect of emigration on my results using estimates of emigration rates reported in Van Hook et al. (2006), who estimate emigration circa the year 2000 using the Current Population Survey (CPS). Van Hook and coauthors report estimates for the foreign-born population by age and country of origin. The process for adjusting undercount for emigration is described in Appendix 4. The adjustment is state-of-birth specific, as the fraction of births to foreign-born mothers varies by state, as well as by racial/ethnic group. I use two sets of Van Hook et al. (2006)’s annual emigration rates: the age 25–34 rate (2.7%) and a rate based on national origin: Mexican for non-Black and Hispanic (4.3%), Chinese for non-Hispanic (2.4%), and Caribbean for Black (1.8%). The age 25–34 rate is used as I assume that native-born young children emigrate with their mothers, a large portion of which are likely in this age range. The national origin rates are chosen based on the largest country of origin for foreign-born women aged 15–49 in the 2000 census in the corresponding race/ethnic category. (The birth records only identify Canada, Cuba, and Mexico for foreign-born mothers, with those born in other countries classified as “remainder of world”. This is not enough detail to be useful in identifying the place of birth of the mother by race and Hispanic status.) All of these rates are likely upper bound estimates on the emigration of native-born young children of foreign-born mothers, as Van Hook and coauthors report that overall women emigrate at a much lower rate than men (1.7 vs. 4.4%), and all the rates I use are for both genders combined. I only adjust the demographic estimate of native-born children born to foreign-born mothers; I still assume that emigration of native-born children born to native-born mothers is zero.

Selected results from this emigration adjustment are shown in Table 1. Overall U.S. results for the four racial/ethnic groups are shown, as well as the number of states with significant undercounts and the unadjusted and emigration adjusted undercounts for the five states with the highest unadjusted undercount rates in each category. Results for all states are shown in Appendix Tables 8, 9, 10, and 11. Also shown in Table 1 are the number of states with significant undercount rates before and after adjusting for emigration, as well as the fraction of births in each state to foreign-born mothers. Total U.S. undercount of native-born non-Black children is reduced by approximately 1.5 and 2.5 percentage points when emigration is accounted for using the age 25–34 rate and the Mexican rate, respectively. Of the 27 states with significant unadjusted undercount rates, 21 and 16 remain significant after adjustment using the two different rates, respectively. The reduction for Black children is more modest, from 6.45% to 5.67 and 5.93%, respectively, and all the states with significant unadjusted rates also have significant adjusted undercount rates. For non-Hispanic children, adjusting for emigration reduces undercount by around one percentage point using both rates. Hispanic children are the group with the largest reduction in undercount rates after adjusting for emigration. Adjusting for emigration using the age 25–34 rate cuts the overall national undercount rate by nearly half, and by nearly three-quarters using the Mexican rate. Similarly, the number of states with significant undercount rates falls from 13 to 9 and 6, respectively. In all categories, the states that see the largest reductions in undercount rates are those with the highest proportion of native-born births to foreign-born mothers.

Table 1 Adjusting state of birth census undercount for emigration, native-born children aged 0–4, 2000

As the emigration rates from Van Hook et al. (2006) are not specifically for children aged 0–4, it is difficult to know whether they accurately estimate the emigration of native-born children. However, recent evidence from Masferrer et al. (2019) uses Mexican census data to estimate approximately 107,000 U.S.-born children aged 0–4 lived in Mexico in 2000. Using the Van Hook et al. rates, I estimate 249,024 native-born young children born to Hispanic mothers emigrated prior to the 2000 census. As the birth records show 72% of births to Hispanic foreign-born mothers are to Mexican-born mothers,Footnote 18 over 179,000 of my Hispanic emigrants are Mexican, assuming they emigrate at the same rate. If Masferrer et al. (2019)’s estimates are correct, the method I use over-adjusts for emigration (at least for the Hispanic population), meaning the true undercount value lies between the unadjusted and adjusted values.

In sum, emigration of native-born children of foreign-born mothers may account for high undercount levels of Hispanic, non-Black, and non-Hispanic young children in some states, but rates in large population states with large fractions of births to foreign-born mothers remain high and significant. Emigration of the children of foreign-born Black mothers does not explain the high undercount in states with high foreign-born Black populations.

Correlates with Undercount by State of Birth

In this section, I examine the association between state-level characteristics and the undercount of native-born young children aged 0–4 in the 2000 census. The results of this analysis are not intended to be interpreted as causal evidence of the reasons for the variation in undercount across states. Drawing such conclusions based on correlations without a source of exogenous variation and using relatively sparse data is incorrect at best. Instead, the intent is to provide insights into characteristics associated with the undercount of young children that can potentially provide a foundation for future research into the causes of undercount.

While O’Hare (2014c) examined the correlation between state-level characteristics and the undercount of young children in the 2010 census, my work expands on his in several ways. First, I perform my analysis for each race and Hispanic group separately, while O’Hare only looks at total state undercount. As undercount patterns across states vary substantially by race and Hispanic status, examining the correlates with undercount separately for these groups may provide more insight into potential explanations for the undercount of young children. Second, I correlate state-level undercounts with mothers’ characteristics from birth records, while O’Hare (2014c) uses full population characteristics drawn from 2010 census counts and the 2010 American Community Survey (ACS). Birth certificate information on mothers, while imperfect, is likely more reflective of the characteristics of individuals who will respond to the census on behalf of their children than full population census or ACS data. The mothers’ characteristics also contain potentially important measures in explaining undercount unexamined by O’Hare, particularly the fraction foreign born. Third, I use OLS regressions to measure the correlation between undercount and the mothers’ characteristics, which enables me to consider not only single-variable correlations, as O’Hare does, but also how multiple characteristics combine to explain patterns of undercount across states. These differences enable me to form a more complete, nuanced picture of how the undercount of young children varies by state than previous work.

Data and Methods

To measure the correlation between state-of-birth level undercount and state-level characteristics of mothers, I use the following specification:

$$\begin{aligned} U_{s,g} = X_{s,g}\beta_{g} + \varepsilon_{s,g} \end{aligned}$$
(2)

where \(U_{s,g}\) is undercount for native-born children born in state s in racial/ethnic group g (non-Black, Black, non-Black non-Hispanic, and non-Black Hispanic), \(X_{s,g}\) is a state-level characteristic of mothers of children born in state s belonging to group g,Footnote 19 and \(\varepsilon _{s,g}\) is a standard error term. The vector \(X_{s,g}\) can consist of up to five variables: the fraction of mothers who are foreign born, who have less than a high school degree, who are unmarried, and the fraction of both mothers and fathers who are Hispanic.Footnote 20 These five measures are constructed using the same birth certificate data used to form the demographic estimate of native-born children aged 0–4. As the information reported on birth certificates is limited, these are the only characteristics I can examine from this source. The coefficient \(\beta _g\) measures the relationship between the characteristic(s) and undercount for racial/ethnic group g. All variables are measured as fractions. The main specification includes all 50 states and the District of Columbia in the sample, and is weighted by the total state population of group g in the 2000 census. Equation 2 is estimated using OLS for all four ethnic groups g separately.

Results

As an initial analysis, I estimate single variable versions of Eq. 2 for each of the four racial/ethnic groups. Results are shown in Table 2. As all variables are entered as fractions, we can interpret the coefficient for non-Black in column 1 as an increase in foreign-born mothers of 10 percentage points is associated with an increase in undercount of 1.78 percentage points.Footnote 21 However, more informative than the coefficients is the \(R^2\) of each regression, which measures the fraction of between-state variation in undercount “explained” by (i.e., associated with) that characteristic. For non-Black children, a very large portion of the between-state variation is correlated with the fraction of foreign-born mothers, mothers with less than a high school degree, and Hispanic mothers and fathers, \(R^2\)s of 0.67, 0.71, 0.77, and 0.77, respectively. The fraction of mothers who are unmarried explains a smaller portion with an \(R^2\) of 0.30. However, all of these variables are strongly correlated with one another, as evidenced by the correlation coefficients also shown in Table 2. The fraction Hispanic mothers and fraction Hispanic fathers are extremely strongly correlated (with a \(\rho\) of almost 1) and both of these are highly correlated with the fraction of mothers who are foreign-born (\(\rho\) = 0.91). These high correlations are unsurprising, as the Hispanic population makes up a large proportion of the foreign-born non-Black population. Additionally, higher rates of low education and non-husband/wife households have been shown in previous work to be associated with hard-to-enumerate populations (Bruce & Robinson, 2003).

Table 2 Correlates with state of birth undercount by race and Hispanic status, single variable specifications, native-born children aged 0–4, 2000
Table 3 Correlates with state of birth undercount by race and Hispanic status, multivariate specifications, native-born children aged 0–4, 2000

The pattern is quite different for Black young children. The fraction of mothers who are foreign-born explains the highest proportion of the between-state variation in the undercount of young children for this group, although it is a much lower amount than for non-Black children (\(R^2\) = 0.29). The fractions of mothers and fathers who are Hispanic are both positively correlated with undercount, but the fraction of unmarried and low-educated mothers do not significantly explain any of the between-state variation in undercount for Black children.Footnote 22 For both non-Black non-Hispanic and non-Black Hispanic children, the fraction Hispanic mothers is most strongly correlated with state-of-birth-level undercount of young children, but in each case this variable measures a slightly different characteristic of the child. For non-Hispanic children, the fraction Hispanic mothers is a measure of the fraction of non-Black children with a non-Hispanic father and a Hispanic mother, as Hispanic status is determined using the father rule. (This is also why the fraction Hispanic fathers cannot be included in the Hispanic status regressions, as it is perfectly correlated with Hispanic status using the father rule.) For Hispanic children, this variable measures the fraction of children with two Hispanic parents. Therefore, states with higher fractions of children with a Hispanic mother and a non-Hispanic father tend to have higher undercounts of non-Black non-Hispanic children, and states with a higher proportion of children with two Hispanic parents tend to have higher undercounts of Hispanic young children. The fraction of foreign-born mothers is associated with higher undercount of non-Black non-Hispanic children, but not Hispanic children, while the reverse is true for mothers with less than a high school degree. The fraction of unmarried mothers is not correlated with state-level undercounts of non-Black Hispanic nor non-Hispanic children.

Results of multivariate analyses are shown in Table 3. As several of the independent variables are highly correlated, I report variance inflation factors (VIFs) and condition numbers for all specifications. Many specifications in Table 3 may suffer from large standard errors and unstable coefficient estimates due to multicollinearity issues, so again I focus my discussion on the \(R^2\) of each regression, which is unaffected by multicollinearity. The \(R^2\)s show how much of the total between-state variation in undercount is correlated with the variables included in the model. Adjusted \(R^2\)s, which account for the increasing number of covariates in each specification, are also shown. For non-Black, including all 5 variables in the model explains over 80% of the between-state variation in the undercount of native-born young children, but the addition of the final 3 variables adds very little explanatory power once the fraction of mothers who are foreign-born and have less than a high school degree are controlled for. Indeed, the VIFs and condition numbers indicate strong multicollinearity when all variables are included.Footnote 23 In contrast, the 5 variables together only explain around 30% of the between-state variation in undercount for Black native-born children. Nearly all of this explanatory power is from the fraction of mothers who are foreign-born. For non-Black non-Hispanic children, all variables contribute to the nearly 60% of the explained between-state variation in undercount, although the fraction mothers with less than a high school degree, Hispanic mothers (children with Hispanic mothers and non-Hispanic fathers), and fraction foreign-born mothers are more important in terms of increasing \(R^2\) than fraction unmarried mothers. The fraction of mothers of non-Black Hispanic children who are foreign born doesn’t contribute much to the explanatory power of the model of undercount. The total \(R^2\) of 0.64 is mostly due to the fraction of mothers who are Hispanic, have less than a high school degree, and are unmarried.

The appendix contains several robustness analyses for the models in Table 3. To see whether the results are driven by states with very high levels of undercount, the highest undercount states in each of the four groups are excluded from the models in Appendix Table 12. Results are nearly identical, as are those using only states with statistically significant undercount levels (Appendix Table 13). Appendix Table 14 uses the emigration-adjusted undercounts as the dependent variable. As these undercounts are adjusted using the fraction of mothers who are foreign born, the fraction of between-state variation explained by this characteristic falls substantially for the non-Black, non-Black non-Hispanic, and non-Black Hispanic populations. However, the reduction is much smaller for Black children, and the fraction of mothers who are foreign-born still contributes more to the total explanatory power of the model than the other four variables. The \(R^2\)s for the full models containing all variables are very similar to those in Table 3. Results using state-level measures of foreign-born, Hispanic, and low-educated women aged 15–49 constructed from 2000 census data, shown in Appendix Table 15, are nearly identical to those using birth certificate data. Results using undercounts using Hispanic status assigned using the middle rule as opposed to the father rule are also very similar (Appendix Table 16).

Conclusion

This paper is the first to calculate census undercount by state of birth, race, and Hispanic status for young children aged 0–4—a population known to suffer from high undercount but easy to measure using administrative records—to gain insight into geographic patterns of undercount by these characteristics. States show a wide range of undercount levels. Non-Black undercount is highest in states in the south and southwest with high Hispanic populations. Undercount for Black native-born children is highest in northeastern states and Florida, states that have a high foreign-born Black population. Non-Black non-Hispanic children have the highest undercount rates among those born in Hawaii, Arizona, and Florida, while the highest rates for non-Black Hispanic children are in the northeastern states of Rhode Island, Connecticut, and New York. Several states, such as California and Texas, have high undercount rates of both Hispanic and non-Hispanic non-Black children. Adjusting for the potential emigration of children of foreign-born mothers may account for the high undercount rates of non-Black, non-Black Hispanic, and non-Black non-Hispanic children born in some states, but does not explain the high undercount of native-born Black children born in states with a high fraction of foreign-born Black mothers.

The strong association between foreign-born mothers and the undercount of Black native-born children is confirmed using OLS regressions of state-of-birth level undercount rates on mothers’ characteristics from birth certificates. The fraction of births to foreign-born mothers has the highest correlation with state-of-birth level undercount of the characteristics examined, although it is highly correlated with fraction of mothers of Black children who are Hispanic (likely due to the large Hispanic Black population from the Caribbean). Fraction of mothers who are Hispanic is also highly correlated with state-of-birth undercounts of non-Black, non-Black Hispanic, and non-Black non-Hispanic children, but fraction foreign-born is not for non-Hispanic and Hispanic children. The fraction Hispanic mothers variable for these two groups actually measures the fraction of mothers of non-Hispanic children and Hispanic children who are Hispanic, as Hispanic status is defined using the father rule. However, using an alternative assignment rule that takes the mother’s Hispanic status into account does not affect the results. The results for non-Hispanic children may indicate a disconnect between these assignment rules and the classification of children’s Hispanic status by their parents in the census. Those for Hispanic children may indicate a higher likelihood of undercount if both parents are Hispanic.

These correlations should not be considered causal evidence of the underlying reasons for undercount. The relatively sparse information contained on birth certificates combined with no source of exogenous variation makes it highly likely that other, unobserved factors are behind the observed undercount patterns, and the state-level correlations may or may not reflect those at a more local (i.e., city or county) level. Despite these caveats, my results highlight the importance of focusing on the foreign-born population to mitigate coverage issues in the census or other surveys. While much attention has deservedly been paid to increasing the census response of foreign-born non-Black Hispanic individuals, my findings show this should not be done at the expense of other foreign-born populations—such as the foreign-born Black population—as they are also likely to suffer from high levels of undercount. Furthermore, my findings show that to address the undercount of young children, we need to focus on parents, particularly mothers. Determining which mechanism is behind child undercount—missing the child only, but counting the parents, or not counting the entire family—is a priority, as remedies would likely take different forms depending on the underlying cause. More research is needed, perhaps using administrative records like birth certificates matched to individual census responses, to uncover the true causes of undercount and ensure the census counts all residents of the United States.