1 Introduction

In this article, we attempt to identify the critical forces driving the massive outbreak of COVID-19 in Los Angeles County, which by January 10, 2021 had registered over 967,000 confirmed cases of the disease (Los Angeles County Department of Public Health 2021).

To that end, we bring together four critical strands of the growing research literature on the worldwide COVID-19 epidemic. First, investigators have attempted to reconstruct the transmission dynamics of local outbreaks by applying theoretical models to data on reported cases (Chang et al. 2020; Fang et al. 2020; Hao et al. 2020). Second, numerous studies have used the techniques of geospatial analysis to evaluate the impacts of public health policies (Dickson et al. 2020; Franch-Pardo et al. 2020; Orea and Alvarez 2020; Zheng et al. 2020). Third, cross-sectional studies have related the age structure and household composition of various countries to COVID-19 incidence and mortality (Aparicio Fenoll and Grossbard 2020; Esteve et al. 2020). And fourth, researchers have increasingly relied on data derived from the movements of smartphones with location-tracking software to study patterns of viral propagation (Dave et al. 2020; Harris, 2020a, d).

While a number of studies have assessed the effects of state-of-emergency and stay-at-home orders, as well as restrictions on restaurants, bars and large social gatherings, these efforts have largely relied upon large cross-sections of state and county data (Cronin and Evans 2020; Gupta et al. 2020). Here, by contrast, we rely upon detailed data on the dynamics of SARS-CoV-2 transmission among approximately 300 communities within Los Angeles County from February 24, 2020 through January 10, 2021. Focusing sharply on Los Angeles County—far and away the largest by population in the United States—we follow in the line of other recent studies attempting to relate transmission patterns to the fine microdetails of individual communities (Horn et al. 2020; Vijayan et al. 2020).

We develop a spatial extension of the conventional SIR epidemic model to study the radial spread of infection among these contiguous communities during the early phases of the epidemic. We merge our geospatial data with census-derived, community-specific data on the characteristics of households, in particular a measure of the prevalence of households at risk for inter-generational transmission. We study how wide local variations in the prevalence of this risk factor result in marked heterogeneity in the reproductive number during the course of the Los Angeles County epidemic.

1.1 The Four Phases

Before delving into the fine details of our data, methods and results, we paint a broad-brush picture of the epidemic under study.

Figure 1 plots the weekly incidence of confirmed COVID-19 cases per 100,000 population in Los Angeles County over a 46-week period, from the week starting Monday, February 24, 2020 (which we designate as week 0) to the week starting Monday, January 4, 2021 (designated week 45). The data points for the figure were derived principally from the web-based dashboard of the Los Angeles County Department of Public Health (DPH), supplemented by the dashboards of the cities of Long Beach and Pasadena, which are situated within Los Angeles County but run their own health departments (Long Beach Department of Health and Human Services 2021; Los Angeles County Department of Public Health 2021; Pasadena Health Department 2021). We characterize these as confirmed cases, as they reflect only those infected individuals who were tested and reported to the DPH or to the two other municipal health departments.

Fig. 1
figure 1

Weekly confirmed COVID-19 cases per 100,000 population in Los Angeles County

We have divided the observation period into four successive phases. Phase I (spanning weeks 0–5) saw an initial rapid increase in confirmed case incidence. By the start of Phase II (spanning weeks 6–20) the epidemic curve was already flattening, as the emergency lockdowns declared in early March by the Los Angeles mayor, the county supervisor and the California governor had begun to bite (Barger 2020; Garcetti 2020a; Newsom 2020). Following the reopening of retail stores, indoor dining, hair salons, gyms and bars in May and June (Money 2020; Parvini 2020; Shalby 2020a), confirmed case incidence rose to a temporary peak of 221 per 100,000 by the week of July 13 (week 20) at the end of Phase II.

Reacting to the surge in cases, the state public health officer ordered the reversal of most of the county’s prior reopening orders (Angell 2020; Gutierrez 2020). The resulting decline in confirmed case incidence, seen during the initial part of Phase III (spanning weeks 21–34) was short-lived, as the county reopened hair salons, nail salons, breweries and shopping malls in September and October (Cosgrove 2020; Shalby 2020b; Shalby and Cosgrove 2020). As Phase III came to a close in week 34, the state issued new guidelines permitting gatherings of up to three households (Times Staff 2020).

The extraordinary, ten-fold surge in confirmed case incidence that followed during Phase IV (spanning weeks 35–45) elevated Los Angeles County to the title of the new COVID-19 epicenter in the United States. Emergency stay-at-home orders issued by the Los Angeles mayor and the California regional health officer during week 40 (Garcetti 2020b; Pan 2020) had little or no short-term detectable effect on indicators of social mobility (Harris 2020f). Confirmed cases during the eleven weeks of Phase IV accounted for more than two-thirds of all confirmed cases recorded during the entire 46-week interval covered by the figure.

2 Data and Methods

2.1 Countywide Statistical Areas

We have already noted our data sources for confirmed COVID-19 cases in Los Angeles County and the cities of Long Beach and Pasadena. While Fig. 1 shows incidence trends at the global, countywide level, our analysis was focused principally on a detailed study of confirmed case incidence among the more than 300 communities within the county.Footnote 1 For this purpose, we relied upon the DPH’s geographic breakdown based upon countywide statistical areas (CSAs), a mixed classification of independent cities such as the City of Beverly Hills, neighborhoods within the city of Los Angeles such as Hollywood, and unincorporated places such as Hacienda Heights (City of Los Angeles 2020). In this scheme, both Long Beach and Pasadena have their own CSAs.

2.2 American Community Survey Data

We relied on the 2015–2019 five-year public use microsample from the U.S. Census Bureau’s American Community Survey (ACS) (U.S Census Bureau 2021). The nationwide database covered 788,475 households and group living arrangements with a total of 1,887,461 persons. Out of the entire database, 203,545 households and group living arrangements with 510,501 persons were identified as residing in one of 69 public use microdata areas (PUMAs) within Los Angeles County (U.S Census Bureau 2020b).Footnote 2

We used the person records of the public use microsample to identify households with at least four persons, of whom at least one person was 18–34 years of age and at least one other person was at least 45 years of age. We describe these households here as at risk for multi-generational transmission (abbreviated MULTI in the results below). Among all such at-risk families in the Los Angeles County extract of the ACS, 43% had four persons, 27% had five persons, 14% had six persons, and 15% had seven or more persons. Among the 69 public use microdata areas (PUMAs), the median proportion of at-risk households was 14.6%.

Using the internal household sampling weights provided by the ACS, we then computed the proportion of at-risk households in each PUMA. Applying a Census Bureau crosswalk between PUMAs and census tracts (U.S Census Bureau 2020a) as well as a DPH-provided crosswalk between census tracts and CSAs, we determined the corresponding proportions of at-risk households in each CSA. Among 300 CSAs, the median proportion of at-risk households was 13.8%.

We similarly used the ACS five-year public use microsample to determine three other CSA-based indicators: the proportion of households receiving food stamps under the Supplemental Nutrition Assistance Program (SNAP, median 7.3% among 300 CSAs); the proportion of households with total income below $22,000 annually, the amount that one full-time worker would earn at California’s minimum wage of $11 per hour (INC22, median 15.5%); and the proportion of households with at least one person engaged in a low-wage occupation that cannot be performed remotely (OCCUP, median 14.1%).Footnote 3

2.3 SafeGraph Data

We relied upon the Patterns database issued by SafeGraph (SafeGraph Inc. 2020), which describes the movements of smartphones equipped with location-tracking software to numerous points of interest throughout the United States. We previously relied upon this data source in a comparative study of the COVID-19 epidemics in Milwaukee and Dane Counties in Wisconsin (Harris 2020a) and a geospatial analysis of the September 2020 COVID-19 outbreak on the campus of the University of Wisconsin-Madison (Harris 2020e).

Here, we focused on fast-food restaurants as points of interest. We used the Patterns location_name variable to identify all entities whose names included at least one of these key words: burger, pizza, pizzeria, taco, taqueria, quesadilla, burrito, chipotle, tortilla, sushi, sashimi, ramen, udon, wok, and noodle. We then used the Patterns brands variable to identify other fast-food chains that were prevalent in Los Angeles County.Footnote 4

Many of these restaurants, particularly the chains, had multiple locations. Each distinct location was identified by a point-of-interest census block group (poi_cbg). For each distinct location, we used the variable visitor_home_cbgs to identify the home census block groups of all visitors during each weekly reporting period, where a device’s home is the location where it is regularly located overnight. For each week from the week starting February 10, 2020 through the week starting January 11, 2021, we then accumulated the respective numbers of restaurant visits originating from each home CBG. Once again taking advantage of the census tract crosswalk provided by the Los Angeles County DPH, we converted these counts into a longitudinal time series of restaurant visits originating from each CSA.

2.4 Spatial SIR Model

We devised a spatial adaptation of a discrete-time SIR (susceptible-infective-resistant) model similar to the model we employed in a study of COVID-19 transmission between younger and older persons in Florida’s most populous counties (Harris 2020b).

To facilitate the exposition, we first review a deterministic SIR model without a spatial component. Let \({S}_{it}\) denote the proportion of susceptible individuals in geographic unit \(i\) at discrete time \(t\). In our empirical application, geographic units will refer to CSAs, and each time period \(t\) will refer to one week, where \(t\) = 0, 1, …, 45. Let \({I}_{it}\) denote the corresponding proportion of infective individuals, and \({R}_{it}\) the corresponding proportion of resistant individuals. For each CSA \(i\), the equation of motion of \({S}_{it}\) is given by

$$ S_{it} = S_{i,t - 1} - \alpha S_{i,t - 1} I_{i,t - 1} $$
(1)

where \(\alpha \) reflects the rate at which susceptible and infective individuals interact, as well as the likelihood that an interaction will result in transmission. We take \(\alpha \) to be an unknown parameter to be estimated from our data.

The corresponding equation of motion of \({I}_{it}\) is given by

$$ I_{it} = \alpha S_{i,t - 1} I_{i,t - 1} + \left( {1 - b} \right)I_{i,t - 1} $$
(2)

where \(1>b>0\) is the rate at which infectives become resistant, either through recovery or death, and \(\left(1-b\right)\) is the corresponding depreciation factor. Rather than estimating \(b\) from our data, we rely on external sources. With a mean duration of infectivity of 5.5 days (Griffin et al. 2020), we assume a weekly depreciation factor of \(1-b=exp\left(-7/5.5\right)\)= 0.28, that is, \(b\) = 72 percent of current infectives become resistant each week. In sensitivity analyses, we tested the effect of increasing the mean duration of infectivity to 6.5 days, so that \(1-b=exp\left(-7/6.5\right)\)= 0.34, that is, \(b\) = 66 percent of current infectives become resistant each week. We further assume that the population of each CSA is closed, so that

$$ S_{it} + I_{it} + R_{it} = 1 $$
(3)

Finally, for each CSA \(i\), we assume the initial condition \({S}_{i0}=1-{I}_{i0}\), where \({I}_{i0}>0\) denotes the proportion of infectives during the initial week \(t\) = 0, and where \({R}_{i0}=0\).

We now add a stochastic component to our deterministic model of Eqs. (1) through (3). For notational compactness, we write \({y}_{it}={S}_{i,t-1}-{S}_{it}\) as the incidence of COVID-19 cases in CSA \(i\) during week \(t\). We also write \({X}_{it}={S}_{it}{I}_{it}\). Equation (1) can then be written as

$$ y_{it} = \alpha X_{i,t - 1} + \varepsilon_{it} $$
(4)

where the error terms \({\varepsilon }_{it}\) are assumed to be independently and identically distributed. We designate this specification as Model 0. This model excludes a constant term, which is ordinarily included in linear models, because all new infections are assumed to arise from contact with other infective persons.

So long as we take the parameter \(b\) as known, we can estimate the unknown parameter \(\alpha \) in Eq. (4) from the available data \(\left\{{y}_{it}\right\}\) on COVID-19 incidence in each CSA \(i\) and week \(t\). That information is sufficient to generate the entire series of \({X}_{it}\). To that end, we start with \({I}_{i0}={y}_{i0}\) for all \(i\), so that \({S}_{i0}=1-{I}_{i0}=1-{y}_{i0}\). For all subsequent weeks \(t>1\), we compute \({S}_{it}={S}_{i,t-1}-{y}_{it}\), and then generate the values of \({I}_{it}\) from Eq. (2). Once we have computed \({S}_{it}\) and \({I}_{it}\), we have \({X}_{it}\) as well. We note that the lagged values \({X}_{i,t-1}\) in Eq. (4) are not constructed from the contemporaneous incidence \({y}_{it}\) on the left-hand side.

We now incorporate a spatial component into our non-spatial Model 0. We write

$$ y_{it} = \alpha X_{i,t - 1} + \gamma \mathop \sum \limits_{j \ne i} w_{ij} X_{j,t - 1} + \varepsilon_{it} $$
(5)

where \(\left\{{X}_{j,t-1}:j\ne i\right\}\) refers to all other CSAs, and where both \(\alpha \) and \(\gamma \) are unknown parameters. Here, \(\left\{{w}_{ij}\right\}\) are elements of a known symmetric matrix \(W\) with zero diagonal elements, where each off-diagonal element represents the influence of geographic unit \(j\ne i\) on the rate of new infections in unit \(i\). The parameter \(\gamma \) thus captures the influence of nearby CSAs. In our empirical application, we set the off-diagonal element \({w}_{ij}\) = 1 if the distance \({d}_{ij}\) between the centroids of CSA \(i\) and CSA \(j\) was no greater than the radius \(r\), and \({w}_{ij}=0\) otherwise, where the distances \({d}_{ij}\) were calculated from the Haversine formula (Hedges 2002). In our base case, we specified a radius \(r\) of 1 km, but we also studied the effect of increasing \(r\) to 1.5 km. We designate the specification in Eq. (5) as Model 1.

Once again, this model excludes a constant term because all new infections are assumed to arise from contact with other infective persons located within the same or adjacent geographic units. As above, we can estimate the unknown parameters \(\alpha \) and \(\gamma \) from the available incidence data \(\left\{{y}_{it}\right\}\) so long as we take the depreciation rate \(b\) and and the matrix \(W\) as known. Similarly, the lagged values \({X}_{i,t-1}\) and \(\left\{{X}_{j,t-1}:j\ne i\right\}\) in Eq. (5) are not constructed from the contemporaneous incidence \({y}_{it}\) on the left-hand side.

We consider a further extension of Model 1 that permits covariates. In keeping with the strong assumption that all new infections arise from contact with other infectives, this model takes the form

$$ y_{it} = \alpha X_{i,t - 1} + \gamma \mathop \sum \limits_{j \ne i} w_{ij} X_{j,t - 1} + \delta Z_{i} X_{i,t - 1} + \varepsilon_{it} $$
(6)

where \({Z}_{i}\) represents a time-independent exogenous characteristic of CSA \(i\), and \(\delta \) is an additional unknown parameter capturing the multiplicative effect of this covariate on the within-CSA transmission rate. We designate this specification as Model 2. In what follows, we estimate Model 2 where \({Z}_{i}\) represents the prevalence of at-risk multi-generational households (MULTI) in CSA \(i\).

Finally, we recognize that the parameters of our models are unlikely to remain constant during the 46-week time period under study. Accordingly, we estimate the parameters \(\alpha \), \(\gamma \), and \(\delta \) separately for each of the four phases described in Sect. 1.1 above.

2.5 Calculating the Reproductive Number

In our non-spatial Model 0, our estimate of the contemporaneous reproductive number in CSA \(i\) at week \(t\) would be \({\mathcal{R}}_{it}=\alpha {S}_{it}/b\). When \({\mathcal{R}}_{it}=1\), the equation of motion of the proportion of infective persons gives \({I}_{it}={I}_{i,t-1}\), that is, an endemic state. When \({\mathcal{R}}_{it}>1\), the proportion of infectives is increasing, and when \({\mathcal{R}}_{it}<1\), the proportion is decreasing. In spatial Model 1, our estimate becomes \({\mathcal{R}}_{it}={\left(\alpha +\gamma \right)S}_{it}/b\), where the term \(\left(\alpha +\gamma \right)\) represents the sum of the within-CSA effect and the effect of nearby CSAs. By extension, in spatial Model 2, we have \({\mathcal{R}}_{it}={\left(\alpha +\gamma +\delta {Z}_{i}\right)S}_{it}/b\). In our empirical analysis below, we estimate the contemporaneous reproductive numbers for the entire county at the starting week for each of the four phases, that is, at \(t\) = 0, 6, 21, and 35, respectively. To that end, we replace \({Z}_{i}\) with the population-weighted mean of MULTI for the county and \({S}_{it}\) with the population-weighted proportion of survivors at week \(t\).

2.6 Accounting for Underascertainment of Cases

It is widely recognized that confirmed COVID-19 cases undercount total incident infections. Asymptomatic cases appear to constitute at least 40–45 percent of all infections (Oran and Topol 2020) and appear to play a dominant role in disease transmission (Moghadas et al. 2020). Still other symptomatic individuals may not have sought testing, especially in the early days of the epidemic when testing criteria were restricted (Centers for Disease Control and Prevention 2020; Reese et al. 2020).

A straightforward way to account for such underascertainment is to proportionately inflate confirmed case counts. If \({n}_{it}\) is the observed number of confirmed cases and 1 > \(f>0\) is the proportion of all infections that go undetected, then the actual number of incident infections would be \({y}_{it}={n}_{it}/\left(1-f\right)\). In our tests of the three spatial models, we applied this inflation factor to confirmed cases, based on the alternative values \(f\) = 0.4 and \(f\) = 0.5.

During the earliest weeks of the epidemic, when very few cases have accumulated, nearly everyone remains susceptible, so that \({S}_{it}\approx 1\) and \({X}_{it}\approx {I}_{it}\). In that case, the deterministic version of our model without spatial effects in Eq. (4) collapses to \({y}_{it}\approx \alpha {I}_{i,t-1}\). Inflating confirmed cases by a factor \(1/\left(1-f\right)\) would have a negligible effect on our estimates of the transmission parameter \(\alpha \) and the radial expansion parameter \(\gamma .\) As the epidemic progresses, however, the correction for underascertainment will magnify the decline in \({S}_{it}\) and thus increase our estimates of \(\alpha \) and \(\gamma \).

3 Summary of Critical Parameters

Table 1 summarizes the critical parameters of our spatial epidemic model. In the table, \(b\) is the proportion of infectives who become resistant each week, as shown in Eq. (2). The base-case and alternative values were based on assumed mean durations of infectivity of 5.5 and 6.5 days, respectively. The parameter \(r\) is the radius of influence underlying the definition of the elements \(\left\{{w}_{ij}\right\}\) in Eq. (5). In the base case, we take \({w}_{ij}\) = 1, when the distance \({d}_{ij}\) between the centroids of CSAs \(i\) and \(j\) is no more than 1 km. In the alternate case, we increased the radius of influence to 1.5 km. The parameter \(f\) is the assumed proportion of all infections that go undetected, which we take to be 0.4 or, alternatively, 0.5.

Table 1 Spatial SIR model parameters

The unknown parameter \(\alpha \) in Eq. (1), which reflects the rate at which susceptible and infective individuals interact, appears in all three models. The unknown parameter \(\gamma \) in Eq. (5), which captures the influence of neighboring communities, appears in Models 1 and 2. Finally, the unknown parameter \(\delta \) in Eq. (6), which captures the effect of the exogenous covariate MULTI, appears in Model 2.

4 Results

4.1 COVID-19 Incidence in Phase IV versus Prevalence of Multi-Generational Households

We focus initially here on Phase IV because of its quantitative importance. We offer some descriptive tests of the potential role of multi-generation transmission during this critical phase, based upon cross-sectional multivariate regression. In the following section, we proceed to our spatial epidemic model, starting with Phase I.

Figure 2 below displays two color-coded maps of the CSAs of Los Angeles County. The left-hand map shows the geographic distribution of all confirmed COVID-19 cases combined during the 11 weeks of Phase IV, expressed as a percentage of the population of each CSA. The right-hand map shows the corresponding distribution of households at risk for multi-generational transmission, expressed as a percentage of all households in each CSA. In both maps, the color gradient has 7 increments, each corresponding to one septile (or 14.2%) of all CSAs.

Fig. 2
figure 2

Two maps of countywide statistical areas in Los Angeles County. Left: incidence of confirmed COVID-19 cases diagnosed during phase IV. Right: prevalence of at-risk multigenerational households

Figure 2 shows a striking concordance between the two geographic distributions. Both figures display marked concentrations in four regions: the Antelope Valley–Palmdale–Lancaster region to the north; San Fernando Valley–Pacolma region to the west; the San Gabriel Eastern Valley–El Monte–West Covina–Pomona region to the east; and the Vernon–Boyle Heights–East Los Angeles–Downey–Inglewood region in the center and to the south. On the left, in particular, the two darkest shaded areas (with a cumulative incidence exceeding 8.2%) comprised half of all COVID-19 cases during Phase IV but only one-third of the county population.

For the 204 CSAs with a population \(\ge \) 10,000, Fig. 3 graphs confirmed COVID-19 incidence during Phase IV against the proportion of households at risk for multi-generational transmission (MULTI). Data points are proportional in size to CSA population. The population-weighted least squares fit had a slope of 0.300 (95% confidence interval 0.267–0.334).

Fig. 3
figure 3

Bivariate plot of confirmed COVID-19 incidence during phase IV versus the prevalence of multi-generational households (MULTI)

Table 2 below addresses whether the bivariate relationship between case incidence and the prevalence of at-risk multi-generational households observed in Fig. 3 may be attributable instead to other indicators of poverty. The table shows results of population-weighted cross-sectional regressions in two data sets. The first data set, labeled All CSAs, covers all 296 CSAs for which we were able to construct estimates for each of the independent variables from the Census Bureau’s American Community Survey 5-Year (2015–2019) database. The second data set, labeled CSAs \(\ge \) 10,000, covers only those countywide statistical areas with at least 10,000 population, as shown in Fig. 3 above.

Table 2 Cross-section regression analysis of the percentage of the population diagnosed with COVID-19 during phase IV

Table 2 demonstrates that, while the estimated coefficient of MULTI was reduced in comparison with the bivariate model of Fig. 3, the prevalence of multi-generational households remained the dominant factor in determining confirmed COVID-19 cases during the Phase IV surge. Inclusion of all four regressors (not shown) resulted in an insignificant relationship for SNAP (p = 0.757) and a marginally significant relation for OCCUP (p = 0.049). Having offered evidence that MULTI dominates over other indicators of poverty, we focus on this variable as the critical covariate in the estimates of the spatial epidemic model below.

4.2 Rapid Radial Expansion

The series of six maps in Fig. 4 below show the evolution of the cumulative case incidence at the end of weeks 2 through 7, respectively. The lighter shaded CSAs correspond to a cumulative incidence between 120 and 360 per 100,000, while the darker shaded CSAs correspond to a cumulative incidence of at least 360 per 100,000.

Fig. 4
figure 4

Cumulative COVID-19 incidence in Los Angeles County at the end of weeks 3–8

By the end of week 2 (running from March 9–15), a focus of infection exceeding the 120-per-100,000 threshold can be seen in the Beverly Crest community of Los Angeles and the city of West Hollywood. By the end of week 3 (March 16–22), this focus had expanded to include the Brentwood and Belair communities to the west, the Melrose and Hancock Park neighborhoods to the east, and the Crestview community to the south. By the end of week 4 (March 23–29), the focus had further expanded to comprise a cluster of 18 communities, extending to Pacific Palisades to the west, forming a hotspot with cumulative incidence over 360 per 100,000 in the neighborhood of West LA surrounding the Veterans Affairs Medical Center. By the end of week 5 (the conclusion of Phase I), this enlarging cluster added three more communities of high concentration, including Crestview, Hancock Park, and Little Armenia. By weeks 6 and 7 (now in Phase II), the initial focus is no longer distinguishable, and the initial areas of higher concentration had migrated to the south and east.

Based upon Fig. 4 alone, the dominant mechanism underlying the continued rapid rise in confirmed case incidence during Phase I appears to be the local radial expansion around a single focus. To be sure, several other isolated areas with a cumulative incidence over the 120-per-100,000 threshold can be seen in such relatively affluent communities as Marina Peninsula and the cities of Manhattan Beach and Palos Verdes Estates to the south. While this observation points to multiple importations by individuals with resources to travel, these parallel importations do not appear to have been controlling.

The map-based observations are supported by the results of spatial Models 1 and 2 covering Phase I, shown in Table 3 below. We focus first on the base case on the left, where the radius of influence \(r\) was set equal to 1 km. The significant coefficient for the regressor \(W\), which captures the spatial component, supports the radial expansion interpretation. Without any spatial component, the estimated reproductive number came to \(\mathcal{R}\) = 1.72, but with the inclusion of a spatial component, the estimate increased to \(\mathcal{R}\) = 2.25.

Table 3 Spatial model estimates for phase I (February 24–April 5)

In Spatial Model 2, moreover, the multiplicative effect of MULTI was significant. To appreciate the estimate of \(\delta \) = 0.047, consider the effect of a 12-percentage-point increase in the prevalence of at-risk multi-generational households, equivalent to the difference between the population-weighted mean values of MULTI in the top and bottom half of the distribution. Since the proportion \(S\) of susceptible individuals during this early phase of the epidemic is close to 1, the reproductive number would increase by \(\left(0.047 \times 12\right)/b\) = 0.78. Turning to the alternative case on the right where the radius of influence \(r\) was increased to 1.5 km, we see that the estimated parameter \(\gamma \) was significantly decreased. This finding supports the conclusion that the influence of nearby communities on the radial propagation of the virus was highly local.

4.3 Phase II: Epidemic on the Knife Edge

Tables 4, 5 and 6, respectively, display the corresponding parameter estimates for Phases II, III and IV. As in Table 3, each of the tables has two sections. The section on the left shows the base case, while the section on the right shows the effect of varying one of the parameters \(b\), \(r\), or \(f\).

Table 4 Spatial model estimates for phase II (April 6–July 19)
Table 5 Spatial model estimates for phase III (July 20–October 18)
Table 6 Spatial model estimates for phase IV (October 19–January 10)

Table 4 shows the parameter estimates for Phase II. The estimated value of the transmission parameter \({\alpha }_{II}\) in the base case is significantly lower than the corresponding value \({\alpha }_{I}\) obtained in the base case in Phase I, as shown in Table 3. (We use subscripts here to distinguish between the estimates for different phases.) This finding is consistent with role of publicly imposed and voluntary lockdowns in reducing transmission during this phase of the epidemic.

While the estimate for \({\gamma }_{II}\) is still significantly different from zero, the ratio γII/αII = 0.23 is smaller than the corresponding ratio γI/αI = 0.39 in Phase I. That is, radial expansion continued during Phase II, but had a smaller quantitative contribution than during Phase I. On the other hand, the ratio \({\delta }_{II}/\left({\alpha }_{II}+{\gamma }_{II}\right)\) = 0.055 was larger than the corresponding ratio \({\delta }_{I}/\left({\alpha }_{I}+{\gamma }_{I}\right)\) = 0.037. That is, transmission via multi-generational households had a larger contribution during Phase II. These conclusions are also borne out in the right-hand panel, based upon an assumed increase in the average duration of infectivity from 5.5 to 6.5 days.

During Phase II, the results for spatial Models 1 and 2 in the base case show the average reproductive number \(\mathcal{R}\) for the entire county hovering around the endemic value of 1. With an estimate of \(\delta \) on the order of 0.02 in Model 2, even a 5-percentage-point increment or decrement in the variable MULTI would flip the curve of infectives from increasing (with \(\mathcal{R}\) = 1.1) to decreasing (with \(\mathcal{R}\) = 0.9). The resulting heterogeneity of the reproductive number is borne out in Fig. 5, which shows the distribution of \({\mathcal{R}}_{it}\) at week \(t\) = 6 at the start of Phase II.

Fig. 5
figure 5

Distribution of reproductive number \(\mathcal{R}\) among 296 countywide statistical areas during phase II

While the lockdowns reduced \(\mathcal{R}\) in the aggregate from about 2.7 in Phase I to 1.0 in Phase II, Fig. 5 informs us that the velocity of the epidemic in Phase II remained quite variable. As we noted in connection with our discussion of Fig. 1, county-level officials reopened retail stores, indoor dining, hair salons, gyms and bars in May and June during Phase II. Figure 5 suggests that the impact of such reopening was likely to be quite variable.

4.4 Phase III: The Paradox of the Negative Coefficient

Table 5 displays the spatial model results for Phase III. The reduced estimates of the transmission parameter \(\alpha \) in Models 0 and 1 are consistent with an inhibitory effect of the California health officer’s order reversing most of the county’s prior reopening orders of Phase II (Gutierrez 2020). The results of Model 2, however, tell a different story. The spatial influence parameter \(\gamma \) is now insignificant, while the coefficient \(\delta \) of MULTI is now significantly negative. The right-hand panel of Table 5, displaying the alternative case where \(f\) = 0.5, shows that this finding is not an artifact of our assumption concerning the extent of underascertainment of confirmed cases.

The apparent paradox of the negative coefficient is resolved in Fig. 6, which shows the time paths of estimated proportions of infective \({I}_{it}\) and susceptible \({S}_{it}\) in two groups of CSAs: those in the lower half and those in the upper half of the distribution of MULTI. The estimates are derived from the alternative case where \(f\) = 0.5. Those CSAs with a higher proportion of at-risk multi-generational households, rendered as the darker blue curves, consistently had a higher prevalence of active infection. However, after the state’s reversal order at the end of Phase II, the proportion infected declined more rapidly among the high-MULTI CSAs. The inference is that the renewed lockdown had a greater deterrent impact on social mobility in those communities with a higher proportion of multi-generational households.

Fig. 6
figure 6

Time path percent infective (\(\mathrm{I}\)) and percent susceptible (\(\mathrm{S}\)) in CSAs the upper and lower halves of the distribution of MULTI

Figure 7 below relies upon smartphone mobility data to further explore the basis for this conclusion. Here, we’ve broken down CSAs into four quartiles of the distribution of MULTI, with the highest quartile rendered as the darkest curve. For each of the four groups of CSAs, the curve shows the trend in the number of fast-food-restaurant visits by smartphones originating in CSAs within that group. The trends have been normalized so that the mean number of visits during the weeks of February 10 and 17, before the start of our study period, are equal to 100. Truncated at the left end of the graph are the precipitous declines in restaurant visits in all four groups during Phase I. Thus, the time line starts with the week of March 30 (week 5).

Fig. 7
figure 7

Visits to fast food restaurants from smartphones originating in CSAs in the four quartiles of the distribution of MULTI

Figure 7 shows all four quartiles of MULTI reaching the nadir of fast-food-restaurant visitation rates during the week of April 6 (week 6). By that point, CSAs in the lowest quartile had exhibited the largest declines in response to the lockdowns of Phase I. Visitation rates recovered in all four groups during Phase II, as local restrictions were relaxed. Once the state health officer reversed these local reopening orders during the week of July 13 (week 20), visitation rates in the higher quartiles began to turn around, but the lowest quartile did not. By the week of September 14 (week 29 in Phase III), the spread between the highest and lowest quartiles had narrowed to less than 5 percentage points.

4.5 Phase IV: The Reproductive Number \(\mathcal{R}\) Climbs Back Up To 1.4

Table 5 below shows our spatial model estimates for Phase IV. The most salient feature of Phase IV is the marked rise in the reproductive number \(\mathcal{R}\) back up to 1.4, despite the emergency stay-at-home orders issued by the Los Angeles mayor and the California regional health officer during week 40 (Garcetti 2020b; Pan 2020). As in Table 4, we show the alternative case where the underascertainment proportion \(f\) is assumed to be 50%. The insignificant coefficient of the radial expansion parameter \(\gamma \), also seen in Model 2 for Phase III, indicates that within-CSA transmission had become the dominant mode of viral propagation.

While the estimated coefficient \(\delta \) of MULTI is smaller than the estimates for Phases I and II, it remains significantly different from zero. Thus, a 12-percentage-point increase in the prevalence of at-risk multi-generational households, equivalent to the difference between the population-weighted mean values of MULTI in the top and bottom half of the distribution, would increase the transmission parameter by only about 12 \(\gamma \approx \) 0.11. Still, over the course of the 11 weeks of Phase IV, that increment alone would be sufficient to account for the divergence in the prevalence of active infection seen in Fig. 6.

To bring home this point, Fig. 8 plots the relation between the estimated cumulative proportion of infections (in the notation of our spatial model, 1 − \({S}_{it}\) at \(t\) = 45) against the prevalence of at-risk multi-generational households (that is, \({Z}_{i}\)). As in Fig. 6, the estimates are derived from the alternative case where \(f\) = 0.5. The fitted line has a slope of 0.777 (95% confidence interval, 0.678–0.875). With the exception of some outlier CSAs with relatively small populations, the communities with the highest values of MULTI had a predicted cumulative prevalence approaching one-third of the population. Communities with the lowest values of MULTI, by contrast, had a predicted cumulative prevalence under 10 percent. The predictions of the spatial SIR model in Fig. 8 are thus broadly consistent with the dispersion in the cumulative incidence of confirmed cases seen in Fig. 2.

Fig. 8
figure 8

Predicted cumulative incidence of infection (1 − \(\mathrm{S}\)) through week 45 in relation to the proportion of at-risk multi-generation households (MULTI)

While Fig. 8 helps to explain the striking geographic variation in the burden of new SARS-CoV-2 infections during Phase IV, it does not address the underlying causes for the marked increase in the overall reproductive number to 1.4. If the increase in \(\mathcal{R}\) were attributable to lapses in compliance with social distancing policies, we would have expected to see an increase in restaurant visits during the final weeks in Fig. 7 above. Neither has an increase in the amount of time spent outside the home been observed (Harris, 2020f).

5 Discussion

5.1 Summary of Findings

We identified four phases of the epidemic in Los Angeles County during February 24, 2020 through January 10, 2021 (Fig. 1). Phase IV (running for 11 weeks starting October 19) accounted for more than two-thirds of all confirmed COVID-19 cases during the entire 46-week interval under study.

The map of the cumulative incidence of confirmed infections during Phase IV, we found, bore a striking resemblance with the corresponding map of the prevalence of at-risk multi-generational households (Fig. 2). In a multivariate cross-sectional analysis of approximately 300 communities in the county, the prevalence of at-risk multi-generational households (MULTI) was a more important determinant of the cumulative incidence of confirmed infections in Phase IV than other community-specific indicators. These included the proportion of households with a total income below that of a single minimum-wage worker (INC22), the proportion of households with at least one worker in a low-wage occupation that could not be performed at home (OCCUP), and the proportion of households dependent on food stamps (SNAP) (Fig. 3 and Table 2).

We formulated a spatial modification of an SIR epidemic model that tested two distinct multiplicative effects on the transmission rate in each community: the effect of infection rates in adjacent communities (the parameter \(\gamma \)), and the effect of multi-generational household prevalence within the same community (the parameter \(\delta \)). We estimated this model separately on the data for each of the four phases.

Phase I (weeks 0–5) was characterized by substantial adjacent-community effects operating within a narrow radius of 1 km (Table 3). This finding was in accordance with serial weekly maps of the spread of infection during Phase I, which showed initial, rapid radial extension from a focus originating in an affluent area containing such communities as Beverly Crest (Fig. 4). Phase I was also characterized by the significant influence of within-community multi-generational prevalence on transmission rates. As the epidemic expanded, hotspots began to develop in areas with higher concentrations of multi-generational households. The estimated overall reproductive number \(\mathcal{R}\) during this phase was on the order of 2.7 (Table 3).

The flattening of the epidemic curve in Phase II (weeks 6–20) reflected the effects of voluntary and coerced social distancing, particularly the state-of-emergency orders issued during Phase I. Still, both adjacent-community effects and the within-community impact of high proportions of multi-generational households remained significant (Table 4). While the global, countywide reproductive number \(\mathcal{R}\) hovered around the endemic level of 1.0, local reproductive numbers varied widely from 0.6 to 1.5, depending on the prevalence of multi-generational households in the community (Fig. 5).

As the epidemic curve began to rise toward the end of Phase II, the state ordered the reversal of measures taken at the county level to easy restrictions on retail stores, indoor dining, hair salons, gyms and bars (Fig. 1). As confirmed case incidence declined during Phase III (weeks 21–34) in response to this public policy intervention (Fig. 1), the countywide reproductive number \(\mathcal{R}\) dropped to about 0.6 and adjacent-community effects were no longer consistently detectable (Table 5). More striking, however, was the finding that a high prevalence of multi-generational households was significantly associated with decreased—rather than increased—viral transmission (Table 5). Phase III, it turned out, saw a narrowing in the gap in infection rates between communities with high and low percentages of multi-generational households (Fig. 6). This interpretation was supported by smartphone tracking data showing that the gap in restaurant visitation rates during this period had also narrowed (Fig. 7).

During Phase IV (weeks 35–45), infection rates surged, with the global reproductive number \(\mathcal{R}\) reverting to 1.4 (Table 5). Spillover effects between communities were no longer detectable. The model parameter (\(\delta \)) relating multi-generational household prevalence to the transmission rate, while smaller than at the start of the epidemic, was still sufficiently large to generate a wide dispersion in cumulative infection rates. By the end of our study period (the week of January 4, 2021), estimated cumulative infection rates varied from under 10% in communities with low multi-generational prevalence to greater than 30% in communities with a high percentage of at-risk multi-generational households (Fig. 8).

These findings, taken together, supported a critical role of household structure in the initial dissemination and continued wide propagation of SARS-CoV-2 infection in Los Angeles County.

5.2 Strengths and Limitations of This Study

Our study takes advantage of the cohort structure of our database, in which we follow a group of related geographic units longitudinally over time. This structure allowed us to test a model of radial geographic expansion during Phases I and II of the Los Angeles County COVID-19 epidemic, and to elucidate the substantial heterogeneity of transmission patterns within the county over time. While the global reproductive number \(\mathcal{R}\) hovered around the endemic level of 1 during Phase II, our approach permitted us to discern local reproductive numbers ranging from 0.6 to 1.5. While the overall confirmed case incidence rate rose by about tenfold during Phase IV, we were able to identify wide community-specific dispersion in cumulative disease rates.

On the other hand, our study is exclusively population-based. We do not follow a longitudinal cohort of individual households to see how many young adult members went to a restaurant or a gym, got infected, and then brought their infections home to older household members. A population-based indicator such as the proportion of households at risk for multigenerational transmission (MULTI) could thus be criticized as no more than a proxy for some other correlated characteristic of the community.

To be sure, we confirmed the quantitative importance of multi-generational household prevalence in a cross-sectional regression analysis that included other measures of poverty (INC22, SNAP) as well as the proportion of households with high-risk workers (OCCUP) (Table 2). Still, one might posit that the critical underlying variable is the proportion of Spanish-speaking households with uninsured members (Vijayan et al. 2020; Weng et al. 2020). One might similarly contend that our variable MULTI, which relied on the presence of at least one younger adult (aged 18–34) and another older adult (aged 45 or more) in the household, was no better an indicator of multi-generational transmission risk than, say, the number of persons per bathroom in the household. Data on the prevalence of such risk factors as smoking, elevated body mass index and comorbidities such as diabetes have been considered (Horn et al. 2020), but these cofactors are more relevant to a study of disease morbidity and mortality.

Our principal endpoint was the incidence of confirmed cases of COVID-19. It is now widely acknowledged that confirmed case counts significantly understate the actual numbers of SARS-CoV-2 infections (Havers et al. 2020).While we adjusted our spatial SIR model to account for an estimated 40–50% underascertainment, there is some evidence that underascertainment rates are much higher (Sood et al. 2020; Wu et al. 2020). While we estimated that cumulative infection rates ranged from 10% to upwards of 30% across communities, another unpublished model suggested that one in three residents of Los Angeles had already been infected (County DHS COVID-19 Predictive Modeling Team 2021).

One alternative endpoint would be seroprevalence, but serial population-based studies of seroprevalence are still uncommon (Hallal et al. 2020), and there is evidence that population seroprevalence may decline with time (Buss et al. 2020). Hospital admission rates have been studied as an alternative to confirmed case incidence (Harris, 2020b, 2020g), but such an endpoint would also depend on case severity. The test positivity rate—the proportion of positive tests among all persons tested—has been employed as an endpoint in cross-sectional studies (Cotti et al. 2020; Vijayan et al. 2020). Adaptation of this endpoint to a dynamic epidemic model is problematic, however, as the number of individuals tested is endogenous and must be modeled as well (Bhaduri et al. 2020).

To explain the paradoxically negative value of the parameter \(\delta \) during Phase III, we relied on data on smartphone visits to fast-food restaurants as an indicator of social mobility (Fig. 7). While data on smartphone visits to restaurants and bars have been repeatedly used as a measures of potential coronavirus exposure (Harris 2020b, e), there has been no independent verification of their accuracy.

Our specification of a spatial modification of the conventional SIR model adheres to the modeling philosophy that one should introduce the minimum necessary modifications of the most parsimonious model (Harris 2020b). One might contend that the appropriate base model would instead be the SEIR (susceptible-exposed-infective-resistant) version, which has been widely employed in studies of SARS-CoV-2 transmission (Godio et al. 2020; Radulescu et al. 2020; Li et al. 2021). It is hardly clear, however, that the problem of coming up with the additional between-state transition parameters in the SEIR model is any more tractable than our problem of devising an inflation factor to account for unascertained cases in Sect. 2.5 above.

5.3 Characterizing the Initial Outbreak

Our geospatial mapping study (Fig. 4), in combination with our estimates of a spatial SIR model for Phase I (Table 3), permitted us to characterize the initial outbreak of COVID-19 in Los Angeles County. The earliest days of the outbreak saw multiple parallel importations in several relatively affluent areas of the county where residents had the resources to travel. A phylogenetic analysis of SARS-CoV-2 samples drawn during March 22–April 15 at a major hospital located within one of the initial foci of infection found that the larger proportion belonged to clades derived from Europe (Zhang et al. 2020). The epidemic then spread by radial expansion over a period of weeks from this focus of infection to nearby communities with a higher prevalence of multi-generational households. This pattern of spread by radial extension stands in sharp contrast to the earliest days of the outbreak in New York City, where community-transmitted infections were dispersed throughout all five boroughs in a matter of days (Gonzalez-Reiche et al. 2020; Harris 2020c). Our estimate of a reproductive number \(\mathcal{R}\) equal to 2.7 (Table 2, base case, Model 2) is consistent with the estimate of \({\mathcal{R}}_{0}\) in the range of 2.43–3.10 for the Italy (D'Arienzo and Coniglio 2020), but falls below the estimates of 3.47 (range, 3.16–3.78) for New York City (Harris 2020d) and 3.54 (range, 3.40–3.67) for Wuhan (Hao et al. 2020), both of which had massive subway systems.

5.4 Heterogeneous Responses to Public Policies

Our analysis of a longitudinal panel of diverse communities within Los Angeles County helps us understand how responses to epidemic-control policies can be so heterogeneous. After the state-of-emergency orders issued in March, the county entered into Phase II during the week of April 6 with an overall endemic-level reproductive number \(\mathcal{R}\) of 1.0. Yet Fig. 5 informs us that the local reproductive numbers varied widely from 0.6 to 1.5.

In response to the flattening of the global epidemic curve in Phase II, county policy makers reopened retail stores, indoor dining, hair salons, gyms and bars in May and June (Money 2020; Parvini 2020; Shalby 2020a). When these actions overshot the mark, the Los Angeles County health officer issued an order closing indoor onsite dining (Los Angeles County Department of Public Health 2020). On July 13, the state public health officer closed indoor operations in bars not concurrently serving meals, as well as gyms in counties on its monitoring list, to which Los Angeles County already belonged (Angell 2020).

All of these public policy decisions applied to the entirety of Los Angeles County. Yet the evidence is that the responses to these policies varied widely. Figure 7 shows that fast-food restaurant visits by residents of those communities in the top quartile of multi-generational household prevalence, which had the highest reproductive numbers, had a very different response than visits by residents of communities in the lowest quartile, which had the lowest reproductive numbers. While this finding alone does not establish that all indicators of social mobility responded in the same manner, it highlights the bluntness of policy instruments that were to be applied uniformly to a county of 10 million inhabitants.

5.5 What Caused the Phase IV Surge?

While Figs. 3 and 8 and Table 2 establish the substantial contribution of multi-generational households to the surge in confirmed case incidence observed in Phase IV, they do not tell us why the surge occurred in the first place. In terms of our spatial SIR model, they do not explain the striking rebound in the transmission parameter \(\alpha \) that is evident in the data.

Figure 7 tells us that fast-food-restaurant visits declined overall during Phase IV. Other smartphone-based indicators of social mobility showed little change (Harris 2020f) despite the emergency stay-at-home orders issued by the Los Angeles mayor and the California regional health officer during week 40 (Garcetti 2020b; Pan 2020). Even visits to gyms, compiled in an earlier draft of this article (Harris 2020h), continued to decline.

Two plausible explanations come to mind. First, smartphone-based indices of social mobility have not captured large family gatherings that occurred during the succession of winter holidays that began with Thanksgiving. The publicly available smartphone data show only how frequently individual device holders moved and where they went, but not how many were congregated in the same place. A subsequent decline in the frequency of such high-density gatherings may help to explain the drop in confirmed case incidence that has so far been observed in January 2021, outside the observation interval of the present study. Second, a new strain of SARS-CoV-2 could have emerged in Southern California (Zhang et al. 2021). The difficulty with the latter explanation is that it does not readily explain the subsequent post-Phase IV decline in incidence that now appears to be under way.

5.6 Implications

Despite an array of aggressive public policies aimed at reducing social mobility, our findings suggest that intra-household transmission has been a critical vehicle for the persistence of the COVID-19 epidemic in Los Angeles County. The prevalence of at-risk households in a community, it appears, is not simply a predictor of the persistence of coronavirus transmission, but also a multiplier of the effects of other policies aimed at social distancing. The impact of preventing one case of asymptomatic infection in a socially active young adult, who would otherwise have brought his or her infection into the household, will depend directly on the number of susceptible household members who have been spared.

Our results cast a pessimistic shadow on so-called targeted policies that selectively relax restrictions on lower-risk, younger persons while seeking to protect more vulnerable older persons (Acemoglu et al. 2020; Chikina and Pegden 2020; Gollier 2020; Iverson et al. 2020). Such a policy might be feasible in settings where older persons are sequestered in retirement communities or assisted living facilities, but the data here show that this is not the reality of Los Angeles County.

Most importantly, our findings require us to view the household rather than the individual as the foremost target of healthcare policy. The message “protect yourself” (protégete in Spanish) needs to be reconfigured as “protect your family” (protege a tu familia). When a healthcare provider encounters a new patient with suspected or established COVID-19, the interview needs to turn quickly to questions about other household members, their health status, and their symptoms. The widely recognized model of the patient-centered medical home (Alexander and Bae 2012) needs to be replaced by the family- and household-centered medical home.