## Abstract

In human geography and the urban social sciences, the segregation literature typically engages with five conceptual dimensions along which a given society may be considered segregated: evenness, isolation, clustering, concentration and centralization (all of which can incorporate or omit spatial context). Over the last several decades, dozens of segregation indices have been proposed and studied in the literature, each of which is designed to focus on the nuances of a particular dimension, or correct an oversight in earlier work. Despite their increasing proliferation, however, few of these indices remain used in practice beyond their original conception, due in part to complex formulae and data requirements, particularly for indices that incorporate spatial context. Furthermore, existing segregation software typically fails to provide inferential frameworks for either single-value or comparative hypothesis testing. To fill this gap, we develop an open-source Python package designed as a submodule for the Python Spatial Analysis Library, PySAL. This new module tackles the problem of segregation point estimation for a wide variety of spatial and aspatial segregation indices, while providing a computationally based hypothesis testing framework that relies on simulations under the null hypothesis. We illustrate the use of this new library using tract-level census data in two American cities.

This is a preview of subscription content, access via your institution.

## Notes

More recently, Ref. [12] addressed this problem assuming a nonparametric binomial mixture of the frequencies.

In the original paper, they consider 43 different indices, due to three Atkinson indices versions. However, these indices only differ in terms of the value of the parameter

*b*; therefore, we consider this index only once.Most notably shapefiles are limited to ten character column names and they are difficult to transport across computing environments because the specification is actually a minimum of four files, not a single file as the name would suggest.

One of the most prominent is the indices issues presented in Ref. [49] discussed in the bottom of page 6 of Ref. [47]. During the construction of the present module, the same problems were identified and the default approach of these indices follows actually the latter study for this Python package.

In terms of software, so far, we are unaware of any that performs inference for comparison between them.

This last case is unusual, but our framework permits any of these combinations, as presented in Sect. ??.

Available at https://github.com/pysal/segregation.

More recently, some other measures were added to SM, but we conducted the current work with the original 25.

In addition, the module has a function/class named Compute_All_Segregation that performs point estimation of several segregation measures at once.

It is worth mentioning, that using a geopandas GeoDataFrame for the non-spatial indices is also valid since it “behaves” as a usual pandas dataframe.

Assuming that \(n_{ij}\) is the population of unit

*i*of group*j*, this approach assumes that the distribution of people from each*j*group is a multinomial distribution with probabilities given by \(\frac{\sum _{j}n_{ij}}{\sum _{i}\sum _{j}n_{ij}}=\frac{n_{i.}}{n_{..}}\).We are aware that for some measures, some approaches would not be appropriate, but we chose to allow these combinations, allowing our framework to remain as generic as possible. For example, the Modified Dissimilarity (Dct) and Gini (Gct), rely exactly on the distance between evenness through sampling which, therefore, the "evenness" value for null_approach would not be the most appropriate for these indices.

We thank a reviewer for drawing attention to this point in the manuscript.

There is also a statistic attribute to access the original point estimation of the measure.

Note that in this case, each measure has to be the same SM class as it would not make much sense to compare, for example, a Gini Index with a Delta (DEL) Index.

We refer the word composition to the group of interest frequency of each unit. For example, if a unit has total population of 50 and 5 people belonging to group

*A*, the group A composition of this unit is 10%.The details of the construction of these counterfactual values are presented in Appendix B.

We also noticed that for most of the indices, specially the spatial ones, SM was much faster to estimate than the implementation of Ref. [47].

We used the total population of 100,000 and generated a random composition for each unit given from a Uniform distribution between 0 and 1.

The indices were fitted used the default values for input. Although this can be a source for difference in the values, we highlight that these default values are roughly comparable since all indices that rely on simulations (Dct, Gct, and Dbc) have the same value of 500 for the iterations and indices that rely on integration (

*R*and SPP) have the same number of thresholds for integral approximation of 1000. The index Ddc has a degree of tolerance in the optimization of \(10^{-5}\).The values marked with * are virtually the same although OasisR has a mispecification in \(d_{ii}\) that does not follow [24]. This difference can be checked in https://github.com/cran/OasisR/pull/1/commits/cc3681dae96188663230cf140d0cf41fd90e45cd.

Composed by five counties: New York County, Bronx County, Kings County, Queens County and Richmond County.

Both regions are similar in terms of number of spatial units, as Los Angeles County has 2346 census tracts in 2010 and New York City has 2168.

Once again, all simulation were run using the default values of the input parameters and 500 iterations in parallel with 6 cores in a Jupyter Notebook [22] using an Intel (R) Core (TM) i7-8750H CPU with 2.21 GHz and 16 GB of RAM. It was necessary approximately 34.7 h to run all application results here presented.

This approach does not apply to measures that do not take spatial context into consideration since each value for the simulations would be the same along the permutations.

\({H_0:} \mathrm{Los Angeles}\ {\mathrm{segregation}_{2010}}\ - \mathrm{Los Angeles}\ {\mathrm{segregation}_{2000}} = 0. \)

With the caveat that the Exposure is inversely proportional of the segregation and, thus, it is located on the right-tail of the distribution under null hypothesis.

The

*p*value of ACO was \(\approx \) 0.74 and of RCO was \(\approx \) 0.816.\({H_0:} \mathrm{Los}\ \mathrm{Angeles}\ \mathrm{segregation} - \mathrm{New}\ \mathrm{York}\ \mathrm{segregation} = 0\).

For the xP

*y*and DDxP*y*, it presented lower values, but the interpretation is the same.However, an unexpected result arose from the fact that for the Ddc Index Los Angeles was, significantly, more segregated.

This table does not reflect necessarily the original/pioneer paper of each measure, but rather the related literature of the formulas presented in this Appendix.

We considered to include the mixture of betas approach of Ref. [35] for the

*D*,*G*and*H*indices, as the author kindly shared the original code. However, due to convergence problems, we chose not to include it in the current version of SM.

## References

Allen, J. P., & Turner, E. (2012). Black-White and Hispanic-White segregation in US counties.

*The Professional Geographer*,*64*(4), 503–520.Allen, R., Burgess, S., Davidson, R., & Windmeijer, F. (2015). More reliable inference for the dissimilarity index of segregation.

*The Econometrics Journal*,*18*(1), 40–66.Apparicio, P., Martori, J. C., Pearson, A. L., Fournier, É., & Apparicio, D. (2014). An open-source software for calculating indices of urban residential segregation.

*Social Science Computer Review*,*32*(1), 117–128.Boisso, D., Hayes, K., Hirschberg, J., & Silber, J. (1994). Occupational segregation in the multidimensional case: Decomposition and tests of significance.

*Journal of Econometrics*,*61*(1), 161–171.Brown, L. A., & Chung, S. Y. (2006). Spatial segregation, segregation indices and the geographical perspective.

*Population, Space and Place*,*12*(2), 125–143.Carrillo, P. E., & Rothbaum, J. L. (2016). Counterfactual spatial distributions.

*Journal of Regional Science*,*56*(5), 868–894.Carrington, W. J., & Troske, K. R. (1997). On measuring segregation in samples with small units.

*Journal of Business & Economic Statistics*,*15*(4), 402–409.Carrington, W. J., & Troske, K. R. (1998). Interfirm segregation and the Black/White wage gap.

*Journal of Labor Economics*,*16*(2), 231–260.Clark, W. A., & Östh, J. (2018). Measuring isolation across space and over time with new tools: Evidence from Californian metropolitan regions.

*Environment and Planning B: Urban Analytics and City Science*,*45*(6), 1038–1054.Cowgill, D. O., & Cowgill, M. S. (1951). An index of segregation based on block statistics.

*American Sociological Review*,*16*(6), 825–831.Devroye, L. (1986). Sample-based non-uniform random variate generation. In

*Proceedings of the 18th conference on winter simulation**ACM*(pp. 260–265).d’Haultfoeuille, X., & Rathelot, R. (2017). Measuring segregation on small units: A partial identification analysis.

*Quantitative Economics*,*8*(1), 39–73.Duncan, O. D., & Duncan, B. (1955). A methodological analysis of segregation indexes.

*American Sociological Review*,*20*(2), 210–217.Hellerstein, J. K., & Neumark, D. (2008). Workplace segregation in the united states: Race, ethnicity, and skill.

*The Review of Economics and Statistics*,*90*(3), 459–477.Hong, S. Y., O’Sullivan, D., & Sadahiro, Y. (2014). Implementing spatial segregation measures in R.

*PloS One*,*9*(11), e113767.Hong, S. Y., & Sadahiro, Y. (2014). Measuring geographic segregation: A graph-based approach.

*Journal of Geographical Systems*,*16*(2), 211–231.Hunter, J. D. (2007). Matplotlib: A 2d graphics environment.

*Computing In Science & Engineering*,*9*(3), 90–95. https://doi.org/10.1109/MCSE.2007.55.James, D. R., & Taeuber, K. E. (1985). Measures of segregation.

*Sociological Methodology*,*15*, 1–32.Johnston, R., Poulsen, M., & Forrest, J. (2007). Ethnic and racial segregation in us metropolitan areas, 1980–2000: The dimensions of segregation revisited.

*Urban Affairs Review*,*42*(4), 479–504.Jones, K., Johnston, R., Manley, D., Owen, D., & Charlton, C. (2015). Ethnic residential segregation: A multilevel, multigroup, multiscale approach exemplified by London in 2011.

*Demography*,*52*(6), 1995–2019.Jordahl, K. (2014). Geopandas: Python tools for geographic data. https://github.com/geopandas/geopandas. Accessed 3 Apr 2019.

Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B. E., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J. B., Grout, J., & Corlay, S., et al. (2016). Jupyter notebooks-a publishing format for reproducible computational workflows. In

*ELPUB*(pp. 87–90). https://jupyter.org/. Accessed 3 Apr 2019.Lee, D., Minton, J., & Pryce, G. (2015). Bayesian inference for the dissimilarity index in the presence of spatial autocorrelation.

*Spatial Statistics*,*11*, 81–95.Massey, D. S., & Denton, N. A. (1988). The dimensions of residential segregation.

*Social Forces*,*67*(2), 281–315.Massey, D. S., & Denton, N. A. (1989). Hypersegregation in us metropolitan areas: Black and hispanic segregation along five dimensions.

*Demography*,*26*(3), 373–391.Massey, D. S., & Denton, N. A. (1993).

*American apartheid: Segregation and the making of the underclass*. Cambridge: Harvard University Press.Massey, D. S., & Tannen, J. (2015). A research note on trends in black hypersegregation.

*Demography*,*52*(3), 1025–1034.McKinney, W. (2010). Data Structures for Statistical Computing in Python. In S. van der Walt, J. Millman (Ed.),

*Proceedings of the 9th Python in Science Conference*(pp. 51–56).Morgan, B. S. (1983). A distance-decay based interaction index to measure residential segregation.

*Area*,*15*(3), 211–217.Morrill, R. L. (1991). On the measure of geographic segregation.

*Geography Research Forum*,*11*, 25–36.Napierala, J., & Denton, N. (2017). Measuring residential segregation with the ACS: How the margin of error affects the dissimilarity index.

*Demography*,*54*(1), 285–309.Park, R. E. (1926). The urban community as a spatial pattern and a moral order. In

*Urban social segregation*(pp. 21–31).R Development Core Team (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. ISBN 3-900051-07-0. http://www.R-project.org. Accessed 3 Apr 2019.

Ransom, M. R. (2000). Sampling distributions of segregation indexes.

*Sociological Methods & Research*,*28*(4), 454–475.Rathelot, R. (2012). Measuring segregation when units are small: A parametric approach.

*Journal of Business & Economic Statistics*,*30*(4), 546–553.Reardon, S. F., & Townsend, J. B. (1999). SEG: Stata module to compute multiple-group diversity and segregation indices. Statistical Software Components, Boston College Department of Economics. https://ideas.repec.org/c/boc/bocode/s375001.html. Accessed 3 Apr 2019.

Reardon, S. F., & Firebaugh, G. (2002). Measures of multigroup segregation.

*Sociological Methodology*,*32*(1), 33–67.Reardon, S. F., & O’Sullivan, D. (2004). Measures of spatial segregation.

*Sociological Methodology*,*34*(1), 121–162.Rey, S. J. (2004). Spatial analysis of regional income inequality.

*Spatially Integrated Social Science*,*1*, 280–299.Rey, S. J., & Anselin, L. (2010). PySAL: A Python library of spatial analytical methods. In

*Handbook of applied spatial analysis*(pp. 175–193). Springer.Rey, S. J., & Sastré-Gutiérrez, M. L. (2010). Interregional inequality dynamics in Mexico.

*Spatial Economic Analysis*,*5*(3), 277–298.Roberto, E. (2018). The spatial proximity and connectivity method for measuring and analyzing residential segregation.

*Sociological Methodology*,*48*(1), 182–224.Rossum, G. (1995). Python reference manual. Technical report. The Netherlands: Amsterdam.

Royuela, V., & Vargas, M., et al. (2010). Residential segregation: A literature review. Technical report.

Shapley, L. S. (1953). A value for n-person games.

*Contributions to the Theory of Games*,*2*(28), 307–317.Söderström, M., & Uusitalo, R. (2010). School choice and segregation: Evidence from an admission reform.

*Scandinavian Journal of Economics*,*112*(1), 55–76.Tivadar, M. (2019). Oasisr: An R package to bring some order to the world of segregation measurement.

*Journal of Statistical Software*,*89*(1), 1–39.Waskom, M., Botvinnik, O., O’Kane, D., Hobson, P., Lukauskas, S., Gemperline, D. C., Augspurger, T., Halchenko, Y., Cole, J. B., Warmenhoven, J., de Ruiter, J., Pye, C., Hoyer, S., Vanderplas, J., Villalba, S., Kunter, G., Quintero, E., Bachant, P., Martin, M., Meyer, K., Miles, A., Ram, Y., Yarkoni, T., Williams, M. L., Evans, C., Fitzgerald, C., Brian, Fonnesbeck, C., Lee, A., & Qalieh, A. (2017). mwaskom/seaborn: v0.8.1 (september 2017). https://doi.org/10.5281/zenodo.883859.

Wong, D. W. (1993). Spatial indices of segregation.

*Urban Studies*,*30*(3), 559–572.Wong, D. W. (2003). Implementing spatial segregation measures in GIS.

*Computers, Environment and Urban Systems*,*27*(1), 53–70.

## Acknowledgements

We are grateful for the support of National Science Foundation (NSF) (Award 1831615) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) foundation (Process 88881.170553/2018-01).

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### A: Point estimation details

Here, we present and explain each formula for the segregation measures presented in Table 1 of Section 2.1. The respective literature used for each measure can be found in Table 4^{Footnote 36}\(^,\)^{Footnote 37} in addition with the respective dimension.

For consistency of notation, we assume that \(n_{ij}\) is the population of unit \(i \in \{1,\ldots , I\}\) of group \(j \in \{x, y\}\), also \(\sum _{j}n_{ij} = n_{i.}\), \(\sum _{i}n_{ij} = n_{.j}\), \(\sum _{i}\sum _{j}n_{ij} = n_{..}\), \({\tilde{s}}_{ij} = \frac{n_{ij}}{n_{i.}}\), \({\hat{s}}_{ij} = \frac{n_{ij}}{n_{.j}}\). The segregation indices can be build for any group *j* of the data.

The Dissimilarity Index (D) is given by:

The spatial D (SD) is given by:

where \({\tilde{s}}_{ij}^{i_1}\) and \({\tilde{s}}_{ij}^{i_2}\) are the proportions of the minority population in the units \(i_1\) and \(i_2\), respectively and where \(c_{i_1i_2}\) denotes an element at \((i_1,i_2)\) in a matrix C, which becomes one only if \(i_1\) and \(i_2\) are considered neighbors.

The boundary spatial D (BSD) is given by:

where

where \({\tilde{s}}_{ij}^{i_1}\) and \({\tilde{s}}_{ij}^{i_2}\) are the proportions of the minority population in the units \(i_1\) and \(i_2\), respectively, and \(cb_{i_1i_2}\) is the length of the common boundary of areal units \(i_1\) and \(i_2\).

The perimeter/area ratio spatial *D* (PARD) is a Spatial Dissimilarity Index that takes into consideration the perimeter and the area of each unit by adding a specific multiplicative term in the second term of BSD (the spatial effect):

where \(P_i\) and \(A_i\) are the perimeter and area of unit *i*, respectively and \(\mathrm{MAX}(P{/}A)\) is the maximum perimeter–area ratio or the minimum compactness of an areal unit found in the study region.

The Gini coefficient (*G*) is given by:

The global entropy (*E*) is given by:

while the unit’s entropy is analogously:

Therefore, the Entropy Index (*H*) is given by:

The Atkinson Index (*A*) is given by:

where *b* is a shape parameter that determines how to weight the increments to segregation contributed by different portions of the Lorenz curve.

The Concentration Profile (*R*) measure is discussed in Ref. [16] and tries to inspect the evenness aspect of segregation. The threshold proportion *t* is given by:

In the equation, *g*(*t*, *i*) is a logical function that is defined as:

The Concentration Profile (*R*) is given by:

The SPP is similar to the Concentration Profile, but with the addition of the spatial component in the connecting function:

where *k* refers to the sum of *g*(*t*, *i*) for a given *t* and \(\delta _{ij}\) is the distance between \(i_1\) and \(i_2\). One way of determining \(\delta _{i_1i_2}\) would be to use a spatial structure matrix, *W*. The matrix *W* present ones if \(i_1\) and \(i_2\) are contiguous and zero, otherwise. The distance \(\delta _{i_1i_2}\) between \(i_1\) and \(i_2\) is given by is the order of how neighbors is needed to reach from \(i_1\) to \(i_2\). For example, two census tracts, \(x_1\) and \(x_2\), that do not have a common boundary but both are adjacent to the same unit, \(x_3\), are second-order neighbors, so \(\delta _{12}\) becomes 2. Like the Concentration Profile, if the number of thresholds used is large enough, a smooth curve, or a SPP, can be constructed by plotting and connecting \(\eta _t\).

Isolation (xP*x*) assess how much a minority group is only exposed to the same group. In other words, how much they only interact the members of the group that they belong. Assuming \(j = x\) as the minority group, the isolation of *x* is giving by:

The Exposure (xP*y*) of *x* is giving by

The correlation ratio (*V* or \(\mathrm{Eta}^2\)) is given by

The SP Index is given by:

where

\(d_{i_1i_2}\) is a pairwise distance measure between area \(i_1\) and \(i_2\) and \(d_{ii}\) is estimated as \(d_{ii} = (\alpha a_i)^{\beta }\) where \(a_i\) is the area of unit *i*. The default is \(\alpha = 0.6\) and \(\beta = 0.5\) and for the distance measure, we first extract the centroid of each unit and calculate the euclidean distance.

The RCL measure is given by:

The Distance Decay Isolation (DDxP*x*) is given by:

where

such that

where \(\zeta _{i_1i_2}\) is defined as before. This also could be seen as the probability of contact of members of group *x* to each other weighted by the inverse of distance.

The Distance Decay Exposure (DDxP*y*) is given by:

where \(P_{i_1i_2}\) is defined as before.

The DEL measure is given by the following equation:

where \(a_i\) is the area of unit *i* and *A* is the total area of the given region \(A = \sum _{i=1}^{I}a_i\).

The ACO Index is given by:

where the units are ordered from smallest to largest in areal size. In this formula, \(n_1\) is the rank of the unit where the cumulative total population equal the total minority population, \(n_2\) is the rank of the unit where cumulative total population equal equal the total minority population from the largest unit down. In addition,

and

Another measure of concentration is the RCO Index:

where \(n_1\), \(n_2\), \(T_1\) and \(T_2\) are defined as before.

The degree of centralization can be evaluated through the Absolute Centralization Index (ACE) or through the RCE:

where \(A_i\) is the cumulative area proportion through unit *i*, \(X_i\) is the cumulative frequency proportion through unit *i* of group *x* and \(Y_i\) is the analogous for group *y*. In this measure, the area units are ordered by increasing distances from the central business district, which we assume being located in the average latitude and average longitude among all centroid.

The Dct Index based on [7] evaluates the deviation from simulated evenness. This measure is estimated by taking the mean of the classical *D* under several simulations under evenness from the global minority proportion.

Let \(D^*\) be the average of the classical D under simulations draw assuming evenness from the global minority proportion. The value of Dct can be evaluated with the following equation:

Similarly, the Gct based also on Ref. [7] evaluates the deviation from simulated evenness. This measure is estimated by taking the mean of the classical *G* under several simulations under evenness from the global minority proportion.

Let \(G^*\) be the average of *G* under simulations draw assuming evenness from the global minority proportion. The value of Gct can be evaluated with the following equation:

Lastly, the Bias-Corrected (Dbc) and Density-Corrected (Ddc) Dissimilarities indices are presented in Ref. [2]. The Dbc is given by:

where \({\bar{D}}_b\) is the average of *B* resampling using the observed conditional probabilities for a multinomial distribution for each group independently.

The Ddc measure is given by:

where

and \(n\left( {\hat{\theta }}_i \right) \) is the \(\theta _i\) that maximizes the folded normal distribution \(\phi ({\hat{\theta }}_i-\theta _i) + \phi ({\hat{\theta }}_i+\theta _i)\) where

and \(\phi \) is the standard normal density.

### B: Counterfactual composition details

Following the same notation of A and assuming building counterfactual values fro two different cities, we form the cumulative distribution functions (CDF) for these values taken over all the tracts in City 1: \(F^{(1)}({\tilde{s}}_{i,j}^{1,t})\), and City 2: \(F^{(2)}({\tilde{s}}_{i,j}^{2,t})\). To create a counterfactual distribution that imposes the attribute distribution of City 2 on the spatial structure of City 1 we take \(p_{i,j}^{1,t} = F^{(1)}({\tilde{s}}_{i,j}^{1,t})\) and then generate \(n_{i,j}^{1,t} |_{attr = 2} = {F^{(2)}}^{-1}(p_{i,j}^{1,t}) n_{i,.}^{1,t}\), where \(attr = 2\) means that this population is calculated given the attributes of City 2. This entire process is done for all tracts of a group in City 1 and the majority group population is given by the difference \(n_{i,.}^{1,t} - n_{i,j}^{1,t} |_{attr = 2}\). The populations for City 2 are generated analogously.

## Rights and permissions

## About this article

### Cite this article

Cortes, R.X., Rey, S., Knaap, E. *et al.* An open-source framework for non-spatial and spatial segregation measures: the PySAL segregation module.
*J Comput Soc Sc* **3**, 135–166 (2020). https://doi.org/10.1007/s42001-019-00059-3

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s42001-019-00059-3

### Keywords

- Open-source
- Segregation
- PySAL
- Spatial analysis