Soul of the community: an attempt to assess attachment to a community


In this article, we work with data from the Soul of the Community survey project that was conducted by the Knight Foundation from 2008 to 2010. Overall, 26 communities across the United States with a total of more than 47,800 participants took part in this study. Each year, around 200 different questions were posed to each participant. One key variable is attachment to one’s community. In our article, we provide an assessment via various machine learning algorithms which factors may have an effect on attachment.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. Becker RA, Wilks AR, Brownrigg R, Minka TP (2013) Maps: draw geographical maps. R package version 2.3-2. Accessed 12 Dec 2018

  2. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory (COLT’92). ACM Press, Pittsburgh, PA, USA, pp 144–152

  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  4. Breiman L, Cutler A (2014) Random forests. Accessed 21 May 2014

  5. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth and Brooks, Monterey

    Google Scholar 

  6. Cook D (2014) ASA 2009 data expo. Comput Stat 29(1–2):117–119

    Article  Google Scholar 

  7. Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36(4):338–347

    MathSciNet  Article  Google Scholar 

  8. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 85:14,863–14,868

    Article  Google Scholar 

  9. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15:3133–3181

    MathSciNet  MATH  Google Scholar 

  10. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(7):179–188

    Article  Google Scholar 

  11. Hofmann H (2013) Soul of the community. Accessed 12 Nov 2013

  12. Hofmann H, Wickham H, Cook D (2019) The 2013 data expo of the American Statistical Association. Computational Statistics XX(YY): This issue

  13. Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications. Springer, New York

    Google Scholar 

  14. James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning. Springer, New York

    Google Scholar 

  15. Kahle D, Wickham H (2013) ggmap: a package for spatial visualization with Google Maps and OpenStreetMap. R package version 2.3. Accessed 12 Dec 2018

  16. Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab–an S4 package for kernel methods in R. J Stat Softw 11(9):1–20

    Article  Google Scholar 

  17. Knight Foundation (2013) Soul of the community. Accessed 12 Nov 2013

  18. Knight Foundation (2014) Accessed 23 May 2014

  19. Knight Foundation (2015) Accessed 3 Mar 2015

  20. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A (2012) caret: classification and regression training. R package version 5.15-023.

  21. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22

    Google Scholar 

  22. Murrell P (2010) The 2006 data expo of the American Statistical Association. Comput Stat 25(4):551–554

    MathSciNet  Article  Google Scholar 

  23. Neuwirth E (2011) RColorBrewer: ColorBrewer palettes. R package version 1.0-5. Accessed 24 Mar 2015

  24. Quach A, Symanzik J, Forsgren Velasquez N (2013) Soul of the community: a first attempt to assess attachment to a community. In: 2013 JSM proceedings, American Statistical Association, Alexandria, VA

  25. R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria,, ISBN 3-900051-07-0

  26. Rowley E (November, 2011) Is loving where you live the key to a successful community?

  27. Schloerke B, Crowley J, Cook D, Hofmann H, Wickham H (2012) GGally: extension to ggplot2. R package version 0.4.2. Accessed 24 Mar 2015

  28. Wegman EJ (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc 85(411):664–675

    Article  Google Scholar 

  29. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer New York,

  30. Wickham H (2011a) ASA 2009 data expo. J Comput Graph Stat 20(2):281–283

    MathSciNet  Article  Google Scholar 

  31. Wickham H (2011b) The split-apply-combine strategy for data analysis. J Stat Softw 40(1):1–29

    MathSciNet  Article  Google Scholar 

  32. Wickham H (2012) scales: scale functions for graphics. R package version 0.2.3. Accessed 24 Mar 2015

  33. Williams C (November, 2013) Detroit Mayor Dave Bing says bankruptcy was ‘inevitable’ after city hit rock-bottom.

Download references


We would like to thank Dr. Adele Cutler for her input on the methodology of this manuscript and for providing access to her archetype software. In addition, we would like to thank the reviewers for their helpful comments and suggestions. This article was submitted prior to Jürgen Symanzik becoming Editor-in-Chief of Computational Statistics, and was handled by Yuichi Mori, the previous Editor-in-Chief.

Author information



Corresponding author

Correspondence to Anna Quach.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


A Appendix

A number of variables and cases were removed prior to our analyses. Variables were removed for the following reasons: (i) Variables with a large number of missing responses (more than 45%) among the cases were excluded. (ii) When variables were provided as 5–level variables and also as aggregated 3–level variables (variables names ending with an r), the aggregated 3–level variables were removed as they provided less nuanced information. (iii) Variables not observed in all 3 years were excluded for comparison purposes. (iv) All index variables (see Table 3) were removed, assuming that the variables that were aggregated into an index variable would show up together if the index variable is an important predictor variable. Including the index variables that are linearly dependent on the variables they were derived from creates issues in some of the models we used. We determined how the index variables were derived by using PCA. (v) Finally, all variables were removed that form the basis for “Community Attachment” (which is one of our main response variables). For (iv) and (v), PCA was conducted. Ultimately, 55 variables were retained for our analysis from the original 179 (2008), 195 (2009), and 229 (2010) variables, respectively.

Table 3 Table of the formulation of the 15 index variables (variables calculated by taking the mean of several variables) found using PCA

After the removal of variables, cases were removed for the following reasons: (i) Cases with at least one missing value in the remaining variables were removed. (ii) Answers such as “don’t know”, “refuse to answer”, or “did not answer the question” in the survey were replaced as missing and then were handled according to (i). Figure 12 shows the effect of data cleaning for the sample sizes in each community in each year. Although steps (i) and (ii) sound rigorous, in most communities/years, only a few cases had to be deleted. Notice that communities with considerable decreases in sample size after data cleaning were mostly urban communities (such as Philadelphia, Pennsylvania, Miami, Florida, and San Jose, California, in 2008 and Charlotte, North Carolina, Akron, Ohio, and Detroit, Michigan, in 2009). An explanation of why we see these dramatic changes in urban communities following data cleaning would be interesting, but has not been investigated here.

Figure 13 provides a graphical representation of the variables and cases that were removed from further analysis. Overall, the largest number of cases were removed from the 2010 data set, but the original sample size that year was approximately 50% larger than the sample sizes in 2008 and 2009.

Fig. 12

Dot chart of the sample size in each year for each community before and after data cleaning

Fig. 13

Heatmaps showing missing data in each year. Also shown are cases and variables removed from further analysis (further explained in the main text)

B Appendix

“Appendix B” summarizes the predictor variables in Table 4 and lists additional variables related to the index variables in Table 5. The variables in bold in Table 4 are the variables that are found to be the three most important predictor variables in predicting attachment status. The table also lists some of the variables that make up the index variables in Table 3. Table 5 lists the remaining variables and descriptions used to make up the index variables in Table 3.

Table 4 Table of predictor variables
Table 5 Table of additional variables that make up the index variables in Table 3

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Quach, A., Symanzik, J. & Forsgren, N. Soul of the community: an attempt to assess attachment to a community. Comput Stat 34, 1565–1589 (2019).

Download citation


  • Random forests
  • Archetypes
  • Knight foundation
  • Community attachment