Keywords

FormalPara Learning Objectives

By the end of this chapter, you will be able to:

  1. 1.

    Present the current urban data landscape including sources, typology, and limitations.

  2. 2.

    Describe health use cases of urban data.

  3. 3.

    Understand the socio-technical considerations and challenges around data integration for urban health.

1 Introduction

Since the 1940s, scientific research in cities has evolved with the inclusion of ideas from regional science (Zipf 1949), cybernetics (Wiener 1948), systems engineering (Von Bertalanffy 1969), and system dynamics (Forrester 1970). These studies have conceptualized cities as complex ‘systems-of-systems’, albeit with limitations due to a lack of real-world data and the hardware and software elements required for implementations. Over the past two decades, rapid technological development has brought smart devices, the quantified-self movement (Wolf 2007), the smart cities movement (Wiig 2015), big data (James et al. 2011), and the open data movement (Barbosa et al. 2014) into the urban studies realm. As a result, we are becoming “human on the net” (Bradley 2007)—well-connected and integrated with urban sociotechnical systems. Meanwhile, rapid urbanization, population growth, and climate change are intensifying at an unprecedented speed. Citizens, especially vulnerable populations, are facing threats to achieving ecological sustainability, economic prosperity, social justice, and overall quality of life. Emerging disciplines, such as urban computing, urban informatics, and civic analytics, have begun to incorporate data science and urban domain knowledge in order to generate a more scientific and consistent understanding of cities. Such efforts have generated two main branches of research output: technological deployment within an urban environment (“smart cities”) and the new insights derived from real-world data that constitute “urban intelligence”. Regarding the first branch, urban infrastructure is increasingly equipped with real-time sensing nodes, cloud computing, automation systems, and optimized networks. And regarding the second, ubiquitous urban data increasingly enable data-supported policy formulation, computational social science, preventive interventions, and data-driven operations.

This article maintains that data integration is a critical component for developing next-generation intelligence for urban health, and that population wellness and quality of life are the results of complex urban biophysical and socioeconomic dynamics. After introducing a broader context of urban data landscape and health indications, it then provides a specific project narrative as an exemplar—NYC PollenScape—describing its background, data, methods, and findings. A more general discussion of sociotechnical challenges in integrated data analytics for urban health follows. The paper concludes with a summary of key takeaways, current limitations, and future work.

2 The Broader Context

2.1 Urban Data Landscape

Urban data are large in volume and variety, and derive from heterogeneous sources that require different analytical strategies. Conventionally, government administrative data at the federal, state, and city level provide baseline information for population and neighborhood studies. The U.S. Census Bureau conducts nation-wide decennial population censuses, and provides data to support large-scale decision-making and policy analysis involving housing, transportation, healthcare, and education (U.S. Census Bureau 2010). The American Community Survey is an annual sampled survey reporting neighborhood population characteristics regarding demographics, occupation, income, and education, in order to inform public funding, capital planning, and infrastructure investment (U.S. Census Bureau 2018). New York City Community Health Profiles report neighborhood-level health conditions (life expectancy, healthcare, behavioral and demographic characteristics) from fifty-nine community districts in New York City (NYC Department of Health and Mental Hygiene 2018). The population census and community survey data provide long-term, comprehensive information to support cross-sectional longitudinal analyses. Universities, research institutes, and non-profit organizations represent additional sources providing database archives, data repository portals, and online platforms related to global health. These platforms enable researchers and health practitioners to explore comprehensive information collection. However, since these platforms collect data from various sources and share access as a third party, the timeliness and coverage of available resources are highly dependent on platform maintenance.

With increasing digital applications and IoT (Internet of Things) products, Application Program Interfaces (API) have become a relatively new data source. AIDSinfo is an API created by the U.S. Department of Health and Human Services for accessing AIDS-related drug databases (U.S. Department of Health and Human Services 2019). AirNow is an API providing real-time air quality data in the U.S., Canada, and Mexico (US EPA 2019). Fitbit is a wearable device for tracking personal physical activities (walking and sleeping) and health (heart rate), providing an API for accessing the data (FitBit 2019). Twitter, one of the most popular online news and social media platforms, provides a developer platform with multiple APIs that enable the query of historical data (most recent seven days) by keywords, the creation of a campaign, or the generation of social media engagement metrics (Twitter 2019). All of these APIs provide data resources for prototyping new applications or integrated products, which also require deep technical skills to access, read, and process data.

2.2 Urban Health Indications

Increasing data sources and multidisciplinary research methods have enabled new analytic approaches to public health research and practice. These include historical clinical data, hospitalization records, and surveys that report certain aspects of health outcome (e.g., mortality rate, life expectancy, hospitalization rate, asthma, diabetes). The United States Small-Area Life Expectancy Estimates Project (USALEEP) is the most granular (specifically at the census tract level) and comprehensive (all neighborhoods in the U.S.) study to date (NAPHSIS 2018). This program reveals that in addition to healthcare, per se, location of residence with its concomitant implications for housing, educational, safety, environmental, and food access factors may also have an impact on life expectancy (Robert Wood Johnson Foundation 2019). Variances in neighborhood outcome prevalence and resultant spatial patterns also reveal population health disparities that have been shaped by broader historical, social, political, and environmental issues which vary by location. Increasing digitization of health facility information, such as that from hospitals, clinics, and pharmacies, enables more in-depth analytics to improve public access, resource allocation, and service operation of healthcare. For example, NYC, Boston, and Chicago publish the geolocation of hospitals and clinics (NYC Department of Health and Mental Hygiene 2019a;  Boston Department of Innovation and Technology 2019; Chicago Department of Public Health 2019). In addition, NYC also shares the geolocation of health facilities for purposes of hepatitis prevention and testing (NYC Department of Health and Mental Hygiene 2019b), and seasonal flu vaccination (NYC Department of Health and Mental Hygiene 2019c). Singapore maintains data on retail pharmacy locations (Singapore Health Sciences Authority 2019) as well as daily polyclinic attendance as categorized by selected diseases (Singapore Ministry of Health 2019). To fully understand current resource allocation, operations, and potential improvements, extensive analytical work, involving data mining and spatial analysis integrating with other population and environmental data, is required.

New survey approaches and data sources currently provide more granular information on behavioral factors related to urban health. City-level health surveys enable multifaceted research on health-related behaviors (e.g., smoking, drinking, physical activity, commuting patterns), as well as related household characteristics (e.g., age, gender, household size, foreign-born population), and socioeconomic status (e.g., income, education, occupation). Novel analytics using geotagged social media data make it possible to quantify, visualize, and promote public awareness of issues such as obesity, diabetes, or a physically active (or not) lifestyle (Ghosh and Guha 2013; Maitland et al. 2006; Hawn 2009). Analytics addressed to citizen complaints provide new insights into the relationship between heavy drinking and alcohol store location (Ransomeet al. 2018), neighborhood risk from hazardous chemicals exposure (Gunn et al. 2017), and noise pollution varying by location and time (Zheng et al. 2014). Sensing technology and the IoT make it feasible to monitor environmental conditions at various spatial-temporal resolutions: At macro-mesoscale, satellite imagery and remote sensing data enable large scale spatial analysis on the health impact of land cover, ecological patterns, and natural disasters. At micro and hyperlocal scales, in situ sensing and GPS-enabled spatial tagging devices make it possible to monitor issues such as Dengue cases (Seidahmed et al. 2018), real-time ambient air quality (Schneider et al. 2017; Zheng et al. 2013) and drinking water quality (Hou et al. 2013).

3 Project Narrative: NYC PollenScape

3.1 Background

In 2015, more than 2,240 New Yorkers participated in TreesCount! 2015, a project hosted by the City Department of Parks & Recreation (NYC Department of Parks & Recreation 2015a). Each volunteer participant was issued a GPS device, a tape measure, and a training book to guide the digitization of any street tree’s information, such as its geolocation (in latitude and longitude), species, size, health condition (e.g., damaged, overgrown, or dead), and sidewalk conditions (NYC Department of Parks and Recreation 2015b). To date, this is the largest crowd-sourced urban forestry data collection in the U.S. history (NYC Department of Parks & Recreation 2015c). The final output was reported in the NYC 2015 Street Tree Census which included data on more than 666,134 trees covering the streets in all five boroughs of NYC.

The research project NYC PollenScapeFootnote 1 derived conceptually from some controversial findings on the health impact of urban trees. Generally, street trees function to clean the air (McPhearson et al. 2013), ease the urban heat island effect (Loughner et al. 2012), mitigate stormwater (Nowak et al. 2007), and create a more sustainable and aesthetic neighborhood that promotes physical activities (Ulmer et al. 2016). On the other hand, the potential adverse health impact of urban forestry has raised researchers’ attention. Previous studies in the U.S. and Canada reveal an increasing health risk caused by tree pollen allergens. Certain tree species can be a source of allergens that exacerbate respiratory health issues including asthma (Lovasi et al. 2013). Surveys and clinical visit records reveal an underlying spatial-temporal correlation between allergenic pollen exposure and neighborhood asthma prevalence (Dales et al. 2008). Researchers in spatial epidemiology have further concluded that the local risk of pollen exposure should be incorporated in allergy diagnosis (Asam et al. 2015). Unfortunately, due to limited data sources and non-robust analytical methods, previous studies were constrained to specific case studies or small survey samples. More importantly, a lack of cross-domain knowledge integration and multidisciplinary research have yielded inconsistent findings and implementations that are separately segmented within public health, environmental science, landscape architecture, and urban planning.

By 2016, the NYC Open Data platform already had over 1,600 related data sets publicly available (NYC Department of Information Technology 2016). These resources inspired us to consider additional street trees beyond the Department of Parks and Recreation’s regular duties, with more integrated views on quality of life involving sustainability, safety, health, municipal services and beyond. Utilizing the tree census data and other ancillary data sets, project NYC PollenScape measures the localized environmental health impact of 600,000 + street trees of more than 120 species in NYC. Ultimately, this study aims to integrate the segmented data sets from various sectors representing the holistic urban physical-technical-ecological-socioeconomic dynamics that shape neighborhood respiratory health.

3.2 Methodology

Data mining and integration processes were employed to collect information from federal, state, and city sources, including the American Community Survey of the U.S. Census Bureau; the neighborhood asthma prevalence as captured by the New York State Department of Health; the citywide air quality monitoring program from the NYC Department of Health & Mental Hygiene; building and tax lot information from the NYC Department of City Planning; citizen complaints related to indoor air quality (e.g., chemical vapors, dry cleaning, construction dust) collected by NYC 311;Footnote 2 and public housing location as published by the NYC Housing Authority. Based on tree species, a web crawler extracts pollen information from Pollen.com including pollen allergenicity, severity, and active seasons. This integration creates an extensive database reporting—health outcomes (by asthma hospitalization rate), environmental conditions (e.g., street trees, pollen exposure, ambient air quality, indoor air quality), neighborhood demographics (population, age, income), and ‘built’ environment characteristics (building density, land use types, street network, public housing). These multiscale variables can be summarized at geolocation (in latitude and longitude), zip code, Census Block, or Neighborhood Tabulation Area scales through aggregation, disaggregation, spatial-interpolation, or spatial-extrapolation.

Asthma prevalence patterns are a result of complex biophysical-socioeconomic processes. Our modeling efforts aim to (1) specify those variables that capture underlying interactions among various factors and multicollinearity, and (2) estimate related interactions varying across space. The final model is a multivariate, geographically weighted regression model (GWR) which captures both a global coefficient β and a localized effect β(ui, vi) that may vary by location (ui, vi). A project website publishes the final results through interactive maps and plots built in Tableau.Footnote 3 The general public can navigate the maps to check pollen exposure in their neighborhood during different seasons. A location-based spatial search engine enables the user, if in NYC, to zoom in on the neighborhood scale based on current location.

3.3 Findings

The study revealed that the top 20 species represent more than 80% of street trees in NYC. Among these, many produce moderate or severely allergenic pollen. The peak pollen season is spring with 76% of street trees having active pollen during this period. The concentration of allergenic pollen exposure shifts by season, e.g. the South Bronx actually has a higher exposure risk during the fall. The regression models show that although street trees contribute to better air quality overall (measured by PM 2.5 concentrations), certain species (Red Maple, Northern Red Oak, and American Linden) are positively related to local asthma hospitalization rate (while other correlated factors are held fixed).

The GWR model results further explain the spatial disparities of environmental health in NYC which are collectively driven by ambient exposure, indoor environment quality, demographics, and socioeconomic status. For example, Midtown Manhattan has relatively high PM 2.5 concentrations but a lower asthma hospitalization risk than average; this might be explained by Midtown’s higher income population having better awareness and access to preventive care as related to asthma. The spatial model also reveals a significant health risk associated with indoor air quality and public housing. Mott Haven is a low-income neighborhood in the south Bronx that has the city’s highest youth asthma hospitalization rate. This aberrant rate is collectively driven by bad ambient air quality, pollen exposure, a vulnerable population, and poor housing quality (e.g., presence of 79% housing maintenance defects) (NYC Department of Health & Mental Hygiene 2015). These findings strongly suggest the need for cross-domain, intersectoral, multiscale data integration, and collaborative research efforts to understand urban health issues.

3.4 Limitations and Future Work

Long-term investigations and collaborative efforts are required to achieve integrated data intelligence for urban health as already noted. Admittedly, our current project has several limitations necessitating additional future work. Although new data sources enable comprehensive quantification of the urban context, neighborhood characteristics, and population baselines, a lack of granular health outcome data remains an important hurdle to developing data intelligence at high spatial-temporal resolution. To date, no publicly-available data reports specifically on asthma cases in NYC, at least in part because of privacy concerns and data ownership issues.

While it is promising to see more open data reporting regarding long term personal scale population health, there are some novel analytics which may serve as alternative approaches that can address the limitations noted. Since the direct data on asthma cases are often not available for NYC PollenScape, secondary analysis of other records may provide a proxy ‘digital trace’ of asthma patients. For example, Sheffield et al. analyzed 5-year asthma medication sales records in NYC and found a significant correlation between pollen season and medication sales volume (Sheffield et al. 2003). With increasing deployment of the IoT in the urban environment, in situ sensing can potentially be employed to collect real-time or near real-time measurements as ground-truth validation. Currently, there are market-available sensors for monitoring particle concentrations (e.g., pollen, dust), air quality (e.g., PM2.5, Ozone), and weather conditions (e.g., temperature, wind, precipitation). Our research may inform future in situ sensing at specific locations to better understand the complex interactions among air quality, micro-climate, pollen exposure, and asthma risk.

Considering that the NYC Tree Census data were collected through crowd-sourcing, new civic analytics products may serve as an interface for (1) providing individuals with useful insights related to urban life as a return for their volunteering data collection efforts; and (2) collecting new information such as anonymous geo-tagging of user’s neighborhood, capturing related urban environmental exposures, and the accrual of additional population health data. Long-term research testing on how to sustain a robust information feedback mechanism among city agencies, researchers, and citizens will also be necessary to achieve these goals.

4 Sociotechnical Considerations

Integrated urban data intelligence involves the environment, technology, and people. Cities as complex biophysical-technical-socioeconomic systems often raise challenges for developing technically feasible and socially viable solutions. Overall, successful research and deployment require careful consideration and understanding in order to address current and anticipated technical, social, and managerial challenges. Previous ad hoc “smart cities” deployments, legacy infrastructure, and enterprise-specific software applications have created a segmented data landscape in the urban domain (Harrison and Donnelly 2011). Data-driven decision-making and operations should respect specific urban contexts that may vary by historical, political, cultural, and regulatory environment factors. However, a lack of precise, transparent, and validated methodology across cities constrains more open and collaborative analytics. Data format, naming convention, and spatial unit definitions vary at different administrative scales including city, borough, community district, census tract, census block, and neighborhood tabulation area.

Methodological clarity becomes vital for developing reproducible, generalizable, and scalable computational solutions and analytical pipelines. In April 2017, the University of Chicago hosted the first workshop addressed to these issues entitled ‘Convening on Urban Data Science’ which included 112 experts from governmental agencies, universities, and the private sector. Presentations, discussions, and debates concluded with a common concern that the consistent analytical framework needed to address the rapid growth of data was not yet available.Footnote 4 In addition, the advent of artificial intelligence into the field was addressed: Algorithmic decision-making raises questions on how ‘black-box’ machine learning approaches can reliably handle underlying relationships (including the knotty issue of correlation versus causality) and confounding effects, while also taking into consideration the existing problematic urban patterns involving segregation and environmental justice.

Both ethics and social awareness must be addressed in carrying out data computing, analytics, and deployments that impact people’s lives. The ethical practice of data mining and analytics, especially regarding security and privacy issues, is critical for developing accountable methods, fair algorithms, and healthy partnerships with city agencies, stakeholders, and local communities (Bloomberg Data for Good Exchange 2017). Urban data is not always the ground-truth due to limited representativeness (e.g., survey data), reporting biases (compliant data), or the performative nature of specific behavior (social media data). Hence, data scientists need to be fully aware of the limitations of specific data sources or types. Decisions that involve physical infrastructure, capital investment, and policy intervention are often irreversible within the short-term, while an A/B testing is neither feasible nor ethical in reality. Data scientists should work with policy-makers and planners to carefully evaluate potential risks.

Managerial and domain barriers constrain multidisciplinary research and practice in cities. Urban health issues involve various biophysical and socioeconomic factors that require cross-domain efforts and intersectoral actions (World Health Organization 2008). In reality, cities are complex systems-of-systems with agencies often operating within a silo, creating managerial and organizational barriers. Although city agencies are becoming data-rich, different departmental demands and operations often come to define data collection, analytics, and management. Integrated analytics face real-life constraints shaped by administrative hierarchies, organizational priorities, and competing interests. Besides the managerial silos, collaborative urban analytics needs to break down the domain barriers.

Multifaceted urban issues require trans-disciplinary approaches integrating science, engineering, and design expertise to address both social and technical urban problems. In 2016, the NYC Department of Parks and Recreation organized TreeCount! Data Jam,Footnote 5 a one-day hackathon that invited the general public to explore potential insights from the street tree census data (NYC Department of Parks and Recreation 2016). During this event, urban planners, tree enthusiasts, data scientists, and students from universities formed teams to analyze data, develop research questions, design prototypes, and visualize the potential use cases. However, the effective facilitation and maintenance of robust multidisciplinary collaborative research are on-going challenges.

Effective and sustained partnership is a crucial enabler for the information feedback loops needed to support successful long-term implementations. Since cities are complex systems-of-systems, their overall success relies on great efforts that are required for integration, communication, and engagement (Maier 1998). A regional cross-cities network enables information exchange and experience sharing. For example, MetroLab Network is a U.S. city-university league created to promote civic technology.Footnote 6 Within cities, collaboration opportunities also lie at the policy-academic-industry nexus. In NYC, the Mayor’s Office of the Chief Technology Officer launched ‘NYCx Challenges’ to support local business partners and researchers on pilot projects promoting sustainability, health, and economic development.Footnote 7 City-university-community partnerships enable developing “test-beds” to explore how information technology and data science may improve the quality of life at a neighborhood scale. Research/community innovation projects, such as the ‘Array of Things’ project by the University of ChicagoFootnote 8 and the ‘Quantified Community’ project by New York University,Footnote 9 provide first-hand experience in innovative technology deployment, data-driven operations, and citizen sciences.

5 Conclusion

The rapid digitization of urban life brings opportunities for developing new methods and applications to promote urban health. As we live in increasingly smart and connected society, cross-domain integration and participatory research play important roles in addressing previous ad hoc technology development, and top-down urban policies and operations. Citizen-involved and community-based projects will continue (1) utilizing various technologies to solve neighborhood problems, and local demands for better quality of life, (2) educating the public for better data literacy and promoting citizen science through active engagement, and (3) validating the progress and impact of urban digital transformation at a granular human scale. All of these technical, methodological, and social transformations working progressively in tandem will result in the creation of a new data-driven version of urban science.