1 Introduction and motivation

This project studies AI-based tools to characterize human perception of urban spaces at the level of street photographs. This is done using state-of-the-art machine learning and computer vision techniques to model perception of inequality in terms of safety, maintenance, and architectural features. Between the late 2000s and mid-2010s, the Great Recession ravaged most countries’ economies and societies. It was a global crisis gestated and realized in cities, but its effects on the physical urban landscape are still poorly understood. We hypothesize that the effects of the crisis were substantially reflected in the built environment of cities and visible at the street level. To test this hypothesis, we analyze the streetscapes of neighborhoods in Raleigh, NC (USA) in the 2008–2020 period. For such a study, urban history researchers must utilize computational interfaces and methods to analyze streetscapes at scale. For about a decade, the crisis’s effects unfolded visually in the urban environment, which is reflected in the data available from Google Street View (GSV). In addition to traditional quantitative metrics (i.e., spatialized measures of income, housing, employment, etc., over time), this visual assessment of streetscape change could help social scientists and policymakers with an additional tool to either corroborate or dispute the expected spatial distribution of the crisis across a metro area. The distinctions between these distributions can illuminate how changes in the visible streetscape diverge from, lag, precede or parallel other changes in the human condition during economic recessions.

In this paper, we demonstrate that predictors trained using generic image features and annotated with human perception scores of street-level indicators of economic health of neighborhoods from a crowdsourced study can be utilized to scale up analysis to larger regions. We evaluate the predictive power of different image features commonly used for scene understanding and illustrate how these models can be useful for social scientists, urban planners, and policymakers in their analysis. We discuss how expert interactions with AI predictions and human-in-the-loop ML interpretive interfaces (exploratory geovisualizations) often allow an expert to accommodate for a model’s weaknesses through expert knowledge. Our approach is validated by comparing the changes observed in household income data and its correlation with changes in image ratings from similar periods. We annotate images between 2011–2013 and 2016–2017 and add household income data for these periods. The periods are chosen based on the availability of a sufficient number of images for statistical comparison.

From a social science perspective, we seek to improve the understanding and characterization of the effects of economic recessions observed at multiple scales. Several significant advances in this research area have been made with the use of generalized socioeconomic data as well as detailed (although highly localized) social science surveys (Hwang and Sampson 2014; Hwang 2015; Moye 2014; Pfeiffer and Lucio 2015). Our project expands research in this field through the integration of AI technology in the analysis process by:

  1. 1.

    Grounding the analysis of urban renewal and decline at the street level by providing human analysts with the tools to access features of the visual environment processed by computer vision techniques.

  2. 2.

    Presenting novel methods of interactive exploratory geovisualization that integrate geographic and multidimensional views of street attribute data produced by artificial intelligence algorithms.

  3. 3.

    Introduce a process for augmenting AI models with crowdsourced subjective ratings for interpretation by urban studies experts.

1.1 The perception of safety and reality

There are several metrics that could be utilized to characterize safety of a specific street, neighborhood, or city. Crime data is being systematically collected and analyzed across the world at different regional scales. Sources such as City DataFootnote 1 organize and show detailed crime statistics by city by crime category (theft, vandalism, murders, etc.) and change of safety measures over time. Figure 1 shows the interface for crime data in Raleigh at the city level in terms of crime per 100 k residents with comparison to nearby towns.

Fig. 1
figure 1

Crime map and safety score based on crimes per 100 K residents from city-data.com

Such metrics are useful at a high level, similar to income, zoning, and real-estate data to characterize the safety at the city level. A comprehensive study of how crime data matches street-level perception of safety promises to offer rich insight as further work on this topic. This process will benefit from ongoing work by entities like the Smart Cities CouncilFootnote 2 that is looking at crime rate and response times across many regions consistent with international standard ISO 37120:2014 that provides clearly defined performance metrics and the process of collection and reporting. Further data from these and other similar standard sources would add to the depth of analysis coupled with the human perception approach outlined in this paper.

1.2 Threats to validity and the risk of Code Hubris

The choice of algorithms, survey instruments, and statistical tests all impact the results and their interpretation. In terms of data collection and human annotation, the quality of images is important because annotators looking at higher quality images on larger screens are able to see more detail that could affect their perception rating. The social, economic, and cultural background also affects the interpretation of subjective metrics. From a modeling perspective, this work utilizes a Bayesian ranking algorithm for modeling the relative perception of images. This assumes that the images are sampled from a normal distribution over their perceived ratings. In the absence of the availability of the true underlying distribution, this is a reasonable initial assumption. This model also allows us to directly compare this work with previous work from the StreetScore and PlacePulse projects that make a similar assumption. The statistical tests are appropriately chosen for determining the significance of the statistical correlation between features. As with all scientific research, progressive improvements are expected with an improved understanding of the underlying phenomena and feature relationships. Finally, the amount of data needed for pairwise rankings depends on the number of domain features and consistency of annotators.

It is essential to explicitly state that the work presented in this paper neither claims to offer a completely automated approach to prediction nor does it claim to accurately characterize/provide commentary on the underlying subjective social and cultural factors that are a critical part of the visual imagery and its perception. This work introduces a set of tools and a methodology for using the tools that provide human researchers with observations, processing methodology, and visual insight into the problem that has as yet not been seen from this perspective. Much work needs to be done with these tools as expert quantitative geographers adopt them in their tradecraft. As Birhane (2021) states in a recent paper, “… the more socially complex the problem is, the less accurate machine learning systems are of accurately classifying, predicting …” We understand that researchers, policymakers, and stakeholders are equally challenged by the complexity of such social phenomena. Still, we recognize that the current availability of untapped data (e.g., GSV images) opens new opportunities for a critical inquiry into the use of algorithmic tools. That said, the paper puts social science questions as primary drivers of this project with the tools engineered to support these questions because this allows us to work with real data and real questions. Unlike Rey’s (2019) characterization of data scientists’ work that claims expertise in application domains, we have an interdisciplinary group of social scientists, humanities scholars, visual designers, and engineers that are represented in this project and the process that the paper describes.

2 Background

2.1 Historical and urban studies background

The Great Recession started in cities with the burst of the US subprime mortgage bubble in December of 2007, and from there, it spread to the rest of the globe. Several studies have characterized the different scales of the crisis as interconnected and mutually constitutive, pointing out the linkages between the global economic downturn and the fate of local urban geographies such as neighborhoods. Given the importance of homeownership in the US economy, US cities constituted ground zero for the crisis (Newman and Schafran 2013; Aalbers 2012). The recession and the austerity measures taken to address its effects were not equally felt across the North Atlantic economies, affecting some regions and nations more than others (Kitson et al. 2011; Harvey 2012). The crash also brought fiscal retrenchment and the down-streaming of austerity policies to the level of local city governments. The housing slump and mortgage foreclosures affected local city revenue, especially in the United States, where public services disproportionately rely on property taxes (Peck 2012; Donald et al. 2014). At the local scale, the Great Recession had an uneven effect, with mass foreclosures leading to the clustering of low-income groups in particular neighborhoods and areas harboring many ethnic or racial minorities undergoing a rapid process of decline (Zwiers et al. 2016). These regressive trends also, at times, opened space to unorthodox and countering forms of urbanism in newly vacant spaces (Tonkiss 2013).

Since the height of the crisis, studies have used various quantitative and qualitative methods to assess the effects of the great recession on the urban landscape. In their study of Phoenix, Pfeiffer and Lucio (2015) used property-level data on real estate transactions and subsidized housing vouchers to demonstrate that foreclosed homes purchased by investors expanded the geography of opportunity for low-income renters. Using city data for Philadelphia, Moye (2014) showed how the pre-crisis housing boom allowed a modest process of racial integration to occur, with some African-Americans moving into predominantly white neighborhoods. Minn et al. (2015) used Landsat imagery to assess the upkeep of lawns and gardens in urban and suburban lots in Maricopa County (AZ) before and after the housing crisis. These case studies reveal the diversity of processes in place in different cities in the United States during and after the Great Recession. However, they fail to capture a crucial dimension of the production of urban spaces—the visible changes occurring at the street level.

In her classic study of urban life, The Death and Life of the Great American City, Jane Jacobs used street-level observation to understand the social processes that make a neighborhood safe (Jacobs 1992; Perrone 2019). Her observation of city life informed her theory of how the built environment affects how people act on the street. This type of visual assessment of urban life rooted in the experience of city dwellers is also present in other influential studies. In The Image of the City, Lynch (1960) analyzed how people perceive—and find their way through—the urban space utilizing visual elements such as paths and edges. These studies and others that follow in this vein conclude that the visible, physical form of public space is crucial in reflecting and producing the human condition. People experience urban spaces through their senses—primarily through sight—and images of a city can provide essential information about the development of a cityscape. This project draws on the insights of Jacobs and Lynch to propose a visual assessment of the effects (e.g., gentrification, decay) of the Great Recession on the landscape of a city (Criekingen and Decroly 2003). It does that at scale, employing machine learning (Reades et al. 2019) and computer vision techniques to expand spatially and temporally what a surveyor, or a group of surveyors, could capture solely by walking the streets of the selected cities (Yoshimura et al. 2018).

As one of the fastest-growing urban areas in the United States in the past decades, Raleigh offers an ideal case study for testing an image-based tool for assessing urban change. Between 2010 and 2019, the population of Raleigh grew 17 percent, from about 404,000 inhabitants to 474,000. The increase in the size of the city was also spatial, particularly in the suburbs, reflecting the urban sprawl of many cities in the North American Sun Belt. Like many other North American cities growing spatially, Raleigh also experienced a period of urban decline in its core areas before the 1980s. This decay started to be reversed in the last decades, first with the listing of the Oakwood neighborhood as one of the first National Register of Historic Places Districts in North Carolina. Designated in 1975, the nineteenth-century neighborhood of Victorian homes became an example of the use of historical preservation tools to prevent highway development and disruption (Schulman 1991; Raleigh Historic Development Commission 2021). From the 1990s on, Raleigh has also experienced accelerated growth driven by the boom in the technology industry stemming from the consolidation of the Research Triangle Park as a major technological hub in the United States (O’Mara 2005; Cummings 2020). Such growth has induced changes in Raleigh’s urban core, with the gentrification of traditional working-class neighborhoods in the city’s south and eastern areas, a process that intensified despite the effects of the Great Recession (Badger et al. 2019). As a mid-sized city that has experienced different types of suburbanization, urban decay, and, more recently, gentrification, Raleigh offers an excellent case study to test our method of assessing urban change through the visual analysis of a changing streetscape.

2.2 Computer vision and the built environment

In recent years, computer science and urban studies researchers have increasingly used computer vision and machine learning techniques in the visual study of cities and other types of locations. They employed pre-annotated datasets of geo-tagged images to train convolutional neural network algorithms (CNNs) to identify visual features and then used these algorithms to analyze larger datasets of images without annotation. For example, researchers used trained CNNs to define the visual identities of 21 cities across the globe (Zhou et al. 2014); to correctly determine the location of a photo with no geographical metadata (Weyand et al. 2016); and to measure visual similarities between architectural designs by different architects (Yoshimura et al. 2018). The rapid explosion of nearly comprehensive street-level geo-tagged image panoramas for most cities on the planet has offered a new source of consistent urban images for studies that use computer vision. Doersch et al. (2012) produced an earlier study using GSV images to discover the architectural elements that define specific urban spaces. They found, for example, that what makes a city like Paris look like Paris was not so much the presence of famous landmarks such as the Eiffel Tower but rather a set of stylistic architectural elements reproduced throughout Paris but not present in other cities. Another project extracted 22 million distinct vehicles from GSV images of 200 US cities to correlate the year, make, model, and body type of cars and trucks to socioeconomic and voter preferences data. They found out that pickup trucks are spatially related to areas with a more robust Republican voter turnout across the United States (Gebru et al. 2017).

Another avenue of inquiry opened by machine learning techniques is the study of the perception of urban spaces. The Place Pulse 1.0 (Salesses et al. 2013) project was one of the first computational studies on the visual perception of cities. Using GSV images from New York and Boston and captured images from two Austrian cities as data, this crowdsourcing project invited participants to rank pairs of streetscape images in response to the question “Which place looks safer?” They aimed to capture how people perceive some places as safer than others, a dimension that only partially correlates to quantitative measures of income and crime. A step beyond was taken by the StreetScore project (Naik et al. 2014). The researchers led by Nikhil Naik used the original Place Pulse data, which was produced by humans on a limited dataset of GSV images from two cities (New York and Boston), to train a computational model of visual classification. They intended to reproduce a similar ranking of perceived visual safety for a larger corpus of GSV images of 21 cities in the US Northeast and Midwest. After classifying the images, they confirmed the predictive power of the StreetScore algorithm through the use of income data as a proxy metric of accuracy for cities absent from the original crowdsourced study. They also produced an interactive online visualization featuring five cities included in their study. They concluded that scores of perceived safety from a crowdsourced study (e.g., Place Pulse) could be used to train an algorithm to accurately predict the perceived safety scores of streetscapes not used in the original dataset. Quercia et al. (2014) conducted a similar study, using the crowdsourced annotation of pairs of images, but correlating it to visual features in the streetscape images such as color and texture. Others have expanded the work of StreetScore, using different CNNs to rank the crowdsourced image annotations (Porzi et al. 2015); combining data on visual perception or the urban landscape with other datasets such as mobile phone use or real estate value (De Nadai et al. 2016; Fu et al. 2019); or adding crime event records to test the difference between the perception and the reality of safety (Liu et al. 2017).

However, StreetScore had two critical limitations addressed in follow-up studies. First, the algorithm was unscalable, for it was trained with crowdsourced data from Place Pulse 1.0 produced with images of only New York and Boston. StreetScore performed well in predicting the perception of safety for cities in the Northeast and Midwest United States but lacked accuracy for other urban areas. Place Pulse 2.0 (Dubey et al. 2016) attempted to solve the problem by expanding its training dataset to 56 cities in 28 countries. The training dataset was produced with a combination of crowdsourcing and machine learning techniques. A new set of questions was introduced to measure perceived beauty, boredom, and wealth. Second, StreetScore did not incorporate temporal change. This latter limitation was first addressed by Naik et al. (2017), who took advantage of the fact that GSV captures images of the same streets in different years. Using GSV images from 2007 and 2014, they calculated StreetScore indexes for street blocks in Baltimore, Boston, Detroit, New York, and Washington DC and calculated the StreetChange index—the difference between the StreetScore of the two years. A StreetChange positive value meant an upgrade in physical appearance; a negative value meant a decline. Recently, Ilic et al. (2019) followed the Pulse Place protocol to create their own crowdsourced dataset of street-view annotations for Ottawa, asking participants about property improvement instead of safety. Their interest was to map gentrification, a process characterized by changes in space over time. Thus, they incorporated time in their analysis, using CNNs to detect change in GSV images in five different periods between 2007 and 2015.

3 Data collection and annotation

We created a dataset of images from a 64 km2 square encompassing the urban core of Raleigh, NC, from GSV images. We algorithmically corrected the perspective for each GSV image and split the 360º image into four two-dimensional images (i.e., front, back, and two sides).

Algorithmic annotation—Our project leverages two popular machine-learning techniques and open-source software libraries. The first, used to build the base image feature datasets, is a CNN-based computer vision object detection system (currently YOLOv3) (Redmon et al. 2018). We are able to train desired image classifiers (e.g., what does a building facade, commercial sign, window, or mailbox look like), which are then used to autonomously mine photographic archives and extract these features (and their associated geolocation/temporal metadata), depositing the results in a database. The second, used to process the image feature datasets, is an ML technique for dimensionality reduction, either t-SNE (van der Maaten et al. 2008) or UMAP (McInnes et al. 2018), useful for visualizing high-dimensional datasets without human training.

Human annotation—It has been shown that statistical models provide better results in predicting subjective questions when trained on data on relative comparison of items by humans rather than their ordinal ratings of individual items (Yannakakis and Martinez 2015). A well-established and interpretable model of preference ranking is the TrueSkill algorithm that learns to rank based on a trained Bayesian factor graph of features (Herbrich et al. 2006). For this study, we collected 6000 pairs of street-level images and compiled over 7200 pairwise ratings from US and Canada-based annotators on Amazon’s Mechanical Turk platform. The 6000 pairs contained repeated individual images in both orders (left and right) and were matched with images with different feature profiles for pairing. Some image pairs were also assigned to different annotators for measuring agreement. Each annotator was shown a pair of images, A and B. Then they were asked to choose on a four-point forced-choice scale (A, B, Both, Neither) to answer five questions about their perception of which location looked safer, wealthier, newer, better maintained, and more occupied.

Household income—To compare algorithm prediction to external markers, we took household income by region from previous census data. This household income data are from 2011–2013 to 2016–2017. These two periods were chosen based on the total number of available street-level images and their distribution in a variety of neighborhoods of Raleigh in our image dataset. The income dataset is available as aggregates in regions (i.e., census blocks). We assigned average income to each image based on the geolocation of our images inside income regions. We will make the income annotated data along with associated scripts for geolocation and income assignment available to the research community for reproducibility and further research.

4 Bayesian model for subjective perception measures

To train a prediction model based on pairwise comparison ratings from human annotators, we adapt the TrueSkill algorithm (Minka et al. 2018) developed at Microsoft for generating player rankings on their multiplayer game platform. The algorithm assumes that all image ratings fit a normal distribution regarding the features in question (safety, wealth, occupancy, maintenance, and age). Each image has a rating μ and a confidence σ2. The Bayesian factor graph of features is then trained on human ratings of pairs of images by updating the skill and confidence of each image. In this project, we treat each image as a player playing against the other image and the human rating as the result of their matchup. The resulting trained model provides a prediction of unseen images and is able to offer a ranking of all the images in our dataset.

To illustrate the approach's usefulness, we analyzed the rated images in our dataset of images from a 64 km2 area in the Raleigh urban core. Figure 2 shows the distribution of predicted safety values by the model overlayed on the geolocation of the respective images. Sample images that are present in different score levels are also shown. In terms of real-estate, economic, and crime data, the region closer to downtown and on the southeast side is perceived as being less safe. There are spots on the northwest side that are perceived as less safe. These images correspond to some industrial developments. Our tool affords filtering and region selection. For illustrative purposes, we divided the region into four quadrants. The tools afford more specific groupings such as neighborhoods or subdivisions.

Fig. 2
figure 2

(Above) Distribution of predicted safety values on a map of Raleigh. (Right) Images and corresponding predicted safety scores. The four images shown illustrate examples from 4 levels of safety

5 Interactive visualization and analysis

The data collection algorithm and scripts and the ranking algorithm are not readily accessible to urban studies experts. It is challenging to model error in terms of specific features due to the subjective nature of the task. To better integrate the computational tools and algorithms with the analysis methods for scholars, we designed a novel interactive interface called the Street-Feature Location Mapping Tool (SFMLT). This tool provides a visualization of images clustered in the visual feature space, on the geospatial map, and overlays of predicted scores.

5.1 The street-feature location mapping tool

This project developed an interactive visualization tool that features a two-panel mapping interface where the right panel has a geospatial map with selection and annotation tools, and the left panel harbors an assortment of 2D/3D scatter plots and cluster-grid thumbnails of the street-level photographic corpora, with similar selection tools. The tool allows expert users to work back and forth between these two panels to select and prune a collection of images, which are visualized as a spatial heatmap in the right panel and a cluster visualization on the left panel. These selections can be exported and saved for later use (Fig. 3). Combined with maps based on our enhanced GSV analysis, the tool allowed researchers to compare and contrast the effects between urban areas, reveal the narratives and dynamics, and extract data derived from GSV coupled with other spatial information. This visualization tool ultimately offers a system based on the application of artificial intelligence/machine learning modalities and computer vision techniques to massive multi-year databases of street-level photography. It provided a platform featuring several novel interactive interfaces, making it possible to navigate and explore data and extract visual features for the case study.

Fig. 3
figure 3

Street Feature Location Mapping Tool: grid-based clustering of commercial facades in a 64 km2 region of Raleigh, NC

Fig. 4
figure 4

Top left-Images clustered in feature space and visualized on a grid. Highlighted group represents images of churches, clustered by the algorithm based on architectural features. On the top right are the geo-locations of the images. SFLMT tool allows interactive analysis in both feature and location spaces

For such image corpora analysis, the use of unsupervised ML dimensionality-reduction techniques (e.g., t-SNE, UMAP, or PCA—van der Maaten et al. 2008; McInnes et al. 2018) and supervised ML techniques such as CNNs is well established as are the functionalities of GIS for spatial mapping and analysis. However, spatially tagged photographic corpora are a unique use case for image analysis, and a geospatial dimension can provide a critical context for finding meaning in ML-derived query results. In this sense, SFLMT’s hybrid interface offers interactants the capacity to work fluidly between the ML-produced computational feature space (CFS), and a more traditional cartographic space (CS), allowing complex queries and explorations into spatially tagged image corpora (image data points have a representation in each space). Users can explore relationalities and potential narratives within the corpus through a bi-directional process of navigation, selection, and pruning, thus tweaking the parameters of each map to curate a collection of images (and subsequent geospatial mapping) derived out of the corpus.

With SFLMT, users can move back and forth between computational and cartographic spaces to find correlations between the configurations of the data points in each spatial context (their relationalities, densities, and absolute positioning). They can selectively plot these features—along with other potentially relevant data points (socioeconomic data, etc.)—to locate feature clusters (singly or in compound queries) across urban landscapes. Conversely, users are able to delineate geographic regions in the CS and see mappings within the CFS. Researchers can pursue open-ended questions as they explore and modify the properties and parameters of each space and visualize patterns between representations.

In our initial design, the CFS is contained in the left panel; it is a graph space where the images are located either as points in a 2D point cloud or scatter plot (graphing two desired image features or location-related variables) or compressed into a grid raster representation (for ease of navigation and manipulation, by necessity distorting the original plot). The CS is contained in the right panel; it is a 2D map representing the locations of the photographs, with positional pins and added visualization features such as overlayed density heatmaps (emphasizing the densities and gradients of the image distributions). The panels have identical tools for selecting and deselecting items or regions, undoing actions, providing research annotations, and saving selections; selection and deselection actions in any panel affect all panels (Fig. 4).

This implementation of the SFLMT (HTML/Javascript/WebGL) allows for the interactive visualization of clusters of tens of thousands of street-level image features (e.g., building facades, street signs, people, and vehicles) or more granular street-level visual features (e.g., architectural details), scraped autonomously from geolocated street-level photographic corpora (such as GSV). Our project leverages two popular machine-learning techniques and open-source software libraries. The first, used to build the base CFS image feature datasets, is a CNN-based computer vision object detection system (currently YOLOv3) (Redmon et al. 2018). The second, t-SNE, is a popular dimensionality reduction and visualization technique, which we use to analyze all images in our particular feature corpus—i.e., all building facades—and autonomously cluster visually similar images together in a 2d spatial plot. It can roughly cluster building facades into groupings suggesting architectural style or usage without training. Initial experiments of one urban corpus (all GSV images from a 64 km2 region of Raleigh, NC, within a 3-years period) found feature clusters of Victorian, Neoclassical, and Modern Ranch houses, functional categories like skyscrapers, strip malls, and churches, though also oddities such as cottages with white picket fences, and even begin to cluster abandoned/dilapidated buildings. The adjacencies are not perfect—these techniques are as much of an art as a science. They require designing interfaces that allow the user to begin to organize, cull, or reconfigure the tools and results, teasing out valuable patterns. Phenomena such as gentrification, urban decay, the spread of architectural styles, the use of different building materials, textual analysis of urban signs, social uses of public space, urban flora, etc., are explorable through these techniques and types of interfaces.

While this prototype deals specifically with streetscape image features, the SFLMT could be used to work with any geospatially tagged digital media objects (not just images). The critical innovation involves linking the CS with one or multiple CFSs for interactive engagement and curation by the user. For example, an ornithologist might interactively explore a collection of geospatially tagged bird call audio samples, working between a CFS (for example, a t-SNE clustering plot of audio data points by Tan and MacDonald (2018)) and the CS showing sample locations, to curate a collection of bird calls, and potentially ascertain territorial regions of a particular species. This curation task might be improved by adding other CFS panels, which analyze other aspects of the audio samples, the combined linked spaces allowing a more refined curation.

6 Analysis of street-level markers of recession in Raleigh’s urban core

We analyzed the distribution of predicted ratings along the dimensions of safety, wealth, occupancy, maintenance, and age in the area in and around Raleigh. The upper right chart in Fig. 5 shows the distribution of predicted safety scores of images from the four axis-aligned quadrants on the map. We rank all images based on their absolute safety scores and group them in four quartiles from highly safe (Blue/Level 4) to unsafe (Red/Level 1). The northwest quadrant of Raleigh has the most images labeled by the algorithm as safe, and the southeast quadrant has the lowest number of images labeled safe. There are fewer images labeled unsafe on the northern half of Raleigh compared to the southern one.

Fig. 5
figure 5

Distribution of ratings per quadrant in the Raleigh urban core

The mid-right chart in Fig. 5 shows the distribution of predicted wealth scores of images from the four axis-aligned quadrants on the map. We rank all images based on their absolute safety scores and group them into four quartiles from wealthy (Blue/Level 4) to poor (Red/Level 1). Results indicate that the northeast region has the most images associated with higher wealth. The southeast region has the least number of images labeled as indicating wealth.

The mid-left chart in Fig. 5 shows the distribution of predicted occupancy scores of images from the four axis-aligned quadrants on the map. We rank all images based on their absolute safety scores and group them into four quartiles from occupied (Blue/Level 4) to unoccupied (Red/Level 1). Results indicate a pattern of predominant moderate occupancy ratings across the four quadrants, which is different from the two previous indices (safety and wealth). Furthermore, the southern half of the urban core presents a higher rate of unoccupied and moderately unoccupied buildings when compared to the northern quadrants.

The bottom left chart in Fig. 5 shows the distribution of predicted maintenance scores of images from the four axis-aligned quadrants on the map. We rank all images based on their absolute safety scores and group them in four quartiles from well-maintained (Blue/Level 4) to not maintained (Red/Level 1). Results indicate high maintenance ratings for the northern half compared to the southern half, which is consistent with safety, wealth, and occupancy ratings.

The bottom right chart in Fig. 5 shows the distribution of predicted new development scores of images from the four axis-aligned quadrants on the map. We rank all images based on their absolute safety scores and group them into four quartiles from newer (Blue/Level 4) to older (Red/Level 1). Results indicate newer development in the northeast region and older areas in the southeast.

6.1 Comparison with income data

To directly compare algorithm-predicted ratings and other economic markers, we annotated each image with household income data at the census block level. We chose time periods 2011–2013, with 2393 images in our dataset, and 2016–2017, with 1883 images. We labeled each image in the dataset with a corresponding median household income value in that region based on geolocation data. To analyze the relationship between perceived scores predicted by the algorithm and income, we first take the entire dataset of 5930 images where income data is available in the neighborhood and run the Pearson correlation test on safety perception scores without considering the time feature. This yields a correlation coefficient of 0.41 with a p value < 0.001, indicating a moderate positive correlation. Increase in perceived scores correlates with increase in household income values.

We next split the dataset into images from two different time periods and run the correlation test on each time period separately to see whether there is a significant difference between the relationship between income data and ratings. Table 1 shows the results of the correlation tests for separate time periods for five different perception ratings. All of the results are statistically significant. Safety, Completeness, and Maintenance have a moderate correlation. Occupancy rating has a low correlation coefficient. The newer (Age) rating correlation with perception scores has significant differences across the two time periods. In 2016–2017 the perception of safety for the age of buildings is higher. This may be due to the construction of several new condominium buildings in 2016–2017 in downtown that were not there in 2011, particularly in the east and southeast parts of the city.

Table 1 Pearson correlation test between predicted ratings and household income data for the two time periods between 2011–2013 and 2016–2017

7 Discussion and relevance to future work in AI and society

By studying the effects of the last global economic crisis (2008–2016) on the visible built environment of urban neighborhoods in Raleigh, this project offers a case study of the use of automatic visual data analysis at scale and predictive models for how future urban spaces will develop in response to global economic crises, including the currently ongoing Covid-19 disruption. We provide a systematic process for data collection, annotation, development of predictive models, and validation of constructed AI models. Precisely for this project, we outlined the challenges of data collection in the form of panoramic street-view imagery across a period of time, design of human annotation tasks through crowdsourcing, a Bayesian model for automatic prediction of several high-level features such as safety, occupancy, and maintenance from street-level imagery, and finally a validation step to compare the predicted model ratings to an economic indicator of median income. Our results indicate that using such AI models is promising and adds a tool to the human analyst’s toolbox. But the interpretation of results from the model and their utilization in framing the analysis report requires additional skills in terms of understanding and characterizing the noise in the data and the accuracy and confidence of the computational model. Safety scores in Raleigh neighborhoods correlate with income levels. The rapidly changing neighborhoods in downtown with new condominium buildings affect the changes in the correlation between perceived age and income. The statistically significant changes between 2011–2013 and 2016–2017 are seen at the street level in terms of correlation within each time frame with income markers. A deeper comparison across the time frames is challenging due to the lack of comparable quantity and quality of data across the time periods. We aim to follow-up with conscious data collection and annotation efforts that will lead to better models of comparison across future periods. Another dataset related to crime would also enhance analysis and prediction. Current datasets for crime reporting in our recent analysis period are sparse and unstructured. Departments of public safety are increasingly collecting more structured and nuanced data with better reporting. This is a promising avenue both for the work presented in this paper and to guide the improvement of our public data infrastructure. Early prediction of such phenomena that indicate signs of gentrification could potentially be helpful as a tool for city leaders in policymaking and interventions. It is worth noting that reported income data lags in terms of being available compared to visible signs of change that are apparent in photographs.

The project also introduces a methodology for collaboration between humanities and social scientists, designers, and computer scientists. This experience has provided an interesting insight into communication, collaboration, and trust in systems. Computational models, even as black boxes, could be useful if users can test hypotheses and validate known data. For humanities scholars in particular, it is crucial to articulate research questions and frame analytical arguments based on understanding the limitations of features expressed in the computational models.

Careful design of interactive interfaces is essential so that they are both intuitive and expressive. The current project opens up several possibilities. We plan, for example, to compare the visible effects of the crisis at the street and neighborhood scales within 15 metropolitan areas in the three largest countries in the Americas: the United States, Mexico, and Brazil. By studying these 15 cities, we will understand how global economic recessions more broadly affect the built environment of cities at the street and neighborhood levels. Our selected group of 15 metropolitan areas offers a diverse array of population size, density, and economic activity. They allow us to understand the impact of a global crisis on different urban neighborhoods and how these effects are tied to factors such as development vs. underdevelopment, industrialization vs. deindustrialization, and centrality vs. peripherality in the international system.

The project offers insights into the use of AI technology for three stakeholders. First, it provides tools to help urban planners, policymakers, and government agencies understand the dynamics of phenomena such as the 2007–20009 economic downturn, allowing them to incorporate the crisis’s lesson into mitigating the urban effects of future recessions, including the current Covid-19 disruption. Second, it introduces a methodology that uses computation and new data sources to expand the toolkits of geographers, historians, and other social scientists. Finally, it presents innovative geovisualization methods to integrate traditional GIS and multidimensional data to facilitate exploration and interpretation by scholars in multiple disciplines.