Geospatial visual appearance depends on many factors, such as built structures (roads, buildings, and sidewalks), greenery, and openness as well as the presence of different visual objects and their ratio in an environment [1, 2]. Visual appearance of built environments are inherently related to various socio-economic outcomes, such as population concentration, economic disparity, prevalence of crime, and pedestrian safety [1, 3,4,5,6,7].

In this study, we define the geospatial visual diversity as a criterion by which to understand the visual appearance of a geographic area, which is considered as an important component of environmental design [8]. This definition of geospatial visual diversity is guided by previous studies, such as [8, 9]. According to Stamps III [8], geospatial visual diversity is an important component of environmental design. It contributes to understanding scenic beauty and esthetically pleasing landscapes [9,10,11], comparing neighborhoods [1, 5, 12], and providing a subjective visual preference [13]. A few existing approaches [7, 9, 14, 15] quantified visual diversity with different metrics, such as entropy [8].

Recently, AI-based image segmentation tools were utilized in extracting semantic object information from street-view images, which eased the burden of data access and computing [1, 10, 12, 16,17,18]. In the existing work, however, the extracted semantics from street-view images was not employed in computing geographical visual diversity.

In this study, we compute and investigate a variety of visual diversity indices based on the AI tools and a large set of street-view images. Further, we compare multiple indices to see which index is more suitable and how the indices relate to multiple social phenomena.

This study aims to advance urban technology in four ways:

  1. (1)

    Presenting a computational framework of geographical visual diversity based on the semantic segmentation that extracts semantically segmented information from a large set of street-view images of a neighborhood;

  2. (2)

    Computing and comparing multiple types of visual diversity indices, including both single group diversity indices and multi-group diversity indices;

  3. (3)

    Validating the reliability of using the computed diversity indices through a study with human evaluators; and

  4. (4)

    Extracting social-demographic information including economy, population and crime metrics, and studying the correlations between the visual diversity indices with the socio-demographic variables.

The contribution of this research is twofold: (1) by measuring visual diversity from street-view images for urban studies; and (2) by recognizing implications for urban neighborhood planning based on visual diversity. Of specific value to the current research, this study demonstrates the process and value in examining can reveal relationships between street-level urban design qualities and property values.

Related work

The term diversity helps to quantify and compare social phenomena [19]. Measuring diversity is crucial in several disciplines [20], such as economics [21], ecology [22], urban planning [9], and social studies (e.g., culture). Diversity can also help to understand and assess the distribution of resources of an area, such as greenery, water, wetlands, and land use [23,24,25]. It also relates to human movement and urbanization [26, 27] as well as home prices [9].

De Jonge [28] interviewed residents and found that most people prefer to live in areas with more visually diverse. Stamps III [8, 29,30,31] reported similar findings, indicating that when an urban area has more visual diversity, people find the area more appealing. Visual diversity can be related to and improve livability in urban planning [9, 32].

Many past studies have used Street-View images to explore information about the environment. For example, presence of openness [33], healthy areas [5], green areas [33, 34], crime-prone areas [1, 35], local businesses [36], land use and vacant areas [37], urban effect over time [38, 39], voting pattern analysis [40], COVID-19 affected areas [41], and landscape analysis [18, 39]. Most studies have used street-view images because it is readily available for almost all cities and saves researchers time to walk around and collect street-view images from different geographic locations.

The aforementioned studies competently focused on particular problems and used street-view images as their data source. Past studies that used street-view image had different research goals than this study. However, one notable study seems similar to the study. For example, Wen, Liu, and Wu [18] used an entropy weighted method to quantify ecological matrices. While Wen et. al.’s [18] study contributes to urban planning by focusing on ecological aspects, this current study concentrates on other social aspects and uses both multi-category and single-category diversity indices. Literature review indicates a gap between the neighborhood and its relationship with social-demographic variables. This study also presents its results in terms of social aspects, such as population, economics, and crime.

In the past, the visual content of a map, such as landscape and land cover type, served as primary sources of information that contributed to measuring visual diversity [42]. In a recent study, Zhang and Dong [9] used the horizontal green view index (HGVI) to measure the visual diversity of greenery from street-view images. In this study, we instead study a set of diversity indices from multiple visual categories computed from street-view images, and we further examine their relationships to social variables.

Urban data and AI-based street-view image segmentation

We collected two types of data: (1) socio-demographic information; and (2) street-view images from a U.S. metropolitan area. The details of our data collection procedure is explained in following sections. Please see sections "Open street-view image data" for socio-demographic data collection, "Open street-view image data" for street-view image data collection, and "Reliability study with human evaluator" for human evaluator data collection.

Open socio-demographic information

For the socio-demographic information of neighborhoods, the researchers used open data from Zillow. Zillow is an online real estate marketplace [43]. For crime-related information, open data from the FBI Uniform Crime Report were obtained [44]. In addition, other socio-demographic information, such as population size and population density per square mile, was gathered from open data at Area Vibes. Area Vibes is a website that measures various neighborhood population parameters [45, 46].

Open street-view image data

Using a neighborhood’s boundary information downloaded from Zillow, the street network of the neighborhood was retrieved from OpenStreetMap [47]. Next, a large set of locations were sampled on the street network. In particular, each street is sampled with points with a 20-meter distance (see Fig. 1) so as to capture each block from the neighborhood [10]. A similar approach was applied by previous studies as well [1, 39]. The blue dots in Fig. 1 represent the generated segmented geolocations, which are 20 m apart from each other.

Fig. 1
figure 1

A large set of sampling locations in a region

Then, we used the segmented geolocations to download street-view images with the help of the Google Street-View (GSV) Application Programming Interface (API). We were only interested in the side-views (see Fig. 2) and ignored the front-view of roads. We ignored front and back views because those would mostly include roads, cars, and the sky, which could dominate the diversity computation (see Fig. 2); instead, we were mostly interested in scenic views. Scenic views primarily consist of the side-view of the street-view images [1].

Fig. 2
figure 2

Street-view images from four different sides: left view; front view; right view; and back view

To obtain side-views of the streets we computed the heading of the street and then added 90 degrees and 270 degrees, respectively (see Eq. 1). This idea was adopted from [1]:

$$\begin{aligned} \theta= & {} {{\,\textrm{atan2}\,}}(x,y) \\ where \nonumber \\ x= & {} \{\cos ({lat}_0)\times \sin (\Vert {lat}_0-{lng}_{2} \Vert )\}, \nonumber \\ y= & {} \{\cos ({lat}_0)\times \sin ({lat}_2) \nonumber \\{} & {} -\sin ({lat}_0)\cos ({lat}_2)\times \cos (\Vert {lng}_0-{lng}_2\Vert )\}, \nonumber \end{aligned}$$

where \(\theta\) is the heading for geolocation (\({lat}_1, {lng}_1\)), using geolocation (\({lat}_0, {lng}_0\)) before the geolocation (\({lat}_1, {lng}_1\)) and after the geolocation (\({lat}_2, {lng}_2\)).

Semantic segmentation

From each street-view image, a deep learning-based semantic segmentation tool, PSPnet [48], extracted the visual information of each category from a total of 19 categories, namely road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, and bicycle. The segmented proportion for category i is computed as

$$\begin{aligned} c_i = \frac{\text{ The } \text{ number } \text{ of } \text{ pixels } \text{ in } \text{ category}_i}{\text{ Total } \text{ number } \text{ of } \text{ pixels }}. \end{aligned}$$

The PSPnet model can reach 82.2% accuracy [48], as a state-of-the-art AI model at the time of this study. Figure 3 shows several examples of semantic segmentation results.

Fig. 3
figure 3

Segmented images used in the content validation: (top) with low geospatial visual diversity and (bottom) with high geospatial visual diversity. Green: greenery such as trees and bushes. Dark gray: buildings. Light blue: sky or openness. Purple: roads and driveways. Light green: grass or vegetation. Beige: fences. Dark blue: cars. Red: person. Pink: sidewalk

Visual diversity indices

The visual diversity was computed using a selection of indices in two categories: (1) multi-category indices; and (2) single-category indices. We computed multi-category indices to understand how different categories holistically affect the environment, while single-category indices explored individual category affects.

The PSPnet model provided the same 19 above-noted categories from each image, road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, and bicycle [10]. An initial qualitative review of these categories found that some were considered more “static” while others were more “transient.” Static categories consist of immobile objects, such as buildings, trees, sky, and fences. Transient categories include mobile objects, such as persons, cars, and trains.

We studied the 19 categories extracted from PSPNet in all images. Interestingly, we noticed that only five categories (i.e., road, building, vegetation, terrain, and sky) are normally distributed. The remainder of the categories are highly skewed and have high kurtosis values. Further analysis indicated that more than 97% images lacked presence of “transient” categories. Further, the presence of “transient” categories are more of time dependent. For example, we might see more presence of cars, bicycles, motorcycles during working hours, than weekends and after work hours. As such, we only considered these five categories in the diversity computation. For a given spatial unit (e.g., block, neighborhood, city), k street-view images in this unit were selected to compute the geospatial visual diversities shown below.

Multicategory diversity indices

The following section describes the multi-category diversity indices used in this study.

Simpson index

The Simpson index [49] considers a sum of individual categories with respect to the sum of all categories:

$$\begin{aligned} D = 1 - \frac{\sum n_i \times (n_i-1)}{N \times (N-1)}, \end{aligned}$$

where \(n_i = \sum c_i^k\) for category \(i \in [1,5]\) in all k images and \(N = \sum n_i\) .

McIntosh index

The value of the McIntosh index [50] varies from 0 (no diversity) to 1 (extreme diversity):

$$\begin{aligned} M = 1 - \frac{N - \sqrt{\sum n^2_i}}{N - \sqrt{N}}, \end{aligned}$$

Here n and N are the same as in the Simpson index.

Multiple-category entropy (MCE)

Considering the popularity of entropy as a diversity index (i.e., entropy of Shannon’s entropy \(H (p) = - \sum p_i \times \log _2 (p_i)\), where \(p_i\) is the probability of each category), we extended it to multiple categories as presented in the following Eq. 5:

$$\begin{aligned} H = -\sum _{i=1}^{n}\sum _{j=1}^{k}(p_1,p_2, \ldots , p_k)\log \left( \prod _{j=1}^{k} (p_1,p_2, \ldots , p_k)\right) , \end{aligned}$$

where p is the probability of a single category computed by \(p_i=\frac{c_i}{n_i}\) .

Single-category indices


One of the most-used diversity indexes is Shannon’s entropy or entropy [51, 52]. A few recent studies also used entropy to compute diversity, such as [52, 53], and [18].

Entropy is computed as

$$\begin{aligned} H (p) = - \sum p_i \times \log _2 (p_i), \end{aligned}$$

where p is the same as in Equation 5. A comparable diversity index value obtained can be further obtained as

$$\begin{aligned} D = e^{H}, \end{aligned}$$

where e is the Euler number (i.e., \(e = (1+1/n)^n\)) [54].

Horizontal view index

Li, Zhang, Li, Ricard, Meng, and Zhang [55] and Zhang and Dong [9] used a Horizontal View Green Index (HGVI) to measure the greenery of an area. To generalize this index, we call it Horizontal View Index (HVI), which is computed as

$$\begin{aligned} HVI = \frac{\sum n_i}{\sum N_i} \times 100. \end{aligned}$$

Gini index

One of the most popular diversity indexes to understand economic diversity or income inequality is the Gini index [56]. The Gini index is frequently used to measure diversity in other fields, such as social and health [57]. Considering its widespread use, we used this index to understand individual category diversity. In this study, the Gini index was computed for each individual category in four steps: (1) sum all the proportion values of the category in all images; (2) sort the values in ascending order; (3) divide each value by the sum to get probabilities of each value; (4) compute the cumulative sum of all the probabilities.

Reliability study with human evaluator

It is important to validate the reliability of the proposed method (i.e., the consistency interpreting the definition of geospatial visual diversity and computational indices). Reliability analyses such as Inter-Evaluator Reliability or Inter-Rater Reliability (IRR), are helpful to measure the consistency among human ratings and computation results [58]. We conducted an IRR study with a group of five human evaluators. One of the biggest advantages of IRR is that it does not require a large pool of raters. Even in some cases, only raters can be sufficient to measure IRR [58, 59]. The definition of geospatial visual diversity, rating scales, and sample images were provided to the evaluators, who evaluated each image for its visual diversity.

We analyzed the semantic segmentation information of an urban neighborhood and found that the data was not normally distributed, which means that there were some outliers. To remove the outliers, we computed the z-scores of static categories, and we removed values \(\pm 3\) (\(\alpha\) = .01). Stamps III [8] showed that the visual diversity of individual images can be calculated using the entropy values. Thus, we computed entropy values for each image vector.

To find visually low and high diversity images, first, we computed the median from the entropy values and divided the dataset into four quantiles. As for the visually low diversity images, we selected five random images from the first quartile, and for the visually high diversity images, we selected five random images from the third quartile.

Second, ten representative images are used with a mix of low and high visual diversities as shown in Fig. 3. Then, the evaluators who were not aware of the actual calculated diversity rated each image for its level of visual diversity, using a 5-point Likert scale (i.e., 1 = Not Diverse and 5 = Extremely Diverse). The IRR of the evaluators were calculated by Intra-Class Correlation (ICC) [58] and Krippendorff’s alpha [60], which were computed by the rating variance. Our resulting ICC was .836 for single measures and .953 for average measures. These ICC values indicated that the evaluators had a high degree of agreement and suggested that the evaluators similarly scored diversity in the images. The high ICCs also meant that a minimal amount of measurement error attributed to the evaluators, and thus power was not reduced. In addition, the obtained Krippendorff’s Alpha was .775, which likewise indicated a moderate to high degree of agreement among evaluators [60]. In this study, SPSS version 24.0 was used to compute the ICC and Krippendorff’s Alpha values. Finally, we performed correlation analysis between the average ratings and geospatial visual diversity indices. Before conducting the correlation analysis, we ran the assumptions of correlation and noticed that the data was normally distributed. Seeing this, we used Pearson correlation to analyze the relationship between average ratings and geospatial visual diversity indices. The correlations appear in Table 1.

The correlation between evaluators’ average ratings and multiple category visual diversity index (e.g., Simpson index) was positive for both low and high diversity images (see Table 1). This indicates that the Simpson index could be helpful in assessing both low and high diverse images for a geospatial urban area. Despite this, the McIntosh index could be more appropriate for low diverse areas, whereas for high diverse areas, multi-category entropy might be the superior index to use.

For the single-category indices, entropy could be helpful to assess diversity for the building and greenery of a high diverse area. Similarly, the HVI index could be good for assessing a high diverse area in terms of building, greenery, and sky (see Table 1). Overall, the results here indicated that multiple category diversity indices show a stronger relationship than single category indices. This finding makes conceptual sense as different aspects of the geospatial area, and their proportions are typically considered together when appraising diversity, while single-category indices are better able to appraise which individual category impacts overall diversity.

Table 1 Correlation between average ratings and the computed geospatial visual diversity indices (upper diagonal correlation values are from high diversity images, and lower diagonal from low diversity images)

Evidence from this study suggests that the validity for geospatial visual diversity computed from street-view images was high. It should be noted that the sample size was five, meaning that several correlation values might be high, but not significant. In this study, magnitudes and directions of the correlation values were the focus.

Correlation study between visual diversity and social variables

Street-view images totaling 351,246 in number were analyzed from 86 neighborhoods in a Midwest metropolitan area consisting of two major cities (City 1 and City 2). The sample size for the correlation analysis was computed using G*Power [61]. An a priori power analysis indicated that a total sample of 85 would be needed to detect medium effects with 80% power using an alpha of .05. The sample size for this study was 86 neighborhoods, enough to achieve 80% statistical power. Table 2 reports the result. Next, we present a few examples.

Relationship between geospatial visual diversity indices and population metrics

The results here indicated that there were mainly negative correlations between multi-category diversity indices and population metrics. For instance, Simpson index (\(r_s\) = – .328, \(p<\) .05 for City 1 and \(r_s\) = – .442, \(p<\) .01 for City 2, respectively) and McIntosh index (\(r_s\) = – .390, \(p<\) .01 and \(r_s\) = – .423, \(p<\) .01, respectively) have medium, negative correlation with total population. There were strong, negative correlations between MCE index (\(r_s\) = – .502, \(p<\) .01 and \(r_s\) = – .513, \(p<\) .01, respectively) and total population.

For diversity indices of individual visual category, the relationships depend on specific categories. For instance, high diversity of building and terrain often links to a large population. There were strong, negative correlations between total population and Gini index of building (\(r_s\) = – 0.727, \(p<\) .001 and \(r_s\) = – 0.531, \(p<\) .01, respectively), and the same occurred for the Gini index of terrain (\(r_s\) = – 0.706, \(p<\) .001 and \(r_s\) = – 0.657, \(p<\) .001, respectively). In addition, negative correlation was reflected between population density per square mile and Gini index of building (\(r_s\) = – 0.461, \(p<\) .01 for City 1 and \(r_s\) = – 0.432, \(p<\) .01 for City 2 respectively). In a similar manner, a negative correlation occurred between MCE index and population density per square mile (\(r_s\) = – .309, \(p<\) .05 and \(r_s\) = – .420, \(p<\) .01, respectively).

Relationship between geospatial visual diversity indices and economic indicators

For multi-category indices, negative correlations appeared between diversity and household income in a neighborhood. In some examples, this occurred between Simpson index and median household income (\(r_s\) = – 0.785, \(p<\) .001 for City 1 and \(r_s\) = – 0.387, \(p<\) .05 for City 2, respectively), and between McIntosh index and median household income (\(r_s\) = – 0.756, \(p<\) .001 and \(r_s\) = – 0.349, \(p<\) .05, respectively), For individual categories, positive correlations were detected between HVI of green and median household income (\(r_s\) = 0.780, \(p<\) .001 for City 1 and \(r_s\) = 0.368, \(p<\) .05 for City 2, respectively). Similarly, there were positive correlations between HVI index of green and median home value (\(r_s\) = 0.663, \(p<\) .001 and \(r_s\) = 0.309, \(p<\) .05, respectively). Greenery is often an indicator of a high-income neighborhood. Conversely, personal income has a negative correlation with diversity of building: HVI of building and median household income (\(r_s\) = – 0.551, \(p<\) .01 and \(r_s\) = – 0.370, \(p<\) .05, respectively).

Relationship between geospatial visual diversity indices and crime metrics

For multi-category indices, high diversity often indicates high crime activities. In the table, there were positive correlations between Simpson index and violent crime (\(r_s\) = 0.721, \(p<\) .001 for City 1 and \(r_s\) = 0.323, \(p<\) .05 for City 2, respectively), McIntosh index and violent crime (\(r_s\) = 0.691, \(p<\) .001 and \(r_s\) = 0.337, \(p<\) .05, respectively), Simpson index and property crime (\(r_s\) = 0.684, \(p<\) .001 and \(r_s\) = 0.324, \(p<\) .05, respectively), and McIntosh index and property crime (\(r_s\) = 0.641, \(p<\) .001 and \(r_s\) = 0.339, \(p<\) .05, respectively).

For single-category indices, it became evident that positive correlations exist between violent crime and Gini index of green (\(r_s\) = 0.751, \(p<\) .001 for City 1 and \(r_s\) = 0.328, \(p<\) .05 for City 2, respectively), and between property crime and Gini index of green (\(r_s\) = 0.707, \(p<\) .001 and \(r_s\) = 0.321, \(p<\) .05, respectively).

Table 2 Correlation between geospatial visual diversity and economic, population, and crime metrics (upper diagonal correlation values are from City 1, and lower diagonal from City 2)


One difference between our approach and those of other researchers was to understand a built environment using micro-level analysis, as we tried to capture every single block of a neighborhood using the street-view images. This process allowed us to gain information from every corner of a neighborhood and then compute the geospatial diversity. Moreover, we used different indices to understand the same information, which was an attempt to overcome underlying limitations and biases that each index could have. We also sought to use single categories to understand how each category relates to social phenomena. This was an attempt to discern the effect of individual categories and their relationships with different social aspects. The results of this study suggested that multi-category geospatial indices are more effective at explaining social phenomenon than single categories.

In this section, the results from the previous section are discussed in the order of the research questions. We ran two separate correlation analyses to see whether the correlation between visual diversity indices and social phenomenon varies for the two cities. Evidence obtained from the analysis suggests that the correlations between visual indices and social phenomena are somewhat similar but still different regardless of the cities.

Relationship between geospatial visual diversity indices and population metrics

In both cities, there were negative correlations between the Simpson, McIntosh, and MCE indices and total population (see Table 3, rows 1-3). Similarly, as total population and population density are related to where people live, Day [62] found that most residents preferred to have their homes in a less visually diverse area. Collis, Felton, and Graham [63] reported comparable findings.

There was a negative correlation between the MCE index and population density per square mile (see Table 3, row 16). This finding concurs with previous research indicating that low visually diverse areas are typically less crowded and contain fewer buildings, greenery, and sky [64]. In other words, the suburbs are considered less visually diverse and downtown urban areas are more visually diverse [37, 65]. In all, two of three multi-category diversity indices, Simpson and McIntosh diversity indices, correlated with the total population, but only one multi-category diversity index, MCE, correlated with population density per square mile.

The results indicated that there were positive correlations between Entropy indices of road, building, green, terrain, and sky and the total population in the neighborhoods of both cities (see Table 3, rows 8-12). As the research by Zhang and Dong [9] noted, people prefer to live in areas with more greenery and visible terrain. In addition, this finding was consistent with Noland [66], who noted that an increase in population demands an increase in vehicles and roads.

Negative correlations exist between Gini index of building, green, and terrain and the total population (see Table 3, rows 4-6). For instance, this indicates that high building diversity is related to a higher population density. This finding is supported by Gillis [67] as well as Ellis and Ramankutty [68], with each study asserting that the more variety of buildings in a given area are related to higher population density.

In addition, there was a negative correlation between HVI index of sky and the total population (see Table 3, row 15). Higher HVI of sky indicates more openness. This explains that fewer people tend to live in open-sky areas in outer-city neighborhoods. In general, inner-city areas are covered with high-rise buildings or concrete jungles with a high population where high-rise buildings prevent fully viewing the sky from the street level [69].

Table 3 Summary of the correlation table for diversity indices and population metrices.\(^{\dag }\)

Relationship between geospatial visual diversity indices and economic indicators

In both cities, there were negative correlations between the Simpson index and the McIntosh index and household income and median home value (see Table 4, rows 1-2, 13-14). That is, families/individuals with more household income as well as a higher home value tend to reside in less visually diverse areas (e.g., living in the outer-city or suburb areas), instead of more visually diverse areas (e.g., inner-city or downtown areas). These findings are supported by Howe, Bier, Allor, Finnerty, and Green [70], who found that most people want to live outside inner-city areas due to lower taxes and less crime. Despite these findings, the HVI of green and sky showed positive correlations with median household income (see Table 4, rows 10, 12) and median home value (Table 4, rows 20, 22). These findings are consistent with Kim and Kim [71], who showed that families/individuals with higher income and home values were more likely to live in green and open areas compared to those with lower incomes and home values.

In both cities, there was a negative correlation between HVI of building and median household income (see Table 4, row 9). A study by Ghose [72] suggesting that those with higher income want to reside in outer-city (suburb) areas with fewer high-rise buildings concurs with this study’s results.

Table 4 Summary of the correlation table for diversity indices and economic metrices

Relationship between geospatial visual diversity indices and crime metrics

There were positive correlations between the Simpson and McIntosh indices and both violent and property crime (see Table 5, rows 1-2, 11-12). A previous study reported that high visual inequality and crime were correlated [73]. Likewise, Lentz [74] asserted that the type of environment and crime are related.

In terms of single-category geospatial diversity indices and crime metrics, the results here suggested that there were positive correlations between Gini index of green and both violent and property crime (see Table 5, rows 4, 14). Notably, a higher value of Gini green indicates heterogeneity of greenery. In other studies, however, Kuo [75, 76] and Kuo and Sullivan [77] did not support this assertion. Rather, Kuo [76] explained that green areas decrease crime since paved areas with no vegetation are often seen as “no man’s lands.” In general, empty or “no man’s land” areas have less presence of residents or witnesses and increase crime activities, making criminals feel that they are less likely to be caught. Often, studies containing crime statistics characterized empty areas as crime hot spots [78].

Table 5 Summary of the correlation table for diversity indices and crime metrices

Limitations and future directions

The computation of diversity and the relationship between diversity and social variables are based on the primary results in the selected metropolitan area in the Midwest. We recognize that specific relationships between visual diversity indices and social factors might become different due to the variation of geographical regions or even different countries. Despite this, we contend that this study presents a useful methodology and substantive results in computing and linking the diversities with social outcome, which can be leveraged by urban researchers, residents, the workforce (i.e., businesses), and administrators.

For the validity and reliability analyses, only five evaluators were asked to participate. This smaller sample size was bolstered by the number of pictures rated (i.e., ten) and the rating scale that was used (i.e., 5-point Likert scale). Overall, the Inter-Rater Reliability (IRR) component in this study was exploratory in nature, and a total of five evaluators was considered adequate. Future studies should include more evaluators in order to approximate the magnitude of the reliability coefficient with more precision. Second, for the validity and reliability analysis, GSV images (n = 10) were randomly chosen from approximately 53,000 images from an urban neighborhood. In addition, only lower and higher visually diverse images were selected. Future studies should consider selecting more images and from a range of neighborhoods with levels of visually diverse images in more than two categories.

Although a total of 351,246 street-view images were collected in this study, the neighborhoods sample remained limited. More samples of neighborhoods should be selected in future research to verify whether obtained results of this study have any biases related to the sample size. Future studies should also explore how relationships might change or remain the same across more states or countries. In the end, limited sample size can dictate the kinds of analyses conducted. For instance, this study relied primarily on correlational analyses. A larger sample size and variety of geolocations (i.e., cities and neighborhoods from different states or countries) could be used in future research to develop more complex statistical models.

Another limitation of this study is that the accuracy of the geospatial visual diversity indices depends on the precision of the deep learning model. While PSPNet achieves about 82.2% accuracy with cityscape images [48], future AI studies could improve the deep learning model, which could also enhance this method in the accuracy and reliability of the diversities and the correlation coefficients.


The work here presents a computing framework of geospatial visual diversity based on AI-based tools from street-view images. It shows that diversity indices can be helpful in understanding the built and natural environment as well as the social dynamics of an urban neighborhood. Still, correlation analysis does not imply causality or inference. Nevertheless, the results presented in this study can be used to understand the influence of visual diversity for cities or neighborhoods. This study indicated that multiple category geospatial indices could be more effective in explaining social phenomena in urban neighborhoods than in single-category indices. This approach can potentially be used by city administrators, policymakers, and urban planners for their work in urban and community study and improvement.

We considered three aspects of the social phenomenon: the economy; population; and crime. We considered three social aspects together because earlier studies found they are related, so it was important for the researchers to consider them together. This study used street-view images to capture neighborhood scenes and the semantic segmentation method to extract visual objects information, enabling the computation of geospatial visual diversity. This was an attempt to incorporate computer programs to understand geospatial visual diversity and automate the computational process. This approach can be employed by city administrators, policymakers, and environmental designers to understand geospatial visual diversity without leaving their offices. Our approach could potentially save time and cost to aid in better understanding a built environment.