Assessing the impact of demographic characteristics on spatial error in volunteered geographic information features
The proliferation of volunteered geographic information (VGI), such as OpenStreetMap (OSM) enabled by technological advancements, has led to large volumes of user-generated geographical content. While this data is becoming widely used, the understanding of the quality characteristics of such data is still largely unexplored. An open research question is the relationship between demographic indicators and VGI quality. While earlier studies have suggested a potential relationship between VGI quality and population density or socio-economic characteristics of an area, such relationships have not been rigorously explored, and mainly remained qualitative in nature. This paper addresses this gap by quantifying the relationship between demographic properties of a given area and the quality of VGI contributions. We study specifically the demographic characteristics of the mapped area and its relation to two dimensions of spatial data quality, namely positional accuracy and completeness of the corresponding VGI contributions with respect to OSM using the Denver (Colorado, US) area as a case study. We use non-spatial and spatial analysis techniques to identify potential associations among demographics data and the distribution of positional and completeness errors found within VGI data. Generally, the results of our study show a lack of statistically significant support for the assumption that demographic properties affect the positional accuracy or completeness of VGI. While this research is focused on a specific area, our results showcase the complex nature of the relationship between VGI quality and demographics, and highlights the need for a better understanding of it. By doing so, we add to the debate of how demographics impact on the quality of VGI data and lays the foundation to further work.
KeywordsVolunteered geographic information OpenStreetMap Spatial analysis Spatial data quality Demographics
Recent years have been characterized by a significant shift in the way geographic information is produced, specifically through the rise of volunteered geographic information (VGI; Goodchild 2007), which has been fueled through Web 2.0 technologies (Hudson-Smith et al. 2009). While this shift has resulted in an increase in the volume and richness of geographic data, it is also posing some significant challenges with respect to evaluating the quality of such data. The fact that volunteers with minimal (if any) geographic training are contributing such information (Mooney et al. 2010), as well as the mechanisms that govern the contribution process, brings to question the quality of such data (Sui 2008). An exemplar of this trend is OpenStreetMap (OSM), an open source collaborative mapping project that aims to generate an editable global map database (Mooney and Corcoran 2012). Earlier studies addressing the quality, in particular positional accuracy, of VGI with respect to OSM have pointed out the lack of homogeneity (e.g. Hochmair and Zielstra 2013; Koukoletsos et al. 2012). In order to understand these issues, researchers have explored the relationship of VGI data and its quality with various geographical and demographic indicators. Studies addressing the relation between population density and VGI have indicated that in densely populated urban areas more contributions can be expected, which may ultimately lead to higher quality (Haklay 2010). At the same time, Girres and Touya (2010) noted that population density is not the only factor controlling the positional accuracy of VGI, indicating that areas with higher income and younger population are characterized by higher number of contributions. Further efforts to identify discernible spatial patterns in VGI quality (e.g. by comparing the topology of volunteered road network data and commercially available data) did not find statistically significant correlation (Neis et al. 2011).
A common assertion that has emerged from such studies is that the demographic characteristics of the contributing volunteers may impact the distribution of the positional and shape errors in VGI (Fairbairn and Al-Bakri 2013). Certain demographic indicators that have been explicitly suggested as potentially contributing to data quality patterns, include race and economic status (e.g. Tulloch 2008; Elwood 2008; Graham 2005; Zook and Graham 2007a, b; Crutcher and Zook 2009). However, these earlier studies were primarily qualitative in nature and were not accompanied by rigorous quantitative analyses. Motivated by this gap, this paper aims to examine the quantitative relationships between VGI quality—in particular positional accuracy and completeness—and demographic properties. We do so through a case study in Denver, Colorado (CO), where we use VGI data and contrast it with local demographics from the United States (US) Census Bureau. The remainder paper is organized as follows: in section “Background and motivation” we provide background information with respect to the demographic characteristics of VGI. In section “Data and methods” we present the background and rationale for our approach to assess the impact of demographic variation on the quality of VGI. In section “Results and discussion” we present the results of our analysis, and conclude with our summary and outlook in section “Discussion”.
Background and motivation
When attempting to assess the quality of VGI content we need to be cognizant of the particular nature of the volunteering process that differentiates VGI contributions from the traditional and established processes through which geographical data had been collected. While VGI was enabled by technological advancements, it gained popularity primarily because it addressed the general public’s growing need to access geographical data for a constantly increasing array of activities. This is the reason why OSM, the prototypical example of VGI, emerged and grew in the United Kingdom, where geographical data were not as freely distributed by government agencies (Haklay and Weber 2008). In contrast, the US government policy has leaned more towards openly sharing much of its geographical data. Up to this point, efforts to assess the quality of OSM content have focused primarily on road networks, as OSM was intended after all to be a ‘street map’. Such studies have compared for example OSM road data to a reference data set, in order to assess OSM relative to an authoritative standard such as the U.K. Ordnance Survey road network data sets or comparable products (Brown and Pullar 2012; Haklay 2010; Hochmair and Zielstra 2013; Koukoletsos et al. 2012; Neis and Zipf 2012; Neis et al. 2011). However, as OSM has evolved well beyond streets we are now in need to assess the accuracy of other types of features as well, in order to gain a more thorough understanding of quality issues with respect to VGI. This need is further emphasized by the fact that large volumes of road data (e.g. TIGER/line files), which were made freely available by the U.S. Census, have been bulk-uploaded into OSM since 2009 (OSM 2013b). Accordingly, road content in OSM (especially in the US) does not comprise solely VGI contributions, but a hybrid aggregate of VGI and authoritative data. Therefore, a study of non-road features (especially ones that do not include bulk uploads of authoritative content) will complement the current body of work and enhance our understanding of the accuracy of OSM contributions, and this is one of the contributions of this paper.
Furthermore, there is a lack of understanding with respect to the spatially-drivenmotivations for VGI contributions: why do people contribute for some areas and not others? For example Zielstra and Zipf (2010) contrasted the differences between VGI and commercial data sources in Germany against population density, noting that the completeness of the VGI degraded considerably as the distance from the urban core increased. Conversely, rural areas with low population densities are less likely to attract VGI contributions. However, our understanding of how this participation varies among locations with similar population density remains largely unexplored. For example, Goodchild and Li (2012) found that Linus’ Law is not as effective for geographic facts as it is for other information like Wikipedia, and that population density alone is not sufficient to explain the trends in the data error. A study of the quality of contributions as it relates to the demographics of the place will therefore advance our understanding of the relation between population characteristics and VGI quality, and this is the second contribution of this paper. In the following two subsections we provide review of accuracy and completeness in the context of OSM (section “Positional accuracy and completeness in OSM”), followed by a discussion of the characteristics of the contributors to VGI—both in terms of their motivation and their demographics (section “The motivation to contribute to open source initiatives”).
Positional accuracy and completeness in OSM
In the context of this paper the term accuracy is used to refer to positional accuracy, namely the closeness of the coordinate values of a VGI feature (e.g. a point) to its corresponding authoritative equivalent feature based Euclidian distance. The term completeness refers to the extent to which features are included or omitted from a dataset, again in comparison to the authoritative equivalent. In that sense, if one were to consider schools, accuracy would refer to how close a VGI record of a school is to its corresponding authoritative record, whereas completeness would refer to the percentage of schools that have been mapped in a VGI dataset. Both terms are indicators of the overall quality of a dataset, a term which in this paper is used to refer to the overall fitness for use of a dataset.
Past efforts to assess the quality of VGI contributions have primarily focused on positional accuracy. Haklay (2010) compared OSM data and the United Kingdom’s (UK) Ordnance Survey road centerline data, finding that OSM data road centerlines are displaced on average by 5.83 m for selected areas within London. Girres and Touya (2010) compared French road data and reported an average displacement of 6.65 m, which is consistent with Haklay (2010). These studies however, did not examine any spatial variations of accuracy. In an effort to address this issue, Al-Bakri and Fairbairn (2010) performed a localized study in the UK, assessing the accuracies of nodes and linear segments of polygonal VGI entries. They found that node accuracy degraded between urban and the more rural ‘peri-village’ areas, noting errors of 9.6 and 11.0 m respectively. For linear segments, the authors measured an average displacement from the surveyed results of 1.5 m using uniformity of buffers established for both the reference and OSM data. The spatial heterogeneity of VGI errors was also pointed out by other researchers who noted that inaccuracies were often localized to specific areas (e.g. Girres and Touya 2010; Haklay 2010; Zielstra and Zipf 2010). In an effort to understand these accuracy variations, Haklay et al. (2010) showed that there was no observable correlation between the number of contributors and quality of VGI data once that number reaches a certain level. Considering the lack of discernible spatial patterns of accuracy variations, the research community turned its attention to the motivation behind VGI contributions, as a potential explanation for these variations.
The motivation to contribute to open source initiatives
Open source initiatives exceed the purview of the geographical community, with similar efforts seen in a very broad range of activities, ranging for example, from Wikipedia and open source software development projects to citizen science. Nevertheless, regardless of the topic, there exist certain general motivating factors that drive participation in such efforts. For example, Oreg and Nov (2008) identified the three primary motivational elements for contribution, ranked by priority of importance, as: self-development, altruism, and reputation building. Kuznetzov (2006) argued that for Wikipedia, perhaps the ‘poster child’ for crowdsourcing efforts, the key motivation is a sense of community and accomplishment. Vickery and Wunsch-Vincent (2007) noted that the motivations driving these contributions of user-generated content are primarily technological, social, economic, and institutional or legal considerations. In essence, if it’s not difficult, illegal, or inconvenient to contribute the information and there is some motivation in the form of intrinsic social or economic return achieved by the contributor, then there will be information contributed (Nov et al. 2011).
While the above drives public participation, the very nature of public participation introduces biases in volunteered content. These biases result in variations in the patterns of contribution and the accuracy of the contributed content itself. Such biases stem from four key areas: Internet access, knowledge of language, available time, and adequate technical capability to support the editing functions required of the contributors (Holloway et al. 2007). As these areas are closely associated with demographic properties, it is only natural to move towards studying VGI content relative to the corresponding demographic information of an area, which is a direction that we pursue in this paper. This is also consistent with earlier work by Porter and Donthu (2006) and Longley and Singleton (2009), who also identified the correlation between Internet usage and several demographic indicators (e.g. age, education, income and race).
When we consider the above under the lens of VGI contributions in particular, an argument emerges about the relation between VGI contributions and demographics. Elwood et al. (2013) note that differences in social processes associated with VGI can impact the content and quality of the contributed data. Studies of OSM contributions have also noted that as the population density decreases so do OSM contributions (e.g. Zielstra and Zipf 2010; Girres and Touya 2010; Haklay 2010; Zielstra and Zipf 2010). However, population density itself is not the only factor affecting content and quality (Girres and Touya 2010; Schmidt and Klettner 2013) as it is not uncommon to have information gaps in highly populated areas (Cipeluch et al. 2010). Girres and Touya (2010) and Haklay (2010) noted a significant decrease in contributions for areas that were economically deprived or disadvantaged. This suggests that socioeconomic factors, such as income and educational achievement may affect OSM contributions, leading to complex spatial patterns of participation (Elwood 2008, 2009; Ghose and Elwood 2003; Sieber 2006; Tulloch 2008). Graham (2005) along with Zook and Graham (2007a, b) noted similar impacts as spatial queries and ‘software-sorting’ techniques influenced by cultural differences which can create or enhance a bias in digital presentation or ‘perception’ of a place. Furthermore, Crutcher and Zook (2009), in a study of geographical data in the context of reporting the impacts of the 2005 Hurricane Katrina in New Orleans, Louisiana, stated that racial inequities were a key factor affecting access and use of digital technologies. However, while there have been a number of suggestions regarding the impact of the demographic element on the contribution patterns and quality of VGI data, this direction remains understudied, and in this paper we make a contribution by addressing this issue.
Data and methods
VGI quality and demographics data
Oak Ridge National Laboratory (ORNL) data, which geographically located a Department of Education list of schools using the street address information. The ORNL data locates a point for each school at the street address.
School locations from the Points of Interest (POI) layer of OSM. The POI layer represents each specific feature as a node. In OSM, it is common practice to represent area features as points (Over et al. 2010) and the guidelines provided within OSM indicate that the node should be placed in the middle of the site (OSM 2013a).
Data from the United States Geological Survey (USGS) OSM Collaborative Project (OSMCP—2nd Phase; Poore et al. 2012).
We consider the ORNL data as the authoritative dataset in our analysis while the remaining two datasets represent different types of VGI data. The OSM POI school data represents ‘classic’ VGI while the OSMCP dataset represents a variant of VGI data in the sense that it introduces limited authoritative oversight to the VGI process, in the form or peer-developed quality control feedback to the volunteers (e.g. university students) as well as USGS feedback (Poore et al. 2012). This is part of a larger effort by the USGS to augment the National Map (2014) with VGI data pertaining to manmade structures (e.g. schools, hospitals, post offices, police stations etc.). As one might expect, the OSMCP effort would likely fall in the Expert Professionals range of the scale proposed by Coleman et al. (2009) since the final review of the data is conducted by USGS personnel. This has also been shown in Jackson et al. (2013), who indicated that OSMCP data was of higher quality both in terms of completeness and positional accuracy compared to OSM data.
Census tract level demographic indicators considered in our study across four categories: population, economic status, education, and race/ethnicity
Percent without high school (HS) diploma
Percent with HS diploma (and over 25 years old)
Percent with BA degree (or better) (and over 25 years old)
Percent white—not Hispanic
Median household income
Percent African American
Median home value
Percent American Indian
Percent below poverty line
Percent of homes receiving food stamps
In conjunction with these indicators, the completeness and positional accuracy of the OSM and OSMCP were evaluated. For this purpose, the method described in Jackson et al. (2013) was used to derive the positional accuracy and completeness of each of OSM and OSMCP datasets with respect to ORNL dataset (which was considered as the authoritative source). This analysis resulted in completeness rates for each dataset (89 % for OSMCP vs. ORNL and 72 % for OSM vs. ORNL) and positional error distributions (47 m ± 50 m for OSMCP vs. ORNL and 190 m ± 314 m for OSM vs. ORNL). Furthermore, 59 % of the time OSMCP schools were closer to their ORNL reference entries, compared to their OSM counterparts, suggesting that that OSMCP outperforms OSM. These results, together with the demographics indicators, were then used to analyze possible relations between VGI quality and demographics.
Our analysis is carried out in two modes: non-spatial and spatial. While in the non-spatial analysis mode we focus on the descriptive statistical characteristics of the relationship between demographics and quality, in the spatial analysis mode we focus on the geographic properties of this relationship. Below, we provide a concise description of each analysis mode (non-spatial and spatial) with a particular emphasis on the spatial analysis methods used.
In the non-spatial analysis mode we explore the relations between VGI quality and demographic indicators in four steps. First, we compare the distribution (i.e. histogram) of quality measured with respect to the distribution of the demographic indicators in order to determine whether accuracy errors or completeness errors are generated in tracts that fit specific demographic profiles. As some demographic indicators are continuous variables, a method for binning the data is required. In this study we approximate the distribution of the demographic indicators using histograms using a binning process. A number of statistical approaches have been developed to bin data, which generally fall into four categories: natural (Jenks); quantile; equal-interval; and standard deviation (Longley et al. 2010). While each of these approaches could be considered, no substantial differences in the results were observed between them in our case study. Consequently, natural (Jenks) breaks within each data element are used to establish five ranges within each of the selected demographic indicator to enable the exploration of emergent patterns in the distribution of a VGI quality measure across a demographic parameter (e.g. OSM completeness vs. median age).
Following this initial analysis, Principal Component Analysis (PCA; Press and Wilson 1978) and discriminant analysis (Davis 1973) is applied in order to identify the demographic indicators that would best explain quality variations in the two VGI sets. Through this analysis, PCA can potentially lead to a reduction in the dimensionality of the demographic indicators, while retaining as much as possible of the variation present within the original dataset. This is achieved by transforming the original dataset to a new set of variables, the principal components (PCs), which are uncorrelated but are ordered so that the first few retain most of the variation present in all of the original variables (Andrews et al. 1996). As PCA can be sensitive to the scale of the variables, it may be necessary to rescale the demographic indicators. However, most of these indicators are already captured as percentages, therefore such scaling is not necessary. Specific non normalized indicators, such as total population, population density, median age, and median household income, must be scaled to fit within the range of 0–1 prior to the application of the PCA. The transformation to compute these new scaled values is straight-forward: the minimum and maximum attributes are determined, and then the minimum is subtracted from the maximum to establish the range, finally the minimum is subtracted from the actual value and the result is divided by the range, yielding scaled value for that attribute. PCA is particularly useful as a starting point for additional detailed analysis, such as linear regression, in that the PCA results may identify multi-colinearities within the data that can then be used to reduce the data dimensionality prior to additional analysis. By eliminating such colinearities, linear regression can be simplified.
In order to explore and quantify possible correlations between demographic indicators and VGI quality measures both linear regression and discriminant analysis are used. Specifically, linear regression is used to assess the correlation between VGI positional accuracy and demographic properties (Burt et al. 2009), and discriminant analysis (Davis 1973) is used to assess the correlation between VGI completeness and demographic properties. In the linear regression analysis positional accuracy is used as the dependent variable and the demographic properties as the explanatory variables, and R2 values are calculated for both the OSM and OSMCP datasets. In the discriminant analysis completeness is used as the discriminate function, demographic properties are used as the explanatory variables, and Wilks’ lambda (Huberty 1984) is used to estimate the significance of the discriminant function. Though the combination of linear regression and discriminant analysis we are therefore able to assess how both positional accuracy and completeness relate to the various demographic indicators in the study area.
In the spatial analysis mode we analyze the spatial properties of the completeness and positional accuracy that were derived earlier. This analysis is carried out in two steps. In the first step we explore whether spatial patterns emerge in the VGI quality measures, and if so, whether such patterns can also be found in the spatial distribution of the studied demographic indicators. If found, the existence of spatial patterns in both data sets would provide supporting evidence to the possible association between VGI quality and demographic indicators. However, if patterns exist in one dataset and not in the other then such association would be unlikely. In this analysis, the term pattern refers to a non-random spatial distribution of the studied variables that is statistically significant (based on a relevant significance test). To accomplish this we utilize Nearest Neighbor analysis (NN; Clark and Evans 1954), Moran’s Index of Spatial Autocorrelation (Moran’s I; Moran 1950), and Local Indicators of Spatial Associations (LISA; Anselin 1995) to test for statistically significant spatial patterns. In the second step, we utilize regression to model any spatial associations between the VGI quality measures and demographic indicators. Specifically, we use exploratory regression as well as Geographically Weighted Regression (GWR: Fotheringham et al. 2002) to determine whether a statistically significant model (global or local) can be derived. We briefly describe each of these analysis methods below.
NN analysis utilizes the average distance to the nearest neighbor data point to determine whether a point distribution is spatially random, clustered, or dispersed compared to a random point pattern. Based on this, the determination whether a given dataset is randomly distributed is made through a z-score statistic: absolute score values above 2.58 indicate a 99 % chance that the data is not randomly distributed, while values above 1.96 indicate a 95 % chance. Using the NN analysis, both completeness and positional accuracy can be analyzed for the OSM and OSMCP to determine whether any spatial patterns can be detected based on VGI completeness rates. While the NN analysis addresses the question of whether global spatial patterns exist in the quality of the two VGI datasets studied here, it does not address the question of whether such patterns exist at the local scale. To address this, we utilize LISA in order to identify areas where local clusters of higher or lower than expected values exist (Anselin 1995). Once a dataset is shown to exhibit spatial autocorrelation, LISA can identify significant areas within the overall dataset that generate such spatial autocorrelation. Accordingly, within our analysis LISA is used to determine which, if any, locations drive the spatial autocorrelation values that are observed. Similarly to the NN analysis, the positional accuracy and completeness measures from both OSM and OSMCP will be used.
In conjunction with the NN and LISA analyses for the VGI quality measures, similar analysis is carried out for the demographic indicators in order to explore whether any spatial patterns emerge. Specifically, we are interested in determining whether demographic indicators exhibit any significant spatial autocorrelation (Burt et al. 2009) in the study area. As the demographic indicators are associated with aerial features (i.e. tracts) we utilize Moran’s I to evaluate the global spatial autocorrelation between tracts. Similarly to the NN analysis, Moran’s I analysis determines whether a set of features and their attribute values are spatially randomly distributed, clustered, or dispersed. In addition to considering the attribute values of features to calculate autocorrelation, Moran’s I takes into the spatial relationships between features in the form of a spatial weights matrix, which can express both topological and metric spatial relations (Wong and Lee 2005). The statistical significance of Moran’s I is evaluated using a z-score test.
Summary of the exploratory regression criteria values
Minimum adjusted R-squared
Maximum p value
Maximum VIF value
Minimum Jarque–Bera p value
Minimum spatial autocorrelation p value
At the local scale, GWR is utilized to study spatially localized regression models of the relation between VGI quality measures and demographic indicators. While based on principles similar to the global exploratory regression, GWR explores local regression models by allowing the regression coefficients to vary based on location in the study area. This is accomplished by a location-dependent weight matrix that expresses the spatial relations between data elements. As a result, we are able to explore whether there are any significant local associations between VGI quality and demographic indicators.
Results and discussion
Following the workflow and analysis methods described in section “Data and methods”, in this section we outline and discuss the results that were obtained through each analysis method. In particular, the results of the non-spatial analysis mode is described in section “Non-spatial analysis results”, and the results if the spatial analysis mode is described in section “Spatial analysis results”. The different analyses were carried out using the SPSS1 statistical analysis package (for non-spatial analysis) and the ArcGIS 10.1 Spatial Statistics Toolbox.2
Non-spatial analysis results
Summary of the histogram comparisons completeness in the OSM and OSMCP data sets (NV indicates no value)
OSMCP completeness rates
OSM completeness rates
Median household income
Median home value
Percent below poverty line
Percent of homes receiving food stamps
Percent without HS diploma
Percent with HS diploma
Percent with BA degree
Percent white—not Hispanic
Percent African American
Percent American Indian
Principal component analysis (PCA)
As described earlier, PCA is utilized in this study as a tool for reducing the dimensionality of the data potentially simplify the linear regression analysis. Accordingly, PCA was used to analyze the relationship between the eighteen demographic indicators presented in Table 1 to determine whether all of them provide unique information or whether they were mostly redundant. The 401 features (schools) in our data set, i.e. both features for which a match was found and for which a match was not found, were assessed. The results of the PCA identified five factors from the eighteen input properties as able to describe approximately 75 % of the variance. However, the analysis did not find any of the 18 demographic indicators to be redundant, and therefore none could be excluded.
Following this, the rotated component matrix was computed using Varimax and the members of each component were evaluated in an effort to understand the particular relationships between the demographic properties. In this analysis the first factor included eleven of the eighteen demographic properties from across the four demographic categories identified in Table 1. The other four factors identified included fewer demographic properties; however, the groupings of the components for each of the factors were composed of a mixture of demographic indicator categories presented in Table 1. Consequently, the PCA did not indicate specific demographic categories (i.e. Population, Economic Status, Education, Race/Ethnicity) that drive the variability in VGI completeness.
Given these results, further investigation of the demographics indicators was carried out in order to determine whether any multicolinearities exist between demographic indicators in Table 1, as such multicolinearities may result in the standardized residuals behavior that was observed in Fig. 4. This analysis revealed that the following demographic indicators were identified as multicolinearities: percent females, percent males, percent with and without HS diploma, percent white, and percent African American. These indicators were then removed and the linear regression correlation matrix results for the OSM and OSMCP datasets were recalculated. However, no significant correlation was observed between the positional accuracy and any of the remaining demographic indicators, suggesting that these indicators do not explain the positional accuracy error behavior in our datasets. It is worth noting that low correlation was observed between positional accuracy and the population of Asians across the study area.
Discriminant analysis was used to assess the relationship between completeness and demographics in the OSM and OSMCP data sets. This was carried out by analyzing the differences in the demographic indicators for those features that were matched against the unmatched features for the two datasets. This analysis yields an interesting result, in that within the same dataset, there is a statistically significant difference between those features that were successfully matched within the study area and those features that were unmatched. The results were consistent for both the OSM and the OSMCP datasets indicating that there is a statistical difference between the demographic properties of the matched features and those of the unmatched features.
Spatial analysis results
Nearest neighbor (NN) analysis
Nearest neighbor analysis was performed in order to explore whether a spatial pattern (clustering or dispersion) in positional accuracy and completeness could be identified for both the OSM and OSMCP data sets. As described in section “Spatial analysis”, the premise behind this approach is that if such patterns are detected, then they can be evaluated against spatial patterns of different demographic indicators to identify any similar spatial patterns. If similar patterns are found with respect to a specific demographic indicator, then that indicator can serve as a potential driver of VGI quality. Such analysis can also lead to possible insights on how demographics may affect different types of quality measures. For example, in the case of completeness, if there is a statistically significant difference in the spatial pattern of demographic indicators associated with matched features compared to the spatial pattern of the same demographic indicators associated with unmatched features, then those demographic properties could potentially drive VGI completeness rates. Testing whether the VGI quality measures exhibit any patterns is therefore the first step in developing an understanding of possible relationships between VGI quality and demographics.
Z-scores of matched features calculated from a NN analysis
All school features
Moran’s I and local indicator of spatial association (LISA)
Scores of spatial autocorrelation in the demographic indicators (non-significant results at a 99 % confidence level are marked in bold)
Median home value
Percent below poverty line
Percent of homes receiving food stamps
Percent without HS diploma
Percent with HS diploma
Percent with BA degree
Percent white—not Hispanic
Percent African American
Percent American Indian
Z-scores of positional accuracy calculated from Moran’s I
OSM positional accuracy
OSMCP positional accuracy
Global exploratory regression
Exploratory regression was used to assess at a global study area scale both completeness and positional accuracy with respect to demographic indicators presented in Table 1 above. The analysis of the OSM and OSMCP dataset indicated several demographic indicators presented in Table 1 demonstrated multicolinearity issues (see section “Linear regression”). These included: percent female, percent HS diploma, percent white and percent African American. The exploratory regression results indicated that three demographic indicators: Percent white not Hispanic, percent black, and percent Hispanic also exhibited multicolinearlity. These indicators were then removed from further analysis and the exploratory regression was run again. The results of the second exploratory regression of positional accuracy data for both the OSM and OSMCP datasets were consistent with the results of the spatial autocorrelation reported above and yielded no discernable model with respect to demographic indicators that would explain the spatial distribution of positional accuracy. Similarly, results of the exploratory regression for completeness, using presence (1) and absence (0) as the determinant values was conducted for both the OSM and OSMCP datasets respectively. Similarly to positional accuracy, the results did not yield a discernible model of completeness with respect to demographic indicators.
Geographically weighted regression
Following the results of the exploratory linear regression, GWR was applied in order to explore possible relationships between demographic indicators and positional accuracy and completeness at the local scale. However, this analysis did not yield any statistically significant results (at a 95 % confidence level) or discernible regression models. These results are consistent with the regression results presented earlier.
The analyses carried out in this study addressed the commonly suggested notion that VGI quality is generally associated demographic indicators of the area covered by the VGI (e.g. Haklay 2010; Girres and Touya 2010; Zielstra and Zipf 2010; Cipeluch et al. 2010; Elwood 2009; Zook and Graham 2007a, b; Crutcher and Zook 2009). Using 18 demographic indicators from four separate categories (general population; economic status; educational attainment; and race/ethnicity), and focusing on point features (schools), we examined whether such associations could be detected, and if so whether they are statistically significant. In order to accomplish this, we used two VGI sources (OSM and OSMCP) and compared them to a reference (ORNL) using the feature matching methodology developed by Jackson et al. (2013). This enabled us to evaluate two quality measures of each VGI source, namely positional accuracy and completeness, and compare and contrast them with the various demographic indicators. Our analysis included both non-spatial and spatial methods, which ranged from the comparison of the distributions of the quality measure and simple regression to exploratory and geographically weighed regressions. The results of these analyses however, failed to identify a clear and consistent association (or statistically significant correlation) between either positional accuracy or completeness with any of the demographic properties. While some associations did emerge (e.g. in the NN or LISA analysis), they were found to be localized and sporadic. As a result, we were not able to identify any patterns or combinations of demographic indicators that are associated with improved VGI quality.
These results suggest that, at least in some cases, the underlying mechanisms that control VGI quality are more involved, and that modeling VGI quality through a direct relation with demographic indicators may not be able to account for the intricate nature of VGI quality. One potential explanation for the emerging complex nature of VGI quality is that in general, VGI contributions are typically not restricted to contributors from the mapped area, thus enabling virtually any user across the globe to contribute information. Consequently, the VGI quality characteristics resulting from such process are driven by a potentially heterogeneous mixture of demographic indicators, rendering the process of identifying specific demographic drivers of VGI quality difficult. To some extent, this issue has been addressed in this study through the use of the OSMCP data set for which the contributors were indeed local to the area covered by the dataset. However, clear associations with demographic indicators were not found in this dataset as well.
Recent research related to our study by Li et al. (2013), which focused on social media usage (Twitter and Flickr) did find a relationship between Twitter usage percentage of well-educated people with an advanced degree and high income. In addition, high Flickr activity was found to be correlated with a high percentage of highly educated white and Asian people. Similarly, Kent and Capello (2013), who studied the use of social media during a crisis situation (a wildfire) indicated that demographic characteristics of the area impacted by the emergency situation could be used to reveal the propensity of its population to contribute information in social media during such a crisis. While our findings appear to be in contrast to these findings, it is important to note that their focus is on social media rather than VGI. Arguably, social media offers its users an environment that is substantially different from that of VGI, which may lead to differences in the motivation of users to contribute information as well as differences in usage patterns. Consequently, the question whether the findings of these recent studies could be generalized to a VGI setting as well remains open.
It is important to recognize that there are several noteworthy limitations to our study. First, our study focused on a single type of features, namely schools. While this was beneficial for the construction of our analysis workflow, further work is required in order to explore whether our findings are consistent across different feature types (e.g. stores or hospitals). In addition, our study did not explore other types of VGI quality measures, such as attribute accuracy or logical consistency. Lastly, additional analysis methods—both non-spatial and spatial—and other demographic indicators should be further explored in an attempt to model the relationship between demographic indicators and VGI quality. For instance, while our analysis focused on a linear model, non-linear models should also be explored.
Summary and outlook
The enabling of citizens without formal training to produce geographical products for mass consumption through the use of web-based tools and technologies is introducing opportunities and challenges for our field. Opportunities arise from the availability of additional data that may extend the coverage of authoritative datasets, or even better represent particular types of events (e.g. capturing rapidly evolving events). The challenges are associated with the integration of such contributions with authoritative content. This integration is impeded by the lack of an understanding of the accuracy of VGI datasets, and potential patterns behind the variations of such accuracy. The study presented within this paper has explored the relationship between demographics and VGI quality, in order to assess the often-repeated argument that demographics may relate to corresponding accuracy variations.
We conducted a quantitative study to address this issue, using data from a major metropolitan area, and focused on point-represented areal features. The Denver metro area that was selected as our study area offers a unique advantage as we have available for it not only authoritative (ORNL) and traditional crowdsourced content (OSM) but also content derived through a hybrid, partially supervised crowdsourcing process (OSMCP). For this area we selected schools as a representative feature type, for a variety of reasons. First, schools tend to be more easily identifiable, as they are large facilities, with well-defined components. Secondly, in a demographically segregated city such as Denver (James 1986; Aske et al. 2011) school locations can be viewed as representative samples of demographically diverse neighborhoods. The results of our analysis do not support the arguments that a correlation may exist between VGI error and local demographic properties, as they show no statistically significant such association.
VGI on a massive scale is a very complex process and we need to gain a better understanding of the mechanics that drive participation and content quality. This study addressed one potential avenue, assessing the role of demographics. While the findings of this first quantitative study do not support earlier arguments for a correlation between demographics and accuracy, a natural future extension of this work would be to extend the analysis to additional areas, and to consider additional areal feature types (e.g. hospitals, fire and police stations). Given the findings of this research, one could argue that another logical next step would be to study the process characteristics (rather than its spatial and demographic indicators) and their relation to data quality, because the process may be more complex than can be described by work done so far. Understanding who and why contributes VGI data still remains an open research question. As Steinmann et al. (2013) noted, a small percentage of contributors are responsible for the majority of mapping. Determining which ‘small percent’ of the local population is actually contributing poses a significant challenge to future research relating demographics of place with VGI contribution quality. It is through such studies that we will be able to gain an understanding of patterns behind the volunteering of geographical information, and the corresponding accuracy variations. This will improve our capabilities to integrate such datasets with authoritative collections, thus allowing us to harvest the full potential of VGI.
- Al-Bakri, M., & Fairbairn, D. (2010), Assessing the accuracy of ‘Crowdsourced’ data and its integration with official spatial data sets. In Proceedings of the 9th international symposium on spatial accuracy assessment in natural resources and environmental sciences, Leicester, UK, pp. 317–320.Google Scholar
- Aske, D., Corman, R. R., & Marston, C. (2011). Education policy and school segregation: A study of the Denver metropolitan region. Journal of Legal, Ethical & Regulatory Issues, 14(2), 27–35.Google Scholar
- Burt, J., Barber, G., & Rigby, R. (2009). Elementary statistics for geographers (3rd ed.). New York, NY: Guilford Press.Google Scholar
- Cipeluch, B., Jacob, R., Winstanly, A., & Mooney, P. (2010). Comparison of the accuracy of OpenStreetMap for Ireland with Google Maps and Bing Maps. In Proceedings of the 9th international symposium on spatial accuracy assessment in natural resources and environmental sciences, Leicester, UK, pp. 337–340.Google Scholar
- Coleman, D. J., Georgiadou, Y., & Labonte, J. (2009). Volunteered geographic information: The nature and motivation of produsers. International Journal of Spatial Data Infrastructures Research, 4(1), 332–358.Google Scholar
- Davis, J. (1973). Statistics and data analysis in geology. New York, NY: Wiley.Google Scholar
- de Smith, M. J., Goodchild, M. F., & Longley, P. A. (2007). Geospatial analysis: A comprehensive guide to principles, techniques and software tools (2nd ed.). Winchelsea, UK: The Winchelsea Press.Google Scholar
- Elwood, S., Goodchild, M. F., & Sui, D. (2013). Prospects for VGI research and the emerging fourth paradigm. In D. Sui, S. Elwood, & M. F. Goodchild (Eds.), Crowdsourcing geographic knowledge: Volunteered geographic information (VGI) in theory and practice (pp. 361–375). New York, NY: Springer.CrossRefGoogle Scholar
- Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression & associated techniques. Chichester, UK: Wiley.Google Scholar
- Ghose, R., & Elwood, S. (2003). Public participation GIS and local political context: Propositions and research directions. URISA Journal, 15(2), 17–22.Google Scholar
- Hochmair, H. H., & Zielstra, D. (2013). Development and completeness of points of interest in free and proprietary data sets: A Florida case study. Creating the GISociety—Conference proceedings (pp. 39–48). Austria: Salzburg.Google Scholar
- Hudson-Smith, A., Crooks, A. T., Gibin, M., Milton, R., & Batty, M. (2009). Neogeography and Web 2.0: Concepts, tools and applications. Journal of Location Based Services, 3(2), 118–145.Google Scholar
- Longley, P. A., Goodchild, M. F., Maguire, D. J., & Rhind, D. W. (2010). Geographical information systems and science (3rd ed.). New York, NY: Wiley.Google Scholar
- McKechnie, J. (1983). Webster’s new twentieth century dictionary (2nd ed.). New York, NY: Simon and Schuster.Google Scholar
- Mooney, P., Corcoran, P., & Winstanley, A. (2010). Towards quality metrics for OpenStreetMap. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, San Jose, CA, pp. 514–517.Google Scholar
- Nov, O., Arazy, O., & Anderson, D. (2011). Technology-mediated citizen science participation: A motivational model. In Proceedings of the 5th international AAAI conference on weblogs and social media, Barcelona, Spain.Google Scholar
- OpenStreetMap. (2013a). Tag: Amenity = school. http://wiki.openstreetmap.org/wiki/Tag:amenity%3Dschool. Accessed on 17 May 2013.
- OpenStreetMap. (2013b). USGS geographic names information system. http://wiki.openstreetmap.org/wiki/GNIS. Accessed on 17 May 2013.
- Poore, B. S., Wolf, E. B., Korris, E. M., Walter, J. L., & Matthews, G. D. (2012). Structures data collection for the national map using volunteered geographic information. U.S. Geological Survey open-file report 2012–1209, Reston, VA. http://pubs.usgs.gov/of/2012/1209.
- Schmidt, M., & Klettner, S. (2013). Gender and experience-related motivators for contributing to OpenStreetMap. Online proceedings of the international workshop on action and interaction in volunteered geographic information (ACTIVITY) at the 16th AGILE conference on geographic information science, Leuven, Belgium.Google Scholar
- Steinmann, R., Grochenig, S., Rehrl, K., & Brunauer, R. (2013). Contribution profiles of voluntary mappers in OpenStreetMap. Online proceedings of the international workshop on action and interaction in volunteered geographic information (ACTIVITY) at the 16th AGILE conference on geographic information science, Leuven, Belgium.Google Scholar
- Sui, D. (2008). The wikification of GIS and its consequences: Or Angelina Jolie’s new tattoo and the future of GIS. Computers, Environment and Urban Systems, 32(1), 1–5.Google Scholar
- The National Map. (2014). http://nationalmap.gov/TheNationalMapCorps/index.html. Accessed on 23 May 2014.
- U.S. Census Bureau. (2012). Geographic definitions. http://www.census.gov/geo/www/geo_defn.html#CensusTract. Accessed on 17 May 2013.
- Vickery, G., & Wunsch-Vincent, S. (2007). Participative web and user-created content: Web 2.0 wikis and social networking. Organization for Economic Cooperation and Development (OECD), Paris, France.Google Scholar
- Wong, D. W. S., & Lee, J. (2005). Statistical analysis of geographic information with ArcView GIS and ArcGIS. Hoboken, NJ: Wiley.Google Scholar
- Zielstra, D., & Zipf, A. (2010). A comparative study of proprietary geodata and volunteered geographic information for Germany. In Proceedings of the 13th AGILE international conference on geographic information science, Guimarães, Portugal, pp. 1–15.Google Scholar