1 Introduction

The urbanization process is accelerating in world cities and attracting large-scale job opportunities, human flows, business, and social activities. With the rapid development of information and communication technologies (ICT), location-aware devices, and sensor networks, the emergence of multi-source geospatial big data brings new opportunities to understand the rich semantics of space and place and associated human activities in urban areas using large-scale user-generated content (UGC) and crowdsourcing data streams, such as geotagged social media posts, travel blogs, mobile phone data, smart card data from transportation, GPS-enabled ridesharing services, and so forth. In this chapter, we review state-of-the-art research in UGC-based urban informatics using crowdsourced geographic information.

1.1 Background and Definition

Following the development of Web technologies and mobile devices, people can easily produce large numbers of data and rich information irrespective of their expertise. This is known as user-generated content (UGC), which is a form of content created by users of a system or a service and made available publicly on that system. UGC ranges from social media data and crowdsourced GPS trajectory data, to smart card data and mobile location data from a variety of apps. UGC maximizes the opportunity to understand multiple facets of the cities that we inhabit. The uniqueness and potential of UGC are mainly demonstrated in two ways. On the one hand, UGC can be viewed as the complement of professional-generated content (PGC), as it is decentralized and can be collected from the bottom up and through citizen science (Goodchild 2007; See et al. 2016). Therefore, it can be utilized to capture public opinions and further be leveraged to understand place-based contexts and sociocultural perceptions. On the other hand, UGC can be produced in an economical yet effective manner, and individuals as sensors largely expand the data coverage within cities.

Generally speaking, UGC in geographic information applications can be categorized in two types. One is collaborative mapping platforms, such as Wikimapia and OpenStreetMap (OSM), in which volunteers create and contribute geographic features and detailed descriptions to the Web, where the entries are synthesized into databases and made available to both public and private sectors. This type of UGC is also known as volunteered geographic information (VGI; Goodchild 2007) and has lowered the barriers for the general public to not only consume geographic information but also to contribute to the platform. Different organizations can also produce, customize, and render the data sources based on their own preferences of map styles and application requirements, such as in natural disaster management and emergency routing (Longueville et al. 2010; De Albuquerque et al. 2015; Han et al. 2019). VGI demonstrates how geographic data, information, and knowledge are produced and circulated in practice among different communities and in society at large (Sui et al. 2012). In the past decade, there exist a couple of studies comparing the data quality of VGI to the authoritative mapping sources and proprietary geodata in different regions and countries (Haklay 2010; Girres and Touya 2010; Zielstra and Zipf 2010; Neis et al. 2012; Forghani and Delavar 2014; Yamashita et al. 2019; Tian et al. 2019), where developed countries generally had a better coverage and data quality compared to developing countries. And in some regions, OSM data had geographically imbalanced coverage and were missing various types of information such as roads, points of interest (POI), and land uses (Dorn et al. 2015; Kashian et al. 2019). The second type of UGC is socially constructed data streams from users, that is, data entries constructed from mobile phone apps including diverse social media sources, crowdsourcing, and location-based services (Facebook, Twitter, Weibo, Foursquare, Yelp, Flickr, Instagram, Waze, Uber, Lyft, Didi, etc.), where the general public use locations, place names, and geographic contexts to search for information, consume the service, describe their sense of place, and share diverse opinions and comments according to their experiences (Li et al. 2013; Liu et al.2015; Gao et al. 2017; Janowicz et al. 2019). Harvey (2013) argues that this would be more precisely labeled as user contributed data, since people may not consciously volunteer their data, but generate it in the process of using the platforms for their particular purposes.

In cities, as the most populated areas on the Earth, there have been increasing amounts of UGC data streams generated every day from social media platforms, location-based services, crowdsourcing, and sensor networks, which help in sensing and addressing the urban problems and challenges in the regional economy and in globalization (Martinez-Fernandez et al. 2012; Cheshire and Hay 2017), and also drive the new paradigm in urban analytics (Batty 2019) that combine big data, urban planning and design, and spatial information theory for future development of sustainable cities.

2 Characteristics of UGC

User-generated data have their own pros and cons (Martí et al. 2019). In urban studies, although researchers have successfully utilized this emerging source for assessing urban spatial structure and functional regions (Gao et al. 2017; Tu et al. 2017; Xu et al. 2019), analyzing human mobility patterns and transportation infrastructure (Cho et al. 2011; Noulas et al. 2012; Hawelka et al. 2014; Liu et al. 2014; Yue et al. 2014) and supporting the design of new urban development rules, a good understanding of the key characteristics of UGC data is a prerequisite for preventing the abuse of such data. Compared to traditional data sources (e.g. survey) used in urban studies, UGC data have the following advantages.

First, UGC has the five Vs (volume, velocity, variety, veracity, and value) characteristic of big data (Marr 2015; Yang et al. 2017). Millions of users from different countries and regions in the world are posting all kinds of information per second (Hu et al. 2015; Liu et al. 2015; Martí et al. 2019). For instance, on Twitter, as one of the most widely used social media platforms, there are more than 500 million tweets sent daily by 100 million active users from 160 countries (Aslam 2019). UGC covers all kinds of topics including news, sports, entertainment, education, economics, technology, travels, and lifestyle and provides various perspectives in sensing urban environments and human dynamics (Sagl et al. 2012). People share comments about their lives, surrounding environments, and nearby events. As social media records include the timestamps of users’ contents and activities automatically, they provide valuable information for time-series data analytics and time-geography applications (Chen et al. 2016; Tirunillai and Tellis 2012; Kang et al. 2017; Li et al. 2016). Moreover, the UGC data-collection process for a large geographic area is faster, and the cost is reduced compared to traditional surveys (Li et al. 2013; Gao et al. 2014; Jiang, Li, and Ye 2019). Moreover, the resolution of UGC can be zoomed into the detailed individual level (Yue et al. 2014; Liu et al. 2015) rather than the aggregation level such as census data; and the data update period of UGC (i.e. seconds, minutes, hours, or days) is usually shorter than that of official surveys (i.e. months or years).

Second, UGC data are contributed by the users voluntarily or are collected from the users who use a service and agree to share their data. It is worth noting that some references may only use a strict definition of actively generated data or crowdsourcing. Citizens monitoring their surrounding urban environment can be considered as sensors (Goodchild 2007) in terms of expressions, perceptions, and behaviors, while producing streams of data on social media Web sites, which can help reveal different aspects of their own lives and their environment (Arribas-Bel 2014). Conventional data collection methods for urban studies usually require large community surveys, long-period observations, and high labor costs using questionnaires and fieldwork (Nawrath, Kowarik, and Fischer 2019; Oliveira and Campolargo 2015). In contrast, UGC is produced through the motivation of both the organizations and the individuals, for various purposes such as providing and using location-based services (Yap et al. 2012), and the desire to share with others to promote friendships and social connections (Ames and Naaman 2007; Hollenstein and Purves 2010). Through this procedure, massive data can be collected unobtrusively in which the response bias in traditional methods may be eliminated (Quercia et al. 2015).

While UGC offers promising opportunities, several internal challenges and limitations of the UGC should be addressed for urban studies as follows.

First, although large volumes of content are contributed by millions of users every second, we may get a very sparse data matrix (e.g. Lee et al. 2015) after slicing the UGC data into a fine spatiotemporal resolution (e.g. a city-block spatial unit with hourly temporal window), which is crucial in solving some urban problems such as transportation planning and traffic congestion control. The spatiotemporal data sparsity issue becomes more prominent in the regions with limited numbers of active users. Due to the reduced data volume, the uncertainty in each slice may increase when analyzing the data (Bao et al. 2012).

Second, a common concern about UGC refers to the lack of standardization for users in the data generation process, which causes poor data quality and low trustworthiness, as well as high uncertainty (Senaratne et al. 2017). Users produce geographic data based on their local knowledge and their perception of the place, which may vary across different users (Stephens 2013). And due to the vagueness and uncertainty in human conceptualization of location, space, and place, it is hard for users to express some geographic regions and spatial relations precisely (Montello et al. 2003; Goodchild and Li 2012). Thus, an approach driven by data synthesis (Gao et al. 2017b), combining UGC with an approach informed by fuzzy-set theory (Wu et al. 2019), and combining UGC with survey-based behavior approaches (Twaroch et al. 2019) has been proposed to address the abovementioned concerns. For instance, users may have different perceptions and cognitions for the same place, which can cause incorrect tagging behaviors for social media photos (Hollenstein and Purves 2010).

The third issue concerns the representativeness of UGC, which refers to the degree to which UGC observation samples can represent the actual population (Zhang and Zhu 2018). The results may be biased by data sampling. The existing studies have figured out that the information shared on social media platforms usually follows a power-law distribution, indicating that only a small proportion of users contribute most of the content online (Kwak et al. 2010; Longley and Adnan 2016; Gao et al. 2017a). Therefore, the content collected might be dominated by some specific features and can be another source of bias. Besides, the demographic bias in contributors also impedes the representativeness (Hecht and Stephens 2014). Not all people in the real world use social media frequently. People who have limited access to social media, such as the elderly and users in developing countries, may be less sampled by UGC. For example, the average age of users in Twitter is 28 (Longley and Adnan 2016), and most photos in the Yahoo Flickr Creative Commons (YFCC) dataset released by the Yahoo Labs are uploaded by users in USA (Thomee et al. 2015; Kang et al. 2018) and several other developed countries. It is worth noting that the users who send geotagged tweets are also not randomly distributed over the population but create bias in subtle ways (Malik et al. 2015).

Despite the existence of data bias, research driven by UGC data has achieved great success as a result of validation or through comparison with studies using traditional data sources (Al-ghamdi and Al-Harigi 2015; Blaschke et al. 2018; Gao et al. 2017b; Liu et al. 2016). Opportunities have arisen for urban studies using UGC data because of the abovementioned advantages: (1) big data with low collection cost; (2) fast data generation and update velocity; (3) high penetration rate among users. The next part of this chapter summarizes various examples of UGC-driven urban informatics research and applications and with a focus on the topics of urban spatial structure, urban functional regions, place semantics, and user sentiment analysis. We will first introduce an analytical and computational framework to process large-scale crowdsourced data, and followed this with various applications and case studies in the literature.

3 Analytical and Computational Framework to Process UGC Data

A general analytical and computational framework to process and analyze UGC data is shown in Fig. 28.1. It consists of three parts from the bottom up. First, researchers collect various sources of UGC datasets including Twitter, Weibo, Instagram, Facebook, Foursquare, Yelp, and Dianping and store the data (including structured table records and unstructured texts, images, and videos) in the computer server or a cloud data center with master server and data nodes. Second, the raw data must be cleaned, filtered, processed, and enriched to further extract the information about users, locations, and content (more details in Sect. 28.3). Lastly, spatiotemporal analyses, statistical methods, and machine learning models are employed to support urban analytics, diagnostics, knowledge discovery, modeling, prediction, and decision-making applications. During this process, multi-source UGC and crowdsourced data can be integrated and fused. High-performance computing infrastructure (Cao et al. 2015; Gao et al. 2017; Yang et al. 2017) and open-source analysis toolkits as well as machine learning frameworks such as scikit-learn, r-spatial, PySAL, and Tensorflow can be utilized to facilitate the data processing and advanced analysis.

Fig. 28.1
figure 1

A general analytical and computational framework to process and analyze UGC data

4 Single-Source UGC-Based Urban Studies

4.1 User Information and Citizen Demographics

User information in UGC refers to the metadata or the profile of a user, including the place of residence, name, gender, age, ethnicity, hobby, friends, and social connections, and so on. Users are the main entities who generate content. There are two ways to collect user information from UGC. On the one hand, some basic user information can be directly obtained from the public profile which users provide on social media Web sites. When they were registering and creating a new account, users were required to enter such information by filling out online forms. For example, some basic demographic information such as nationality, gender, and age can be directly extracted from the user profiles (Longley et al. 2015; Kang et al. 2018). Researchers can further utilize such demographic information about citizens to better understand the flow of people from different geo-demographic groups in cities (Longley and Adnan 2016; Huang and Wong 2016). In addition, the follower and friendship connections in social media platforms can also be obtained and have been used to examine theories in the social sciences (Sloan and Morgan 2015; Ugander et al. 2011; Hodas et al. 2013).

On the other hand, some missing user information may not be retrieved directly from the user profile but can be inferred by combining other data sources and further analyses. For instance, the gender, age, and ethnicity information can be inferred from the user identifiers with the forename–surname pairs (Chang et al. 2010; Mateos et al. 2011; Mislove et al. 2011; Longley et al. 2015; Luo et al. 2016). By tracking the location and time of user postings, residents and visitors can be identified and distinguished (García-Palomares et al. 2015; Liu et al. 2018; Su et al. 2016).

4.2 Human Mobility, Urban Spatial Structure, and Transportation

Understanding human mobility patterns is important for the planning and management of urban land use and transportation. The work location, the home location, and even social activity locations of UGC users can be identified through their geotagged posts and their activity patterns detected in social media platforms (Gao et al. 2014; Li et al. 2014; Yang et al. 2015; Wu et al. 2015; Liu, Huang, and Gao 2019). The home-to-job commuting trips and non-commuting trips can be extracted and aggregated for traffic analysis zones (TAZs) to support urban transportation analysis. For example, as shown in Fig. 28.2, researchers detected over 24,000 daily commuting trips with an estimated average commuting time of about 32 min and average commuting distance of about 56 km in the Greater Los Angeles Area using millions of geotagged tweets (Gao et al. 2014). Moreover, when survey data and geotagged Twitter data were compared, the Pearson correlation coefficient of trips on weekdays was 0.91, and the correlation between detected trips using geotagged tweets and using a traditional travel demand model was 0.839 (Lee et al. 2015). While these correlations are far from perfect, the conclusions are nevertheless beneficial for urban transportation research.

Fig. 28.2
figure 2

Spatial and distance distributions of the detected commuting trips using geotagged Twitter data

Another benefit of using location-based check-in data from social networks is having access to information on place types (e.g. shops, offices, restaurants) for user activities, which is important to understand the spatial, temporal, and thematic distributions of human activities and activity-type transitions in cities (Noulas et al. 2011; Wu et al. 2014; McKenzie et al. 2015). For example, Wu et al. (2014) analyzed large-scale user check-in statistics in a location-based social-network platform in China and found different spatiotemporal activity transition probabilities among different types of places, including transportation facilities. Such activity-based transition patterns can also be extracted with pattern mining methods from call-detail-record data from mobile phones, allowing at-home, in-work, and social activity types to be annotated at each stay location (Cao et al. 2019). In addition, by combining information on user demographics, researchers found different movement patterns when comparing tourists and local residents (Chua et al. 2016; Liu et al. 2018), which could help transportation planning and management such as traffic congestion control and transportation regulations during events in cities. Moreover, the linkage between land use and urban dynamics can be identified through UGC and crowdsourcing data. For example, researchers found that human activities tended to decrease throughout the day for most land uses (e.g. offices, education, health) but remained constant in parks and increased in retail and residential zones (García-Palomares et al. 2018). Ren et al. (2019) examined the effect of land-use function complementarity on intra-urban spatial interactions using metro smart card records for different time periods and directions in the city of Shenzhen, China, which also demonstrates the trending use of individual-level big data in travel behavior studies in cities (Yue et al. 2014; Liu et al. 2015).

4.3 Place Semantics and Sentiments

Semantic signatures including the spatial, temporal, and thematic posed by McKenzie et al. (2015) and Janowicz et al. (2019) to extract and share high-dimensional data about types of places and neighborhoods. In contrast to spatial statistics, place-based analyses focus more on describing the topological and hierarchical relations between places and understanding various human perceptions and cognition at places (Li and Goodchild 2012; Gao et al. 2013; Zhu et al. 2016; Wu et al. 2019). Understanding the semantics of urban space and place could derive from the spatial, temporal, and thematic perspectives using geotagged texts, photos, and videos. These crowdsourced geographic data could also help the identification of vibrant neighborhoods (Cranshaw et al. 2012; Zhang et al. 2013) and urban areas of interest (AOI), which refers to the regions within an urban environment that attract people’s attention (Hu et al. 2015). Urban AOIs often have high exposure to the general public and receive a large number of visits. UGC such as geotagged photos can reveal the visit popularity and scenery information for city planners, transportation analysts, and location-based service providers to plan new businesses. Besides, the existing studies have utilized POI information and user check-ins in location-based social networking platforms (such as Foursquare, Yelp, Jiepang, and Weibo) to investigate various urban informatics issues. For example, a location-distortion model was proposed to improve reverse geocoding (i.e. convert a latitude/longitude to a POI address) using behavior-driven temporal signatures (McKenzie and Janowicz 2015). Another Place2Vec model derives the reasoning about place type similarity and relatedness by learning embeddings from augmented spatial contexts and user check-in information (Yan et al. 2017). By combining the user check-in information in Foursquare with topic modeling approaches, researchers derived urban functional regions in the ten most populated US cities (Gao et al. 2017), which demonstrates a bottom-up data-driven perspective. In contrast, researchers also developed a top-down theory-informed approach to extracting urban functional regions. For example, a composition-pattern-based knowledge model was proposed to extract urban functional regions (Papadakis et al. 2019a). In this model, places are formalized as “patterns” which are defined as sets of components, composition rules, and functional implications. For example, a shopping plaza should consist of not only shopping stores but also restaurants, parking lots, and other facilities. Recently, an improved model was proposed using theoretical, empirical, and probabilistic patterns (Papadakis et al. 2019b) to enrich the knowledge-based model.

In addition, with advances in artificial intelligence (AI) technologies and open-source processing platforms as well as deep learning methods in the domains of natural language processing (NLP) and computer vision (CV), the extraction of human emotions (e.g. happiness, fear, anger, sadness, and surprise) and sentiments (i.e. positive, neutral, or negative) at different places and environments has become more accessible. For example, researchers applied advanced text mining techniques with spatial analysis to detect depressed Twitter users and their spatial clusters in US metropolitan areas. Socioeconomic variables from the Bureau of the Census and climate risk factors were found to have an impact on the prevalence of depression but may vary seasonally in different regions (Yang and Mu 2015; Yang et al. 2015). Human sentiment scores and their spatial distribution were extracted and explored in the city of Nanjing, China, using Weibo data (Zhen et al. 2018). High levels of air pollution were found to contribute to the urban population’s reported low level of happiness in social media based on the analysis of over 210 million geotagged Weibo posts in China (Zheng et al. 2019). A semantic-specific sentiment analysis was conducted on Web-based neighborhood textual reviews in the city of New York for understanding the perceptions of citizens toward their living environments (Hu et al. 2019). As for image-based urban studies, researchers have used facial expression extraction techniques to explore human–environment interactions (as shown in Fig. 28.3) especially for the relationship between emotions and environments. A positive correlation was found between the happiness score and the presence of natural environments such as water bodies and green vegetation in different types of place (Svoray et al. 2018; Kang et al. 2019). As another source of ambient sensing data, street view images can also be utilized to analyze human perceptions of places. For example, a data-driven machine learning approach with scene elements was proposed to measure how people perceive a place (including safe, lively, beautiful, wealthy, depressing, and boring) using street view images (Zhang et al. 2018a; Zhang et al. 2018b).

Fig. 28.3
figure 3

Spatial distribution of smiling and no-smiling faces extracted from geotagged Flickr photos in Paris, France, and the associated word cloud of most frequent textual tags in these photos (Facial Expression subfigure was modified from the demo image of Face++ at https://www.faceplusplus.com/face-detection/)

5 Multi-source Data-Driven Urban Studies

5.1 Fusion of Multiple UGC Sources

In traditional urban strategic planning or the classification results of remote sensing, many places in urban areas may be labeled as single land-use type; however, these areas may in reality contain multiple functions and land uses. In order to capture citywide dynamics of both human activities and urban functions at finer resolutions, multi-source UGC and crowdsourced information are combined to overcome their own limitations and to enrich the understanding of urban spatial structure and neighborhood demographics. Both mobile phone data and taxi trajectories usually cover large numbers of users and contain rich location information (and social network connections for mobile phone data) but lack place semantics (Liu et al. 2015). Social media data are sparsely distributed in space and time but contain rich content (Huang and Wong 2016; Martí et al. 2019). By combining both mobile phone data and social media, it is possible to extract citizen’s home–job locations and social activity dynamics more effectively in space and time in cities (Tu et al. 2017). Also, by the integration of mobile-phone data and crowdsourced taxi trajectories, or the fusion of POI data and crowdsourced taxi trajectories, researchers have uncovered substantial differences between taxi trips and mobile-phone-based human movements in terms of spatial distribution and distance-decay effects (Kang et al. 2013) and explored the intensity of spatial interactions among different functional regions based on taxi origin–destination flows (Wang et al. 2018). In addition, researchers have used an online restaurant review platform with rich crowdsourced user-generated reviews and extracted machine learning features to further infer urban neighborhoods’ population distribution and socioeconomic attributes in nine Chinese cities. They found a high predictability, in which the distributions of daytime and nighttime populations are estimated by mobile phone location data (Dong et al. 2019). UGC data can also be used to validate the urban spatial structure and place semantics extracted from ambient sensing and to reflect various urban environmental contexts. For example, as shown in Fig. 28.4, given only a certain number of street view images of a street, a deep learning model was trained to accurately estimate the hourly variation of human mobility patterns approximated by taxi trips along the streets (Zhang et al. 2019). In another study, researchers developed a mixed-use decomposition model based on temporal activity signatures extracted from social media check-in data, and taxi origin and destination (OD) trip data over one year were used to validate the land-use mixing results (Wu et al. 2019).

Fig. 28.4
figure 4

A Predicting hourly variation of taxi trips using street view images; B Spatiotemporal variation of human mobility patterns approximated by taxi trips along the streets

5.2 Fusion of UGC and PGC

Compared to UGC, professional-generated content (PGC) mainly comes from domain experts and organizations who have the expertise and knowledge of study subjects, or the authority to collect and publish data, which is more trustworthy in social media platforms and in news media. The fusion of UGC and PGC can take advantages of both sides, to uncover urban spatial structures and dynamics, and to provide valuable information in the emergency management or disaster response scenarios. For example, crowdsourced geotagged photos and videos from social media users, volunteered geographic data, and authoritative storm surge data created by the U.S. Federal Emergency Management Agency (FEMA) were fused together to create a more accurate estimate of urban flood damage and updated road accessibility mapping in New York City during Hurricane Sandy (Schnebele et al. 2014). In urban planning and development, the integration of public participation from UGC big data sources together with the PGC-based expert design may provide a holistic approach through the process of idea generation, feedback, and evaluation for urban management and problem solving (Thakuriah et al. 2017).

In future, a number of multi-source data fusion research areas call for attention in urban informatics. First, the data sampling and fusing resolution requirements in space and time need to be investigated among different UGC sources to comprehensively understand human activities of different gender, age, and socioeconomic groups and place semantics for intra-urban and inter-city human mobility modeling. Second, combining UGC and PGC or combining data-driven and knowledge-driven approaches can solve urban problems such as traffic congestion and environmental pollution. Last but not least, there is a need to increase the engagement of citizen science in addressing urban changes in responsive cities through data-smart governance (Goldsmith and Crawford 2014).

6 Conclusion

UGC data contain rich information about human location, society, and human–environment interactions and have become a promising data source for urban informatics studies with unprecedented spatial, temporal, and thematic resolutions. This chapter summarized the key characteristics of UGC data with a focus on geographic information and urban studies. We discussed the analytical and computational framework to process UGC data and urban applications including citizen demographics, human mobility, urban spatial structure, place semantics, and sentiment analysis, to name a few. Considering the limitation of a single data source, various kinds of data fusion cases were discussed and suggested to advance future urban informatics studies. It is worth noting that we did not try to enumerate all possible fusion cases but just to list several scenarios with a focus on urban challenges. In sum, a combination of multi-source UGC-driven and theory-informed approaches provides a more holistic view for urban analytics, diagnostics, and human-centered sustainable urban planning and future development.