Abstract
Cities around the world have amassed a variety of data. A main challenge lies in transforming these big data into meaningful knowledge that can inform a city’s strategic decisions and enhance urban sustainability. Along with this challenge is the debate about “Wisdom of Crowds” (WOC) in the Petabyte Age. Using the City of Arlington, Texas as a case study, this research explores a hybrid approach for social inquiries with the aid of WOC and spatial learning techniques to leverage data for developing predictive models to support a city’s service planning. The results indicate that there exist temporal and spatial patterns of service demands, spatial correlation between demands for code compliance and water services, as well as association with neighborhood characteristics. The findings point to opportunities for further data integration and data mining, organizational collaboration, and resource management to improve the efficiency of service provision in cities.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
A main goal of cities is to create a sustainable place with a high-quality living environment for their citizens, businesses, and visitors (Yigitcanlar & Lee, 2014). This requires proactive planning and management based on information generated from big data in the Petabyte Age, referred to the new ear of data technologies. Advancement in information and communication technologies (ICTs) has enabled cities in the 21st century around the world to generate a large amount and a wide variety of data. A main challenge is to transform the data into meaningful knowledge to aid a city’s strategic planning, service provision, and resource allocation decisions (Yu et al., 2012; Lim et al., 2018; Pencheva et al., 2020; Ye et al., 2022; WEF, 2023). While data, whether big or small, have been used in the planning of various services, such as transportation, public health, safety, housing (Liu & Brown, 2003; Morckel, 2013; WEF, 2023), its application in code compliance service planning remains infrequent, despite the latter’s significance for the vitality of cities. Little is understood regarding the role of the “Wisdom of Crowds” – a theory emphasizing the acquisition of knowledge from data prior to forming theoretical constructs (Anderson, 2008)—in city service planning.
Using the City of Arlington, Texas as a case study, this research explores a hybrid analytical approach with the aid of Wisdom of Crowds and spatial data-driven techniques to leverage data for developing predictive models to support the planning decisions for the city’s code compliance service. Specifically, it focuses on exploring the temporal and spatial patterns of the demand for code compliance service and its relation to the demand for water service using the spatial modeling techniques. It also investigates the role of Wisdom of Crowds in forming hypothesis and the opportunity of using a machine-learning technique for predicting the demand for code compliance service.
This study fills a gap in the existing knowledge regarding the data-driven approach for code compliance service planning. It joins a limited number of studies in exploring the applicability of a novel analytical tool for planning of code compliance service. The empirical evidence from the case study has implications for enhancing the value of big data, and should be of interest to smart city research, information science, organizational collaboration research and practices, city planners, and managers or decision makers of public work. The data leveraging approach can be adopted by practitioners under their respective circumstance. In the subsequent sections, we begin by providing a context for this study, then detail our research approach. Following this, we present and discuss our research findings. The paper concludes with a summary of the findings and potential future research.
2 Smart city, big data, opportunities, and challenges for service planning
A smart city is a common vision for urban development around the world (Jamei et al., 2017). Numerous studies have defined the concept of a smart city. While studies may define the concept differently by emphasizing different aspects of smart cities, most characterize a smart city as a system of ICTs and applications of big, dynamic, real-time data by and for people and communities. For example, Washburn and Sindhu (2010) defined a smart city as “a city with a great presence of ICT applied to critical infrastructure components and services.” Yin et al. (2015) argued that “ICT applications and intensive use of digital artifacts such as sensors, actuators and mobiles are essential means for realizing smartness in any of smart city domains.” The characterization of smart cities with the presence and applications of ICT can also be seen from other studies (see, e.g., Neirotti et al., 2014; Barns, 2016; Jamei et al., 2017; Malik et al., 2018; and Osman, 2019). Additionally, studies argued that smart cities are not only limited to ICTs, but also people and community needs (Nam & Pardo, 2011; Albino et al., 2015).
“Big data”, on the other hand, generally refer to large, complex, both dynamic and static sets of data that represent digital traces of human activities and objects including various urban facilities, organizations, and individuals (Chen et al., 2012, 2014; Pan et al., 2016). Besides the size, speed, array of data formats, and quality of data, known as the Volume, Velocity, Variety, Veracity of big data (Desouza & Jacob, 2017; Osman, 2019), many scholars have argued that big data have “Values” as they enable technical prediction, creation of public policy; help the public sector to better engage with citizens and to improve public services, administration, and social values (see, e.g., Arribas-Bel et al., 2015; Castelnovo & Simonetta, 2008; Desouza & Jacob, 2017; Hashem et al., 2016; Ingrams, 2018; Lim et al., 2018; Twizeyimana & Andersson, 2019). Pan et al. (2016) further suggested that urban big data share additional characteristics such as hierarchy that “reflects the organizational hierarchy of a city’s physical and social systems”, and correlations that “can be used, not only for mutual corroboration, but also for cooperative reasoning and mining rules of cities operation”.
While researchers have recognized the characteristics and values of big data, they also notice challengers of transforming data into knowledge. For example, Nuaimi et al. (2015) acknowledged that “effective analysis and utilization of big data is a key factor for success in many business and service domains, including the smart city domain”. They also identified many challenges in supporting smart cities, among of which are challenges in data due to the diversity in sources and formats, data processing methods, and the difficulty of creating “a unified understanding of data semantics” and extracting “new knowledge based on specific cycle data and real-time data”. Pencheva et al. (2020) noted that while public organizations have generated numerous data, “they are often unlikely to use it to gather valuable insights or transform services” due to challenges in privacy and security issues related to data and data analysis at the system, organization, and individual levels. Vydra and Klievink (2019) further argued that “limitations and challenges of using big data” have been discussed but mostly in the manner of acknowledgement but rarely assessed systematically, and that benefits of big data may be hindered by political decision-making factors.
A significant body of research has been devoted to the architecture and system developments of data collection. Research in this area is crucial not only for ensuring the volume, velocity, variety, and veracity of data but also as a precondition for maximizing the value of big data in city service planning, provision, and management. For example, Malik et al. (2018) proposed a model for data transformation. Through a case study, they demonstrated the application for weather data system. Similarly, Khan et al. (2017) developed an Internet of Things (IoT) architecture for energy-related data gathering and energy-aware communication. With the availability of anonymized cell-phone data, researchers are able to analyze the patterns and disparities of mobility in cities. The information is useful for planning transportation infrastructure and improving service efficiency, equality, and efficiency (Rathore et al., 2018; WEF, 2023). Other studies have focused on the use of big data for predictions of crime (Liu & Brown, 2003; Gomory & Desmond, 2023), housing value and price (Morckel, 2013; Bartram, 2019), demand for park service (Hamstead et al., 2018), epidemiology and healthcare (Azzaoui et al., 2021; Ullah et al., 2017), to name a few. Nevertheless, fewer research has focused on demand for code compliance service despite its importance. Code violations post a significant threat to the safety and stability of cities as they tend to attract crimes, decrease property values, and cause other consequences (Bartram, 2019; Gomory & Desmond, 2023; Rosenfeld et al., 2010; Robb et al., 2022). In order to maintain the economic prosperity, equality, and livability of cities, planners must fully take advantage of the data enabled by the contemporary technologies and proactively plan for code compliance service. The success of a smart city relies on a system approach for service planning and resource management, as services in different areas are interdependent and yet competing for resources. Decisions in one area may impact the service or wellbeing of cities in other areas (WEF, 2023).
In sum, the rise of big data, along with new data mining software tools, has provided numerous opportunities for cities to gain insights for service planning and resource management. While some have explored the potential of using big data for predictive management of public services, empirical evidence on the role of big data for improving government services is limited (Benbouzid, 2019; Hong et al., 2019). Although core violations are prevalent in cities and pose significant threats to their safety, prosperity, and livability, there is scant research on code compliance service planning. Overall, existing research advocates for the inventive use of data and innovative analytical techniques to convert data into knowledge, thereby supporting code compliance service planning and decision-making.
3 Research approach
Building upon the knowledge of the existing literature, this study aims to explore a data-driven approach to improve the power of demand prediction for code compliance service planning and to discuss the potential collaboration among departments within a city using Arlington as a case study. In the sections below, a general research process is proposed, followed by a description of the study area, data, and analytical tools for the research.
3.1 Research process
Inspired by the work of Kitchin (2014) and many others, this study follows a hybrid analytical approach that combines the “Wisdom of Crowds” for data mining and the traditional model of science in code violation prediction (Fig. 1). On the one hand, the “Wisdom of Crowds” allows gaining knowledge from data. On the other hand, the traditional model of science emphasizes the importance of building models based on theories. In essence, our approach treats the “Wisdom of Crowds” as an inductive research tool. It starts the inquiry with an inductive approach to discovery the correlations among social phenomenon as proposed by proponents of “wisdom of crowds”, followed with a deductive approach to investigate the association discovered in the first stage of the research. In the case of city service planning, the first stage includes mining the raw data of calls for services to inspect their temporal and spatial patterns, sharing the findings with those with “street knowledge” to uncover the nature of the issues, and followed by theoretical reasoning of the observed patterns. In the second stage, hypotheses are formulated based on the discovery from the first stage as well as relevant theories or empirical findings from the literature. Based on these established hypotheses, relevant factors are selected, their relationships are tested, and explanations are either confirmed or rejected.
3.2 Study area and data
The City of Arlington is located in the middle of the Dallas/Fort Worth Metropolitan area in Texas. It is known as the home of the Texas Rangers and the Dallas Cowboys. According to the World Population Review (2020), the population size of the city has reached more than 400 thousand people, and it had been the largest city without public transit in the U.S. (Harrington, 2018) until 2017 when the city teamed up with Via, a transportation network company (TNC), to provide mobility service in place of the traditional public transit.
Over the years, the City of Arlington as a proto “Smart City” has collected a variety of data in water, code compliance, library, building permits, police, recreation, paramedic services, etc. These data are generated by ICTs such as the city’s Action Center, the Ask Arlington App, and other means. Like many cities, the city has the desire to provide effective services to its citizens and manage its services more efficiently and effectively. The data, along with other public data, provide opportunities for data-driven discovery to aid the city’s strategic planning, service provision, and resource allocation decisions. This study is the first attempt to explore such opportunities.
For the purpose of this study, we use the 2016 calls for code compliance and water lockoffs data provided by the City of Arlington as they were the complete data at the time of the research. These data ware shared in Excel files. The code compliance dataset contains the dates when incidents were reported, the locations of the concerned properties, the violation types, the zoning types of the properties, and other data related to code compliance services. The water lockoff dataset includes premise addresses and the lockoff dates associated with individual records. There’s no identifiable information of individuals such as name and other demographic characteristics in both datasets. The total number of cases is 37,536 for code violations and 10,298 for water lockoffs, respectively. In addition, we collect the 2016 American Community Survey (ACS) 5-year estimates data from the U.S. Census Bureau. The ACS data are at the census block group level. Additional data, such as transportation network and city boundary data, are collected from the North Central Texas Council of Governments (NCTCOG). All data are integrated or aggregated by the researchers for analysis where appropriate.
3.3 Analysis tools
We apply Geographic Information System (GIS) mapping and machine learning techniques as the data discovery and analysis tools, because human activities and service demands exist in place. ArcGIS is used to geocode addresses of incidents required for city services, and to integrate data from all sources and to form the database for the study. The hot spot analysis tool in ArcGIS is applied to analyze the spatial patterns of calls for city services. The spatial distribution and clusters resulted from the hot spot analysis provide the base for forming hypotheses regarding attributes to service demand and identifying data for statistical testing.
The Forest-Based Classification and Regression tool offered by ArcGIS is adopted for statistic modeling. It is a machine learning tool based on “Leo Breiman’s random forest algorism, a supervised machine learning method”, to train “a model based on known values provided as part of a training dataset”. In brief, a machine learning tool “creates many decision trees” based on the data. Each decision tree is established “by recursively partitioning the sample into more and more homogeneous groups” until no further groups can be split (Grömping, 2009). Collectively these decision trees form the forest from which the model is established. Such a model “can then be used to predict unknown values in a prediction dataset that has the same associated explanatory variables” (ESRI, 2020).
The machine learning tool has several advantages including the random nature of the tool, the use of big data, good resistance to noise, and the ability to reduce the need to aggregate data that could result in the loss of original data validity. The machine learning tool is particularly suitable for non-linear modeling because of its nonparametric forest approach. Additionally, it’s less susceptible to multicollinearity issues due to the random sampling method employed, which decorrelates the multiple trees based on different training subsets (Amit & Geman, 1997; Grömping, 2009). The tool can save time and steps for data aggregation and is flexible for data mining in either the n (number of observations) > > p (number of variables) or the p > > n setting (Amare et al., 2021; Grömping, 2009; Wei et al., 2018; ESRI, 2020). Moreover, the tool can be used to rank the importance of variables in a classification and regression problem. It also works well on imbalanced dataset and can be easily implemented.
However, like many random forest models, the tool sacrifices the intrinsic interpretability in decision trees compared to Ordinary Least Squares regression models and may still be vulnerable with overfitting on highly noisy dataset. Although both hot spot analysis and the Forest-Based Classification and Regression tools are not new, their applications as a data mining tool in the context of code compliance service planning are rare. Exploring the potential in this line of study may provide insight for debate over the role of Wisdom of Crowds and future applications for service planning.
4 Discovery of knowledge from Wisdom of Crowds
4.1 Types and temporal distribution of code violations
According to Cukier and Mayer-Schoenberger (2013), data can be “datafication”, a term that refers to “the ability to render into data [that can measure] many aspects of the world that have never been quantified before”, including those associated with geographic locations. Similarly, Höchtl et al. (2016) and Caithness (2018) argued that “data don’t exist in a vacuum”, and that they represent “[t]he undeniable truth of facts” based on social network/social learning theory and the Locard’s exchange principle. Following this line of thoughts and the research process outlined in the research approach section, we first analyze the types and temporal distribution of the reported code violations.
According to the City of Arlington, code violations are classified into 94 categories (City of Arlington, 2020). Analysis of the calls for code compliance service reveals that the majority of the reported code violations are in 8 code categories. All of these top violations are property related. Together, they account for about 73% of the total violations (Table 1).
Figure 2 shows the monthly occurrences for the top eight most reported types of code violations. The data indicate a strong seasonal pattern in code violations. Specifically, the reported code violations are high during the summer months when temperature is high and low in winter months when temperature is low. This is not surprising given that both “high weeds and grass” and “unclean premises”, defined as “property blight declared a nuisance” (City of Arlington, 2020), are the top two types of code violations that occur mostly in summertime. The results, when compared with those of water consumption and lockoffs, share similar, though not identical, temporal patterns. The temporal correlations are not surprising as water consumption is also the highest in the summer, which could trigger water lockoff if water bills are unpaid, a lagged in time as shown in Fig. 2.
4.2 Spatial patterns of demands for city services
The preceding results suggest a conceivable correlation between the demands for code compliance and water lockoff services. To further inspect the correlation and the likely causes, we analyze the spatial clusters of code violations and water lockoffs. The results suggest that the spatial pattern of demand for code compliance service is very similar to the spatial pattern of water lockoffs. As shown in the map on the left of Fig. 3, the majority of calls for code compliance services are in the east side of Cooper Street, and north and south of Highway I-20 in Arlington. Another hot spot is in the north side between Highway I-30 and Division Street where Downton Arlington and the University of Texas at Arlington are nearby. These areas are also home to many poor neighborhoods where median household income was $50 K or below according to the 2016 ACS data. The spatial pattern of water lockoffs, as shown in the map on the right of Fig. 3, closely resembles the pattern of calls for code compliance service. These insights generated from the Wisdom of Crowds, along with professional inputs from city staff members, point to the possible causes of the problem and theory base for predicting code violations as outlined below.
5 Converge of observations and theory
5.1 Hypotheses
The observed phenomenon may be explained by the social equilibrium theory. According to Ioannides (2012), individuals make location decisions rationally by weighing numerous factors with utility maximization in mind. As a result, individuals that share similar socioeconomic characteristics and values tend to settle in similar locations thus “prevail at a social equilibrium across communities” (Ioannides, 2012, p.80). Under the social equilibrium state, the socioeconomic dynamic in a neighborhood reflects the characteristics and values of individuals in the neighborhood. “When different individuals tend to act similarly because they have similar characteristics (or face similar institutional environment), we say they are subject to correlated effects” (Ioannides, 2012, p.79). The theory suggests that the “correlated effects” may be the sources for spatial clusters of actions, which may be explained by the socioeconomic characteristics across neighborhoods.
Figure 3 illustrates displays the possible relationships between the dependent variable “code violations” and the explanatory variables based on the social equilibrium theory. We hypothesize that characteristics representing the economic wellbeing of individuals and their households are key factors related to code violations. Collectively, the socioeconomic characteristics of a community, measured by education, age, household income and type, as well as housing occupancy characteristics are associated with the seasonal and spatial patterns of code violations that require city services. Housing cost represents a significant part of household expenses. In general, it is considered unaffordable if housing expense is more than 30% of the income (HUD, 2023). All else being equal, higher income is associated with less code violation problems because more financial resources enable families or households to afford better and larger houses, take care of their properties, and have more space to hold household belongings. Similarly, higher education level is expected to associate with fewer number of code violations, because education level is closely related to income and employment (Vilorio, 2016), thus the economic ability to take care of property. In addition, it is expected that holding other factors consistent, the higher percentage of the aged population is, the more likely the code violations because of the physical limitation for elderly to take care of properties. Moreover, the more owner-occupied housing units in a neighborhood, the higher number of reported violations because the neighborhood environment affects the property value of individual homeowners. To protect their home values, they are more likely to maintain their properties, pay more attention to the surrounding environment, and report code violations that negatively affect their property values. On the contrary, the more the renter-occupied units are in a neighborhood, the fewer reports of violation problem because renters are less attached to the community and the properties they reside (Rose & Harris, 2022). Following the “datafication” and its representation of unobserved facts arguments by proponents of Wisdom of Crowds (e.g., Caithness, 2018; Cukier & Mayer-Schoenberger, 2013; Höchtl et al., 2016), we presume that the demand for water lockoff service is a reflection of certain “correlated effects” under the social equilibrium framework. Its association with code violations is depicted as the dash-dotted line in Fig. 4. We hypothesize that the factor is positively associated with the calls for code compliance service. In this study, the dependent variable (LgCode) is the natural log of total code violations as the raw data is skewed. We adopt the natural log of water lockoffs (LgWLoff) for the same reason. Other independent variables are number of family households (Family HH), median household income (MedHHInc), renter-occupied housing units (HHRentOcc), percent population aged 65 and over (PctPop65+), and percent adults without high school diploma (PctLessHS).
5.2 Analyses results
We conducted a Spearman’s correlation analysis prior to modeling as it is “preferable when outlines are present” (De Winter et al., 2016). The results indicate that the highest correlation coefficient among all pairs of variables in the study is about 0.73. The variance inflation factor (VIF) indicators show that the highest VIF score is 2.41 with an average VIF score of 1.89 for all independent variables. The results suggest there is no serious multicollinearity problem between independent variables as they are all less than the threshold of 5 according to Menard (1995). Table 2 presents the descriptive analysis and multicollinearity analysis results.
Using the Forest-Based Classification and Regression tool in ArcGIS, we test two models – one includes only the variables that their relationships with the dependent variable have been informed by relevant literature. The other one leverages the water service data. To ensure the model stability, we adjusted the number of runs and trees, and found a combination of 10 runs and 150 tree for the training model produces the most reliable results in model outputs. As indicated in Table 3, the R2 of the Regression Diagnostics of the training data are quite similar in both models and the ranking order of the top variable importance is also consistent in both models. However, leveraging the water logoff data improves the model performance. For example, the percentage variation explained as indicated in the “out of bag errors” section is about 53% with a MSE of 0.12 for the model leveraging the water lockoff data, compared to about 28% with a MSE of 0.19 for the model without. The R2 value of the Regression Diagnostics of the validation data indicate that the model with water lockoff variable predicts code violation in the validation dataset with an accuracy of about 64%, compared to about 52% for the one without while the mean squared error is similar in both models, 0.13 and 0.122 respectively.
The model outputs also report the importance score and percentage for the independent variables. The importance score “is calculated using GINI coefficients” (ESRI, 2020). The score for each explanatory variable is the sum of the GINI coefficients from all the trees for that particular variable, and the percentage importance is the proportion of the importance score for that variable over the total score for all the explanatory variables in a model (ESRI, 2020). The results of model 1 indicate that renter-occupied variable is the most important in predicting the demand for code compliance service with a score of 10.7, which accounts for about 25% of the total score in the model. The age variable score is about 9.4, accounting for about 22% of the total score. The scores for the family household, education, and median household income variables are within the 7–8 range, accounting for about 17~18% of the total score. Adding the water lockoff variable does not change the order of these variables. However, the results of model 2 indicate that the variable of demand for water lockoff service is the most important in predicting the demand for code compliance service with a score of 18.2, which accounts for about 41% of the total score in the model. The model results also reveal that the importance of renter-occupied housing, age, and family household variables account for about 18%, 15%, and 11% of the total score respectively. On the other hand, the importance of education and median household income variables are about the same, each elucidates about 8% of the total score.
6 Discussion
6.1 The debate on “Wisdom of Crowds”
There has been a debate on the role of wisdom of crowds from big data and its applications. Contrary to the arguments for Wisdom of Crowds, the traditional model of science argues that theories function in three ways: to prevent the risk of supporting a spurious fluke, help make sense of observed patterns, and shape and direct research efforts (Babbie, 2004). In traditional research, models must be built upon solid theories rather than simple observation of trends, patterns, or correlations. Models based on theories about causation can help predict what would happen in the future or elsewhere.
While others caution the approach of abandoning theories and/or causation for correlation, some recognize the opportunities of ICTs in social research and call for leveraging data, big or small, to advance research. For example, citing works of Cukier (2010), Boyd and Crawford (2012), and many others, Kitchin (2014) recognized that “big data and new data analytics are disruptive innovations” that challenge and reshape the traditional ways of research in many fields. However, he also pointed out that a purely empiricist approach without reasoning, while attractive, would be risky because such approach “is based on fallacious thinking” in terms of issues related to data representativeness, scientific reasoning, the justification of analytics and algorithms being used, and the interpretation of results with knowledge. He therefore called for a data-driven science paradigm, which is a hybrid approach that both holds the principals of the scientific method but is more open to the new data analytical innovation “to advance the understanding of a phenomenon”. Following the similar thinking and arguments, Kitchin and Lauriault (2015) discussed the utilities and futilities of small and big data and data analytics, and saw opportunities of combining the two to leverage data for research advancement. They argued that “small data will increasingly be made more big-datalike through the development of new data infrastructures that pool, scale and link small data in order to create larger datasets, encourage sharing and reuse, and open them up to combination with big data and analysis using big data analytics”, and that “the potential to link data across domains is high” (Kitchin & Lauriault, 2015). These arguments have been echoed and further refined by more recent studies (see, e.g., Athey, 2017; Kettl, 2016, 2018; Lemire & Petersson, 2017; Vydra & Klievink, 2019).
6.2 The role of Wisdom of Crowds in city service planning
Our research intents to explore the role of wisdom of crowds in data analytics and application for prediction of code violations in city service planning. It follows a hybrid paradigm combining the Wisdom of Crowds and traditional model of science. The study demonstrates the feasibility of the data-driven approach to unearth the temporal and spatial correlation between demands for code violation compliance and water lockoff services. The results indicate that leveraging the water lockoff data improves the explanatory power for prediction of code violations. Our study suggests that the Wisdom of Crowds can play a significant role in discovering social phenomena that are unknown or less understood, and that the Wisdom of Crowds can be a good tool for inductive research.
The importance of data driven paradigm for research and practices in supporting sustainable cities has been demonstrated in some studies. For example, Mercader-Moyano et al. (2021) adopted an approach combining technical inspection performed by professional and street knowledge via participatory social survey from residents to diagnose and quantify the vulnerability of existing neighborhoods. Gandini et al. (2021) also involved multi-stakeholders in risk assessment and stressed that “multi-stakeholder context influences the definition of models”, data collection, decision-making, and implementation actions. This study adds an empirical application to the literature. It also contributes to the debate about “Wisdom of Crowds” and discussion on enhancement of data value.
The use of knowledge generated from data is the basis for administrative activities (Cortada, 2018). Information technology can have impacts on the operational efficiencies and organizational restructuring of public administration, and beyond (Cook, 2018). This research not only demonstrates the possibility of leveraging data enabled by contemporary ICTs through the data-driven paradigm, but also points to the potentials for collaboration between city service departments and efficient use of resources. In this particular example, staff members in both the code compliance and water departments have to be in the field to provide services. However, these departments often manage their operations and resources independently (i.e., decentralized management). Due to the limited city service resources (e.g., staff and budget), such decentralized management strategy may lead to reduction in operational efficiency and effectiveness, as well as increase in costs. In contrast, a centralized management strategy holds a great promise for improving coordination between departments and shared decision-making in planning and management of city services. As shown in the results of our spatial analysis, the demand for code compliance service bears a resemblance to the demand for water lockoff service in terms of locations. The temporal analysis also yields some similar pattern with a lag in time for water service demand, which could be complementary to both services in terms of staff workload. The findings suggest that if coordination between the two departments can be made, there is a potential for reducing duplicated trips, increasing service response time, and reducing the cost of service provision. For example, while the staff members issue the notice for code violations in a neighborhood, they can assist with the water lockoff service simultaneously, and given the demand and pattern of both services informed by our analysis, the number of staff members to be assigned to a specific neighborhood can be further investigated. While it is beyond the scope of this study, further analysis can investigate the options of collaboration and evaluate the extent of the benefits accordingly using techniques under the broad framework of “what if” analysis, also known as the “sensitivity analysis”, “decision-support”, “optimization”, or “artificial intelligence” tools. Some examples of such applications for smart city or public policy/public management can be found in studies of Urbieta et al. (2017), Jamei et al. (2017), to name a few.
6.3 Machine learning in urban studies
This research joins a limited number of studies in exploring the machine learning technique for urban studies. While not new in the science and engineering field, the use of spatial machine learning technique in urban studies is still in its infancy. Some studies have used this approach in recent years. For example, Helderop et al. (2019) used the approach to identify the spatial locations of prostitution activities in Phoenix, Arizona. Rummens and Hardyns (2020) also utilized the approach for making spatiotemporal predictions of crime. Rose and Dolega (2021) examined the linkage between weather conditions and retail sales. Together, these studies offer new and efficient ways for social science inquiries. The existing literature and our study provide insightful information for city planning, management, and policy making to better position a city’s economic competitive advantage and to increase the safety and sustainability of cities.
7 Summary and future studies
This research adopts a hybrid paradigm, combining the Wisdom of Crowds with the traditional model of science. By utilizing the spatial analysis and data mining tools, this study reveals the temporal and spatial patterns of service data generated by the City of Arlington. This Wisdom of Crowds, along with “street knowledge” from those who know the context of neighborhoods, provides insights into the formation of theory for prediction of code violations. Under the “Social Equilibrium” framework, the study explores the possibility of linking the service data with the census data and demonstrates that leveraging water data improves the explanatory power of the predictive models for future code compliance service demand. Leveraging service data for predictive model via the data-driven paradigm is a major departure from the traditional model of science. The Forest-Based Classification and Regression tool enables a novel application of machine learning tool for urban planning and city administration.
While this study has demonstrated the potential of leveraging data for proactive planning and management of city services, there is still much to explore. While the knowledge generated from data mining has led to the initial identification of theory for code violation prediction, the “correlated effects” in this specific context need to further clarify. Reasons behind the importance of water lockoffs in this particular application requires further investigation. In addition, the findings of this study are based on a case study of one city. The validity of leveraging water service data for prediction of code violations requires further investigation and demonstration. Moreover, further studies are needed to investigate the feasibility and potential opportunities for organization collaborations. As some scholars have pointed out, big data may not guarantee accuracy and could provide misleading information for decision making. Researchers must recognize the importance of theory in application of big data (Adolf & Stehr, 2018), and be mindful in research design and making the best use of big data (Lavertu, 2016; Zook, 2017). Nevertheless, this study contributes to the current debate on “Wisdom of Crowds” and research on the data driven approach for smart cities with empirical evidence. By integrating scientific reasoning with “street knowledge”, and the use of spatial data analysis techniques and the Forest-Based Classification and Regression tool, the study offers an example of data driven approach for leveraging data from various sources to improve prediction of future service demand. It generates new insights for the on-going discussion on the value and use of big data, and points to challenges and opportunities for further research on scenario planning and resource allocation.
Availability of data and materials
Some of the data used in this study are not publicly available because they were provided by the City of Arlington, Texas. Data requests may be made to the city.
References
Adolf, M. T., & Stehr, N. (2018). Information, knowledge, and the return of social physics. Administration & Society, 50(9), 1238–1258. https://doi.org/10.1177/0095399718760585
Albino, V., Berardi, U., & Dangelico, R. M. (2015). Smart cities: Definitions, dimensions, performance, and initiatives. Journal of Urban Technology, 22(1), 3–21. https://doi.org/10.1080/10630732.2014.942092
Amare, S., Langendoen, E., Keesstra, S., Ploeg, M. V. D., Gelagay, H., Lemma, H., & van der Zee, S. E. (2021). Susceptibility to gully erosion: Applying random forest (RF) and frequency ratio (FR) approaches to a small catchment in Ethiopia. Water, 13(2), 216. https://doi.org/10.3390/w13020216
Amit, Y., & Geman, D. (1997). “Shape quantization and recognition with randomized trees” (PDF). Neural Computation, 9(7), 1545–1588.
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired. 23 June 2008. Available at: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory. Accessed 17 Aug 2020.
Arribas-Bel, D., Kourtit, K., Nijkamp, P., et al. (2015). Cyber cities: Social media as a tool for understanding cities. Applied Spatial Analysis and Policy, 8, 231–247. https://doi.org/10.1007/s12061-015-9154-2
Athey, S. (2017). Beyond prediction: Using big data for policy problems. Science, 355(6324), 483–5.
Azzaoui, A. E. L., Singh, S. K., & Park, J. H. (2021). SNS big data analysis framework for COVID-19 outbreak prediction in smart healthy city. Sustainable Cities and Society, 71, 102993. https://doi.org/10.1016/j.scs.2021.102993
Babbie, E. R. (2004). The practice of social research (10th ed.). Thomson Wadsworth. Print.
Barns, S. (2016). Mine your data: Open data, digital strategies and entrepreneurial governance by code. Urban Geography, 37(4), 554–571. https://doi.org/10.1080/02723638.2016.1139876
Bartram, R. (2019). Going easy and going after: Building inspections and the selective allocation of code violations. City & Community, 18(2), 594–617. https://doi.org/10.1111/cico.12392
Benbouzid, B. (2019). To predict and to manage. Predictive policing in the United States. Big Data & Society. https://doi.org/10.1177/2053951719861703
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Information, Communication & Society, 15(5), 662–679. https://doi.org/10.1080/1369118X.2012
Caithness, A. (2018). Digital evidence does not exist in a vacuum. The Expert Witness Journal, 17 February 2018, at https://www.expertwitnessjournal.co.uk/forensics/937-digital-evidence-does-not-exist-in-a-vacuum. Accessed 15 Aug 2020.
Castelnovo, W., & Simonetta, M. (2008). A public value evaluation of e-government policies. The Electronic Journal Information Systems Evaluation, 11(2), 61–72.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4), 1165–1188.
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.
City of Arlington. (2020). Article II, Section 2.02, B. of the Nuisance Chapter, Ordinance, at https://arlingtontx.gov/UserFiles/Servers/Server_14481062/File/City%20Hall/Depts/Code%20Compliance/Single-Family%20Residential/Rapid%20Reference%20Guide/Unclean%20Premises/NUISChapter.pdf#page=16. Accessed 23 May 2020.
Cook, B. J. (2018). Old and new: Information technology and administration. Administration & Society, 50(9), 1207–1207. https://doi.org/10.1177/0095399718796379
Cortada, J. W. (2018). Exploring how ICTs and administration are entwined: The promise of information ecosystems. Administration & Society, 50(9), 1213–1237. https://doi.org/10.1177/0095399718760584
Cukier, K., & Mayer-Schoenberger, V. (2013). The rise of big data: How it’s changing the way we think about the world. Foreign Affairs, 92(3), 28–40.
Cukier, K. (2010). Data, data everywhere. The Economist, Special Report, February 27th 2010 edition, https://www.economist.com/special-report/2010/02/27/data-data-everywhere. Accessed 3 Aug 2020.
De Winter, J. C., Gosling, S. D., & Potter, J. (2016). Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychological Methods, 21(3), 273. https://doi.org/10.1037/met0000079
Desouza, K. C., & Jacob, B. (2017). Big data in the public sector: Lessons for practitioners and scholars. Administration and Society, 49(7), 1043–1064. https://doi.org/10.1177/0095399714555751
Environmental Systems Research Institute (ESRI). (2020). How forest-based classification and regression works. https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/how-forest-works.htm. Accessed 5 Dec 2020.
Gandini, A., Quesada, L., Prieto, I., & Garmendia, L. (2021). Climate change risk assessment: A holistic multi-stakeholder methodology for the sustainable development of cities. Sustainable Cities and Society, 65, 102641. https://doi.org/10.1016/j.scs.2020.102641
Gomory, H., & Desmond, M. (2023). Neighborhoods of last resort: How landlord strategies concentrate violent crime. Criminology, 61, 270–294. https://doi.org/10.1111/1745-9125.12332
Grömping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63(4), 308–319.
Hamstead, Z. A., Fisher, D., Ilieva, R. T., Wood, S. A., McPhearson, T., & Kremer, P. (2018). Geolocated social media as a rapid indicator of park visitation and equitable park access. Computers, Environment and Urban Systems, 72, 38–50. https://doi.org/10.1016/j.compenvurbsys.2018.01.007
Harrington, J. (2018). Travelers take note: These large cities in America offer no public transit. USA Today. Dec. 4, 2018. https://www.usatoday.com/story/travel/experience/america/fifty-states/2018/12/04/americas-largest-cities-with-no-public-transportation/38628503/. Accessed 10 May 2020.
Hashem, I. A., Targio, V. C., Anuar, N. B., Adewole, K., Yaqoob, I., Gani, A., Ahmed, E., & Chiroma, H. (2016). The role of big data in smart city. International Journal of Information Management, 36(5), 748–758.
Helderop, E., Huff, J., Morstatter, F., et al. (2019). Hidden in plain sight: A machine learning approach for detecting prostitution activity in Phoenix, Arizona. Applied Spatial Analysis and Policy, 12, 941–963. https://doi.org/10.1007/s12061-018-9279-1
Höchtl, J., Parycek, P., & Schöllhammer, R. (2016). Big data in the policy cycle: Policy decision making in the digital era. Journal of Organizational Computing and Electronic Commerce, 26(1–2), 147–169. https://doi.org/10.1080/10919392.2015.1125187
Hong, S., Hyoung Kim, S., Kim, Y., & Park, J. (2019). Big data and government: Evidence of the role of big data for smart cities. Big Data & Society, January–June 2019, 1–11. https://doi.org/10.1177/2053951719842543
Housing and Urban Development (HUD). (2023). Glossary of terms to affordable housing. U.S. Department of Housing and Urban Development. At https://archives.hud.gov/local/nv/goodstories/2006-04-06glos.cfm. Accessed 10 Aug 2023.
Ingrams, A. (2018). Public values in the age of big data: A public information perspective. Policy and Internet, 11(2), 128–148.
Ioannides, Y. M. (2012). From neighborhoods to nations the economics of social interactions (Core Textbook). Princeton University Press. https://doi.org/10.1515/9781400845385
Jamei, E., Mortimer, M., Seyedmahmoudian, M., Horan, B., & Stojcevski, A. (2017). Investigating the role of virtual reality in planning for sustainable smart cities. Sustainability, 9, 2006. https://doi.org/10.3390/su9112006
Kettl, D. F. (2016). Making data speak: Lessons for using numbers for solving public policy puzzles. Governance, 29(4), 573–579. https://doi.org/10.1111/gove.12211
Kettl, D. F. (2018). Little bites of big data for public policy. CQ Press.
Khan, M., Babar, M., Ahmed, S. H., Shah, S. C., & Han, K. (2017). Smart city designing and planning based on big data analytics. Sustainable Cities and Society, 35, 271–279. https://doi.org/10.1016/j.scs.2017.07.012
Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), 2053951714528481. https://doi.org/10.1177/2053951714528481
Kitchin, R., & Lauriault, T. P. (2015). Small data in the era of big data. GeoJournal, 80(4), 463–475. https://doi.org/10.1007/s10708-014-9601-7
Lavertu, S. (2016). We all need help: “Big data” and the mis-measure of public administration. Public Administration Review, 76(6), 864–872.
Lemire, S., & Petersson, G. J. (2017). Big bang or big bust? The role and implications of big data in evaluation. In G. J. Petersson & J. D. Breul (Eds.), Cyber society, big data, and evaluation: Comparative policy evaluation. Transaction Publishers.
Lim, C., Kim, K. J., & Maglio, P. P. (2018). Smart cities with big data: Reference models, challenges, and considerations. Cities, 82, 86–99.
Liu, H., & Brown, D. E. (2003). Criminal incident prediction using a point-pattern-based density model. International Journal of Forecasting, 19(4), 603–622. https://doi.org/10.1016/S0169-2070(03)00094-3. ISSN 0169-2070.
Malik, K. R., Sam, Y., Hussain, M., & Abuarqoub, A. (2018). A methodology for real-time data sustainability in smart city: Towards inferencing and analytics for big-data. Sustainable Cities and Society, 39, 548–556. https://doi.org/10.1016/j.scs.2017.11.031
Menard, S. W. (1995). Applied logistic regression analysis. Sage Publications.
Mercader-Moyano, P., Morat, O., & Serrano-Jiménez, A. (2021). Urban and social vulnerability assessment in the built environment: An interdisciplinary index-methodology towards feasible planning and policy-making under a crisis context. Sustainable Cities and Society, 73, 103082. https://doi.org/10.1016/j.scs.2021.103082
Morckel, V. C. (2013). Empty neighborhoods: Using constructs to predict the probability of housing abandonment. Housing Policy Debate, 23(3), 469. https://doi.org/10.1080/10511482.2013.788051
Nam, T., & Pardo, T. A. (2011). Smart city as urban innovation: Focusing on management, policy, and context. In Proc. 5th Int. Conf. Theory Pract. Electron. Gov. (pp. 185–194). https://doi.org/10.1145/2072069.2072100
Neirotti, P., De Marco, A., Cagliano, A. C., Mangano, G., & Scorrano, F. (2014). Current trends in Smart City initiatives: Some stylised facts. Cities, 38, 25–36.
Nuaimi, A. E., Al Neyadi, H., Mohamed, N., et al. (2015). Applications of big data to smart cities. Journal of Internet Services and Applications, 6, 25. https://doi.org/10.1186/s13174-015-0041-5
Osman, A. M. S. (2019). A novel big data analytics framework for smart cities. Future Generation Computer Systems, 91, 620–633.
Pan, Y., Tian, Y., Liu, X., Dedao, Gu., & Hua, G. (2016). Urban big data and the development of city intelligence. Engineering, 2(2), 171–178.
Pencheva, I., Esteve, M., & Mikhaylov, S. J. (2020). Big Data and AI – A transformational shift for government: So, what next for research? Public Policy and Administration, 35(1), 24–44.
Rathore, M. M., Paul, A., Hong, W.-H., Seo, HyunCheol, Awan, I., & Saeed, S. (2018). Exploiting IoT and big data analytics: Defining Smart Digital City using real time urban data. Sustainable Cities and Society, 40, 600–610. https://doi.org/10.1016/j.scs.2017.12.022
Robb, D. A. N., Marcoux, A., McAteer, M., & de Jong, J. (2022). Using integrated city data and machine learning to identify and intervene early on housing-related public health problems. Journal of Public Health Management and Practice, 28(2), E497–E505. https://doi.org/10.1097/PHH.0000000000001343
Rose, N., & Dolega, L. (2021). It’s the weather: Quantifying the impact of weather on retail sales. Applied Spatial Analysis and Policy. https://doi.org/10.1007/s12061-021-09397-0
Rose, G., & Harris, R. (2022). The three tenures: A case of property maintenance. Urban Studies, 59(9), 1926–1943. https://doi.org/10.1177/00420980211029203
Rosenfeld, R. R., Chew, G. L., Emmons, K., & Acevedo-García, D. (2010). Are neighborhood-level characteristics associated with indoor allergens in the household? Journal of Asthma, 47(1), 66–75. https://doi.org/10.3109/02770900903362676
Rummens, A., & Hardyns, W. (2020). Comparison of near-repeat, machine learning and risk terrain modeling for making spatiotemporal predictions of crime. Applied Spatial Analysis and Policy, 13, 1035–1053. https://doi.org/10.1007/s12061-020-09339-2
Twizeyimana, J. D., & Andersson, A. (2019). The public value of E-Government – A literature review. Government Information Quarterly, 36(2), 167–178. https://doi.org/10.1016/j.giq.2019.01.001
Ullah, F., Habib, M. A., Farhan, M., Khalid, S., Durrani, M. Y., & Jabbar, S. (2017). Semantic interoperability for big-data in heterogeneous IoT infrastructure for healthcare. Sustainable Cities and Society, 34, 90–96. https://doi.org/10.1016/j.scs.2017.06.010
Urbieta, A., González-Beltrán, A., Ben Mokhtar, S., Anwar Hossain, M., & Capra, L. (2017). Adaptive and context-aware service composition for IoT-based smart cities. Future Generation Computer Systems, 76, 262–274.
Vilorio, D. (2016) Education matters. Career Outlook. U.S. Bureau of Labor Statistics, at https://www.bls.gov/careeroutlook/2016/data-on-display/education-matters.htm. Accessed 22 Aug 2023.
Vydra, S., & Klievink, B. (2019). Techno-optimism and policy-pessimism in the public sector big data debate. Government Information Quarterly, 36, 101383. https://doi.org/10.1016/j.giq.2019.05.010.
Washburn, D., & Sindhu, U. (2010). Helping CIOs understand “smart city” initiatives: defining the smart city, its drivers, and the role of the CIO. Forrester Research, Inc. https://s3-us-west-2.amazonaws.com/itworldcanada/archive/Themes/Hubs/Brainstorm/forrester_help_cios_smart_city.pdf
Wei, S., Zhou, X., Wei, Wu., Qiang, Pu., Wang, Q., & Yang, X. (2018). Medical image super-resolution by using multi-dictionary and random forest. Sustainable Cities and Society, 37, 358–370. https://doi.org/10.1016/j.scs.2017.11.012
World Economic Forum (WEF). (2023). Data for the city of tomorrow: Developing the capabilities and capacity to guide better urban futures. At https://www3.weforum.org/docs/WEF_Data_for_the_City_of_Tomorrow_2023.pdf. Accessed 1 Aug 2023.
World Population Review. (2020). Arlington, Texas population 2020. https://worldpopulationreview.com/us-cities/arlington-tx-population. Accessed 13 Aug 2020.
Ye, X., Wu, L., Lemke, M., Valera, P., & Sackey, J. (2022). Defining computational urban science. In New thinking in GIScience (pp. 293–300). Springer Nature Singapore.
Yigitcanlar, T., & Lee, S. H. (2014). Korean ubiquitous-eco-city: A smart-sustainable urban form or a branding hoax? Technological Forecasting and Social Change, 89, 100–114. https://doi.org/10.1016/j.techfore.2013.08.034
Yin, C., Xiong, Z., Chen, H., Wang, J., Cooper, D., & David, B. (2015). A literature survey on smart cities. Science China Information Sciences, 58(10), 1–18.
Yu, Z. J., Haghighat, F., Fung, B. C., & Zhou, L. (2012). A novel methodology for knowledge discovery through mining associations between building operational data. Energy and Buildings, 47, 43–440. https://doi.org/10.1016/j.enbuild.2011.12.018
Zook, M. (2017). Crowd-sourcing the smart city: Using big geosocial media metrics in urban governance. Big Data & Society. https://doi.org/10.1177/2053951717694384
Acknowledgements
The authors are grateful for the financial support from the City of Arlington, Texas, and the University of Texas at Arlington. We wish to thank the two anonymous reviewers for their constructive comments and suggestions. We appreciate very much the research support of Mike Bass, Craig Cummings, Director of Water Utilities, David Stapp, Carol Weemes, and Jennifer Wichmann from the city. The views and errors remain the responsibility of the authors.
Funding
The research was supported by the City of Arlington, Texas, and the University of Texas at Arlington.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by J. Li and M. Zhou. The first draft of the manuscript was written by J. Li and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, J., Zhou, Y. & Ye, X. Data-driven service planning in the Petabyte Age: the case of Arlington, Texas. Urban Info 2, 5 (2023). https://doi.org/10.1007/s44212-023-00030-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44212-023-00030-8