1 Introduction

A main goal of cities is to create a sustainable place with a high-quality living environment for their citizens, businesses, and visitors (Yigitcanlar & Lee, 2014). This requires proactive planning and management based on information generated from big data in the Petabyte Age, referred to the new ear of data technologies. Advancement in information and communication technologies (ICTs) has enabled cities in the 21st century around the world to generate a large amount and a wide variety of data. A main challenge is to transform the data into meaningful knowledge to aid a city’s strategic planning, service provision, and resource allocation decisions (Yu et al., 2012; Lim et al., 2018; Pencheva et al., 2020; Ye et al., 2022; WEF, 2023). While data, whether big or small, have been used in the planning of various services, such as transportation, public health, safety, housing (Liu & Brown, 2003; Morckel, 2013; WEF, 2023), its application in code compliance service planning remains infrequent, despite the latter’s significance for the vitality of cities. Little is understood regarding the role of the “Wisdom of Crowds” – a theory emphasizing the acquisition of knowledge from data prior to forming theoretical constructs (Anderson, 2008)—in city service planning.

Using the City of Arlington, Texas as a case study, this research explores a hybrid analytical approach with the aid of Wisdom of Crowds and spatial data-driven techniques to leverage data for developing predictive models to support the planning decisions for the city’s code compliance service. Specifically, it focuses on exploring the temporal and spatial patterns of the demand for code compliance service and its relation to the demand for water service using the spatial modeling techniques. It also investigates the role of Wisdom of Crowds in forming hypothesis and the opportunity of using a machine-learning technique for predicting the demand for code compliance service.

This study fills a gap in the existing knowledge regarding the data-driven approach for code compliance service planning. It joins a limited number of studies in exploring the applicability of a novel analytical tool for planning of code compliance service. The empirical evidence from the case study has implications for enhancing the value of big data, and should be of interest to smart city research, information science, organizational collaboration research and practices, city planners, and managers or decision makers of public work. The data leveraging approach can be adopted by practitioners under their respective circumstance. In the subsequent sections, we begin by providing a context for this study, then detail our research approach. Following this, we present and discuss our research findings. The paper concludes with a summary of the findings and potential future research.

2 Smart city, big data, opportunities, and challenges for service planning

A smart city is a common vision for urban development around the world (Jamei et al., 2017). Numerous studies have defined the concept of a smart city. While studies may define the concept differently by emphasizing different aspects of smart cities, most characterize a smart city as a system of ICTs and applications of big, dynamic, real-time data by and for people and communities. For example, Washburn and Sindhu (2010) defined a smart city as “a city with a great presence of ICT applied to critical infrastructure components and services.” Yin et al. (2015) argued that “ICT applications and intensive use of digital artifacts such as sensors, actuators and mobiles are essential means for realizing smartness in any of smart city domains.” The characterization of smart cities with the presence and applications of ICT can also be seen from other studies (see, e.g., Neirotti et al., 2014; Barns, 2016; Jamei et al., 2017; Malik et al., 2018; and Osman, 2019). Additionally, studies argued that smart cities are not only limited to ICTs, but also people and community needs (Nam & Pardo, 2011; Albino et al., 2015).

“Big data”, on the other hand, generally refer to large, complex, both dynamic and static sets of data that represent digital traces of human activities and objects including various urban facilities, organizations, and individuals (Chen et al., 2012, 2014; Pan et al., 2016). Besides the size, speed, array of data formats, and quality of data, known as the Volume, Velocity, Variety, Veracity of big data (Desouza & Jacob, 2017; Osman, 2019), many scholars have argued that big data have “Values” as they enable technical prediction, creation of public policy; help the public sector to better engage with citizens and to improve public services, administration, and social values (see, e.g., Arribas-Bel et al., 2015; Castelnovo & Simonetta, 2008; Desouza & Jacob, 2017; Hashem et al., 2016; Ingrams, 2018; Lim et al., 2018; Twizeyimana & Andersson, 2019). Pan et al. (2016) further suggested that urban big data share additional characteristics such as hierarchy that “reflects the organizational hierarchy of a city’s physical and social systems”, and correlations that “can be used, not only for mutual corroboration, but also for cooperative reasoning and mining rules of cities operation”.

While researchers have recognized the characteristics and values of big data, they also notice challengers of transforming data into knowledge. For example, Nuaimi et al. (2015) acknowledged that “effective analysis and utilization of big data is a key factor for success in many business and service domains, including the smart city domain”. They also identified many challenges in supporting smart cities, among of which are challenges in data due to the diversity in sources and formats, data processing methods, and the difficulty of creating “a unified understanding of data semantics” and extracting “new knowledge based on specific cycle data and real-time data”. Pencheva et al. (2020) noted that while public organizations have generated numerous data, “they are often unlikely to use it to gather valuable insights or transform services” due to challenges in privacy and security issues related to data and data analysis at the system, organization, and individual levels. Vydra and Klievink (2019) further argued that “limitations and challenges of using big data” have been discussed but mostly in the manner of acknowledgement but rarely assessed systematically, and that benefits of big data may be hindered by political decision-making factors.

A significant body of research has been devoted to the architecture and system developments of data collection. Research in this area is crucial not only for ensuring the volume, velocity, variety, and veracity of data but also as a precondition for maximizing the value of big data in city service planning, provision, and management. For example, Malik et al. (2018) proposed a model for data transformation. Through a case study, they demonstrated the application for weather data system. Similarly, Khan et al. (2017) developed an Internet of Things (IoT) architecture for energy-related data gathering and energy-aware communication. With the availability of anonymized cell-phone data, researchers are able to analyze the patterns and disparities of mobility in cities. The information is useful for planning transportation infrastructure and improving service efficiency, equality, and efficiency (Rathore et al., 2018; WEF, 2023). Other studies have focused on the use of big data for predictions of crime (Liu & Brown, 2003; Gomory & Desmond, 2023), housing value and price (Morckel, 2013; Bartram, 2019), demand for park service (Hamstead et al., 2018), epidemiology and healthcare (Azzaoui et al., 2021; Ullah et al., 2017), to name a few. Nevertheless, fewer research has focused on demand for code compliance service despite its importance. Code violations post a significant threat to the safety and stability of cities as they tend to attract crimes, decrease property values, and cause other consequences (Bartram, 2019; Gomory & Desmond, 2023; Rosenfeld et al., 2010; Robb et al., 2022). In order to maintain the economic prosperity, equality, and livability of cities, planners must fully take advantage of the data enabled by the contemporary technologies and proactively plan for code compliance service. The success of a smart city relies on a system approach for service planning and resource management, as services in different areas are interdependent and yet competing for resources. Decisions in one area may impact the service or wellbeing of cities in other areas (WEF, 2023).

In sum, the rise of big data, along with new data mining software tools, has provided numerous opportunities for cities to gain insights for service planning and resource management. While some have explored the potential of using big data for predictive management of public services, empirical evidence on the role of big data for improving government services is limited (Benbouzid, 2019; Hong et al., 2019). Although core violations are prevalent in cities and pose significant threats to their safety, prosperity, and livability, there is scant research on code compliance service planning. Overall, existing research advocates for the inventive use of data and innovative analytical techniques to convert data into knowledge, thereby supporting code compliance service planning and decision-making.

3 Research approach

Building upon the knowledge of the existing literature, this study aims to explore a data-driven approach to improve the power of demand prediction for code compliance service planning and to discuss the potential collaboration among departments within a city using Arlington as a case study. In the sections below, a general research process is proposed, followed by a description of the study area, data, and analytical tools for the research.

3.1 Research process

Inspired by the work of Kitchin (2014) and many others, this study follows a hybrid analytical approach that combines the “Wisdom of Crowds” for data mining and the traditional model of science in code violation prediction (Fig. 1). On the one hand, the “Wisdom of Crowds” allows gaining knowledge from data. On the other hand, the traditional model of science emphasizes the importance of building models based on theories. In essence, our approach treats the “Wisdom of Crowds” as an inductive research tool. It starts the inquiry with an inductive approach to discovery the correlations among social phenomenon as proposed by proponents of “wisdom of crowds”, followed with a deductive approach to investigate the association discovered in the first stage of the research. In the case of city service planning, the first stage includes mining the raw data of calls for services to inspect their temporal and spatial patterns, sharing the findings with those with “street knowledge” to uncover the nature of the issues, and followed by theoretical reasoning of the observed patterns. In the second stage, hypotheses are formulated based on the discovery from the first stage as well as relevant theories or empirical findings from the literature. Based on these established hypotheses, relevant factors are selected, their relationships are tested, and explanations are either confirmed or rejected.

Fig. 1
figure 1

General research approach

3.2 Study area and data

The City of Arlington is located in the middle of the Dallas/Fort Worth Metropolitan area in Texas. It is known as the home of the Texas Rangers and the Dallas Cowboys. According to the World Population Review (2020), the population size of the city has reached more than 400 thousand people, and it had been the largest city without public transit in the U.S. (Harrington, 2018) until 2017 when the city teamed up with Via, a transportation network company (TNC), to provide mobility service in place of the traditional public transit.

Over the years, the City of Arlington as a proto “Smart City” has collected a variety of data in water, code compliance, library, building permits, police, recreation, paramedic services, etc. These data are generated by ICTs such as the city’s Action Center, the Ask Arlington App, and other means. Like many cities, the city has the desire to provide effective services to its citizens and manage its services more efficiently and effectively. The data, along with other public data, provide opportunities for data-driven discovery to aid the city’s strategic planning, service provision, and resource allocation decisions. This study is the first attempt to explore such opportunities.

For the purpose of this study, we use the 2016 calls for code compliance and water lockoffs data provided by the City of Arlington as they were the complete data at the time of the research. These data ware shared in Excel files. The code compliance dataset contains the dates when incidents were reported, the locations of the concerned properties, the violation types, the zoning types of the properties, and other data related to code compliance services. The water lockoff dataset includes premise addresses and the lockoff dates associated with individual records. There’s no identifiable information of individuals such as name and other demographic characteristics in both datasets. The total number of cases is 37,536 for code violations and 10,298 for water lockoffs, respectively. In addition, we collect the 2016 American Community Survey (ACS) 5-year estimates data from the U.S. Census Bureau. The ACS data are at the census block group level. Additional data, such as transportation network and city boundary data, are collected from the North Central Texas Council of Governments (NCTCOG). All data are integrated or aggregated by the researchers for analysis where appropriate.

3.3 Analysis tools

We apply Geographic Information System (GIS) mapping and machine learning techniques as the data discovery and analysis tools, because human activities and service demands exist in place. ArcGIS is used to geocode addresses of incidents required for city services, and to integrate data from all sources and to form the database for the study. The hot spot analysis tool in ArcGIS is applied to analyze the spatial patterns of calls for city services. The spatial distribution and clusters resulted from the hot spot analysis provide the base for forming hypotheses regarding attributes to service demand and identifying data for statistical testing.

The Forest-Based Classification and Regression tool offered by ArcGIS is adopted for statistic modeling. It is a machine learning tool based on “Leo Breiman’s random forest algorism, a supervised machine learning method”, to train “a model based on known values provided as part of a training dataset”. In brief, a machine learning tool “creates many decision trees” based on the data. Each decision tree is established “by recursively partitioning the sample into more and more homogeneous groups” until no further groups can be split (Grömping, 2009). Collectively these decision trees form the forest from which the model is established. Such a model “can then be used to predict unknown values in a prediction dataset that has the same associated explanatory variables” (ESRI, 2020).

The machine learning tool has several advantages including the random nature of the tool, the use of big data, good resistance to noise, and the ability to reduce the need to aggregate data that could result in the loss of original data validity. The machine learning tool is particularly suitable for non-linear modeling because of its nonparametric forest approach. Additionally, it’s less susceptible to multicollinearity issues due to the random sampling method employed, which decorrelates the multiple trees based on different training subsets (Amit & Geman, 1997; Grömping, 2009). The tool can save time and steps for data aggregation and is flexible for data mining in either the n (number of observations) >  > p (number of variables) or the p >  > n setting (Amare et al., 2021; Grömping, 2009; Wei et al., 2018; ESRI, 2020). Moreover, the tool can be used to rank the importance of variables in a classification and regression problem. It also works well on imbalanced dataset and can be easily implemented.

However, like many random forest models, the tool sacrifices the intrinsic interpretability in decision trees compared to Ordinary Least Squares regression models and may still be vulnerable with overfitting on highly noisy dataset. Although both hot spot analysis and the Forest-Based Classification and Regression tools are not new, their applications as a data mining tool in the context of code compliance service planning are rare. Exploring the potential in this line of study may provide insight for debate over the role of Wisdom of Crowds and future applications for service planning.

4 Discovery of knowledge from Wisdom of Crowds

4.1 Types and temporal distribution of code violations

According to Cukier and Mayer-Schoenberger (2013), data can be “datafication”, a term that refers to “the ability to render into data [that can measure] many aspects of the world that have never been quantified before”, including those associated with geographic locations. Similarly, Höchtl et al. (2016) and Caithness (2018) argued that “data don’t exist in a vacuum”, and that they represent “[t]he undeniable truth of facts” based on social network/social learning theory and the Locard’s exchange principle. Following this line of thoughts and the research process outlined in the research approach section, we first analyze the types and temporal distribution of the reported code violations.

According to the City of Arlington, code violations are classified into 94 categories (City of Arlington, 2020). Analysis of the calls for code compliance service reveals that the majority of the reported code violations are in 8 code categories. All of these top violations are property related. Together, they account for about 73% of the total violations (Table 1).

Table 1 Number of reported code violations in 2016

Figure 2 shows the monthly occurrences for the top eight most reported types of code violations. The data indicate a strong seasonal pattern in code violations. Specifically, the reported code violations are high during the summer months when temperature is high and low in winter months when temperature is low. This is not surprising given that both “high weeds and grass” and “unclean premises”, defined as “property blight declared a nuisance” (City of Arlington, 2020), are the top two types of code violations that occur mostly in summertime. The results, when compared with those of water consumption and lockoffs, share similar, though not identical, temporal patterns. The temporal correlations are not surprising as water consumption is also the highest in the summer, which could trigger water lockoff if water bills are unpaid, a lagged in time as shown in Fig. 2.

Fig. 2
figure 2

Top 8 code violations, water lockoffs, and water usage by month

4.2 Spatial patterns of demands for city services

The preceding results suggest a conceivable correlation between the demands for code compliance and water lockoff services. To further inspect the correlation and the likely causes, we analyze the spatial clusters of code violations and water lockoffs. The results suggest that the spatial pattern of demand for code compliance service is very similar to the spatial pattern of water lockoffs. As shown in the map on the left of Fig. 3, the majority of calls for code compliance services are in the east side of Cooper Street, and north and south of Highway I-20 in Arlington. Another hot spot is in the north side between Highway I-30 and Division Street where Downton Arlington and the University of Texas at Arlington are nearby. These areas are also home to many poor neighborhoods where median household income was $50 K or below according to the 2016 ACS data. The spatial pattern of water lockoffs, as shown in the map on the right of Fig. 3, closely resembles the pattern of calls for code compliance service. These insights generated from the Wisdom of Crowds, along with professional inputs from city staff members, point to the possible causes of the problem and theory base for predicting code violations as outlined below.

Fig. 3
figure 3

Clusters of code violations and water lockoffs

5 Converge of observations and theory

5.1 Hypotheses

The observed phenomenon may be explained by the social equilibrium theory. According to Ioannides (2012), individuals make location decisions rationally by weighing numerous factors with utility maximization in mind. As a result, individuals that share similar socioeconomic characteristics and values tend to settle in similar locations thus “prevail at a social equilibrium across communities” (Ioannides, 2012, p.80). Under the social equilibrium state, the socioeconomic dynamic in a neighborhood reflects the characteristics and values of individuals in the neighborhood. “When different individuals tend to act similarly because they have similar characteristics (or face similar institutional environment), we say they are subject to correlated effects” (Ioannides, 2012, p.79). The theory suggests that the “correlated effects” may be the sources for spatial clusters of actions, which may be explained by the socioeconomic characteristics across neighborhoods.

Figure 3 illustrates displays the possible relationships between the dependent variable “code violations” and the explanatory variables based on the social equilibrium theory. We hypothesize that characteristics representing the economic wellbeing of individuals and their households are key factors related to code violations. Collectively, the socioeconomic characteristics of a community, measured by education, age, household income and type, as well as housing occupancy characteristics are associated with the seasonal and spatial patterns of code violations that require city services. Housing cost represents a significant part of household expenses. In general, it is considered unaffordable if housing expense is more than 30% of the income (HUD, 2023). All else being equal, higher income is associated with less code violation problems because more financial resources enable families or households to afford better and larger houses, take care of their properties, and have more space to hold household belongings. Similarly, higher education level is expected to associate with fewer number of code violations, because education level is closely related to income and employment (Vilorio, 2016), thus the economic ability to take care of property. In addition, it is expected that holding other factors consistent, the higher percentage of the aged population is, the more likely the code violations because of the physical limitation for elderly to take care of properties. Moreover, the more owner-occupied housing units in a neighborhood, the higher number of reported violations because the neighborhood environment affects the property value of individual homeowners. To protect their home values, they are more likely to maintain their properties, pay more attention to the surrounding environment, and report code violations that negatively affect their property values. On the contrary, the more the renter-occupied units are in a neighborhood, the fewer reports of violation problem because renters are less attached to the community and the properties they reside (Rose & Harris, 2022). Following the “datafication” and its representation of unobserved facts arguments by proponents of Wisdom of Crowds (e.g., Caithness, 2018; Cukier & Mayer-Schoenberger, 2013; Höchtl et al., 2016), we presume that the demand for water lockoff service is a reflection of certain “correlated effects” under the social equilibrium framework. Its association with code violations is depicted as the dash-dotted line in Fig. 4. We hypothesize that the factor is positively associated with the calls for code compliance service. In this study, the dependent variable (LgCode) is the natural log of total code violations as the raw data is skewed. We adopt the natural log of water lockoffs (LgWLoff) for the same reason. Other independent variables are number of family households (Family HH), median household income (MedHHInc), renter-occupied housing units (HHRentOcc), percent population aged 65 and over (PctPop65+), and percent adults without high school diploma (PctLessHS).

Fig. 4
figure 4

Hypotheses

5.2 Analyses results

We conducted a Spearman’s correlation analysis prior to modeling as it is “preferable when outlines are present” (De Winter et al., 2016). The results indicate that the highest correlation coefficient among all pairs of variables in the study is about 0.73. The variance inflation factor (VIF) indicators show that the highest VIF score is 2.41 with an average VIF score of 1.89 for all independent variables. The results suggest there is no serious multicollinearity problem between independent variables as they are all less than the threshold of 5 according to Menard (1995). Table 2 presents the descriptive analysis and multicollinearity analysis results.

Table 2 Descriptive statistics and multicollinearity analysis results

Using the Forest-Based Classification and Regression tool in ArcGIS, we test two models – one includes only the variables that their relationships with the dependent variable have been informed by relevant literature. The other one leverages the water service data. To ensure the model stability, we adjusted the number of runs and trees, and found a combination of 10 runs and 150 tree for the training model produces the most reliable results in model outputs. As indicated in Table 3, the R2 of the Regression Diagnostics of the training data are quite similar in both models and the ranking order of the top variable importance is also consistent in both models. However, leveraging the water logoff data improves the model performance. For example, the percentage variation explained as indicated in the “out of bag errors” section is about 53% with a MSE of 0.12 for the model leveraging the water lockoff data, compared to about 28% with a MSE of 0.19 for the model without. The R2 value of the Regression Diagnostics of the validation data indicate that the model with water lockoff variable predicts code violation in the validation dataset with an accuracy of about 64%, compared to about 52% for the one without while the mean squared error is similar in both models, 0.13 and 0.122 respectively.

Table 3 Model results

The model outputs also report the importance score and percentage for the independent variables. The importance score “is calculated using GINI coefficients” (ESRI, 2020). The score for each explanatory variable is the sum of the GINI coefficients from all the trees for that particular variable, and the percentage importance is the proportion of the importance score for that variable over the total score for all the explanatory variables in a model (ESRI, 2020). The results of model 1 indicate that renter-occupied variable is the most important in predicting the demand for code compliance service with a score of 10.7, which accounts for about 25% of the total score in the model. The age variable score is about 9.4, accounting for about 22% of the total score. The scores for the family household, education, and median household income variables are within the 7–8 range, accounting for about 17~18% of the total score. Adding the water lockoff variable does not change the order of these variables. However, the results of model 2 indicate that the variable of demand for water lockoff service is the most important in predicting the demand for code compliance service with a score of 18.2, which accounts for about 41% of the total score in the model. The model results also reveal that the importance of renter-occupied housing, age, and family household variables account for about 18%, 15%, and 11% of the total score respectively. On the other hand, the importance of education and median household income variables are about the same, each elucidates about 8% of the total score.

6 Discussion

6.1 The debate on “Wisdom of Crowds”

There has been a debate on the role of wisdom of crowds from big data and its applications. Contrary to the arguments for Wisdom of Crowds, the traditional model of science argues that theories function in three ways: to prevent the risk of supporting a spurious fluke, help make sense of observed patterns, and shape and direct research efforts (Babbie, 2004). In traditional research, models must be built upon solid theories rather than simple observation of trends, patterns, or correlations. Models based on theories about causation can help predict what would happen in the future or elsewhere.

While others caution the approach of abandoning theories and/or causation for correlation, some recognize the opportunities of ICTs in social research and call for leveraging data, big or small, to advance research. For example, citing works of Cukier (2010), Boyd and Crawford (2012), and many others, Kitchin (2014) recognized that “big data and new data analytics are disruptive innovations” that challenge and reshape the traditional ways of research in many fields. However, he also pointed out that a purely empiricist approach without reasoning, while attractive, would be risky because such approach “is based on fallacious thinking” in terms of issues related to data representativeness, scientific reasoning, the justification of analytics and algorithms being used, and the interpretation of results with knowledge. He therefore called for a data-driven science paradigm, which is a hybrid approach that both holds the principals of the scientific method but is more open to the new data analytical innovation “to advance the understanding of a phenomenon”. Following the similar thinking and arguments, Kitchin and Lauriault (2015) discussed the utilities and futilities of small and big data and data analytics, and saw opportunities of combining the two to leverage data for research advancement. They argued that “small data will increasingly be made more big-datalike through the development of new data infrastructures that pool, scale and link small data in order to create larger datasets, encourage sharing and reuse, and open them up to combination with big data and analysis using big data analytics”, and that “the potential to link data across domains is high” (Kitchin & Lauriault, 2015). These arguments have been echoed and further refined by more recent studies (see, e.g., Athey, 2017; Kettl, 2016, 2018; Lemire & Petersson, 2017; Vydra & Klievink, 2019).

6.2 The role of Wisdom of Crowds in city service planning

Our research intents to explore the role of wisdom of crowds in data analytics and application for prediction of code violations in city service planning. It follows a hybrid paradigm combining the Wisdom of Crowds and traditional model of science. The study demonstrates the feasibility of the data-driven approach to unearth the temporal and spatial correlation between demands for code violation compliance and water lockoff services. The results indicate that leveraging the water lockoff data improves the explanatory power for prediction of code violations. Our study suggests that the Wisdom of Crowds can play a significant role in discovering social phenomena that are unknown or less understood, and that the Wisdom of Crowds can be a good tool for inductive research.

The importance of data driven paradigm for research and practices in supporting sustainable cities has been demonstrated in some studies. For example, Mercader-Moyano et al. (2021) adopted an approach combining technical inspection performed by professional and street knowledge via participatory social survey from residents to diagnose and quantify the vulnerability of existing neighborhoods. Gandini et al. (2021) also involved multi-stakeholders in risk assessment and stressed that “multi-stakeholder context influences the definition of models”, data collection, decision-making, and implementation actions. This study adds an empirical application to the literature. It also contributes to the debate about “Wisdom of Crowds” and discussion on enhancement of data value.

The use of knowledge generated from data is the basis for administrative activities (Cortada, 2018). Information technology can have impacts on the operational efficiencies and organizational restructuring of public administration, and beyond (Cook, 2018). This research not only demonstrates the possibility of leveraging data enabled by contemporary ICTs through the data-driven paradigm, but also points to the potentials for collaboration between city service departments and efficient use of resources. In this particular example, staff members in both the code compliance and water departments have to be in the field to provide services. However, these departments often manage their operations and resources independently (i.e., decentralized management). Due to the limited city service resources (e.g., staff and budget), such decentralized management strategy may lead to reduction in operational efficiency and effectiveness, as well as increase in costs. In contrast, a centralized management strategy holds a great promise for improving coordination between departments and shared decision-making in planning and management of city services. As shown in the results of our spatial analysis, the demand for code compliance service bears a resemblance to the demand for water lockoff service in terms of locations. The temporal analysis also yields some similar pattern with a lag in time for water service demand, which could be complementary to both services in terms of staff workload. The findings suggest that if coordination between the two departments can be made, there is a potential for reducing duplicated trips, increasing service response time, and reducing the cost of service provision. For example, while the staff members issue the notice for code violations in a neighborhood, they can assist with the water lockoff service simultaneously, and given the demand and pattern of both services informed by our analysis, the number of staff members to be assigned to a specific neighborhood can be further investigated. While it is beyond the scope of this study, further analysis can investigate the options of collaboration and evaluate the extent of the benefits accordingly using techniques under the broad framework of “what if” analysis, also known as the “sensitivity analysis”, “decision-support”, “optimization”, or “artificial intelligence” tools. Some examples of such applications for smart city or public policy/public management can be found in studies of Urbieta et al. (2017), Jamei et al. (2017), to name a few.

6.3 Machine learning in urban studies

This research joins a limited number of studies in exploring the machine learning technique for urban studies. While not new in the science and engineering field, the use of spatial machine learning technique in urban studies is still in its infancy. Some studies have used this approach in recent years. For example, Helderop et al. (2019) used the approach to identify the spatial locations of prostitution activities in Phoenix, Arizona. Rummens and Hardyns (2020) also utilized the approach for making spatiotemporal predictions of crime. Rose and Dolega (2021) examined the linkage between weather conditions and retail sales. Together, these studies offer new and efficient ways for social science inquiries. The existing literature and our study provide insightful information for city planning, management, and policy making to better position a city’s economic competitive advantage and to increase the safety and sustainability of cities.

7 Summary and future studies

This research adopts a hybrid paradigm, combining the Wisdom of Crowds with the traditional model of science. By utilizing the spatial analysis and data mining tools, this study reveals the temporal and spatial patterns of service data generated by the City of Arlington. This Wisdom of Crowds, along with “street knowledge” from those who know the context of neighborhoods, provides insights into the formation of theory for prediction of code violations. Under the “Social Equilibrium” framework, the study explores the possibility of linking the service data with the census data and demonstrates that leveraging water data improves the explanatory power of the predictive models for future code compliance service demand. Leveraging service data for predictive model via the data-driven paradigm is a major departure from the traditional model of science. The Forest-Based Classification and Regression tool enables a novel application of machine learning tool for urban planning and city administration.

While this study has demonstrated the potential of leveraging data for proactive planning and management of city services, there is still much to explore. While the knowledge generated from data mining has led to the initial identification of theory for code violation prediction, the “correlated effects” in this specific context need to further clarify. Reasons behind the importance of water lockoffs in this particular application requires further investigation. In addition, the findings of this study are based on a case study of one city. The validity of leveraging water service data for prediction of code violations requires further investigation and demonstration. Moreover, further studies are needed to investigate the feasibility and potential opportunities for organization collaborations. As some scholars have pointed out, big data may not guarantee accuracy and could provide misleading information for decision making. Researchers must recognize the importance of theory in application of big data (Adolf & Stehr, 2018), and be mindful in research design and making the best use of big data (Lavertu, 2016; Zook, 2017). Nevertheless, this study contributes to the current debate on “Wisdom of Crowds” and research on the data driven approach for smart cities with empirical evidence. By integrating scientific reasoning with “street knowledge”, and the use of spatial data analysis techniques and the Forest-Based Classification and Regression tool, the study offers an example of data driven approach for leveraging data from various sources to improve prediction of future service demand. It generates new insights for the on-going discussion on the value and use of big data, and points to challenges and opportunities for further research on scenario planning and resource allocation.