An urban infrastructure assessment system built on geo-tagged images and machine learning

In modern era, the maintenance of public infrastructure often takes up a large share of financial budget for a city. The management of these urban assets is supported by a frequently updated inventory reflecting facility conditions. Traditional methods relying on inspection staff or sensors are faced with two main challenges: comprehensive and standardized data collection; quick and automatic assessment process. In this technical note, we introduce a unified method for condition assessment, purely based on street views and machine learning to develop perception quantification models with pairwise labeling datasets. In this way, the two problems could be solved with automatic and scalable processes, updatable algorithms, and affordable costs The method has been tested in the city of Ulaanbaatar, in which a benchmark covering the assessment of eight types of urban infrastructure (roadway, road curbs, road markings, road signs, sidewalks, catch basins, guardrails, and manholes) is demonstrated.


Introduction
The past urbanization process has witnessed large-scale investment on city construction especially on public infrastructure construction (Wang, L. Zhang, Z. hang, & Zhao, 2011;Baatarzorig, Satoshi, Kajita, Oeda, & Matsunaga, 2014). As urbanization enters a new stage, urban renewal and urban assets upgrade, instead of new construction, become important issues for urban management. In the context of this, public infrastructure constitutes the largest asset of a city and requires regular assessment and maintenance. A comprehensive inventory and management of this asset will help improve the efficiency of urban renewal and cut down the cost of rebuilding it.

Challenges of assets condition assessment
Infrastructure asset management with traditional methods may meet several challenges. Public facilities are widely distributed, large in quantity, and mixed in types. Comprehensive assessments cost too much when using traditional methods. The detection process involves a large number of piecemeal objects and requires high accuracy. Careful inspections are required to distinguish damaged catch basins, worn road markings, and incomplete road signs, for example, from facilities in good condition. And follow-up works may also need to rate the degree of facility damage. As these facilities are used every day, their condition should be updated with high frequency, making real-time monitoring hard to implement. Different obstacles or damages may appear on infrastructures such as guardrails, road surfaces, and sidewalks at any time, which could cause severe accident or inconvenience. High-frequency inspections may need to avoid those problems.

Key problems for developing an extendable assessment system
To establish an extendable system for large-scale facility assessment, the main obstacle lies in two aspects: data collection and condition assessment (Sitanyiova, & Mužík, 2013).

Collecting comprehensive and standardized data cost too much
First, an inventory of urban facilities is needed to keep up with their changing condition and to cover different facility types and locations. Data recorded during traditional inspections and maintenance processes are difficult to standardize for different facilities and to different cities.
To establish an extendable system, the process of data collection is expected to be light-weighted processed, low cost, and efficient, acquiring a large amount of information about various facilities in a scalable way.

Quick and automatic assessment is hard to realize
At the same time, assessment methods are expected to deal with those large data quickly and ideally automatically. Evaluation by maintenance personnel could not reach this goal solely. But the automatic and standardized method should be extendable to include those maintenance experiences to keep up with the new conditions and new facility types (Wei, Du, Mahesar, Ammari, Magee, Clarke, & Cohn, 2020).

New possibilities in geo-tagged images and machine learning technology
With the development of mobile devices, location-based service (LBS) provides a vast volume of geo-tagged data. Image data depicts the overlook of the physical environment around, human activities, and the interaction in between with reasonable granularity. Those massive image datasets, such as Google Street View, social media data, and satellite images provide a fresh dimension to urban studies. Machine learning is the field of computer science of using statistical techniques to enable computers to make data-driven decisions and progressively improve over time without being explicitly programmed. The new technology could help with extracting information from images after the training process using human perception data (Alfarrarjeh, Trivedi, Kim, & Shahabi, 2018).

Framework and approaches
With geo-tagged images as data sources, deep learning models as tools, a system could be developed to extract information from those images and to develop perception quantification models. In this way, the two problems mentioned in 1.2, data collection and assets assessment, could be solved with automatic and scalable processes, updatable algorithms, and affordable costs (Fig. 1).
The framework includes identifying facilities from images, developing deep learning models to quantify perception, and aggregating the process into an autoassessment tool for customized data and different areas (Fig. 2). The process can be used in dealing with visual quality classification for diverse objects and extendable for upgrading with new input data and for adding new types of evaluation objects.

Identification of urban facilities from geo-tagged images
To evaluate the condition of different infrastructures, the first step is to acquire data of facilities and to identify their types. The geo-tagged images are one of the data sources to meet this goal. And object detection and segmentation with computer vision could help with extracting information from those images.

Geo-tagged images collected by map service providers and from other source
Geo-tagged images with streetscapes and urban facilities can be found in User Generated Content (UGC) or collected by map service providers. Typical examples of geotagged images driven by UGC are photos with location markings uploaded to social media by worldwide users. Map service providers provide street view images shot by professional equipment or trained collectors. The assessment process uses the street view images as a main data source as they are in standard size and could cover relatively wider areas. In the future version, UGC images can be added as a supplementary data source to provide more up-to-date and detail information. The condition of public infrastructure, for example, transportation assets, could be observed from those images. And the geo-tag would link the condition reflected by images to real geographic locations (Zamir, & Shah, 2010;Baek, Ha, & Kim, 2019).

Object detection and segmentation with computer vision
Computer vision refers to the technology that trains computers to interpret images and videos, including identifying and classifying different objects (Neuhold, Ollmann, Rota Bulo, & Kontschieder, 2017;Krylov, Kenny, & Dahyot, 2018). Models of infrastructure in streetscape have been developed based on large quantities of training data--images with objects labeled with their boundaries and types in the image (Fig. 3). From geo-tagged streetscape images, either stree various public facilities could be identified and anchored to certain locations on streets (Fig. 4). Condition assessment would be conducted using those images and the results could be linked to the type and location extracted them. Meanwhile, various data visualization methods can contribute to presenting those conditions in a more intuitive and spatial way.

Perception qualification with deep learning models
After identifying the evaluation targets, a series of perception qualification models could realize the auto-evaluation process by rating them into different condition levels. The models could collect perception data from public and maintenance staff to digitalize their experience in evaluating the condition of a certain facility.

Perception data collection from binary-choice surveys with images
The training data of the models are collected from surveys answered by participants as "evaluators": they could be professionals or the public (Zhou, Liu, Oliva, & Torralba, 2014). The surveys are organized with easier-to-answer binary choice questions developed from 2AFC (the two-alternative forced-choice) tasks (Lapid, Ulrich, & Rammsayer, 2008) and use images instead of words as contents. Images with the same type of facilities are shown in pairs, and participants are asked to select the better condition ones or "hard to tell" choices ( Fig. 5). The survey could be conducted through websites or cell phones to reach participants everywhere. The simple binary questions with image choices reduce the difficulty and time to understand and answer them. These characteristics make the survey easy to spread and can quickly acquire a large amount of data from people. Each picture will get one of three results: "better" (selected), "worse" (the other picture in the comparing pair is selected), or "hard to tell" (neither picture is selected) in one comparison. When a certain facility/ asset in each image is compared with others enough times, its relative condition can be calculated by scores.
The scores can be calculated according to the following equations: Step 1. The positive rate (P i ) and the negative rate (N i ) of image i is calculated according to the ratio of the positive click times, negative click times or equal click times to the total comparing times.
p i indicate the positive click times (when "better" is chosen).
n i indicate the negative click times (when "worse" is chosen).
e i indicate the equal click times (when "hard to tell" is chosen).
Step 2. The Q-score for image i is defined as the positive rate P_i corrected by the positive rate ( P i ′ ) and negative rate ( N i ′ ) of the images that it was compared with.
k 1 indicate the number of times that image i is selected. k 2 indicate the number of times that image i is not selected.

Modeling the public perception of urban infrastructure
The labeled images would be used as training data sets. According to their Q scores, a process of 'classificationthen-regression' would be conducted (Zhang, Zhou, L. Liu, Y. Liu, Fung, Lin, & Ratti, 2018). Deep learning models would be built to extract deep features of the labeled images with transportation assets in different conditions.
For each type of facility, a model would be trained and built separately to keep their accuracy as condition performances of different objects are different. Figure 6 shows an example of manhole models. When input an image identified and marked with manholes inside, the model would output with a score representing its condition. Several areas of the output image would also be highlighted where are judged as the main contributors to this score and ideally, those area are the damaged parts of the manholes.

Extendable assessment system of urban facilities
With geo-tagged images and auto-evaluation models, the system could be built into an extendable system that supports the process to choose facility types, upload customized image files, and manage assessment results. The system can be used in different cities with universal models or customized models upgraded with local datasets. Below shows the main pages and operation methods of a basic version system. The first step is to choose the types of transportation assets to evaluate (Fig. 7). Then users could choose to upload images or compressed files as evaluation objects (Fig. 8). After the task is finished, the uploaded files would be calculated at the back end and downloadable data will be provided and users could choose to use email or other methods to receive the outcome data. (Fig. 9).

An assessment case in Ulaanbaatar
The system has been used in a transportation assessment project collaborating with the World Bank team. In the project, 340,000 streetscape images of Ulaanbaatar have been used, and 8 types of facilities in the city have been identified and evaluated. The project consists of 5 stages: segmenting urban facilities from images, pairwise labeling on interactive websites, data cleaning and score calculation, model parameter verification, large-scale assessment with geo-tagged images of Ulaanbaatar, and visualizing the assessment results.

Segmenting urban facilities from images
The World Bank Transport Global Practice team (WB team) and the Road Development Department (RDD) of the Capital City of Ulaanbaatar (UB) collected 340,000 street-view images of 1097 km road networks in 2018 to build an inventory of transportation infrastructure of Ulaanbaatar. The annotation data obtained from Mapillary (https:// www. mapil lary. com) also contributed as training sets to this segmenting process. Mapillary provides a dataset with 25,000 street level images for semantic segmentation of 66 objects in urban, countryside and off-road scene and the data is available for anyone under a CC-BY-SA license agreement. (Neuhold, G., Ollmann, T., Rota Bulo, S., & Kontschieder, P., 2017). The WB team and the Mapillary team have completed the facility identification from those geo-tagged images with the computer vision technology mentioned in 2.1 before the modeling process below. The DeepLabV3+ model (https:// github. com/ tenso rflow/ models/ tree/ master/ resea rch/ deepl ab) trained on the Mapillary Vistas dataset was used to identify facilities (Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. .2018). Its mean pixel accuracy was 85.32% and mean IoU was 25.18%. The identified transportation infrastructure includes roadway, road curbs, road markings, road signs, sidewalk, catch basin, guardrail, and manholes, which are marked in segments. Figures 10, 11, 12, 13, 14, 15, 16 are examples of identified images of several facility types.
The numbers of images with different facilities have been calculated (Figs. 17 and 18). Considering the actual situation of street view and the data collection methods, images with roadways and road curbs are the most. And it was reasonable that relatively small and uncommon facilities such as guardrails and manholes occur less frequently in the 340,000 streetscape datasets.
The identified facilities appeared along road network and distributed continuously in central urban area. Data in suburban area scattered on main roads and other area. Since those facilities, especially sidewalks and road curbs, are larger in number and more complex in status in central area than suburban area, the data distribution is almost consistence with the actual situation.

Pairwise labeling on interactive website
Websites with intuitive interfaces and simple operations have been designed as tools to collect labeling data   (Fig. 19). The public, experts, and other calibration personnel from the WB team and RDD can participate in the process on their own equipment. The labeling process is described in 2.2.1 with binary questions and image choices. Each type of facility has its own labeling questions and images.
The websites begin with guidance pages describing the projects' background and graphic illustration of "good condition" and "bad condition" to give users a reference. The question pages contain the question "Which image depicts a roadway in better condition?" two streetscape images with roadway, and a button of "hard to tell". Users could answer the question by clicking the images with the roadway in better condition or clicking the button if it's hard to tell. Also, users could change the evaluated facility types by clicking the blue words "roadway". For each type of transportation asset, the times of labeling and the number of valid pictures have been counted automatically to monitor the progress of labeling. Figure 20 shows the labeling process when the collection has been almost finished. The "total" number represents the target and the "collect" number shows the currently collected labelling data. The ratio shows the progress of the labeling and could help to monitor the up-to-date condition and adjust collection plan such as focusing on certain types or giving up certain types. In this project, road lighting poles are the canceled type due to bad feedback.

Data cleaning and score calculation
The labeled images have been cleaned according to their content clarity and comparison times. A single picture should be compared at least 10 times to satisfy the requirements for the training process. Figure 21 shows the numbers of labeled and valid images of each facility with this "10 times" standard.
The Q-score of each image has been calculated according to the methods mentioned in 2.2.1. The histogram (Fig. 22) shows the distribution of the Q-score of the labeled data for 4 types of assets as examples with the x-axis showing the score of the picture and the y-axis showing the number of pictures. Data of guardrails and manholes present a typical normal distribution pattern. Data of roadway and catch basin present pattern with two small peaks which may be induced by the actual situation in Ulaanbaatar, limited data labeling times, and the inadequate histogram group numbers. To find out adequate way to group the scores, more discussions would be expanded in the following parts.
With different histogram groups, the distribution of Q-score has be depicted as different patterns. The figures  below (Fig. 23) show the histogram of the catch basins' Q-score when grouped into 3 to 10 groups as an example. When scores are divided into less groups, more distribution features may miss compared with others. But when divided into more groups, each group would contain a smaller sample size and the features represented would be less typical.  When determining the group numbers, it is often necessary to strike a balance between the pursuit of more accurate ratings (more group numbers) and the amount of data within each level (fewer group numbers). The number of groups influences the generated Q-scores (mentioned in 2.2.1 which reveling the condition of transport infrastructure assets) and the performance of the model, which would be discussed in the next chapter.

Model parameter verification
According to pervious studies (Zhang, F., Zhou, B., Liu, L., Liu, Y., Fung, H. H., Lin, H., & Ratti, C.,2018), DenseNet121 (https:// github. com/ liuzh uang13/ Dense Net) was employed to build evaluation models. The labeled pictures have been separated into two groups to train and test the performance of the algorithm. According to conventional experience, the effective ratio between the train dataset and the test dataset should be close to 8:2. The test dataset includes pictures and its Q-score pre-calculated from labeling. After inputting pictures of the test dataset, models generate predicting Q-score. If the predicted Q-score is close to the pre-calculated Q-score, this prediction is regarded as correct.
To select models with the best performances, parameter verification has been conducted for the model for each facility type. For those models, two parameters have been tested: model types (models built with different hyper-parameters) and Q-score group numbers. To balance performance and time cost, commonly used hyper-parameter values and Q-score group numbers were chosen for further testing (Table 1). Table 2 shows the accuracy of the 5 types of deep learning models using densenet121 built with different hyper-parameters to assess roadway condition as an example. Comparing the accuracy in Table 1, model 5 showed the best performance (68.0%) and has been chosen as the optimal model for roadway condition assessment. Page 14 of 21 Zhang et al. Computational Urban Science (2022) 2:30 As mentioned in 3.3, the accuracy of the algorithm would be affected by the number of groups that the dataset has been divided into. The table below shows the accuracy of roadway condition scoring algorithm with a different group number of the dataset and top k (k refers to the rank of the performance of each group and top1 refers to the group with the best performance), and its improvement of the accuracy above the randomly selected probability.  Table 3 showed that a higher accuracy of the algorithm has been achieved with fewer groups while the improvement of the accuracy above the randomly selected probability gets larger with more groups. Taking the balance between the group number/top-k selection and the accuracy into consideration, the group number of 5 has been chosen. Catch basins, guardrails, and manholes were divided into 3 groups, considering the small number of their valid data.

Contents in
For other facility types, similar process was conducted to find out the reasonable models. Table 4 below shows the performance of final algorithms.
The accuracy of the roadway model has reached 68.0% due to the large numbers of labeled pictures and collected label times. The accuracy could not be raised further mainly to keep the generalization of the algorithm to ensure that the single algorithm could be applied to different types of roadways.
The performances of the manholes and catch basin models have shown a large gap (64% and 56%) despite their similar number of labeled pictures and collected label times. A major reason may be that manholes are easier to compare and show variance in their condition. But more mistakes have been found in the detection of catch basins and their images are too fuzzy and similar, leading to a relatively lower accuracy of the evaluation model.
The accuracy for guardrail assessment has reached 50.1% mainly because the label pictures and labeling times are less than other facilities.

Large-scale assessment with geo-tagged images in Ulaanbaatar
The best performance model has been used for largescale evaluation of the 10 types of transportation facilities conditions in Ulaanbaatar. In this chapter, catch basins, guardrails, manholes, and roadways are used as examples to show the auto-assessment results. The score distributions in (Fig. 24) below show different patterns for these 4 types of transport infrastructure assets. For catch basins and guardrails, their condition scores follow a normal distribution indicating less difference in their condition and more difficulties in differentiating them further. And different statistical indicators have been calculated to check the general condition of the asset (Table 5).
At the same time, several peaks could be identified from the pattern of some facilities such as manholes and roadway, meaning that major differences in those assets' conditions may exist. The roadway images with scores in the 3 typical score peaks are shown in Figs. 25,26 and 27. According to the auto-assessment scores, facilities have been categorized into 5 classes including "Need Urgent Care", "Need Protection", "Avoid Further Damage", "Maintain Status Quo" and "Excellent Condition". The gradation are easier to understand and can help with developing specific maintenance strategies.
• 'Need Urgent Care' means that the facilities are in extremely poor condition and actions should be taken immediately • 'Need Protection' class refers to the asset in poor condition and should be improved in time. • ' Avoid Further Damage' class consists of the largest number of pictures, which means the current condition of the asset is acceptable and more damage should be avoided. • 'Maintain Status Quo' refers to the asset in relatively good condition and should be maintained. • 'Excellent Condition' means the asset is well maintained and is in excellent condition.

The assessment results and visualizing
The assessment results in Table 6 and Fig. 28 reveals that the classification results of different facilities have some common pattern and vary in details. The two extreme  Page 16 of 21 Zhang et al. Computational Urban Science (2022) 2:30 types, "need urgent care" and "excellent condition", account for small parts (less than 20%) in all facility types, which is consistent with the common sense. Catch basin, manholes and road markings are relatively in worse condition with around 50%-70% need protection or urgent protection. And guardrails and road curbs need less care with over 70% rated in median level or better. If unit cost  is given, a rough maintenance investment and work plan could be estimated from the results in Table 6. The evaluation results in Fig. 28 give a rough picture of transportation assets' status in Ulaanbaatar. The maintenance condition of different facilities could be concluded in 3 types. The best type, including road curbs, road signs, and guardrails are facilities for protection or instruction, further maintenance work of which could only focus on those small portions in bad conditions. The second group includes roadways and sidewalks, which are both large in quantity and directly bear the weight of cars and pedestrians.  Zhang et al. Computational Urban Science (2022) 2:30 Almost one-third of them require protection or urgent care, which may make up the primary part of the whole maintenance work. The worst group is composed of manholes, catch basins, and road markings. They are all important functional parts on roadways and easily get abraded or damaged by vehicles. The evaluation reveals their bad condition and the mapping of those small facilities become more meaningful to enhance the maintenance efficiency. The results of the evaluation have ben visualized on the map (Fig. 29), and the overall distribution can be observed, and specific locations can be identified as well after zooming in the map and clicking the data point. For roadway, the condition levels are related closely to road grades. The pattern is quite obvious that tertiary roads or paths inside blocks need more attention and maintenance. Different types of transportation assets can be chosen, and except for asset conditions, operation suggestions and road conditions would be shown on the map. When clicking a spot on the map, the corresponding data on that place would be shown as well. Users could easily get an overall understanding of the city and check the conditions of certain places.
Furthermore, an open platform has been built for customized image data assessment. Users can upload images or zip files to the system and download condition results data after the assessment is finished. A progress bar would indicate the progress of the download process (Fig. 30). The evaluation results would be displayed on the outcome page, including score, maintenance ratings, and response JSON showing the possible damaged area.

Discussion and outlook
The case in Ulaanbaatar shows the application of the urban infrastructure assessment system. The results reveals that catch basin, manholes and road markings need more maintenance work while guardrails and road curbs are relatively in better condition. The location and corresponding quantified information could contribute to spatial statistical analysis for regional investment estimation and sequential maintenance work arrangement.
With regular data collection and continuously upgrading models, the system could give better performance in different application scenarios: • Implementation of Refined Urban Management.
The system and interface can be further developed for local governments to monitor and manage their urban transportation infrastructure assets with detailed and updated information on asset conditions. • Rapid Troubleshooting of Public Facilities. The detection accuracy can be improved to help mitigating certain risks related to problems and conditions of transport assets, such as flooding and road safety. • Building Comprehensive Urban Spatial Database.
More types of assets can be added as the objects of detection and the results can be used to enrich the urban spatial database. The database could be overlaid with other urban data for cross-analysis, and even be used in projects like digital twin cities or smart cities. • Humanized assessment of urban space quality.
The scope of the public perception can be weighted to integrate the evaluation of assets quality and human spatial perception.
The proposed system provides a quantitative evaluation method for urban infrastructure assessment. With geo-tagged photos and deep learning models, data collection and automatic assessment could be quick and affordable. For different facility types, different models with unique parameters are built to enhance evaluation accuracy for each type. Also, the system could be extended to more facility types and be upgraded with more pairwise  labeling. The labeling tools and assessment tools are userfriendly with clear operation steps and sample images, and visualized data could help with spatial analysis and data demonstration. There still some potential space to improve the process. From the data collection scope, geo-tagged images may be supplemented by customized collection or UGC images to enrich datasets of some hard-to-find facilities such as guardrails in this article. Street view images shot by map service provider often update with 1-2 years duration, on which the assessment results based could only present the past status. Regular image collection can be realized by maintenance workers or patrols to give up-to-date information. From the perception modeling scope, more weight may be given to labelling data from experts. For datasets with uneven condition distribution, small sample models may needed to filter raw dataset and rebuild one with better quality.