Intelligent IoT Systems for Civil Infrastructure Monitoring: A Research Roadmap

This paper address the problem of eﬃcient and eﬀective data collection and analytics for applications such as civil infrastructure monitoring and emergency management. Such problem requires the development of techniques by which data acquisition devices, such as IoT devices, can: (a) perform local analysis of collected data; and (b) based on the results of such analysis, autonomously decide further data acquisition. The ability to perform local analysis is critical in order to reduce the transmission costs and latency as the results of an analysis are usually smaller in size than the original data. As an example, in case of strict real-time requirements, the analysis results can be transmitted in real-time, whereas the actual collected data can be uploaded later on. The ability to autonomously decide about further data acquisition enhances scalability and reduces the need of real-time human involvement in data acquisition processes, especially in contexts with critical real-time requirements. The paper focuses on deep neural networks and discusses techniques for supporting transfer learning and pruning, so to reduce the times for training the networks and the size of the networks for deployment at IoT devices. We also discusses approaches based on machine learning reinforcement techniques enhancing the autonomy of IoT devices.

Keywords Autonomous IoT Devices · Deep Neural Networks · Analytics at the Edge

Introduction
In future smart cities, many decision processes in critical infrastructure and emergency management will be based on machine learning (ML) techniques. One particular application is the processing of large datasets of visual images and other types of data for defect assessment where the data is collected by a swarm of IoT devices (devices, for short), some of which can be mobile. e.g., small unmanned aerial vehicles (UAVs), and robots. In this context, examples of defective regions are corrosion and cracks in buildings and facilities [3], and potholes on roads. A critical requirement for the success of such assessment processes is the reliable detection, quantification and localization of defective regions. Furthermore, in such applications, a real-time assessment is often critical so that the swarm can decide regarding the optimum strategy and corresponding actions for effective data collection in unknown environments, e.g., robots used for earthquake reconnaissance and rescue where they enter buildings whose plan is unknown to the robots. On the other hand, for such assessments to be reliable it is critical that data be of good quality since poor data may negatively affect the accuracy of classification and predictions, and consequently may introduce additional costs and time overhead.
In general, acquiring data and making sure that the data is of high quality, especially for real-time decisions, is expensive due to difficulties in reaching the regions where the objects of interest are located and the need for humanintensive assessment. However, today we have many technologies that can be leveraged to devise effective and inexpensive solutions, including: deep neural networks for image analysis; image processing techniques; devices able to acquire images and other types of data; 5G networks and edge computing processing [6]; crowd-sourcing [1]. Mobile devices such as drones are becoming quite powerful in terms of the sensing capabilities they offer -as an example the nano Black Hornet 3 drone is equipped with a microcamera core and a visible sensor to allow for enhanced image fidelity [11].
The use of devices for data acquisition, of course, is not new and almost all application domains we may think of use these devices, including civil infrastructures, smart cities, smart agriculture, emergency management, and environmental protection. However, the common practice is to use the devices just as a data collection means. Data is acquired by devices and then transmitted to some centralized large server, such as a cloud server, for processing and analysis. Such an approach may not always be optimal in many situations. Intermittent communications and communication disruptions, as in the case of battlefield and emergency management scenarios [7], may make it difficult to transmit data to a centralized server. In addition, in many such scenarios, it is critical to quickly analyze the data and, based on the analysis results, determine whether additional data needs to be collected or specific actions ex-ecuted. For example, consider the case of the collapse of the Morandi Bridge in Genoa [5]. In such a scenario, being able to quickly detect anomalies in a bridge and notify incoming vehicles would save human life and a few seconds may make the difference. Approaches by which data has to be sent to a cloud server would not be viable.
Another important consideration is that for many applications it is critical to ensure that data be of good quality. For instance, in the case of image data, it is important that the image of an object of interest (e.g., a crack on a wall) would not be occluded. Completeness (e.g., making sure that no relevant data is missing) is equally critical. For example, if a failure is detected on one side of a structure, it is critical to determine whether the failure extends to the other sides of the structure. Current assessments regarding the quality and completeness of acquired data requires not only to send the data to remote servers, but often also to involve human analysts to evaluate the data and provide further data acquisition instructions to the remote devices. This approach is not effective and requires extensive real-time human involvement. Furthermore, this approach would not be scalable as the numbers of data acquisition devices increase. Additionally, as new phenomena of interest arise in the application domain of interest, it is critical to reduce the time required to deploy the needed data analytic solutions. For example, machine learning approaches that require training labeled data sets that are expensive and/or time-consuming to collect and label may not be effective for many scenarios. We thus need approaches by which: -Devices can directly perform analyses on the collected data as when there are real time requirements, transmitting the analysis results requires significantly less communication bandwidth and, in addition, may allow the devices to quickly send high-priority safety information to humans, vehicles, and other parties. -Devices can autonomously decide which data to collect based on data they have already collected and locally analyzed. -The times required for generating analysis models deployed at the edge be minimized as much as possible.
In this paper, we discuss approaches addressing the above three requirements and outline a research roadmap. We focus on data analysis techniques based on deep convolutional neural networks (CNNs) and investigate the use of transfer learning (TL), so to minimize the time/cost for training the networks, and CNN pruning, so to reduce the size and inference costs of the networks, for deployment at devices. We discuss also approaches supporting autonomous and adaptive data collection based on the use of reinforcement learning (RL).

Analytics at Devices
In this section, we focus on the problem of efficiently training CNNs and deploying them at devices. We first summarize initial results from one of our  projects. We then briefly discuss related work and outline open research directions. Figure 1 show the high-level steps of our approach for device-based failure detection [8]. Since typically multiple types of damage exist in infrastructures (e.g., cracks, corrosion, spalling, exposed rebars, etc. on a concrete surface), it is often very difficult and/or expensive to acquire and label sufficient data for network training. To address this limitation, we have adopted TL where a pre-trained deep CNN is used to detect a new type of damage. TL is a very popular choice for vision-based infrastructure assessment, since it requires less training data compared to when a CNN is trained from scratch. To this end, we have used very large networks that have shown success in the ImageNet Challenge. The drawback is that the number of damage classes for civil infrastructure assessment is far less than the number of classes in ImageNet (i.e., 1000 classes). This means that while the feasibility of TL has been acknowledged for health monitoring of civil infrastructures, this solution is not efficient (i.e., the networks are unnecessarily too deep for the problem of interest) and, consequently, TL is not suitable for detection at devices. To address this issue, we have used network pruning to enhance the resource efficiency for on-device analysis while still maintaining good detection accuracy. This approach allows one to deploy deep CNNs that are quite accurate, require low storage and computing resources, and can make decisions very quickly at devices. We have tested this approach for the detection of crack [3] and corrosion [9] surface defects. The methodology starts with a pre-trained network (e.g., VGG16 [10]), and recursively reduces the network size by using the Taylor-expansion based pruning technique [12]. Since the pre-trained network is originally designed for the ImageNet 1000 image categories, it is very large in size and may contain redundant convolution kernels that do not contribute to the new detection problems of interest. The pruning technique evaluates the importance of the convolution kernels and removes the kernels with the least contribution. After removing the kernels, the pruned network is fine-tuned again to enhance its performance for damage detection. Based on the detection performance, the user can determine whether or not to further prune the network following the same procedure.

Initial Results
In our experiments we started from VGG16, and used 29, 468 crack, 29, 780 non-crack, 33, 039 corrosion, and 34, 148 non-corrosion image patches to train and test the network. In this case, pruning stops if the detection accuracy after fine-tuning drops more than 3%. When up to 84% filters are pruned, the mean detection accuracy after fine-tuning for crack and corrosion is approximately 99% with a standard deviation below 0.5%. This demonstrates the robustness of the proposed approach as the variations in the performance are quite small when the pruned network still has the capacity to deal with the detection task. When 97% of the filters are removed (i.e., only 128 filters left), the mean accuracy of crack detection drops to 84.7% with a standard deviation of 19.86%, and the mean accuracy of corrosion detection drops to 96.0% with a standard deviation of 0.95%. This indicates that the pruning should be terminated due to the increasing variation and decreasing accuracy in detection performance. We also compared the inference time (i.e., the total time required to classify one 720 × 540 image) of VGG16 and ResNet18 [13] when deployed at devices (i.e., NVIDIA Jetson TX2 GPU) for damage detection. By removing 84% and 79% of the convolution kernels from VGG16 and ResNet18, respectively, the inference time for crack detection decreases from 279.7 (sec) to 31.6 (sec) for VGG16 and 36.8 (sec) to 8.9 (sec) for ResNet18. For the corrosion dataset, the inference time decreases from 275.7 (sec) to 30.6 (sec) and from 34.1 (sec) to 9.0 (sec) for VGG16 and ResNet18, respectively. In terms of memory reduction, the memory demands of VGG16 drops from 525 (MB) to 125 (MB), and the demands of ResNet18 drops from 44 (MB) to 2 (MB). By utilizing network pruning, VGG16 achieves a 89% reduction in inference time and 80% reduction in memory, while ResNet18 achieves a 76% reduction in inference time and 95% reduction in memory demands, without decreasing damage detection performance.
Our results indicate that network pruning is an important step towards incorporate deep learning architectures into devices. However, there are still several open questions that need further research. For instance, the selection of the appropriate pruning algorithm to reduce the network size. Also estimating the sensitivity of the pruning algorithm with respect to various network configurations is critical. The sensitivity should be considered in different aspects, e.g., inference performance and pruning efficiency.

Related Work
TL techniques have already been widely investigated [15], whereas pruning techniques have been receiving wide attention more recently. Notable related work includes: -Transfer Learning: Zhu et al. [16] addressed heterogeneous TL and used information from text data to improve model's performance in image classification. Aytar et al. [17] and Tommasi [18] addressed the deficit of training samples for some categories by adapting classifiers trained for other categories. Oquab et al. [19] showed that layers trained on ImageNet [20] can be reused to extract the mid-level features of images in the PASCAL VOC dataset. Shin et al. [2] addressed two specific computer-aided detection problems in medical images by fine-tuning CNNs pre-trained using a huge training set, such as CIFAR-10 [21], which has one million images from ten different classes. They further explored different popular CNN architectures and their performances on datasets of different sizes, concluding that the trade-off between learning more accurate models and using more training data should be carefully considered. Collobert et al. [22] explored TL for natural language processing. Hwangbo et al. [23] showed how to enhance RL by applying TL. More recently Singla et al. [14] developed a TL approach based on generative adversarial networks (GANs) for data different from images. -Model Pruning: Network pruning has been investigated in the context of CNNs. The early work of LeCun et al. [27] proved that network pruning is a valid strategy to reduce the network complexity and over-fitting by using a diagonal Hessian-based approximation. In this kind of approximation, neurons were removed based on the result of calculations obtained from Hessian matrix. Recently, Han et al. [28] proposed an effective compression approach for CNNs. They have tested their approaches on both VGG16 and AlexNet [40] and on different hardware platforms. The experiments showed that their approach was able to substantially reduce the size of the networks without losing accuracy. He et al. [29] proposed a kernel pruning algorithm for CNNs. Their experimental results show that for VGG16 their approach was able to obtain a time speed up of 5X with only 03% error increase.

Research Directions
In many applications, such as civil infrastructure monitoring and emergency management, the amount of training data is limited and thus TL is essential to still obtain good models even with limited data. Pruning is then critical to reduce the size of the models obtained from transfer learning. However, combining these two techniques requires analyzing the optimal ordering of TL and pruning steps, and assessing the impact that different strategies would have on the performance of different CNNs. Equally important is to assess and optimize the computing resources required for inference at different edge devices. In what follows we discuss some relevant research directions.

Optimal ordering for the execution of pruning and TL steps
It is important to determine whether it is better to first execute TL and then pruning -the strategy adopted in our preliminary work -or vice-versa. Both these strategies could be beneficial. By applying TL first, one can make sure that models adapt well to the target dataset, and then pruning can remove redundant neurons or layers. In this approach, the probability of removing the right neurons or layers from the model is higher. However, if the size of the target training dataset is small, it is better to first reduce the size of the model by applying pruning. More neurons or layers in a model means more parameters, and thus the time for training and fine-tuning will increase as well as the inference times. In addition, training a large model with a small amount of data may cause model over-fitting. Consequently, if the target dataset is small, it is better to first remove redundant neurons or layers, and then train the network with the small dataset.

Optimal pruning strategy
There are three pruning strategies that can be adopted: -Neuron removal . This is a common pruning strategy. There are various criteria for removing neurons: -Threshold-based pruning. This strategy analyzes the weight of neurons and removes those having a weight less than a threshold. Results by Han e al. [24] show that this strategy reduces the number of neurons by a factor of 9× without incurring accuracy loss. -Taylor-based pruning. This strategy, used in our preliminary work, uses Taylor expansion [12] to determine the importance of neurons and thus allows one to remove the least important ones. -Layer removal. Previous results by He et al. [25] have shown that most of the neurons in the middle layers of ResNet50 have zero weight. Such neurons not only do not perform any feature extraction, they may also result in information loss. Thus, it is possible to skip those redundant layers by the introduction of shortcut connections [26] that skip one or more layers. -Kernel removal. Previous results by He et al. [29] have shown that kernel removal is another possible approach to prune CNNs, as an images have RGB three different colors that correspond to three kernels in the CNN architecture. Thus, in some applications not every kernel is needed. For example, if one needs detect red roses, the red kernel would play a more important role than other kernels. Therefore one can keep the red kernel and remove some redundant kernels. Most of the times, a CNN has more than RGB three kernels, so there are several redundant kernels that could be removed. Experimental comparisons to determine which pruning strategy works better depending on specific datasets and networks are critical. In addition, an interesting approach to explore would be to combine the three different pruning strategies: neuron, kernel, and layer removal. For instance, in ResNet50 after removing removed some layers, redundant/useless kernels and neurons could still be left in the network. Removing those kernels and neurons may further reduce the network size and inference time.

Multi-step Transfer Learning and Pruning
There can be scenarios where one might need multi-step TL and pruning (MSTLP). MSTLP refers to obtaining a target network, specialized for certain classification decisions, by performing multiple TL and pruning steps. In such an approach one would start from an initial general network. Then, an intermediate trained network is derived from the initial one by applying TL and pruning. From the intermediate network one can derive another intermediate network and continue this procedure. For example, one can use TL and network pruning on a network initially trained using a generic dataset, such as ImageNet, in order to obtain an optimized network model for civil infrastructure monitoring applications that can detect various defects such as cracks, corrosion, etc. One can then use this intermediate model to train specialized binary classification models that just detect specific categories such as corrosion/no-corrosion, crack/no-crack, etc. This multi-step approach has several advantages over a single-step TL approach that learns directly from a generic dataset: (a) TL and network pruning can be much faster if they start from an intermediate specialized network than if they start from a larger generic network trained for large number of categories not relevant to the application of interest. (b) The use of intermediate networks enhances flexibility in cases in which one needs to identify more than one class as well as use a specialized network when one just needs to detect one category of objects/defects. For instance, one might just be interested in only identifying cracks for civil infrastructure inspection purposes. (c) Detection accuracy is enhanced as MSTLP may help prevent overfitting. (d) The network size obtained from MSTLP will be smaller than the ones obtained from training specialized classifiers directly from the generic dataset. Investigating MSTLP is an interesting research direction.

Adaptive Data Acquisition
We now focus on the other crucial aspect of a smart health inspection system based on intelligent autonomous devices. Consider the case of an inspection device, such as a robot sent to inspect a civil infrastructure (e.g., a bridge). In this case, the device would carry some sensors (e.g., cameras) and perform on-board analyses using the approaches discussed in the previous section. The device will first need to identify a target of interest (e.g., crucial structural components of a bridge), navigate itself to a position closer to the target, and then collect data to determine the presence of damage as well as whether to proceed with a more detailed inspection. During the inspection process, the device is constantly facing decision making problems whenever an input data sample is acquired. For instance, what should be the next movement of the device if no damage of interest is present in the input image? Is the device close enough to the target? Is there any damage detected on the target? Is there any other potential damage nearby the target that requires more data collection? Is the collected data of good quality enough for the device to make the next decision? In such a context where many decisions need to be taken, RL appears a suitable approach. RL MDP figure Actions: States: RL, inspired by how human learning, is a technique to determine the optimal decision by interacting with the environment. Compared to the conventional Q-learning algorithm, which becomes extremely inefficient for large scale state-space problems [39], the incorporation of deep neural networks (DNNs) into RL makes the learning process able to deal with high dimensional problems. In what follows we refer to DNN-based RL as DRL.

Related Work
Recent advances in robotic navigation/control fields have demonstrated the great potential of DRL for smart inspection systems and other similar applications. Tai et al. [30] demonstrated the capability of using DRL to train a robot to reach a pre-defined target location without collision in a map-less environment. Cheng and Zhang [31] used DRL to navigate a boat to a target, and avoid obstacles on the way. Mirowski et al. [32] used DRL for training agents to navigate in large and visually rich environments with various starting points and destinations. Zhu et al. [33] developed a robotic navigation system for indoor scenarios using DRL. The robot is trained in a simulated environment and is able to navigate to user-defined indoor objects (e.g., a sofa or a desk) in the testing stage. Hwangbo et al. [23] used DRL to control a quadrotor to stabilize itself when subjected to extreme external forces. Moreover, DRL has been applied in various video games where a trained artificial intelligence (AI) system is able to outperform human players. Hasselt et al. [34] proposed the double Q-learning algorithm to address the issue of overestimation problem encountered by deep Q-network (DQN). The performance of double Q-learning was tested on multiple Atari games and it was shown to be effective. Wang et al. [38] proposed a dueling network architecture to decouple the estimation of value and advantage in DQN. The dueling network outperformed the double Q-learning algorithm [34] in the challenging Atari game. Schaul et al. [35] showed that by integrating experience replay into the double Q-learning algorithm proposed in [34], the performance of the AI system in playing Atari games can be further enhanced. In [36,37], DRL was used to play first-person shooter (FPS) games (e.g., Doom video game). The agent was trained to explore the map, collect items, search and fight against the enemies. Due to DRL's success in these challenging decision making problems, we consider DRL as the best fit for smart inspection systems.

An Example Inspection Scenario and a DRL-based Framework
Consider an inspection device that starts with an initial state S 0 and observes data D 0 (e.g., an image). Based on D 0 , the device selects an action A 0 from the action set A and moves to a new state S 1 . By interacting with the environment (or an environment simulator), an immediate reward R 1 is assigned to the device, and the device will move to new states by repeating the steps of data acquisition, selecting actions, and receiving rewards. The objective of RL is to find the optimal policy, i.e., the set of best actions based on observing the data, that maximizes the cumulative rewards for the achieve to achieve. During training, the underlying driving mechanism is the Bellman equation [39] given by the equation below, that updates the Q-value of picking an action a at state S i : In the previous equation, R(S i , a, S i+1 ) is the immediate reward after choosing action a at state S i , t(S i , a, S i+1 ) is the transition time from state S i to S i+1 , max b∈A(Si+1) Q(S i+1 , b) is the maximum Q-value among all the actions b ∈ A(S i+1 ) in the next state S i+1 , k is the iteration number, α k is the learning rate at iteration k, ρ k is the average reward at iteration k, and γ is the discount factor for the expected future rewards. At the early stage of training, the device has a higher tendency to choose the exploratory actions in order to discover the good actions in the state-action space. As the training proceeds, the device will gradually choose the greedy actions, i.e., the actions that have the highest Q-value. Note that the proposed DRL approach approximates Q(S i , a) with DNNs in order to account for the infinite number of state-action pairs in our problem. Figure 2 illustrates the fundamental concept of RL in the context of smart inspection systems.
In our inspection scenario, the device would first collect data at a farther distance to capture the overview of the building, and then would moved to a closer position to inspect the damage on a first floor column thoroughly. Such  Fig. 3 The example DRL framework. scenario is reflected by our example DRL-based framework shown in Figure 3. A CNN is employed to extract features from the input image, and the features are sent to both the navigation and the damage detection network. The navigation network deals with the identification of the target of interest (e.g., first floor columns shown in Figure 4 1 ) and determines the movement of the device. The detection network identifies the presence of damage on the target and determine whether the data is of good quality (e.g., lack of blurriness and/or completeness), and whether to proceed with more data collection. Consider the image shown in Figure 4. The actions for the device to take can be defined as:{a 1 : a column exists and move forward distance d 1 ; a 2 : a column exists and move forward distance d 2 < d 1 ; a 3 : a column does not exist and randomly picks the next movement direction and movement distance.} Since the device does see a column and should move much more to get closer to the column, the corresponding rewards can be assigned as:{R 1 = 1.0,R 2 = 0.5,R 3 = 0} in order for the device to learn to move more when seeing images similar to the one in Figure 4. Note that the navigation and detection network will work jointly during the inspection, as the decision made by one network will affect the decision on the other. For instance, the navigation network will keep making decisions if the detection network cannot identify the presence of damage with Figure 4. Once the device moves to a closer position to the column, e.g., Figure 5, the actions for the detection network can be defined as:{a 1 : a crack is detected and should slightly move to collect more data; a 2 : a crack is detected and requires no further data collection; a 3 : no damage is detected.} Since the device only captures a portion of the damaged column and should collect more data, the corresponding rewards can be assigned as:{R 1 = 1.0,R 2 = 0.5,R 3 = 0} in order for the device to learn to collect more data when seeing images similar to the one in Figure 5.

RL framework
Our DRL-based inspection framework has then to support two important tasks: 1) routine inspection -the device would be be trained to perform regular inspections, in which the device scans through the structure and identifies all possibilities of damage presence, and 2) urgent inspection -the device would be trained for rapid assessment in the context of urgent inspection (i.e., inspections after a natural hazard) where the device will evaluate the severity  of damaged regions and it will perform more data collection around the damage of top priority (i.e, not a thorough inspection of whole structures).

Research Directions
3.3.1 DRL-based autonomous smart inspection system under two scenarios, i.e., routine inspection and urgent inspection.
Although DRL has been demonstrated to be effective in robotic navigation, there are challenges needed to be addressed in the context of health inspection. For instance, in the existing approaches the target/destination is predefined and will be identical in the testing stage, e.g., the items and the enemies in the Doom game [36]. In our context, the target of interest and the damage may have similar pattern but not identical as the training data in the testing stage. The sensitivity of DRL towards the changes in the objects of interest needs a thorough investigation. There are various network configurations proposed in previous work to enhance the performance of DRL in playing Atari games. However, there is no a particular configuration that outperforms the others in all categories of games [34]. Therefore, research is needed to determine which configurations are best for inspection systems.

Evaluation of the benefit of an environment simulator.
Unlike past work in which an environment simulator, such as a gaming engine or a predefined map, is available, our adaptive DRL data collection framework infers the information directly from the acquired training data. This is particularly useful for urgent inspection after natural hazards as the environment often changes drastically. An interesting research direction is to analyze whether an approximate simulator can bring additional benefits to the inspection system.

Transfer learning for reinforcement learning agents.
A critical challenge related to real-time requirements is that RL has high convergence times. To address this issue, an approach is to use TL techniques based on GANs [14].
The application of such a methodology to quickly bootstrap the DQN in the control RL system has two variations: The key goal here is balancing domain loss and quality loss while training the GAN. Since, there is only one learning step here, we need to introduce a bias (target domain over source domain) in the GAN itself; this can be done when selecting batches to train the GAN. It will be interesting to compare the two approaches and investigate additional specific aspects related to TL for RL agents used for inspection systems.

Concluding Remarks
In this paper we have discussed the use of devices in a critical application domain, that is, the monitoring of civil infrastructures for defect identification and assessment. In the paper we have focused on the use of ML techniques to allow devices to carry out data analyses on-board and to enhance the autonomy of devices. Device autonomy is critical when dealing with emergency situation in which network communications can be become fragmented. The paper has discussed several research directions. However the area of ML for IoT systems is an important emerging application area with many other interesting research directions.