Application of regression methods for classification of sewers’ damages

The paper presents possibilities of application of selected regression methods (classification trees, support vector machines, K-nearest neighbours, artificial networks) for classification of sewers’ damages. Operational data from the time span 2006–2011 obtained from water utility were used for deterioration analysis. On the basis of the following independent variables, the modelling was carried out: diameter, depth, year of construction, material and season of damage’s occurring. The following kinds of damages were classified: corrosion, crack, longitudinal crack, displacement, unsealing, failure, collapse. The main aim of the paper was to check if prediction methodology could be useful for classification of different kinds of sewers’ damages. The obtained results pointed out that proposed classification methods are not appropriable in quality analysis of registered damages of sewers. Moreover, it is recommended for water and sewerage companies to register types of failures using unified notation which make easier preliminary classification before applying modelling approach. The calculations were performed in Statistica 13.1 software.


Introduction
Sewage removal systems belong to the critical infrastructure, and their proper operation and maintenance are very important tasks from the point of view of both the operators and the users. For many years in Poland and in the world, quantitative research has been conducted to analyse in depth the causes and effects of failures and the number of failures and abnormalities occurring in sewers (Kuliczkowska 2017(Kuliczkowska , 2016Caradot et al. 2018;Miszta-Kruk 2016;Ana et al. 2009). The quality and moisture of soil surrounding the buried infrastructure have a great influence on the technical condition of pipes (Iwanek 2018). Much attention is devoted to problems relating to the renovation or upgrading of sewers (Ariaratnam and MacLeod 2002;Madryas and Wysocki 2008;Duchesne et al. 2014) and to planning and carrying out video camera pipe inspections (Plihal et al. 2016;Harwey and McBean 2014).
The main aim of this paper is to show how selected regression methods could be used to classify and predict types of sewer damage. Recently, an attempt was made to use modelling based on the multicriteria decision process to select sewer sections for replacement and renovation (Kessili and Benmamar 2016). Moreover, artificial intelligence assumptions (Tran et al. 2006) were used to analyse the failure rate of a storm sewer system. Such a system (drainage structure at urban areas) should be designed on the basis of the latest data concerning the amount of precipitation in the analysed area (Wartalska et al. 2020). Prediction methods, used to model phenomena which are difficult to describe because of the dynamically changing interdependences between the variables, are currently increasingly often used in the broadly understood environmental engineering (Czapczuk et al. 2015), e.g. to analyse the failure frequency of pumping units (Bevilacqua et al. 2003), to locate water-pipe network leakages (Candelieri et al. 2014), in hydraulic calculations of water distribution networks (Czapczuk et al. 2017;Piasecki et al. 2018) and to predict the hydraulic conductivity of the ground (Elbisy 2015). The prediction methods include: classification trees (CT), support vector machines (SVM), K-nearest neighbours (KNN) and artificial neural networks (ANN). The particular methods are based on different assumptions and entail the proper selection of model parameters. In the case of artificial neural networks, it is necessary to determine the number of neurons in the hidden layer and select an appropriate training method. Classification trees are based on the proper division of variables at the particular tree levels. The KNN method uses the notion of the distance of the dependent variable from the independent variable. The SVM method is based on the notion of decision spaces. The prediction of variables (by means of regression and classification methods) is a two-stage process consisting in creating a model describing the investigated phenomenon and then in applying the model to predict the selected dependent variable on the basis of the independent variables (predictors). More information about the above methods can be found in (Hastie et al. 2008;Breiman et al. 1984;Cristianini and Shawe-Taylor 2000). Further, in this paper the terms "regression method" and "modelling method" are used interchangeably. The sewerage systems are relatively old and exploiters all over the world as well as scientist should pay more attention to proper maintenance, rehabilitation and active damage localization. Investigations using statistical models in the assessment of damage to sewers have been done in the world, e.g. sewer pipe condition was established by means of neural networks (Hassan et al. 2019), diagnosis of sewer defects was based on support vector machines (Yang and Su 2008), random forest was used to model sewer deterioration (Hansen et al. 2020). Nevertheless, the literature on the subject indicates that so far broadly understood regression and classification methods (prediction methods) have not been used to evaluate the condition and failure frequency of sewerage networks in Poland. Considering this fact, the authors decided to undertake research in this area. Proper appointment of the most important parameters of models and selection of number and kinds of independent variables influencing predicted values of dependent variable seem to be the crucial problems during mathematical modelling using machine learning approach. Moreover, it is necessary to check if application of selected prediction methods could be useful for rational management of buried infrastructure. Carried studies pointed out that appropriate arrangement of operational data registered in water and sewerage companies is necessary and should result in greater possibilities of using such information for construction of reliability model in the future.

Material and methods
Operational data for the years 2006-2011, obtained from one of the water and sewerage companies in a selected Polish town (number of citizens exceeds 300,000), concerning the recorded sewerage network failures were used to classify and predict types of sewerage network damage by means of the four regression methods mentioned above. Sewerage network is built in gravitational separate system. There are several pumping stations due to differences of altitude. Sanitary sewers (the total length was equal to almost 568 km at the end of 2011) collect and transport sewage to wastewater treatment plant. The length of sewer laterals amounted nearly to 198 km at the end of 2011. Some parts of the network were constructed in the beginning of XX century. Sewerage network is still expanding and renovated, and is made mainly from vitrified clay, concrete, reinforced concrete and PVC. Almost 81% of the total number of damages occurred on sewers made from vitrified clay. The rest failures were registered on concrete sewers (ca. 16%) and on PVC conduits (ca. 3%). The most deteriorated (circa 60% of all failures) were sewers built in the time span . The average value of sewers' failure rate equalled to 0,0444 fail./(km·a). Considering their different roles in the system and different damage occurrence determinants, separate classification models were built for sewers and sewer laterals. Such abnormalities in sewerage pipe functioning as clogups, chokings or blockages were not taken into account. The focus was on typical damage, such as pipe breakage, corrosion and loss of integrity (leakage) and sewer cave-in. In some cases, the type of damage was not precisely defined. It was merely noted in the documentation that a failure occurred. In such cases, the general term "damage" was used without going into details as to the type and quality of the damage. That assumptions mean that not all damages were taken into account during statistical modelling. For further investigations, the total number of sewers' failures varied between 8 in 2011 and 29 in 2006. The type of damage (TD) was the dependent variable while the independent variables (predictors), on the basis of which the classification was made, were the following parameters describing sewerage conduits: (1) quantitative parameters-conduit diameter (D), conduit laying depth (LD), year of conduit construction (YC) and (2) qualitative parameters-season in which failure occurred (SE: winter (W), summer (SM)) and conduit material (M: vitrified clay (VC), cast iron (CI), concrete (C), PVC). The summer (non-heating) season comprised the months from March to October inclusive. Because of the relatively small sample size, V-fold crossvalidation was used. In the case of the CT method, the following model parameters were adopted: equal wrong classification cost, Gini coefficient (a measure of random variable distribution concentration) as goodness of fit, a priori estimated probability, tree pruning at wrong classification, node count-5, maximum 10 tree levels and maximum 1000 nodes. Four types of basis functions: linear (L), sigmoidal (S), polynomial (P) and radial basis functions (RBF) were used in SVM modelling. The maximum number of iterations-1000, the capacity-10, training ended at an error of 0.001. Rescaling of predictors and data narrowing were used. The number of support vectors and localized vectors is given when analysing the results for the selected SVM model. Four distance measures: Euclidean distance (KNN-E), squared Euclidean distance (KNN-E2), Manhattan distance (KNN-M) and Chebyshev distance (KNN-C) were used in the KNN method. The number of nearest neighbours was determined by the trial-and-error method to be 10. The other nearest neighbour values would result in worse classification results. ANN models based on radial basis functions, in which the minimum and maximum number of hidden neurons amounted to, respectively, 10 and 14 (sewers) and 6 and 8 (sewer laterals), and multilayer perceptrons for which the minimum and maximum number of hidden neurons amounted to 1 and 20, were built. Sum square and mutual entropy as the error function and the following activation functions: linear, logistic, hyperbolic tangent, exponential and sine functions were used. Altogether 20 ANN models were built. 95 failures of sewers and 39 failures of sewer laterals occurred in the analysed period. The whole data set was randomly divided into a training set (70% of the cases) and testing set (30% of the cases). The training set data were used to build models while the testing set data were used to verify the modelling and its quality. Tables 1  and 2 show dependent variable and predictor values for, respectively, training and testing. The calculations were performed using Statistica 13.1.

Results and discussion
Figures 1 and 2 present the classification accuracy results for all the types of damage to the sewers and to the sewer laterals for the modelling methods: CT; SVM-L, SVM-S, SVM-P, SVM-RBF (the support vector machine method based on linear, sigmoidal, polynomial and radial basis functions); ANN-4 and ANN-5 (the selected optimal ANN models). The results of classification by means of the K-nearest neighbours method are for only the testing data. Therefore, they are presented in a tabular form (Table 3). The results for the training data for the KNN model are not available in Statistica.
The model and method most suitable for the classification of the types of damage to the considered sewerage conduits were selected on the basis of an analysis of classification accuracy. The latter is a ratio (expressed in per cent) of the accurately classified failures (with no division into the particular types) to all the failures, separately for the training data and the testing data. Another parameter which should be taken into account when selecting a prediction method is the architecture of the given model.  However, since four different methods were analysed as part of this study, a comparison of their structures is not a clear-cut task. Each of the regression methods is based on different assumptions and comprises different training parameters, which makes a comparative analysis quite difficult. Therefore, only a preliminary comparison based on the classification accuracy mentioned above was made. An analysis of Fig. 1 and Table 3 shows that none of the proposed methods is a tool accurately classifying types of sewer damage. In the KNN models, the classification accuracy results are the same for different measures of distance, which is surprising since each of the distance measures uses a different relation between the dependent variable and the predictors. This problem looked differently when predicting water conduit damage intensity (Kutyłowska 2018). Thus, the above results put a question mark over the use of the KNN method for both regression  and classification tasks. The best classification results were obtained using the support vector machine method based on the linear basis function (SVM-L) (Fig. 1). The damage type classification accuracy amounted to just under 42% and 32% for, respectively, the training data and the testing data. Detailed results of the classification of the particular types of damage by the SVM-L method, for which the number of support vectors and localized vectors amounted to, respectively, 67 and 0, are presented in Table 4. Even though the general classification accuracy result is at the level of 42% (training), the detailed results are not so promising. Only corrosion and cave-in were correctly classified by the model, whereas such abnormalities as shift, breakage and general damage were incorrectly classified. Unfortunately, the classification results for the training data (Table 5) are even worse. The few types of damage which were correctly classified are in bold. The testing data set contained 28 cases of which only 9 were unequivocally classified by the SVM-L model. From the two longitudinal breakages, one was classified correctly while corrosion and cave-in were assigned correctly in 100%. Table 5 shows results for only the selected model. The results of classification by means of other methods and models were even worse.
The failures of the sewer laterals were somewhat better classified (Fig. 2). Also SVM-L was chosen as the optimal model since the classification results for both the training and testing data were acceptable for all the types of damage. Even if accuracy for the training data was higher in the case of SVM-RBF and SVM-P, the results for the testing data were slightly worse than for the model based on the linear basis function. Similarly as in the case of sewers, the SVM-L model had no localized vectors while the number of support vectors was 29, i.e. it was equal to the sample size. Tables 6  and 7 show classification accuracy for the particular types of damage for the training data and the testing data.
Only the abnormality generally referred to as "damage" was correctly classified in all the considered cases. Cavein and longitudinal breakage were classified correctly in 50%, whereas most of the other types of damage were not predicted (classified) correctly by the chosen model. In the case the results for testing, three of the four types of abnormalities (leakage, cave-in, damage) were classified correctly in about 50%, which is not a wholly satisfying result. One should look for the causes of so poor results of classification by means of the selected regression methods in the peculiarities of the problem, as discussed below. In order to thoroughly investigate the problem of the unsatisfactory results of the classification of types of damage, one can use a table containing operational data. Table 8 shows a few illustrative cases (for training and testing) for sewer laterals. This quite representative sample is enough to demonstrate certain interdependences (or their absence) which had an unquestionable influence on the quality of the modelling of the classification problems by means of the regression methods.
An analysis of the exemplary operational data (Table 8) indicates their high variability. For example, a leakage occurred in a vitrified clay sewer lateral 150 mm in diameter laid 2.5 m under the ground surface in the non-heating season in 1980, whereas in the case of another sewer lateral made of the same material and having the same diameter, but laid 1.6 m below the ground in 1997, a shift of the conduit occurred. The classification results indicate that the selected regression methods are not flexible enough to correctly classify damage when there is a considerable discrepancy between the operational data used to build the models. Similar problems are encountered in the case of operational data and the modelling of types of sewer damage. The variability of the dependence between, e.g., the conduit material, its diameter and the abnormality to which it was subject in the analysed period, or its absence is quite normal from the engineering and operational point of view. Nevertheless, such problems as indicated above should have no influence on the quality of modelling and the accuracy of classification, especially that as many as four different regression methods were selected for the modelling. The results presented in this paper are one of the first attempts at classifying damage types by means of regression methods, but the obtained results are not promising. Perhaps in the future this problem should be approached differently, i.e. a preliminary classification of damage should be made depending on what class of abnormalities it belong to (e.g. a class of structural damage, a class of damage affecting conduit operation hydraulics, etc.), which could improve the quality of the modelling. Moreover, a separate problem is the damage classification system used in water and sewerage companies. Its standardization and the use of precisely defined abnormalities and damage types would make it possible to obtain a more representative sample, which could significantly improve the quality of the results of analyses carried out using the prediction methods.

Conclusions
The obtained results show that the proposed regression methods (classification trees, the support vector machine method, the K-nearest neighbours method, artificial neural networks) do not classify types of damage to sewers in a satisfactory way since the accuracy of the classification for all the types of abnormalities did not exceed 50% for the adopted SVM-L model. At the current stage of this research, one cannot unreservedly recommend any of the prediction methods for use in the qualitative evaluation of the condition and damageability of sewerage networks. Undoubtedly further, more advanced, research on the classification and prediction of qualitative variables is needed, also based on other operational data. Moreover, it seems necessary to standardize the system of classification and notation of damage types not only within a particular water and sewerage company, but also for the whole country. Of course, there are world guidelines and Polish recommendations proposed by Kuliczkowski and Kuliczkowska (2007) for classifying abnormalities occurring in sewerage networks, but their implementation in many water and sewerage companies still poses a difficulty.
Funding The author(s) received no specific funding for this work.

Conflict of interest
The authors declare no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.