A fault prediction method for catenary of high-speed rails based on meteorological conditions

Fault frequency of catenary is related to meteorological conditions. In this work, based on the historical data, catenary fault frequency and weather-related fault rate are introduced to analyse the correlation between catenary faults and meteorological conditions, and further the effect of meteorological conditions on catenary operation. Moreover, machine learning is used for catenary fault prediction. As with the single decision tree, only a small number of training samples can be classified correctly by each weak classifier, the AdaBoost algorithm is adopted to adjust the weights of misclassified samples and weak classifiers, and train multiple weak classifiers. Finally, the weak classifiers are combined to construct a strong classifier, with which the final prediction result is obtained. In order to validate the prediction method, an example is provided based on the historical data from a railway bureau of China. The result shows that the mapping relation between meteorological conditions and catenary faults can be established accurately by AdaBoost algorithm. The AdaBoost algorithm can accurately predict a catenary fault if the meteorological conditions are provided.


Introduction
In recent years, the high-speed rails (HSRs) of China have developed rapidly, which means that both scale of operation and catenary expand greatly. The traction power supply system (TPSS) of HSRs requires a very high reliability [1][2][3]. Catenary is a key component of the TPSS, but there is no standby catenary in TPSS. Meanwhile, the stability and reliability of the catenary system are directly related to the operation state of HSRs. Therefore, an accurate fault prediction of the catenary system and timely warning is crucial to improving the reliability of the entire HSR system.
Zhao et al. [4] established a reliability model of the TPSS based on the Weibull distribution, used the proposed model to predict reliability, and obtained the reliability evolution process. However, this model is applicable only when fault occurrence follows the Poisson distribution, but this is not the case in practice. Moreover, Zhao et al. ignored the influence of meteorological conditions. The catenary system is completely exposed to the external meteorological conditions. The meteorological conditions have a significant influence on the catenary system operation [5]. Recently, Wang et al. [6]  In power systems, the influence factors such as the external environment on power load forecasting, life prediction, and fault prediction has been highlighted [7,8]. The power load forecasting methods, which consider the influence of weather conditions, have made significant progress in the weather-sensitive load [9,10]. In addition, using the real-time electricity price, He et al. [11] proposed a method for forecasting the probability density of the power load. In terms of life prediction, scholars [12][13][14] used the rough set theory, cross-entropy theory, stochastic process simulation, and other methods to predict the equipment remaining life, and considered the influence of the external service environment on electrical equipment. Andre et al. [15,16] used the Monte Carlo simulation to develop a model for the prediction of fault rate, fault type, and fault duration of transmission line and bus, and forecasted the annual outage times of the power system. Their model was based on the history of fault data, but the influence of the external environment on transmission lines was ignored. In [17], indexes including the meteorological sensitivity rate, difference of fault number, outage time were introduced to reflect the difference of transmission line risks for different meteorological disasters. In [18,19], the temporal characteristics of transmission line faults were analysed, the time-varying fault rate simulation model was established, and the fault time distribution was simulated for risk assessment of a transmission line. A fault warning method based on the support vector machine (SVM) and AdaBoost method were proposed in [20]. All the abovementioned studies consider the influence of external meteorological environment on power system on various levels, which can provide a reference for catenary fault prediction. As there are great improvements in the data acquisition, monitoring, and system management, catenary fault prediction can be supported with comprehensive data. Thus, it is significantly important to consider the overall influence of meteorological conditions on the fault prediction of catenary system.
The main objective of this work is to develop a catenary fault prediction method which can accurately and timely predict the catenary fault based on the external meteorological conditions, and provide decision support for the operation and maintenance of HSRs. In this paper, based on the AdaBoost algorithm, a method is proposed to predict the catenary fault. The proposed method establishes the mapping relation between meteorological conditions and catenary faults. It can predict catenary fault accurately if the meteorological conditions are provided.
The remainder of this paper is organized as follows. Section 2 introduces the influence of meteorological conditions on catenary faults. Section 3 briefly describes the AdaBoost and single decision tree algorithms. Section 4 presents the pre-processing method for historical statistical data and construction of training samples. A case study and the result analysis are provided in Sect. 5, followed by the conclusions in Sect. 6.
2 Influence of meteorological conditions on catenary faults The catenary system is completely exposed to the complex environment. According to field surveys by a railway bureau, the meteorological conditions are one of the influential factors that cause catenary faults. In this work, a trip of the TPSS caused by the catenary system is regarded as a catenary fault, and the influence of meteorological conditions on the catenary fault occurrence is analysed quantitatively.

Temporal distribution characteristics of catenary faults
The number of catenary faults and their causes can be collected by field surveys. The results in [21] show that the working state of a catenary system is highly influenced by the external meteorological conditions, such as thunderstorms, gale, snow, and others. The number of catenary faults on a monthly basis under various meteorological conditions was collected by the railway bureau in northwest China from 2012 to 2015, as shown in Fig. 1. According to Fig. 1, the most influential meteorological conditions in northwest China are, respectively, the gale and dense fog from March to April, the thunderstorm and gale from May to October, and the snow and gale from November to February. Meanwhile, when the days of the most influential meteorological condition increase or decrease, the number of catenary faults changes correspondingly. Therefore, there is a strong correlation between the meteorological conditions and the number of catenary faults.

Spatial distribution characteristics of catenary faults
In order to depict the spatial distribution characteristics of catenary faults, the catenary fault frequency (CFF) is introduced and defined as where C FF indicates the catenary fault frequency in a year per kilometre, l i the length of line i, o i the number of catenary faults in a year, and z the number of lines. According to the data for central China in the period of 2012-2015, the corresponding CFF for each power supply section of Wuhan Bureau is shown in Fig. 2, which is calculated by Eq. (1).
As can be seen in Fig. 2, catenary fault frequency is diverse across regions. Namely, the CFF of Wuchang region is the largest, reaching the maximum of 0.85 times/ km in 2012, and then followed by those of the Hanyang and Huangzhou regions with the CFF of more than 0.5 times/ km in three statistical years. There was no catenary fault in the Jingzhou region during 2013-2015 and in the Wuxue region in 2012 and 2013. Meanwhile, the CFF of the Huangpo region is the lowest within the whole statistical period. Therefore, it can be concluded that CFF is strongly correlated to the geographical locations.
In order to reveal the temporal and geographical correlation between the meteorological conditions and number of catenary faults, the fault data from the railway bureaux in northwest and central China are statistically analysed on a monthly basis, and the results are shown in Fig. 3. Figure 3 indicates that the catenary faults in these two regions are mainly concentrated in June, July, and August. However, in December and January, the proportion of catenary faults in northwest China is higher than in central China. In view of the meteorological characteristics of the two regions, the main reasons for such results may be concluded as follows. Both in central and northwest China, there is the maximum amount of thunderstorm, gale, rain and high temperature in June, July, and August. Besides, snow and low temperature mainly occur in December and January. In central China, the summer lasts for a long time, A fault prediction method for catenary of high-speed rails based on meteorological conditions 213 and the weather conditions do not fluctuate drastically during winter. In addition, the catenary system is almost unaffected by icing due to fewer snow and low temperature. Therefore, the fault distribution of the catenary system in central China can be approximated by a ''single-peak'' model. In contrast, the northwest region has a longer winter with snow and ice. Therefore, the fault distribution of catenary system in the northwest region can be approximated by a ''peak-valley'' interlaced model.

Analysis of meteorological conditions influence on catenary faults
The influence of meteorological conditions on catenary faults is always reflected in factors such as precipitation of rainstorm, heavy rain, moderate rain, thunderstorm, shower and light rain, wind speed, and temperature [21,22].
1. Influence of precipitation. On the one hand, precipitation affects air humidity and insulation performance, and causes flashover because of the damp. Moreover, the water flow on the equipment surface can easily cause a short circuit. On the other hand, if there is lightning in rainy days, the lightning may lead to overvoltage and insulation damages; moreover, the overvoltage may invade the substation and cause trip. 2. Influence of wind speed. First, high wind speeds lead to catenary wire tension. Second, the gale causes the vibration of catenary wire and affects the current collection performance of the pantograph. Most importantly, the branches, plastics, and other foreign bodies blew by the gale may hang from the catenary, resulting in the short circuit. 3. Influence of temperature. The high temperature leads to the large tension of contact wires and short insulation distance, resulting in the short circuit.
Meanwhile, under the low temperature, ice accumulates on a wire, which interrupts the current flow from contact wire to the pantograph.

Statistical analysis on influential factors of catenary faults
The influential factors are analysed using the actual data of the Beijing-Shanghai HSR (with a length of 1318 km) collected in the period of 2012-2015. The statistical results are shown in Table 1. Moreover, weather-related fault rate (WRFR) is introduced to represent the correlation between various meteorological conditions and the number of catenary faults. It indicates the frequency of catenary faults under a particular meteorological condition: where, q i denotes the number of catenary faults on line i under the particular meteorological condition, l i denotes the length of line i, t WB is the statistical time of a certain weather condition, and z is the number of lines.  Using the statistical data given in Table 1 and the Eq. (2), the WRFR can be calculated as shown in Fig. 4.
As can be seen in Fig. 4, the WRFR under the gale, dense fog, and rain is higher than that under the normal weather. The highest fault rate is under the heavy rain condition. In general, the worse the weather is, the greater the possibility of a fault is. The influence of multiple uncertain factors makes it difficult to build an accurate mathematical model for catenary faults. In fact, there is a coupling relationship between various meteorology conditions. The catenary faults prediction is to determine whether the system could work healthily in the next period of operation with the current system state. It is often based on the massive multi-source data provided by the monitoring system. The fault prediction can be viewed as a classification prediction problem with supervised learning. In most cases, the learner accuracy is significantly influenced by training data and its distribution, and it is hard to build accurate classifiers directly. However, it is easier to generate a relatively accurate weak classifier. The AdaBoost algorithm is one of the most widely used machine learning methods for training different weak classifiers using the same training set. After training, the weak classifiers can be combined into a strong classifier. Namely, by combining the attributes of weak classifiers, the resultant classifier can possess a stronger generalization ability.

AdaBoost algorithm 3.1 Basic theory of AdaBoost algorithm
The AdaBoost algorithm is an important characteristic classification algorithm for machine learning, and it is widely applied to the power system fault warning [20], wind speed prediction [23], and other fields [24,25]. Zhang et al. [26] compared the prediction accuracy of SVM, BP neural network, and AdaBoost, and indicated the superiority of AdaBoost algorithm.
The basic idea of the AdaBoost algorithm is to integrate a large number of weak classifiers that have a general classification ability to form a classifier with a strong classification ability. The specific steps of the AdaBoost algorithm are as follows.
where I C n a j À Á 6 ¼ y p À Á is equal to 1 when C n a j À Á 6 ¼ y p ; otherwise, I C n a j À Á 6 ¼ y p À Á is equal to 0. 5. Calculate the weight of C n (X) by a n ¼ 1 2 ln 1 À e n e n ; 6. Update sample weight distribution: e Àa n ; C n a j À Á ¼ y p e a n ; C n a j À Á where Z n ¼ P m p¼1 V n p ð Þ Á e Àa n y p C n a j ð Þ denotes the normalization factor, such that P m p¼1 V nþ1 p ð Þ ¼ 1. 7. Repeat Steps 3-6 for N times to obtain N different weak classifiers. 8. Combine all the trained weak classifiers into one strong classifier which is defined by

Construction of weak classifiers
In this work, the single decision tree [27,28] is chosen to construct weak classifiers. The decision tree makes a decision by using the threshold division method for a single feature vector. This method has the following advantages: short computation time, fast calculation, and certain accuracy. In addition, this method can be well adapted to where s is the number of the characteristics. 3. Determine the threshold H k according to the data size of vector a j : where k = 0, 1, 2, …, K, k is the number of steps; H step is the step length; max a j À Á and min a j À Á are the maximum and minimum values in the vector.
where r is equal to 0 or 1, and it expresses the classification method. 6. Repeat Step 3-5 K times, and record the error rates of classifiers with the corresponding thresholds and classification models. 7. Repeat Step 2-6 s times, and select the eigenvector a j , whose threshold equal to H K and classification models correspond to the minimum error rate. Finally, calculate the classification function of a weak classifier by 4 Fault prediction on catenary system 4.1 Statistic and process input data for AdaBoost As the field data contains much complex information, it is difficult to predict the catenary faults directly. Namely, the data should be first screened for validity. The required data can be divided into two types: historical running-state data and meteorological data. It also includes the catenary operating states, catenary fault types, protection information, catenary outage time, operation conditions, and weather information during the predicted period. The data types and sources are presented in Table 2. The meteorological data should be standardized and transformed into a mathematical form by attribute construction and discretization.

Attribute construction
The attribute sets of meteorological conditions include the precipitation grades, mean temperature grades, and wind scales during daytime and night.

Discretization of meteorological data
1. According to the rainfall intensity, the precipitation is divided into seven grades as shown in Table 3. 2. Use the equal-width division method to discretize the continuous temperature variables: Faults record from the railway bureau Meteorological monitoring system

Meteorological information system
Meteorological data during the predicted period Meteorological information during the predicted period including the precipitation, wind speed, and temperature

Meteorological information system
Weather forecasting where P f refers to the range of the temperature level, f = 1, 2, …, F, F is the number of divisions, and T max and T min denote the maximum and minimum temperatures in the statistical time, respectively. 3. Classify the wind power into 0-12 grades according to the standard of China Meteorological Administration.

Construction of sample set
The catenary fault may be caused by impact effect of weather conditions. For example, lightning or strong wind leads to short-circuit trip of the TPSS. On the other hand, it may be a product of cumulative effects from external meteorological conditions, such as short circuit due to low sag of contact line over long time of high temperatures and flashover of the insulation device caused by continuous rainfall. The external meteorological conditions are considered as a characteristic vector X that affects the catenary fault occurrence, and Y that denotes whether there is a fault on catenary. The sample set is constructed according to Sect. 4.1. Suppose that there are m data samples; then, the constructed sample set can be expressed as matrix G, where p = 1, 2, …, m, j = 1, 2, …, s, and s is the number of characters that could be taken into account, and the matrix G is expressed as where x p-j denotes a set of influential factors such as precipitation, temperature, and wind scale on sample p; y p = (-1_1), the value of -1 means no catenary fault, and the value of 1 a catenary fault.

Catenary fault prediction based on AdaBoost
The catenary fault prediction based on the AdaBoost algorithm includes the following steps.
1. Input the training data, including the catenary fault data and meteorological data. 2. Set the initial weight V 1 and iteration number N, and initialize the AdaBoost algorithm. 3. Update the weights through the iterative computation.
Train the optimal decision tree by different weights of V n . Construct multiple weak classifiers, and combine them with the weights to generate a strong classifier. 4. Use the future meteorological data provided by the Weather Forecast as an input data for fault prediction, and obtain the final prediction result using the trained strong classifier.
The specific calculation flow chart is shown in Fig. 5. Construct optimal single layer decision tree to generate weak classifier C n (X) Calculate the classification error rate of C n (X) by Eq. (3) Calculate the weight of classifier C n (X) by Eq.(4) The historical data was pre-processed by the steps introduced in the previous chapter. The field data analysis revealed that in the selected samples, there is no fog-related fault. At the same time, the Meteorological Information System showed that there was no foggy day in the seasons of study. Therefore, fog was not considered in the training and test data. In the selected samples, the lowest temperature was 17°C, and the highest temperature was 33°C. The detailed temperature classification calculated by Eq. (10) is given in Table 4.
The data samples include the recording time, precipitation grade, temperature grade, wind scale, and catenary state. Through data pre-processing, the training sample set and test sample set are presented in Tables 5 and 6.

Construction of strong classifier
The training data was divided into two categories. One category only shows the influence of precipitation, and the other one shows the joint influence of precipitation, wind scale, and temperature. For simplicity, we only take the influence of precipitation grade as an example to illustrate the processes of constructing the weak classifiers based on the single decision tree and training the weak classifier based on the AdaBoost.
The representation matrix of training data about precipitation grade was as follows: where, R ptd , R ptn , R py , and R pb indicate the precipitation grades in the current daytime, current night, the average precipitation grade on the previous day, and the average precipitation grade for 2 days before the current day with respect to sample p, respectively. Then, the weights were initialized as V 1 = (1, 1, …, 1)/ 43. Following the weak classifier calculation process, the optimal decision feature vector of the first weak classifier was obtained as a 2 = (x 1-2 , x 2-2 , …, x p-2 , …, x 43-2 ) T , and the classification function was given as:  where x p-2 represents the eigenvalues of an eigenvector a 2 in a line p, and 5.4 is the threshold value calculated by Eq. (7). Finally, the error rate of each classifier was calculated and the weights were adjusted to obtain a strong classifier by the AdaBoost algorithm. Using the two above-mentioned categories, two different training sets were obtained, respectively. Then, the accuracy on each training set was calculated, as shown in Fig. 6.
In Fig. 6, the accuracy on both training sets increases with the number of weak classifiers. In Fig. 6a, the maximum accuracy is 0.9535, and the curve tends to become stable when the number of classifiers reaches the value of 64. In Fig. 6b, the maximum accuracy of 1 is achieved when the number of classifiers reaches the value of 53. Thus, in the case of joint influence of precipitation, wind, and temperature, the accuracy of classification is higher and less number of weak classifiers is required compared with the case of a single influence of precipitation.
By comparison, it is observed that the results of the first training set have more oscillations and lower accuracy. Thus, we select the precipitation, wind, and temperature as influential factors to construct weak classifiers.

Results of catenary faults prediction
The proposed fault prediction method was evaluated through a comparison with the decision tree and BP neural network algorithm on the test data, and the obtained results are shown in Table 7. And the bold number in Table 7 indicates the inaccurate prediction result.
According to the results presented in Table 7, the prediction accuracy of the AdaBoost was 88.89%, and almost all the catenary faults were correctly predicted except for two errors. The first one was the data on 02 June 2015, and the second one was the data on 26 June 2015. The Ada-Boost algorithm predicted that there was a high fault probability on catenary under current meteorological conditions, which is a false alarm. With more sample data, the prediction accuracy of the AdaBoost algorithm can gradually stabilize at about 90% [24,26].
The prediction accuracy of the decision tree is 77.8% and the BP algorithm is 83.3%, which were lower than that of the AdaBoost algorithm. In this paper, the single decision tree algorithm is the weak classification algorithm to construct the strong algorithm. Therefore, the prediction accuracy will be significantly lower than the AdaBoost algorithm. For the BP neural networks, although the training accuracy can reach 100%, the generalization effect is worse than the AdaBoost algorithm. Moreover, because of randomness in the learning phase, the BP algorithm may converge to local minima. In conclusion, the strong classifier constructed by the AdaBoost algorithm had a stronger generalization ability than the single decision tree and BP neural network. However, the method of machine learning needs to be improved in the following aspects. First, the AdaBoost algorithm uses the single decision tree for weak classifiers construction in this work. Since only the decision tree is used in the training process, the accuracy of prediction results with decision tree is not high enough, which further decreases and limits prediction accuracy of the strong classifier. This problem may be solved by using better classification methods such as support vector machine (SVM). Furthermore, the AdaBoost algorithm constructs a strong classifier by updating the weights of different weak classifiers, but it pays more attention to the misclassified samples in the training process. Thus, the weights of samples that are easily misclassified will gradually increase with the number of iterations. This leads to the imbalance of samples and causes the decrease in classification accuracy. This problem can be solved by optimizing the weights updating process of the classifiers.

Conclusions
The external meteorological conditions, including the precipitation, wind speed, and temperature, have a significant impact on catenary fault. In this paper, the relationship between the catenary fault and meteorological conditions is analysed. The cumulative effect of meteorological conditions on the catenary system is taken into account in catenary fault prediction, and the AdaBoost algorithm is utilized to construct a strong classifier to predict the catenary fault by using the historical meteorological data. The obtained prediction results demonstrate that the AdaBoost algorithm could provide prediction for the catenary faults with an accuracy of 88.89% by considering the external meteorological conditions.