Performance evaluation of machine learning for fault selection in power transmission lines

Learning methods have been increasingly used in power engineering to perform various tasks. In this paper, a fault selection procedure in double-circuit transmission lines employing different learning methods is accordingly proposed. In the proposed procedure, the discrete Fourier transform (DFT) is used to pre-process raw data from the transmission line before it is fed into the learning algorithm, which will detect and classify any fault based on a training period. The performance of different machine learning algorithms is then numerically compared through simulations. The comparison indicates that an artificial neural network (ANN) achieves remarkable accuracy of 98.47%. As a drawback, the ANN method cannot provide explainable results and is also not robust against noisy measurements. Subsequently, it is demonstrated that explainable results can be obtained with high accuracy by using rule-based learners such as the recently developed quantitative association rule mining algorithm (QARMA). The QARMA algorithm outperforms other explainable schemes, while attaining an accuracy of 98%. Besides, it was shown that QARMA leads to a very high accuracy of 97% for highly noisy data. The proposed method was also validated using data from an actual transmission line fault. In summary, the proposed two-step procedure using the DFT combined with either deep learning or rule-based algorithms can accurately and successfully perform fault selection tasks but indicating remarkable advantages of the QARMA due to its explainability and robustness against noise. Those aspects are extremely important if machine learning and other data-driven methods are to be employed in critical engineering applications.


Introduction
Transmission lines are a fundamental part of today's power systems, as they ensure power supply to end consumers by connecting them to far-off large generation plants. Hence, it is crucial to have an adequate protective system that is capable of isolating faults quickly and reliably to prevent any possible damage to other electrical components [29]. The most commonly used device for protection of transmission lines is the distance relay, whose operation relies on the impedance between the fault location and relay installation point. Depending on the network conditions, looped segments, and double circuit lines that share towers [4], short lines, and in-feed from the other end of the line, the measured fault impedance can suffer from certain transitory variations that can cause incorrect operation of the distance relay.
Many transmission line protection schemes are used, but they do not provide intrinsic phase selection (e.g., negative-sequence and zero-sequence line differential, neutral over-current protections). However, information of the faulty phase is required to enable single-pole tripping. As any action performed by the protective system during real-time operations will directly affect the grid dynamics, correct tripping is critical to maintain the system stability and reliability [30]. Distance relaying depends on a fault selector to calculate the impedance in the loop that would lead to line triggering when the protective zone requirements are met. Therefore, a reliable distance relaying protection system for transmission networks must have a high-accuracy fault selector for correct operations in any protective zone for fast trip decision-making. In particular, faults in double-circuit and high-impedance faults rarely pose technical challenges in terms of fault selection and proper relay operation [14,25,32]. In addition, mutual impedance from double-circuit transmission lines may affect relay performance. When a fault to ground occurs, the zero-sequence currents from one line induce a voltage in the coupled adjacent line, thereby causing a current to flow in the opposite direction, which may add or subtract to the existing zero-sequence current [20].
Both researchers and relay manufacturers have made great efforts to improve fault classification algorithms to perform fault selection and thereby increase the system robustness. The main difficulty for selecting the correct fault is related to the effects of high resistance on the fault parameters at any given point. This leads to a situation where the fault currents are similar to each other in magnitude, and thus, their classification becomes a difficult computational task. Fault selection methods using one-end recordings can be classified according to the algorithm [27]. Following this approach, they can be divided into two broad classes: classical and emerging methods, primarily differing with respect to the balance between speed and accuracy. Some algorithms can perform the fault selection faster than the time taken by a cycle of the system frequency, but at lower accuracy. On the other hand, others perform the fault selection with high accuracy, but they lack speed and even post-protective actions, and therefore are not suitable for real-time protection and trip decision making based on the faulted loop (distance relays) [28]. These algorithms consider all measurements available, if not, approaches like shown in [21] can deal with missing values.
One remarkable example of a classical method is the symmetrical component angle comparison which checks whether the magnitude of sequence currents is sufficient to reliably perform the task by comparing with a threshold. Depending on the currents that are above the threshold, the fault is selected by comparing the angles, as illustrated in Fig. 1. In particular, either negative-and positive-sequence currents (I 2F and I 1F , respectively; see Fig. 1a), or negative-and zero-sequence (I 2F and I 0F , respectively; see  Another classical method is the so-called delta method that uses transient components from faulty continuous currents or voltage signals are used as prefault components. The output components employed in this method are, for example, decaying memory function (as illustrated in Fig. 2), superimposed signals, or Fourier transforms.
One of the most common classical method is the impedance-based algorithm. Its main advantage is to achieve a speed below one cycle of the system frequency, making it very popular in distance relays, being it implemented by different commercial products used for single pole tripping actions. In this method, current and voltage measurements from the fault condition are used to determine the respective zone of operation for each phase (in the case of a single phase-earth fault) or multiple phases in the loop R − X diagram. These measurements are extensively used in numerical relays. Single phase-earth impedance loop characteristics for relays, such as plain impedance, quadrilateral, self-polarised mho [15], offset mho/lenticular, fully cross-polarized mho, or partially cross-polarized mho, can be defined depending on the manufacturer and system conditions.
The other class of fault classification methods and electric power applications are based on emerging computational approaches such as machine learning (ML) or deep learning (DL) [5,7,17,18]. For example, in [6], the authors introduced a non-intrusive fault identification method in power transmission lines using PS-HST to extract high-frequency fault components. A feed-forward artificial neural network (ANN) was used to select the fault classes. The authors calculated the HST coefficients and obtained a power spectrum based on the Parseval's theorem. In [1], a semi-supervised ML approach based on co-training of two classifiers is presented. The fault selection was performed in both transmission and distribution systems. Feature extraction was performed using a wavelet transform of the current and voltage signals, and a nature-inspired meta-heuristic, harmony search, was used for determining the optimal parameters of the wavelets.
Another emerging method is pattern recognition, having shown promising results compared with conventional methods. For instance in [8], a summation-Gaussian extreme learning machine (SG-ELM) was used for transmission line diagnosis, which includes fault classification and fault location, by means of an iterative back-propagation learning algorithm. In [27], an intrinsic time decomposition (ITD) algorithm was employed to analyze the frequency and time of non-stationary signals, and subsequently, a probabilistic neural network (PNN) to implement fault classification was developed. The advantage of this approach lies in its training speed that enables the entire process to be performed in real time. A power-spectrum-based hyperbolic S-transform (PS-HST) and back-propagation artificial neural network (ANN) were used in [6] to extract high-frequency components of the electric signal generated by an electric fault to improve fault selection coefficients, and fault classes in power transmission networks were then identified with one-end recordings. Three ML models-naive Bayes classifier, support vector (SV), and extreme learning machine-were compared in [26] for fault classification based on the Hilbert-Huang Transform.
Hybrid techniques can also play an important role. Control strategies that involve two or more of the methods described above can be used to increase the reliability and accuracy of fault selection. New numerical relays (with higher processing capabilities) are often employed to effectively select the proper fault and avoid undesirable tripping. Strategies based on ML use classic methods for pre-processing to improve their models [6].
Here, we will give particular attention to the quantitative association rule mining algorithm (QARMA), which has not yet been employed for the fault selection in transmission lines. QARMA has already been tested in several application scenarios and use-cases in the health domain and in predictive maintenance applications in particular (see [10,11] for results relating to predicting tool Remaining Useful Life in the auto-motive manufacturing industry from the recently concluded PROPHESY project). Within the context of the EU-funded QU4LITY 1 project, QARMA results have been tested against real-world data-sets ranging from tool wear-and-tear to body measurements to compute morphotype fit scores in the fashion industry.
The main reason therefore for choosing QARMA as a tool to study its applicability in the given domain is the success that QARMA-based classifiers and regressors obtained in such varied domains, and given the natural appeal that the output of QARMA offers, in the form of easy-to-understand rules, that we consider are more directly explainable than higherorder approaches to explainability/interpretability such as Shapley values for explaining otherwise black-box models. Still, we compare our main two approaches, namely deep neural networks and QARMA to several other well-known classification algorithms, see Sect. 4.2. The main criteria for the choice of these other algorithms were their prior use in this domain as established in the literature, their overall popularity in the ML field in general, as established by the number of results returned in Google Search for the respective terms, and finally, their explainability/interpretability. This paper extends the above contributions by proposing a two-stage method. The first stage is the delta method discrete Fourier transform (DM-DFT) that is used to pre-process the raw data from the transmission. The second stage performs a machine learning algorithm for fault selection. We studied different techniques in terms of accuracy and explainability. Our main contributions, also presented in Sect. 4, are as follows: -We propose a general hybrid methodology based on DM-DFT algorithm that works independent of network topology. -We test and compare the performance of well-known ML algorithms techniques such as decision trees, neural networks, and support vector machines (SVM). -We develop a fully explainable method that employs the quantitative association rule mining algorithm (QARMA) [12,13] and compare its performance with the state-of-theart ML algorithms (mostly not-explainable). -We demonstrate with several numerical examples, including real-world data that fault classification task can be solved by QARMA with very high accuracy even when only one-end currents are available, or when the measurements are subject to high levels of noise.
The rest of this paper is divided as follows. Section 2 introduces proposed methodology. Section 3 details the machine learning algorithms employed here, including a detailed description of QARMA. Section 4 presents the numerical results, while Sect. 5 concludes this paper.

Step 1: delta method discrete Fourier transform
To extract fault features (currents and voltages), a combined DM-DFT is employed to identify the fault instance. The DFT maps a given point of the input signal (i.e., current or voltage) into two points in the output signal. For N samples, considering the pair x n (input signal) and X k (its DFT) where 0 ≤ k ≤ T − 1, and T is the number of samples per cycle and n represents current phase (a, b or c). The DM-DFT uses a moving window of length T instead of the complete signal, thus allowing faster fault recognition. The fault point requires to be within a fault time, which is considered here to be approximately 3.5 cycles. The sampling rate is 4 kHz, i.e., about 80 samples per cycle (which is usually used in commercial relays). To obtain a highly accurate signal point 1.5 cycles or about 120 samples after fault occurrence are needed. When a transmission line is in a faulty state, magnitudes of currents and voltages (which are the features used in the fault classification task) can suddenly change depending on the type of fault and its characteristics. Figure 3 illustrates a cycle in the periodic sinusoidal signal where the DFT calculations are performed. Once the DFT is calculated, variations in the frequency domain can be detected as follows:  where "threshold" refers to the current signal; ΔI n are the changes in the current signal; I n( j) is the current Fourier value of the jth sample with 0 ≤ j ≤ S − T ; S is the total number of samples in the signal; and F i indicates the fault instance. Threshold for the ongoing signal calculation is given by 1.5 times the Fourier pre-fault signal value. A different threshold value is selected based on the experimental results for different fault conditions. The DM-DFT is applied to the three-phase currents without considering if ΔI n have similar values; it only considers values above the threshold. If multiple ΔI n are positive, the fault instant is chosen from the phase n that has the highest value. If ΔI n ≤ 0 for all three phases, then it is assumed that there is no fault and the feature extraction is taken randomly from one of the samples of each signal. Delta methods can be seen as detection of high fluctuation of any quantity (like temperature, current, or even monetary value) and is only used to identify faulted point and then method continues with feature extraction. At this stage, it is possible to miss fault points (miscalculation of a given phasor due to wrong time series point) because of threshold values. However, for the data-set obtained in this process, all the faults were detected successfully. Note that the DFT phasor estimation is less sensitive to noise than the individual measurements, and it is robust to the presence of harmonics [23]. Also the threshold selection can perform even if parallel lines are out of service or if it is applied to transmission lines with different parameters or ratings [16]. That means the threshold is independent of the topology and geometry of the structure. Traditional protection relays use the DFT for protection calculations [23], therefore using the same pre-processing technique to minimize hardware requirements, while providing sufficient information to the neural network. However, the DFT is dependent on the sample frequency, which might be problematic for real-time applications because of the computational time limitations. It means, the DFT calculation might take longer than 20 ms to calculate, which is the time where the trip decision is made in a real-time transmission line scenario.
The process is applied to data-sets (like the one obtained from simulations to be explained in Sect. 4.1) that contain currents and voltages, either with or without a faulted state. The selection is automatically done after DFT procedure is completed. For data-sets with faults, the voltages and current feature extraction are selected at the fault instant point; for the ones without faults, a random point within the signal is selected. The output data-set, listed in Table 1, is the input for the ML methods to be described next. Table 1 contains the absolute values of currents and voltages of local and remote ends of the transmission line. The neutral currents were estimated as a phasorial sum of the abc currents.

Target variable
Fault type Value Value … Value

Step 2: machine learning methods
A number of different established algorithms are then considered for the supervised learning: decision trees, artificial neural networks (both shallow and deep), support vector machines, rule-extraction systems (Ripper-k and QARMA), naive Bayes, logistic regression, and finally ensemble methods (AdaBoost). As already mentioned, the main criteria for selecting the above methods were their prior use in the domain, as established in the current literature, their popularity in the machine learning field, and their explainability/interpretability properties. These algorithms have different accuracy levels and time to train each model. Moreover, the "explainability" of their models also varies. For example, explainable methods, such as decision trees, usually have poorer accuracy than ANNs. On the other hand, when the training data-set is large enough, an ANN often gives very high accuracy, but it is time consuming, and the resulting model offers little in terms of explainability to humans. The algorithm should then be selected depending on the requirements set between accuracy, time, as well as explainability of the outcome.
In this paper, the focus is on representatives from the class of DL methods and the "explainable artificial intelligence (AI)" families-for an exposition to the latter class, see [22]. We built and tested models with multiple hidden layers of feed-forward nodes trained by mini-batch-based optimization methods (including classical stochastic gradient descent with momentum as well as the Adam [24] optimizer); we also built rule sets extracted using the QARMA algorithm for quantitative association rule mining [13]. These choices were made because of the proven capability of DL methods to obtain very high accuracy given enough data, and also because QARMA is an algorithm that has already been successfully tested for Predictive Maintenance (PdM) related tasks in industrial settings. The output data-set from the DM-DFT is used in all cases, and it contains all the produced features. The target variable, i.e., the fault type (see Table 1), is encoded into eleven different classes (ten fault types plus one no_fault mode) as it has string values (one-hot encoding). Figure 4 presents the flowchart of the proposed two-step method. The DM-DFT is employed for all cases as the preprocessing stage, while the second stage is the different ML algorithms presented here.

Selected learning algorithms
This section starts with the algorithm that is expected to have the highest accuracy: the ANN as proposed in [9]. Then, the QARMA algorithm [12] is presented in brief, as it is expected to provide reasonably high accuracy but with the added benefit of explainable outcomes.

Artificial neural networks
Artificial neural networks (ANN) and in particular feed-forward ANNs, also known as multilayer perceptrons, are a powerful ML tool and have been used extensively for fault diagnosis problems such as the ones mentioned before. The ANN is a feed-forward neural network consisting of three stages. The first stage is the input layer containing the voltages and currents from both ends at the time of fault occurrence given by the DM-DFT along with a fault tag coded into binary form. The second stage is the set of hidden layers, where every node in a particular layer receives inputs from all the nodes in the layer immediately below that layer and sends its output to all nodes in the layer immediately above it. We have experimented with various architectures, shallow and deep, using the Open-Source library popt4jlib (https:// github.com/ioannischristou/popt4jlib) that allows for parallel and distributed evaluation of training instance pairs of both the network output as well as the gradient of the network computed via the classical back-propagation algorithm. The third and final stage is the output layer that returns fault type (or no_fault) signals that are encoded back into the phase selection tag. Figure 5 illustrates the procedure, where "Local current A" refers to the current signal of phase A measured at the left end of the transmission line (see Fig. 6), while "Remote Current A" refers to the current signal of phase A measured at the right end of the transmission line and subsequently with voltages and phase B and C. Together, they form the features of the data-set. The multi-layer ANN in [9] has the following parameters: two fully connected layers with a rectifier linear unit (ReLU) activation, one output layer with softmax activation, categorical cross-entropy loss function, and adam optimizer.
In our experiments, the best topology was achieved with a deeper network consisting of 4 layers in total: the first hidden layer consisting of a mixture of 5 linear activation units and 5 SoftPlus activation units (smoother version of ReLU), and the other two hidden layers consisting of 5 SoftPlus activation units each. The output layer, using one-hot encoding, comprised of 11 sigmoid (logistic) activation units, each one corresponding to one of the possible classification results for the problem (10 different fault types and one no_fault type.)

Quantitative association rule mining for fault diagnosis
Association rule mining (ARM) is a major and still very active research area; implementations of the algorithms developed over the years are found in most popular software packages for data mining, such as WEKA, MOA, KEEL, and Orange. ARM works on datasets that contain subsets of "items." A typical dataset applicable for ARM is a database containing super-market basket data, i.e., the items in customers' shopping carts during check-out. Its major objective is to discover statistical rules that relate the presence of a set of such items to the presence of other items, and a typical association rule for such market basket data would be Buys("Milk") ⇒ Buys("Bread") where the implication is understood to hold in a statistical sense, so that the rule means that the percentage of baskets that contain both milk and bread is above a minimum threshold (support of the rule) as well as that the ratio of all baskets that contain both milk and bread over the number of baskets that contain at least milk is above another threshold (confidence of the rule.) The a priori [3] algorithm is a famous early algorithm for discovering all such rules satisfying minimum support and confidence in a given dataset. In the following years, many different authors improved upon this first algorithm (see [19] for a notable example).
However, the above notion of association rules is a "qualitative" one: any possible quantitative attribute belonging to the items is not taken into account. Quantitative association rule mining (QARM) is an extension of the standard ARM that allows for items to quantify any attributes they may have in the rule antecedents and/or consequences, for more precise rules.
An illustrative example of a quantitative association rule would then be Buys(Milk). price ≤ 0.9 ∧ Buys(Bread). price ≤ 0.25 ⇒ Buys(Sugar). price ≤ 0.1 which says that (for a percentage of customers above the specified support) customers who buy milk at a price less than or equal to U S D$.9 and bread at a price less than or equal to U S D$.25 will also purchase sugar at a price less than or equal to U S D$. 1. This is significantly more information than simply knowing that when a customer buys bread and milk they are also likely to buy sugar.
QARMA [12,13] is a family of efficient novel cluster-parallel algorithms for mining quantitative association rules with a single consequent item, and many antecedent items with different attributes in large multidimensional datasets. Using the standard support-confidence framework of qualitative association rule mining [2], it extends the notions of support, confidence, and many other "interestingness" metrics so that they apply to quantitative rules.
QARMA is configured to produce rules of the form I 1 .attr 1 ∈ [l 1,1 , h 1,1 ] ∧ · · · ∧ I n .attr m ∈ [l n,m , h n,m ] ⇒ J 0 . p ∈ [l 0 , h 0 ] or alternatively to produce rules of the form: I 1 .attr 1 ∈ [l 1,1 , h 1,1 ] ∧ · · · ∧ I n .attr m ∈ [l n,m , h n,m ] ⇒ J 0 . p = v. The latter form is very useful in supervised classification problems where the value of the target item attribute is essentially the class variable that is being learned.
QARMA (fully specified in [13], and then extended in [12]) within the particular context of grid fault diagnosis, works as follows: First, all subsets of variables including the target variable (fault indicator) of length 2, then 3, then 4, up to a user-specified length are constructed, and called "itemsets." Then, the algorithm proceeds sequentially to produce all valid quantitative association rules from each itemset of length 2, then 3, then 4…Within each phase of producing all valid rules of length l = 2, 3, . . . , the algorithm considers in parallel all frequent itemsets of length l. For a given itemset, it produces all possible rules (with each attribute in the rule being un-quantified in the beginning); for each such initially unquantified rule, a possibly different CPU core runs a procedure called QU AN T I FY _RU L E() maintaining a local rule set R (initially empty) and runs a modified breadth-first Search procedure that first assigns the consequent attribute to the highest possible value, and, as long as the resultant partially quantified rule has support above the threshold required, adds it to a queue data structure T .
While this queue is not empty, the first rule inserted in the queue is retrieved and removed from the queue. For each attribute that has not been quantified in it yet, the algorithm creates as many new rules as there are different values in the dataset for the attribute being examined in an ascending attribute order value and enters the queue T in this order, but only if the newly quantified rule exceeds the minimum support requirement. If the partially quantified rule also meets minimum confidence (or any other metric), then it is checked against the current set of local rules R to see if it is dominated by another rule in R. If no other rule in R dominates the current rule, the current rule is added to the set R. After having run this BFS process in parallel for all frequent itemsets of length l, the various CPUs participating in the run synchronize to obtain all rules from all the other ones before moving to process the frequent itemsets of length l + 1.
The resulting rule set has the theoretical property that it maximally covers the dataset it has worked on: there is no other rule outside the produced dataset in the form described above that can cover even a single extra instance in the dataset while having the required minimum support and confidence (or other specified interestingness metrics) required. Once the set of all non-dominated rules has been computed, a classifier based on their ensemble works as follows: 1. Select all the rules whose antecedent conditions are satisfied by this instance and add them to the set F; 2. Sort out the rule-set F in decreasing order of confidence and decreasing order of support on the training set; 3. Remove all but the top-100 rules of the sorted set F; 4. Each rule in F carries a weight equal to its confidence on the training set; 5. The weighted majority vote of the rules in F decides the class of the instance.

Test system description
A 400-kV, 50-Hz power system (Fig. 6) was simulated to extract features and then generate the dataset of currents and voltages based on the DFT at a fault point (when there is a fault). Under this setting, 10 different faults can occur involving the electrical phases A, B or C and ground G of the transmission line: three-phase faults (ABCG), bi-phase faults (ABG, BCG, CAG, AB, BC and CA) and mono-phase faults (AG, BG and CG). They differ from each other due to the phases involved and their parameters. The electrical system under study is composed of a double-circuit transmission line typical, for example, in Finland and other European countries. It has two lines connected to a local end marked as L and remote end R. At each end, a source is connected representing a transmission network. These types of lines represent a challenge for correct fault identification and selection owing to the strong impact of mutual impedance on the fault resistance. As for the communication channel, data were gathered by intelligent electronic devices (IED) from both ends and sent via a wireless link (e.g., 4G or 5G) to the fault selector, as shown in Fig. 6. It also presents the data flow blocks how the fault selection is performed and retrieved back to the smart devices for protective actions. The training and testing data-sets were collected in the preprocessing phase.  All the simulations were carried out in MATLAB/Simulink. The simulations were prepared with the specifications shown in Table 2, the transmission line parameters in Table  3. Both normal operations and different fault types (10 in total) were simulated along with different fault resistances (24), fault inception angles (2), line parameter errors (5), high and low power flow (2), and fault locations along the line (9). The simulation comprised 20160 rounds to collect data of both fault and non-faulted systems whose details are presented in Table 2. The resulting data-set is already publicly available. 2

Results
Two simulation scenarios and a real fault from a transmission line were used to test the proposed methodology. Note that, for these experiments, all machine learning algorithms ran on the same machine. However, the proposed implementations are fully parallel and do take advantage of all CPU cores available in the computer running the codes. This makes it more computation-efficient. Specifically, QARMA does not require any hyper-parameters to run. This is not the case for the ANN, though, for which, the architecture (number of layers, number of nodes in each layer, type of each node and so on) must be specified in advance, and forms the set of hyper-parameters that need to be fine-tuned through experimentation and best-practice guidance.
Nevertheless, it is worth mentioning that we do not claim that the parameters of our ANN model are optimal, as they were found by manual search in repeated experiments; they only provided excellent accuracy, and it is only this best set of results for the ANN that we report in this paper. Further, regarding the hyper-parameters required for the other classification algorithms that we experimented with, Naive Bayes requires no hyper-parameters, Ripper-k requires only the number of FOLD iterations which is by default set to 2, Decision Trees, Logistic Regression and Support Vector Machines are famous for requiring very few hyperparameters (gain criterion function, and penalty factors "w" and "C," respectively); finally, for the AdaBoost.M1 method, that does require the base weak learners to be fully specified, we left the default settings specified in the WEKA package.
Besides, all simulation scenarios were based on typical topologies and parameters used in the specialized literature, which represent real transmission lines and their operation.

Test system 1
In the first test system, the generated data-set was split into two subsets: 75% of a random shuffle of the data-set was kept for training and the remaining 25% was used to validate the accuracy of the trained models. The exact same split was used for all simulations with all different algorithms. Experiments with fivefold cross-validation gave essentially identical results. Table 4 shows the results of running the above-mentioned algorithms for supervised learning on the produced data-set; Fig. 7 shows the results for the classification task. The accuracy achieved with the DL model setup was remarkably high, 98.33%. It was achieved by a 4-layer deep network, with 10 nodes in the first hidden layer (5 linear and 5 SoftPlus units), 5 SoftPlus units in the second layer, and 5 SoftPlus units in the third layer; the output layer had 11 sigmoid units corresponding to each of the 11 fault class types (including the

QARMA 98
Bold indicates the two best accuracy methods

Fig. 7
Fault classification task confusion matrix for test system 1 "no-fault" type). This particular architecture was determined via trial-and-error, as the best observed among 50 different architectures. The total cost function of the network was the sum of square errors of each output node over all training instances. The entire network was trained via stochastic gradient descent (SGD) as the weights optimization algorithm. The backpropagation algorithm was used to compute the overall function gradient (derivatives corresponding to each data instance within a batch computed in parallel and then summed together to form the total gradient). The open-source library popt4jlib (https://github.com/ ioannischristou/popt4jlib) was used to train this network, and it also contains the simulation datasets used in this paper. Note that simpler methods such as Naive Bayes or logistic regression did not perform well on this dataset. This happens because the relatively deep neural network employed here has enough layers to produce an intermediate representation that makes it easy for the final layer to classify correctly the 11 different classes; and that the large number of produced high-confidence rules leads to majority votes that are usually correctly predicting the fault. The SGD method was used with a mini-batch size of 50 instances. In addition, normalization of the gradient vector g(w) = ∇ E(w) to 1 before the steepest gradient descent rule w ← w − αg(w) was important for quick convergence; the learning rate α decayed as the epochs progressed according to the formula α ← α500/(500 + epoch). The remarkable validation accuracy was achieved after only 10 epochs in less than 8.6 s of wall-clock training time on an Intel i-9 10920X processor, using all its 24 logical cores. This high accuracy is due to the large size of the simulated fault dataset and equally importantly because of the balance between the sample sizes of the various classes. The strong success of the DL model is also because all the voltages and currents from both lines were available, including neutral currents. The faults that were not selected properly were all single-phase-to-ground faults. This can be explained as follows: in those fault cases where the fault resistance took the larger value, only one of the phases changed slightly compared to the other two phases making the feature variation difficult for the model to detect. Further, perfect communication and without any problems related to latency, availability, or synchronization was considered.
With this setup, the importance of availability of all features was tested. Table 5 lists the number of features tested, and Fig. 8 shows the results with an ANN.
With fewer features, the ANN does not perform as well, emphasizing the importance of neutral current estimation. However, when only one-end currents are available, the validation error of the algorithm is still adequate for the task.
We also ran an experiment to test the sensitivity of the neural network to measurement noise; we progressively added more Gaussian white noise (with zero mean and increasing sigma values) to each of the features in our training and/or test data except for the class attribute (fault type.) The results are tabulated in Table 6 and show that for small σ values less than 10, the trained model is still able to classify test data with nearly the same accuracy as when there is no noise in the measurements; however, for large σ = 100, the neural network accuracy drops significantly, to around 89% which indicates that the trained model is no longer able to accurately identify fault types when measurement noise reaches such high levels. The situation is the same or worse when the training data-set itself suffers from measurement noise: when the training data-set is "polluted" with white noise with small σ = 10, even when the test data have no noise at all, the accuracy of the trained model  drops to less than 94%. When both training and test data are "polluted" with white noise with σ = 100, the neural network accuracy drops to less than 80%. We ran QARMA on the same training set with the user-defined support threshold of 3.5%, and the confidence threshold of 90% to obtain 5333 rules covering 97.8% of the entire training set. Then, a slight variant of the decision making algorithm described in the previous section, based on weighted voting, was used: for each instance in our test set, as long as the instance is covered by more than 100 rules, the instance's class is decided upon by the majority vote of the top 10 firing rules having the highest confidence on the training set; instances that fail the minimum coverage requirement are not classified. This algorithm resulted in high accuracy comparable with the one obtained by the DL, around 98% but at the following cost: a longer training time (around 15 min of wall clock time on the same i-9 10920X CPU with 24 logical cores.) For a small percentage of testing instances, approximately 4%, QARMA was not able to provide a decision, because of the small number of rules firing on them. However, we expect that QARMA and its decision-making components will compare equally well or even outperform deep learning techniques in training sets that are more highly skewed.
Another advantage of QARMA relates to the sensitivity of the produced rules with respect to noise in the data. We already saw that when the training and testing data suffer from Gaussian white noise with σ = 100 the performance of the neural network drops just below 80%. On the other hand, when QARMA ran on the same noise-polluted training dataset with σ = 100, and then the resulting rule-ensemble asked to classify an equally noise-polluted testing data-set (with σ = 100), surprisingly, QARMA performance remained very high at 97.11% making QARMA much more robust to noise in the measurements than the neural network. QARMA performance is then very little affected by noise in absolute terms, ranging from 2% of error in the best studied case (first row of Table 6) to 2.9% in the worst (last row).
Even though more research and experiments are needed to fully explain why this might happen, we believe that the answer about the cause of the difference of robustness between the two classifiers is probably lying on the underlying models' complexity: the NN being a deeply composite function of many variables (connection weights and bias thresholds) when optimized on a noise-polluted training data-set is easier to over-fit, and "learn" some of the noise in its weights. On the other hand, QARMA being a rule extractor that learns rules that have only a small number of different features in their antecedent conditions provides an ensemble of simple if-then decision rules that are more likely to hold true in the presence of noise.
Besides, QARMA produces a model with a set of quantitative rules that are much easier to understand and reason about than most of the other models, and DL models in particular; this makes QARMA results much easier to explain to humans than any other model. Every extracted rule is trivially checked against the training data-set for validation purposes, and it is also trivial to understand "what it means" since the preconditions of the rule are nothing more than a conjunction of the restrictions of the attributes that comprise the rule's antecedents to certain intervals. This ease of understanding of rules is what has made them particularly attractive since the beginning of AI and ML research. In fact, already since the 1980s, there have been attempts to extract the knowledge that is embedded in neural network models into sets of rules [31] since such rule sets were recognized from the beginning as the most obvious knowledge representation that can exist. Therefore, QARMA is, in general, a particularly good fit for the newly emerging "eXplainable Artificial Intelligence" (XAI) paradigm, the term "explainable" meaning that the resulting model that the algorithm produces can be easily understood by humans.

Test system 2
A different line configuration was also tested in order to evaluate the generalization capabilities. The test system 2 consists of a single-circuit 400-kV transmission line connected to two Thevenin equivalents. Although this is a simpler system that does not have the same impact of mutual impedance of double-circuit transmission lines, the inclusion of this simulation in the data-set allows the analysis of generalizing the solution to different systems. A data-set containing 990 rows was used for testing the original model; the ANN achieved an accuracy of 98.8% while QARMA achieved 98.1% for all faulty classes and non-faults in this system. The results were slightly better than the test performed on the original data-set, showing the viability of DMFT for fault detection and the ANN/QARMA for classification. The confusion matrix of the mentioned test can be seen in Fig. 9.
We also ran a symmetrical algorithm method using the model data-set for this paper. This method is used as the basis for comparison of our proposed approach because it is employed by one top relay manufacturer. The results can be seen in Fig. 10 (note that a confusion matrix like the one presented in Fig. 9 cannot be used for comparing all faults using the symmetrical method because the datasets have different lengths; only AC, BC and CG faults can be compared in that way). The accuracy of this method for single-phase faults can is represented in Table 7 along with the false positive single-phase detection. False positive in this context is seen as the number of single-phase faults selected by the symmetrical method  given that the real fault involved at least 2 phases. Under those conditions, the classification strategy takes the system to a situation less secure than with a tripolar tripping.
In summary, the results shown in Fig. 7 indicate that the errors in the proposed method occurred in the form of lack of identification of some faults. Since fault selection systems are meant to be associated with protection algorithms, those errors can cause an unnecessary tripolar breaker opening-security error. Considering an interconnected system, security errors are less likely to cause system-wide power outages than protection dependability errors. Therefore, in comparison with the symmetric method, the proposed solution will promote better system stability than the traditional method's results depicted in Table 7.

Real fault file
To test the proposed procedure, we used a real fault file from a transmission system located in Brazil, whose exact location cannot be disclosed. Real faults are usually gathered by fault recorders in .cvg files, we used an algorithm to convert into matrices (.mat) for easier processing. Once the voltages and currents matrices are obtained, they can be injected in DM-DFT algorithm that yields the fault point and extracts the features as seen in Fig. 11. In real situations, faults can suddenly reappear for reasons as re-closure or reinsertion. This is the case on the CG-type fault we see in Fig. 11. The algorithm detects successfully the first fault occurrence and also locate the exact sample where the phasors are extracted to perform selection. The NN and QARMA techniques were applied in the real fault data with a successful result: both correctly classify the fault as CG. Particularly with QARMA, it yielded 1100 rules that predicted the class of the fault, which resulted in the overall correct classification of the test case. One of the highest confidence rules for this test case was, The support of this rule on the training set is 2.72%, and it holds with confidence 100%.

Implications of the results
Current implementations for real-time use cases, such as relay 21 in transmission lines, usually employ a full cycle of phasor estimation and around 4 ms of angle comparison between current/voltage components, as reported by some manufacturers. With either of the studied methods (DL or QARMA), once the model is generated based on historical data of the target system, the time taken to perform phase evaluation given a single-phase fault in the system is as small as 4 ms. Therefore, both methods can reliably select the faulty phase in the relay to make additional trip decisions. The algorithms described in this paper that are commonly implemented in relays in operation today employ a full cycle Fourier phasor estimation for both the protection and the phase estimation. Because the phasor calculation is done in real time, it is based on the Fourier transform (or another filter with a similar output) that provides the root-mean-square (RMS) value required for the proposed phase estimation, implying that only the phase selection itself has to be calculated. Distance relays (ANSI 21) require 4 ms of angle comparison between current/voltage components, as reported by some manufacturers. The process to utilize the algorithms' outputs only requires multiplications and additional operations, making it more computationally efficient than most phasor operations and suitable for use in real-time applications.

Communication systems
As for the requirements for a communication setup where one can perform phase selection, no communication is needed between the two ends as a signal input, because once the model is generated, the validation is performed at the end. However, there is still a need for communication for data gathering from both ends for phasor estimation. Current mobile communication advances could enable wireless communication between the ends for instantaneous data gathering of currents and voltages, and interface diversity could enable a centralized system that is cheaper to implement in a communication architecture, as shown in Fig. 6.

Explainable results
When comparing the results of the DL method against the ones provided by QARMA, it is clear that the rule set produced by QARMA leads to a slightly lower accuracy than the DL method, while still being highly accurate. However, the resulting QARMA model is by default much more "explainable" than the DL model and has the extra advantage that it can be "reverse-engineered" much more easily than any other model. As an example, consider the following QARMA produced rule: ⇒ fault_type = 0 (AG) (7) holding with support 4.19% and confidence 95.06%. Again, a human can understand what the rule means instantly. When the QARMA rule set leads to a false diagnosis, it is trivial to see which set of rules led to the wrong decision. These rules can then be individually checked by human experts to see if their validity still holds in the face of new data and/or operating system conditions. Thus, at least, in principle, the entire model can be monitored and "debugged" in real time by human experts when it is put in production. This contrasts with models that make decisions based on the output of a highly nonlinear equation.
When less features are available in the training set, it has been shown that performance drops as expected. In certain cases, the performance degrades gracefully, but it can also be more serious. This performance degradation could be mitigated to a larger extent if we performed our simulations allowing the design of the deep network to vary with all hyperparameters, from the number of layers and the number of epochs chosen, to the optimization algorithm used for the learning of the network weights and threshold biases. However, with such an approach, the ML process shown in Fig. 4 would essentially have to be repeated anew. Instead, we show how a network with predefined hyper-parameters, in particular those proposed in Sect. 3, performs when trained on different subsets of the original training dataset containing less features, such as local only information (local currents or local voltages and so on.) Moreover, we also presented key aspects related to the real dataset from a transmission line and how the proposed method can be used by power engineers on their operational decision-making, including the association rules provided by QARMA that "explain" the fault selection. Results from association rules improve the knowledge on how power systems work in face of stressful events. This is indeed an important step if traditional engineering fields would rely more on machine learning methods. This sort of new explaining knowledge is to become evermore frequent in real-world applications as well as in academic research.

Conclusion
In this paper, we have proposed and analyzed a two-step methodology for selecting faults in double-circuit transmission lines. In the first step, the DFT was used to pre-process the raw data from the transmission lines. Subsequently, different learning algorithms were employed in the second step to detect and classify any fault based on a training period, and their performances were compared through numerical simulations. The presented two-step approach has been proven to be highly robust against high resistance faults and faults that occur in lines with high mutual impedance. The results have shown high phase selection for all types of faults and even identified recordings that do not present faulty states.
Among the different benchmarked learning methods, deep neural networks have reached an accuracy of 98.33% of correct selection, while the QARMA reached 98% accuracy. However, interestingly, the QARMA is also an explainable algorithm (i.e., the outcomes have explainable explicit internal relations between features) and also robust against noisy measurements unlike ANNs. This makes QARMA a highly suitable approach to achieve high robustness and high accuracy with explainable model outcomes. Future work will include the communication delay of the current and voltage signals sent to the central processing unit from the IEDs to evaluate the performance of the proposed method.