Introduction

According to the report of the International Energy Agency (IEA) for the year 2021, approximately 81% of global electricity production is based on the combustion of coal, oil, and natural gas. Within a year, the use of alternative energy sources such as the use of photovoltaics and wind turbines [1] has increased by 1% [2]. In the European Union, energy production from PVS during the years 2008–2020 increased by 1848% [3]. This increase can be explained due to the ability of the PVS to zero carbon footprint—therefore their use is in line with the Paris Agreement. Furthermore, PVS are easy to install [4,5,6,7]. However, it should be noted that their low efficiency and low-profit margin per MWh are deterrents for large investments in PVS [8, 9]. With the progress of embedded systems, the transition to smart photovoltaic systems is gradually taking place. Smart PVS through power line communication (PLC) can maximize the energy production of a PVS, providing additional control and parameterization of both the array itself, but also fault control at the PV cell level [10].

The advantages of machine learning (ML) methods over other artificial intelligence (AI) and threshold-based methods are many and include their data-driven nature, scalability, automation, continuous learning, and predictive accuracy [11, 12]. ML algorithms are designed to learn from data and make predictions based on patterns in the data, rather than relying on pre-programmed rules. This feature allows to the ML-based algorithms for more accurate predictions and decision-making [13]. Unlike other AI methods, which can be limited by pre-programmed rules, ML algorithms can handle large amounts of data and are suitable for processing big data sets, making them scalable [14]. However, it should be noted that several AI algorithms such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), gradient boosting machines (GBMs), and rule-based systems can be scaled up to handle large datasets and complex problems. Automation is another advantage of ML algorithms as they can automate many tasks that would otherwise require human intervention, leading to increased efficiency and reduced costs [12]. In addition, ML algorithms can continue to learn and improve over time with new data, making them more adaptive and versatile than traditional AI methods [11].

Compared to threshold-based methods, ML algorithms can make more complex decisions based on patterns in the data, leading to improved predictive accuracy. Threshold-based methods rely on pre-defined thresholds to make decisions, which can lead to oversimplification and decreased accuracy [13, 14]. Unlike threshold-based methods, ML algorithms can also handle non-linear relationships in the data, making them more suitable for a wider range of applications [12].

In conclusion, the advantages of ML methods over other AI approaches and threshold-based methods make them a powerful tool for prediction and decision making in many fields. However, the specific advantages of ML will vary depending on the application. Fast execution time and low memory usage are crucial for the success of machine learning (ML) algorithms in real-world applications [12, 15]. ML algorithms can be computationally intensive. Slow execution times can lead to increased processing times and decreased efficiency [15]. In addition, many real-world applications require the use of large amounts of data, and the memory requirements of ML algorithms can be substantial [12]. If the memory requirements of the algorithm are too high, it may not be feasible to run the algorithm on the available hardware, leading to decreased performance and accuracy [15]. Therefore, fast execution time and low memory usage are essential for the successful deployment of ML algorithms in real-world applications, as they ensure that the algorithms are computationally feasible and efficient [12]. To sum up, the objectives of this work is to implement a machine learning-based fault identification and detection algorithm, capable.

  1. a)

    To detect at least the three main categories of faults (open-circuit fault, short-circuit fault, mismatch faults) that arise on the DC side of a PVS,

  2. b)

    Of small computational cost. It should have small execution time per prediction, while in parallel, it should consume minimum memory.

  3. c)

    Of high accuracy.

  4. d)

    To be applied to smart PV arrays which can transmit voltage and current measurements from each PV cell of the array individually. In this way, the operation of each cell of the smart PV array is monitored [10, 16]. When there is a flaw, the faulty PV cell is isolated from the PV array.

The structure of this paper is as follows: Sect. 2, includes the similar works presented in the literature. A summary of the most common types of faults that occur on the DC side of a PVS will be presented. Following that, techniques for fault detection proposed in the literature will be assessed for their accuracy, memory and time requirements, and finally, their ability to detect as many unique types of faults as feasible. Methods that can detect and identify a wide variety of defects will be preferred above those that do not meet the selected criterion. In Sect. 3 the methodology of the developed method is presented. In Sect. 4 the results from the experimental procedure and the discussion of this paper are presented. The findings of the suggested approach will also be discussed and compared to other methods proposed in the literature. Finally, in Sect. 5 the conclusion of this research is presented.

Literature review

Faults in PVS

The main purpose of fault detection and classification methods is to identify what is causing fluctuations in the energy production of a PVS [17]. Different types of faults can occur on both the AC and DC sides of a PVS [18]. Traditional protection systems are designed to address AC faults, but faults on the DC side can be harder to identify and fix [17, 19]. Typical faults on the DC side of the PVS are shown in Fig. 1 and briefly presented in Table 1.

Fig. 1
figure 1

Manifestation of common types of faults on the DC side of a PVS. a Degradation of the semiconductor. b Discolorations. c Microcracks. d Particles accumulation. e Shading. f Short-Circuit. g Open-circuit

Table 1 Common types of dc faults and their causes [20]

One common category of DC faults is the mismatch faults, which can significantly reduce the power output of a PVS. Mismatch faults can be temporary or permanent. Temporary mismatch faults can be caused by particle accumulation on the surface of a PVS such as dust, bird droppings, or from the shading of the PVS due to some tree or some cloud. Permanent mismatch faults can be caused by damage to the adhesive materials, surface cracks on the PVS, gaps between layers of the PV module that cause shading, or deterioration of the semiconductor material [21]. It should be noted that permanent mismatch faults can occur in a system even as a result of another fault, such as an open circuit fault. Short-circuit faults can also occur when there are problems with the connections in a PVS, leading to the unintended connection between two points of the PVS [22]. An unintended short circuit between two voltage potentials across two neighboring strings or between two voltage potentials inside a single string [23], is called line-to-line fault. If the short-circuit involves the connection of a current-carrier with a non-current carrier, such as the PV frame then the fault is named ground fault or line-to-ground fault [24].

Open-circuit faults can occur when there is a disconnection on a PV string [25] (usually caused by poor soldering), but under certain conditions an open-circuit fault can also lead to arc failures, leading to high-frequency noise and rapid decreases in output voltage and current [26]. It is worth noting that arc faults can be mitigated using an Arc Fault Circuit Interrupter (AFCI) and ground faults can be monitored using a Residual Current Monitor (RCM) [27,28,29]. Special mention is made of both arc faults and ground faults, because both are particularly dangerous. Τhe former can cause a fire, the spread of which can threaten the entire installation, while the latter can turn the PVS frames into live traps for the installation's personnel, putting their lives at risk [18, 30]. Using high-quality materials and proper handling during the transport and installation of a PVS can also help reduce the risk of mismatch faults [31] since a proper installation avoids microcracks on the PVS surface and the use of better quality materials will greatly slow down the appearance of discoloration.

Fault detection algorithms

There are various techniques in the literature in order to detect faults in PVS. They can be categorized into three main groups; electrical characterization, visual inspection and thermal imaging. Visual inspection techniques [32, 33] require regular inspections to detect anomalies in the appearance of the PVS, thus they cannot be used for real-time monitoring. Thermal imaging techniques [34,35,36,37] involve the use of specialized equipment that increases the cost of PVS installation. Electrical diagnostic methods, on the other hand, can be performed either on-site or remotely. They are based on monitoring the specific electronic signatures that each fault produces and its effect on the output power of the PVS [38]. Many electrical diagnostic methods base their operation data analysis on the I-V curve of the PVS, which can detect various faults [39].

Several machine learning (ML)-based techniques with high fault detection accuracy have been published in the literature. Many of these methods are trained using I-V curve data. Although ML-based approaches necessitate a significant amount of processing power for training, their capacities to self-learn and adapt to a variety of inputs overcome the drawback of the required processing power [40].

The method presented by Chen et al. [41] can identify a wide range of faults. It is based on a kernel-based extreme learning machine (KELM), which has seven inputs (Voc, Isc, Vmpp, Impp, α1, Rs, and RMSE values) and five outputs (4 faulty states and a normal state). Harrou et al. [42], in order to develop the single diode model for the monitored PV cells, employed an artificial bee colony (ABC) method to handle irradiance and temperature data. At the maximum power point, the current, voltage, and power levels are determined. The disparity between simulated and measured values is utilized to indicate the presence of a fault. Voutsinas et al. [43], used data from I-V curves to train a multi-output feed-forward neural network consisting of 8 × 10 × 10 × 6 neurons. This implementation has 4 faulty states and a normal state. To identify line-to-line faults, Yi and Etemadi [44] used a multi-resolution signal decomposition (MSD) combined with a support vector machine (SVM) for feature extraction. Xia et al. [45] used wavelet decomposition in conjunction with SVM to identify series DC arc defects. Harrou et al. [46] used a binary SVM classifier to detect irregularities in output DC and power using a PSIM simulation of an installed grid-connected PVS. Wang et al. [47] employed a multi-class SVM to identify and categorize line-to-line faults and anomalous degradation faults in a PV module. Winston et al. [48] utilized a feed forward back propagation neural network combined with an SVM in order to detect micro-cracks and hotspots. Yi and Etemadi [49] proposed a method for detecting line-to-line and line-to-ground faults, which was based primarily on the use of a multi-resolution signal decomposition (MSD) algorithm on a fuzzy inference system. Memon et al. [50] proposed the use of a convolutional neural network (CNN) that used parameters such as irradiance temperature voltage and current, in order to detect the presence of faults. Jia et al. [51], presented a near perfect accuracy method for detecting arc faults using logistic regression. Fadhel et al. [52] proposed a data driven approach for detecting faults caused by shading on a PVS. The method is based in principal component analysis (PCA) that used data from I-V curves to detect faults with significant accuracy. Finally, Dai et al. [53] suggested a deep reinforcement learning-based PVS fault detection technique. The starting premise for this approach is data-driven. The fault diagnostic model of the PVS is created, and the deep neural network is used to estimate the decision network in order to find the optimum strategy, allowing the photovoltaic power generation system to be fault diagnosed.

The need to improve the reliability and performance of a smart PV array is the motivation for the development of a rapid and accurate fault detection and identification method based on ML. In the context of renewable energy systems, fault detection and identification are crucial for ensuring optimal energy generation and preventing catastrophic failures. However, traditional fault detection methods are often time-consuming and rely on human intervention, which can lead to delayed or inaccurate diagnoses. By leveraging the power of machine learning, the proposed approach can quickly and accurately identify faulty PV cells in real-time, based on the data transmitted by each cell. While there are limitations to machine learning, such as overfitting, the use of a rigorous cross-validation process can help mitigate these issues and improve the accuracy of the model. Therefore, the use of machine learning in fault detection and identification for smart PV arrays is a promising approach that can improve system reliability and performance.

Methods

To create the dataset, irradiance and temperature data are required as well as a model that will simulate the operation of each photovoltaic cell. The electrical output of a photovoltaic cell can be approximated by an analogous model circuit named single-diode model (SDM) with five parameters; these parameters are unknown and required to predict the performance of the PV module and are derived from the photovoltaic cell's current equation for a given temperature and irradiance. Both models can simulate PV cell performance in low voltage and/or high external temperature circumstances [54]. Equation (1) denotes the current equation for the single-diode model, whereas Fig. 2 depicts the analogous circuit. In Eq. (1), Iph is the current generated by the irradiance of light, Io1, is the reverse saturation current of diode D1, q is the electron charge, k is the Boltzmann’s constant, α1 is the diode’s ideality factor, T is the temperature expressed in Kelvin degrees and Rs, Rsh are the resistors in series and shunt. The approach provided in the work of De Soto et al. [55] is used to determine the five parameters of Eq. 1 (Iph, Is, α1, Rs, and Rsh)

Fig. 2
figure 2

Single-diode model, equivalent circuit

$$I={I}_{ph}-{I}_{{0}_{1}}\left({e}^{\frac{q\left(V+{IR}_{s}\right)}{{a}_{1}kT}}-1\right)-\frac{V+{IR}_{s}}{{R}_{SH}}$$
(1)

Looking at Fig. 3, plots depicted in green represent the Impp and Vmpp values according to the manufacturer’s datasheet, while plots in yellow represent the values created by the application that will create the dataset, based on De Soto method. It should be noted that the deviations in the voltage are less than 10 mV and the corresponding ones in the current are less than 100 mA, which makes them negligible.

Fig. 3
figure 3

Comparison between the values of Impp and Vmpp from the datasheet of the PV cell and the Impp and Vmpp generated by the SDM

Since the SDM is fully functional, the next step is to collect irradiance and temperature data. The PVGIS service [56] is used to acquire irradiance and temperature data for the Egaleo region (Attica, Greece). Twelve files were provided from PVGIS, 1 for each month, from January till December, indicating the average hourly temperature and irradiance for all days of each month. These files were consolidated and the corresponding records to the certain hours after sunset and before sunrise were discarded. The concatenated yearly file served as input to a Python script, which allowed the SDM to construct I–V curves for the 147 pairs of irradiance and temperature. From the 147 pairs, temperature varies from 1.32 to 35.06 °C, while irradiance varies from 0.04 to 984.84 W/m2.

After all the data have been collected, the last script undertakes the interconnection of the data and the creation of 110,250 records wide dataset. Each record has the format: Temperature (oK), GHI (W/m2), VmppModel (V), ImppModel (A), VocModel (V), IscModel (A), VCell (V), ICell (A), and operational status. Temperature and GHI values are retrieved directly from the PVGIS service. VmppModel, ImppModel, VocModel, and IscModel are retrieved from each generated I–V curve. VCell and ICell are voltage and current values ranging from 0-VocModel and 0-IscModel, and are measurements made in each PV cell of the array.

The operational status codes are encoded according to Table 2. While Fig. 4 depicts the percentages of the operational status codes within the full dataset, and in Fig. 5 the nine features of the dataset are grouped by their operational status code.

Table 2 Operational status code encodings–conditions
Fig. 4
figure 4

Distribution of operational states within the dataset

Fig. 5
figure 5

Dataset features grouped by operational status code

Observing the values of the last field of the dataset (operational status), we are led to the conclusion that this is a multi-class classification problem. To turn a multi-class problem into a set of binary tasks, the use of either one-vs-one (OVO) or one-vs-rest (OVR) strategies is suggested. Using the OVO strategy requires 10 classifiers ([N*(N-1)]/2), while for the OVR strategy only 5 are required. Each operational status in the dataset has its classifier. Consequently, the second strategy is considered preferable.

The use of logistic regression enchansed with CV for fault detection in photovoltaic systems can be considered significantly constructive. While logistic regression is a well-established machine learning technique for binary classification tasks [51], its use in the specific context of fault detection in PV systems is still developing. The addition of CV to the logistic regression model can be considered innovative as it helps to improve the performance of the model by reducing overfitting and increasing its ability to generalize to new data. By incorporating CV into the logistic regression model, the effectiveness of the model for fault detection in PV systems is likely to be improved, which could lead to new and more effective approaches for identifying faults in PV systems. LogisticRegressionCV classifier is compatible with the OVR strategy. It is similar to plain logistic regression but it has been hyperparameter-tuned (through CV). It tries several regularization strengths and chooses the optimal one based on CV ratings then refits a single model on the entire training set, using that best C (Inverse of regularization strength). The LogisticRegressionCV parameters were determined as follows: The trainer continues training for 10,000 iterations to find better weights. The number of CV sets is set to 3. The ‘ovr’ option has been selected, to follow the one-vs-rest strategy. Since it is a multi-class problem, the limited-memory BFGS (LBFGS) optimization algorithm was chosen as a solver combined with the regularization parameter (penalty) ridge regression (L2). Finally, the size of the list of the available values (Cs) for the coefficient of the inverse of regularization strength (C) is set to 10.

The development of the application as well as the creation of the dataset was done in Python 3.9 language using sklearn 1.1.2 [57] and PVlib 0.9 libraries [58]. The photovoltaic cell used in the dataset is the Solar Cells Hellas SCH6P-60 Multicrystalline Solar Cell [59].

Results and discussion

In previous sections, the development of a machine-learning algorithm based on logistic regression with cross-validation, capable of detecting and identifying faults in the DC side of a PVS was presented. Then the experimental measurements of the method are listed and discussed. The methods presented in the literature review are compared with each-other and with the method presented.

The experimental process was performed on an AMD Ryzen 3 5400U processor, 8.00 GB DDR4 RAM and PCIe M.2 SSD. During the measurement process, no other processes were running in the foreground of the operating system apart from the basic processes of the OS. This was done in order the results to be as accurate as possible.

The experimental data have been divided into two tables (Tables 3 and 4), Table 3 has the qualitative characteristics and Table 4 has the quantitative characteristics.

Table 3 Experimental results—qualitative characteristics
Table 4 Experimental results—quantitative characteristics

Regarding the quantitative characteristics of the measurements. The average training time and memory required for the training process is shown in Table 4. In more detail the logistic regression took 113 s and consumed 6.68 MB of memory. The process of fitting and training the model lasted 205 s and consumed 2.48 MB of memory. Accordingly, the use of the model in order to make a forecast of the operating state of the PV cell is 8 ms/prediction call with a memory consumption of 180 KB/prediction call. It should be noted that in order to obtain the measurements concerning the execution time, but also the memory that we reserved when calling the generated model of the method, a loop of 100 iterations was used, and the above values (execution time, memory consumption) are essentially the average values of 100 iterations.

In terms of the measurements’ qualitative features, Fig. 6 shows the confusion matrix after the classification and the fitting process. Here, we can observe the true positive predictions, the true negative predictions as well as type 1 and 2 errors (false positive, false negative). While Fig. 7 depicts the receiver operating characteristic (ROC) curves for the five classifier that were used. These curves display the performance of each classifier across all categorization criteria. The greater the area under the curve (AUC), the better the model distinguishes across classes. We have an AUC of 99.8% based on the experimental data. This is almost an ideal circumstance. The data from true positives (TP) and true negatives (TN) overlap by less than 0.2%. When TP and TN do not overlap, the model provides an ideal measure of separability. This means that each one of the classifiers used, nearly precisely, differentiates between its positive and negative classes.

Fig. 6
figure 6

5 × 5 confusion matrix

Fig. 7
figure 7

ROC curves for the five classifiers

The F1-score (0.949) is an improved version of two simpler performance metrics: accuracy and recall. Precision (0.955) that indicates the proportion of anticipated positives that are genuinely positive, while recall (0.945) indicates the proportion of actual positives that were accurately detected. It is commonly referred to as the harmonic mean of the two metrics. The goal is to produce a single metric that evenly weights the two values (precision and recall). The accuracy statistic (97.11%) indicates how many times the model predicted correctly over the full dataset.

The measurements presented in Table 4 are all remarkably high, a fact that makes the algorithm particularly reliable.

Table 5 below shows the comparison of the implemented method compared to the methods presented in the literature. The comparison among the methods is based on the accuracy of each method, the ability to identify the three main categories of faults on the DC side of a PVS, but also on the computational cost of memory usage and the execution time using each method. In Table 5 additionally, in the field of “comments”, other characteristics, advantages, or disadvantages of each method are presented.

Table 5 Comparison of the methods

According to Table 5, the developed method will be compared with other thirteen methods that were presented in the last 5 years (2017–2022) in the literature. Six methods [41,42,43, 46, 50, 53], can detect the three main types of faults on the DC side of a PVS (open-circuit fault, short-circuit fault, mismatch faults). One method [47] can detect two types of faults (short-circuit fault, mismatch faults), and six methods [44, 45, 48, 49, 51, 52] can detect only one type of fault (open-circuit fault, arc fault, mismatch faults, short-circuit fault, arc fault, mismatch faults respectively). Regarding the accuracy of fault detection, nine methods [41, 42, 45, 47,48,49,50,51, 53] show an accuracy of more than 95%, while one method [46] shows fluctuating accuracy depending on the type of the fault (89.6–98%). Finally, only two methods [43, 50] provide information about their computational cost, it has to be noted though that the method presented in [50] provides data only for its execution time.

Comparing the new method implemented and presented in this work with the corresponding 13 methods from the literature, the method can perceive the three basic categories of faults that may occur, while in addition, it can also perceive the existence of other errors. The fourth category of faults that can be perceived by our method refers to faults that do not correspond to any of the three basic categories of errors that we study, but they can be considered as the transition from normal operation to some faulty state, thus their monitoring can act as a warning indicator of future problems. As far as the accuracy of its measurements is concerned, it has a performance greater than 95% (more specifically 97.11%) and provides information about its computational cost in memory and execution time (180 KB RAM per call, 8 ms per call). It should be noted that most of the methods presented in the literature do not report information about the execution time and the memory they consume. The best execution time is shown in the presented method (8 ms), with second best performance that presented in [43] (44 ms), and with third best performance that presented in [50] (160–70 ms; note: the execution time the operating status of the PVS changes accordingly). The execution time is objectively related to the hardware of the computer on which the method is executed, but methods based on ANNs and CNNs, due to the complexity of their models, lead to an increase in the execution time. On the other hand, the method presented in the current work is based on logistic regression, a simple and fast machine learning method compared to other complex models like deep neural networks. This is because logistic regression is a linear model, which means that it has a relatively small number of parameters and can be trained relatively quickly. Logistic regression can be executed very quickly, even on large datasets, due to its linear nature and the efficient optimization algorithms that are available for training the model. Due to this, the presented method exhibits the shortest execution time.

The developed method is considered to have comparable response in relation to the corresponding methods in the literature, but in some criteria such as the identification of additional faults, or in the part of the computational cost it goes beyond the limits set by the literature. Its installation in a PVS with quality materials in order to limit the appearance of permanent mismatch faults, while at the same time having AFCI and RCM sub-units installed to the PVS in order to immediately detect arcing and current leakage to ground, will guarantee the smooth operation of the PVS.

As a future work, it is proposed to increase the records of the dataset, in order to further increase the accuracy of the method. Furthermore, the additional categorization of the faults with the use of appropriate hardware will be able to separate the subcategories of the faults, such as for example the separation between permanent and temporary mismatch faults.

Conclusions

The purpose of this work is to design an algorithm for the early detection and identification of faults that may occur in the DC part of a PVS. The results of the method are particularly encouraging since it can identify with 97.11% accuracy the three main categories of faults (open-circuit fault, short-circuit fault, permanent and temporary mismatch faults, and can detect the presence of extra undefined faults) on the DC side of a PVS. The latter fault category includes faults that do not belong to any of the three main categories of faults that appear on the DC side of a PVS, or they can signify the transition from the normal operation of the photovoltaic cell to some faulty state. Furthermore, comparing our method with other methods introduced in the literature, our method is quick and memory-efficient when used for output prediction (180 KB RAM per call, 8 ms per call). Comparing our method with the existing methods from the literature, it provides similar levels of accuracy, while in the majority of cases it identifies more faults. It should not be overlooked that the specific method can be applied to typical PVS installations (with minor modifications), not only to smart PVSs. In fact, in the latter, our method can be used, at the photovoltaic string level but also at the PV cell level, which is very important since it gives full real-time control over the state of each cell of the PVS. The results indicate that it can be used in PVS-based power plants.