Introduction

Smart Grids (SG) represent an evolution of the concept of traditional power grids. While traditional power grids were centralized power systems, modern SGs represent two-way IT-supported communication between energy providers and customers, which allows the delivery of electricity in a more efficient, reliable, sustainable way (Fang et al. 2011). An SG is composed of several components that can include: Advanced Metering Infrastructure (AMI) and smart meters, Supervisory control and data acquisition (SCADA), sensors and a multitude of network communication protocols (Gao et al. 2012; Yu et al. 2011; Chren et al. 2016). The complexity of the SG infrastructure is reflected in the multitude of faults and failures that can emerge from different components and their interrelations, leading often to complex failure scenarios with potential cascading and disruptive effects (Rivas and Abrao 2020; Mousa et al. 2019; Otuoze et al. 2018). Thus, an important aspect is to be able to classify and determine the possible faults and failures that can impact SGs, to look at the causes for preventive measures and at consequences and countermeasures to counteract the effects of failures.

The goal of this paper is to review and classify existing faults and failures in SGs to provide a summary view of all the causes, consequences, and countermeasures that can be applied. To review the existing SG failures and faults we adopt the Systematic Literature Review (SLR) approach, collecting and classifying information from 50 articles that were filtered during the process. The types of faults/failures and their belonging to specific categories in the SG system were synthesised primarily from the causes, impacts and descriptions that were collected from the articles. Synthesised data was then examined and summarised to provide information concerning the most common types of faults and failures. Afterwards, we looked at common detection techniques and methods for countermeasures. From the list of all determined impacts, we uncovered the most recurrent general consequences and connected them to the causing faults/failures—building also chains of faults and failures that can be represented as graphs.

The main contribution of this paper is the collection of 30 faults/failures from 50 articles in the context of Smart Grids. Such faults/failures were defined, categorized, and then linked to the areas and domains of the Smart Grid Conceptual Model and the Smart Grid Architecture Model (SGAM) (Bruinenberg et al. 2012). Among others, we cover aspects such as causes, countermeasures, impacts, and faults/failures chains. Unlike some of the similar studies surveying the fault or security issues in SG (Rivas and Abrao 2020; Otuoze et al. 2018), we do not aim to propose new fault classification and architecture schemes for SGs. Instead, we map the extracted faults and failures to an established dependability taxonomy (Avizienis et al. 2004) as well as the SGAM model (Bruinenberg et al. 2012). The final classification can be useful for both practitioners and researchers dealing with dependability engineering in the context of Smart Grids.

The paper is structured as follows. In “Related work” section we discuss existing previous reviews about faults and failures in SGs: covering power, security faults and failures and giving classifications of different faults/failures types and detection techniques. In “Background” section, we review the concept of SG, in particular the SGAM model, and provide the main concepts related to reliability engineering that will be used in the remaining parts of the paper. In “Literature survey” section, we define the research questions and propose the methodology that has been applied for building the catalogue of faults / failures. In “Faults and failures in smart grids” section we answer the main research questions of the study, determining major faults and failures in SG, their classification by type, causes, impacts, consequences, fault chains, and countermeasures. In “Conclusions” section we conclude the paper.

Related work

To the best of our knowledge, there are not many papers dealing with the provision of a summary view of faults and failures types in SGs. We collected the major previous studies in Table 1. With the ongoing interest for SG cybersecurity, several studies providing a list of main attack types against SGs including countermeasures were published, such as Mathas et al. (2020); Wang and Lu (2013)—in some cases dealing with security threats leading to SG failures (Otuoze et al. 2018). Other reviews are more focused on power faults (Mousa et al. 2019), and on faults classification and detection (Sarathkumar et al. 2021; Rivas and Abrao 2020).

Mathas et al. (2020) classify attacks to Smart Grids in confidentiality, integrity, and availability attacks. Confidentiality attacks such as passive eavesdropping, man-in-the-middle, and spoofing attacks can be detected and mitigated with countermeasures such as cryptographic signatures and inspection of network packets. Integrity attacks such as false data injection attacks can relate to tampered data packets of false measurements. They can be counteracted with machine learning models and software infrastructure taking into account cryptographic techniques. Availability attacks such as Denial of Service (DoS) happening at different layers (physical, MAC, network and transportation layer) can be detected and mitigated by means of several monitoring and self-healing approaches.

Wang and Lu (2013) extensive survey identifies several challenges for the detection and mitigation of security threats. Challenges go in the direction of proactive countermeasures for DoS attacks, cryptographic measures for Smart Grids, and the design of secure network protocols and architectures.

Otuoze et al. (2018) provide a review of various security threats and challenges that do not represent specifically failures but may result in a failure being the outcome. Authors distinguish between technical and other sources of SG threats. Technical sources deal with infrastructure security (such as Advanced Metering Infrastructure (AMI), Smart meters and power theft), technical operational security, and system data management security. Non-technical sources of security threats are related to environmental threats (e.g., earthquakes), and governmental regulatory policies.

In Mousa et al. (2019) different types of power faults, impacts, and countermeasures are presented. Power faults can be classified into short circuits and open circuit faults with incipient, abrupt and intermittent categories. These faults can be detected by means of several monitoring techniques, such as using wavelet transforms to detect the duration of disturbances in the power signal.

A brief summary of faults in smart grid infrastructure is provided by Hlalele et al. (2019). They distinguish between faults related to power distribution, photovoltaic and wind turbines and outline possibilities of the fault identification.

The most comprehensive summaries of faults similar to the current review were found in both (Rivas and Abrao 2020; Sarathkumar et al. 2021) that deal with the classification of faults and the discussion of countermeasures.

Rivas and Abrao (2020) mostly focuses on monitoring and detection techniques, dividing Smart Grids faults in physical devices, communication and hardware/software faults. Sensors and monitor capabilities can be adopted to provide self-healing mechanisms. The authors provide 65 faults detection and location approaches that were discussed in previous research (e.g., real-time anomaly detection of smart meter data). Methods for fault detection location are divided into impedance-based methods (e.g., waveform measurements to detect power disturbances), analytical methods (e.g., signal processing techniques), learning-based methods [(e.g., forecasting with Artificial Neural Networks (ANN)].

Sarathkumar et al. (2021) provide 15 common types of faults which are discussed according to causes, effects and diagnosis methods. The faults are collected and reported from previously published research papers. Similarly to previous reviews (e.g, Mousa et al. (2019)), faults are classified into incipient, abrupt, and intermittent faults.

Compared to previous reviews, our surveys of faults and failure has the following main contributions:

  1. 1.

    the provision of a list of 30 faults/failures in the context of Smart Grids that are linked to the areas and domains of the SGAM model (Bruinenberg et al. 2012) as well as to one of the main dependability taxonomies (Avizienis et al. 2004) (in “Faults and failures in smart grids” section). Unlike the related work (e.g. Rivas and Abrao (2020); Sarathkumar et al. (2021)) that proposed their custom taxonomies, we believe that grounding our classification in the already established taxonomies would be more beneficial for the practitioners as they could find the context more familiar;

  2. 2.

    the first review attempting to provide a linkage in form of graphs for the most common consequences and chaining of faults and failures (in “Causes and impacts (RQ4)” section);

  3. 3.

    to the best of our knowledge, this is the first of this kind of reviews conducted as a Systematic Literature Review (SLR), tracing the sources throughout the process;

Table 1 Related works

Background

Smart grids

In this study, we use SGAM (Bruinenberg et al. 2012) when referring to smart grid elements as shown in Fig. 1. The SGAM is a three-dimensional, multi-layered framework that consists of the interoperability layers that are mapped on the smart grid pane. The smart grid pane is spanned by physical electrical domains and information management zones. The purpose of this model is to indicate which zones of information management interactions between domains take place. It allows the presentation of the current state of implementations in the electrical grid, and also depict the evolution to future smart grid scenarios.

Fig. 1
figure 1

The SGAM framework (Bruinenberg et al. 2012)

In this section, we briefly present the three SGAM dimensions that are used to organize the survey results and classification.

The SGAM domains represent a set of roles and services involved in the energy industry:

  • Generation generators of electrical energy in bulk quantities (e.g. fossil, nuclear and large-scale hydropower plants), that are connected to the transmission system.

  • Transmission infrastructure and organization responsible for carrying bulk electricity over long distances.

  • Distribution infrastructure and organization responsible for delivering electricity to and from customers.

  • DER small-scale distributed resources connected directly to the distribution system. May also include energy storage devices.

  • Customer premises industrial, commercial and residential end-users of electricity managing their use of energy, they may also act as producers or storage of energy.

Smart grids largely depend on the interconnection and information exchange between different systems. Within SGAM, such interoperability is described by the five layers (Bruinenberg et al. 2012):

  • Business layer includes regulatory and economic structures and policies, business models, business portfolios of market parties involved. Business capabilities and business processes are also part of this layer.

  • Function layer represents functions and services provided by SG and their relationships from an architectural viewpoint. Functions are described as extracted use case functionalities separated from actors.

  • Information layer deals with the format and semantics of information exchanged between functions, services and components to ensure interoperable exchange of information during communication. It includes information objects and data models.

  • Communication layer is responsible for interoperable communication by describing appropriate protocols and mechanisms.

  • Component layer encompasses all the components of the SG and their physical distribution.

Finally, the SGAM zones represent the hierarchical levels of power system management, aggregation and functional separation. The aggregation can be on a data level or spatial level. The former deals with aggregating the data from the field zone to the station zone in order to reduce the volumes of data to be sent to and processed by the operation zone. The latter represents, for example, aggregation from distinct locations to wider areas or the aggregation of data from customers’ smart meters by data concentrators in the neighbourhood, as there are many data analysis techniques that can be applied in the context of SGs (Rossi and Chren 2019).

  • Process zone represents all the primary power grid equipment designed for energy generation, transmission and distribution (e.g. generators, transformers, circuit breakers, overhead lines, cables, electrical loads). Physical energy conversion is also part of this zone (electricity, solar, heat, water, wind).

  • Field zone consists of equipment to protect, control and monitor the process of the power system (e.g. protection relays, bay controllers and intelligent electronic devices (IEDs) which receive and utilize power system process data)

  • Station zone describes the aggregation level for fields, e.g. for data concentration, substation automation.

  • Operation zone consists of all sorts of management systems controlling different parts of the grid such as distribution management systems, energy management systems in generation and transmission systems, microgrid management systems, virtual power plant management systems (aggregating several DER), electric vehicle fleet charging management systems etc.

  • Enterprise zone refers to the commercial and organizational processes, services and infrastructures for enterprises (e.g. asset management, staff training, customer relation management, billing and procurement).

  • Market zone includes operations of the market domain such as energy trading, mass market, retail market.

Reliability engineering

For the classification of faults and failures, we adopt the general taxonomy proposed by Avizienis et al. (2004)—failures can be classified by four criteria:

  • Failure domain recognises failures of content and timing failures. Content failure represents information delivered by the service, that differs from the desired (implemented) form. Timing failure occurs when the service is delivered at the incorrect time or for the wrong duration. Timing failure can be further classified as early or late, depending on the system being delivered too soon or too late. Content and timing failure is a combination of the two aforementioned failures. If the system’s activity is no longer perceptible, it is called a halt failure. It can be also labelled as an erratic failure when the service is delivered, but its content and timing are off.

  • Consistency considers the view of different users. Consistent failure is observed equally by all users, whereas an Inconsistent failure is perceived variously by different users.

  • Detectability determines whether a failure was signalled to the user. Signalled failure is detected and a signal warning is sent. Otherwise, it is an Unsignalled failure.

  • Consequences determine the severity of failure’s impact. Failures span the range from minor to catastrophic consequences.

Additionally, besides the service failures, there are also Dependability (or security) failures which relate to more frequent or severe service failures of the system than it is acceptable and Development failures which occur when the development process is terminated before the system is placed into service.

Faults can be classified into eight categories (Avizienis et al. 2004):

  • Phase of creation determines when the fault occurred. it can occur either during system development and maintenance Development fault) or during the system’s operation phase (Operational fault)

  • System boundaries show where the fault originates from. It can arise within the system Internal fault or from the outside of the system boundary (External fault)

  • Phenomenological cause depends on whether there were human activities involved and it can be caused by natural phenomena (Natural fault) or as a result of human actions (Human-made fault).

  • Objective can be specified in case of human-made faults and we distinguish faults induced with the intention of causing harm (Malicious fault) or without a malignant purpose (Non-malicious fault)

  • Intent refers to the intention of non-malicious human-made faults. They can be an outcome of a harmful decision (Deliberate fault) or caused without awareness (Non-deliberate fault)

  • Capability considers competence of non-malicious human-made faults. Accidental fault happens by mistake and Incompetence fault results from lack of professional competence.

  • Dimension distinguishes between Hardware fault affecting physical components and Software fault occurring in programs or data.

  • Persistence considers the duration of faults which can remain continuous in time (Permanent fault) or its presence can be bounded in time (Transient fault)

Apart from the classification of faults and failures we further investigate their details that could be helpful for smart grid stakeholders. In accordance with the reliability engineering goals and inspired by the failure mode and effects analysis (FMEA) (Stamatis 2003), we extract information about faults and failures causes, impacts, detection techniques and counter-measures.

Literature survey

We adopted the Systematic Literature Review (SLR) (Keele 2007) methodology. An SLR provides a structured method to conduct detailed surveys of a given topic and can be considered a suitable approach for the goals of this article, as the identification of fault/failure types requires detailed research among all published research to gather information about faults/failures as determined in the context of Smart Grids.

To carry out the SLR, we followed the SLR guidelines (Keele 2007) for the planning, execution, and reporting of the review. We next describe the SLR process and provide the review protocol. First, we mentioned pre-existing studies that related to the topic in a previous part of the paper (“Related work” section). After that, we define research questions (“Research questions (RQs)” section) and specify the search strategy (“Search strategy” section) to clarify what information will be searched and how. Results of the search need to be examined and filtered with respect to a set of chosen selection criteria (“Study selection criteria” section).

Research questions (RQs)

With respect to the aim of the review, the following questions were considered:

  1. RQ1

    What faults and failures occur in smart grids? The aim of this RQ is to provide an extensive list of all the different faults and failures that are reported by SG research.

  2. RQ2

    In what component of smart grids are the faults/failures involved? The goal of this RQ is to classify the failures and faults into SGAM domains, layers, and zones to see how many failures and faults are propagating in these different contexts.

  3. RQ3

    What is the type of a particular discovered fault/failure? The goal of this RQ is to provide the types of failure and faults in SG according to an orthogonal classification (e.g., hardware/software related, operational, etc...).

  4. RQ4

    What are the causes and impacts of the faults/failures? The goal of this RQ is to provide a graphical representation linking the faults and failures to their usual causes and consequences.

  5. RQ5

    What detection techniques and countermeasures are used in connection with a given fault/failure? The goal of this RQ is to provide an extensive list of any detection techniques that are commonly adopted for the detection of faults/failures and then common countermeasures put into place to respond to the critical situation.

Search strategy

The search for the review was conducted within three different digital libraries, namely ACM Digital Library, IEEExplore and Elsevier ScienceDirect. With regard to search terms, the following three variants were considered: after the definition of the best combination of terms, following the Patient, Intervention, Comparison and Outcome (PICO) suggestions to build the query (Frandsen et al. 2020), we adopted the following query:

  • (“smart grid” AND fault) OR (“smart grid” AND failure) OR (“power grid” AND fault) OR (“power grid” AND failure)

As we needed very specific types of papers to collect faults and failures, we adopted a specific search strategy: collecting first a set of core relevant papers and then looking at the referenced papers that could provide more relevant results [(so-called snowballing in SLR terminology (Wohlin 2014)]. We run the search query in each repository and we shortlisted 20 studies for each of the repositories that were considered relevant after reviewing the abstracts (Fig. 2). This led to a core-set of 60 papers. From this core-set of papers, we reviewed the reference list and we added additional papers that were considered relevant from the titles. All included papers were then refined by looking at the abstracts.

Fig. 2
figure 2

The SLR process with # of articles in each phase

Study selection criteria

To determine which papers to accept or deny, inclusion and exclusion criteria were formulated. The inclusion criteria list consists of:

  1. IC1

    studies that include a description of a fault/failure in a smart grid and possibly its causes and consequences,

  2. IC2

    studies published in journals and conference proceedings,

  3. IC3

    year of publication in the range of 2010 - 2021,

  4. IC4

    English language only.

As for the exclusion criteria, we defined the following:

  1. EC1

    studies that do not concern faults/failures in smart grids,

  2. EC2

    studies discussing only faults in general (e.g. fault tolerance, fault detection etc.) that do not mention any specific fault/failure.

Study selection process

Since the review was targeted at the identification of specific failures and faults, we followed a search approach that was attempting to include the largest amount of papers and then filtering based on the most relevant references. For this reason, due to the extensive number of search results, the results were sorted by relevance and thoroughly examined the most relevant papers in every digital library along with their promising references. The total number of examined studies was 189 (20 from each digital library, 129 relevant references). The primary resulting studies often did not provide sufficient findings, but they provided many potentially relevant references.

Thereafter, all the full texts of the chosen papers were read and checked for the fulfilment of the remaining criteria: only the papers containing a description of a fault/failure with at least a brief mention of its causes or consequences were included in the review’s results. Ultimately, 50 papers were selected out of which 30 different faults or failures types were extracted. During the whole review, a list of all examined papers was maintained with notes about their acceptance or rejection. The list of all papers surveyed and the final table with all the failures and faults collected is available in a downloadable dataset (Authors 2022).

Data extraction and synthesis

Data extracted from selected studies encompass following information about a fault or failure:

  • name,

  • description,

  • type,

  • causes,

  • detection techniques,

  • involved components,

  • impacts,

  • countermeasures.

While data items such as name, description, cause, detection, impact and countermeasures were usually extracted directly from a study, involved components and the type of a fault/failure had to be often determined from the context using the SGAM model of domains, layers and zones (Fig. 1) and types of faults and failures collected during the review.

Faults and failures in smart grids

In the next sections we answer the five research questions (set in “Research questions (RQs)” section) by building a catalogue of SG faults and failures, mapping them to the SGAM levels, and extracting information about causes, consequences, detection techniques and countermeasures.

Overview of faults and failures (RQ1)

In the list below, we provide answers to question RQ1 by reporting 30 found faults and failures as well as their brief description. We also list a total of 50 references to the studies from which the data about specific faults/failures were extracted.

F1:

Connection loss between the smart meter and local controller—Wireless communication between the smart meter and the controller is disrupted because the particular communication channel is unavailable due to a channel jamming attack (Alohali and Vassilakis 2017; Mathas et al. 2020; Wang and Lu 2013; Liu et al. 2017).

F2:

Connection loss from all IEDs to the substation gateway—IEDs are responsible for monitoring and controlling automated devices in distribution and they can perform operations such as tripping circuit breakers if they sense voltage, current, or frequency anomalies. If their connection to the substation is lost, those operations cannot be performed correctly (Mathas et al. 2020; Wang and Lu 2013).

F3:

Collision in a Wireless Sensor Network (WSN)—Collisions can occur when a large number of messages are sent (possibly premeditatedly by an attacker), that can interfere with normal protocol communication (Alohali and Vassilakis 2017).

F4:

Maliciously forged identities in a WSN—A single malicious node can forge many identities and therefore mislead the legal nodes (Alohali and Vassilakis 2017; Najafabadi et al. 2013).

F5:

Data aggregator’s buffer overflow—The event buffer of a data aggregator is filled, and therefore is unable to buffer critical alerts (Mathas et al. 2020; Wang and Lu 2013; Jin et al. 2011).

F6:

Black hole in the network—In a communication network a node can drop a certain portion (possibly all) of packets instead of forwarding them further (Kaplantzis and Şekercioğlu 2012).

F7:

Software Defined Network (SDN) controller failure—With the use of software-defined networking in SG communications, SDN controllers can be seen as a single point of failure, as it is solely responsible for flow control in a network (Ghosh et al. 2016).

F8:

Desynchronized measurements—Measurements like consumption and production values have to be synchronized, often with the use of GPS for obtaining a time stamp. If a GPS signal is forged, then measurements are sent to the WAMS (wide-area monitoring system) with wrong timestamps and therefore not synchronized (Mathas et al. 2020; Gong et al. 2012).

F9:

False state estimate—A key function in building a real-time network model in the energy management system in the state estimation, based on data periodically collected from remote meters. False state estimates can be a consequence of random errors in measurements or bad data injection attacks (Liu et al. 2013; Mathas et al. 2020; Wang and Lu 2013; Cui et al. 2012; Liu et al. 2013).

F10:

Programmable logic controller (PLC) hijacked—During the Stuxnet worm attack on Iran’s nuclear facilities discovered in 2010, the PLCs were hijacked by inserting a rogue code into the controllers. Thereafter, the controllers were monitored and eventually, the rogue code took control without the legitimate controller code noticing (Langner 2011; Trellix 2021).

F11:

Inconsistent energy consumption reports—Data concerning energy consumption can be tampered with locally or remotely either before being sent to smart meters, inside the smart meters or over the communication links. For example, the reported energy consumption can be smaller than the actual one which is done in order to pay less than the real price for the consumed energy (an energy theft) (Jokar et al. 2016).

F12:

Privacy leakage—Malicious users can access smart meters to obtain collected fine-grained power usage data and therefore invade customers’ privacy (Birman et al. 2015; Federal Office for Information Security (BSI) 2013; Wang and Lu 2013; McDaniel and McLaughlin 2009).

F13:

Compromised price signals—The real-time prices advertised to smart meters are compromised by a scaling factor (so that the meters will use the wrong prices) or by corrupted timing information (so that the meters will use old prices) (Tan et al. 2013).

F14:

Inconsistent state messages—In a distributed energy routing, nodes inform each other how much energy they request or supply. In addition, correct energy link state information is also needed for the energy routing process. Spreading incorrect information can disrupt the energy distribution process (Lin et al. 2012).

F15:

Frequency variation—A stable frequency synchronized throughout the whole electrical grid is required for the grid’s stability. Frequency pushed outside the 47-52Hz range (50Hz being the optimal value in Europe) can cause instability of the electrical grid possibly leading to a total blackout (Costache et al. 2011; Samarakoon and Ekanayake 2009).

F16:

Voltage variation—Tolerance limits for voltage variation are +10 % and -15 % around the optimal value (230V in Europe). Manifestations of voltage variations include short interruptions, flickers, voltage dips, supply voltage variations and harmonic disturbances (Costache et al. 2011).

F17:

Transformer failure—Transformers are crucial constituents of electrical transmission and distribution systems and they can fail due to many different causes (Bhatt et al. 2014).

F18:

Series fault—Also known as an open circuit fault, occurs when one or more conductors (phases) open in the system due to a broken line. It can be further divided into unsymmetrical and symmetrical series faults (Mousa et al. 2019; Gururajapathy et al. 2017; ElectronicsHub 2015).

F19:

Shunt fault—Alternatively called a short circuit fault, represents an abnormal connection of very low impedance between two points of different potential, whether made intentionally or accidentally. There are different types of shunt faults, such as Single line to ground fault (most common, least severe); Line to line fault (second most common, less severe); Double line to ground fault (less common, more severe); Three-line to ground fault or Three line to ground fault (Mousa et al. 2019; Gururajapathy et al. 2017; ElectronicsHub 2015).

F20:

Geomagnetically Induced Currents (GIC) in the power grid—Geomagnetic storms induce GIC in the power grid, that then flows through the power transformer causing half-wave saturation of the iron core and generating a large amount of reactive power loss, possibly resulting in cascading failures and large-scale blackouts (Kang et al. 2019).

F21:

Flashover fault in a transmission line—Various natural phenomena like forest fires can cause an electric discharge - a flashover in a transmission line (Yue et al. 2017).

F22:

Transmission line break off—A transmission line can break off due to weather factors like ice or wind that can increase the mechanical stress of the line (Jin et al. 2017).

F23:

Lightning stroke trip-out of a transmission line—Lightning stroke presents an important threat to the power grid infrastructure, specifically, transmission lines are often exposed to lightning (Li et al. 2016; Bakar et al. 2013).

F24:

Cascading failure—The effect of one or a few component (typically transmission line) failures, leading to the failure of a sequence of interconnected components in a networked system (Chen et al. 2014; Min and Varadharajan 2016; Bernstein et al. 2012, 2012; Wang et al. 2017; Eppstein and Hines 2012; Wei et al. 2019).

F25:

Fault current—The rising integration of renewable energy sources in the smart grid increases the fault current level of the system (Reddy and Chatterjee 2017; Jangale and Thakur 2017; Liu et al. 2019; Rajaei et al. 2014; Rajaei and Salama 2015).

F26:

Hurricane damage—Hurricanes can have devastating consequences on a power grid’s generation, transmission and distribution, like in the case of Hurricane Maria in Puerto Rico, 2017 (Kwasinski et al. 2019; Menasché et al. 2014).

F27:

Supply uncertainty in DERs power generation—Uncertainty comes from perturbation of the amount of energy generated by the DERs from the generation schedule due to factors like the change of the wind speed and the sunlight intensity or equipment failure (Yang and Walid 2014).

F28:

Tripping of a distributed generator in a microgrid—Due to the intermittent nature of its distributed generators, a microgrid in an islanding mode can suffer from severe frequency deviation during the post-fault condition that can eventually lead to tripping of the generators (Mousa et al. 2019; Kabir e al. 2014; Arif and Aziz 2017).

F29:

DC series arc fault in photovoltaic (PV) systems—The rising of PV systems and the trend toward higher DC voltage levels may potentially create DC arc faults. DC arcing appears across small gaps in connections (Lu et al. 2017).

F30:

Wind turbine gearbox failure—Wind turbine gearbox transmits mechanical energy into the generator with high speeds. It is one of the most fragile components since it is responsible for 59 % of total wind turbine failures (Wang et al. 2017).

From the list of discovered faults and failures, it becomes apparent that the faults and failures are largely varied in the literature. However, half of them is referenced from multiple sources with F24, F25, F1, and F12 being the most referenced ones. Additionally, the literature covers faults and failures on different levels of abstraction ranging from general faults and failures, such as F24 to very specific ones, such as F7 or F20.

Domain, layer and zone classification (RQ2)

The faults and failures were mapped into SGAM domains, layers and zones as defined in “Background” section based on their characteristics. In the case of zones and domains, one fault/failure could be assigned to more domains depending on their origin and the range of their impact. The resulting mapping is shown in Fig. 3.

Fig. 3
figure 3

RQ2. Mapping of faults and failures to SGAM domains, zones and layers

All SGAM domains were covered by at least two faults or failures. The most frequent domain was the Transmission with 18 distinct faults and failures closely followed by the Distribution domain with 16 results.

In terms of SGAM zones, we were able to map faults and failures to Process, Station, Operation and Market zones. We did not find any suitable fit for the Enterprise and Field zones. In the latter case, they seemed to be close enough to the Field zone but after careful examination, we attributed them to the Operation and Station Zones. The most frequent were the Process and Operation zones with 11 and 10 faults and failures respectively, spread across all the domains.

From the SGAM layers perspective, the Component and Communication layers were the most prominent with 16 and 9 faults and failures respectively. The Information and Function layers were rarer with only two findings for each. Additionally, we did not discover any faults and failures related to the Business layer. An interesting observation can be made about the relation between the layers and zones. All Component layer faults and failures are present only in the Process zones. The Communication layer faults and failures are distributed between Operation and Station zones with frequent overlaps especially for F3, F4. On the other hand, there are no Communication layer faults and failures in Process and Market zones.

Table 2 RQ3 (Types of faults according to taxonomies defined in “Reliability engineering” section)
Table 3 RQ3 (Types of failures)

Classification of faults and failures (RQ3)

The classification of faults and failures described in “Reliability engineering” section was applied to the findings listed earlier in “Overview of faults and failures (RQ1)” section. The findings were classified based on the Tables 4 and 7. Additionally, full texts of the related papers were consulted for better comprehension of the fault/failure’s characteristics.

First of all, we divided the findings into faults and failures. However, some failures could also be considered faults, since they may lead to another failure. As a result, some findings are labelled both as a fault and a failure.

The Table 2 includes 27 faults classified by 12 attributes. All faults were operational, nevertheless, 3 of them (F7, F17, F19) could also be identified as development faults depending on the circumstances. A similar situation appears also in other categories like internal-external, HW-SW etc. because one particular fault can have multiple different causes (more in “Causes and impacts (RQ4)” section) and therefore fall in different categories. External faults appeared more frequently than internal ones. the Same amount of natural and human-made faults was found, although malicious faults significantly outweighed non-malicious ones (deliberate and non-deliberate). The capability category (accidental and incompetence faults) was not taken into consideration as the available information about the faults was not sufficient to determine this category. Hardware faults were slightly more common than software faults and the persistence category was divided more or less equally.

Regarding failures, Table 3 presents 17 different failures categorized by failure domains and consistency. On top of that, failure F24 is assigned to a special category of Dependability failures, because it presents a very serious threat due to its severe consequences. Out of all the failures, 15 were also mentioned as faults. As for the domain category, most of the failures were designated as halt failures, in addition, some content and late timing failures also appeared. The consistency category ended up balanced. Just like in the case of faults, some failures were assigned to more types within a category.

After careful consideration, one finding F27 was marked as neither fault nor failure, but an error, more specifically a content, inconsistent error. The supply uncertainty is caused by the perturbation of the amount of generated energy that could be perceived as a fault, and it may lead to a failure such as a power outage.

Table 4 RQ4 (Impacts of found faults and failures)

Causes and impacts (RQ4)

We report the findings for RQ4 in Tables 4 and 7, introducing every cause and impact of found faults/failures that we were able to extract from the reviewed studies. Here, we report only the causes and impacts that have been extracted from the literature associated with SGs as a result of SLR. However, it is possible for the individual faults and failures that other causes and impacts exist, especially when the fault or failure is more generic and can occur in different domains.

We identified the most common consequences of the found faults or failures:

  • Power outage (14 causes)

  • Financial loss (9 causes)

  • Equipment damage (8 causes)

  • Loss of network connectivity (5 causes)

These consequences are pictured in Fig. 4 along with the faults/failures that may cause them.

Fig. 4
figure 4

RQ4. Most common consequences and their causes

Furthermore, we present the results of the efforts to assemble representative faults and failures into a chronological sequence based on their causes and impacts. The goal is to graphically depict how a fault/failure can lead to another, eventually forming a chain of subsequent faults/failures, similar to the concept of fundamental chain of dependability and security threats described by Avizienis et al. (2004).

Fig. 5
figure 5

RQ4. The fault chain

The outcome is presented in Fig. 5, which consists of two separate groups. The major group encompasses 15 faults or failures and 20 associations among them, where arrows point to the consequent event. In particular, it is worth mentioning the cascading failure F24, since it is associated with many others as a consequence. We can also notice the cyclic relationship with voltage variation F16, meaning that a fluctuation of voltage levels can cause a cascading failure, which may lead to further voltage variation. Additionally, the minor group contains a simple relationship of lost communication to smart meter F1 with false state estimate F9.

Table 5 RQ5 (Detection techniques for SG faults and failures)
Table 6 RQ5 (Countermeasures for SG faults and failures)

Detection and countermeasures (RQ5)

Concerning question RQ5, we report in Tables 5 and 6 detection techniques and countermeasures if they are available in the reviewed articles. Only 9 of the reported faults/failures include both detection techniques and countermeasures, on the other side two are not covered by both detection and countermeasures. Half of all the findings mentioned some detection technique or approach. The situation concerning countermeasures was considerably better, 22 faults/failures in the papers are reporting some sort of countermeasures like prevention, mitigation or recovery.

Detection techniques cover a broad spectrum of approaches for the identification of faults and failures (Table 5). These techniques deal mostly with the identification of anomalies that can be linked to the presence of faults or the triggering of failures. For example, delays in data packets can be an indication of SDN controller failures (F7), consumption patterns analysis and anomaly detection of energy demand can be used for energy theft (F11), energy link-state information (F14). Other signal-based, and traffic monitoring detection can be used for smart meter communications with local controllers failures (F1), the presence of forged identities in a WSN (F4).

Countermeasures represent actions and techniques put in place to counteract the possible effects of faults and the impacts of failures (Table 6). For example, error correction codes can be used for communication collision in a WSN (F3), smart meter load blocking schemes can be used to counteract frequency variations of the grid (F15), adding noise and using differential privacy techniques to mask the individual contributions of smart meters to data aggregates sent to the utility can be used for potential privacy leakage (F12). Other more hardware-related countermeasures can be the utilization of energy storage devices for the minimization of frequency deviations for wind turbine gearbox failures (F30) or the deployment of Energy Storage Systems (ESS) for cascade failures of interconnected networked components (F24). Many countermeasures are also related to the application of machine learning techniques that are commonly applied in the context of SGs (Rossi and Chren 2019), for example, using the weighted fuzzy-C means algorithms for the identification of faults chains and cascading failures (F24) (Table 7).

Conclusions

The goal of this paper was to review and classify existing faults and failures in SGs to provide a summary view of all the causes, consequences, and countermeasures that can be applied. Following the SLR approach, we collected and classified information from 50 articles arriving at the definition of a catalogue of 30 faults/failures. These were categorized and then linked to the areas and domains of the SGAM model (Bruinenberg et al. 2012) and to the general dependability taxonomy (Avizienis et al. 2004). Overall, the categorization provides an actionable catalogue that can be used by practitioners and researchers to pinpoint specific predictive activities and countermeasures with an indication of the sources where to gather additional knowledge about the proposed techniques.

The definition of clear categories of faults and failures and their characteristics can allow to better cope with such disruptive events and to enable self-healing capabilities of SG components, by having in place preventive measures for the detection and activities for the restoration of the impacted services. There are still many researchers that attempt at making SG systems more robust, secure, and resilient but they clash with the heterogeneity of the different devices and components involved in the different layers—as we have shown by mapping faults and failures to the SGAM levels. An overall and integrated view is necessary, however, most of the preventive measures are fine-grained techniques that need to be applied in a coordinated modality. The level of coverage is thus given by the combination of all these disparate techniques: simulation, optimization, analysis techniques all need to be combined with engineering methods to build self-healing components in the SG. If we want to reach such a level of coordination, categorizations as the one presented in this paper are of fundamental importance.

Table 7 RQ4 (Causes of found faults and failures)