1 Introduction

Protection misoperations in power systems are a top reliability concern according to a recent state of reliability report of the North American Electric Reliability Corporation (NERC) [1]. Protection misoperations fall into two broad categories: failures to trip and false trips, where trip means the removal of a piece of supposedly faulty equipment from the system by opening some circuit breakers. Equipment faults considered in this paper are three-phase short to ground faults in transmission lines. The principles established, however, are applicable to other types of transmission line faults and faults in other types of equipment. A typical protection system is designed to be dependable (to trip whenever it should) at the expense of security (not to trip whenever it should not) [2]. False trips occur much more often than failures to trip in a stressed system [1]. False trips are sometimes referred to as hidden failures. A hidden failure is a permanent defect causing the incorrect removal of a circuit element as a direct consequence of another triggering event [3]. A system can become stressed when it is subject to an equipment fault, a switch, a change in load/generation power, or a severe event of nature.

Despite the efforts to analyze the impact of protection misoperations [35], to prevent them from occurring through adaptive relaying [4], and to deploy remedial action schemes for some specific scenarios [6], fundamental study on systematic recovery from false trips upon an equipment fault remains an uncharted territory. Recovery from protection misoperations presents a significant challenge, given the absence of a formal mathematical representation for false trips and recovery processes, as the traditional reliability model typically only captures equipment failure and system restoration processes. The challenge also stems from the fraction-of-a-second fast response time required of protection functions.

This work leverages the phasor measurement unit (PMU)-like sensing technology, which has made a significant recent entry into the power grids. Typically, analog inputs to a PMU are 3-phase currents and voltages at the secondary windings of instrument transformers. They are filtered and converted into digital signals at thousands of samples per second. The samples are time-tagged with sub-microsecond accuracy according to the GPS clock, and then processed using a discrete Fourier transform or least-squares methods to produce estimates of positive-sequence voltage and current phasors at a reporting rate of about 30–120 samples/s. These so-called synchrophasors are the balanced three-phase steady-state sinusoidal components in terms of a magnitude and a phase angle for each of the three AC phases at a fixed frequency. A relay-misoperation detection method using PMU data from both ends of a transmission line is presented in [7]. The work presented in this paper does not require measurements at both ends of a transmission line. Although different fault diagnosis methods are proposed recently [8, 9], the dynamics of the electrical transmission network has not been considered. This paper involves the electric network dynamics to promptly identify fault modes and protection misoperations. The current IEEE standard [10], however, does not specify requirements on PMU responses to power system transients. This work proposes the use of the time-tagged input waveform samples of PMU-like sensors, instead of output samples of PMUs, for estimation of both continuous-state and discrete state through diagnosis of fault modes in order to gain the speed and accuracy during severe transients. The development of diagnosis is reported in [11]. These samples are processed to provide feedback information for decision support in the proposed selective secondary protective control strategy.

For the purpose of clarity in defining the scope of this paper, the NERC’s definition of reliability is adopted. It contains two functional aspects: adequacy (ability to provide uninterrupted service), and security (ability to withstand large disturbances). Thus reliability in this context is an overarching measure encompassing both steady-state availability (adequacy) defined for a discrete state stochastic process and transient stability (security) defined for a continuous-state dynamic system whose mode of operation is determined by the prevailing discrete state. The fact that availability and security have been viewed as largely disconnected is evidenced by the separate bodies of power systems literature [1215]. This disconnection is an obstacle hindering the progress in the general area of mitigation of cascading failures. The issue is recognized in less than a handful of publications [16, 17].

This paper builds on a stochastic framework first proposed in [18] to tackle recovery from a protection misoperation through a selective secondary protective control under a maximum security criterion. Protective control actions may include high-speed fault clearing, high speed reclosing, regulated shunt/series switching of reactive devices [14]. The qualifier selective to be suppressed in the following development, refers to the fact that the availability of PMU-like sensors may allow only partial implementation of the secondary protection. Primary protection and secondary protection are used hereafter to distinguish the existing and newly introduced protection mechanisms. The desired secondary protective control takes no action when the primary protection operates correctly, and a corrective action when the primary protection fails to trip or falsely trips. This approach is highly cost-effective, as it keeps the primary protection scheme intact, which has been both ingeniously and painstakingly designed to benefit the dependability, while exploiting new technologies to offer the needed security.

The protection mechanism proposed in this paper takes the sampled current and voltage waveforms acquired by instrument transformers as the real-time measurements. Two types of computations are made in real-time: the tracking of electro-mechanical state (relative rotor phase angles and the angular speed deviations of the generators), and the diagnosis of faults using a multiple model approach. The control decisions are obtained by combining the real-time tracked electro-mechanical state and the pre-computed post-fault stability regions. This protection scheme drastically reduces the on-line decision time on transient stability, in comparison with any existing decision methods

All admissible selective secondary protective control actions are actuated through switches. The existing study on switched systems [19] seeks to analyze or construct switching sequences for asymptotic stability [20, 21]. The nature of solutions sought here, however, is to maintain the power system synchronism [14]. Upon entering a transiently unstable configuration, transient stability must be established before the sojourn time to cascading failures expires. Such an issue has not been touched upon in the switched system literature.

The paper is organized as follows. Section 2 presents the necessary background and establishes a stochastic model in which security indices are embedded. Security indices are formally defined in the section. Their relations to diagnosis, estimation, and control are delineated. Section 3 examines the computational issues associated with evaluating the indices through an example of a three-area power system. Section 4 presents the results of a simulation study on the performance robustness of the proposed secondary protection. Section 5 draws conclusions on the technology readiness for implementing the secondary protection. The Appendix provides an aggregated discrete-state stochastic model used for reliability analysis, a typical continuous-state classical swing model used for transient stability analysis, and a typical electric transmission network model used for diagnosis.

2 A stochastic model controlled by the secondary protection

This section reviews a previously defined modeling principle [16], which uses security indices as protective control criteria. The new development shifts the emphasis to modeling the protection misoperations and recovery processes and to delineating the secondary protective control problem. Many concepts and definitions in this section are inherited from the authors’ recent preliminary study [22].

2.1 Security indices and reliability

The presentation of this subsection draws heavily from [16] for the purpose of review of background. An \(N-1\) secure system maintains its transient stability upon the prompt removal of a single piece (group) of equipment which has experienced a critical fault. A critical fault refers to one that inevitably leads to a system outage in the absence of an appropriate protective control action.

Fig. 1
figure 1

Rate transition diagrams

The discrete state-space of a power system model shown in Fig. 1a consists of a normal state p (pre-fault), an aggregated outage state o and a set of N degraded yet operational states \(d_1\), \(\cdots\), \(d_i\), \(\cdots\), \(d_N\), each corresponding to a post-fault state upon the removal of the equipment experiencing a critical fault. Associated with each degraded discrete state \(d_i\), there is a set of parameters: transition rate \(\lambda _{i}\) from pre-fault state p, restoration rate \(\gamma\) or \(\bar{\gamma }_{i}\) into state p, and sequential fault transition rate \(\bar{\lambda }_{i}\) into outage state o. \(s_{i}\) is the conditional probability of successful state transition into \(d_i\) given the occurrence of critical fault i, and \(\bar{s}_{i}=1-s_{i}\). \(s_i\) and \(\bar{s}_i\) are named security index and risk index, respectively. They are introduced in a manner supported by the Poisson decomposition property [23]. The term security index is coined because it is directly proportional to a probabilistic measure of post-fault transient stability.

Define steady-state probabilities \(\pi _0 = \pi _p\), \(\pi _i= \pi _{d_i}\), \(i=1, \cdots , N\), and \(\pi _{N+1}=\pi _{o}\), or in vector form \(\pi =[ \begin{array}{ccc} \pi _0&\cdots&\pi _{N+1} \end{array} ]\). The grid availability (adequacy) is given by

\(A_s= \pi _0 + \pi _1 + \cdots + \pi _N,\) and unavailability \(\bar{A}_s=\pi _{N+1}\). The steady-state probabilities can be solved from the Chapman-Kolmogorov equation \(\dot{\pi }={\pi }\) Q at steady-state \({\pi }\) Q \(\,= 0, \;\; \sum _{i=0}^{N+1} \pi _i = 1\), where Q is the rate transition matrix, as explicitly expressed in Appendix A. Since the model in Fig. 1a is of finite state and irreducible, a unique steady state probability distribution exists [24]. System availability can be expressed explicitly as a function of security indices upon exploiting the structure of Q matrix [16],

$$\begin{aligned} A_s=\pi _0 + \sum _{i=1}^{N}\frac{s_{i}\lambda _{i}\pi _0}{\bar{\lambda }_{i}+\bar{\gamma }_{i}}, \; \pi _0 = \frac{\gamma }{\gamma +\sum _{i=1}^N \left\{ \lambda _{i} -\frac{s_{i}\lambda _{i}(\bar{\gamma }_{i}-\gamma )}{\bar{\gamma }_{i}+\bar{\lambda }_{i}}\right\}} \end{aligned}$$

The above establishes that the grid availability is monotonically increasing with respect to each security index associated with a critical fault, provided that restoration to pre-fault state from a degraded state is faster than that from system outage, i.e., \(\bar{\gamma }_{i} > \gamma\), which always holds for any well-designed power systems. Computation of security indices and the control mechanism to maximize them are the focal points of discussion in [16]. This paper reexamines the definition and computation of security indices specific to the secondary protection that deals with misoperations of the existing primary protection and with their recovery processes.

2.2 Protection misoperation and recovery processes

Inability to recover from protection false trips in power systems is a top root cause of the modern-day cascading failures [1]. In this regard, the framework developed in [16] becomes inadequate for two reasons: (1) The stochastic discrete-state model there does not capture misoperation and recovery processes. As a consequence, our ability is impeded in terms of understanding the fundamentals and quantifying the computational and technological requirements for an effective secondary protective control strategy; (2) The traditional continuous-state model ignores the electric network dynamics and considers only the much slower electromechanical dynamics. This severely limits our ability to promptly identify fault modes necessary for correct execution of the secondary protection functions in the face of primary protection misoperations. Typical modern relays for primary protection operate in 8–10 ms, and circuit breakers clear a fault in 30–50 ms [13].

To include the processes of protection misoperations and their recoveries while keeping the model simple and scalable, the N degraded states in Fig. 1a are aggregated into a single degraded state d as depicted in Fig. 1b, resulting in a 3-state availability model with less detailed state information. The transition rates in the two models are related by

$$\begin{aligned} \lambda =\sum _{i=1}^{N}\lambda _i, \; \bar{\lambda }=\sum _{i=1}^{N}\bar{\lambda }_i, \; \bar{\gamma }=\sum _{i=1}^{N}\bar{\gamma }_i \end{aligned}$$

and the security indices are related by

$$\begin{aligned} s=\frac{\sum _{i=1}^{N}s_i\lambda _i}{\sum _{i=1}^{N}\lambda _i}, \;\; \bar{s}=1-s = \frac{\sum _{i=1}^{N}\bar{s}_i\lambda _i}{\sum _{i=1}^{N}\lambda _i} \end{aligned}$$

They are obtained using the superposition property of Poisson processes [24]. The monotonic dependence of availability on the aggregated security index s can be shown as [22]

$$\begin{aligned} A_s=\frac{\gamma \bar{\lambda } + \gamma \bar{\gamma } + s \lambda \gamma }{ \lambda \bar{\lambda } + \gamma \bar{\lambda } + \gamma \bar{\gamma } + s \lambda \gamma + \bar{s}\lambda \bar{\gamma }} \end{aligned}$$
(1)

To observe protection misoperation and recovery processes, pre-fault state p is split into pre-fault p (or 0) and pre-fault false trip t (or 4) states, degraded state d is split into fault-on f (or 1) and degraded d (or 2) states, and system outage state o is split into misoperation fault-on m (or 3) and system outage o (or 5) states. The new 6-state model is shown in Fig. 2b. For an \(N-1\)–secure power system, discrete-states p (or 0), d (or 2), and t (or 4) are to be called secure (or transiently stable) states hereafter. During the holding time at each of these discrete-states, the continuous-state (Section 2.3) evolves in the neighborhood of a stable equilibrium. States f (or 1) and m (or 3) are to be called insecure states as they are transiently unstable. State o (or 5) alone is called an outage state. Upon entering a discrete-state i, the evolution of the system’s continuous-state \((x,\xi )\) is governed by a pair of differential equations [22].

Fig. 2
figure 2

Rate transition diagram with protection misoperations

At any given moment of time, a power system of any size and complexity resides in one of the 6 states in the semi-Markov chain (or Markov chain if transition rates are constants) depicted in Fig. 2b. A state holding time (sojourn time) is the random amount of time that the chain stays at a state. The average holding time at state 2, for example, is \(1/\varLambda _2\) where \(\varLambda _2=\lambda _{20}+\lambda _{23}+\lambda _{25}\). Typically transition rates in Fig. 2b range from 1/weeks (\(\lambda _{01}\), \(\lambda _{15}\), \(\lambda _{25}\), \(\lambda _{35}\)), to 1/days (\(\lambda _{23}\), \(\lambda _{43}\), \(\lambda _{50}\)), to 1/hours (\(\lambda _{20}\), \(\lambda _{40}\)), to 1/AC-cycles (\(\lambda _{12}\) and \(\lambda _{32}\)). It is noted that the values for \(\lambda _{12}\) and \(\lambda _{32}\) reflect the speed of operation of the primary protection without any misoperations. The inter-event time distributions can be established by using standard statistic methods, such as parameter estimation and goodness of fit tests [25], based on the data collected.

Some transitions in Fig. 2b can be influenced through systematic decision and control with real-time information feedback. These are called controllable transitions. This paper focuses on two types of the controllable transitions. Referring to Fig. 2b, at the fault-on state f, a protection misoperation occurs when an outgoing transition either falsely trips into misoperation state m, or fails to trip correctly into degraded secure state d within a specified time limit set for the existing primary protection. Thus the concern on the high frequency of protection misoperations is reflected in the need to use a secondary protective control to reduce the risk index \(\bar{s}_{12}\) to be as close to 0 as possible. The role of \(s_{12}\) in Fig. 2b is similar to that of s in Fig. 2a. At misoperated state m, on the other hand, the secondary protection is relied on to help the system recover into state d rather than allowing the system to cascade into outage state o. Thus the concern on the largely absent systematic recovery scheme from a misoperated primary protection system is reflected in the need to raise \(s_{32}\) (from 0) to as close to 1 as possible. Transition from outage state o or degraded state d to normal state p is referred to as restoration, which is in fact controllable. The topic on restoration from system outage [26], is faced with a different set of challenges outside of the scope of this paper.

Despite its simplicity, the 6-state model contains the essential and quantifiable information necessary for imposing computational and technological requirements. For example, the event probabilities [24] can be solved as

$$\begin{aligned} \text {Prob}[d|f]=s_{12}, \;\; \text {Prob}[d|m] \approx s_{32},\;\; \lambda _{32}>>\lambda _{35} \end{aligned}$$
(2)

and the first order approximation of the system’s availability can be derived from the steady-state Chapman-Kolmogorov equation associated with the Markov chain in Fig. 2b (Appendix A)

$$\begin{aligned} A_s \approx 1-\frac{\lambda _{01}}{\lambda _{50}}\left( 1-\frac{s_{32}\bar{s}_{12}+s_{12}}{(1+\frac{\lambda _{25}+\bar{s}_{32}\lambda _{23}}{\lambda _{20}})} \right) \pi _0 \end{aligned}$$
(3)

where \(\pi _0 \approx 1\) can be assumed for any functioning power system.

The modeling process discussed in this section is applicable to the higher resolution model in Fig. 1a, which contains N degraded states for N critical faults. Splitting each of these states yields a discrete state-space formed from the direct product of a composite set of binary equipment states \(\{faulty, degraded\}\) and a composite set of binary relay states \(\{restrain, operate\}\), respectively. Thus the 6-state model in Fig. 2b is expansible systematically to any desired resolution. Transition rates associated with failure and misoperation processes generally increase linearly with the system size in the aggregated model as dictated by the Poisson superposition property [24]. Thus the model in the form of Fig. 2b is also scalable, and the monotonic dependence of availability on security indices remains true.

2.3 Continuous-state dynamics during the holding time at a discrete-state

Upon entering each discrete-state i, the evolution of continuous-state vectors \((x,\xi )\) representing generator rotor dynamics, and electric dynamics of the transmission network, respectively, is governed by a pair of differential equations from the set \(\{\dot{x}=\phi ^i(x,\xi ), \dot{\xi }=\eta ^i(x,\xi )\}\) particular to discrete-state i, as seen in Fig. 2b. Set notation \(\{ . \}\) signifies that if state i is aggregated, it can have multiple continuous-state dynamics associated with different system configurations. The dimension and content in \((x,\xi )\) depend on the prevailing system configuration. Such a system is sometimes called a hybrid system [27].

The new development in this paper is centered around introducing secondary protective control functions to recover from a primary protection misoperation. To that end the following discussion focuses on the recovery process from m to d, using the simple framework of Fig. 2b. The discussion on the transition from f to d can follow a similar path, and in fact has, to a certain extent, been elaborated on in [16].

Transient stability of a power system is defined for the slower electro-mechanical dynamics [14], where state x contains typically relative rotor angles of all synchronous generators and deviations of their angular speeds; whereas the transients in the continuous-state \(\xi\) are governed by the dynamics of the passive electric transmission network, and are assumed to settle instantaneously with respect to the settling time of transients in x. This work abandons such steady-state notions as phasors. Instead, the time-domain waveforms of node voltages and lumped line currents constitute the components of \(\xi\). The two coupled differential equations at discrete-state d, for example, \(\dot{x}= \phi ^d(x,\xi )\) and \(\dot{\xi }= \eta ^d(x,\xi )\) can be replaced by a differential-algebraic system \(\dot{x}= \phi ^d(x,\xi )\) and \(0= \eta ^d(x,\xi )\), from which component \(\xi\) can be eliminated from the differential equation [14]. With some abuse of notation, \(\dot{x}= \phi ^d(x)\) is used to describe the electromechanical dynamics in the following development. On the other hand, it is shown in Section 3.1 that the slow dynamics that enter \(\dot{\xi }= \eta ^d(x,\xi )\) can be accurately estimated as a part of an input to the electric network using the synchronized high rate waveform samples of PMU-like devices [22]. Thus \(\dot{\xi }= \hat{\eta }^d(\hat{x},\xi )=\eta ^d(\xi )\) hereafter with some abuse of notation. From this point on, the dynamic descriptions of x and \(\xi\) are assumed decoupled, and they will be employed to evaluate control performance and diagnosis performance, respectively. The relevant continuous-state dynamics to the process of recovery from misoperations, among all expressed in Fig. 2b, are

$$\begin{aligned} \{ \dot{x}&= \phi ^{m}(x) \}, \; t_m< t < t_{d} \end{aligned}$$
(4)
$$\begin{aligned} \{ \dot{x}&= \phi ^d(x)\}, \; t_d \le t < \infty \end{aligned}$$
(5)

The notations above are borrowed from [15]. The holding times are those of the misoperated dynamics and the post-fault (degraded) dynamics, necessary for studying the recovery process. Also relevant are the pre-fault dynamics \(\{ \dot{x} = \phi ^{p}(x) \}\) for \(t < t_{f}\) and fault-on dynamics \(\{ \dot{x} = \phi ^{f}(x) \}\) for \(t_f< t < t_{d}\) if the post-fault dynamics \(\{ \dot{x} = \phi ^{d}(x) \}\) are entered at \(t = t_d\) with rate \(s_{12}\lambda _{12}\), or misoperated dynamics \(\{ \dot{x} = \phi ^{m}(x) \}\) are entered at \(t = t_{m}\) with rate \(\bar{s}_{12}\lambda _{12}\). Though outage state can also be transitioned into from state f, which occurs when there is a second equipment fault, it is considered much less likely as \(\lambda _{15}\) is very small. The operating principle for protection functions to achieve \(\bar{s}_{12}<<s_{12}\) has been considered in [16].

The initial continuous-state (x) at \(t_m\), upon entering discrete-state m, is inherited from the final continuous-state arrived at the expiration of the holding time at the originating state f, or d, or t, whose dynamics are defined by \(\{(\phi ^{f/d/t},\;\eta ^{f/d/t})\}\). The initial continuous-state at \(t_d\) upon entering state d is inherited from the final continuous-state arrived at the expiration of the holding time at the originating state f or m defined by \(\{\phi ^{m/f}\}\). Since the holding time at state m is on average much shorter than that in state d, the latter is assumed to last till the infinitely remote future to simplify the discussion of transient stability and the definition of a stability region. Denote the stability region (region of attraction) [15] of a post-fault system around its stable equilibrium by \(A(x^e) \equiv \{x|\lim _{t \rightarrow \infty } \psi ^d(x,t)=x^e\}\), where \(\psi ^d(x,t)\) is the post-fault continuous-state governed by (5) initiated at x.

Our approach to answering whether the fault-on electromechanical state is in the post-fault boundary of \(A(x^e)\) characterized off-line [15] by tracking the state in real-time. Further, we ask whether a secondary protection can be devised to establish \(A(x^e)\) that encloses the electromechanical state in the face of misoperations of the primary protection. This is equivalent to driving the insecure state m to secure state d before the system enters outage state o within a small fraction of a second.

2.4 Maximally secure secondary protection

Consider an N–1–secure system with N critical equipment faults. Let \({\mathcal U}\) denote the set of admissible secondary control actions, and \({\mathcal U}_m\) the set of actions admissible at state m. To render a successful recovery from a protection misoperation by the secondary protection, the \(N+1\) operation modes from the set

$$\begin{aligned} \{f_0,\; f_1, \; \cdots , f_N \} \end{aligned}$$
(6)

must be distinguishable in the face of a misoperation, where \(f_0\) denotes the normal mode and \(f_i\), \(i=1, 2, \cdots , N\), denotes the \(i^{\rm th}\) fault mode. System outage state is excluded as it requires a much different functionality from that of the secondary protection [26]. Note that states and modes are two different sets, because mode \(f_i\) may be related to one of many possible fault-on states corresponding to different protective control actions. Table 1 imposes the desired secondary protection functions. Consider security profile [22] associated with fault mode \(f_i\) and admissible secondary protective control \(u \in {\mathcal U}_m=\{u_1, \cdots , u_M\}\) exerted at some \(t>t_m \ge t_f\) to attempt recovery from a protection misoperation

$$\begin{aligned} s_{f_i,u}(t)= p_{f_i}(t)c_{f_i,u}(t) \end{aligned}$$
(7)

\(p_{f_i}(t)\) in (7) is the mode probability distribution profile conditioned on control action u applied at time t

$$\begin{aligned} p(t)=(p_{f_0}(t), p_{f_1}(t), \cdots , p_{f_N}(t)) \end{aligned}$$
(8)

as the outcome of a diagnosis process, such as that computed by a multiple-model diagnosis algorithm [11]. \(c_{f_i,u}(t)\) in (7) is a fault coverage profile [18] defined by

$$\begin{aligned} c_{f_i,u}(t)=\int _{x} J_{f_i,u}(x)f_{(\hat{x})}(t,x){\text {d}}x \end{aligned}$$
(9)

representing the probability that the system enters a post-fault stability region defined by characteristic function \(J_{f_i,u}(x)\) associated with control action u exerted at time \(t>t_m > t_f\). Stability can be estimated using the energy function method [28, 29], in which generator parameter uncertainty can also be considered [16]. If the time \(t-t_m\) the system spends at the insecure state m is longer than the critical clearing time for fault mode \(f_i\), control u can no longer establish a stability region to enclose the departing state \(\psi ^m(x,t)\) and the system enters the outage state instead. \(f_{(\hat{x})}(t,x)\) in (9) is a snapshot at time t of the probability density function for the estimate of electromechanical state (or rotor state) \(x=(\delta ,\; {\mathrm {\Delta }}\omega )\). The distribution can be estimated by formulating a maximum likelihood problem [30, 31], where the parameters in a family of distributions are optimized to fit the data.

Table 1 Desired secondary protection functions

Time \(t_d\), at which a secondary protective control action is exerted, can be determined by tracking each \(s_{f_i,u}(t)\) in real-time until one exceeds a prescribed threshold, and satisfies

$$\begin{aligned} \max _{i \in \{0,\; 1, \cdots , N\},u \in {\mathcal{U}}} s_{f_i,u}(t) > s_{th} \end{aligned}$$
(10)

persistently for a period of time up to \(t_d\). At \(t_d\), the optimal control action

$$\begin{aligned} u^*=\arg \max _{u \in {\mathcal{U}}}s_{f_i,u}(t_d) \end{aligned}$$
(11)

is applied. Referring to Fig. 2b, the aggregated security index \(s_{32}\) afforded by the secondary protective control u can be identified with \(s_{f_i,u}(t_d)=p_{32}(t_d)c_{32}(t_d)\), which measures the probability of successful recovery from a protection misoperation. \(\bar{s}_{32}=1-s_{32}=\bar{p}_{32}+p_{32}\bar{c}_{32}=\bar{c}_{32}+c_{32}\bar{p}_{32}\). \(u^*\) is optimal because it maximizes the security of the controlled transition.

3 Computation of security indices

The feasibility of the proposed secondary protective control strategy hinges on our ability to compute security indices fast enough for real-time mitigation of cascading failures. As discussed above, the computation involves carrying out fault coverage (9) and fault mode probability (7). This section explains the various aspects of security index computation through a power system example.

Fig. 3
figure 3

One-line diagram of the WSCC 9-bus system [32]

The one-line diagram of the WSCC 3-generator 9-bus model [32] in Fig. 3 is now used to demonstrate computational and other technological issues. Each generator is regarded as aggregated for an area based on the concept of coherence [33]. The transmission lines are modeled as \(\varPi\)–circuits, and each load is modeled as a constant impedance based on the nominal power flow solution. Only the transmission line faults of three phase to ground short circuits are considered. The time domain parameters and the design models are given in Appendix C. All time-domain circuit parameters are derived from the corresponding 60 Hz impedances in per unit on a 100-MVA base. Three PMU-like sensors are located at buses 4, 7, and 9 with fiber-optic links in between. The channel capacity, data transfer rates, and electromagnetic immunity of such links relieve us from being drawn to the discussion of communication issues.

Three assumptions/conditions are stated for the following discussion. (1) The system under study has been planned to be \(N-\)1–secure, which meets the broad NERC standard, and all post-fault stability regions have been estimated off-line. (2) At most one piece of equipment is tripped at a given time. Thus any equipment removal decision by the secondary protection must be preceded by a reinstatement of a previously tripped piece of equipment. (3) The system has the knowledge of the protective control action taken and the action time (\(t_m\) or \(t_d\)), whereas it relies on the secondary protective control strategy to isolate an equipment fault and estimate fault onset time (\(t_f\)).

3.1 Electromechanical state tracking using input samples of PMU-like sensors

There are two reasons for tracking a generator’s rotor angle and angular speed deviation in real-time. (i) They are needed in Section 3.2 for profiling securities using (9), and, (ii) they are needed in Section 3.3 as inputs to the design models of diagnosis filters.

Express the internal voltage e(t) of a generator as \(e(t) = E(t) {\text{cos}}(\omega _0 t + \delta (t))\), where E(t) is the magnitude, \(\omega _0\) is the nominal angular speed, \(\delta (t)\) is the phase angle, and speed deviation \({\mathrm {\Delta }} \omega (t)={\text {d}}\delta (t)/{\text {d}}t\). E(t) and \({\mathrm {\Delta }} \omega (t)\) are assumed to vary slowly with time due to the machine inertia and magnetic flux, even in the event of transmission line faults [14, 32]. In this case, e(t) is considered to be a quasi-steady-state sinusoidal voltage signal, and can be used as a signal model to track \((\mathrm {\Delta } \omega (t), \delta (t), E(t))\), provided that the terminal voltage v(t) and current i(t) in the following equation:

$$\begin{aligned} e(t) = v(t) + L \frac{{\text{d}}i(t)}{{\text {d}}t} \end{aligned}$$
(12)

are both measured and sufficiently excited, where L is the sum of the transient inductance of the generator and the leakage inductance of its step up transformer. Suppose the sample interval of the PMU acquired signals from the secondary windings of instrument transformers is \({\mathrm {\Delta }} t\) . The following nonlinear signal model was used to track the internal voltage (12), or equivalently, to estimate \(x(t_k) =(\mathrm {\Delta } \omega (t_k),\delta (t_k),E(t_k))\)

$$\begin{aligned} y(k)&\equiv v(k{\mathrm {\Delta }} t) + L \frac{i(k{\mathrm {\Delta }} t)-i((k-1){\mathrm {\Delta }} t)}{\mathrm {\Delta } t} \\&= E(k) \cos (\omega _0 k{\mathrm {\Delta }} t + {\mathrm {\Delta }} \omega (k) k{\mathrm {\Delta }} t + \phi (k)) + \epsilon (k) \\&\equiv h(x(k))+\epsilon (k) \end{aligned}$$
(13)

One approach to state tracking is to implement nonlinear recursive least-squares with an extended Kalman filter [30] using state evolution \(x(t_{k+1})=x(t_{k})+w(t_k)\), where w has a zero mean and an appropriately selected covariance to balance between the accuracy and speed of the estimate.

Because the estimates are obtained using only local PMU input samples at the high side terminals the generators, no remote data exchange is needed. It is important to note that the estimate does not involve any swing dynamics, and thus the issue of generator model uncertainty is circumvented.

Fig. 4
figure 4

Electro-mechanical state tracking restuls

A recursive least-squares algorithm is implement using an extended Kalman filter for the 9-bus system of Fig. 3. The simulation results are shown in Fig. 4. Simulated sample paths on the three generators’ terminal voltage and current waveforms, rotor angles, and angular speed deviations are displayed that contain consecutive durations at normal, fault-on, protection false trip, and post-fault operations. The system starts with a normal operation. A mid-point short to ground fault in line 5-7 occurs at 0.3 s. into the simulation. Line 7-8 is falsely removed at 0.4 s., which renders the system enter insecure state m. The fault is cleared at 0.5 s. by reconnecting line 7-8 and removing line 5-7. The recovered system enters and stays at state d (post-fault) from 0.5 s. till the end of the simulation. The electro-mechanical states of the three generators are estimated based on the measurement model in (13). The estimates are obtained using the local PMU-like sensor input waveforms at 24 samples/cycle at the high side terminals of the step up transformer of a generator. The estimation algorithm does not involve any swing dynamics.

3.2 Stability region computation and security profiling

Stability regions (regions of attraction) are discussed in Section 2.3 and are defined based on the classical electro-mechanical dynamics described in Appendix B under normal or post-fault conditions. In Fig. 5, the regions enclosed by the color-coded curves are the intersections of \(\delta _{31}\)-\(\delta _{21}\) plane with \({\mathrm {\Delta }}\omega _{i}=0\), \(i=1,2,3\), and the post-fault stability regions when one of the six transmission lines has been removed. The regions are obtained conservatively by the closest unstable equilibrium point (UEP) method [28]. The dashed curves emanating from the pre-fault equilibrium are the six fault-on trajectories without tripping the faulted lines. The trajectories are obtained by simulating the system’s fault-on continuous-state, starting from the pre-fault equilibrium. These trajectories cross the boundaries of their respective post-fault stability regions in 200–400 milliseconds (critical clearance times).

The window shaped curves in Fig. 6 are coverage profiles evaluated using (9) with some simplifying approximations, when a mid-point short to ground fault in Line 5-7 (Fig. 3) has occurred. All but one profiles represent primary protection misoperations with a line falsely tripped (instantaneously at the onset of a fault \(t=0\)), and the recovery from the false trip occurs at t when the secondary protection takes a corrective control action. The largest window corresponds to the the recovery from a false trip of Line 4-5, and the window size may be attributed to the complete loss of Load A. The profiles in Fig. 6 indicate that the window of opportunity for recovery from a misoperation (from m to d), or seconary protection, is almost as wide as primary protection (from f to d). On the other hand, both Fig. 5 and Fig. 6 reveal that the system’s electromechanical continuous-state as well as the coverage profiles are insufficiently responsive for use to identify transmission line fault modes within the time window of opportunities. The dashed curve marked \(p_{(5,7)}(t)\) hints that if a mode probability (mode \(f_i\) = Line 5-7 short-circuited) can distinguish itself well within the critical clearance time from the other mode probabilities, the recovery probability \(s_{(5,7)}(t_d)=c_{(5,7),u_{(5,7)}}(t_d)p_{(5,7)}(t_d)\) as given in (7) can be high, where \(t_d\) is less than the associated critical clearance time.

Fig. 5
figure 5

Stability regions and fault-on trajectories

Fig. 6
figure 6

Coverage profiles for a short to ground fault at Line 5-7

3.3 Fault diagnosis based on PMU input samples and electric network dynamics

The necessity for including fault diagnosis in the secondary protection has been established in 3.2. The multiple-model filtering method [34, 35] is applied to fault mode diagnosis for the 9-bus system equipped with a network of 3 PMU-like sensors in [22]. More specifically, a bank of Kalman filters corresponding to different fault modes are executed in parallel. Each filter is built on a design model based on a particular configuration of the electric network.

An example of a design model of such filters is given in Appendix C with inputs being the estimated internal generator voltages, states being the independent inductor currents and capacitor voltages in the fictitious lumped circuit models of the electric network, and outputs being the time-tagged input samples of voltage and current waveforms of PMU-like sensors. The number of design models associated with each fault mode is minimized by heuristic means to balance between complexity of a filter bank and the accuracy of diagnosis outcome. Assuming a multivariate Gaussian distribution for the output residuals \(\tilde{\xi }_{f_i}(t_k)\) of a filter, probability \(\rho _{\xi _{f_i}}(t_k)\), indicating how likely the observed system inputs and outputs are associated with an assumed model, can be obtained [36]. The probability that a model infers a system mode conditioned on the measurements, can be calculated by [35, 37]

$$\begin{aligned} p_{f_i}(t_k) = \frac{p_{f_i}(t_{k-1}) \rho _{\xi _{f_i}}(t_k) }{\sum _{j} p_{f_j}(t_{k-1}) \rho _{\xi _{f_j}}(t_k)} \end{aligned}$$
(14)

This implements the mode probability distribution defined in (8).

Figure 7 shows an example of the fault diagnosis results for the aggregated discrete-state as defined in Fig. 2b, and model probability. The PMU input waveforms are sampled at 24 samples/cycle with signal to noise ratio at 15 dB. The system experiences the same sequence of events as those in Fig. 4. Based on the operation of breakers, three banks of filters are used. The first set of filters, consisting of a normal design model and 6 design models for 6 line faults, runs from 0 to 0.4 s. The second and third sets of filters which, consisting of a post-fault design model and 5 misoperated design models, run from 0.4 to 0.5 s and 0.5 to 1.0 s, respectively. Note that only 4 model probabilities are indicated in the legend, while a total of 19 filters are used during the diagnosis process [22]. Our results show an average diagnosis latency due to information deficiency of 23 ms, a little over one AC cycle, without counting communication delays. The more subtle issues of this diagnosis approach, such as scalability and model selection are reported separately in [11].

Fig. 7
figure 7

Fault diagnosis results

4 Performance analysis of the secondary protection via hybrid simulation

A hybrid simulation program is developed to assess the performance of the secondary protection. Here, hybrid refers to the integration of both continuous-state simulation and discrete event simulation. The discrete event simulation is modeled as a finite state automaton based on Fig. 2b, where most transitions occur randomly based on the inter-event distributions whose parameters are given in Table 1 of [22].

A typical simulation cycle of the hybrid simulation goes as follows. (1) Generate next stochastic discrete event based on inter-event distributions feasible at the current state. (2) Determine the power system configuration and generate the continuous state trajectories until the discrete-state holding time expires. (3) Determine the controlled transition based on the outcomes of fault coverage evaluation and fault diagnosis. (4) Enter the next discrete-state and a new cycle starts. Fig. 8 depicts the hybrid simulation schematic with the secondary protective control. The hybrid simulation is implemented in MATLAB. Fig. 9 illustrates some fine points in a typical simulation process. A simulated sample path of a one-year duration is shown in Fig. 10 based on the scalable model in Fig. 2b specialized to the 9-bus system in Fig. 3.

The top panel of Fig. 9 shows the evolution of aggregated discrete-state in the 3-generator 9-bus system [32] as a sample path of a 8760-hour long hybrid simulation. The lower panel shows the evolutions of the discrete-state and the continuous-state over a 1-s time span at around the 1370th hour into the simulation. The continuous-state plot shows two rotor angle components relative to the third rotor angle. It is the brief holding time of a few ten milliseconds first at state ‘1’ due to a short circuit in a transmission line, followed by another brief holding time of a few ten milliseconds at state ‘3’ due to a relay false trip that causes the rotor angles to depart. The recovery to secure state ‘2’ eventually brings the rotor angles to a new stable equilibrium, which takes a few hundred ms to settle [38].

Fig. 8
figure 8

Hybrid simulation schematic with the secondary protective control in the loop

Fig. 9
figure 9

A typical hybrid simulation

Fig. 10
figure 10

Hybrid Simulation Results

The results in Table 2 are obtained from the output analysis of the hybrid simulation. In particular the event probability, analytically expressed in (2) of Section 2.2, is estimated by performing the output analyses of the sample paths of 1000 independent replications. In this case the location of a short circuit fault is uniformly distributed along any section of a transmission line, whereas the diagnosis is carried out by using two representative models per transmission line. The two- model/transmission line is determined heuristically by the desired accuracy of the diagnosis outcome.

Table 2 Steady-state probabilities at state p (or 0), d (or 2), and o (or 5), as well as event probability for recovery from a primary misoperation, with and without the secondary protection

Dramatic increase in recovery probability expressed as \(\text {Prob}[d|m]\approx s_{32}\) in (2) is attained (from 0 to more than 0.4). As a result, \(N-1\)–secure state probability \(\pi _2\) is also significantly increased and outage state probability \(\pi _o\) is decreased. State probabilities at two insecure states are not shown in Table 2 because they are negligibly small.

A significant advantage of using a hybrid simulation is its flexibility in specifying the clock structure and event lives so that inter-event time distributions can closely conform to reality without subject to the homogeneous Markovian assumption. The latter is desired to gain qualitative insights in analytic forms, as seen in (2) and (3) in Section 2.2. Simulation results presented here are mostly drawn from [38], where more details of our study through hybrid simulation can be found.

5 Conclusions and future work

New developments are made in the following areas in this paper: ① Misoperation and recovery processes are incorporated into a scalable stochastic availability model where security indices enter to quantify the effectiveness of the secondary protection for mitigation of cascading failures due to misoperations; ② Real-time computation of such security indices is tackled by simultaneous tracking of the fault-on electromechanical state to provide the probability of post-fault transient stability, and electric network state to provide the model probability distribution for a given secondary protective control action, both with the input samples of PMU-like sensors; ③ The proposed secondary protective control strategy capitalizes on the rapid entry of networked high sample rate sensors to provide real-time decision support cost-effectively for security enhancement without compromising the dependability offered by the existing primary protection. Therefore a new step has been taken towards realizing a cost-effective mitigation strategy of cascading failures and the goal is within our reach provided that current technological and computational potentialities are fully exploited.

Because of the extreme complexity of the mitigation problem at hand, many simplifications have been made in this paper. First in the small 9-bus system the state-space representation of the full electric network is involved. For a larger system, use of partitioned electric networks as the design models of diagnosis filters is necessary to reduce both computational and communications complexity. It is also desirable to develop a formal procedure for selection of the number of design models of diagnosis filters to balance between diagnosis accuracy and computational complexity.

As new sensing and control devices continue to enter the grid, consideration to mitigate protection misoperations should be incorporated into device placement criteria involving the dynamics of the electric network [39], and the average sensor data availability [40]. Finally, the maximally secure mitigation strategy of protection misoperations can be regarded as a special solution of a Markov decision problem [24, 41] where a greedy policy is sought in the sense that it focuses on the immediate security concern. It would be desirable to investigate whether the generalization to a longer time horizon policy involving a sequence of high rate transitions, from state f to m to d, for example, could lead to a mitigation strategy that better benefits the power system reliability.