Cyber—Physical Power System (CPPS): A review on measures and optimization methods of system resilience

The Cyber—Physical Power System (CPPS) is one of the most critical infrastructure systems in a country because a stable and secure power supply is a key foundation for national and social development. In recent years, resilience has become a major topic in preventing and mitigating the risks caused by large-scale blackouts of CPPSs. Accordingly, the concept and significance of CPPS resilience are at first explained from the engineering perspective in this study. Then, a review of representative quantitative assessment measures of CPPS resilience applied in the existing literature is provided. On the basis of these assessment measures, the optimization methods of CPPS resilience are reviewed from three perspectives, which are mainly focused on the current research, namely, optimizing the recovery sequence of components, identifying and protecting critical nodes, and enhancing the coupling patterns between physical and cyber networks. The recent advances in modeling methods for cascading failures within the CPPS, which is the theoretical foundation for the resilience assessment and optimization research of CPPSs, are also presented. Lastly, the challenges and future research directions for resilience optimizing of CPPSs are discussed.


Introduction
Cyber-Physical Power System (CPPS), such as the smart grid, consists of a large number of computing equipment (e.g., servers, computers, and embedded computing equipment), data acquisition equipment (e.g., sensors, phasor measurement units, and the embedded data acquisition equipment), and physical equipment (e.g., generator sets, various distributed generations, microgrids, and loads).The computing and data acquisition equipment are connected through the communication network (i.e., cyber layer), and the physical equipment constitutes the interconnected power network (i.e., physical layer).A schematic of the structure of CPPSs is shown in Fig. 1.Power supply and control dependencies exist between the power and the communication networks.Specifically, every computing equipment obtains real-time status information of each physical equipment from the data acquisition equipment connected with it and realizes the coordinated control and dynamic optimization of the power network through the two-way data interaction of the communication network.Accordingly, power network provides the cyber layer with the electricity needed for normal operation (Wang et al., 2016a;Huang et al., 2017;Tu et al., 2019;Sturaro et al., 2020).Such interdependencies within the CPPS is conducive to the expansion of the network scale and intelligent management of the power network.Meanwhile, the complex coupling relationship makes the power network more vulnerable in the face of natural disasters and malicious attacks (Vespignani, 2010;Huang et al., 2016;Nezamoddini et al., 2017;Li et al., 2019a).
The traditional risk management is dedicated in finding answers to the following five primary questions to prevent and mitigate the risk to CPPSs from natural disasters and malicious attacks (Kaplan and Garrick, 1981;Rausand, 2011): (i) What can happen?(i.e., what can go wrong?) (ii) What is the likelihood that such a disruptive scenario will happen?(iii) What are the potential consequences if such a disruptive scenario does happen?(iv) How tolerable the identified risk is?And (v) what are we going to do about it?(i.e., what risk-reducing measures can be taken for the identified risk?)According to the answers for the above-mentioned five questions, risk management strategies focus on the following three aspects or some of their synthesis to mitigate risk: (i) reducing the likelihood of disruptive events (i.e., improving system reliability); (ii) reducing the potential consequences of disruptive events (i.e., improving system vulnerability); and (iii) enhancing the ability of a system to withstand disruptive events (i.e., improving system robustness).In summary, the main objective of the traditional risk management strategies is to avoid or absorb disruptive events from occurring by using a preventive and protective way.
Although prevention and protection strategies are critical in preventing disruptive events or consequences, the recent events suggested that not all disruptive events can be prevented.The disturbances in CPPSs mainly come from natural disasters and malicious attacks.Hurricane Sandy is one of the well-known cases; it was the deadliest and most destructive hurricane of the 2012 Atlantic hurricane season and left more than 8 million customers across 21 states in the US East Coast without electricity for days and even weeks (Manuel, 2013;Henry and Ramirez-Marquez, 2016).After Hurricane Sandy, Hurricane Harvey in 2017 became the costliest hurricane on record in the US since Katrina in 2005.This disruptive event caused the worst electrical blackout in the US history (Sebastian et al., 2017).In Ukraine in 2015, the blackout that affected more than 1.4 million people was caused by cyber-attacks on several substations (Lee et al., 2016).Based on the consequences of these inevitable large-scale destructive events, the post-disaster rapid response and recovery strategies in the face of such a disaster event are more important than the prevention and protection strategies.Therefore, the Department of Homeland Security, among others, has placed emphasis on resilience through preparedness, response, and recovery (US Department of Homeland Security, 2013;2014).
The term resilience was introduced into the English language in the early 17th century from the Latin word "resiliere", which means to "bounce back" (McAslan, 2010).The common use of the word resilience implies the ability of an entity or system to return to a normal condition after the occurrence of an event that disrupts its state.Such a broad concept applies to diverse academic disciplines to express the property of different objects or systems; thus various definitions for the notion of resilience have been proposed.The CPPS belongs to the engineering discipline, and it includes technical systems designed by engineers that interact with humans and technology (Holling, 1996).Therefore, this study focuses on the definition of resilience from the engineering perspective.Hollnagel et al. (2006) defined engineering resilience as "the intrinsic ability of a system to adjust its functionality in the presence of a disturbance and unpredicted changes".Besides, the US National Infrastructure Advisory Council (2009) defined the resilience of infrastructure systems as "their ability to predict, absorb, adapt, and/or quickly recover from a disruptive event such as natural disasters".Based on the significance and concept of resilience in CPPSs discussed above, two research questions have become the major research topics in the last decade: (i) How is the resilience of CPPSs measured?And (ii) how is the resilience of CPPSs optimized and enhanced?This review examines these two research questions.A review of the representative assessment measures of CPPS resilience applied in the existing literature is provided for the former question.The latest advancements of optimization methods for CPPS resilience are reviewed for the latter question from three perspectives that mainly focus on the current research, including optimizing the recovery sequence of components, identifying and protecting critical nodes, and enhancing the coupling patterns between the physical and the cyber networks.The abovementioned assessment measures and optimization methods depend on the system performance transition curve of CPPSs under the occurrence of disturbances.The description and quantification of the system performance curve further depend on the establishment of coupling dependencies and cascading failure models within the CPPS.Hence, the recent advances in modeling methods for cascading failures within the CPPS are also presented to provide the theoretical foundation for the resilience assessment and optimization research of CPPSs.
The overall scheme of this paper is shown in Fig. 2. The remainder of the paper is organized as follows.Section 2 first introduces the generic system performance transition curve of CPPSs under the occurrence of disturbances.Then, a review of the recent advances in modeling methods for cascading failures within the CPPS is provided.Section 3 reviews the representative curve-based quantitative assessment measures of CPPS resilience applied in the existing literature.Section 4 reviews the latest advances of optimization methods for CPPS resilience from three major research directions.Section 5 discusses the challenges and possible future research directions for the assessment and optimization of CPPS resilience.
2 System performance transition curve and cascading failure models of CPPSs In this section, the generic system performance transition curve of CPPSs under the occurrence of disturbances is introduced to lay out the foundation for the quantification and optimization research of CPPS resilience.Understanding the coupling dependencies within the CPPS and establishing the corresponding cascading failure models are the prerequisites to quantify and obtain the system performance transition curve.Hence, a review of the modeling methods for cascading failures of CPPSs in the existing literature is presented.

Generic system performance transition curve of CPPSs under the occurrence of disturbances
The transition of the system performance after a disturbance can be divided into three stages (Ouyang et al., 2012;Panteli and Mancarella, 2015;Li et al., 2017b) (Fig. 3).The vertical axis, FðtÞ, expresses the performance level of the system, and it can be evaluated by using different dimensions of the CPPS, such as performance (e.g., the percentage of total or critical loads and the number of survived components), economic (e.g., loss due to power outages), and social (e.g., the percentage of unaffected population) dimension.The first stage (t 0 £t < t e ) is the prevention stage from normal operation to the onset of initial failure (the disturbance occurs at time t e , i.e., point A, at which the robustness of the system can Fig. 2 Overall research roadmap for this paper. Gongyu WU & Zhaojun S. Li.CPPS: A review on measures and optimization methods of system resilience be shown), which shows the predictive capacity of the system (i.e., the ability to prevent any possible disturbances).The reliability of the system can be reflected from this stage.The second stage (t e £t£t d ) is the damage propagation stage after the occurrence of a disturbance.Specifically, if the system is sufficiently robust, then the disturbance can be adapted and absorbed, and the performance level will not be affected by the disturbance.Point B represents the initial damage to the system caused by the disturbance occurring at time t e , and point C at time t d denotes the maximum impact suffered by the system.These two points show the vulnerability of the system.Hence, the second stage reflects the absorptive and adaptive capacities of the system (i.e., the ability to absorb, adapt, and mitigate the impacts of disturbances).The transfer from B to C is mainly caused by cascading failures of the system.For example, initial failures may cause the interdependent nodes in other network to fail when a node of any network in the CPPS fails, and vice versa.This process may recursively continue, eventually leading to large-scale blackouts.The duration of this stage is much shorter than those of the other two stages, which can be regarded as instantaneous.The third stage (t d < t£t r ) is the recovery stage to restore the system from the maximum damage point C to the normal state D, where t r denotes the time when all repair tasks are completed.The final performance level after recovery, Fðt r Þ, may not be equal to the initial one of the system, Fðt 0 Þ, due to the improvement.In addition, the increase of the curve slope in the recovery stage means the implementation of the city's rescue strategy, namely, the increase of recovery resources.
Section 1 indicated that resilience is defined as the system's ability to predict, absorb, adapt, and/or quickly recover from a disturbance.Consequently, resilience covers all three stages in Fig. 3. Current studies on resilience assessment and optimization are mostly carried out around the transition curve shown in Fig. 3. Understanding and modeling the propagation mechanism of cascading failures within CPPSs are the bases of investigating CPPS resilience because the transition curve in Fig. 3 must be obtained through the cascading failure model.

Modeling methods for cascading failures of CPPSs
The research of cascading failures for actual CPPSs is often faced with challenges in terms of efficiency and cost.For this reason, the topology and coupling dependencies of CPPSs are abstracted into a graph structure with some reasonable assumptions via the complex network theory, and the approximate results of the failure cascade propagation can be obtained by the simulation.Specifically, the power and communication networks are modeled as an undirected graph (Shi et al., 2018).Each node represents a unique physical equipment (e.g., a generating station or substation for the power network, and a control center for the communication network).Every branch denotes unique transmission towers, lines, or other transmission facilities for the power and communication networks.The power supply and control dependencies between the power and the communication networks are generally considered two independent sets of directed branches.
Existing modeling methods for cascading failures of CPPSs in the literature can be divided into two categories, including, complex network-based and approximate dynamic behavior-based models.The former is characterized by high efficiency and few input parameters, such as betweenness-based model (Chen et al., 2014), dynamical redistribution model (Crucitti et al., 2004), and Medium and Low voltage model (Pagani and Aiello, 2013).However, this category of models contains a large number of unrealistic assumptions and approximations about power flows.An example of these models is the betweenness-based model, which assumes that power flows always along the shortest path between two nodes.Hence, complex network-based models are difficult to be applied in engineering because of the inevitable errors between their results and the actual power network.
The approximate dynamic behavior-based models can also be divided into Direct Current (DC) and Alternating Current (AC) power flow models.DC models (Mei et al., 2009;Liu and Li, 2016) are the tractable relaxation models of AC models and have the high computational efficiency and good linear features.However, DC models are sometimes unable to accurately simulate the actual cascading failure processes due to their inability to reflect the systems' reactive power characteristics (Stott et al., 2009).Accordingly, AC models, which consider the reactive power characteristics, are introduced and studied.Rios et al. (2002) studied the effect of weather conditions and time-dependent phenomena, such as cascade tripping due to overloads, malfunction of the protection system, and potential power network instabilities, on level of power networks' security.These random failures are obtained by Monte Carlo simulation.The AC power flow and bus load shedding were used in their cascading failure model.Nedic et al. (2006) investigated the criticality of the power networks in a 1000 bus network by using the similar cascading failure models.However, the self-regulating function of the generators and the economic power dispatch have not been considered in the above-mentioned cascading failure models.Li et al. (2017a;2018) built an AC cascading failure model considering bus load shedding and AC optimal power flow analyses.The self-regulating function and economic dispatch are considered in their models.Wu and Li (2019) introduced multistate failures and the corresponding degradation of performance for each nodes and branches into the cascading failure model.Some other similar models for independent power networks can be found in Zhang et al. (2017), Li et al. (2018a), and Khederzadeh and Zandi (2019).The above-mentioned modeling methods focus on a single power network and provide effective modeling methods for the cascading failures of the single power network.
Existing research on the coupling effect of interdependencies within CPPSs is relatively few.Buldyrev et al. (2010) proposed a cascading failure model for the CPPS based on the penetration theory.In their model, the "oneto-one" undirected dependencies are considered.Specifically, they integrated the power supply and control dependencies into an undirected set of branches (i.e., each node is assumed to have one and only one unique bi-directional link to the interdependent network).If a node in any network fails, then the nodes connected to it in the interdependent network will also fail.However, Buldyrev et al. (2010)'s research does not consider the power flow distribution of power networks during the propagation of cascading failures.For this reason, Zhang and Yağan (2020) incorporated the power network's equal load redistribution and overload failures into the "one-toone"-based cascading failure model.Zang et al. (2019) introduced the power dispatching and control strategies, such as the AC power flow analysis, AC optimal flow analysis, and bus load shedding in their improved model.Some other improvements on the basis of the "one-to-one" undirected dependency-based model are presented in Yağan et al. (2012), Wang et al. (2016a), Liu et al. (2019), andPan et al. (2020).
Some researchers have further investigated the complex dependencies in the CPPS.Huang et al. (2013) applied two sets of directed branches to represent the power supply and control dependencies.Their model allows each power node to supply power to multiple communication nodes.However, each communication node has only one power supply dependency and controls only one power node.Chen et al. (2018a) introduced the control threshold of power network into their improved model on the basis of the directed dependency-based cascading failure model.They assumed that some power nodes must be controlled by multiple communication nodes to normally operate, such as the multiple wide-area control.In the abovementioned modeling methods of cascading failures within the CPPS, power nodes are typically classified into four categories, namely, power supply only node, load only node, power supply and load node, and neither power supply nor load node.However, the communication nodes are assumed to have the same level and function, and the node types have not been well studied.In addition, the above-mentioned research assumed that any power node can be selected to provide electricity to communication nodes; however, only distributed substation, i.e., load nodes, has the function of supplying power to end users in the actual CPPS.
For these reasons, Wang et al. (2016b) divided the nodes in the communication network into two categories, namely, routing nodes and unique control center.In their model, each power node is configured with a corresponding routing node, and the data collected by all routing nodes will be sent to the unique control center to monitor and control all power nodes and branches.Accordingly, the regional control for large CPPSs can be simulated.Sturaro et al. (2020) considered two types of communication nodes, including control centers and relays.The types of the power node are considered when selecting the power source for the communication nodes in their model.Guo et al. (2019) divided the communication nodes into three levels, including local control centers, area load dispatch centers, and regional control centers.In their model, random failures of power or communication nodes due to overload, hidden failures, and control center failures have been studied.Wu et al. (2020) proposed a more realistic cascading failure model for the CPPS.Not only the node types and interdependencies for power and communication networks are involved but also the multi-state failure power dispatching and control strategies of power networks are considered.The evolution of the modeling methods of CPPS cascading failures and corresponding representative methods are summarized in Table 1.

Curve-based quantitative assessment measures for CPPS resilience
The generic system performance transition curve in Fig. 3 shows the general trend in the performance level of a system over time under the occurrence of disturbances.The assessment for CPPS resilience depends on this transition curve.This section presents a review of the representative curve-based quantitative measures for the assessment of CPPS resilience applied in the existing literature.
Henry and Ramirez-Marquez (2012) measured the CPPS resilience through the proportion of the system performance that has been recovered from its disrupted state.In their method, the value of resilience Rðt R Þ corresponding to the performance level at an arbitrary time t R during the recovery process, denoted as Fðt R Þ, can be computed via the following equation: where Fðt 0 Þ and Fðt d Þ represent initial performance level and maximum impact caused by the disturbance of the CPPS, respectively.However, the duration of recovery measures is ignored.Shinozuka et al. (2004) introduced the duration of the recovery process (i.e., the duration from t d to t r in Fig. 3) into the measure for assessing CPPS resilience, which is shown in the following equation: where Fðt r Þ expresses the final performance level of the CPPS after all recovery measures have been completed.The amount of performance restored from the damaged state and the time required for that restoration are involved in this measure.Francis and Bekera (2014) proposed the measure that considered the recovery speed.The mathematical expression of their method is shown as follows: where S p is the speed recovery factor, t * r is the time to complete recovery measures, t δ is the slack time that represents the maximum amount of time post disaster that is acceptable before recovery, and α is a parameter controlling decay in resilience attributable to time to new equilibrium.Fðt r Þ=Fðt 0 Þ expresses the adaptive capacity of the CPPS as the proportion of original system performance retained after the new stable performance level has been achieved.Fðt d Þ=Fðt 0 Þ expresses the absorptive capacity in terms of the proportion of original system performance immediately retained post-event.However, the measures discussed above only focus on the performance level of the CPPS at two moments during the recovery process, namely, the time when the impact suffered by system reaches its maximum and the time when the system reaches its final new equilibrium.Meanwhile, the change in system performance during the recovery process is ignored (i.e., the change in the slope of the performance transition curve from C to D in Fig. 3).Bruneau and Reinhorn (2007) considered the changes in the system performance during system deterioration and recovery processes and quantified the CPPS resilience as the area between the target performance curve TFðtÞ and the real performance curve FðtÞ within the period from t 0 to t r .Some improvements to this measure can be found in Ayyub (2014), D 'Lima and Medda (2015), and Bevilacqua et al. (2017). (5) Ouyang and Dueñas-Osorio (2012) calculated the CPPS resilience as the ratio between the real performance and the target performance of a system (i.e., the ratio of the areas between the curve FðtÞ and the time axis and between the curve TFðtÞ and the time axis).
Fang et al. ( 2016) defined the CPPS resilience as the ratio between the cumulative system performance that has been restored and the expected cumulative system performance (i.e., supposing that the system has not been affected by disturbances) during this time period.
The measure in Eq. ( 7) is the continuous time representation of the measure in Eq. ( 1).Nevertheless, all measures discussed above only focus on a single disturbance and ignore the effect of the randomness and frequency of the occurrence of destructive events on system resilience.For these reasons, Ouyang et al. (2012) further incorporated multiple inter-related hazard events on the basis of the measure in Eq. ( 6) and proposed a novel time-dependent expected annual resilience measure as the mean ratio of the area between the real performance curve and the time axis to the area between the target performance curve and the time axis during a year.In their measure, the occurrence of each type h of hazard events is assumed to be subject to Poisson processes.Their measure is mathematically expressed in the following equation: where E½⋅ is the expected value, T represents a year, N ðT Þ is the total number of event occurrences during T, l h denotes the occurrence rate of each type of hazard per year, AIA n ðt n Þ is the area between the real performance curve and the targeted performance curve called impact area for the nth event occurrence at time t n , and E½AIA h represents the expected impact area under hazard type h accounting for all possible hazard intensities.
The above-mentioned measures focus on a single dimension of the system performance level FðtÞ.Wu et al. (2020) proposed a measure that considers the performance and economic dimensions of the system and the occurrence frequency of hazard events.Specifically, two factors related to resilience, including the Systemic Impact (SI, i.e., the impact of disturbances on system performance) and Total Recovery Effort (TRE, i.e., the cumulative cost during the recovery process), are integrated in their measure.The occurrence of the hazard event is also assumed to be subject to Poisson process with occurrence rate l per year.The resilience measure in their research is expressed by the following equation: where EðSI A Þ and EðTRE A Þ denote the expected impact of disturbances on system performance and expected cost accounting for all possible disaster intensities I, respectively, β is a nonnegative weighting factor that converts the units and balances the relative importance of SI and TRE, and fðIÞ represents the probability mass function for disaster intensity I.The references for CPPS resilience assessment measures can be analyzed in Table 2 by considering the type, stages or nodes involved, whether speed and uncertainty should be restored, and the assessment factor.

Optimization methods for CPPS resilience
According to the system performance transition curve introduced in Section 2 and curve-based quantitative measures for assessing CPPS resilience reviewed in Section 3, the existing optimization methods for CPPS resilience are carried out around three aspects, including optimizing the recovery sequence of components, ing and protecting critical nodes, and enhancing the coupling patterns between physical and cyber networks.Specifically, optimizing the recovery sequence focuses on the recovery stage after disturbances.This method is designed to speed up the recovery of system performance.
The identification and protection of critical nodes focuses on the prevention and damage propagation stages.
Reinforcing (e.g., improving the disaster resistance level of buildings) or configuring redundancy (e.g., configuring backup power supply) for critical nodes can effectively reduce the frequency of system interruptions caused by disturbance events and mitigate the propagation of cascading failures.In this way, the influence of disturbance on system performance can be reduced.Specifically, the critical nodes are identified and protected to improve system robustness and reduce system vulnerability.Optimization of the coupling patterns has been focused on the damage propagation stages.This method aims to prevent and mitigate the propagation of cascading failures through changing the connection between physical and cyber networks.In this section, an up-to-date review of existing optimization methods for CPPS resilience is presented from the above-mentioned three perspectives.
4.1 CPPS resilience optimization through optimizing the recovery sequence of components Ouyang et al. (2012) proposed a framework to simulate the system performance transition curve during the recovery process.The vulnerability of each component to a random disaster and the amount of recovery resources required are involved in their method.However, all resources are assumed to have the same effectiveness.They chose the power transmission grid in Harris County, Texas, US, as a case study, and their research showed that the recovery sequence plays a key role in the system resilience under limited resources.They further investigated the recovery sequence of components and introduced some operation mechanism of power network, such as load shedding and power balance based on DC power flows, into their framework (Ouyang and Duenas-Osorio, 2014).The recovery sequence in their research is mainly determined in terms of the specific rule associated with the type of components and lacks quantitative methods and strategies.Figueroa-Candia et al. ( 2018) modeled the decision problem of optimal recovery sequence as a mixed-integer optimization problem.The risk analysis, investment, and resilience targets are integrated in their method.Specifically, four factors, including First Fail First Repair, Shortest Repair Time, Fastest Customer Restoration, and Number of Customers, as the resilience targets are investigated in their research.Fang et al. (2016) built a mixed integer programming model to quantify the criticality of each damaged component based on two metrics, including the optimal repair time and the resilience reduction worth.The final repair priorities of damaged components are ranked according to the combination of these two metrics.Shi et al. (2019) also developed a power supply recovery programming model based on their proposed model of power supply path description for obtaining the optimal recovery strategy for distribution network after disasters.The objective function in their method is to maximize the supply power load.Zhang et al. (2019) developed a reinforcement learning-based approach to obtain the recovery sequence of power networks after cascading failures based on the Q-learning approach.However, some necessary electrical constraints for power network, such as voltage amplitude and phase angle constraints, are not covered in this research.Gao et al. (2016) formulated the decision problem of optimal recovery sequence as a two-objective chance constrained program.The optimal recovery sequence obtained in their research can simultaneously maximize the amount of power received by load nodes with weights and minimize the voltage variation.Wei et al. (2020) focused on tripping events of transmission lines caused by cyber-attacks and the impacts of reclosing them on power network, such as current inrush and power swing.A recovery strategy based on deep reinforcement learning framework is proposed to determine the optimal reclosing time of each tripped transmission line.Arab et al. (2015) investigated the recovery sequence of components from two stages, including the pre-positioning stage prior to a natural disaster and the real-time allocation stage after the natural disaster strikes.A two-stage stochastic problem with complete recourse is formulated, and the objective function minimizes the expected restoration operation cost, customer load interruption cost, and electricity generation cost.The other similar quantitative methods of the optimal recovery sequence based on two-stage are shown in Lei et al. (2016), Chen et al. (2017), andJiang et al. (2020).Li et al. (2019b) researched the optimal routing of recover resources and developed a joint optimization model to simultaneously determine the recovery sequence of damaged components and the routes of each resource.Zhou et al. (2020) considered the impact of system dependencies inside systems on cascading failures and built a cascading failure model with dependence clusters of nodes.Four different restoration prioritization strategies are applied and compared to maximize system resilience through mitigating the performance loss of systems.Nevertheless, the above-mentioned methods only focus on power networks (i.e., physical layer) and ignore coupling effects of interdependencies between physical and cyber layers.
The method proposed by Ouyang and Wang (2015) is one of the few researches in the literature that integrated the coupling effects within CPPS into the decision problem of optimal recovery sequence.They adopted the resilience assessment framework proposed in their previous research to interdependent power and gas systems.Five recovery strategies are investigated and compared through the multi-systems' joint restoration algorithm based on genetic algorithms to assign different weights to these two systems.González et al. (2016) modeled a mixed integer programming model to obtain the recovery sequence of components.In their model, the resource allocation and recovery sequence in the functionally interdependent infrastructure networks and savings due to simultaneous collocated recovery jobs are synchronously covered.However, only the "one-to-one" undirected dependencies between two networks are involved in above-mentioned two studies.Wu et al. (2020) defined the optimal problem of the recovery sequence as a multi-mode resource-constrained project scheduling problem for maximizing CPPS resilience.The independent power supply and control dependencies (i.e., directed dependencies), electrical constraints, the diversity of recovery resources, execution modes of recovery tasks, and the availability, cost, and time of recovery resources are involved in their method.The references for CPPS resilience optimization methods based on the optimization of the recovery sequence of components can be summarized in Table 3 by considering the system type, coupling dependencies, method type, and the special factors involved.

CPPS resilience optimization through identifying and protecting critical nodes
Most existing research on the Critical Node Identification (CNI) problem only based on the topology information of the system.Xu et al. (2018) integrated the k-shell decomposition, degree centrality, and structure hole to assess the local importance of a node.Yang et al. (2019) proposed a comprehensive importance measurement method that covered three classical measures of topology centralities, namely, degree, betweenness, and closeness centrality.In their method, the comprehensive importance of nodes is calculated by the weighted sum of these three measures, and the weights of them are determined through the proposed algorithm that integrates entropy weighting method and the Vlsekriterijumska Optimizacija I Kompromisno Resenje method.Chen et al. (2020) modeled the CNI problem as a nonconvex mixed-integer quadratic programming problem.Their method aims to obtain a set of critical nodes that can simultaneously maximize the weights between the nodes and minimize pairwise connectivity.Some similar research only based on the topology information of the system can be found in Zhou et al. (2019), Jiang et al. (2019), and Sotoodeh and Falahrad (2019).
The topology-based only methods provide a good guidance for the CNI problem in complex networks.However, such methods are hard to be applied to the CPPS.Specifically, in the actual CPPS, some critical nodes can only be located in marginal areas of the network due to their special geographical requirements or functions (e.g., wind power plants), resulting in their low topology centrality.Adebayo et al. (2018a) introduced three electrical parameters, including the admittance matrix of a power network, the voltage deviation of each load nodes, and the reactive power variation, into the CNI problem.Two measures for assessing importance of nodes in a power network, including network structural theory participation factor and voltage critical bus index, are proposed.Liu et al. (2018) built an identification model for the CNI problem, which considers the dual role of the voltage antiinterference and influence abilities.The influential abilities of a node involve the topological structure based on the electric betweenness and operating status based on the entropy of power flow.Kang et al. (2018)  Gongyu WU & Zhaojun S. Li.CPPS: A review on measures and optimization methods of system resilience power network as a directed graph and presented the concept of the giant efficiency subgraph the power network.On this basis, they developed an algorithm based on four factors, including the node sharing degree, the cumulative distance between a node and all of the generation nodes, and in-degree and out-degree of nodes, to calculate the critical degree of nodes.Yang et al. (2020a) modeled power networks based on electrical betweenness and defined the electric cactus structure to denote the effective control scope of a single node with less energy in power networks.Based on the network structure they define, three measures, namely, the electric control capability, the electric dynamic characteristic, and the electric observe capability indexes, are provided to quantify the importance of power nodes.However, the methods discussed above do not apply to CPPSs because they do not consider the importance of communication nodes in the CPPS and the interdependencies between physical and cyber layers.Some similar research on CNI problems only for power networks can be seen in Adebayo et al. (2018b) and Wang et al. (2020a).

modeled the
Existing research on the CNI problem, which involves coupling effects and electrical characteristics of the CPPS, is relatively few.One of them is the method proposed by Fan et al. (2018), in which they identified the critical nodes of a CPPS through introducing the hyper-network methods.In their method, the hyper-network model of substation auto system according to the hyper-graph theory is established.On this basis, the importance of nodes is calculated and ranked by the provided efficiency indexes and the functional impact of the CPPS model after data attack on communication node.Fan et al. (2020) divided the CPPS into three logical layers, namely, the physical topology, transport, and service layers.The importance of nodes in these three layers is quantified by topology property, flows that the node carries, and the service importance value.The comprehensive critical degree of each node is obtained by the proposed multi-layer critical node identification algorithm.The references for CPPS resilience optimization methods based on the identification and protection of critical nodes can be summarized in Table 4 in terms of the system type, problem, assessment perspective, and the special factor involved.
4.3 CPPS resilience optimization through optimizing coupling patterns Chen et al. (2015) investigated the effect of coupling probability and three types of coupling patterns on robustness of interdependent systems.Specifically, assortative (i.e., connect the node with the highest load in one network with the node with the highest load in the other network, and so on), disassortative (i.e., connect the node with the highest load in one network with the node with the lowest load in the other network, and so on), and random coupling patterns are compared through the "one-to-one" undirected dependency-based systems.Their research shows that the disassortative coupling is more robust for sparse coupling, while assortative coupling is more robust for dense coupling.Golnari and Zhang (2015) also investigated the effect of three types of coupling patterns, including "high-to-high" degree, "high-to-low" degree, and random coupling, on cascading failures within the interdependent systems.The one-way dependencies of each node in one network to the other network are involved in their cascading failure model.The simulation results indicate that the "high-to-high" degree coupling results in smaller failure cascades than those of the other two types of coupling patterns.Liu et al. (2020) focused on the functionality and robustness of CPPSs and ranked and coupled the power and communication nodes through the average propagation latency.Meanwhile, a metric called relative coupling correlation coefficient is provided to quantify the degree of coupling in the current system.The results show that a average propagation latency or a larger relative coupling correlation coefficient leads to less robustness.Wang et al. (2018) proposed a coupling strategy, namely, neighbor node priority connection strategy, to obtain robust interdependent systems.Two factors that affect the system robustness are considered, including the degree distribution of nodes and the intensification effect of nodes on the cascading failure propagation.Some other similar studies on optimal coupling patterns of interdependent systems can be found in Chattopadhyay et al. (2017) and Chen et al. (2018b).
The above-mentioned studies are instrumental in designing an interdependent system from the ground up.However, the large-scale reconstruction and rewiring for existing CPPSs are unacceptable on the economic and time dimensions to optimize CPPS resilience.By contrast, the changes on a small number of branches between physical and cyber networks can more easily withstand cascading failures and enhance CPPS resilience.Schneider et al. (2011) proposed a branch swapping strategy, which can swap the endpoints of a part of branches, to mitigate the cascading failures caused by malicious attacks.Their research is based on single power networks.Moreover, two key constraints that are focused in their research are keeping invariant the number of branches and the degree of each node to ensure that the conductance distribution and transport properties of the improved system can be close to the original ones.Peng et al. (2020) investigated the performance of different branch swapping strategies under "one-to-one" undirected dependency-based system models.Specifically, seven different swapping strategies based on three types of classical network centrality measures (i.e., degree, betweenness, and eigenvector centralities) are designed.Their research found that the swapping strategy based on high eigenvector centrality has the best effect on improving network reliability.
The branch swapping strategies can significantly improve the system robustness with small changes in the system structure (i.e., with low cost).However, one limit of these strategies is that they are only able to move around the existing branches within the system, which provides an upper limit for system improvement, that is, even a system that has been made maximally robust through such swapping strategies can be made even more robust by adding new branches.Consequently, the branch addition strategies have attracted the attention of scholars.
The branch addition strategies for a single system has been extensively studied.Kazawa and Tsugawa (2020) applied these existing methods to the "one-to-one" undirected dependency-based systems and investigated their effectiveness for improving the robustness of interdependent networks against targeted attacks.Three extensions of existing branch addition strategies are proposed and verified.Meanwhile, two branch addition strategies for interdependent networks proposed by Ji et al. (2016) are also involved and compared.The abovementioned methods aim to improve the robustness of interdependent networks by adding intra-layer branches rather than dependence branches (i.e., inter-layer branches).Wang et al. (2020b) proposed a dependence branch addition method based on uniformity of node degree to improve the robustness of CPPSs.This method aims to homogenize the degree of nodes in the subnet of interdependent networks.The partial "one-to-one" undirected dependency-based systems, which consider redundancy and autonomous nodes, are adopted in their research.Cui et al. (2018) presented an optimal strategy based on attack modes for enhancing robustness of directed dependency-based systems according to the addition of intra-layer and dependence branches.The cost constraint and two attack modes are involved, including intra-and inter-degree priority modes.
The method proposed by Yang et al. (2020b) is one of the few research in the literature that investigated the branch reduction strategies for improving robustness of interdependent networks.They indicated that the imbalance between nodes at both ends of the dependence branches is one of the important reasons for a cascade of failures.To this end, they focused on reducing key unbalanced dependence branches to improve the robustness of "one-to-one" undirected dependency-based systems.The unbalanced dependence branches in their research are defined as the dependence branch with the greatest degree difference between the nodes at both ends of it, where the degree difference is calculated by their proposed dependency link imbalance index.
However, the branch between two networks is designed to meet the demand of the interdependencies and redundancies between power and communication networks.The removal of these branches not only fails to mitigate the spread of cascading failures across the networks but also may cause some nodes to lose power supply or control dependencies, thus affecting the normal operation of the CPPS.For these reasons, the branch reduction strategy lacks the feasibility in the actual CPPS.From the above discussions, the references for CPPS resilience optimization methods based on the optimization of the coupling patterns between power and communication networks can be summarized in Table 5 by considering the coupling dependency, optimization strategy, and the special factor involved.

Discussion and future research
Some challenges for current research on optimization methods of CPPS resilience still persist.This section discusses the research problems that can be considered in the future from four aspects, including the modeling methods of cascading failures, optimization methods of the recovery sequence, identification methods of critical nodes, and optimization of coupling patterns.

Modeling methods of cascading failures for CPPSs
Modeling the propagation mechanism of cascading failures is the basis of investigating CPPS resilience.Most current research on cascading failure models focuses on a single power network rather than the interdependent CPPS.Few research, which involves coupling effects of interdependencies within the CPPS, makes a number of simple abstractions, approximations, and other assumptions of interdependencies, such as "one-to-one" undirected dependencies.These modeling methods is difficult to be applied in the actual CPPS.Therefore, the following problems concerning modeling methods of cascading failures should receive more attention in future research: (i) How to further consider the functionality of different types of communication nodes into cascading failure models?Similar to power networks, communication nodes also have different levels and functions, and the impacts of their failures on power networks are not equal.
(ii) How to model the effects of different failure modes of communication networks on power networks?For example, the communication congestion may cause delay and/or loss of the measurement and/or control signals (e.g., the phasor values from phasor measurement units), resulting in the unawareness of the states of power networks.
(iii) How to further establish the multi-state failure model of components and the state transitions?The component failures present multiple states, which have the unique impact on the CPPS performance.In addition, the failure state of each component is variable during the propagation of cascading failures.

Optimization methods of the recovery sequence
The following problems concerning optimization methods of the recovery sequence should be paid more attention to considering in future research: (i) How to consider the continuous impact of disasters on CPPSs during the recovery process?Some disasters (e.g., hurricanes and earthquakes) have a periodic and persistent effect on CPPSs rather than a transient one.The sustained or secondary damage caused by such disasters during the recovery process has a great impact on determining (or modifying) the recovery sequence of components.
(ii) How to allocate the reserves of different types of recovery resources to cope with disasters in advance?Sufficient recovery resources can effectively speed up the recovery of system performance.However, each resource needs to be stored and maintained with a cost.

Identification methods of critical nodes
The following problems concerning identification methods of critical nodes should be considered in future research: (i) How to evaluate the importance of communication nodes based on their coupling effects to power networks?Power and communication networks have their unique operation mechanisms and measurement units.The manner by which to evaluate the node importance for power and communication networks through the interdependencies within CPPSs remains a challenge.
(ii) How to efficiently identify a set of critical nodes?Most studies on the CNI problem of CPPSs focus on the importance assessment and identification of a single node.However, the impact of a node on the system is limited, and the greatest challenge is to identify the set of critical nodes whose simultaneous failures will cause system collapse or have the greatest impact on the system.Existing identification methods for the critical node set lack effective and search strategy, thus leading to the low convergence speed and accuracy of the algorithm.

Optimization methods of coupling patterns
The following problems concerning optimization methods of coupling patterns should be considered in future research: (i) How to integrate the branch swapping and addition strategies and develop the corresponding joint optimization method to mitigate the propagation of cascading failures?Branch switching and addition strategies are two feasible optimization strategies for coupling patterns of CPPSs in existing literature and have higher economic benefits.The former has a higher economic benefit, while the latter has a higher upper limit of the improvement effect on CPPS resilience.The integration of these two strategies may lead to a better network structure.
(ii) How to integrate the penalty factor and the constraints on the operation of CPPSs into the optimization methods of coupling patterns?In addition to the cost and time required to restructure the CPPS, other factors must be considered to maintain the normal operation of CPPSs, such as the type of node of each end of the branch, the power or communication area to which the node belongs, and increased (or reduced) power transmission costs due to the change of admittance matrix.

Fig. 1
Fig. 1 Schematic of the structure of CPPSs.

Fig. 3
Fig.3Generic system performance transition curve of CPPSs under the occurrence of disturbances.

Table 1
Evolution of the modeling methods of CPPS cascading failures and corresponding representative methods d ½TFðtÞ -Fðt d Þdt :

Table 2
References analysis of CPPS resilience assessment measures

Table 3
References analysis of CPPS resilience optimization methods based on the optimization of the recovery sequence of components

Table 4
References analysis of CPPS resilience optimization methods based on the identification and protection of critical nodes

Table 5
References analysis of CPPS resilience optimization methods based on the optimization of the coupling patterns between power and communication networks