Introduction

Smart Manufacturing (SM) is still faced with several demands, such as the quick response to faults on the shop floor, inventory balancing while aiming for customized output, and flexible adjustment of operating schedules according to the product and shop floor state, which are difficult to satisfy simultaneously (Jeon et al., 2016; Tao et al., 2018; Wang et al., 2018). SM is becoming incredibly complex. It brings together different advanced technologies such as Internet of Things (IoT), Artificial Intelligence (AI), robots, machinery, cells, conveyors, new and innovative sensors, electronics, and Programmable Logic Controllers (PLC) (Angelopoulos et al., 2020; Cheng et al., 2018; Kang et al., 2016; Kusiak, 2018; Machado et al., 2020; Mittal et al., 2018; Qi & Tao, 2018; Radziwon et al., 2014; Tao et al., 2019). Therefore, SM is an interconnected and interoperable system which includes hardware and software components. Consequently, large losses in terms of time, cost, and production rejects due to the accumulation of faults can interrupt and even abruptly stop the entire production cycle. Indeed, SM systems require innovative solutions to improve the quality of the production process while reducing the cost (Cioffi et al., 2020; Davis et al., 2012; Machado et al., 2020; Zheng et al., 2018).

SM systems often consist of several layers that are interconnected and in communication with each other. These layers are: (1) the physical layer, which consists of manufacturing facilities such as robots, machines, cutting tools, sensors, and actuators; (2) the communication layer, which includes Machine-to-Machine (M2M) communication, Wireless Sensor and Actuator Networks (WSAN), Wireless Sensor Network (WSN), and Wireless Body Area Network (WBAN); and (3) the application layer, which includes the end user and the control center for monitoring (Jeon et al., 2016; Kim et al., 2019; Zheng et al., 2018).

The complex architecture and concept of SM make its supervision and monitoring challenging. In addition, the vulnerability of SM to many types of faults impacts the behavior of the entire manufacturing system. It can also affect the resilience and sustainable aspects of the manufacturing system by increasing the machine run time, energy consumption, maintenance cost, and lifespan of the equipment (hard and software), as well as the technical, financial, and environmental waste (Cioffi et al., 2020). To deal with these problems, different advanced approaches, such as Fault Detection and Diagnosis (FDD), Self-Healing and Fault-Tolerant (SH-FT) strategies, and smart methods, exist in the literature that make it possible to improve the manufacturing operation by incorporating aspects of resilience and robustness.

Related work

FDD approach

One of the approaches to overcoming the challenges of SM is the well-known FDD. FDD methods are made up of fundamental steps that correspond to crucial components of efficient monitoring systems. Generally, these methods are based on three steps. The first step is fault detection, which is the process of determining the occurrence of faults and the time of their occurrence in the system. The next step is fault isolation. The purpose of fault isolation is to pinpoint the source of the fault, which means extracting some information about the fault, such as its type and location. The third step is fault identification. Its objective to determine the magnitude (size) and estimated time behavior of the fault (Gao et al., 2015; Tidriri et al., 2016). Due to potential faults that can lead to serious troubles and failures in the manufacturing process, it becomes essential to create FDD techniques that are more resilient to normal system disturbances and responsive to various faults (Skliros et al., 2019). As a result, there has been a lot of emphasis in recent literature on developing new FDD techniques and their applications in various fields. Several review papers in the field of fault detection and diagnosis were published recently. (Babaei et al., 2018) gave an overview of fault diagnosis approaches that are used to address faults in electric ship power systems. (Park et al., 2020) investigated recent research and development in FDD approaches for process monitoring in Industry 4.0. (Abid et al., 2021) provided an overview of the evolution of FDD techniques. The review discussed both conventional model-based and signal processing-based FDD methodologies, with special emphasis on AI-based FDD techniques. (Ahmad & Mohd-Mokhta, 2022) presented an overview of recent model-based fault diagnosis methods for linear time-invariant systems.

SH-FT approach

In addition to FDD techniques, other approaches based on self-healing techniques exist, with the purpose of improving the yield and performance of SM systems. Self-healing refers to the ability of a manufacturing process to detect system abnormalities and make the necessary adjustments to return to normal operation without the need for external intervention (Ghosh et al., 2007). Self-healing systems are emerging as a viable solution to the increasing complexity of system management requirements in manufacturing. These systems attempt to classify and analyze sensory data in order to autonomously detect and mitigate faults. As a result, little interaction between the systems and human administrators is required, minimizing operational costs and enhancing current fault mitigation techniques (Schneider et al., 2015). Without human intervention, a self-healing manufacturing technique may proactively monitor and identify a potential variance from its standard parameters, validate it with a high level of confidence, and restore regular operations (Qin & Lu, 2021). Such systems leverage a diverse set of methodologies to autonomously detect and recover from faults. Several methods for self-healing systems have been recently investigated in the literature, such as dual modular redundancy, triple modular redundancy, embryonic hardware, and an artificial hormone system, as specifically stated by (Rajput & Sikka, 2021). Nevertheless, these methods are based on a general framework and do not focus on the fundamental level of the self-healing approach based on the automatic control framework. (Abbaspour et al., 2020) reviewed the causes of faults and failures along with the most recent innovations in control systems. They also investigated Fault Detection and Isolation (FDI) methods and active fault-tolerant control approaches. Some survey papers (Benosman, 2010; Fourlas & Karras, 2021; Gao et al., 2015; Shraim et al., 2018) have reviewed the development of fault-tolerant control and studied their advantages. A review of fault-tolerant control of AC/DC microgrids was performed by (Ortiz et al., 2020). (Yu et al., 2022) discussed the most recent advancements in fault-tolerant cooperative control of multiple unmanned aerial vehicles.

Predictive maintenance and smart methods

The incorporation of a smart aspect into the manufacturing process involves some approaches. This smart technique is usually based on the algorithmic advances in Predictive Maintenance (PdM) and AI approaches (Yan et al., 2017). It has been recently applied in industries for handling the health status of industrial equipment. PdM is essential for sustainable SM. AI techniques have emerged as a promising tool in PdM applications for SM (Cinar et al., 2020). However, selecting the proper AI algorithms, data types, and data size for SM is still extremely challenging. Indeed, inappropriate selection of predictive maintenance techniques, datasets, and data size may cause major losses and make maintenance scheduling infeasible. Nevertheless, the literature has highlighted a range of approaches based on the development of a control framework and proposes the concept of preventative maintenance (Calabrese et al., 2021; Canizo et al., 2017; Chen et al., 2021a, 2021b; Cheng et al., 2020; Cohen et al., 2019; Hosamo et al., 2022; Huynh et al., 2019; Khorsheed & Beyca, 2021; Kiangala & Wang, 2020; Nguyen & Medjaher, 2019; Shcherbakov & Sai, 2022; Zonta et al., 2020). (Zonta et al., 2020) carried out a systematic review of predictive maintenance in Industry 4.0 and addressed its constraints and difficulties. (Li et al., 2017) investigated fault diagnosis and prognosis in machine centers using data mining techniques to develop a systematic method and acquire knowledge for predictive maintenance in Industry 4.0. (Bousdekis et al., 2019) reviewed and analyzed the literature on decision-making in PdM in the framework of SM. Intelligent techniques are used to develop a predictive maintenance model for sustainable manufacturing (Abidi et al., 2022). (Ayvaz & Alpay, 2021) developed a predictive maintenance strategy for production lines based on a machine learning approach. The findings demonstrated that the predictive maintenance system was capable of detecting warning signs of potential failures and preventing certain unplanned pauses in production. (Phan et al., 2022) undertook a systematic study of machine learning techniques for condition monitoring and predictive maintenance in manufacturing. (Yu et al., 2020) developed a big data ecosystem for fault detection and diagnosis in preventive maintenance, using actual industrial big data directly collected from worldwide manufacturing plants. The proposed system has been operating for several years in a cooperative company's real-time industrial production system, and it sets off an alarm several days before the defect occurs. (Divya et al., 2022) reviewed fault detection methods for predictive maintenance. (Richardson et al., 2021) explained how the one-class support vector machine algorithm and low data rate internet of things may be used to achieve fault detection in data-driven predictive maintenance in remote and rural areas. (Ciaburro, 2022) reviewed machine fault detection based on machine learning algorithms. The study investigated various approaches to identify the most frequent mechanical failures, together with the most popular machine learning techniques. (Taqvi et al., 2021) provided a succinct overview of supervised and unsupervised data-driven methods for fault detection and diagnosis in chemical processes. (Singh et al., 2023) presented a review of AI application in fault diagnosis of rotating machines such as gear, induction motor and bearings. (Zhang et al., 2021) suggested employing transfer learning method for life prediction by utilizing deep representation regularization. (Zhang et al., 2023a, 2023b) proposed a blockchain-based, decentralized, federated transfer learning methodology for collaborative machinery fault diagnosis. The results showed the effectiveness of this methodology in data privacy-preserving collaborative fault diagnosis of multiple users. (Li, 2023) developed a deep learning-based remaining useful life prediction method for sensor malfunction. Experimental results showed that the proposed method was appropriate to real industrial applications.

Limitations of the previous works and main contributions

The literature review shows that 70% (see Fig. 1a) of the studied papers focus on the description or comparison of FDD approaches. For example, (Ahmad & Mohd-Mokhta, 2022) investigated model-based fault detection methods (parameter estimation, parity space and observer-based methods) for LTI systems. The remaining 30% of publications are devoted to SH-FT methods. One of the most recent review papers in this field, published in 2020, surveys active fault-tolerant control systems (Abbaspour et al., 2020).

Fig. 1
figure 1

Percentage of papers published between 2010 and 2022 that are reviewed in this paper and those available on Scopus

We found 773 research papers on Scopus published between 2010 and 2022 (see Fig. 1b) that deal on FDD and SH-FT, with a focus on smart manufacturing and their applications. The ratio of FDD to SH-FT papers available on Scopus is more or less the same as the ratio of FDD to SH-FT studies reviewed in this paper (see Fig. 1). In details, we reviewed and analyzed 163 research papers in the field of FDD, which includes topics such as BB (55%), WB (35%), and signal processing (10%), as described in Fig. 2a. In contrast to the self-healing approach, we investigated only 69 research papers that discuss the manufacturing context and its applications. As described in Fig. 2b, the highest percentage (60%) of the research papers focus on the framework method; 23% and 17% of the papers focus on active and passive FTC, respectively.

Fig. 2
figure 2

The percentage of investigated FDD methods (163 studied papers) and SH-FT (69 studied papers)

The current paper lays particular emphasis on FDD and SH-FT approaches for SM applications. The first initiative is to give a bird’s eye view of such approaches. Moreover, this paper provides a qualitative benchmark for the FDD steps (detection, isolation, and identification) by considering fault types and SM applications.

To achieve inclusive understanding, this paper answers the following Research Questions (RQ)

  • RQ 1: How can FDD and SH-FT approaches be integrated in the same concept for SM?

  • RQ 2: What are the popular taxonomies for faults in SM?

  • RQ 3: What are the most used FDD methods in SM applications?

  • RQ 4: What are the benefits and drawbacks of each method of FDD?

  • RQ 5: What are the most used SH-FT methods in SM applications?

  • RQ 6: What are the advantages and disadvantages of each method of SH-FT?

  • RQ 7: What are the promising research directions?

The evaluation of review papers in the fields of FDD and SH-FT for SM allows the exposition of the following limitations, which are considered and overcome in the current review. This adds value to this paper and distinguishes it from previous review papers.

  • The lack of models that bring together FDD and SH-FT for resilient smart manufacturing.

  • The previous studies do not review conventional and unconventional FDD approaches from the perspective of both FDD and SH-FT. However, few review papers present some methods on FDD approaches. For example, (Abid et al., 2021) reviews data-based models and some combined WB and BB models, called hybrid methods.

  • Most of the previous research papers on SH are oriented towards PdM methods based on the framework of passive redundancy of equipment for smart manufacturing application.

These limitations motivate us to propose a new architecture of a conceptual model in order to define an infrastructure that is capable of integrating the most used advanced techniques of FDD and SH-FT for intelligent and resilient manufacturing. In this context, this paper provides a framework to compare existing solutions and highlights promising research directions in this area for better guidance on future related research. Table 1 illustrates the distinct merits of this work compared with recent related review/survey papers in the literature.

Table 1 Comparison of related previous review/survey papers on FDD and SH-FT in SM with the current review paper

It is rare to find previous review papers on SM. In addition, few of the review papers try to classify faults in manufacturing. Some review papers mainly explain the FDD approach, whereas other publications cover fault-tolerant control methods. Furthermore, most of the existing studies do not integrate self-healing and fault diagnosis techniques, which would have taken into account the autonomy of the monitoring system.

Main contributions of this paper

This paper aims to explain and understand the different approaches, sub-approaches and methods that could be used to develop an FDD and SH-FT strategy in smart manufacturing. It strives to fill the gaps in previous reviews by answering the RQs already mentioned:

  • This paper proposes a novel conceptual model that brings together FDD and SF-FT for smart manufacturing. (Response to RQ 1).

  • It introduces and classifies taxonomies of faults based on time behavior, faulty location on supervised process (actuator, system, sensor), and the mathematical relationships of faults with the studied system (additive, multiplicative). (Response to RQ 2).

  • It reviews and discusses the most used FDD approaches based on the physical models, data-driven approaches, and signal processing. A comparative study between these methodologies is carried out to highlight their advantages and disadvantages. (Response to RQ 3 & RQ 4).

  • It reviews the self-healing approach, with emphasis on the most recent achievements of SH-FT control systems in SM, in order to highlight their benefits and describe the best for manufacturing applications. (Response to RQ 5 & RQ 6).

  • It analyzes the existing reviews of FDD and SH-FT in the literature and promotes an orientation for future research that could be the key for resilient and smart manufacturing. (Response to RQ 7).

Reviewing methodology

This section provides the reviewing methodology used in this paper. We analyzed and interpreted about 256 relevant research in the literature published between 2010 and 2022, using a classification scheme that included keywords and key characteristics cluster (Fig. 3). A bibliometric analysis of FDD and SH-FT in smart manufacturing was conducted to organize the data from the Web of Science core collection database using VOSviewer software (van Eck & Waltman, 2010). The VOSveiwer software was used to disclose the thematic content of the research papers based on keyword identification. The keywords included by authors which occurred more than 10 times in the Web of Science core database from 2010 to 2022 were exported into a Research Information Systems (RIS) format and used in the final analysis. The initial search identified 4702 keywords, 130 of which met the threshold. Keyword combinations were employed to provide a wide view of research trends in FDD and SH-FT in SM applications. Figure 3 illustrates the bibliometric analysis for author-supplied keywords; the size of nodes represents the frequency of recurrence. Connections between nodes describe their co-occurrence in the same article. When there is a short distance between two keywords, the keywords co-occur more frequently.

Fig. 3
figure 3

Bibliometric analysis of author-supplied keywords

To answer the research questions, we mainly used Web of Science categories such as the Journal of Intelligent Manufacturing, the International Journal of Advanced Manufacturing Technology, Mechanical Systems and Signal Processing, Production Planning and Control, Processes MDPI, Reliability Engineering and System Safety, IEEE Transactions on Control Systems Technology, Sensors and Actuators A: Physical, etc.

The search was narrowed to the following keywords and titles that contain these keywords: fault detection, fault diagnosis, smart manufacturing, fault-tolerant control, conventional and unconventional approaches, data-driven model, active control, passive control, PCA for smart manufacturing, CNN in industrial process, parameter estimation, state estimation, and signal processing in manufacturing.

The inclusion criteria of the reviewed method were as follows (Fig. 4):

  1. 1.

    Articles published between 2010 and 2022 in English language documents.

  2. 2.

    Search on published papers on fault classifications in SM.

  3. 3.

    Search on survey, review and published papers on FDD in SM.

  4. 4.

    Search on survey, review and published papers on SH-FT in SM.

Fig. 4
figure 4

Descriptive schema of the review method

In this paper, we present and explain the SM concept and review papers related to the complexity of SM, which is composed of different layers and sub-systems (IoT, M2M, WSAN, WBAN, etc.). We then divide the reviewing method into three main parts. The first part involves research in general about faults in SM. We classify faults according to SM layers, and there are sub-sections on fault classification based on time behavior, fault locations in the process, and fault equations. The second part is concerned with the FDD approach. Relevant research articles are carefully reviewed and analyzed, while appraising their applications in SM. There is an appropriate classification of the FDD approach based on physical models, data-driven models, and signal processing. Moreover, a comparison of different techniques highlights their advantages and disadvantages and provides a qualitative benchmark for the FDD steps (detection, isolation, and identification). The last part investigates and reviews the SH-FT approach in SM under the general framework and FTC methodologies. Subsequently, it focuses on FTC approaches in SM, classifying (active and passive FTC) and comparing them. In the end, we conclude and highlight some promising directions for future work.

This paper is organized as highlighted in Fig. 5. Section 1 introduces the study, with a sub-section on related works and their limitations and the main contributions of the current study. In Section 2, a novel conceptual model of FDD and SH-FT is proposed. Section 3 presents fault classifications in SM (based on time behavior, fault locations and fault equations in the supervised system). In Sect. 4, the most used fault diagnosis approaches are reviewed and discussed, with a special focus on their development, advantages and disadvantages in SM. Section 5 presents and critiques the state of the art in self-healing and fault-tolerant approaches, highlighting their advantages and disadvantages. Section 6 concludes the study and proposes some directions for future work.

Fig. 5
figure 5

Diagrammatic outlook of the organization of the paper

Proposed conceptual model of FDD and SH-FT for SM

A model often describes the proper operations of a manufacturing system without faults or failures. Different research studies integrate diagnostic approaches and methods into the modeling of a supervised system, taking the nominal conditions or just simple additive and/or multiplicative faults into consideration. This considerably limits the efficiency and usefulness of real applications, especially for smart manufacturing and complex operations. The main principal challenge in smart manufacturing concerns not only the different types of faults but also the nature of the faults, their occurrence time (online or offline), and the combinations (multi-faults in serial or parallel or both). These assumptions show the difficulty in developing a smart manufacturing model, taking into account the different layers and sub-systems. It is against this backdrop that we propose a novel conceptual model of FDD and SH-FT for smart manufacturing to overcome these difficulties.

The proposed architecture makes it possible to conceptualize the flow of data between different layers of SM as well as the interaction between its interconnected sub-systems in order to ensure advanced monitoring based on multi-fault diagnosis and self-healing (see Fig. 6).

Fig. 6
figure 6

Conceptual block model of smart fault diagnosis and self-healing for smart manufacturing systems

In Fig. 6, a control system in a supervised process (block 1) is made up of three parts: the process, the sensors and the actuators. The smart fault diagnosis (block 2) can be described in three steps: fault detection, isolation and identification. In the first step, the objective of this block is to detect faults in the different SM layers. These faults may be related to the sensors, actuators or even processes of the different heterogeneous and interconnected systems constituting the supervised manufacturing system. In the second step (fault isolation), the smart fault diagnosis block receives the different residual signals estimated according to the fault detection result. Subsequently, a fault matrix relating to the number of sensors and actuators and the type of the considered fault is generated. The signature of this matrix provides relative information on the type of the defective sensor or actuator or on the faulty process. The third step is fault identification, which uses AI methods to process the data. Diagnostic procedures generally rely on some data collection options to obtain real-time information and data induced by any faults. This information could (block 3) be associated with different properties of the fault, like its type, nature, time of occurrence, origin, etc. As shown in the data collection block (block 3), the first option is based on the data directly measured from the sensors. The second option relies on the diagnosed signals computed from the physical model, if it exists. The third option relies on real-time data saved in the cloud. The self-healing and fault-tolerant approach (block 4) preserves the stability and optimal operation of the supervised system even with the existence of faults. This can be realized by active and passive methods. A new concept is proposed to maintain and recover a faulty system by reconfiguration of the controller and/or the model without external intervention. The proposed conceptual model has a significant positive impact on robust and resilient manufacturing, improving the response time and minimizing potential faults in the manufacturing process. This reduces the maintenance cost and time.

Fault classifications in SM

Several attempts have been made in the literature to classify faults in SM. These faults can be classified into: hardware faults, which mainly affect the physical layer of the SM model; networking and communication faults, which are observed in the communication layer and involve incompatibility of protocols between different applications, such as due to non-recoverable data and packet sending; and software faults, which can affect all layers of SM and include, for example, bit-flips (error in data source), failure during subroutine execution, run-time failure, and malfunction of some parts of the software (Abbas & Zhang, 2021; Abid et al., 2021; Kim et al., 2019). The literature review shows other criteria for fault classification. The classification could be based on the time behavior of the studied fault or on the location of the fault in the supervised process (actuator, system, or sensor). The third type of fault classification is based on the fault equation; under this classification, a fault could be additive or multiplicative.

Fault classification based on time behavior

The principal of fault classification based on time behavior is illustrated in Fig. 7. This figure shows mainly three kinds of faults: abrupt, intermittent, and incipient (ramp) faults. An abrupt fault can be defined as a sudden failure of the element (total or partial disconnection), i.e., a stepwise change. An abrupt fault has more serious consequences and can lead to damage in the machine or system that cannot be resolved unless there is effective repair or replacement of the faulty component. Abrupt faults are easy to detect (Heydarzadeh & Nourani, 2016). Intermittent faults are a special case of abrupt faults. Its symptoms appear only at certain times or under certain operational conditions in the system (Sedighi et al., 2013). Because of its slow evolution over time and its gradual derivative, an incipient fault is considered the most difficult fault to detect. Such a fault is typically described as a drift fault (Zhou et al., 2018).

Fig. 7
figure 7

Fault classification based on time behavior

Fault classification based on fault location

Fault classification based on fault location (actuator, system, sensor) is described in Fig. 8. Actuator faults occur as a total or partial loss of control action. A situation whereby a ‘stuck’ actuator generates no (controllable) action despite the input commands is considered a total actuator fault. Actuator faults occur as a result of breakage, shortcuts, cut wirings, or the existence of an exterior body in the actuator; rolling element bearing faults are described as defects in the inner and outer races and ball damage (Hagh et al., 2021; Khalil et al., 2022; Rai & Upadhyay, 2016; Sun et al., 2022).

Fig. 8
figure 8

Classification of fault based on fault location

Sensor faults describe the difference between the measured and actual value of a system’s output variable. They are further categorized into total and partial faults. A total sensor fault generates information that is unrelated to the value of the measured physical parameter because of, for instance, lost contact with the surface or broken wires. A partial fault sensor provides readings that are connected to the observed signal in such a way that relevant data can still be extracted. Sensor freezing, degradation of performance or loss of accuracy, drift, and calibration error are some common sensor faults (Foo et al., 2013; Han et al., 2020; Jana et al., 2022; Kommuri et al., 2016; Li et al., 2020a, 2020b; Li et al., 2016; Liu et al., 2020; Liu & Shi, 2013; Okafor & Delaney, 2021; Saeed et al., 2021; Ye et al., 2020).

System or process faults are those that exist in the components of the process. The physical parameters of the process are altered by a process fault, which leads to variations in the usual system dynamics. Examples include leakage in tanks, cracks, and breakages in gearbox systems. All faults that cannot be classified as actuator or sensors faults are defined as system faults. The most typical causes of these faults are structural defects, such as wear and tear and component aging (Amin et al., 2018; Melo et al., 2022; Xu et al., 2022).

Fault classification based on fault equation in the supervised process

A fault may also be classified as additive (Block 1, Fig. 9) or multiplicative (Block 2, Fig. 9) based on the fault equation. Additive faults are unknown signals that could add to the input or output of the system and generate changes in the system’s output independent of the known input. Generally, sensor and actuator faults are considered as additive faults, whereas changes in process parameters are described as multiplicative faults (Hao et al., 2014; Li et al., 2018; Rotondo et al., 2016; Talebi & Khorasani, 2013; Yang et al., 2021; Zhang & Basseville, 2014; Zhang et al., 2019).

Fig. 9
figure 9

Additive and multiplicative faults

Fault detection and diagnosis approaches

In recent decades, Fault Detection and Diagnosis (FDD) has attracted significant attention in automation control, and systems with faults can lead to irreparable damage. FDD approaches are addressed in many areas, especially those that require critical security vigilance where no level of tolerance is acceptable. Different fields of applications can be cited such as industrial applications (Aguilera et al., 2022; Jiang & Yin, 2018; Jiang et al., 2018; Liang et al., 2018; Liu, 2020; Nor et al., 2017; Oliveira et al., 2017; Schubert et al., 2011; Sidhom et al., 2021; Tao et al., 2020; Yin et al., 2014; Zhou et al., 2022), energy applications (Elmasry & Wadi, 2022; Gopakumar et al., 2018; Kamga Sagoun., 2021; Kurukuru et al., 2020; Lin et al., 2020; Mansouri et al., 2021; Reddy & Raju, 2020; Sidhom et al., 2016), chemical industries (Fazai et al., 2019; Harinarayan & Shalinie, 2022; Jiang et al., 2013; Ng & Srinivasan, 2010; Nor et al., 2020; Taqvi et al., 2021) and biomedical sciences (Badri et al., 2016, 2017; Chihi & Benrejeb, 2018; Dhaou Olfa, 2016).

Research on FDD show that fault detection and diagnosis approaches are mainly classified into three based on physical model, based on data or based on signal processing (see Fig. 10). The physical model-based and data-driven approaches are each classified into quantitative and qualitative methods. Here, we classify FDD approaches mainly according to the most used methods in SM applications.

Fig. 10
figure 10

Classifications of FDD approaches

Fault diagnosis based on physical models

This FDD approach is also known as the White-Box (WB) method. The main challenge of this approach is the development of a mathematical model that accurately represents the studied process, considering the many reconfigurations involved in the manufacturing process or the complexity of the considered phenomena. Although physical models can effectively reproduce the behavior of the real system, their complexities cause some difficulty during the practical implementation FDD techniques based on physical models. This is due to their high dependence on several environmental, physical, and mathematical assumptions (Enciso et al., 2021; Wu et al., 2022; Zhao & Shen, 2019). FDD approach based on the physical model includes two sub-categories: quantitative and qualitative methods (Sun et al., 2019).

Physical model-based quantitative methods

Parity space

The parity space method allows the parity (coherence) verification of a process with measurements from sensors and known inputs (control signal). A parity space consists of residuals determined by evaluating the analytical redundancy relations between the input and output signals of the real system. These residues are highly sensitive to defects. So, the presence of any incoherence necessarily indicates the presence of faults. This method facilitates data analysis for fault isolation. It is well known and applied to linear dynamic systems (Blesa et al., 2016; Wu et al., 2022; Zhong et al., 2015, 2018, 2021, 2022). (Tolouei et al., 2017) presented a method for sensor fault detection based on a nonlinear parity technique that could be applied to pH neutralization systems. This method can quickly and precisely determine the time of fault occurrence, as well as efficiently identify and isolate the sensor fault on the pH channel. The conceptual model of this method is generally quite advanced, and simulation is mainly used to verify it. Despite these applications, the parity space is poorly suited to nonlinear models, non-additive faults and multiplicative faults (Enciso et al., 2021).

State estimation

Another quantitative method exists for the FDD model-based approach, which is the state estimation method. It is the most popular method and offers simple calculation while being robust to measurement noise; furthermore, it is applicable to linear and nonlinear systems (Chadli et al., 2017; Mann & Hwang, 2013; Tornil-Sin et al., 2012; Zhao & Shen, 2019). The basic idea of such a method is to use an observer, a filter, or estimators to estimate system states from measurements. For linear systems with deterministic states, the observation problem was introduced in the 1960s by Luenberger (Goncalves et al., 2019). On the other hand, for a stochastic or random system, the Kalman filter is the most adequate (Qian et al., 2017). For nonlinear systems, the state estimation problem or the observation problem remains an active field of research. As a result, multiple solutions exit in the literature, and they are classified based on the system class. The fault diagnostic issue in nonlinear systems with multiple incipient faults in sensors was examined by (Wu et al., 2017), who suggested a new FDD technique based on sliding-mode observers and total measurable residual fault information. They divided the original system into two subsystems, one with sensor faults and the other with actuator faults, using a state and output transformation technique. (Piltan & Kim, 2018) presented a new observer based on the FDD approach according to a variable structure of a feedback linearization observer, in order to improve the robustness of the traditional feedback linearization observer method as well as the fault diagnosis performance in rotating machinery. To identify and diagnose actuator and sensor faults in nonlinear chemical processes, (Emanuel Bernardi, 2020) developed two types of observers based on the linear parameter variable technique. (Pignati et al., 2017) developed real-time fault detection and faulted line identification functionality using the concurrent computation of synchro phasor-based state estimators. The suggested method successfully identified the faulted line regardless of the neutral connection, fault type, fault impedance, or fault position along the line. The method is based on the state estimate, which does not automatically depend on the nature of the loads or generators. Therefore, the presence of distributed generation does not affect the accuracy of fault location. (He et al., 2013) presented least-squares FDD for networked sensing systems using a direct state estimation method. Any real physical system is exposed to unavoidable disturbances, the most common being measurement noise. These disturbances are translated by a change in the system model, just like defects. Among the drawbacks of this method is the need for an accurate and complete physical model and its poor adaptation to complex processes (He et al., 2013; Sidhom., 2017; Xu et al., 2017).

Parameter estimation

The parameter estimation method makes it possible to estimate the parameters rather than the state. It introduces an identification technique based on a system’s model and its input–output signals. Usually, the effect of faults may well show up in the parameters of a system. Actual process parameters can then be estimated multiple times using online parameter identification methods (Sidhom et al., 2021). The estimated parameters are then compared to those of the nominal model. Any substantial difference indicates a fault. Such a method is often used in the automotive industry because it is suitable for identifying multiplicative faults affecting the parameters, but it has a high number of variables. It also requires a permanent excitation of the physical system, which is not always obvious for systems operating in stationary mode. The disadvantage of parameter estimation lies in the definition of the relationship between mathematical and physical parameters, which is not always invertible. This makes it difficult to use for complex installations due to the large number of variables involved. According to (Gao et al., 2015), only the model’s structure need be known because it is presumed that faults will reflect in the system parameters. The fundamental tenet of the detection approach is to determine the parameters of the actual process online and compare them with the original observations (reference parameters) under healthy settings. If the model parameters have an explicit mapping with the physical coefficients, the parameter estimation-based fault diagnosis techniques are relatively simple (Chi et al., 2022). (Tran & Fowler, 2020) proposed a parameter estimation method combined with recursive least squares for sensor fault diagnosis in lithium-ion batteries in electrical vehicles. (Duan & Zivanovic, 2016) used the parameter estimation method for fault detection in an induction motor stator. (Ugwiri et al., 2022) proposed a parameter estimation algorithm for fault detection and classification in centrifugal pumps. (Liao et al., 2021) developed injection molding machine parameter estimation.

Physical model-based qualitative methods

Fuzzy logic

The fuzzy approach involves building a ‘fuzzy inference system’ that can imitate the decision-making of a human operator based on verbal rules that translate the operator’s knowledge of a given process. This technique allows an approximation of the behavior of a complex system with rules that have no clear semantic meaning (Abbas & Zhang, 2021; Adhikari et al., 2016). Fuzzy interface systems have the capacity to map nonlinear functions, indicating a connection between inputs (symptoms) and outputs (failure modes) by using fuzzy rules from “if condition, then conclusion”. A fuzzy interface system usually consists of four components which are the fuzzifier, the interface system, the rule base and the defuzzifier (Djelloul et al., 2018). Some research papers have studied the application of fuzzy logic to fault diagnosis. (Safarinejadian et al., 2015) proposed a fault detection method based on interval type-2 fuzzy sets for nonlinear systems. The results showed the effectiveness of the proposed method. In (Adhikari et al., 2016) presented a fuzzy logic approach for online fault detection and classification of transmission lines. The results demonstrated that the proposed approach was capable of rapid fault type classification and right-tripping action, making it suitable for use in real-time applications. (Nasiri & Khosravani, 2019) presented fuzzy case-based reasoning for fault detection in injection molding. The results showed the accuracy of the proposed method. (Qu et al., 2020) suggested employing non-singleton fuzzy logic with extended linguistic concepts and rules to detect faults in wind turbines. The findings of the experiment demonstrated that the suggested method might successfully identify early faults in wind turbines and provide more details about fault severities. (Nasser et al., 2021) developed a method for intelligently diagnosing and locating faults in analog electronic circuits by using a fuzzy logic classifier. The test results for the proposed approach indicated that it had an average of 98% F-score accuracy in diagnosing a faulty component in the circuit.

Fault diagnosis based on data

This approach is also called the ‘Black-Box’ (BB) or ‘empirical’ method. It is based on data that describes the behavior of the supervised system, which constitutes an efficient alternative (Brito et al., 2022), where the necessary process information can be directly extracted from enormous amounts of recorded process data (Cerone, 2017). Fault diagnosis based on data depends mainly on the quantity and quality of the data. It also requires a high computing time in the training step (Chen et al., 2021a, 2021b; Huang et al., 2022; Kou et al., 2020; Sinitsin et al., 2022; Wang et al., 2020a, 2020b; Wang et al., 2021a, 2021b). Indeed, the data-based approach has received huge attention in diverse manufacturing applications and has been widely applied to complex industrial process diagnosis and monitoring.

Methods related to this approach are also divided into qualitative and quantitative methods.

Data-driven quantitative methods

Principle component analysis (PCA)

Principle Component Analysis (PCA) is one of the various multivariate statistical FDD approaches that have been used in complex industrial processes to detect unidentified abnormalities occurring during the operations (Harmouche et al., 2015; Ahmed, 2012). PCA aims to reduce the dimensions of the original dataset by projecting it onto a lower dimensional space while keeping a large number of connected variables (core elements) and maintaining as much variation as exists in the dataset (Herve Abdi, 2010; Miljković, 2011). To identify open circuit faults in modular multilevel converters, (Houchati et al., 2018) employed the PCA technique and a sliding mode observer. As a result of this research, PCA has excelled among sliding mode observers in terms of fault detection speed, regardless of the fault location. According to (Du et al., 2022), more than 80% of all aviation system failures were caused by sensor faults. The authors introduced division-based sensor fault diagnosis techniques in the flying status and used PCA to create a diagnostic model for each situation. The findings of the experiment indicated that the used method could successfully enhance quick identification of single fault sensors. One disadvantage of this method is that it only works with linear systems. As a result, modeling a nonlinear system using PCA as a linear solution may decrease the fault diagnosis efficiency. To avoid this drawback, an extension of the basic PCA, called kernel PCA, was proposed by (Navi et al., 2015). This PCA version suggests a K principal component analysis. The authors tested the sensors of an underwater vehicle. Results indicated that, when compared to PCA, the KPCA method could generate warning signals more effectively and was more sensitive to faults. (Sun et al., 2021) proposed an adaptive fault detection and root cause analysis schema for complex industrial processes using moving window KPCA and information geometric causal interface. The results showed the proposed scheme able to reduce the faulty alarms and missed detection rates and locating causes of faults.

Besides the PCA method, AI-based methods are can solve issues related to complex and non-linear processes. For example, (Gravanis et al., 2022) proposed an FDD framework for non-linear industrial process empowered by dynamic neural networks. They evaluated the proposed approach for 18 different faults. The simulation findings showed that this methodology performs better than current solutions for the majority of those faults. (Peng et al., 2022) provided s systematic review of approaches based on data for fault diagnosis and early warning. (Angelopoulos et al., 2020) reviewed machine learning solutions for faults in Industry 4.0. For high-noise industrial environment, (Lyu et al., 2022) proposed a new method for smart bearing fault diagnosis based on a residual building unit, soft thresholding and a global context for motors. (Li et al., 2022) proposed an intelligent fault diagnosis technique using deep learning for bearings under unbalanced data conditions.

Convolutional neural network (CNN)

Convolutional Neural Network (CNN) has been used in bearing fault diagnosis (Chen et al., 2021a, 2021b; Eren et al., 2019; Liu et al., 2019; Pan et al., 2018; Peng et al., 2019; Peng et al., 2020; Sinitsin et al., 2022; Wang et al., 2020a, 2020b; Wang et al., 2021a, 2021b; Zhao et al., 2020; Zhong et al., 2019). (Kou et al., 2020) developed a multi-dimensional, end-to-end CNN model for fault diagnosis in rotating devices in high-speed train bogies. (Huang et al., 2022) proposed a new fault diagnosis method based on a combination of CNN and long short-term memory network for complex systems. It was proven that the predictive accuracy and noise sensitivity of fault diagnosis could significantly increase when the proposed method was applied to the Tennessee Eastman chemical process. (Jin et al., 2021) developed a fault diagnosis method based on CNN for rotating machines to recognize fault types quickly and precisely and increase the efficiency of fault diagnosis. Two mechanical datasets were used in the experiments to test the efficacy of the suggested method. The method matched excellent existing ones in terms of accuracy, achieving about 100% accuracy for data used with typical signals, while maintaining good performance under various dynamic loading. (Hsu & Liu, 2021) proposed a multi-time convolutional neural network model for fault diagnosis in semiconductor manufacturing. The experimental results shows that the suggested method effectively detects the faults comparing with other multivariate time series methods. (Zhang et al., 2023a, 2023b) described a new method based on sample reliability assessment and improved CNN. The findings demonstrate the designed method can reduce the negative impact of issues during the training time including imbalanced sample, overfitting and class imbalance. Thereby, the performance of fault diagnosis is improved.

Artificial neural network (ANN)

Artificial Neural Network (ANN) is considered an effective technique for the detection of faults, particularly incipient faults (Castresana et al., 2022; Rahman et al., 2019; Soto et al., 2019; Zakaria et al., 2012). A new technique developed by (Jayamaha et al., 2019) is based on ANN and a wavelet multiresolution analysis approach for quick fault detection and isolation in DC microgrids without de-energizing the existing network. The outcomes showed the effectiveness of the proposed scheme in terms of quick and accurate fault localization as well as fast and reliable fault detection. (Zhakov et al., 2020) applied ANN for fault detection on overhead hoist transport systems for semiconductors. The result showed that ANN offered precise real-time fault detection, enabling a needs-based, resource-saving, and effective maintenance process for robust overhead hoist transport systems and, consequently, constant semiconductor manufacturing. Few research works on hybrid methods under certain conditions, for instance (Capriglione et al., 2018), have proposed a hybrid method combining NARX with ANN. This method is applied to the rear suspension stroke sensor in motorcycle design. The effectiveness of the scheme lies in its ability to identify various fault types, such as un-calibration faults, which are caused by slight variations in the input/output sensor curve, and hold faults, which are caused by the breaking of the potentiometer cursor, open circuit, and short circuit. All the results were obtained through experimental tests. (Lee et al., 2022) proposed an ANN, correlation and fitness value-based feature selection and multi-resolution analysis-based fault-detection system for malfunctioning induction motors.

Support vector machine (SVM)

Support Vector Machine (SVM) is a well-known machine learning classification method based on a small number of samples of information. It is used in many sectors such as mechanical fault diagnosis, face recognition, biomedicine, brain-computer interfaces, and financial applications (Gupta et al., 2019; Li et al., 2020a, 2020b; Morra et al., 2010; Poursaeidi & Kundakcioglu, 2014; Shi Hong, 2011; Tanveer et al., 2021; Widodo & Yang, 2007; Zheng et al., 2017; Zhou et al., 2010). (Widodo & Yang, 2007) reviewed and summarized the development of SVM for monitoring and fault diagnosis. For wind turbine transmission systems, a new diagnosis technique based on manifold learning and Shannon wavelet SVM support was proposed by (Tang et al., 2014). The effectiveness of the suggested method was demonstrated through the implementation of fault diagnosis in the gearbox of a wind turbine. The proposed method, which achieved up to 92% accuracy, showed greater accuracy than existing methods. (Zheng et al., 2017) proposed a new method of FDD for rolling bearing based on composite multi-scale fuzzy entropy and ensemble SVM. In fact, the composite multi-scale fuzzy entropy was used to examine the complexity of rolling bearing vibration signals and extract hidden (unknown) nonlinear features from the vibration signals. An ensemble SVM-based multi-classifier was developed for the purpose of effectively classifying fault features. This method successfully differentiated between various bearing fault types and the degree of severity. (Lin, 2021) proposed a medium Gaussian SVM approach for application machine learning to mortar bearing fault diagnosis. The findings demonstrated that, under situations of varying crack-size and load, the medium Gaussian SVM approach enhanced the reliability and accuracy of motor bearing defect prediction, detection, and identification. According to experimental findings, the medium Gaussian SVM intelligent diagnosis approach showed a 96% accuracy rate when using nine features of motor bearings, which is superior to the 89.6% and 93.6% accuracy rates of the fine and coarse Gaussian SVMs respectively. (Huerta-Rosales et al., 2021) used an approach based on statistical time features and SVM to diagnose a transformer in various short-circuited turns conditions, obtaining an accuracy of 96.82% for such an application. (Tanveer et al., 2022) highlighted the benefits of the SVM method, which uses the structural risk minimization principle to improve generalization and lower training phase error. Since it is created for binary-class classification, many SVMs must be integrated in a certain way to provide multi-class classification. SVM learning requires a lot of time for a huge volume of data, while some approximation approaches are employed to speed up the computing time. This reduces the classification performance.

Data-driven qualitative methods

Expert systems

The expert system is a computer program that simulates the decision-making of a human expert. Compared with traditional programs, expert systems are designed to handle and solve complex problems by using knowledge to reason like experts rather than by following a developer’s instructions. Classically, expert systems consist of four components: knowledge acquisition system, knowledge base, inference engine and man–machine interface (Li et al., 2013). (Venkatasubramanian et al., 2003) highlighted some advantages of designing expert systems, such as the ease of design, the ability to reason (make a decision) under uncertainty, and the ability to provide explanations for the solutions provided. However, expert systems have limitations, as they are difficult to update and have very specific applications. Some research articles have investigated the application of expert systems to fault diagnosis. One such fault diagnosis system was developed for wind turbines by using confidence production rules and an expert system self-learning method (Deng et al., 2017). (Al-Jonid et al., 2018) proposed a fault diagnosis expert system for semiconductor manufacturing equipment using a Bayesian network. The results proved the accuracy of the used method. (Wang, 2018) designed a fault diagnosis model of mechanical equipment fault features of a vibration system based on an expert system using Abaqus software. The results demonstrated that the system successfully increased the capability of fault diagnosis of the vibration system of mechanical equipment. (Berredjem & Benidir, 2018) proposed a fuzzy expert system for bearing fault diagnosis by using improved range overlaps and the similarity method. The result showed the efficiency and validity of the proposed method. (Xu et al., 2020) presented a belief rule-based expert system for fault diagnosis of marine diesel engines. The proposed system was applied to abnormal wear detection in a marine diesel engine. The performance of the proposed method was compared with other models (ANN, SVM, and binary logistic regression models), with fivefold cross-validation, and the result demonstrated that the proposed expert system outperformed the compared methods in terms of stability, accuracy and the effectiveness of concurrent fault detection.

Fault trees

Fault tree analysis identifies the potential causes of faults or failures in a system by analyzing the suspicious components and their related failure modes that may have caused the issue. Fault tree analysis is a common tool in reliability and risk management that can support decision-making in complex systems (Jimenez-Roa, 2022). When an error occurs, engineers carefully investigate all data during the operation to perform fault diagnosis (Lee et al., 2005). Fault tree analysis generally consists of four steps as follows: system definition, fault tree creation, qualitative evaluation, and quantitative evaluation. It provides a computational method for combining logic to investigate the faults in a system. Moreover, it is considered an interesting method because it allows the use of AND, OR, and XOR logic nodes rather than the predominantly OR node shown in digraphs. This reduces erroneous solutions and provides a better representation of the system. However, the main issue with fault trees is that they are prone to errors at various points in the development phase (Venkatasubramanian et al., 2003). With the intelligent industry, the availability of inspection and monitoring data is increasing, making techniques for extracting knowledge from large data sets relevant. (Gao et al., 2018) presented a fault diagnosis system based on fault tree for electric vehicle charging devices. The results showed that fault tree analysis could identify the fault location. Fault tree analysis for network fault diagnosis was employed to assist network maintenance managers in identifying faults with maximum probability and improving the effectiveness of network fault diagnosis (Wang, 2022).

Fault diagnosis based on signal processing

The primary elements of FDD are symptoms, which are represented by signals or observers connected to the faults. The residual, which represents the deviation of a certain system characteristic from its fault-free status, is a widely used symptom. Thus, if the residual is not zero, the system is alerted to faults (Okada et al., 2021). Most residuals are produced by signal analysis-based approaches, which are created by comparing amplitudes in the frequency spectrum, in signal amplitudes in the time domain, and from statistical information (Brkovic et al., 2017; Fan et al., 2018; Heydarzadeh & Nourani, 2016). The Wavelet Transform (WT) technique is the most used FDD based on signal processing approach.

Wavelet transform (WT)

WT is an analytic technique for time-varying or non-stationary signals which uses a scaling concept to describe spectral decomposition (Bouzida et al., 2011). (Chen et al., 2016a, 2016b) reviewed a WT method, which is based on the inner product, for fault diagnosis in rotating machines and proposed a new WT methodology for use in the decomposition of the sensor signals in a process. Wavelet transforms are traditionally classified as Discrete Wavelet Transform (DWT), Continuous Wavelet Transform (CWT) and Wavelet Packet Transform (WPT) (Li & Chen, 2014). Therefore, all wavelet techniques are limited by the choice of the wavelet basis used in the applications, which has a direct impact on fault detection accuracy, particularly in weak fault diagnosis (Chen et al., 2016a, 2016b). The advantages of WT are described in (Chen et al., 2016a, 2016b), including its enormous power in condition monitoring and fault detection of mechanical equipment due to its ability to perform multi-resolution analysis. This is helpful in finding weak problem features in noisy data. (Saravanan & Ramachandran, 2010) investigated the use of DWT for feature extraction and ANN for classification in gearbox fault diagnosis. (Anwarsha & Babu, 2022) reviewed and summarized a WT method, called the tunable q-factor wavelet transform, in rolling element bearings. Depending on the q-factor number, this method can divide any vibration signal into low q-factor, high q-factor, and residual components. This method can be applied to rolling element bearing fault diagnosis for feature extraction, signal denoising, and automatic defect detection. A hybrid method based on CNN and DWT was developed for fault diagnosis of power cables (Wang et al., 2022). The test results demonstrated that the described method had outstanding performance in terms of recognition accuracy, achieving 97.5%, and rapidly identified the fault status of power cables. (Han et al., 2022) proposed a hybrid solution defined by a dual tree complex WPT and time-shifted multi-scale range entropy for fault diagnosis of rolling bearing. The results showed high effectiveness in the determination of different fault types in different bearings. The method was also able to pre-screen healthy bearings and improve the accuracy of identifying the types of bearing faults. A combination of flexible analytical WT and fuzzy entropy approaches for fault diagnosis of bearings was proposed by (Malhotra et al., 2021) Experiments showed that this method had advantages in fault identification and bearing severity.

Differentiator design

(Sidhom et al., 2018) presented a new method based on robust differentiator design. It is important to emphasize that all measured signals from a physical process represent useful information. This information is represented by a low-frequency signal compared to noise. Useless information includes disturbances and noises. The noises can have different sources of origin, such as electrical, thermal, digital, etc. The obvious presence of noise in the signal to be derived is one of the main sources of difficulty in the design of differentiation algorithms. For example, the well-known method based on a finite difference presents an exact differentiation in the absence of noise. However, the quality of the signal derivative is greatly degraded in the presence of noise. The definition of an ideal differentiator based on such a linear approach over a frequency band of the considered signal assumes that the frequency range of the noise must be known a priori. Therefore, it is possible to place a low-pass filter to remove the high frequency characteristics of the noise. This solution can provide satisfactory results in terms of noise reduction. On the other hand, the presence of a phase shift is inevitable, which is a particularly disadvantageous effect for the dynamic system. However, when no or minimal information about the dynamics of the signal/noise is known, an alternative approach based on the sliding mode technique can be used. A novel FD approach based on a higher-order sliding mode technique was proposed (Sidhom et al., 2018). This approach is defined by a new sliding mode differentiator schema compared to the classic one (Pisano & Usai, 2011). The aim of this new version is to overcome the problem of setting parameters while improving precision and robustness with respect to noise. By including a proper low-pass filter, such an improvement aids in achieving the best compromise between the phase shift and the error. Against this backdrop, a first-order Dynamic Gain Robust Differentiator (DGRD) has been proposed for a 3 phase, 9-cell cascaded H-bridge for open-circuit fault detection (Sidhom et al., 2018). The proposed method calculates the first derivative of the current to quickly identify the amplified faults using a new scheme of sliding mode differentiator. Given that the proposed algorithm is very robust to noise, this then makes it possible to differentiate the measured signals by only amplifying the impact of the faults. Such a proposition can help in the detection of micro faults in some processes in real-time. To evaluate the overall efficiency of this method, it is necessary to validate the proposed algorithm with other kinds of systems and consider other levels of measurement noise (Sidhom et al., 2018).

Discussion and comparison of FDD approaches

Previous research studies indicate that FDD in manufacturing systems does not take multiple faults into account while considering the problem of interconnection and interpretability (Liang et al., 2018; Oliveira et al., 2017; Schubert et al., 2011). Despite the complexity of industrial processes, most of the proposed approaches in the literature are based on linear representations with simple additive or multiplicative faults. They are based on the modeling of the supervised systems without fault integration (Chadli et al., 2017; Piltan & Kim, 2018; Zhao & Shen, 2019).

In this section, we compare the three main FDD classes (Table 2), highlighting the advantages and disadvantages as well as the applications of each method. Moreover, we provide a qualitative benchmark regarding the different steps of the FDD approach (detection, isolation, and identification) according to the fault types. As described in Table 2, both state estimation and ANN are effective methods for the detection and diagnosis of additive, multiplicative, and incipient faults, while PCA is an effective method for the detection and identification of abrupt faults in linear systems. Fault tree is effective for the localization of abrupt and intermittent faults, whereas DD is a powerful method for the detection of abrupt, intermittent and incipient faults. The rest of the methods demonstrate their effectiveness in the detection and isolation of various faults.

Table 2 Comparison of FDD approaches in SM

Self-healing systems and fault-tolerant approaches

SM systems are vulnerable to many kinds of hardware and software faults. These defects can be amplified by closed-loop control systems, and faults can develop into malfunctions of the loop. Self-healing approaches can maintain efficient behavior of the supervised system, even with faults. Self-healing has its origin in fault-tolerant and self-stabilizing systems research (Psaier & Dustdar, 2011). Fault-tolerant systems handle transient failures and mask permanent failures in order to return to a valid state (Luo et al., 2022). Self-stabilizing systems are considered a non-fault masking approach for fault-tolerant systems (Altisen et al., 2021). From the perspective of the process control framework, Self-Healing and Fault-Tolerant (SH-FT) approaches are categorized into two kinds of methods: passive and active Fig.  (11).

Fig. 11
figure 11

Categories of self-healing and fault-tolerant approaches

Passive fault-tolerant control

Passive SH-FT methods consist in the development of robust control techniques. For such a method, the list of potential malfunctions is assumed to be known a priori as basic design defects and is considered in the design phase of the control system. The key role of the robust control technique is to guarantee the insensitivity of the closed-loop system to some known sets of faults. In such cases, no online fault information systems are used. Thus, the term “passive” indicates that no additional action is taken by the control system in response to the malfunction. In other words, the controller handles defects passively. This method is based on the simple idea that faults represent disturbances that the control law must consider from its initial design. Figure 12 describes a schema of passive FTC system. To begin with a reference signal which represents a desired output applied to a supervised system which composed mainly of 3 blocks. Actuator is a component of machine responsible to convert an input signal, that received from FTC passive block, into physical action. A system could be represented by a physical or behavior model of the studied process. A sensor is defined as a device that measure the physical properties such as temperature, pressure, humidity, acceleration, etc. and convert them into electrical or digital signals. The red signs represent the different kind of faults that could affect the actuator, system, sensor. The green arrow demonstrates the feedback signal from sensor to comparator in order to calculate the error between the actual output and the reference signal. The FTC passive block is a controller which designed for normal conditions and predefined faults that affect the supervised system. FTC passive able to take actions offline in response to faults.

Fig. 12
figure 12

Fault-tolerant passive control schema

Passive FTC uses robust control techniques with respect to parametric uncertainties and external disturbances, which are the defaults (H∞ control, adaptive control, sliding mode control, etc.) (Chen et al., 2022; Sidhom et al., 2016, 2022; Stefanovski, 2018; Cerone, 2017).

For time-invariant linear systems, (Amin & Mahmood-ul-Hasan, 2021) introduced a passive FTC system based on a Proportional-Integral (PI) controller with a high-gain feedback system for air fuel ratio control systems of internal combustion gasoline engines. Its effectiveness was evaluated by introducing noise in the sensor measurements. Using MATLAB (Simulink) for simulation showed the robustness of the system to faults in normal and noisy sensor conditions. Probabilistic reliability calculations were done for the model. Other passive methods are available for linear systems, such as the H2 and H∞ robust control techniques (Sidhom et al., 2016; Cerone, 2017). This type of controller (H2 and H∞) ensures the system’s stability and performs well in the face of disturbances/defects in the external environment. The quantification of robustness is done via the mathematical H2 and H∞ norms of the Hilbert space. In fact, the H2 norms minimize the transmission of disturbances/faults to the controlled output. The optimal gain H2 is obtained by solving Riccatti's equations or a convex optimization problem formed by Linear Matrix Inequalities (LMIs). On the other hand, the H∞ norm corresponds to the maximum value of the largest of the singular values of the transfer between the system output/input over all the frequencies; that is, it corresponds to the maximum gain of the frequency response.

Various passive actuator FTC techniques have been developed for different classes of nonlinear systems (Jin & He, 2017; Nasiri et al., 2019; Tao, 2014). For a class of multi-input, multi-output nonlinear systems with uncertainties, (Nasiri et al., 2019) introduced passive actuator FTC by using adaptive sliding mode control. Such a controller is based on two layers: an adaptive layer and a robust layer. The robust layer is defined by the sliding mode technique, which is robust to the disturbances or the unknown dynamics satisfying the matching condition. Therefore, there are faults at the process input or output that can be considered disturbances by the control block. In this case, the robust layer can ensure the operation of the closed-loop system by taking these faults into account. The only necessary information is knowledge of the upper bounds of these faults so that the controller will correctly consider them. In return, the adaptive layer makes it possible to compensate for the process faults that result from modification to the system parameters.

There are various research works on the adaptive scheme of controllers. For instance, (Kordestani et al., 2018) developed an adaptive form of passive FTC for an industrial steam turbine by using an adaptive PCA-based inverse neural network control strategy. (Guezmil et al., 2019) investigated passive fault-tolerant control for induction machines by using the sliding mode method. The experimental results showed that the induction machine was continually operating even though inter-turn short circuits faults existed.

Active fault-tolerant control

In contrast, the active FTC method responds to faults by reconfiguring the online control law to sustain the stability and effectiveness of the system's nominal values. Active FTC is designed to meet the control objectives with minimal system performance degradation, either by utilizing a pre-calculated control law or by synthesizing and updating a control strategy online. The main objectives of active FTC are to preserve the stability of the system and, in the case of system defects, to maintain an acceptable level of performance by acting online on the control system in real time. This is achieved by using the various information collected about the defects from the FDD block for actuator, sensor, or system type, as described in Fig. 13. The term “active” indicates that the corrective action under consideration is steadily adapting the control of the system according to the faults that may occur in the process. Therefore, active FTC strategies usually respond to real-time fault diagnosis systems in order to provide the most recent information on the status of the system being monitored. This method allows for the most precise information on any fault, including the intensity, time of appearance, type and magnitude (Yu et al., 2022; Zhao et al., 2022).

Fig. 13
figure 13

Fault-tolerant active control schema

The aims of fault-tolerant active control design are summarized in the following points: first, to develop an effective fault detection and diagnosis system (the FDD block) that provides fault information in short time with low uncertainties; secondly, to effectively reconfigure the existing control scheme in order to ensure system stability and achieve acceptable closed-loop system performance; and lastly, to reconfigure the control smoothly and minimize potential transient switches.

As illustrated in Fig. 13, the active FTC consists of generally supervised process (actuator, system, sensor) blocks which make suspectable to various faults (red color), FDI block, configuration mechanism block and controller configuration block. FDI block provides information online to the controller about the faults may affect the supervised system. The main role of FDI is to detect and isolate the faults. Configuration mechanism block is responsible to reconfigure the supervised system parameters to compensate the faults. Different approaches exists in the literature used in active FTC design such as switching-based active FTC, hierarchical structure active FTC, safe parking active FTC and Analytical feedback compensation active FTC (Abbaspour et al., 2020). The controller (controller configuration block) is designed to make the necessary adjustment in real time in case of existing faults or any disturbances in the supervised system. These last three units (FDI, configuration mechanism, controller configuration) blocks must work in harmony to ensure successful completion of the control tasks. The green arrow demonstrates the feedback signal from sensor to comparator in order to calculate the error between the actual output and the reference signal.

In general, there are three active control tolerance steps. The first step is defect accommodation, which considers defects of small amplitude. The new control law is only an adaptation (online) of the parameters of the old regulator. The second step is system reconfiguration, which is used when accommodation of defects is not effective. It is characterized by the modification of the structure of the system to compensate for defects. The third step is reconstruction, which corresponds to the synthesis of a new control law, with new structures and parameters (Mekki et al., 2015; B. Wang et al., 2020a, 2020b).

Defect-tolerant active control has a major disadvantage: it has limited time to act and adapt the new order law. (Shen et al., 2019) designed an active fault-tolerant control system for spacecraft attitude maneuvers that focuses on actuator faults. Simulation results demonstrated the success of the proposed method. Based on adaptive sliding mode control and recurrent neural networks, (B. Wang et al., 2020a, 2020b) proposed an active fault-tolerant control technique for a quadrotor helicopter that protects against actuator faults and model uncertainties while directly taking fault estimate errors into account. Through actual experiments using a quadrotor helicopter exposed to actuator faults and model uncertainties, the usefulness of this active FTC technique was confirmed. (B. Wang et al., 2021a, 2021b) developed a new neural network based on fault-tolerant active control for fractional time-delayed systems. The experimental results highlighted the robustness and reliability of the proposed approach for nonlinear systems. (Mrazgua et al., 2019) proposed a fuzzy H∞ fault-tolerant control (FTC) problem for T-S fuzzy model-based active suspension systems with actuator faults. (Han et al., 2021) introduced an active, physical, and realizable fault-tolerant controller based on a reduced order observer for vehicle suspension in discrete time domain. (Zhu et al., 2020) explained active fault-tolerant control and fault estimation for discrete time systems in finite frequency domain. (Hagh et al., 2021) presented an active fault-tolerant control (FTC) technique for robotic manipulators that are vulnerable to actuator faults. A simulated two-degree-of-freedom robotic manipulator subjected to various fault scenarios was used to evaluate the effectiveness of the suggested approach. (Bensalem et al., 2021) employed a fuzzy controller of speed sensor faults to design an active fault-tolerant control for 5-phase Permanent Magnet Synchronous Motors (PMSM). Experimental tests were performed in terms of the measured and estimated speed responses on the 5-phase PMSM drive. Simulation results demonstrated that the proposed method could achieve 5-phase PMSM continuous operation even in the event of a speed sensor fault.

Discussion and comparison of FTC approaches

Table 3 compares the two main SH-FT approaches while highlighting their advantages, disadvantages and applications. Offering our analytical view on previous works, we emphasize that the methods are adaptive, robust, and ensure the stability of a supervised system when faced with certain faults and disturbances in the external environment. However, the majority of the proposed methods in the literature focus on actuator faults and system disturbances, and there is no general methodological framework, including FFD, fault-tolerant, and self-healing methods, specific to manufacturing systems. We suggest that future studies should consider the internal and external behavior of studied systems as well as the different types of faults that may have a serious effect on such systems, such as sensor and random faults.

Table 3 Comparison of Active FTC and Passive FTC

Conclusion

We have summarized and explained different FDD and SH-FT concepts and approaches for SM applications. This paper proposes a novel conceptual model of FDD and SH-FT for smart manufacturing. Moreover, it presents the advantages and disadvantages of each approach and highlights promising research directions.

From our analyses and interpretation of 256 papers, we arrive at the following conclusions:

Concerning fault combinations and taxonomies, the literature review shows that most of the studies consider abrupt and intermittent faults, whereas a few studies focus on incipient fault, which is the most frequently occurring fault during the manufacturing process. Incipient faults are generally related to the degradation of the equipment, which slowly evolves over time (e.g., motor degradation, electric cable degradation, etc.). In addition, this review paper shows that multi-faults have not been extensively investigated by previous studies, although they are strongly observed in SM due to the interconnection and interpretability problems of multi-systems that constitute the smart manufacturing system. One promising research direction in this regard is the development of multi-fault diagnosis methods for SM multi-systems. These methods should respond effectively to various multi-faults during operation in real time.

For the FDD approaches in SM, selecting an appropriate one gives rise to a dilemma. In this context, the main question is: What is the best FDD approach to adopt, one based on equations and physical laws (White Box models) or one based on the data (Black Box models)? Here, a promising research direction could be to develop hybrid approaches that combine physical and data-driven models. However, the optimal solution depends mainly on various conditions and hypotheses related to the studied system, its industrial environment, and to the interconnection between the different sub-systems that constitute the SM system.

Self-healing and fault-tolerant control is not a common research field in smart manufacturing. Only about 30% of the reviewed papers focus on SH-FT applications in SM. Here, we emphasize the value of fault-tolerant control as a replacement for framework-based techniques that require a significant number of components to resolve a defect. We suggest that future studies should focus on the development of new self-healing methods for smart and resilient manufacturing. SH-FT approaches are mainly based on the readjustment of the model, controller, and sensors of a supervised system without any external intervention.