1 Introduction

The goal in regression is to predict one or more variables \(\varvec{\textbf{y}}(k)=(y_{1}(k),\ldots ,y_{m}(k))^T\) from the information provided by measurements \(\varvec{\textbf{x}}(k)=(x_{1}(k),\ldots ,x_{p}(k))^T\) for a given sample k. Customary, \(\varvec{\textbf{y}}(k)\) are refereed as targets, outputs, or dependent variables, while \(\varvec{\textbf{x}}(k)\) are commonly refereed as predictors, inputs, covariates, regressors, or independent variables. Regression models covers several application areas, as economic growth problems [1, 2], air quality prediction [3, 4], medicine [5, 6], chemical industries [7, 8], and industrial processes [9, 10]. Recent studies show that regression models have become predominant in increasingly complex real-world systems due to the large availability of data, inclusion of nonlinear parameters, and other aspects intrinsic to the application area. For complex systems, traditional machine learning techniques (i.e., non-deep/shallow techniques) may become limited due to two main challenges: big-data explosion (high-dimensionality, high number of observations) and increase in complexity, caused by dynamics of nowadays applications. Deep learning (DL) methods have gained prominence in recent years due to their ability to represent systems in complex structures with multiple levels of abstraction and high-level features. However, DL may have limitations such as dependence on a high number of samples, hyperparameter sensitivity, and interpretability issues [11, 12], which limits the application in critical systems or one that requires accountability in its results. In this sense, deep fuzzy systems (DFS) have emerged as a viable method over DL to balance accuracy and interpretability in complex real-world systems.

Regression models can be categorized as [13]: (i) “white-box” when the input–output mapping is built upon first-principle equations, (ii) “black-box” when the mapping is derived from the data (also referred to as data-driven modeling), or (iii) “grey-box” when the knowledge about the input–output mapping is known beforehand and integrated along with data-driven modeling. White-box models are advantageous for promoting interpretations of the internal mechanisms associated with input–output mapping. On the other hand, black-box models can address complex systems with predictive analysis without prior knowledge of the system. Grey-box models can combine the interpretability presented by “white-box” and the ability to learn from data given by “black-box” models (e.g., fuzzy systems). Regarding data-driven modeling, the dependency between input and output can be built by linear or nonlinear models. A regression model is linear when the rate of change between input–output is constant due to the linear combination of the inputs. Examples include the multivariate linear regression models. Data-driven nonlinear regression is adopted when the input–output dependence is nonlinear and can not be covered by linear modeling. There is a plethora of methods for nonlinear regression, and its applicability is problem-dependent. Examples include fuzzy systems, support vector regression, artificial neural networks (e.g., non-deep/shallow and deep networks), and rule-based regression (e.g., decision trees and random forest).

Following are brief discussions of recent surveys and reviews that have showcased the applicability of DL to various regression application domains. The work of Han et al. [14] reviews DL models for time-series forecasting, where the DL models are categorized as: (i) discriminative, where the learning stage is based on the conditional probability of the output/target given an observation, (ii) generative, which learn the joint probability of both output and observation, with the generation of random instances), or (iii) hybrid, a combination of generative and discriminative DL methods. The authors demonstrated that DL models are effective at discriminating complex patterns in time series with high-dimensional data by implementing them in benchmark systems and using a real-world use case from the steel industry. Sun et al. [11] discuss the use of DL for soft sensor applications, showing the trends, the scope of applications in industrial processes, and the best practices for model development. The authors established some directions for future research, such as working on DL solutions to overcome the limitation in learning in scenarios with a lack of labeled samples (e.g., semi-supervised methods), hyperparameter optimization, solutions to improve model reliability (e.g., model visualization), and the development of DL methods with distributed and parallel modeling. Torres et al. [15] survey DL architectures for time-series forecasting. Furthermore, the authors discuss the practical aspects that must be considered when using DL methods to solve complex real-world problems in big-data settings, which include the existing libraries/toolbox, the techniques for automatic optimization of model structures, and the hardware infrastructure. Pang et al. [16] and Chalapathy et al. [17] present an overview of DL studies for anomaly detection; they also discuss the complexities and different types of DL models for different domains (e.g., classification, autoregressive, unsupervised, and semi-supervised) with application to cyber-security systems, medical monitoring, financial surveillance, and industrial processes. Additional review papers on DL for regression are referred to here: [18,19,20]. It is noted from these works that DL techniques have some advantages over non-deep methods, such as the ability to learn complex representations with automatic feature engineering, not requiring prior experience or knowledge, good performance by increasing the dimensionality of the data, among other specific advantages depending on the type of framework and application [21].

Despite the advantages portrayed in the literature, deep learning has limitations, such as the need for a sufficiently large dataset for model training, sensitivity to hyperparameter selection, and lack of interpretability [11, 12]. Also, there is a lack of proper explanations of the internal structure of deep model structures, which raises concern in applications that directly and indirectly impact human life as well for operational decisions [22, 23]. Having this concern in mind, recent works address the interpretability issue of DL from the following principles of “eXplainable Artificial Intelligence” (XAI) systems [24]: the existence of an appropriate explanation for each decision made; each explanation must be meaningful to the user; the process must accurately consider what happens in the system; identify the situations in which the system may or may not function properly (knowledge limits).

Fuzzy logic systems (FLS), composed of IF-THEN rules with linguistic terms mimicking human thought [25], is one of the research areas that contemplates the XAI principles. FLS has a wide range of approaches to nonlinear systems, primarily in terms of interpretability, whether related to the complexity and semantics of fuzzy rules, notation readability, coverage of input data space (operation regions), and so on [26, 27]. Moral et al. [28] show the main benefits of adopting FLS for the development of explainable methodologies, with a discussion of fundamental concepts and definitions associated with XAI and FLS to how to design increasingly interpretable models. Therefore, DL techniques can be complemented with FLS, developing deep fuzzy systems (DFS) and then providing an easy-to-understand and easy-to-implement interface to efficiently address the main drawbacks of DL and FLS, thus ensuring good accuracy and good interpretability. The surveys of [29] and [30] investigated some recent trends in DFS models and real-world applications (e.g., time series forecasting, natural language processing, traffic control, and automatic control). However, despite the benefit of adopting FLS principles to DL systems, no comprehensive survey or review has been conducted focusing exclusively on deep fuzzy for regression problems.

This paper surveys and discusses the state-of-the-art on deep fuzzy techniques developed to deal with a diverse range of regression applications. Initially, Sect. 2 presents fundamental concepts about FLS and XAI. Then, an overview of DL techniques commonly used for regression will be presented in Sect. 3, namely Convolutional Neural Networks, Deep Belief Networks, Multilayer Autoencoders, and Recurrent Neural Networks. Next, Sect. 4 shows the literature on deep fuzzy systems in two ways: (i) standard deep fuzzy systems, based on fundamental FLS; and (ii) hybrid deep fuzzy systems, with the combination of FLS and the conventional deep models discussed in Sect. 3. Finally, Sect. 5 presents general discussions based on the state-of-the-art surveyed.

2 Background

2.1 Fuzzy Logic Systems

Developed initially by Lotfi A. Zadeh [31], FLS are rule-based systems composed mainly of an antecedent part, characterized by an “IF” statement, and a consequent part, characterized by a “THEN” statement, allowing the transformation of a human knowledge base into mathematical formulations, thus introducing the concept of linguistic terms related to the membership degree.

The basic configuration of a fuzzy logic system, shown in Fig. 1, depends on an interface that transforms the real input variables into fuzzy sets (fuzzifier), which are interpreted by a fuzzy inference model to perform an input–output mapping based on fuzzy rules. Thus, the mapped fuzzy outputs go through an interface that transforms them into real output variables (defuzzifier) [32, 33].

Fig. 1
figure 1

Basic configuration of fuzzy logic systems

Some well-known fuzzy systems are Mamdani fuzzy systems [34], Takagi-Sugeno (T-S) fuzzy systems [35], and Angelov-Yager’s (AnYa) fuzzy rule-based systems using data clouds [36]. Among these, the T-S fuzzy models stand out for their ability to decompose nonlinear systems into a set of linear local models smoothly connected by fuzzy membership functions [37]. T-S fuzzy models are universal approximators capable of approximating any continuous nonlinear system, that can be described by the following fuzzy rules [38]:

$$\begin{aligned} R^{i}:{\textbf {IF}} ~ x_1(k)~\text {is}~F_{1}^{i}~\text {and}~\ldots ~\text {and}~x_p(k)~\text {is}~F_{p}^{i} \nonumber \\ {\textbf {THEN}} ~ \varvec{\textbf{y}}^i(k)=\varvec{\textbf{f}}^{\,i}(x_{1}(k),\ldots ,x_{p}(k)), \end{aligned}$$
(1)

where \(R^{i}\) (\(i=1,\ldots ,N\)) represents the i-th fuzzy rule, N is the number of rules, \(x_1(k), \ldots ,x_p(k)\) are the input variables of the T-S fuzzy system, \(F^{i}_{j}\) are the linguistic terms characterized by fuzzy membership functions \(\mu ^{i}_{F^{i}_{j}}\), and \(\varvec{\textbf{f}}(x_{1}(k),\ldots ,x_{p}(k))\) represents the function model of the system of the i-th fuzzy rule [39].

2.2 Explainable Artificial Intelligence

Although the explainable artificial intelligence (XAI) concept is often associated with a homonymous program formulated by a group of researchers from the Defense Advanced Research Projects Agency (DARPA) [40], the principles related to explainability gained strength from the 1970 s onwards. The earliest works presented rule-based structures and decision trees with human-oriented explanations, such as the MYCIN system proposed in [41] developed for infectious disease therapy consultation, the tutoring program GUIDON proposed in [42] based on natural language studies, among numerous other systems [43,44,45,46]. Although many authors commonly use the terms “explainability” and “interpretability” as synonyms, Rudin [47] discusses the problem of using purely explainable methods to only provide explanations from the results obtained in black-box models (post hoc analysis), demystifying the importance of developing inherently interpretable methodologies with causality relationships that are understandable to human users.

3 Deep Regression Overview

Deep neural networks (DNN) have emerged due to their architecture with multiple levels of representation and their remarkable performance in a variety of tasks [48]. This section presents the DL techniques commonly employed in regression problems to provide a better understanding and context for the literature review of deep fuzzy regression performed in Sect. 4.

3.1 Convolutional Neural Networks

Convolutional neural networks (CNNs or ConvNets) are feedforward neural networks with a grid-like topology that are used for applications such as time-series data processing in 1-D grids and image data processing in 2-D pixel grids [49]. Figure 2 presents a CNN architecture for processing time-series data in 1-D grids.

Fig. 2
figure 2

Convolutional neural network, with 1-D architecture

The “neocognitron” model proposed in [50] is frequently referred to as the inspiration model for what is currently known about CNNs. First proposed in [51], the neocognitron aimed to represent simple and complex cells from the visual cortex of animals which present a mechanism capable of detecting light in receptive fields [52].

CNNs use a math operation on two functions called “convolution”, with the first function referred to as input, the second function as kernel, and the convolution’s output as the feature map [49]. Outputs from the convolution layer go through a pooling layer (downsampling), which performs an overall statistic of the adjacent outputs by reducing the size of the data and associated parameters via weight sharing [11, 53]. After the data is processed along with the layers that alternate between convolution and pooling, the final feature maps go through a fully connected (dense) layer to extract high-level features. For regression problems, the extracted features can be combined in a prediction mechanism with an activation function or a supervised learning model (e.g., support vector regression) to estimate the final output [54, 55].

CNNs in the context of regression have been explored in several domains, for example, traffic flow forecasting of real road networks [56, 57], prediction of natural environmental factors using data from meteorological institutes [58,59,60,61], industrial process optimization [62,63,64], electric load forecasting [65,66,67,68,69], and chemical process analysis [70, 71]. Despite showing good performance with the extraction of spatially organized features, essentially for pre-processing, CNN’s performance depends on a large amount of data and the correct choice of hyperparameters, being computationally intensive [21, 72].

3.2 Deep Belief Networks

Deep Belief Networks (DBNs) are probabilistic generative models proposed by Hinton et al. [73], which have a hierarchical structure as illustrated in Fig. 3.

Fig. 3
figure 3

Deep belief networks architecture

The DBNs are composed of multiple layers of latent variables (hidden units) of binary values, organized into multiple learning modules called restricted Boltzmann machines (RBMs).

Each RBM comprises a layer of visible units for data representation and a layer of hidden units for feature representation, learned by capturing higher-order correlations from the data. The two RBM layers, with no connections within layers, are connected by a matrix of symmetrically weighted connections, with a total of L weight matrices \(\varvec{\textbf{W}} = \{\varvec{\textbf{w}}_{1},\varvec{\textbf{w}}_{2},\ldots ,\varvec{\textbf{w}}_{L}\}\), considering a DBN of L hidden layers. All units in each layer have a bidirectional connection with all units in neighboring layers, except for the last two layers, L and output, that have a unidirectional connection [49]. In the DBN architecture of Fig. 3, the layer of visible units in the first RBM represents the input variables \(\varvec{\textbf{x}}\) and the subsequent layers represent hidden units \(\varvec{\textbf{h}}\), progressing hierarchically until reaching the estimated output, with the output weights \(\varvec{\textbf{w}}_{out}\).

Recent DBN literature for regression has primarily addressed industrial problems in the context of process monitoring [74,75,76,77,78,79], soft sensing [80, 81], and prognostics [82, 83]. Also, DBNs were applied in time series forecasting problems such as traffic flow [84, 85], environmental prediction [86, 87], and stock price [88], as well as modeling benchmark systems [89, 90]. Studies with DBNs have difficulties in explaining the effect of hidden units on the dynamics of the system, leading to interpretability issues. As Fig. 3 shows, a DBN needs to undergo unsupervised and supervised learning, whose training process becomes slower with increasing DBN structure. In this way, the DBN becomes sensitive to noisy inputs by not correctly readjusting its low-level parameters [91].

3.3 Multilayer Autoencoders

Autoencoders (AEs) are feedforward neural networks used for dimensionality reduction and representation learning, whose training is aimed at mimicking inputs to outputs [49]. A historical overview of deep learning in [92] presented some early works of AEs in the literature, such as the work in [93] that proposes unsupervised architectures to reconstruct the inputs through internal representations. Another early work was published in [94], where the authors explore the effect of hidden units in simple two-layer associative networks, in which they want to map input patterns to a set of output patterns. As presented in Fig. 4, a single AE essentially has a structure composed of a layer with input variables \(\varvec{\textbf{x}}\), a layer with hidden units \(\varvec{\textbf{h}}\) (which performs an encoding used to represent the input), and an output layer with the reconstructed inputs represented with a “hat” symbol (e.g., \(\hat{\varvec{\textbf{x}}}\)). These layers are interconnected by an encoder function (between input and hidden) and by a function decoder (between hidden and output) [49]. As an evident constraint, the number of neurons in the input layer must be the same number of neurons in the output layer, setting AE to unsupervised pre-training or feature extraction [95].

Fig. 4
figure 4

Modified stacked autoencoder architecture for regression

A viable alternative to deal with increasingly complex data is to increase the number of hidden layers in standard AEs, enabling the development of deep network architectures. A well-known configuration of multilayer autoencoders found in the literature is stacked autoencoder (SAE). As illustrated in Fig. 4, an SAE is developed from the grouping of L AEs, where the hidden layers of the autoencoders are stacked hierarchically, performing an unsupervised layerwise learning algorithm. Thus, the reconstruction of inputs after the L-th hidden layer can be disregarded to address regression problems. As for CNNs, a prediction mechanism or a supervised learning model can be included after the L-th hidden layer to estimate the system output. Related parameters, such as weights \(\varvec{\textbf{W}} = \{\varvec{\textbf{w}}_{1},\varvec{\textbf{w}}_{2},\ldots ,\varvec{\textbf{w}}_{L}\}\) between layers, are fine-tuned by a supervised method (e.g., backpropagation algorithm) [96].

Recent studies using AEs within a deep architecture for regression have considered soft sensing applications for quality variable prediction [97,98,99,100,101]. Other cases of applications in industrial processes include CNC turning machines [102, 103], end-point quality prediction [104, 105], and prognostics [106,107,108,109]. Other works were developed for time-series forecasting applications, such as natural environmental factors [110,111,112], electricity load forecasting [113], and tourism demand [114]. Some of the limitations of multilayer autoencoders include the sensitivity to errors or loss of information from the first layer, impairing learning as it progresses through the hidden layers. Thus, the nature of encoding and decoding by hidden layers can cause a loss of interpretability and an increase in computational cost [21].

3.4 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are artificial neural networks with an internal state that allow the use of feedback signals between neurons [115]. One of the early works that culminated in the popularization of RNNs was proposed in [116], with the development of content-addressable memory systems called Hopfield networks whose dynamics have a Lyapunov function (or energy function) to direct to a local minimum of “energy” associated with the system states [117]. Another early work, published in [118], presented a new learning procedure, backpropagation, which adjusts the weights of connections of recurrent networks with “internal hidden units”. Due to the presence of an “internal memory”, RNNs are often used to process data in the time domain (sequential information), with weight sharing through hidden states [119]. Figure 5 shows the architecture of an RNN, whose hidden states \(\varvec{\textbf{h}}\) represent the “internal memory” of the system, and as the inputs are sequentially observed, the corresponding outputs are estimated. The weights \(\varvec{\textbf{W}}^{in}\), \(\varvec{\textbf{W}}^{h}\) and \(\varvec{\textbf{W}}^{out}\) are auxiliary parameters that are shared across time [120]. The states are updated at every temporal instant until completing all the input sequences of length K [121, 122]. The RNN learning process depends on a “backpropagation through time” algorithm, which is a gradient-based technique that updates parameters recursively starting from the last temporal instant and going backward in time [123].

In deep learning, there are many variations of standard RNNs, such as Long Short-Term Memory networks (LSTMs), Gated Recurrent Units (GRUs), and Echo State Network (ESNs) [49]. Some of these variations were proposed to cope with some limitations present in standard RNNs, such as exploding and vanishing gradients (instability in networks caused by a large variation in model parameters), overfitting, and difficulty to store low-level features in long data sequences [124]. In this document, only LSTMs and ESNs employed for regression problems will be discussed.

Fig. 5
figure 5

Recurrent neural network architecture

3.4.1 Long Short-Term Memory

Long Short-Term Memory networks (LSTMs) were initially developed in [125] to overcome the issues of RNNs associated with vanishing/exploding gradients. These issues can occur during the training of long temporal sequences with the backpropagation through time algorithm. In this sense, during successive operations in compound functions with the weight matrices, gradients can exponentially reach very low values close to zero (vanishing) or very high values (exploding) [49]. Figure 6 illustrates the architecture of an LSTM.

Fig. 6
figure 6

The long short-term memory cell

LSTMs introduced “gates”, nonlinear elements that control memory cells using sigmoidal functions \(\sigma\), hyperbolic tangent functions, current observation \(\varvec{\textbf{x}}(k)\) and hidden units \(\varvec{\textbf{h}}(k-1)\) from the previous time instant [122]. Each of these memory cells has input, output and forget gates (\(\varvec{\textbf{i}}\), \(\varvec{\textbf{o}}\) and \(\varvec{\textbf{f}}\), respectively), that protect the information from perturbations caused by irrelevant inputs and irrelevant memory contents [72]. The information stored in the cells represents the states \(\varvec{\textbf{s}}(k)\) obtained from the data processed in the time domain. Despite having the same inputs and outputs as a standard RNN, an LSTM cell has an internal recurrence (self-loop) to propagate the information flow through a long sequence and, therefore, more parameters to be adjusted [49].

Due to the ability to process data sequentially and ensure long-term dependencies, LSTMs are primarily used for time-series forecasting in traffic flow [126, 127] and natural environment factors [128,129,130,131,132,133]. LSTM methodologies have also been applied in the industrial context for soft-sensing with attention mechanism [134, 135] and key quality prediction [136, 137], process monitoring [138, 139], and prognostics [140,141,142]. Other applications of LSTMs include electric load forecasting [143, 144]. LSTMs in regression problems, despite the feasibility, can suffer from vanishing gradient from the saturation of cell states, which must be reset occasionally. In addition, there is a computational cost issue in the application of LSTMs as they require a high memory bandwidth to perform the associated functions in complex systems [145].

3.4.2 Echo state network

Echo State Networks (ESNs) are variations of RNNs that were developed in [146] and share the basic ideas of reservoir computing of Liquid State Machines from [147]. The term “reservoir computing” stands for a homonymous research stream that introduced the concept of a dynamic reservoir, in place of the hidden layer, with many sparsely connected neurons [148]. In addition, this reservoir must have a stability condition known as “echo state property”, which allows for a gradual reduction in the effect of previous states and inputs on future states over time [149]. Figure 7 illustrates the architecture of an ESN.

Fig. 7
figure 7

The echo state network

An ESN induces nonlinear response signals from the input signals \(\varvec{\textbf{x}}(k)\), whose resulting state \(\varvec{\textbf{h}}(k)\) echoes the input information, estimating a desired output signal \(\varvec{\textbf{y}}(k)\) [146]. In addition to the input weight \(\varvec{\textbf{W}}^{in}\) and output weight \(\varvec{\textbf{W}}^{out}\) that are present in a standard RNN, the ESN features the reservoir weights \(\varvec{\textbf{W}}^{r}\) and, optionally, the output-to-reservoir feedback weights \(\varvec{\textbf{W}}^{back}\). The fundamental idea for the functioning of an ESN is to adjust only \(\varvec{\textbf{W}}^{out}\) during training, while the rest are randomly assigned and fixed before learning [150].

Recent applications of ESNs are time-series forecasting in chaotic systems [151,152,153,154,155], wind power systems [156,157,158,159,160], solar energy systems [161,162,163,164], and benchmark systems [165,166,167]. Other applications include chemical processes [168, 169], soft sensing [170], electrical/power systems [171,172,173], and the prognosis of turbofan engines [174, 175]. The recent literature shows that random connectivity within the dynamic reservoir of an ESN can lead to interpretability issues. Furthermore, ESN hyperparameters, such as the number of units in the reservoir and input scaling, are limited to a short operating space that ensures maximum model performance [176].

4 Deep Fuzzy Regression

Deep fuzzy systems (DFSs) are models built on top of DL structures with fuzzy logic systems (FLSs). DFS aims to overcome the lack of interpretability of DL systems and the limitations of FLS when dealing with high-dimensional data. DFS is often referred to in the literature as an explainable model due to the incorporation of fuzzy logic into its core. From the definition from [177], an explainable AI (XAI) system should provide capabilities accessible to human understanding, reflecting positively on the health of the system’s processes and being able to operate even in unforeseen situations. Although DFS is promoted as an XAI system by definition, this is not the case, as evidenced by works in the literature. This section surveys recent applications of DFS for regression, with the main focus on its structure and if it adheres to the XAI principles.

4.1 Survey on Deep Fuzzy Systems

This survey will follow two stages to review the DFS for regression applications. First, the DFS will be categorized according to its structure. Secondly, the models will be categorized according to whether they follow the XAI principles. The structures of deep fuzzy models can be represented in several forms, such as those illustrated in Fig. 8. Here, deep fuzzy structures are summarized into two categories: (i) Standard DFS and (ii) Hybrid DFS. A model belongs to the first category when the blocks of fuzzy systems are stacked in series, in parallel, or hierarchically (see Fig. 8a). Also, there are cases where the architecture of such systems resembles neural network architecture, such as the dense DNN architecture (see Fig. 8d). The second category includes hybrid methodologies, where conventional DL models are combined with FLS. The combination of DL and FLS is commonly in an ensemble form (see Fig. 8b,c) or mixed form (see Fig. 8d).

Fig. 8
figure 8

Examples of deep fuzzy system frameworks: a multiple FLSs organized hierarchically; b sequential ensemble models; c parallel ensemble models; and d fuzzy neural network fully integrated into a deep architecture or mixed with parts of a conventional deep model

Aside from the deep fuzzy structures, the discussed works will be investigated whether they are in synergy with the XAI principles. If so, these works will be classified following the categories defined in [22]: understanding and scope of the explainability. In terms of understanding, the models will be classified as (i) transparent or (ii) opaque. Transparent models are those in which the decisions, predictions, or inner functioning are perceptible or are visible; they are considered opaque otherwise. The scope is related to accessing the model’s interpretability through post hoc explanations, classified as (i) local, (ii) global, and (iii) visual. Local explanations facilitate comprehension of small regions of interest in the input space for a given decision/prediction. Global explanations when such desired understanding considers the entire sample space. Visual explanations are required when visual interfaces are needed to demonstrate the influence of features on decisions.

The following sections will discuss the methods containing DFS for regression problems. Section 4.1.1 will discuss Standard DFS, while Sect. 4.1.2 will discuss Hybrid DFS.

4.1.1 Standard Deep Fuzzy Systems

The methods discussed in this section have a DFS structure and were used in any regression problem domain while adhering to the Standard DFS structure. They are discussed in order of structural similarity, with structurally similar methods following each other.

In [178], a model called Randomly Locally Optimized Deep Fuzzy System (RLODFS) is proposed. It is composed of a hierarchical structure of a bottom-up layer-by-layer type with FLS, similar to a fully connected DNN; the structure of RLODFS is shown in Fig. 9. In RLODFS, the input variables are divided into several fuzzy subsystems, allowing the decrease of the size of the fuzzy rule’s antecedent, thus reducing the learning complexity when compared to learning with all input variables at once. The structure of RLODFS shows several groupings with sharing of input variables performed randomly for the fuzzy subsystems of level 1, whose outputs are used as inputs of the fuzzy subsystems of the subsequent levels until reaching the last level fuzzy system (used to estimate the model output). The training of each fuzzy subsystem was followed by the Wang-Mendel algorithm [179], which performs the construction of fuzzy rules from a small-scale observation set (data pairs) with input–output mapping. Finally, a random local loop optimization strategy is performed to remove feature combinations and corresponding subsystems with low correlation to achieve fast convergence [178]. The performance of RLODFS on the prediction of 12 real-world datasets from the UCI repositoryFootnote 1 was compared with other methods such as DBN, LSTM, and generalized regression neural network (GRNN). From the authors’ perspective, RLODFS has good interpretability due to its structure, clear physical meaning of its parameters, and the ease of locating fuzzy rules that may fail for future optimizations. However, there is a lack of transparency in using input sharing strategies, which increases the method complexity. Furthermore, the selection of the number of features per fuzzy subsystem needs to be carefully done, whose manual/arbitrary choice may not reflect the real needs of the case study under analysis.

Fig. 9
figure 9

The structure of RLODFS based on grouping and input sharing, adapted from [178]

The study in [180] proposes a stacked structure composed of double-input rule modules and interval type-2 fuzzy models, abbreviated as IT2DIRM-DFM. The proposed model is illustrated in Fig. 10, where four layers are presented: the input layer, which deals directly with the original input data, grouped two-by-two in each rule module; the stacked layer, where the signals coming from the input layer become the inputs of the first layer, whose output becomes the input of the second layer, and so on; the dimension reduction layer, where the width of the IT2DIRM-DFM is hierarchically reduced until it becomes two; and the output layer, where the latest IT2DIRM-DFM model produces the final forecasting results. Still, the authors discuss the interpretability of the model showing layered learning and fuzzy rules composed of only two variables in the antecedent, partitioned into interval type-2 second-order rule partitions. The resulting model was evaluated in two real-world applications, using subway passenger data from Buenos Aires, Argentina, and traffic flow data from California Highway System. The results allowed to verify the interpretability from consistent readability of which partitions and bounds of the inputs are used in each fired rule from each rule module. However, the proposed method is only interpretable locally in each rule module and not globally, whose structure does not reflect the depth or number of layers needed to address the experiments. Furthermore, it is not clear the motivations for considering a function, denoted as \(\text {f}\), in each layer related to the worst-performing module (this is shown at the bottom of Fig. 10).

Fig. 10
figure 10

The structure of the double-input rule modules stacked deep interval type-2 fuzzy model (IT2DIRM-DFM), adapted from [180]

Another work that explores double-input rule modules within a stacked deep fuzzy model (in a hierarchical way) was proposed in [181] using datasets related to photovoltaic power plants from Belgium and China. The authors investigate the interpretability of the resulting model, called DIRM-DFM, with conclusions similar to [180], mainly regarding the composition of fuzzy rules. However, in DIRM-DFM, they promote more transparency and simplicity. Some other studies deal with interval type-2 fuzzy models for deep learning in regression problems. In [182], a novel dynamic fractional-order deep learned type-2 FLS was proposed and constructed using singular value decomposition and uncertainty bounds type-reduction. The resulting model was implemented with two chaotic benchmark system simulations, a simulation for the prediction of the glucose level of type-1 diabetes patients, and a dataset of a heat transfer system with an experimental setup. In addition to determining the limit values of the input data (upper and lower singular values), the authors used stability criteria of fractional-order systems, allowing to reduce the necessary number of fuzzy rules and reduce the complexity of nonlinear systems. An evolving recurrent interval type-2 intuitionistic fuzzy neural network (FNN) was proposed in [183], and it was evaluated using regression datasets from the KEEL repository,Footnote 2 Mackey–Glass time series, and a simulated second-order time-varying system. Intuitionistic evaluation, fire strength of membership degree and strategies for adding and removing fuzzy rules were considered to improve uncertainty modeling. Both studies in [182] and [183] did not present an analysis of the interpretability of their models.

Methods that use multiple neuro-fuzzy systems in a deep hierarchical structure were proposed in [184] and [185]. The work in [184] proposed a deep model that cascades multiple neuro-fuzzy systems modified as multivariable generalized additive models, with application in real ecological time series, the Darwin sea level pressure. The resulting model manages to locally detail the mechanisms of each neuro-fuzzy system in each layer. However, it presents incomplete discussions and results regarding the increase in the network depth and the influence of the inputs and partial outputs of the layers on the final prediction. In [185], a hybrid cascade neuro-fuzzy network was proposed, which is composed of multiple extended neo-fuzzy neurons with adaptive training designated for online non-stationary data stream handling. Each layer has a generalization node that performs a weighted linear combination to obtain an optimal output signal. The experimental results in electrical loads prediction, using Southern Ukraine’s data from 2012, show the authors’ search for better accuracy, although at the cost of increasing membership functions to cover the input space and increasing adjusted parameters (weights).

The work in [186] proposed a deep learning recurrent type-3 fuzzy system applied for modeling renewable energies (i.e., power generation of a 660kW wind turbine and solar radiation generated by sunlight simulator). The proposed methodology showed good performance compared to other methods, such as multilayer perceptron, type-1 FLS, type-2 FLS, and interval type-3 FLS, despite the lack of a more elaborate discussion of the presented results. Furthermore, transparency is not guaranteed regarding the modeling steps and the influence of various parameters optimized during learning. The authors in [187] proposed a self-organizing FNN with incremental deep pre-training, abbreviated as IDPT-SOFNN, to promote efficient feature extraction and dynamic adaptation in the structure according to error-reduction rate. IDPT-SOFNN was implemented for the prediction of Mackey–Glass time series, total phosphorus concentration in wastewater treatment plant, and air pollutant concentration. In [188], a deep fuzzy cognitive map was proposed for multivariable time-series forecasting applications, such as air quality indexes, traffic speed of six road segments in China, and two benchmark datasets from the UCI repository, the electric power consumption and the temperature of a monitor system. The analysis of the model’s interpretability was based on nonlinear and nonmonotonic influences of unknown exogenous factors. The work in [189] proposed a deep FNN composed of an input layer, four hidden layers (membership functions, T-norm operation, linear regression, and aggregation), and an output layer, designed exclusively for intra- and inter-fractional variational prediction for multiple patients’ breathing motion.

Table 1 summarizes the literature on Standard DFS. They are categorized according to the application domain and regarding its interpretability categories. The survey shows that the application domain is vast, with applications in the domain of industrial systems, power systems, traffic systems, and multivariable benchmark systems.

Table 1 State-of-the-art on methods with standard deep fuzzy systems for regression problems. XAI: explainable artificial intelligence; Disc.: discussion by the authors (Yes/No); Und.: how understandable is the model, whether it is transparent (T) or opaque (O); Scope: in post hoc scope, if the model promotes local explanations (L), global explanations (G) or visual explanations (V)

Table 1 shows that only four out of eleven works discuss interpretability, despite most of their proposed methods being transparent. Also, not all methods provide post hoc explanations, and of these, the local scope is more frequent. The Standard DFS architecture is easy to implement, with flexibility in the construction of fuzzy rules, having a similar structure to feedforward neural networks. However, these methods may suffer from loss of interpretability when using a non-intuitive hierarchical structure, the lack of investigation of interconnections between the variables involved, and the limitation of fuzzy rules that can disturb the estimation by not covering all operating regions (coverage of input data space) [190].

4.1.2 Hybrid Deep Fuzzy Systems

The Hybrid DFSs are discussed in this section. The selection and order of works follow the same principles used for Standard DFS. The common DL architectures used in combination with fuzzy systems are the Deep Belief Networks, Autoencoders, Long Short-Term Memory networks, and Echo State Networks.

In [191], it is proposed a sparse Deep Belief Network (SDBN) with FNN for nonlinear system modeling (benchmark) and total phosphorus prediction in wastewater treatment plant. The SDBN is considered for unsupervised learning and pre-training to perform fast weight-initialization and improve modeling robustness. The FNN is used as supervised learning to reduce layer-by-layer complexity. As shown in Fig. 11, the structure of the DBN resembles the structure shown in Fig. 3, except for considering additional constraints (sparsity) used to penalize fluctuations of values along the hidden neurons. The proposed method performed better compared to other similar methods, such as transfer learning-based growing DBN [192], DBN-based echo-state network [164], and self-organizing cascade neural network [193]. However, the authors noticed various fluctuations in the assignment of hyperparameters, which can compromise the stability of the proposed model, making it necessary to dynamically and robustly improve its structure to these fluctuations. In terms of interpretability, the proposed model guarantees a moderate number of membership functions and rules, allowing good consistency and readability of what happens within the FNN structure. The same cannot be said for the DBN structure, which is not intuitive in sparse representation to decide which features are more valuable than others.

Fig. 11
figure 11

Examples of FLS application with conventional deep models: a with deep belief networks, adapted from [191]; and b with denoising autoencoders, adapted from [194]

The authors in [194] proposed a novel robust deep neural network (RDNN) for regression problems involving nonlinear systems, with the implementation of three strategies: a fuzzy denoising autoencoder (FDA) as a base-building unit for RDNN, improving the ability to represent uncertainties; a compact parameter strategy (CPS), designed to reconstruct the parameters of the FDA, reducing unnecessary learning parameters; and an adaptive backpropagation (ABP) algorithm to update RDNN parameters with fast convergence. As shown in Fig. 11, FDA has an input layer, a fuzzy layer (built by Gaussian membership functions), and an output layer, which can be divided into an encoder (“input” to “fuzzy”) and a decoder (“fuzzy” to “output”). Furthermore, at each FDA, due to the nature of this autoencoder, the data is partially corrupted by noises (represented with a “tilde” symbol) and is reconstructed using CPS (represented with a “hat” symbol) with associated parameters (e.g., encoder/decoder weights and biases) are adjusted via ABP. The resulting model was evaluated through four prediction examples: air quality from the UCI repository, wind speed from National Renewable Energy Laboratory, housing price from the KEEL repository, and water quality from a wastewater treatment plant in Beijing, China. Regarding the interpretability of the proposed model, there was no appropriate discussion by the authors, as they focused mainly on model performance/accuracy in the presence of uncertainties. In addition to a fuzzy rule base not being defined, the number of fuzzy neurons in each FDA is manually initialized and adapts through redundancies during the reconstruction of parameters via CPS. Finally, the architecture of the proposed model is not intuitive based on the mechanisms related to the representation ability of each FDA, which would be crucial to determine the influence of input variables and learning parameters on the outputs.

Some hybrid methods using FLS and conventional deep models have been applied in the literature for traffic flow prediction. In [195], the authors developed an algorithm based on Dolphin Echolocation optimization [196], where the input features are fuzzified into membership functions to obtain chronological data, whose integration goes into the weight update process of the proposed algorithm converging to a globally optimal solution with a Deep Belief Network. The method in [195] was evaluated using datasets of traffic-major roads in Great Britain and PEMS-SF (San Francisco bay area freeways). The study in [197] combined fuzzy information granulation and a deep neural network to represent the temporal-spatial correlation of mass traffic data and be able to adapt to noisy data. A Stacked Autoencoder is used to obtain the prediction results based on processed granules that have a good capacity for interpretation, which have not been discussed by the authors. The method in [197] was evaluated using traffic flow data archived for the Portland–Vancouver Metropolitan region.

Other methods were implemented for energy forecasting, with the usual application of LSTM networks. A novel fuzzy seasonal LSTM was proposed in [198], where a fuzzy seasonality index [199] and a decomposition method were employed to solve the seasonal time-series problem in a monthly wind power output dataset from the National Development Council in Taiwan. In [200], the authors used an LSTM network with rough set theory [201] and interval type-2 fuzzy sets for short-term wind speed forecasting (dataset from Bandar-Abbas City, Iran), with the aid of mutual information approach for efficient variable input selection. A novel ultra-short-term photovoltaic power forecasting method was proposed in [202], where a T-S fuzzy model comprises a fuzzy c-means clustering algorithm and DBNs, with evaluation tests using a 433 kW photovoltaic matrix database. The studies in [198, 200] and [202] did not discuss the interpretability of their models, which have a structure with an ensemble characteristic whose fuzzy part has an affinity for good coverage of the input space and good capture of data uncertainty.

The work in [203] proposed a deep type-2 FLS (D2FLS) architecture with greedy layer-wise training for high-dimensional input data. The D2FLS model was applied to two regression datasets (prediction of the performance of British Telecom’s work area and health insurance premium) and two binary classification datasets (Santander Customer Transaction prediction and British Telecom’s customer service). The authors showed how to extract interpretable explanations related to the contribution of fuzzy rules to the final prediction developed for a two-layer D2FLS. However, the authors opted for a large number of fuzzy rules (100, in this case) that impair interpretability by increasing the complexity of the model. The authors in [204] proposed an ensemble model composed of an Echo State Network (ESN), a T-S fuzzy model, and differential evolution for time-series forecasting problems: Mackey-Glass time series, nonlinear auto-regressive moving average (NARMA) time series, and Lorenz attractor. The differential evolution method, used to optimize the weight coefficients of the model, managed to reduce the number of fuzzy rules. The interpretability of the model can be impaired due to the structure with an ensemble characteristic, which does not provide enough transparency for learning. A method based on the T-S fuzzy model and ESN was developed in [205] and tested with three benchmark examples: approximation of a nonlinear function, prediction of Henon chaotic system, and identification of a dynamic system with and without noise signal. The authors chose to balance model complexity and performance based on the parameters involved for learning (e.g., number of fuzzy rules and reservoir size), which resulted in better results in comparison with other methods (e.g., traditional ESN and hybrid fuzzy ESN), despite the lack of discussion on interpretability.

In [206], a framework based on a sparse autoencoder (SpAE) and a high-order fuzzy cognitive map (HFCM) is proposed for time-series forecasting problems (e.g., sunspots, Mackey-Glass, S&P 500 stock index and Dow-Jones industrial index). SpAE is used to extract features from the original data, and these features are via HFCM. Another study that uses SpAE with FLS was proposed in [207] and implemented in Mackey-Glass time series and Iris dataset (classification). The proposed method uses a method to reduce fuzzy rules by reducing the data dimensionality with SpAE. Both studies in [206] and [207] do not consider addressing the interpretability of the proposed models, which present good directions in data partitioning and construction of fuzzy rules but are not intuitive in feature extraction with sparse representation.

Table 2 summarizes the works presented in this section, in addition to evaluating them within the context of explainable artificial intelligence (XAI) systems.

Table 2 State-of-the-art on hybrid methods using fuzzy logic systems and conventional deep models for regression problems. XAI: explainable artificial intelligence; Disc.: discussion by the authors (Yes/No); Und.: how understandable is the model, whether it is transparent (T) or opaque (O); Scope: in post hoc scope, if the model promotes local explanations (L), global explanations (G) or visual explanations (V)

Only two works discuss the interpretability in Hybrid DFS, half of the discussed works are opaque, and only one presents post hoc explanations. This fact occurs since the combination with DNN brings a new layer of black-box to the system. However, the Hybrid DFS methods categorized as transparent show that the fuzzy component can promote good interpretability with efficient input–output mappings and the construction of rules to cover the universe of discourse for a given system. Inherently interpretable methodologies are difficult to achieve only with an ensemble of multiple methods, as seen in recent studies, in addition to the challenges of reducing the time complexity of systems, which is slightly mitigated by the reduction of fuzzy rules [208].

5 Conclusion

This study surveyed the literature on deep fuzzy systems (DFSs) for regression applications with an emphasis on interpretability. For the survey, the DFSs were categorized as (i) Standard DFS and (ii) Hybrid DFS. Regarding the interpretability, each method was categorized as to whether it is transparent or not and whether it has post hoc explanations (under the definition of [22]). Standard DFSs have been shown to promote more interpretability of their models when compared to Hybrid DFSs, according to the survey. Indeed, Standard DFSs are based on fundamental fuzzy logic systems, which are inherently interpretable, whereas Hybrid DFSs include conventional deep learning (DL) methods, which lack flexibility in promoting interpretability. In terms of applications, DFS can be flexible in its implementation either by simulation or in real-time systems, whose most recurrent applications involve time-series forecastings, such as traffic flow and energy modeling (e.g., photovoltaic and wind). Furthermore, the DFS is frequently referred to as interpretable by default, but only 5 of the 23 works surveyed here actually addressed this issue. The remaining works had a common goal: to improve prediction accuracy using their proposed methods. However, this survey presented the potential of using Standard DFS as a base for developing accurate models while promoting interpretability since hybrid models are not straightforward.