1 Introduction

We live in the Data Age, with many and different sensors distributed all around us that measure different aspects of our life. In this scenario efficient data fusion techniques are becoming essential to improve our decision capacity in several contexts [1, 2]. Unfortunately, the real word is all but perfect and many measurements can be missed or corrupted because of communication latency, line interruptions, sensor malfunctioning, etc. In Smart Energy Grids various kinds of sources, such as smart meters, sensors, etc., are used by Smart Energy Managers to make decisions, but the errors in the measure systems or in the communication are frequent [3]. The consequential missing or corrupted measurements impact negatively on the quality of the available information and hence on the analysis phase or decisions taken. An incorrect missing value handling, in fact, can impact negatively the effectiveness of the other tasks as the forecasting [4] or anomaly detection and, in the Smart Grid context, can lead to severe failures as the mismatch between supply and demand for an incorrect scheduling, or to a voltage violation a wrong regulation may lead to [5]. In the presence of multiple time-series with different characteristics and hence possible issues, such as in Smart Grid context, the problem of managing the data in an aggregated way emerge naturally and it is at the base of the data fusion techniques. For this reason we need a solution that fuses information coming from different sources and that is robust to missing data leveraging the intra- and inter- correlation present in the different observed data. In this way we can have a robust method to missing data for the next downstream tasks and, using more information, can improve the decision making capability.

Regarding the missing data handling, a best practice used in the power industry is to apply linear interpolation for missing or corrupted values in intervals shorter than two hours. For intervals longer than two hours, a typical profile is used considering historical data and taking into account the day in the week and the presence of holidays [6]. Many other approaches to impute missing data in time series have been proposed over the years. Some of them leverage the similarity of neighbors [7], others use Expectation-Maximization (EM) methods, autoregressive-moving average models (ARIMA), or Kalman filter models [8]. More recently, Deep Learning approaches have been applied to the problem of missing data imputation in time series [9] such as the Recurrent Neural Networks [10] and the Generative Adversarial Networks [11].

The handling of missing data in the fusion process has been investigated in several works [12, 13]. The most used approaches are based on classical data imputation, such as interpolation, averages, etc., applied to the time series before data combination.

In this work we improve the classical methods using a machine learning approach based on data history. We propose a DaI-FeO/DaI-DaO Fusion architecture (based on the Dasarathy classification [14]), where the data inputs, i.e. the power consumption measured by different sub-meters, are fused to build a representation robust to missing values (feature output) in order to improve the downstream tasks, and to provide better imputation (data output).

In this work we focus on the Autoencoder-based model because it is a very powerful and flexible approach to data fusion and, at the same time, it can be designed and trained to be robust to missing data. In this way it is possible to represent the observed data in a lower space that can be used for downstream tasks such as clustering, classification, forecasting, etc.

Fusion models based on Autoencoders have been investigated in different works [15,16,17,18], also in the energy context [19,20,21,22]. The majority of them uses the Autoencoder as a feature extractor inserted in a more complex architecture to perform downstream supervised tasks.

In the presence of missing or corrupted data it is important to learn a data representation which is robust to the contamination. This can be achieved using the Denoising Autoencoder [23]. When trained using a particular masking noise, the model can learn to fill-in missing values thanks to dependencies present in the input data.

The process of denoising has also an interesting geometric interpretation based on the manifold assumption (the natural high-dimensional data lay on a non-linear low-dimensional manifold). Using the denoising criterion during training, we learn to map corrupted examples (likely farther from embedding manifold) to the uncorrupted version of them (on the manifold). If the corrupted example is very far from the manifold, the Denoising Autoencoder has to make a considerable effort to generate a correct value [24]. The usage of Autoencoder as model to replace missing values has been already suggested in [23] but, more recently, other works have used a similar approach [25].

In [17] the authors have investigated a data fusion architecture for missing data imputation where the reconstruction task can benefit from having other signals available. Contextual information, in fact, have demonstrated to be very effective in various data processing applications, such as scene understanding [26] or language models [27], and it has been exploited in our previous work [28] where we have investigated a Bayesian approach for FeI-DeO Fusion based on the Factor Graph paradigm [29,30,31], showing how to effectively manage missing, or wrong values, also taking into account sensor reliability in selected sensors, or in selected measurements.

In this work, we propose a new architecture based on Autoencoder model, convolutional layers, skip connections and ad-hoc augmented training sets, to impute missing data using a shared embedding space, that fuses information coming from different sensors. A similar approach has also been investigated in [17], but in our work, more signals to fuse are considered, the convolutional architecture is employed for the specific autoencoders and for the fusion layer, and an ad-hoc augmented training set is considered.

The main contribution of this work is to show that the imputation of missing data, in the energy context, in presence of multiple sensors, in very challenging but real situation where large portions of the signals are missing, can be helped by a proper data augmentation scheme and by the information carried out by other signals properly combined. Moreover, the use of an Autoencoder approach allows to obtain a compressed representation of the input signals which is robust to the missing data and that can be used for other downstream tasks such as clustering, forecasting, etc. To our knowledge no previous work has used this type of architecture and performed this type of analysis in the energy context.

The description of the architecture used for the fusion of the sensor signals is presented in Section 2. The different missing data patterns and the several types of augmentation are presented in Section 3 and Section 4, respectively. The datasets and evaluation method used in our work are presented in Section 5 and Section 6, respectively. The obtained results and discussion of experiments are presented in Section 7. Finally, in Section 8, conclusions and suggestions for further work are presented.

Fig. 1
figure 1

Conceptual scheme of Data Fusion Model based on Autoencoder

2 Sensor Fusion Model

The proposed feature fusion architecture is based on the concept of sharing the intra- and inter-modal correlation of several signals involved [17].

As depicted in Fig. 1, we have S different sensors, the generic i-th sensor produces \(n_i\) measurements (e.g. related to \(n_i\) contiguous timestamps) globally denoted as signal \(\textbf{x}_i = [ x_{i,1}, x_{i,2}, ..., x_{i,n_i} ]\). The dataset is composed by N records: \( \{ \textbf{x}_i^{(n)} \}_{i=1:S}^{n=1:N} \).

For each record of the i-th sensor, the \(n_i\) measurements feed \(n_{E_i}\) specific Encoding Layers \(\textbf{E}_i = \{E_{i,1}, E_{i,2},... , E_{i,n_{E_i}} \}\) (blue boxes in Fig. 1).

The details of the encoding process for the generic i-th signal are described in (1) where: \(\beta _{i,j}\) is the bias for the j-th Encoding Layer of the i-th signal and \(\sigma _e\) is the activation function:

$$\begin{aligned} \begin{array}{rcl} \mathbf {e_{i,1}} &{} = &{} \sigma _e (E_{i,1} \cdot \mathbf {x_1} + \beta _{i,1}) \\ \mathbf {e_{i,2}} &{} = &{} \sigma _e (E_{i,2} \cdot \mathbf {e_{i,1}} + \beta _{i,2}) \\ &{} \vdots &{} \\ \mathbf {e_{i,n_{E_i}}} &{} = &{}\sigma _e (E_{i,n_{E_i}} \cdot \mathbf {e_{i,n_{{E_i}-1}}} + \beta _{i ,n_{{E_i}}}) \end{array} \end{aligned}$$
(1)

All the encoded signals, \(\{ \mathbf {e_{i,n_{E_i}}} \}_{i=1}^{S}\) are concatenated as described in (2) where \([\cdot , \cdot ]\) is the concatenation operator:

$$\begin{aligned} \mathbf {f_0} = [ \mathbf {e_{1,n_{E_1}}}, \mathbf {e_{2,n_{E_2}}}, \ldots \mathbf {e_{S,n_{E_S}}} ] \end{aligned}$$
(2)

Then \(\mathbf {f_0}\) feeds \(n_{F}\) shared Encoding Layers \(\textbf{F}_i = \{F_{1}, F_{2},... ,\) \( F_{n_{F}} \}\) (green box in Fig. 1). The process of the fused encoding is described in (3) where \(\beta _{f,j}\) is the bias for the j-th shared Encoding Layer and \(\sigma _f\) is the activation function:

$$\begin{aligned} \begin{array}{rcl} \mathbf {f_1} &{} = &{} \sigma _f(F_{1} \cdot \mathbf {f_0} + \beta _{f,1}) \\ \mathbf {f_2} &{} = &{} \sigma _f(F_{2} \cdot \mathbf {f_1} + \beta _{f,2}) \\ &{} \vdots &{} \\ \mathbf {f_{n_F}} &{} = &{} \sigma _f(F_{n_F} \cdot \mathbf {f_{n_{F-1}}} + \beta _{f,n_F}) \end{array} \end{aligned}$$
(3)

Finally, specific \(n_{D_i}\) Decoding Layers \(\textbf{D}_i = \{ D_{i,1}, D_{i,2}, ... ,\) \(D_{i,n_{D_i}} \}\) (red boxes in Fig. 1), reconstruct the \(n_i\) measurements of the generic i-th sensor: \({\hat{\textbf{x}}}_i = [ \hat{x}_{i,1}, \hat{x}_{i,2}, ..., \hat{x}_{i,n_i} ]\). In detail, the encoded signal \(\mathbf {e_{i,n_{E_i}}}\) for each sensor is concatenated to the fused encoded signal \(\mathbf {f_{n_F}}\) as described in (4):

$$\begin{aligned} \mathbf {d_{i,0}} = [ \mathbf {e_{i,n_{E_i}}}, \mathbf {f_{n_F}}] \end{aligned}$$
(4)

The process of the decoding is described in (5) where \(\zeta _{i,j}\) is the bias for the j-th Decoding Layer of the i-th signal and \(\sigma _d\) is the activation function:

$$\begin{aligned} \begin{array}{rcl} \mathbf {d_{i,1}} &{} = &{} \sigma _d(D_{i,1} \cdot \mathbf {d_{i,0}} + \zeta _{i,1}) \\ \mathbf {d_{i,2}} &{} = &{} \sigma _d(D_{i,2} \cdot \mathbf {d_{i,1}} + \zeta _{i,2}) \\ &{} \vdots &{} \\ \mathbf {d_{i,n_{D_i}}} &{} = &{} \sigma _d(D_{i,n_{D_i}} \cdot \mathbf {d_{i,n_{{D_i}-1}}} + \zeta _{i,n_{D_i}} ) \end{array} \end{aligned}$$
(5)

Both \(\textbf{E}_i\) and \(\textbf{F}_i\) can be Dense Layers, or Convolutional Layers, while the various \(\textbf{D}_i\) can be Dense Layers, or Transpose Convolutional Layers. Skip connections (using concatenation as in DenseNet [32]), are introduced to accelerate the learning process, and for each signal, the last Decoding Layer \(D_{i,n_{D_i}}\), is a dense layer with dimension \(n_i\).

Fig. 2
figure 2

Conceptual scheme of the specialized Autoencoder for each sensor. The blue and red boxes contain, respectively, the encoder and the decoder parts

Fig. 3
figure 3

Missing Data Patterns for a configuration composed by three sensors with six measurements each. (a) No missing Data; (b) 33% of missing data randomly distributed on measurements of first sensor; (c) 33% of missing data on the central part of measurements of second sensor; (d) 33% of missing data on the last part of measurements of third sensor; (e) 33% of missing data on the first part of measurements of second sensor

The loss function is computed as weighted sum of the MSE for each sensor between the reconstructed signal and the ground truth. Since each sensor can suffer of specific problems, and be corrupted following a particular pattern, we define the indexes of the corrupted signal components as \(\mathcal {S}_i = \{j : j \in \{1,.., n_i\}, x_{i, j} \ is \ corrupted\}\). Hence, the loss function becomes:

$$\begin{aligned} \begin{array}{rcl} \mathcal {L} &{} = &{} \sum _{i=1}^{S} w_i \cdot \Bigg ( \alpha \cdot \frac{1}{\vert \mathcal {S}_i \vert } \Big ( \sum _{j \in \mathcal {S}_i} (x_{i,j} - \hat{x}_{i,j})^2 \Big ) \\ {} &{} + &{} \beta \cdot \frac{1}{n_i - \vert \mathcal {S}_i \vert } \Big ( \sum _{j \notin \mathcal {S}_i} (x_{i,j} - \hat{x}_{i,j})^2 \Big ) \Bigg ) \end{array} \end{aligned}$$
(6)

where \(w_i\) is the weight related to each sensor, and \(\alpha \) and \(\beta \) are the weights for the reconstruction error on components that are, respectively, corrupted or not. If the weights \(w_i\) are equal for all sensors, and \(\alpha \) and \(\beta \) are chosen properly (e.g., \(\alpha =\vert \mathcal {S}_i \vert \) and \(\beta = n_i - \vert \mathcal {S}_i \vert \)), we obtain a value that is proportional to the global MSE computed over all measurements of all sensors.

For each sensor, there is a specialized Autoencoder, as depicted in Fig. 2, composed by an Encoder and a Decoder (respectively blue and red box in the Fig. 2). At the end of the training process of each Autoencoder, the learned weights for the Encoding Layers \(\{E'_i\}_{i=1}^{n_{E'_i}}\) are used as weights (or could be used as initial weights in case of fine tuning) of the Encoding Layers of the overall architecture (Fig. 1 and relative blue boxes). The learned weights for the Decoding Layers \(\{D'_i\}_{i=1}^{n_{D'_i}}\) are instead discarded. In this work we have used a symmetric Autoencoder (\(n_{D'_i} = n_{E'_i} -1\)) and the same number of neurons, or filters, used for the Encoding Layers have been used for the Decoding Layers in the inverse order. When the specialized Autoencoder is convolutional, the last layer is a CONV 1D 1x1.

3 Missing data patterns

Figure 3 shows an example with three sensors with six measurements each. The first row (Fig. 3(a)) has no missing data. The other rows show typical missing patterns that can occur in real data for a single sensor: randomly distributed (Fig. 3(b)), in the central part (Fig. 3(c)), in the last part (Fig. 3(d)), in the initial part (Fig. 3(e)).

The presence of missing data in real context is critical as reported in [22] where in a real smart meter dataset of 50 million of load measurements, there are totally 420k missing points (1% of total), 34k isolated missing points and 38k missing contiguous blocks. Usually, random missing patterns (Fig. 3(b)) are observed when there are communications, or sensor issues, of brief duration (intermittent failure). Contiguous values may be missing when a sensor, or its connection, stops working for a while before reconnection (prolonged failure) [5].

These patterns should be seen as typical day data where the central hours may correspond to peaks of energy demand, or photovoltaic energy production. Having the central part of these time series completely missing is, hence, one of the worst situations to tackle.

To solve these problems, straightforward imputation methods have been suggested, such as replace the missing values with template values obtained from the training set, or leveraging some statistics of the signal estimated on historical data (e.g. average, median, etc.) [6], or using algorithms such as the popular MICE [33]. In our approach, instead, the imputation is relying completely on the trained Autoencoder that fills-in automatically the missing parts.

Fig. 4
figure 4

Augmented training set for a configuration composed by three sensors with six measurements each. (a) Original Record; New record obtained removing the central 33% of the measurements for: (b) the first signal; (c) the second signal; (d) the third signal

Table 1 Configuration for the tested architectures with dense shared encoder

4 Augmented training set

In solving real problems using machine learning, the training algorithms need to have available rich data sets that contain with sufficient frequency the patterns of interest. When this is not possible, it is often necessary to use data augmentation, i.e. enrich the training set creating artificially the critical situations to be addressed.

Following the discussion in Section 3, we focus on the missing data in the central part. To reduce the dataset shift [34], between the training data and the effective situations that can occur, we have created an augmented training set as depicted in Fig. 4: for each original record, we create S new (synthetic) records containing, for each signal, the central part completely removed keeping the rest. Then the original record is used as the desired output (ground truth).

In the following sections we present the results using the network architecture depicted in Fig. 1, with the main hyperparameters listed in Table 1, and trained using different types of augmentation. Each type of augmentation defines a particular model to test:

  • AE: original training set (the records as depicted in Fig. 4(a))

  • AE-A: augmented training set (the records as depicted in Fig. 4(a), (b), (c), (d))

  • AE-A-ALL: augmented training set (the records as depicted in Fig. 4(a), (b), (c), (d)) and adding new records with missing data randomly distributed for each signal (\(K_{all}\) repetitions)

  • AE-A-ONLY-SYNTH: training set composed only by the synthetic records (the records as depicted in Fig. 4(b), (c), (d))

  • AE-A-CONTIG: training set composed only by the synthetic records obtained removing contiguous samples for each signal with center position randomly distributed (\(K_{contig}\) repetitions)

Other two models have been tested with main hyperparameters listed in Table 2:

  • AE-A-ALL-CNN: augmented training set as for the model AE-A-ALL

  • AE-A-ONLY-SYNTH-CNN: training set composed as for the model AE-A-ONLY-SYNTH

The following architectures have been tested as reference models:

  • AE-S: the model based on Stacked Sparse Autoencoder, trained using the original training set (the records as depicted in Fig. 4(a)) and with the main hyperparameters listed in Table 3. This model is similar to model described in [17] and some choices have been made to make it comparable with other proposed architectures (e.g., not use of the layer-wise pretraining, absence of fine tuning procedure)

  • AE-D: the model based on Denoising Autoencoder [23] trained using the original training set (the records as depicted in Fig. 4(a)) and using the Dropout on input layer

  • IMPUTER: the Multiple Iterative Imputer (MICE) [33]

  • BASELINE: the baseline that substitutes the missing values with average of the signal computed on training set for that time step

Table 2 Configuration for the tested architectures with convolutional shared encoder
Table 3 Configuration for the model based on Stacked Sparse Autoencoder

5 Dataset

The datasets we have used for experiments are: REFIT [35] and the "Individual household electric power consumption Data Set" in the UCI Machine Learning Repository [36].

The first dataset includes cleaned electrical consumption data in Watts for 20 households at aggregate and appliance level, sampled each 8 second for the period from October 2013 to June 2015. We have focused on house 15 and the following appliances: Appliance2 (tumble dryer), Appliance3 (dishwasher), Appliance5 (computer site) and Appliance6 (television site).

The second dataset contains minute-wise power consumption measurements gathered from a house located in France between December 2006 and November 2010 (47 months) with 3 sub_meters:

  • sub_metering_1 related to the kitchen, containing mainly a dishwasher, an oven, and a microwave;

  • sub_metering_2 related to the laundry room, containing a washing-machine, a tumble-dryer, a refrigerator, and a light;

  • sub_metering_3 connected to electric water-heater and an air-conditioner.

The consumption of the house related to other rooms/appliances has been taken in account as difference between the global active power and the active power measured by three sub_meters,named sub_metering_4.

For both datasets we have resampled the original time series on hour basis using the average as aggregation method. Hence we have 4 sensors and each one contains 24 measurements for each day. From the complete dataset we have considered, for each month, \(75 \%\) of data for training and the \(25 \%\) for test. In this way the training set and test set contain information coming from all available months.

Fig. 5
figure 5

Reconstruction results on UCI Test set for 4 signals for model AE. Only signal sub_metering_3 contains erasures on central part. Reconstruction for signal: (a) sub_metering_1; (b) sub_metering_2; (c) sub_metering_3; (d) sub_metering_4

Fig. 6
figure 6

Reconstruction results on UCI Test set for 4 signals for model AE-A. Only signal sub_metering_3 contains erasures on central part. Reconstruction for signal: (a) sub_metering_1; (b) sub_metering_2; (c) sub_metering_3; (d) sub_metering_4

6 Model evaluation

For the i-th sensor, each signal \(\textbf{x}_i\) has been normalized using standardization, that is subtracting the mean and dividing by the standard deviation computed on all signals \(\textbf{x}_i\) in the training set.

After the training phase, the models have been tested to reconstruct input signals belonging to the test set with and without missing values. The desired behavior is the following:

  • For the input signals without missing values the model should reconstruct the signal at best even though it has not been seen during the training phase.

  • For the input signals with missing values distributed using the patterns depicted in Fig. 3, the model should try to impute the missing values obtaining results as close as possible to the ground truth signal.

To simulate the situations that can happen in the real contexts, we consider the erasures with random patterns (Fig. 3(b)) and central patterns (Fig. 3(c)). The other two patterns (Fig. 3(d), (e)) are similar and don’t add anything to our discussion. During the following experiments the missing values have been set to a fixed value that usually is zero (average value when the signal is denormalized).

These experiments have been performed with the complete information coming from other sensors, or with their complete absence. In this way we can observe the importance of the fused representation in the shared Encoding Layers for imputing tasks.

Fig. 7
figure 7

Reconstruction results on UCI Test set for 4 signals for model AE. Only signal sub_metering_3 contains random erasures. Reconstruction for signal: (a) sub_metering_1; (b) sub_metering_2; (c) sub_metering_3; (d) sub_metering_4

Fig. 8
figure 8

Reconstruction results on UCI Test set for 4 signals for model AE-A. Only signal sub_metering_3 contains random erasures. Reconstruction for signal: (a) sub_metering_1; (b) sub_metering_2; (c) sub_metering_3; (d) sub_metering_4

7 Results and discussion

The simulations have been performed considering the weights \(w_i=1\) for all sensors, the weight for corrupted points in the loss computation \(\alpha = 0.7\) and \( \beta = 1-\alpha \). The number of repetitions in the data augmentation are: \(K_{all} = 3\), \(K_{contig} = 10\). The percentage of the missing data for augmented dataset is \(50\%\) and the same percentage is used for dropout rate in AE-D.

Fig. 9
figure 9

Reconstruction results on UCI Test set for 4 signals for model AE-D. Only signal sub_metering_3 contains random erasures. Reconstruction for signal: (a) sub_metering_1; (b) sub_metering_2; (c) sub_metering_3; (d) sub_metering_4

In the following figures we show the prediction results of AE and AE-A models for UCI dataset, with three signals with no erasures (sub_metering_1, sub_metering_2 and sub_metering_4), and sub_metering_3 signal with half signal completely removed:

  • In the central part (Figs. 5 and 6).

  • In random positions (Figs. 7 and 8).

In Figs. 5, 6, 7, 8, and 9: the black solid line is the ground truth; the blue line is the result of the prediction using the ground truth as input; the black dashed line is the input containing the erasures (if any); the red line is the result of the prediction using the input containing the erasures; the cyan dashed line is the average signal computed over all training set.

The AE-A model has been trained using a training set augmented using \(50\%\) of erased samples in the central zone, hence the training set for AE-A model has been augmented considering also signals with the 12 central hours completely erased.

In the Fig. 6(c) we show that using the AE-A model, with central part of sub_metering_3 input signal completely removed, the reconstructed signal (red line) doesn’t follow the black dashed line (representing the signal with erasures), but tries to follow the black solid line that represents the original signal without erasures (not provided as input to the model). This means that the model is able to partially impute the missing values and not simply replicates the inputs that contains the missing values as the AE model does (Fig. 5(c)). The same behavior is observed also in AE-A-ALL, AE-A-ONLY-SYNTH, AE-A-ALL-CNN and AE-A-ONLY-SYNTH-CNN models, not shown here for brevity.

The AE-A model leverages the behavior observed in the central area during the training phase and hence the estimation of the average of the corrupted sub_metering_3 signal (cyan line in Fig. 6(c)). Moreover, the erasure of sub_metering_3 doesn’t impact too negatively on the reconstruction of the other signals (red lines and blue lines in Fig. 6(a), (b), (d)).

In the Figs. 7(c) and 8(c) we show, respectively, the results of AE and AE-A models when 50% of samples, randomly distributed, of sub_metering_3 has been removed. For AE model is confirmed an absence of robustness to the missing values, but now also the reconstruction results of AE-A are not so good. The reason is that the erasures can occur in configuration that the model didn’t see during the training process (AE-A has been trained using dataset containing signals with only the central part deleted).

If we consider the result using AE-D model (Fig. 9(c)), we can note that the model is more robust to the erasures as it is filtering them out, but other signals are reconstructed with more difficulty.

This behavior is confirmed also if we evaluate the reconstruction error on all test set as described in Tables 4 and 5, where there are listed the RMSE values for models AE, AE-A, AE-A-ONLY-SYNTH, AE-A-ALL, AE-CONTIG, AE-A-ONLY-SYNTH-CNN, AE-A-ALL-CNN, AE-S, AE-D, IMPUTER and BASELINE applied to, respectively, UCI and REFIT datasets. The evaluation is performed on test set records with the 50% of the central part of i-th signal completely removed. The RMSE between the reconstructed and ground truth signals (without erasures) of the test set is computed only considering the samples that have been erased.

Table 4 Reconstruction error only on erased samples for UCI dataset on Test set for the considered models. One half of the input signal, in the central part, has been erased before it has been presented to the network. Column S_i stands for Sub_metering_i
Table 5 Reconstruction error only on erased samples for REFIT dataset on Test set for the considered models. One half of the input signal, in the central part, has been erased before it has been presented to the network. Column A_i stands for Appliance_i
Table 6 Reconstruction error only on erased samples for UCI dataset on Test set for the considered models. One half of the input signal, in the random position, has been erased before it has been presented to the network. Column S_i stands for Sub_metering_i
Table 7 Reconstruction error only on erased samples for REFIT dataset on Test set for the considered models. One half of the input signal, in the random position, has been erased before it has been presented to the network. Column A_i stands for Appliance_i

In particular, the AE model is confirmed not being good solution for obtaining a robust representation of the input data with missing data in the central part.

For UCI dataset, the models with shared convolutional layers are the best choices. AE-A-ONLY-SYNTH-CNN outperforms other models (except for Sub_metering_1 and Sub_metering_3 where it is the second best model after AE-A-ONLY-SYNTH and AE-A-ALL-CNN model, respectively) and AE-A-ALL-CNN outperforms other models for Sub_metering_3 and it is the second best model for Sub_metering_2 and Sub_metering_4 after AE-A-ONLY-SYNTH-CNN. Moreover, AE-A-ALL-CNN model presents an average improvement over IMPUTER of about 3%, over AE of about 19% and over AE-D of about 13%.

Fig. 10
figure 10

Reconstruction error only on central erased samples for (a) UCI dataset and (b) REFIT dataset without data augmentation (AE), with data augmentation and the dense shared Encoder (AE-ALL), and with data augmentation and convolutional shared Encoder (AE-ALL-CNN)

Also for REFIT dataset, the models with shared convolutional layers are the best choices. AE-A-ALL-CNN outperforms other models (except for Appliance_2) and AE-A-ONLY-SYNTH-CNN is in the top 5 models (except for Appliance_3). AE-A-ALL-CNN model shows an average improvement over IMPUTER of about 12%, over AE of about 22% and over AE-D of about 19%. These results show how a proper augmentation dataset, with missing values distributed as expected in a real situation (in this case in the central part), together with a convolutional fusion layer, is a preferable choice respect to other presented solutions. In particular, if we focus only on the two most employed approaches (AE-D and IMPUTER), in presence of one half of the input signal, in the central part, completely erased, AE-A-ALL-CNN presents an improvement of the imputation capability in the average of about 12%.

If the erasure pattern changes, for instance, the samples are removed randomly (Tables 6 and 7), we can observe an interesting behavior. For UCI dataset, AE-A-ALL-CNN, is the second best model after IMPUTER, while in REFIT dataset, it is able to outperform other approaches (except for Appliance_2). Moreover, AE-A-ALL-CNN model, for UCI dataset, presents an average improvement over UAE of about 19% and over UAE-D of about 5%, and for REFIT dataset, it presents an average improvement over IMPUTER of about 7%, over UAE of about 19% and over UAE-D of about 13%.

These simulations seem to confirm that training the model modifying the input patterns following the same percentage and missing pattern that can be observed in the real context, helps the imputation of missing values. The prior information on the missing data process can help to construct a more robust representation. Often this information is not available and hence we can augment the dataset adding records that contains erasures distributed following several patterns that, eventually, can occur as with AE-A-ALL or AE-A-ALL-CNN models. The latter emerges as a very good model in both situations, when the missing patterns are distributed in the central part or randomly.

These results suggest that, designing properly the data augmentation phase, we can make the representation more robust to some particular missing patterns than others. In this way we can, for example, impute missing data that follow a particular missing pattern, but be transparent to other types of corruption that it is necessary to be “transmitted” to following task in the pipeline (e.g. anomaly detector).

In order to assess the impact of the augmentation and of the convolutional neural network as shared encoder, in the Fig. 10 we show the imputation results with the classical Autoencoder approach, then we introduce the data augmentation and finally we employ the convolutional fusion layer. From the graph it is evident how the proposed solutions improve the imputation capability of the architecture and that both convolutional shared encoder and the augmentation have an important role in the final results.

The architecture is built in order to take advantage also from other available signals. In the following we assess the role of the other signals (not corrupted) in the reconstruction capability of the proposed architecture. In the Figs. 12 and 11 we show how RMSE values for AE, AE-A, AE-A-ONLY-SYNTH, AE-ALL, AE-CONTIG, AE-A-ONLY-SYNTH-CNN, AE-A-ALL-CNN, AE-D vary when the i-th signal contains 50% of samples completely removed in the central part and other signals are progressively completely removed. These figures confirm how the usage of other signals can improve the imputation performance of the considered models even though some signals are more sensitive than others, probably because the correlation among the signal and other ones is not so high. Usually, the lowest value of RMSE for all fours signals is obtained when other three signals are available (0 signal erased) and the model can leverage the information of other signals. Removing information coming from other signals, the reconstruction error on the i-th signal increases, confirming the importance of the data fusion in the imputation of missing values.

Fig. 11
figure 11

Impact of other signals to the reconstruction of the signal on UCI dataset

Fig. 12
figure 12

Impact of other signals to the reconstruction of the signal on REFIT dataset

8 Conclusion

The necessity of a robust method to fill-in missing data is important in IoT context and in particular when we use information coming from several sensors for making decision as in the Smart Energy Grid. In this work we have proposed an Autoencoder-based data fusion architecture that can achieve these objectives. The model is completely data-driven and leverages information coming from other sensors in order to improve the imputation performance. Our technique has been tested in very challenging situations where the most important part of the signals may be completely lost. We have shown that a dedicated data augmentation phase is a crucial step in making the Autoencoder representation robust to missing patterns. Specific augmentation patterns could be effectively used for making this paradigm very versatile. The proposed approach, as any data-driven solution, is dependent on training data, their quality and quantity. If the training data is biased, not representative of the population or of the missing data pattern, the imputations generated by the model may also be biased or inaccurate. In the future work we will evaluate to introduce the attention mechanism in the model in order to improve the model’s ability to focus on important parts of the input data that can help the imputation capability of the overall architecture.