Introduction

Non-Intrusive Load Monitoring (NILM) (Hart, 1992) describes a source separation problem: the energy usage of single appliances is inferred from the aggregated load of the household measured at the household connection point (mains) (Mauch & Yang, 2016). Another term for NILM is energy disaggregation and in this abstract, we call a technique that implements NILM a disaggregator. Visualizing energy usage using NILM techniques raises awareness of the energy consumption, without the need of individual meters for each household appliance. However, whether this facilitates energy efficiency and reduces energy cost is disputed (Kelly & Knottenbelt, 2016).

Inspired by the successes of Deep Neural Networks (DNNs) in the fields of computer vision, audio, and natural language processing, DNNs have been applied to NILM (Kelly & Knottenbelt, 2015a; Mauch & Yang, 2015; do Nascimento, 2016; Zhang et al., 2016; Barsim & Yang, 2018), which Kelly coined as Neural NILM (Kelly & Knottenbelt, 2015a). Recently, Bonfigli (Bonfigli et al., 2018) showed that Kelly’s Neural NILM approach is able to outperform state-of-the-art NILM approaches which are not based on DNNs like Additive Factorial Approximate Maximum A Posteriori estimation (AFAMAP) by Kolter and Jaakkola (Kolter & Jaakkola, 2012).

(Fig. 1) depicts how Neural NILM disaggregation is performed: Assume we have recorded c electrical features (channels) from mains with a fixed temporal resolution for a limited period of time such that we obtain a history of T measurements. Consequently, the measured values LM ∈ c × T form a time series with c channels. Current Neural NILM approaches split this time series into segments of fixed length S and run the disaggregation once for each segment, respectively. Later, the partial disaggregation results for each segment have to be merged to form the final result. Neural NILM approaches usually perform the splitting with overlapping sliding windows.

Fig. 1
figure 1

Information Flow in Neural NILM: The load profile of mains is split into sequences and fed to appliance-specific disaggregators. Later, the partial results have to be merged to form the final result. The gray area highlights the additional generator Ga of the Generative Adversarial Network that we use as appliance load sequence generator in our Neural NILM approach

For each appliance type a, a specific disaggregator Ya is used. This is in contrast to traditional NILM approaches (cf. (Kolter & Jaakkola, 2012; Zeifman & Roth, 2011; Zoha et al., 2012)) where appliance models are merged into a household model before disaggregation is conducted.

Analysis

The quality of NILM approaches can be assessed in two ways. Firstly, whether the disaggregator can correctly detect the time intervals when the target appliance consumes energy. Secondly, the degree of precision with which the disaggregator reproduces the shape of the target appliance load.

With regard to the first criterion, Kelly’s denoising autoencoder (Kelly & Knottenbelt, 2015a) already achieves good results. In most cases, his approach can correctly identify and localize the energy consumption of the target appliance within the aggregated load sequence. However, with regard to the second criterion, the autoencoder has noticeable difficulties.

(Fig. 2) shows the disaggregation result for the autoencoder of the washing machine on a test data window. We show load sequences of the washing machine, as they are complex and consist of multiple stages (heating, washing, spinning, rinsing). Kelly’s approach uses a sliding window with a stride of 16 samples in order to split mains into input sequences and applies the autoencoder on each sequence (cf. (Fig. 1)). In (Fig. 2), we see that the disaggregated estimate (left plot) differs from reasonably-shaped appliance load sequences like the measured appliance load. Kelly uses averaging to merge partial disaggregation results (sliding windows). Zhang et al. (Zhang et al., 2016) criticize this practice and propose that the DNN should only estimate single time points (Sequence-to-Point) instead for a whole target sequence (Sequence-to-Sequence). This eliminates the need of merging multiple estimates for one point in time.

Fig. 2
figure 2

Application of the disaggregation approaches on an exemplary appliance load sequence of the washing machine from the test data set. The output of the Kelly’s autoencoder is compared to the output of our DC-GAN based approach

To conclude our analysis, we observe that Kelly’s Neural NILM approach is successful at deciding whether the target appliance is active in the aggregate load and is able to localize it, whereas it shows poor performance when the exact appliance load must be estimated. From the human perspective, the result does not seem to be a reasonably-shaped and valid appliance load sequence.

Concept

We propose to mitigate the problem stated in the previous section by using a generative neural model for appliance load sequence generation. We pre-train this model using a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) architecture and integrate it into the Neural NILM disaggregation process.

The functional principle of GAN is depicted in (Fig. 3). GAN consists of two neural networks, a generator G and a discriminator D. During disaggregation, we want G to generate load sequences La of a specific appliance a. Thereby, the distribution of the generated appliance load sequences La should match the distribution of measured appliance load sequences \( {L}_a^M \) as close as possible. For the generation process, G uses a source of randomness Z to express the variations in the distribution of \( {L}_a^M \). The dimensionality of Z should be high enough to portray all the variations that real appliance load sequences may exhibit. We empirically choose z = 100 as an upper bound for the number of variance dimensions. During training, the input for the discriminator D are real appliance load sequences observed in the training data (\( {L}_a^M \)) as well as appliance load sequences generated by G (La). D’s objective is to determine whether the load sequences were drawn from the training data (V ≔ 1) or generated by G (V ≔ 0).

Fig. 3
figure 3

A Generative Adversarial Network to generate appliance load sequences

If the GAN training converges, both D and G internalize the distribution of the training data implicitly. Then, Z can be interpreted as a latent representation of an appliance load sequence. G and D are trained simultaneously in an unsupervised manner, where they play a minimax game against each other, hence the name Adversarial Networks. The objective of G is to deceive D, i.e. to generate data samples which make D believe that they were drawn from the real data set. D, on the contrary, strives to classify the data samples generated by G as fake samples and the data samples drawn from the training data set as real samples.

To provide an intuition for the proposed approach, we apply the manifold assumption for appliance load sequences: We assume that reasonably-shaped appliance load sequences span a connected low-dimensional subspace (manifold) embedded in S, where S is the length of the load sequences we want as output from each disaggregation step.

The training of the generator in the GAN architecture ensures that the output of the generator is located on the manifold of appliance load sequences with high probability. As we integrate the pre-trained generator to the disaggregation process, we force the output of the disaggregator to be located on the manifold of reasonably-shaped load sequences.

As depicted in (Fig. 1), our approach consists of two main components, a disaggregator Ya and generator Ga for a specific appliance a. During training, Ga learns a self-defined latent representation of the variations in the appliance load sequences. Ga is used to map from that latent representation into the space of reasonably-shaped appliance load sequences.

Compared to previous Neural NILM approaches, the disaggregator Ya is relieved from the task to generate appliance load sequences. It can focus on the detection and representation tasks, which are already performed sufficiently well by the existing Neural NILM approaches.

In contrast to the works of Barker et al. (Barker et al., 2013) and Buneeva and Reinhardt (Buneeva & Reinhardt, 2017), this approach does not need manual engineering of the characteristics of appliance load sequences. Instead, our approach relies on the ability of DNNs to find load sequence characteristics automatically.

Energy-based performance evaluation metrics

To compare different NILM approaches, we need to define informative metrics that capture specific performance aspects of these approaches. Binary classification metrics are very commonly used in NILM literature (Kelly & Knottenbelt, 2015a; Barsim & Yang, 2018; Bonfigli et al., 2015; Makonin & Popowich, 2015; Faustine et al., n.d.). The practice is to quantize both the appliance load ground truth and the estimate using appliance-specific on/off-thresholds. Unfortunately, these parameters allow to trade-off recall with precision and lead to hardly-comparable results between various NILM approaches. Also, because of the quantization, the information of the detailed load shape gets lost. The metric does not take into account that the shape of the estimated load should match the shape of the ground truth. Therefore, Bonfigli et al. (Bonfigli et al., 2018) propose energy-based precision and recall scores based on the correctly estimated amount of energy in each time interval. We generalize this idea and establish the complete energy-based binary confusion matrix in the following way:

Let ymax > 0 be the upper load limit of the appliance, y(t) ≥ 0 be the true appliance load at time t and \( \widehat{y}(t)\ge 0 \) be the load estimate at time t. Then the elements of the confusion matrix are:

$$ {\displaystyle \begin{array}{cc}T{P}^E={\sum}_{t=1}^T\min \left(\widehat{y}(t),y(t)\right)\kern0.75em ,& F{P}^E={\sum}_{t=1}^T\max \left(\widehat{y}(t)-y(t),0\right)\kern0.75em ,\\ {}F{N}^E={\sum}_{t=1}^T\max \left(y(t)-\widehat{y}(t),0\right)\kern0.75em ,& T{N}^E={\sum}_{t=1}^T\min \left({y}^{max}-\widehat{y}(t),{y}^{max}-y(t)\right)\kern0.75em .\end{array}} $$

Now we can define arbitrary energy-based binary classification metrics which do not need an appliance-specific on/off-threshold. Energy-based precision PE, recall RE and F1-score can be determined as follows:

$$ {P}^E=\frac{\sum_{t=1}^T\min \left(\widehat{y}(t),y(t)\right)}{\sum_{t=1}^T\widehat{y}(t)}\kern0.5em ,\kern1.5em {R}^E=\frac{\sum_{t=1}^T\min \left(\widehat{y}(t),y(t)\right)}{\sum_{t=1}^Ty(t)}\kern0.5em ,\kern1.5em {F}_1^E=2\cdotp \frac{P^E\cdotp {R}^E}{P^E+{R}^E}\kern0.5em . $$

As Barsim (Barsim & Yang, 2018) points out, the F1-score does not account for the true negatives and they propose to use Matthews Correlation Coefficient (MCC). An energy-based pendant of MCC can be derived analogously.

Another metric that is able to cope with data imbalances is the balanced accuracy (BACC). Energy-based BACC is defined as follows:

$$ BAC{C}^E=\frac{1}{2}\cdotp \left(\frac{T{P}^E}{T{P}^E+F{N}^E}+\frac{T{N}^E}{T{N}^E+F{P}^E}\right)\kern1em . $$

Results

We evaluate our approach using the UK-DALE data set (Kelly & Knottenbelt, 2015b) which consists of electric meter recordings of up to 1.8 years duration from 5 households, sampled at 1/6 Hz. We use the same pre-processing, artificial data augmentation approach, and data partitioning into train, validation and test data folds as described in (Kelly & Knottenbelt, 2015a). Based on Kelly’s own re-write of his denoising autoencoder,Footnote 1 we re-implemented the neural networks using PyTorch.Footnote 2 Our first GAN implementation is based on the Deep Convolutional GAN topology (DC-GAN) by Radford et al. (Radford et al., 2015). The generator and discriminator networks contain five convolutional layers and one fully-connected layer each. The generator uses transposed convolutional layers, which reflects the convolutions of the discriminator. For the disaggregator’s topology, we replaced the last layer of Kelly’s autoencoder (Kelly & Knottenbelt, 2015a) in order to map to the latent space z. The loss function is binary cross entropy for the discriminator and mean squared error for the disaggregator. We use the Adam optimizer (Kingma & Ba, 2014) when training the generator and discriminator. For the disaggregator, we use Stochastic Gradient Descent with Nesterov Momentum.

At first, we tried to train DC-GAN with appliance load data, where each training sample contained an arbitrarily placed load sequence. The training did not converge properly and the DC-GAN could only output sequences with zero load. To mitigate this mode collapse, we trained the DC-GAN only on load sequences which contained a complete appliance activation cycle.

(Fig. 2) shows an example output of our DC-GAN-based disaggregator compared with Kelly’s autoencoder (Kelly & Knottenbelt, 2015a), both evaluated on a single observation window. As can be seen, our approach has the potential to reproduce appliance load sequence more accurately than the autoencoder. Because the generator has learned to solely output valid load sequences, its output is more consistent. However, when we compare the F1 and BACC metrics in (Fig. 4), the overall performance of our DC-GAN-based disaggregator is worse than the autoencoder. As we were forced to train DC-GAN with complete appliance activation cycles, a cause for the worse performance is the inability of DC-GAN to output sequences with zero load. To solve this problem, we applied Auxiliary Classifier GAN (AC-GAN) (Odena et al., 2016). AC-GAN is an extension of GAN, where the generator is conditioned to additional class information. We supply the additional information whether the load sequence has zero load. The F1-score in (Fig. 4) shows that our approach based on an AC-GAN can improve disaggregation on washing machines in building 2 and 5. Disaggregation in building 1, however, did not outperform Kelly’s autoencoder. Also, the balanced accuracy scores do not show a clear advantage of our approach.

Fig. 4
figure 4

Energy-based F1 and balanced accuracy scores for the proposed and Kelly’s (Kelly & Knottenbelt, 2015a) Neural NILM approaches for the appliances washing machine and fridge. The approaches were only trained on the buildings with solid bars, i.e., training did not use data of building 2 for the washing machine model and building 5 for the fridge model

Conclusion

In this work, we analyzed Kelly’s Neural NILM approach and noticed that it has difficulties in the reproduction of reasonably-shaped appliance load sequences. Based on this insight, we proposed to integrate the generator of a Generative Adversarial Network into the Neural NILM disaggregation process to support a more accurate reproduction of appliance load sequences. To this end, we stated the manifold hypothesis for appliance load sequences and provided a generalization of energy-based NILM performance metrics by defining the complete energy-based confusion matrix. We showed the preliminary results of our ongoing research, which do not yet provide strong evidence that our approach effectively improves Neural NILM. However, we identify promising indications of the potential of the proposed approach.