NILM as a regression versus classification problem: the importance of thresholding

Non-Intrusive Load Monitoring (NILM) aims to predict the status or consumption of domestic appliances in a household only by knowing the aggregated power load. NILM can be formulated as regression problem or most often as a classification problem. Most datasets gathered by smart meters allow to define naturally a regression problem, but the corresponding classification problem is a derived one, since it requires a conversion from the power signal to the status of each device by a thresholding method. We treat three different thresholding methods to perform this task, discussing their differences on various devices from the UK-DALE dataset. We analyze the performance of deep learning state-of-the-art architectures on both the regression and classification problems, introducing criteria to select the most convenient thresholding method.


Introduction
Non-Intrusive Load Monitoring (NILM) was proposed on 1992 by G. W. Hart [1] as a method to predict whether an individual appliance is active or inactive, by observing the total aggregated power load, having information of the nominal consumption of each appliance in each state. The first approach to NILM employed Combinatorial Optimization, which at the time was the standard technique for disaggregation problems. For historical review of the evolution of NILM techniques, see for instance [2,3,4]. This first approach had a major shortcoming: Combinatorial Optimization performed the power disaggregation on each instant independently of the others, without considering the load evolution through time. These original algorithms were very sensitive to noise and were only accurate for houses with very few appliances. Thus, they could not be applied to real-life scenarios.
NILM algorithms received renewed attention at beginning of the 21st century, mostly thanks to the increased number of datasets coming from smart electric meters installed in domestic residences. These meters are able to record the power load from a household at short time intervals (one hour or less) and send those values to the electric company [5]. The first open-source NILM datasets were published in 2011 and they triggered further research activity by setting benchmarks for comparison of models. These datasets stored high frequency time series for both aggregated and appliance power load [6,7,8]. For a recent comprehensive review on NILM datasets, see [9,10]. Many of them are publicly available in the NILM-EU wiki. 2 Soon after these datasets became available, the prevailing approach to NILM shifted from the combinatorial optimization problem mentioned above, to a supervised learning problem with times series in Machine Learning [11,12]. Traditional ML methods such as Hidden Markov Models were initially used [13,14,15,16], while in recent years the incredible growth of deep learning algorithms has dominated the field [17,18,19,20,21,22,23].
Much of the recent effort of the NILM research community has been focused on improving the efficiency of prediction algorithms and computational speed, usually by presenting novel model architectures [24,22] or even trying unconventional approaches such as finite state machines [25] and hybrid programming [26].
However, fewer works are devoted to investigating the formulation of the problem, and how this can affect the algorithm's performance. For instance, uniformization of evaluation metrics for easy comparison of different models has been stressed in [27,28], while the dependence on sampling frequency has been treated in [29]. This becomes a relevant issue when trying to leverage theoretical NILM results on high frequency benchmark datasets to real life scenarios where records are sampled at lower rates.
NILM datasets typically include both the aggregated power load and that of each monitored device, but not the device status (i.e. whether it is ON or OFF). Thus, a regression problem to predict the consumption of each device is naturally defined by the data. However, most works in NILM address the classification problem of determining whether the device is ON or OFF, rather than its consumption at each time interval. Defining a classification problem requires establishing a threshold or some procedure to determine the output categorical variable from the continuous output power load. Our main observation is that this process involves an external choice of thresholding method which is not included in the initial problem formulation. Depending on how this preprocessing step is performed, the performance and interpretation of the final results may vary in a significant manner. The main contribution of this paper is to highlight this matter and to discuss several possible ways to define a classification problem from the native regression problem.
The paper is organized as follows: in Section 2 we formally introduce the necessary notation to define regression and classification problems for time series in supervised learning. Section 3 introduces three different thresholding methods. Two different deep learning models are introduced and explained in Section 4. The purpose of studying two different models is to have more robust results and to ensure that the reported variations are not model dependent. The detailed methodology is carefully explained in Section 5, including data preprocessing, definition of training, validation and test sets, loss functions, optimization algorithms, and evaluation metrics. The results are exhibited in Section 6 for three monitored devices with different characteristics. Together with the results, we include a discussion on the criteria to choose the most convenient thresholding method. In section 7 we introduce architectures that optimize both classification and regression, to see how the two formulations interplay. Finally, we gather concluding remarks and outline open research problems in Section 8.

Problem Formulation
NILM is formulated as a supervised learning problem, where the model is trained to take the aggregate power as input signal and predict the power or state (ON/OFF) of each monitored appliance. This power load is measured by the smart meter at a constant rate τ * , which produces a series of power measurements 3 P i at each sampled interval. For the analysis, it is often convenient to resample the original series at a larger sampling interval τ , which is part of the preprocessing step. For instance, in this paper the native sampling interval for the UK-DALE dataset provided by the meters is τ * = 6s, but we choose to resample the series at intervals of τ = 60s.
The aggregate power P j at instant j is the sum over all appliances: where L is the total number of appliances in the building, P ( ) j is the power of appliance at time j, and e j is the unidentified residual load. All of these quantities are expressed in watts.
After resampling, the training set comprises a sequence of n tot records that we label as {P i } ntot i=0 . This series is split in chunks of size n that we group in vectors as P j = (P jn , P jn+1 , . . . , P jn+n−1 ). We have a total of n train = n tot /n such series, each of which will be an input to the model. The output of the model are sequences P Supervised learning problems are usually referred to as classification or regression problems depending on whether the output variables are categorical or continuous. In the NILM literature, both of these approaches have been considered in different contributions, but there has been hardly no works devoted to the interplay between both formulations. It is precisely this gap that we would like to fill with this analysis.
In the regression approach, the predicted quantities are the power load P ( ) j for each device. In the classification approach the focus is on predicting whether a given appliance is at time j in a number of possible states, typically ON or OFF. We assume therefore, for the sake of simplicity, that the appliance can be in one of two states at time j, which are s ( ) j = 0 (OFF state) and s ( ) j = 1 (ON state). It is not evident to ascertain when a given appliance is ON or OFF by just looking at the power load. Thus, the usual criterion is to establish a threshold λ ( ) for each appliance, and define s ( ) Multi-state classification problems, where appliances may have more than one ON status (each of them with a different consumption) have also been considered in the literature [30]. A correct definition of the classification approach thus involves a choice of threshold λ ( ) for each appliance . Ideally, this threshold should be determined by the series of data P ( ) j alone, rather than being externally fixed by human intervention. In this paper we review different algorithms to determine this threshold, which lead to rather different outcomes depending on the complexity of the input signal. We address these methods in the following section.

Thresholding
In this section we will explore different methods on how to set a threshold to determine the OFF and ON status of an appliance, given its input power signal.

Middle-Point Thresholding (MP)
In Middle-Point thresholding we consider the set of all power values from appliance in the training set {P ( ) j } ntot j=1 . We apply a clustering algorithm to split this set into two clusters and consider the centroid of each cluster. Typically, the k-means clustering algorithm can be applied for this purpose, [31]. The two centroids for each class, after applying k-means, are denoted by m

Variance-Sensitive Thresholding (VS)
Variance-Sensitive (VS) thresholding was recently proposed as a finer version of MP thresholding by Desai et al., [30]. It also employs k-means clustering to find the centroids for each class, but the determination of the threshold now takes into account not only the mean, but also the standard deviation σ ( ) k for the points in each cluster, according to the following formula The motivation is that, if σ 1 > σ 0 , then the threshold should move towards m 0 in order not to misclassify the points in class 1 that are further away from the centroid m 1 . As a matter of fact, points in the OFF cluster usually have less variance, so the VS approach often sets its threshold lower than the MP approach. A comparison of both thresholding methods on a specific set of power measurements can be seen in Figure 1. Also note that MP is a particular case of VS when σ 0 = σ 1 .

Activation-Time Thresholding
The last two methods only use data from the distribution of power measurements in order to fix the threshold for an appliance. It often happens that due to noise in the smart meters or devices, some measurements during short time intervals are either absent while the device is operating, or produce abnormal peaks during the OFF state. For this reason, to ensure a smoother behaviour, Kelly and Knottenbelt [32] set both a power threshold and a time threshold. The power threshold could be fixed by MP or VS or fixed externally by hand as done in [32]. The time threshold (µ ) specifies the minimum length of time that device must be in a given state, e.g. if a sequence of power measurements are below λ ( ) for a time t < µ ( ) 0 , then that sequence is considered to be in the previous state (ON in this binary case). In [32], both power and time thresholds are chosen empirically, after analysing the appliance behaviour. Table 1 shows the values of the thresholds relevant to our work. The threshold λ is chosen usually at lower values, as the time threshold already filters noisy records. It would be desirable to turn this thresholding method into a fully automated data driven algorithm, in order to remove all subjective inputs.
Threshold Dishwasher Fridge Washing machine λ (W) 10 50 20 µ 0 (s) 30 1 3 µ 1 (s) 30 1 30 Table 1: Activation time (AT) threshold values used in this work, taken from [32]. Figure 2 compares the three thresholding methods. Each graph shows the three values of λ ( ) for a given device, together with the result of applying each thresholding method to the same input series. Observe that the same power data gives rise to rather different series for the ON/OFF status depending on the choice of thresholding method. We thus see that there are multiple ways to define a classification problem given the input signal.
A comparison and a discussion of each thresholding method after training state-of-the-art NILM for regression and classification problems will be performed in Section 6.

Neural Networks
Almost all state of the art models propagate their inputs through one or more convolutional layers [33,34]. This is done to ensure that the models are translation invariant. As NILM is related to time series, many studies also add recurrent layers (e.g. LSTM or GRU) to their networks [18,21]. These layers tend to get very good results on sequence-related problems. In this work, we will try out two different models: one that relies purely on convolutions, and other that also applies recurrent layers after the convolutions.
It is important to stress that both of these neural networks can be applied to train a classification model to predict device status or a regression model to predict device power load, the only difference in their architecture lies in the last layer (see Figure 3b), where an additional softmax layer needs to be added for the classification problem.

Convolutional Network
Our first model relies solely on convolutional layers, inspired on the architecture from the work of Luca Massidda et al. [23]. The general scheme follows the classic approach to the semantic segmentation of images. See Figure 3a to better understand the following model explanation.
The CONV model receives as input a vector with size L in = 510 which represents the household aggregated power over an 8 1 2 hour interval. The vector is propagated through an encoder, characterised by alternating convolution and pooling modules. Each encoder layer begins with a convolution of kernel size 3 and no padding, then applies batch normalization and ReLU activation, and ends with a max pooling of kernel size 2 and stride 2. Only the last layer omits the max pooling step. Encoder layers increase the space of the features of the signal at the cost of decreasing the temporal resolution.
After that, the Temporal Pooling module aggregates the features at different resolutions, which is reminiscent of inception networks [35]. Four different average poolings are applied, with kernel sizes 5, 10, 20 and 30; having the same stride as kernel size. Each of those layers then propagate their values through convolution layers of kernel size 1 and no padding, followed by a batch normalization and ReLU activation. All of their outputs are then concatenated.
Finally, the decoder module applies one convolution of kernel size 8 and stride 8, followed by batch normalization. It then bifurcates into two different outputs: the appliance status and the appliance power. Both outputs are computed by propagating the network values through one last convolution layer of kernel size 1 and padding 1. In the case of status, we apply the softmax function. Both status and power load output vectors have the same sampling frequency as the input aggregate load, but have a shorter length L out = 480 as explained in Section 5.

Bidirectional GRU Network
Some authors tend to connect convolutional and recurrent layers to extract temporal correlations out of the input sequence [24,32]. This second model follows a prototypical GRU scheme, depicted in Figure 3b. The input and output layers are the same as in the previous model. For the processing units, the GRU model propagates the input vector through two convolutional layers with kernel size 20, padding 2 and stride 1, before applying the recurrent (bidirectional GRU) layers.
One can see that the model architecture is rather lightweight. However, GRU takes longer to train than CONV: adapting the GRU weights requires a lot of computation, compared to updating the weights of convolutional layers.

Preprocessing
In order to make our results reproducible and easy to compare with other works, we restrict to the UK-DALE dataset [8], which is a standard benchmark for NILM.   Table 3: UK-DALE dataset.
Only houses 1, 2 and 5 have been used for this work. Our target appliances which are found in the three buildings are: fridge, dishwasher and washing machine. This choice of houses and appliances is common in other works [23,32,36], as they seek to monitor appliances with distinguishable load absorption patterns and relevant contribution to the total power consumption.
Every power load series was downsampled from 6 seconds to 1 minute-frequency. After this downsampling, every input sequence comprises 8 1 2 hours of time, which amount to L in = 510 records. Since the models use convolutions with no padding, the first and last records of each series are dropped in the output, thus leading to an output sequence having L out = 480 records, i.e. 8 hours. We have divided the original time series into input sequences with an overlap of 30 records between consecutive input sequences, so that the output sequence are continuous in time and have no gaps. Aggregate power load is normalized, dividing the load by a reference power value of 2000 W for numerical stability. Each input series is further normalized by subtracting its mean. Thus, we can define the following input and output series for the regression problem Regression: To define the classification problem, the target y ( ) j is the device status which is computed from P ( ) j using the thresholding methods described in Section 3. More specifically, we have Classification: where s ( ) j is a series of binary values defined by (2). The training set for each problem is built by adding the first 80% sequences from each of the three buildings, which amounts to 1941 sequences (describing a total time of 687 days of measurements). The validation set is built by using the subsequent 10% records from house UK-DALE 1, for a total of 183 sequences (65 days), while the test set is composed of the last 10% sequences of the same building, having the same size as the validation set.   In order to judge whether we are dealing with a balanced classification problem, it will be useful to report how often a given device has been ON during the training and test sets, depending on which thresholding method has been applied. The results can be seen in Table 5.
It is important to stress a number of things from the observation of this table. First, the fraction of ON states is clearly dependent on the thresholding method, as it was already clear from Figure 2. Next, dishwasher and washing machine are only sparsely activated, while the fridge is considered to be ON roughly half of the time. In these two cases we are dealing with an imbalanced class problem, which should be taken into account when defining and interpreting the appropriate metrics. Finally, in all cases but specially for the washing machine, the prevalence of the positive class differs greatly from the training to the test set. This of course could happen because these periods have been chosen consecutive in time (in accordance with other works). If the train-validation-test split was done randomly over the 2307 series records, we would observe a similar class distribution across sets. Splitting training and test sets for a time series in machine learning problems always involves a delicate choice: whether to split records randomly (which ensures homogeneous distribution) or chronologically (which is closer to the real operating conditions).  Table 5: Fraction of activation time (in %) for each device over the train and test sets.

Training
Each of the models described in Section 4 was trained for 300 epochs. Training data was fed to the model in batches of 32 sequences, shuffled randomly.
The loss function for the regression problem is the mean square error or L2 metric, given by The standard choice of loss function for the classification problem is binary cross entropy: whereŷ ( ) j is the probability that device is ON at each time step. During the 300 training epochs, we keep the model that achieves the minimum loss over the validation set, using an Adam optimizer for weights update, with a starting learning rate of 1E − 4.
Both data preprocessing and neural network training were performed on Python. Specifically, the models were written on Pytorch and trained in a GPU NVidia GeForce GTX 1080 with 8 GB of VRAM, NVIDIA-SMI 440.95.01 and CUDA v10.2. The code for this paper is available online 4 and the data comes from a public dataset, so all results reported in this paper are reproducible. Using the configuration stated in this section, CONV models took from 7 to 8 minutes to train 300 epochs, while GRU architectures took 16 minutes.

Metrics
Although metrics are clearly related to loss functions used for model training, the main difference is that the reported metrics are not required to be differentiable. When the output is a continuous variable (power load in our case), we use as the relevant metric the L1 error or MAE rather than RMSE, since the latter tends to give too much importance to large deviations.
When the output variable is categorical, we can use the F 1 -score to balance precision and recall in balanced problems where there is no preference to achieve a better classification of the ON or OFF classes. The F 1 -score is the harmonic mean of precision and recall, or directly where in this case y For imbalanced problems, such as NILM for the dishwasher and washing machine (see Table 5), the F 1 -score might not be the best metric to report. Rather, since those devices are ON roughly 1% of the time, it is better to consider a suitable metric for imbalanced classification problems, like the area under the ROC curve. [37].

Regression problem
We begin our discussion of results by reporting the metrics obtained over the test set by the two models (CONV and GRU) in the regression problem for each of the three appliances = {dishwasher, fridge, washing machine}.  Table 6: MAE scores (in watts) for regression models on each appliance The analysis of these metrics needs a word of caution: it is natural to expect that if a device has been sparsely activated during the test set, the predicted load will give an overall lower MAE than a similar device that has been used more often (see Table 5). For this reason, some authors have suggested other metrics such as the energy error [38,39]. For a recent review of all metrics that have been used in NILM, see [27].
To understand the complexity of disaggregating the power signal in NILM, we have plotted a time series from the test set in Figure 4. In the plots we show the input signal P j (aggregated power load), together with the real power of each device P ( ) j and the predicted power obtained from the CONV networkP ( ) j . In the first graph of Figure 4 we see that the model has identified correctly the two main power peaks of 2200 W corresponding to the dish washer, but has properly ignored other similar large peaks occurring earlier in the series. In the second graph, we observe that the fridge power prediction is often masked by the presence of other devices, having a mean value of just 100 W but a very periodic activation pattern. When the aggregate power load is small, the model is able to better resolve the signal coming from the fridge. The washing machine has a more complex consumption pattern during its activation period, but the model has also been able to identify correctly the activation peaks while ignoring other similar peaks of the same magnitude occurring earlier in the series that do not correspond to washing machine operation.

Classification problem
As we have mentioned above, the classification problem is not uniquely defined since the raw data do not include the real intervals where each device was ON/OFF, but only its consumption. Thus, we have three different classification problems depending on the choice of thresholding methods described in Section 3. For each possible value of (model, thresholding, device) we report the F 1 -score (10) over the test set in Table 7.
The next to last line shows that our results are in good agreement with those reported by the authors of [23]. Also, we can observe that CONV has a slightly better F 1 -score that GRU for the classification problems in all three devices and thresholding methods, although both of them show a very good performance (see Figure 5). In general, the classification   Table 7: F 1 -scores for classification models on each appliance and threshold problem for the fridge is harder, for reasons that have been already mentioned above. We raised the idea of using AUC over F 1 -score in Section 5.3. We computed AUC and realized it was almost identical to F 1 -score, so we will keep using the latter as it is more common in the literature.
It is also instructive to represent the input signal P j , together with the real output signal P ( ) j describing the status of device and the predicted status s ( ) j , to grasp the nature of NILM problem for classification. We have plotted in Figure 5 the output of CONV on a given series of records from the test set where the three devices have been activated (sometimes simultaneously). Observe that the model is able to discriminate, with a very good precision, the periods in which each of the devices are activated, just from the observation of the aggregated power signal.

Reconstructing the power signal
A very natural question to address is which of the three proposed thresholding methods should be preferred. One would naively think that the one leading to a better F 1 -score would be the best choice, if this was only based on prediction performance. However, placing a trivial threshold of zero would yield state ON for all time intervals at the training and test sets, and any decent ML method would immediately learn this, thus reaching a perfect F 1 -score but having no useful interpretation at all. Thus, we need to balance predictive performance with a way of judging which method is more meaningful. In the absence of any other external information on when each device can be considered to be ON or OFF, we need to find an objective quantitative argument to tackle this question.
For this purpose, we propose to reconstruct the power signal from each device and compare the reconstructed signal with the original power load of the device. We compute the average powerP ON , (resp.P OF F ) for device during the periods that are considered to be ON, (resp. OFF) after applying the thresholding method, and reconstruct the power series with these binary values.
More specifically, we haveP and reconstruct a binary power load for device as The reconstructed power series can be seen, together with the original series in Figure 6 for two of the thresholding methods, corresponding to the same data as Figure 2. We can compute the MAE between the original P ( ) and reconstructed series BP ( ) averages over the training set, which we call the intrinsic error, since it is prior to any prediction method. The results for the three devices and thresholding methods are shown in Table 8.  From this comparison, we see that the Activation Time (AT) thresholding is the one having the largest intrinsic error, while Middle Point (MP) thresholding offers the closest reconstructed power series. The fridge has similar intrinsic error for all three methods since the original series is very regular, being almost a binary series itself (see Figure 4).
As we mentioned above, it is not enough to look only at the classification metrics in Table 7 to judge which is the best thresholding method for a NILM problem. For this reason, given the prediction output of the classification problem, we compute the reconstructed binary series and compare it with the original power series. MAE averaged over the test set are reported in Table 9 for the two models (GRU and CONV).  Table 9: MAE scores (in watts) for classification models after reconstructing the power load, on each appliance and threshold.
The first thing to note is that naturally the MAE errors are larger than the intrinsic errors, as they incorporate the errors in the classification. However, it is worth noting that for the dishwasher in AT thresholding this error is actually smaller than the intrinsic one. We find the explanation for that phenomenon in Figure 6: current AT parameters contain a large time threshold for the dishwasher (see Table 1), which might lead to incorrect thresholding and large deviations in the reconstructed series. This suggests that the free parameters in AT thresholding should be optimized my minimizing the intrinsic error for each device over the training set. As shown in previous tables, CONV has a slightly better performance than GRU. It is also worth noting that the MAE for the washing machine has increased by a factor of 10 with respect to the intrinsic error, while in the other two devices, the factor is close to 4. The most likely explanation for this deviation is due to the train-test splitting: most of the error comes from activation periods, and these occur twice more often in the test than in the training set for the washing machine (see Table 5) while there is hardly no variation in the other two devices. This brings back the already mentioned remark on the importance of having a train-test split that preserves the distribution of classes.
Finally, these MAE values should be compared with the ones obtained by training our models for a pure regression problem (see Table 6). We observe that the results are comparable, and for the fridge they are even better in the reconstructed case. The explanation comes from the fact that the raw power signal for the fridge is almost binary, so the reconstructed signal matches this behaviour properly and the ON/OFF valuesP ON andP OF F calculated over the training set are very close to the real values. Thus, in this case good metrics for the classification problem immediately translate into good MAE for the reconstructed series. By contrast, addressing the regression problem is harder for the fridge, where the regression curve often fails to reconstruct this signal, specially when it is masked by larger signals coming from other devices (see Figure 4b).

Classification metrics on the regression problem
In the last section we trained the model for classification but evaluated the MAE for the reconstructed power signal and compared it with the MAE obtained by directly training the model for regression. Now we do the opposite: we apply the thresholding methods on the real and predicted power signal to obtain the real and predicted status at each time interval, and then we calculate the F 1 metric over these values. The results can be seen in Table 10.  These scores are on average worse than the ones from the original classification approach (Table 7). In particular, F 1 -scores of AT for dishwasher and washing machine are extremely low, which is caused by the small power threshold set by the thresholding formulation (see Table 1). During periods of inactivity, regression models output values that, although being relatively small compared to the power peaks of dishwasher and washing machine, are high enough to surpass the AT power threshold, thus triggering the ON state and causing many FPs.

Balancing classification and regression
So far we have trained our models to solve a pure classification or regression problem. However, as we explained in Section 4, our network architectures contain two different output layers, one for regression and the other for classification. This means that we can also train the model to solve both problems simultaneously, with a weighted combined loss function. The total loss of the model would then be: where L ( ) class is the binary cross entropy (8), L ( ) reg is the MSE (7), and k is a constant to normalize both losses so that they have comparable magnitude, estimated to be k = 0.0066. The constant w ∈ [0, 1] allows to shift between pure classification and regression.
We train both models using different values of the weight w with MP thresholding. Other thresholding methods showed a similar behaviour. For each value of w, the model is trained five times with random weight initializations, as explained in Section 5. We show the output metrics MAE and F 1 -score for varying w in the figures below.
Note that when w = 0 the model does not train for classification. For this reason, we include for w = 0 a single point for the F 1 curve, corresponding to applying the thresholding method on the regression output, as we did in Section 6.4.  Likewise, for w = 1 the model does not train for regression. For this reason, we include for w = 1 the MAE obtained by reconstructing the power signal from the classification output, as explained in Section 6.3.
Looking at the results in Figures 7-9 we observe a very different behaviour for the fridge than for the other two devices, due to their different characteristics already mentioned. In the dishwasher and washing machine, the F 1 -score grows monotonically with w for CONV as one would expect, but is almost constant for GRU, with a small drop for values of w close to 1 which are hard to explain. Likewise, the MAE in both models and devices tends to grow for larger w, which is natural since the model has a smaller weight for the regression problem. As for the extra points in the graphs, for the dishwasher we see that the F 1 -score obtained by thresholding a pure regression output (red dot) is comparable to the best score obtained by larger weights in classification. Rather surprisingly, the reconstructed power signal from pure classification output (blue dot) has a lower MAE than the best regression weight for CONV and still a comparable value in GRU. This discussion also holds for the washing machine, except the reconstructed series behaves poorly in this case, as we have already discussed by the activation pattern of the signal (see Table 9 and the ensuing remarks). For the fridge, the behaviour is consistent for both models, but different from the other two devices. While the F 1 -score behaves similarly, the MAE for regression decreases with w, which is clearly counter-intuitive: the model prioritizes classification loss, and in doing so it performs better in the regression problem as well. Also, the purely reconstructed signal for w = 1 (blue dot) has a better MAE than any of the models trained for regression. This fact can again be explained by the second graph in Figure 4: the regular (almost binary) activation pattern of the fridge is much better captured by a classification model with the right threshold than by a regression model, since the weak signal of the fridge is often masked by that of other devices.

Summary and conclusions
Non Intrusive Load Monitoring is typically framed as a classification problem, where the input data is the aggregated power load of the household and the output data is the sequence of ON/OFF states of a given monitored device. It is important to stress that this problem is derived, as the raw data do not contain the variable that needs to be predicted but only the power consumption. Creating a classification problem from the raw power signal data requires an external determination of the status by some thresholding method. We have discussed three possible methods in Section 3 and how they lead to classification problems with different results.
A discussion of what is the most appropriate method should not be based on the performance achieved by prediction models alone, but include also some objective way to judge the interpretability of the results. We suggest as an objective criterion to use the intrinsic error, i.e. MAE between the original power series and reconstructed binary series.
We show that deep learning models can be trained to minimize the regression loss (7) or the classification loss (8), but it is also possible to combine both into a weighted loss introducing an extra hyperparameter. This parameter balances the weight given to both problems, that are effectively solved both at a time. The optimal choice of this parameter depends strongly on the characteristics of the device.
To conclude we would like to mention possible improvements and future extensions of this work.
First, two of the thresholding methods (MP and VS) are entirely algorithmic, but AT needs some external parameters to be fixed. We suggest that these free parameters, the time thresholds (µ ( ) 0 , µ ( ) 1 ) for each device, should be fine tuned to minimize the intrinsic error defined in Section 6.3.
Second, some of the results highlight the importance of discussing chronological vs. random splitting of records to form the training, validation and test sets. This choice will become less significant in the large data size limit, but for moderate sizes they can still lead to different results. Similar issues are key in discussing fraud detection methods, where fraud techniques evolve in time and differ chronologically throughout the time span of the dataset.
Finally, we have chosen to address the simpler NILM problem of recognising the same devices seen in the training and test sets. Training models on certain households and generalizing to unseen devices in different households is a harder problem that lies at the root of industrial large scale applications of NILM.
Our purpose for this paper was to highlight an important factor at the foundations of NILM as a supervised learning problem, so we focused on a simple, well known benchmark dataset. We envisage to extend our study to larger, more recent datasets like Pecan Street [40] or ECO [41], addressing also the generalization capacity of deep learning models to cope with unseen devices in households not present in the training set.