1 Introduction

Energy management has taken a crucial role in terms of reducing carbon emissions, drawing increasing attention from the machine learning community [2, 24]. Residential households represent a significant portion of energy consumption [37], and comprehending how energy consumption occurs within them can bring a 5–20 % reduction in total power consumption [16], ultimately improving energy efficiency. However, directly monitoring the power consumption of each appliance in households through smart meters is expensive, intrusive, and difficult to maintain [37, 38]. Non-Intrusive Load Monitoring (NILM), also known as energy disaggregation, offers a solution to these challenges. Introduced in the 1990 s by George W. Hart [20], NILM aims to monitor the power consumption of individual appliances by analyzing aggregated household load data. This technique provides a foundation for energy conservation strategies and carbon emission reduction. For example, NILM allows residents to assess their power consumption behavior without installing multiple meters and to optimize appliance use during peak power hours [15].

Various NILM methods have been proposed over the years, including pattern recognition-based NILM [23], heuristic searching methods [46], decision trees algorithms [30], etc. However, manually defining appliance signatures with hand-crafted features has hindered the widespread adoption of NILM [40, 41, 51]. Recent advances in deep neural networks (DNN) and artificial intelligence (AI) have led to significant improvements in NILM performance [3, 22, 25, 26, 35, 49], as they can automatically learn effective appliance signatures. These include denoising auto-encoders [5], generative adversarial networks (GAN) [3], recurrent neural networks [4], convolutional neural networks (CNN) [14, 35, 49], stochastic configuration networks [43] and so on. CNNs have shown the best performance in AI-based NILM models [44], and many advanced CNN-based NILM models have been proposed recently. Cui et al. [10] embedded background filtering into the DNN training process to better estimate power consumption at the appliance level. Chen et al. [7] applied deep residual networks in a convolutional sequence-to-sequence framework for NILM, which also led to improved performance. A recently proposed method in this area is Short Sequence-to-Point (Seq2Point) [49], which uses a sliding window approach for real-time energy disaggregation, where a window of aggregate energy data is input to predict the midpoint value of appliance consumption within the same window. It is considered as one of the state-of-the-art models utilized in NILM [2].

Most of the AI-based NILM methods are centralized and rely heavily on numerous and diverse training data [24, 44, 50], which is owned by various household power consumers. However, due to the data island and privacy concerns [48], it is impractical to collect data from each power consumer to train an AI-based NILM model. Therefore, most consumers must rely on their own local data for local training, resulting in models with limited generalization [44] since they lack training data. Different from Centralized Learning and Individual Learning, there is another kind of learning paradigm known as Federated Learning (FL) [31], which is a promising way to address the problems of privacy concerns of Centralized learning and the poor performance of individual learning. FL is a distributed machine learning framework that allows clients to cooperatively train AI-based models without sharing their appliance power consumption data [11].

Several previous studies have investigated FL in NILM. Wang et al. proposed Fed-NILM that applied a generic FL algorithm, i.e., FedAvg, to the task of energy disaggregation [44], which demonstrated commendable improvements compared to individual learning. Meanwhile, Zhang et al. [50] designed efficient cloud model compression via filter pruning when using FL in NILM to improve communication efficiency. Shamisa Kaspour [24] proposed a reweighting approach to enhance fairness in FL-based NILM. However, while FedAvg is straightforward to use in NILM, it may not yield precise results due to some significant challenges in adopting FL for the energy disaggregation.

Challenge 1 One critical challenge of adopting the traditional FL to NILM is the slow convergence of the training model. This is mainly due to the data heterogeneity of clients. In this regard, the local model update directions of clients differ extensively from each other during the training. Sometimes they may conflict with the global model’s update direction and lead to poor performance on these clients [29]. In the energy disaggregation, the data of power consumption are sampled by edge devices, which are far from the server so that it is time-consuming to upload the model parameters. As a result, a slow convergence will significantly increase communication rounds in the training process, waste the computation resources, and reduce the willingness of clients to join in FL.

Challenge 2 The second challenge pertains to the poor performance of the training result, which can be attributed to the heterogeneity of clients who participate in FL. The power consumption patterns of clients vary considerably, and their datasets often vary in size and sampling rates. This non-IID (non-identically distributed) data distribution poses a challenge to conventional FL algorithms. As a result, a single global model struggles to accommodate the distinct datasets of each client [12, 28], leading to unsatisfactory performance for some clients.

Challenge 3 Partial client participation and client dropout. In reality, not every client is able to participate in FL at each communication round. Nearly at every round, there are some absent clients that cannot join in FL temporarily. Consequently, the server cannot obtain any information from the absent clients during the model aggregation, which may cause the updated model to forget the knowledge learned from them. This can result in poor performance of the model for these absent clients when they come back in the future [45].

To enhance the convergence of the training model (Challenge 1), we limit the update bias between clients and the global model in the training procedure by adding a proximal term to the local objective of each client, which would not bring extra communication costs or privacy overheads. Subsequently, to tackle the low model performance of the FL model (Challenge 2), we formulate a personalized improvement strategy to create personalized models for clients that better fit their local data. In specific, we calculate dynamic weights to combine the personalized update direction and the direction for the global model update, which can enhance personalization while having no conflicts between the personalized model update and the global model update. Next, we address the most common issues of client dropout (Challenge 3) by designing a reweighting approach to take the absent clients’ historical information into consideration when doing the FL aggregation, which can protect the historical information of the absent clients and would not hinder the convergence.

In this work, we present a Personalized Federated Learning NILM algorithm (PerFedNILM), a practical FL algorithm for the energy disaggregation. PerFedNILM aims to provide practical privacy-preserving FL for NILM appliances. It can mitigate the negative impact of the data heterogeneity of clients’ data and client dropout. Our main contributions are summarized below.

  • We propose PerFedNILM, a practical Federated Learning framework designed for the energy disaggregation among the power consumers. With the inclusion of a personalized enhancing strategy, PerFedNILM can establish personalized models for clients that fit their local power consumption data accurately.

  • We limit the update bias between clients and the global model in the training process by adding a proximal term in the local objective of each client, which can mitigate the negative impact of the data heterogeneity of clients and improve the convergence of the model in FL-based NILM.

  • We propose a practical way to protect the historical knowledge of absent clients by dynamically taking into account the absent clients’ historical information during the training process, which makes PerFedNILM more practical in reality.

  • We make extensive evaluations to verify the progress made by PerFedNILM. The results validate that PerFedNILM outperforms previous FL-based NILM algorithms that simply adopt the generic FL in NILM applications, especially when there are many clients being absent at each communication round in FL.

The rest of the paper is organized as follows. In Sect. 2, we introduce the background knowledge of the NILM and Federated Learning, and several related literature. Sect. 3 first introduces the NILM model. The algorithm of PerFedNILM is then presented in detail, including the framework overview and the workflow. In Sect. 4, we list the experiment setting of the case study, including the dataset we use and the evaluation metrics. Experimental results are then presented, and we analyze how the proposed PerFedNILM outperforms the previous Fed-NILM.

2 Background and related work

Fig. 1
figure 1

The architecture of Sequence-to-Point learning

2.1 Background of NILM

Non-intrusive load monitoring (NILM) is a technique for analyzing the total electricity consumption data of households, as measured by a single main meter, to estimate the power consumption of individual appliances and devices. Let P(t) be the total power consumption of a house at time t. Assume that P(t) is also the sum of the active power consumption of all the appliances in the household, then the formulation of NILM is shown below (provided by [13]):

$$\begin{aligned} P(t) = \sum \limits _{i = 1}^I {A(t) + \beta (t)}, \end{aligned}$$
(1)

where A(t) represents the power consumption of appliance i at time t, I denotes the total number of appliances, and \(\beta (t)\) indicates the noise at time t, which is generally defined as the Gaussian noise with the mean 0 and the variance \(\sigma ^2\). The goal of the energy disaggregation task is to recover the appliance-wise power readings from the observation of the total energy consumption measured by the main meter. This constitutes a difficult single-channel blind source separation problem [13].

There are several approaches that can help address the challenges of disaggregating the total power signal measured by a single meter. For example, incorporating prior knowledge about appliance power signatures and usage patterns into the models can improve disaggregation accuracy. In addition, applying signal processing techniques such as filtering and feature extraction to the aggregated power data can help isolate and identify the consumption of specific devices. Another popular way to address this Single-channel Blind Source Separation problem is to use the AI-based supervised learning, which aims to learn the following map P to A to solve NILM:

$$\begin{aligned} A = G(P). \end{aligned}$$
(2)

It is a non-linear regression problem. Several AI-based approaches have been proposed to find the map G, and the Sequence-to-point (Seq2point) model is currently the state-of-the-art (SOTA) model, which has demonstrated significant progress in disaggregating energy consumption [2, 49].

Figure 1 depicts the architecture of Seq2point model, which is a data-driven model that requires the data and labels for training. It is constructed using a Convolutional Neural Network (CNN) to extract the features of the power consumption. The model takes window samples of the total electric load as the input and produces the midpoint power consumptions for all window samples of a specific appliance as the output. To be more precise, designate W as the window size. Each of the input samples of the network represents the total electrical load \(P_{t:t+W-1}\). The model output is the midpoint element \(y = A_{i,t+W/2}\) of the corresponding window of the target appliance. The model formulation of Seq2point is shown below:

$$\begin{aligned} y = G(P_{t:t+W-1}, \omega ) + \epsilon , \end{aligned}$$
(3)

where \(\omega\) represents the model parameters, and \(\epsilon\) is a parameter that compensates the noise. Assume that \(\hat{a}\) is the network output, then the loss function of the model is

$$\begin{aligned} f(\omega ) = \frac{1}{T-W} \sum _{i=1}^{T-W} (\hat{y} - y)^2. \end{aligned}$$
(4)

The midpoint element is modeled as a non-linear function of the mains window, under the assumption that an appliance’s mid-element state is reliant on contextual information from the adjacent mains. In specific, the status of the midpoint is associated with the main power data from both before and after that point. Compared to the Sequence-to-Sequence model [39], the Seq2Point model offers the advantage of producing a single prediction for each backtranslation instead of averaging projections across windows.

It should be noted that the previous Seq2point algorithm required building a unique model for disaggregating the power consumption of each appliance, which made it vulnerable to noise. This is due to the fact that to learn the mapping between the main power consumption and the electric readings of a particular appliance, the power of other appliances will become new noise, making it challenging to employ in practice. In contrast, we train a shared model for disaggregating the power consumption of all appliances.

2.2 Federated learning

Fig. 2
figure 2

The main workflow of Federated Learning with averaging

Federated learning (FL) is an intriguing machine learning approach that enables clients to cooperatively train a shared global model without exposing their local data. It is a promising technique to tackle data isolation and privacy concerns [48], allowing clients to benefit from a more generalized model compared to training in isolation. By keeping data decentralized, federated learning allows collaborative learning while maintaining data privacy and security.

First proposed by [31], the goal of the traditional FL is to minimize the following weighted average objective of clients:

$$\begin{aligned} \min \limits _\omega ~ \sum \nolimits _{i = 1}^m p_i f_i(\omega ), \end{aligned}$$
(5)

where m is the number of clients, and \(p_i\ge 0\), \(\sum \nolimits _{i=1}^{m} p_i=1\) are the weights for aggregation, which is often set by \(p_i = \frac{1}{m}\) in practice. \(f_i(\omega )\) denotes the local objective of client i, which is usually defined by a specific loss function such as cross-entropy loss, and calculated by \(f_i(\omega )=\sum \nolimits _{j=1}^{N_i} \frac{1}{N_i}{f_{i_j}(\omega )}\). \(N_i\) is the number of local samples, and \(f_{i_j}(\omega )\) is the loss on the \(j^{th}\) sample.

Federated Average (FedAvg) is a classical and popular way to aggregate the local models of clients for building a new global model [31]. The main procedure of FedAvg is shown in Fig. 2. At each communication round t, the clients participating in FL receive the global model \(\omega ^t\) from the cloud server, and then train the model locally on their own data. Let \(\eta\) be the learning rate for the model training, the local training result of client i is obtained by

$$\begin{aligned} \omega _i^{t} = \omega ^t - \eta g_i, \end{aligned}$$
(6)

where \(g_i\) represents the local gradient. After that, they upload the local training results (models or the gradients) to the server. The new averaged model \(\omega ^{t+1}\) is then computed by the server using the following equation:

$$\begin{aligned} \omega ^{t+1} = \sum \limits _{i = 1}^m \frac{N_i}{N} \omega _i^t, \end{aligned}$$
(7)

where N is the sum of \(N_i, \forall i\), denoting the total number of clients’ samples.

2.3 Model personalization

Wang et al. [44] first used FedAvg in NILM and proposed Fed-NILM, but a single global model obtained by FedAvg cannot fit each client’s data well due to the client heterogeneity [6, 9]. To address this issue, many personalized FL (PFL) methods have been proposed. Some regularized personalized models toward the global model [19, 28, 29, 47]. Some other studies share the model base and do personalization in classifier layers for clients [1, 6, 9]. In addition, PFL has also been explored via clustering [17, 34, 36]. Some previous works also apply PFL to the NILM task, such as [24, 50]. However, Han et al. [18] showed that the personalized models trained by most of the PFL algorithms would overfit each client’s local data, which is almost equivalent to Individual Learning. Differently, our proposed PerFedNILM builds the personalized model by determining a direction that does not conflict with the client’s local update as well as the global model update, and thus can achieve a better balance between personalization and generalization in FL.

3 Methodology

To address the three challenges of adopting FL in NILM, we propose Personalized Federated Learning NILM (PerFedNILM) framework. Algorithm 1 demonstrates the main steps of PerFedNILM. At each communication round t, the server sends the global model \(\omega ^t\) to each new-come client, i.e., \(i \in S^t \backslash S^{t-1}\). After that, each online client does the local training on its own power consumption data, and then sends the training result to the server. After receiving all the local training results, the server does the model averaging (including the absent clients who were online at the last communication round), and calculates the weight \(\lambda _i\) for client i to build its personalized model. Finally, each client i receives the new global model update \(g^t = (\omega ^t - \omega ^{t+1})\) and the weight \(\lambda _i\) and obtains personalized model \(\omega _i^{t+1}\) by \(\omega _i^{t+1} = \omega ^{t+1} + d_i^t\).

Algorithm 1
figure a

PerFedNILM Algorithm

3.1 Model architecture

Fig. 3
figure 3

The architecture of our model

We adopt the SOTA model, Seq2point, for the NILM task in our algorithm. But the previous Seq2point model only outputs a predicted energy consumption for a specific appliance, which requires the algorithm user to train different distinct Seq2point models for disaggregating different appliances. This is not practical in reality. Differently, we design a shared Seq2point model that can predict the power readings for different appliances. The architecture of the model is shown in Fig. 3, where the model inputs window samples of the main power, and outputs each of the midpoint power consumption of each window sample of all appliances. We add dropout operators after the model layers to prevent the model from overfitting certain power patterns.

3.2 Limiting the local update bias

Due to the different behaviors of the appliance usages, clients’ local update direction \(g_i^t = \omega _i^t - \omega _i^{t+1}\) may differ a lot from the global model update \(g^t = \omega ^t - \omega ^{t+1}\). Sometimes they even conflict with each other [45], i.e., \(g_i^t \cdot g^t < 0\). This would reduce the model performance of the global model on these clients. To mitigate the local update bias, we add a proximal term to the local objective of each client. Thus, instead of minimizing the original local objective (4), each client minimizes the following global-regularized objective:

$$\begin{aligned} \min \limits _{\omega } h_i(\omega , \omega ^t) = f_i(\omega ) + \frac{1}{2}\Vert \omega - \omega ^t\Vert ^2, \end{aligned}$$
(8)

where \(f_i(\omega )\) is the original FL local objective in the NILM task, which is equivalent to Eq. (4). The proximal term provides two key advantages. First, it addresses statistical heterogeneity by keeping local updates of clients close to the global model update, eliminating the need to manually set the local epoch number. Second, it allows for safe aggregation of the local models, which can mitigate the update bias between the global model and the local model. It is worth noting that, our method does not impose additional communication costs or privacy overheads since the clients have already received the information for computing (8) when communicating with the server.

3.3 Model personalization

In Personalized Federated Learning (PFL), the generalization requires that the model should have a good average performance across clients, while the personalization necessitates that the model can adapt to each client’s trait of the power consumption. Thus, PFL is an FL paradigm that lies between the traditional FL (such as FedAvg) and Individual Learning. However, due to the heterogeneous data of the power consumers, it is a big challenge to balance the trade-off between generalization and personalization in PFL [6, 9]. Because in the heterogeneous setting, the updating direction of the local training easily conflicts with that of the global model updated by the server. Therefore, it is very easy to sacrifice the generalization performance when enhancing the personalization of the model.

To address this issue, we propose a method to protect generalization performance while enhancing the model personalization. Let \(\lambda _i\) be a dynamic weight for mixing the local training result \(\omega _i^t\) obtained by client i and the global model \(\omega ^{t+1}\) updated by the server. Denote \(d_i^t = (-g_i^t-g^t)\lambda _i + g^t\) as the update for building the personalized model for client i. Therefore, we can form \(\omega _i^{t+1}\) by \(\omega _i^{t+1} = \omega ^{t+1} + d_i^t\). The dynamic weight \(\lambda _i\) is obtained by

$$\begin{aligned} \begin{array}{l} \max \limits _{\lambda _i} ~ \cos (g_i^t, d_i^t), \\ s.t. ~ g^t \cdot d_i^t \le 0, \end{array} \end{aligned}$$
(9)

where cos(\(g_i^t, d_i^t\)) is the cosine similarity between the vectors \(g_i^t\) and \(d_i^t\). Figure 4 visualizes the relationship among \(g_i^t\), \(g^t\), and \(d_i^t\), showing that \(d_i^t\) can be closest to the local update \(g_i^t\), but does not conflict with the global model update, and thus it can improve the model’s personalization while protecting the generalization. After that, each client i builds their personalized model by \(\omega _i^{t+1} = \omega ^{t+1} + d_i^t\).

Fig. 4
figure 4

A schematic diagram of the calculation of the model personalization direction \(d_1^t\) at \(t^{th}\) communication round. \(g_1^t\) and \(g_2^t\) denote the local update direction of client 1 and client 2, and \(g^t\) denotes the global model update direction. \(-d_1^t\) is a direction that is closest to \(g_1^t\) while lying in the unconflicted area

3.4 Tackling the client dropout

In practice, due to the system heterogeneity, the training times of clients differ a lot, resulting in the fact that not all clients can upload their model updates of the training results to the server at each communication round. In addition, intermittent FL participation is desirable for some users due to outdoor work obligations that require the power of their household to be shut off during the day [8]. Hence, client dropout is a common occurrence in FL, posing a significant challenge to FL algorithms since the server receives no information from absent clients. In such cases, simply aggregating the local updates of the online clients using the following traditional aggregation function (10) is insufficient to prevent the loss of the knowledge learned from the absent clients before.

$$\begin{aligned} \omega ^{t+1} = \frac{1}{|S^t|}\sum \nolimits _{i = 1}^{|S^t|} \omega _i^t, \end{aligned}$$
(10)

where \(S^t\) represents the set of the online clients at the current communication round t. So the updated model can easily perform significantly poorer when the absent clients return to FL in the future.

To address this issue, we take into account the historical information of these absent clients when the server does the aggregation. First, we measure the participation rate \(\mu _i\) of client i, which is calculated by \(\mu _i = t_i / t\), where \(t_i\) represents the number of rounds that client i has participated in FL, while t is the current communication round. Then, the aggregation function is defined by

$$\begin{aligned} \omega ^{t+1} = \frac{1}{|S^t|}\sum \nolimits _{i = 1}^{|S^t|} \omega _i^t + \frac{1}{|S^t\backslash S^{t-1}|}\sum \nolimits _{j = 1}^{|S^t\backslash S^{t-1}|} \mu _j \omega _j^t. \end{aligned}$$
(11)

After obtaining \(\omega ^{t+1}\) by (11), we check whether the fixed global model update \(g^t = \omega ^t - \omega ^{t+1}\) conflicts with the original global model update \((g^t)^\prime = \omega ^t - (\omega ^{t+1})^\prime\), where \((\omega ^{t+1})^\prime\) is the model updated by the traditional aggregation function (10). We choose \(\omega ^{t+1}\) as the final updated global model if \(g^t\) does not conflict with \((\omega ^{t+1})^\prime\), i.e., \(g^t \cdot (\omega ^{t+1})^\prime > 0\) otherwise, Eq. (10) is utilized as the aggregation function. This is because, the objective of the global model is to minimize Problem (5), if \(g^t\) conflicts with \((\omega ^{t+1})^\prime\), then it cannot ensure that \(\omega\) can converge, since the obtained \(g^t\) is no longer gradient descent for Problem (5). This would happen if the historical local updates of the absent clients differed a lot from the ground truth if they were currently online. Therefore, our strategy for handling client dropout would not affect the convergence of FL, and bring the benefit of easing the negative impact of losing the knowledge learned from the absent clients.

4 Experiments and analysis

4.1 Experimental setup

All experiments are conducted on a server with Intel(R) Xeon Silver 4216 CPUs and NVidia RTX 3090 GPUs.

Case study The experiments use the REFIT electrical load dataset [33], which was collected from 20 households in Loughborough, England, between 2013 and 2015. The dataset provides high-resolution measurements of both aggregate and appliance-level power consumption, providing a rich source of data for training and evaluating disaggregated energy load models. The diversity of monitored appliances and the longitudinal nature of the REFIT data enable robust model development and testing on realistic residential energy usage profiles. We utilize the model described in Sect. 3.1 for learning, and set the window size to 30, which corresponds to a window including the 4-min data series in the REFIT dataset.

To simulate the FL scenario, we set four cases that include 4, 8, 16, and 32 clients respectively to join in FL. Each client represents a data owner, holding the respective total and appliance-level load data. Six appliances from REFIT dataset: freezer, fridge, washing machine, television, microwave, and kettle are selected. Table 1 provides the power thresholds used to determine the on/off status for each appliance monitored in the dataset. The main and appliance-level load signals are derived from the continuous time series in REFIT dataset, ranging from a random month between July 1, 2014 and May 1, 2015. For each client, we separated the first 70% data as the training data, and the rest is used for testing.

Table 1 Power thresholds for appliance state detection in the experiment

Baselines and experimental settings We compare PerFedNILM with the baseline Fed-NILM [44], which uses the traditional FL algorithm in NILM. We also compare with Individual learning and Centralized learning. We also compare with a recent method: FedDeepAR [32]. It is a method that utilizes local fine-tuning to do the personalization in FL-based NILM, while the procedure of training the global model is the same as that of Fed-NILM. The testing procedure is conducted on each client’s own local testing data. For the individual learning, we do not do the test using each client’s local models, since they are prone to local overfitting, which lacks of generalization ability. Instead, we aggregate the local training results of the individual learning as the final model. For the centralized learning, we combine all clients in FL together to simulate the scenario where the server can receive the data from all clients.

During the training process, all FL clients use Batch Gradient Descent (BGD) on local datasets. Each client adopts ADAM optimizer [27] for the local training. We set the learning rate \(\eta \in \{0.001, 0.01, 0.1\}\) with decay of 0.999 at each communication round, and local training epochs \(E\in \{1, 3, 5\}\). We choose the result with the best performance of each method in comparison. The batch size is set to 128, which is considered privacy-preserving enough in Federated Learning [21], since using a small batch size would elevate the risk of gradient inverting attacks from malicious clients or an unreliable server.

4.2 Evaluation metrics

We use two metrics to evaluate the algorithms.

  • Mean Absolute Error (MAE). It is a metric used to evaluate the performance of a regression model that is defined by

    $$\begin{aligned} MAE = \frac{1}{N_i}\sum _{j=1}^{N_i} \Vert \hat{y}_j - y_j\Vert , \end{aligned}$$
    (12)

    where \(N_i\) is the number of test samples of client i. \(y_j\) is the ground truth of the \(j^{th}\) test sample, and \(\hat{y}_j\) is the predicted value of the \(j^{th}\) test sample. A smaller MAE value indicates that the model has a better regression capability. In the experiments of Sect. 4.2.1, we evaluate the average MAE value across clients, which can show the average performance of the FL model on each client’s local test data.

  • F1-score. We often use the accuracy metric to evaluate the model performance in a classification task. However, it is only valid when the classes of the dataset are balanced in size. But in the task of NILM, the appliance’s on/off states are quite unbalanced. For example, some appliances may be always in the off state for a short period. To more accurately evaluate the quality of the model’s prediction, we adopt the F1-score, which is defined by

    $$\begin{aligned} \text {F1-score} = 2 \times \frac{Precision \times Recall}{Precision + Recall}, \end{aligned}$$
    (13)

    where Precision quantifies the proportion of correct “positive” predictions made by the model. And Recall assesses the ability of the model to correctly detect the target class (positive samples), which is calculated as the ratio of true positives to the total number of positive samples of the appliances’ time series of the power consumption. Here "positive" represents the power-on state of the target appliance. F1-score lies between 0 and 1, and a higher value of F1-score indicates a higher ability of the model to disaggregate the main power consumption.

4.2.1 Convergence efficiency

Fig. 5
figure 5

The mean MAE across clients in the setting of a 4, b 8, c 16, d 32 clients. 100% clients are online at each communication round

Fig. 6
figure 6

The mean MAE across 32 clients in the participation rate of a 25%, b 50%, c 75%, d 100%

4.3 Result analysis

We first visualize the average MAE value across clients during the training process on four cases that include 4, 8, 16, and 32 clients, respectively. The average MAE value of the global model is calculated at each communication round by applying the MAE metric to each client’s local testing data. From Fig. 5 we observe that PerFedNILM converges faster than the previous Fed-based NILM algorithm and achieves a smaller average MAE. For the individual learning, since each client trains its local model on their own data without sharing the local updates at each communication round, the final aggregated model cannot converge.

We further compare the convergence efficiency under different participation rates of clients. Figure 6 visualizes the comparison results. It can be seen that when more clients drop out at each communication round, the performance of Fed-NILM is affected and the average MAE gets much higher. This is because some historical knowledge gets lost during the aggregation when there are client dropouts. In comparison, our PerFedNILM can mitigate the negative impact of the partial client participation, so that the convergence is more stable when there are clients dropping out in FL. We attribute this to the strategy of considering the historical local updates of the absent clients when performing the model aggregation. It provides a practical way to stabilize the convergence under the challenge of client dropout, and does not bring any issue of the proof of the convergence.

Table 2 F1-score of the total and each appliance in different NILM algorithms in the residential scenario
Table 3 F1-score of the total and each appliance in different NILM algorithms in the residential scenario with 32 clients

4.3.1 Quality of the energy disaggregation

In this part, we list the F1-score of each appliance of the algorithms. We also report the total F1-score of all appliances, which is obtained by considering all appliances as the same one when evaluating the F1-score. In Table 2, we list the F1 score in four subcases that include different numbers of clients. It can be seen that our proposed PerFedNILM has competitive performance compared to the centralized learning, and is much better than Fed-NILM. For the individual learning and Fed-NILM, the F1-scores of some appliances even get zero, but it does not happen in PerFedNILM. The F1-scores of some appliances of PerFedNILM even slightly outperform that of the centralized learning. This is not surprising, because the centralized learning has all the training data from all the clients, but sometimes it may be affected by some special power patterns of some clients. For FL, since it aggregates the local model updates, it can mitigate the negative impact of some special data patterns of clients, resulting in a better-aggregated model.

In Table 3, we report the F1 score of appliances in another four subcases that there are 25%, 50%, 75%, and 100% of 32 clients being online at each communication round. It verifies that PerFedNILM has a more stable ability of energy disaggregation when there are clients dropping out during the training.

From Tables 2 and 3, we observe that it is hard for the algorithms to have an accurate recognition of the power pattern of the washing machine and the kettle, since some of the corresponding F1-scores of Fed-NILM get near zero. Compared to Fed-NILM, PerFedNILM achieves a much higher F1-score of the washing machine. We attribute this to the model personalization in PerFedNILM, which reduces the update bias among clients, while taking the strategy of making a good balance between generalization and personalization, so that the model can better fit clients’ local power patterns.

4.3.2 Visualization of NILM

Fig. 7
figure 7

The power disaggregation results on one client using the final training model obtained by Individual Learning, Fed-NILM, and PerFedNILM

In Fig. 7, we visualize the power pattern of one client in the experiment. The figures in the first column plot the main power and each of the appliance’s power consumption. The figures in the next three columns show the true appliance power, the model’s predicted appliance power and the threshold of the power-on state using Individual Learning, Fed-NILM, and PerFedNILM. It can be seen that the aggregated model obtained by the individual learning has no power disaggregation capability at all. For Fed-NILM and PerFedNILM, the disaggregation performance is similar for freezer, fridge, washing machine, and kettle. But for microwaves, Fed-NILM cannot detect the power pattern from the total power consumption. In comparison, our proposed PerFedNILM performs well in recognizing the power state of microwaves.

Table 4 F1-score of the total and each appliance in the ablation experiment

4.3.3 Ablation analysis

In Table 4, we evaluate several variants of PerFedNILM to observe the effectiveness of each part.

M1 Remove the strategy of limiting the local update bias in Sect. 3.2. It can be seen that without limiting the local update bias, the F1-score of the energy disaggregation gets much lower, verifying that it is necessary to handle the challenge of the update bias of clients.

M2 Ignore the personalization strategy in Sect. 3.3. The results show that it performs better than M1, but much poorer than PerFedNILM. This is due to the fact that one single global model cannot fit all clients. With the help of personalization, models can have better performance on clients.

M3 Do not handle the issue of client dropout in Sect. 3.4. Its results are worse than PerFedNILM, verifying that taking into account absent clients can help improve the model performance.

4.3.4 Runtime analysis

Table 5 Computation time (second) of the server and clients of algorithms during training over 1000 communication rounds

In Table 5, we report the computation time of the server and clients of each algorithm during training. It can be seen that the proposed PerFedNILM does not bring much extra communication time of the server. For Centralized Learning, since the model is trained under the collection of all data of clients, the computation time of the server is much higher than others.

5 Conclusion and future work

In this work, we analyze three challenges when adopting Federated Learning in NILM to improve data cooperation: slow convergence resulting from the local update bias, poor performance of the FL model, and partial client participation. Subsequently, we propose a novel FL algorithm for NILM that named PerFedNILM to tackle these challenges. To handle the first challenge, we restrict the local update bias across clients by incorporating a proximal term in the local objective. Next, we suggest an effective method for computing the direction for updating personalized models without interfering with the global model update. This helps the model fit each power consumer’s load pattern better while preserving its generalization capability. Moreover, to handle the common issue of client dropout, we devise a reweighting approach that incorporates the historical information of the absent clients when doing the FL aggregation. Empirical results on real-world energy data demonstrate that PerFedNILM can achieve a better FL model for the energy disaggregation in terms of model convergence and performance. Moreover, it demonstrates greater aptitude in identifying specific load patterns from the total power consumption data of the energy consumers. There are a number of intriguing future topics, such as designing a better AI-based NILM model, and reducing the computation cost by using stochastic configuration machines [42] when adopting FL in NILM on edge devices.