1 Introduction

Recent advancements in machine learning have propelled the broad utilization of smart technologies, particularly the Internet of Things (IoT). Worldwide, IoT devices are expected to nearly triple from 8.74 billion in 2020 to over 25 billion in 2030 [1]. On the one hand, massive data collected from IoT devices are critical for constructing robust machine learning models, and these have promoted the growth of innovations in the era of big data. Moreover, real-world machine learning advances rely on the availability of huge amounts of well-labelled data, such as ImgNet [2] and Alpha Zero [3]. On the other hand, big data, which are characterized by high volume, high velocity, and high diversity [4], cannot be utilized directly as high-quality ready inputs, posing significant challenges to the development of data-driven real-world machine learning systems.

In the era of big data, the challenges of developing data-driven machine learning systems differ radically from those of classic theoretical frameworks, owing to the inherent characteristics of big data and the restrictions imposed by data regulations and laws, such as the new General Data Protection Regulation (GDPR) [5]. These distinctions have important effects on the assumptions and performance indicators of data-driven machine learning systems and may stimulate the development of more innovative and practical machine learning algorithms. We begin this review by identifying the modern challenges in real-world machine learning and then present an overview of our survey. Finally, we consider how our survey contributes to the related fields.

1.1 Modern challenges in real-world machine learning

Machine learning has been widely applied to various real-world applications with satisfactory results. This survey identifies significant modern challenges in the era of big data and discusses their impact on developing real-world machine learning models at both the data and model levels, since a good machine learning model requires plentiful training data and a well-designed model.

From the data standpoint, high-quality datasets can provide more comprehensive information essential for building an effective machine learning model. However, in real-world machine learning applications, data may not be stored in a centralized location, and may exhibit statistical disparities [6] referred to as data silos. Medical records, for example, are private and are stored in isolated medical facilities; some facilities may only contain unlabelled data, whereas others may only hold a few labelled records. Moreover, data labeling is prohibitively expensive, particularly in fields requiring human skill and domain expertise, such as the medical sector. Therefore, the lack of labelled data is another obstacle for the development of real-world machine learning since the model performance is highly dependent on labelled data [2]. Also, data collection has become increasingly challenging from a legislative standpoint, which is referred to as data governance. For example, the GDPR [5] contains several provisions that protect user privacy and restrict companies from transferring data without explicit user consent. Moreover, the real-time data collected by IoTs enable more effective resource allocation and pose additional challenges to conventional offline machine learning frameworks that rely on pre-given training data. For instance, real-time traffic data on road conditions are collected and analysed to improve traffic management in smart cities, which necessitates a dynamic machine learning framework capable of handling streaming training samples [7].

From the model perspective, a well-designed model can make effective inferences and meet the needs of various tasks. However, the non-independent and identical distribution (non-IID) of data in the real world complicates the training of a single model that can be applied to all tasks. For instance, when the next-word prediction task is applied to a certain phrase, it should suggest a response tailored to each local user. Local users label the same data differently, necessitating the development of customized models [8]. As a result, model personalization is increasingly popular to meet the diverse needs of various users. Another current challenge in real-world machine learning is rapidly inferring a high-performance model for new users and effectively updating existing models, i.e. constructing models effectively. For instance, in a distributed system, conventional machine learning models that are based on a pre-given dataset must be retrained every time a new user joins, wasting both bandwidth and computing resources.

Various solutions have been proposed to address the aforementioned challenges, including online transfer learning (OTL) [9, 10], and online federated learning (OFL) [11]. OTL and OFL extend the concept of transfer learning (TL) [12] and federated learning (FL) [13] to the online context, allowing these advanced methods to process online big data efficiently. By leveraging knowledge from source domains, OTL aims to develop an online model for the target domain, addressing the challenges associated with predicting sequentially arriving data in the target domain due to a lack of well-labelled data. While OTL addresses the problem of data labeling in the online context, it still requires central access to data in both the source and target domains, which may violate data privacy and security standards in the era of big data. On the other hand, OFL focuses on training a central model using real-time data generated by multiple distributed local devices without violating data privacy regulations. During each training round, only the updated parameters of each local model are transmitted to the central model, ensuring performance of the centra model while maintaining data privacy.

1.2 Literature retrieval strategy and results

The selection of highly related sources and publications was based on standard criteria and protocols. The following three search engines and databases were chosen: (1) ScienceDirect (2) the Institute of Electronics Engineers (IEEE), and (3) Google Scholar. The literature searched ranges from work was conducted between January 2010 (OTL was first proposed in 2010 [9]) through to November 2021.

We selected the key phrases “online transfer learning” and “online federated learning” as our key phrases and included the synonyms as supplementary terms to expand our search results. As a result, the following key search strings are used for OTL during the literature retrieval stage:

  • Online/dynamic/adaptive transfer learning

  • Online/dynamic/adaptive transformation learning

and the following key search strings are used for OFL during the literature retrieval stage:

  • Online/dynamic/adaptive federated learning

  • Online/dynamic/adaptive federated machine learning

The keywords in each key search string utilize the Boolean operator ‘AND’; each key search string is connected with the Boolean operator ‘OR’. After excluding studies with incomplete titles and abstract information, 35 papers were included in this review. The obtained results consist of 20 OTL studies and 15 OFL papers. Figure 1(a) and (b) show the total retrieved articles and journals published from January 2010 to date (November 2021), along with their linear growth trends for OTL and OFL, respectively.

Fig. 1
figure 1

Statistical results of the obtained papers

In summary, the development of OTL can be divided into two distinct phases based on the linear growth trend of the total papers obtained: the initial stage from 2010 to 2016, and the developing stage from 2017 to the present. While the field of OFL is still in its infancy, the lack of journal publications indicates that there is still a lot of potential for OFL research.

1.3 Overview of this survey

The purpose of this paper is to provide a detailed survey of various methods for addressing modern machine learning challenges, focusing on OFL and OTL.

Figure 2 illustrates the print of this survey. The green areas are our main emphasis, whereas existing surveys only concentrate on the yellow areas. The red section is one of the critical future paths we suggested for further investigation. We consider federated and transfer learning in online scenarios: OTL is not studied by traditional learning types of TL [12, 14], i.e. transductive, inductive, and unsupervised TL. Instead, we discuss OTL from two viewpoints: domain-based OTL and task-based OTL. Furthermore, we review OFL from three aspects: statistical heterogeneity, system heterogeneity, and privacy guarantees, highlighting the most significant challenges. The main contributions of our work are summarized as follows:

  • To the best of our knowledge, this is the first survey to present recent advances in OTL and OFL studies, and to identify potential future research directions. It aims to serve as a resource for researchers and practitioners developing online federated and transfer learning frameworks.

  • We provide definitions of OTL and OFL, as well as new viewpoints on them. Additionally, we describe recent advances in online federated and transfer learning, and highlight the connections between different methods.

  • We summarize popular datasets and cutting-edge applications of OTL and OFL, discuss practical considerations and provide insights into potential future research directions.

Fig. 2
figure 2

Blueprint of our survey. FTL: Federated transfer learning; OFTL: Online federated transfer learning

The remainder of this survey is structured as follows. Section 2 reviews and reports on related works, which provides the necessary background on OTL and OFL. Then, recent advances in OTL and OFL are reviewed in Sections 3 and 4, respectively. Practical considerations in datasets and applications of OTL and OFL are summarized and presented in Section 5. In Section 6, we conclude this survey and discuss future research directions.

2 Related work

In this section, we review related work on OTL and OFL, including TL, FL, FTL, and OL. Moreover, we summarize the implementation scenarios of these methods and identify the existing challenges.

2.1 Transfer learning

Most of the traditional machine learning algorithms assume that the training and test data have similar distributions and feature spaces. However, this assumption does not hold in the majority of real-world scenarios. Furthermore, traditional machine learning has been hampered by a lack of adequately labelled training data and mismatched computing capability. TL [12] was proposed to address these challenges by leveraging knowledge from a single or multiple source domains to enhance a training task in the target domain (Fig. 3). The knowledge transferred could be instances from source domains [15], shared features from source domains and the target domain [16, 17], parameters from the trained learners of source domains [18], or relations between source domains and the target domain [19].

Fig. 3
figure 3

Transfer learning process

According to different implementation scenarios, TL can be categorized as single source TL and multiple sources TL. Single source TL refers to transferring knowledge from a single source domain [20] whereas the multiple sources TL utilizes several source domains to transfer the knowledge [21, 22]. Moreover, different TL techniques have been proposed to handle similar or different data structures between the source and target domains, i.e. homogeneous and heterogeneous TL [23, 24].

According to different label settings, a variety of TL methods have been proposed and can be classified into three major categories, i.e. transductive, inductive, and unsupervised TL [12].

Inductive TL is used when the target domain has well-labelled data, and there are different tasks in the source and target domains. TrAdaBoost [25] is a well-known inductive TL technique that extracts valuable information from the source domain by re-weighting predicted instances in both the source and target domains. However, this method only utilized a single source domain, and the extracted information may not be sufficient for the training task in the target domain. To address this challenge, [26, 27] combined the transfer task with multiple source domains, which enhanced the training performance of the target model. Moreover, unlike [25], which retained only one base learner and discarded the rest, [28] assumed that all base learners are useful, based on the theory that older learners can represent the major distributions of instances, while newer learners can provide accurate information about subsequent iterations.

Transductive TL is used when the source domain data is labelled, but the target domain data is unlabelled, and both the source and target domains have the same task. Domain adaptation is the most well-known subfield of transductive TL [29], which aims to minimize the marginal distribution gap between the source and the target domains. Xia et al. [30] proposed a method for selecting and weighting instances based on PU learning to identify examples from the source domain that are most likely to improve the training task. However, this method was limited by the difficulty of dealing with high-dimensional distributions. A solution was provided by [15], using the logistic approximation to adapt the high-dimensional data from the source domain to the target domain.

In real-world situations, both the source and target domains may lack sufficient well-labelled data, which cannot be addressed by the TL techniques discussed so far. As a solution, unsupervised TL was introduced. Wang et al. [31] proposed transferred discriminative analysis (TDA), a method for generating class labels for unlabelled target data by leveraging knowledge from the source domain. Although unsupervised learning is a more practical solution in TL, it has received little attention from researchers over the last decade.

2.2 Federated learning

IoTs, such as smart healthcare devices and smart meters, continuously collect vast amount of data. Models trained from the aggregated data of these applications enable efficient management of smart city applications, however the process is complicated by a variety of legal constraints. In this context, FL has been proposed for training a global model from data distributed across multiple devices with only intermediate updates periodically being sent to a central server [13]. A typical FL paradigm is illustrated in Fig. 4, in which the central server distributes the initial model parameters to all local clients. Each client then trains the local model and uploads the updated parameters to the central server. After this, the global model will be updated and rebroadcast to local clients. The above processes are repeated continuously to ensure that the global model is updated and optimized across all local clients.

Fig. 4
figure 4

Federated learning process

FL can be categorized into horizontal FL, vertical FL, and FTL, depending on how data are distributed among different devices in the sample and the feature space. Moreover, since FTL is known as a novel combination of TL and FL, we will discuss this technique in more detail in chapter (2.3).

Horizontal federated learning (HFL) (Fig. 4) refers to the situation in which data from distributed devices share the same feature space but differ in samples. Google pioneered HFL by utilising data distributed across many local Android devices to forecast text input without violating privacy regulations [32]. Abad et al. [33] then developed a hierarchical heterogeneous HFL architecture for extending HFL to heterogeneous environments, thus optimizing the communication efficiency in local source devices with heterogeneous networks. Additionally, [34] designed a secure aggregation scheme based on [32] to further enhance the privacy of aggregated intermediate updates. Further research [35, 36] has been proposed to address the high cost of communication in the HFL framework.

Vertical federated learning (VFL) was proposed on the premise that heterogeneous data from various devices share common sample IDs but have distinct feature spaces, and thus VFL focuses on the correlation between devices from different sectors. In a typical VFL process, data with common sample IDs are retrieved and used to train a machine learning model (Fig. 5). VFL is more difficult to implement than HFL since it requires encrypted user-ID alignment algorithms [37] for common entities [13] and the authentication of a fully trusted third-party. To overcome these obstacles, [38] developed a framework that eliminates the need for a third-party coordinator, and this framework has proven to be efficient and scalable. Although VFL is capable of handling heterogeneous domains, the majority of VFL techniques rely on statistical models such as logistic regression rather than sophisticated machine learning frameworks, indicating that this field still demands enormous effort.

Fig. 5
figure 5

Vertical Federated learning process

Apart from data distribution, FL can be categorized in a variety of ways. Based on the network topology, FL can be classified into centralized FL and peer-to-peer (P2P) FL [39, 40]. Centralized FL generally relies on a central server to aggregate and broadcast the updated parameters. In contrast to centralized FL, P2P FL does not rely on the central server for local model updates but instead exchanges parameters directly between neighbours. Based on data availability, FL can be classified into cross-silo FL and cross-device FL [41]. The cross-silo FL is suitable for scenarios involving a small number of local clients, in which siloed data are sourced from geo-distributed data centres (e.g. local banks or medical centres) instead of a large number of distributed edge nodes (e.g. smartphones or laptops). This is because almost every local client within the cross-silo FL is considered indexed and available for constant updating at any time. On the other hand, cross-device FL is used when there are a large number of participants and the local clients are not always available. To compensate for the unreliability of local clients, the cross-device FL often employs resource allocation techniques [42] and incentive mechanisms [43] to improve the overall performance of the FL framework.

2.3 Federated transfer learning

Distinguished from HFL and VFL, FTL [44] refers to situations where data across multiple devices differ in terms of both feature spaces and sample IDs and is regarded as a significant extension of traditional FL frameworks [13]. By enabling users to leverage large datasets with well-trained machine learning model parameters, FTL goes beyond simply allowing users to exploit only matching data (i.e. data with overlapped feature spaces or sample IDs) [45], and Fig. 6 depicts the general process of FTL. The use of TL in FL systems addresses the lack of well-labelled data in the source devices and enables various sectors to train more personalized local models in a secure and private manner. It is worth pointing out that while TL and FL are natural complements, there has been relatively little attention paid to the FTL framework.

Fig. 6
figure 6

Federated transfer learning process

Similar to conventional FL methods, the major impediment to FTL development is training data in heterogeneous settings, which is further complicated by the restrictive assumptions of FTL application scenarios. Gao et al. [46] developed a heterogeneous FTL framework to address the feature heterogeneity by mapping the feature spaces of common features to those of uncommon features. Moreover, to enable FTL in heterogeneous intelligent manufacturing applications, [47] utilized pre-built models from a variety of smart environments as the central source domain, and the central server then would select the best model to broadcast based on the similarity between the central source models and the local target models. Accordingly, each heterogeneous local device will conduct TL to acquire application-specific models. Furthermore, communication efficiency is another concern in FTL. In [48], secret sharing (SS) was adopted to improve the communication efficiency and to increase the privacy level of FTL.

FTL has received growing interests in real-world applications, such as smart healthcare [49], traffic monitoring [50], smart energy [51], and image analysis [52]. The majority of current FTL systems are based on deep learning architectures [47, 49, 51, 52] that usually freeze the base layers of the global model and retrain the fully-connected layer on local devices. Chen et al. [49] performed human activity recognition via FTL, which replaced one of the fully-connected layers with a correlation alignment layer to facilitate domain adaptation. FTL with deep learning architectures is efficient due to the highly transferable features in the low-level layers and the ability to capture specific features in the high-level layers of the deep network [53].

2.4 Online learning

OL is a machine learning paradigm for real-time data that uses feedback from sequence data to learn and update the best predictor for future data. In comparison to the optimal model in foresight, the primary goal of OL is to minimize cumulative error across the whole data sequence [54]. Compared to conventional batch learning algorithms, which require pre-given training data, OL is generally more effective and scalable when dealing with large-scale real-world machine learning problems involving data of varying quantity and velocity.

OL has been extensively investigated for many years [55, 56]. There are two fundamental types of OL algorithms: first-order OL and second-order OL. Hoi et al. [56]. The Perceptron [57, 58] is one of the earliest first-order OL algorithms, relying on gradient feedback to update a linear classifier whenever a new sample is misclassified. Passive-Aggressive (PA) [59] was introduced as a family of first-order OL algorithms based on margin-based learning. It updates the model when the classification confidence of a new sample falls below a predefined threshold. Moreover, online gradient descent [60,61,62] was proposed to model the OL as an online convex optimization problem.

The misclassified instances are retained as support vectors (SVs) in standard OL algorithms (e.g. Perceptron and PA). Despite their solid theoretical guarantees and efficient functioning, a fundamental issue is that the increasing number of SVs over time may result in an increased computational overhead. To overcome this challenge, [63] discarded the oldest SVs assuming that they were less representative of the data streams. Additionally, [64] presented bounded online gradient descent (BOGD) to constrain the amount of SVs that fall below a threshold.

Unlike first-order OL algorithms, which maximize convergence by utilizing only the first-order derivative information of the gradient, second-order OL algorithms maximize convergence by utilizing both the first-order and second-order information. The second-order Perceptron algorithm [65] was designed to examine the geometric properties of data. In order to capture second-order information about the confidence level of the features, the confidence weighted (CW) algorithm [66] was developed to manage the updating of the classifier. Furthermore, the second-order OL requires exponential space and time for updates, and the sketched online Newton (SON) [67] was introduced to address this issue. The SON is an enhanced version of the online Newton step with a linear running time in dimension and sketch size, allowing for dramatic improvements in second-order learning efficiency.

2.5 Frontier implementation scenarios and inter-connections of TL, FL, FTL, and OL

TL, FL, FTL, and OL are all innovative approaches built on standard machine learning techniques to address modern challenges in real-world applications. In this subsection, we will outline their implementation scenarios to investigate the underlying relationship between them and discuss the existing challenges. By doing this we hope to highlight the significance of our survey. Table 1 compares the implementation scenarios of traditional machine learning, TL, FL, FTL, and OL, which can be used as a guide to assist professionals in selecting the most appropriate method to apply to specific real-world problems.

Table 1 Frontier implementation scenarios of different techniques

Traditional machine learning relies on a massive amount of well-labelled centralized data and assumes that all data collected are homogeneous [29]. However, many real-world scenarios require a more scalable, private, and dynamic machine learning framework that can manage real-time data from a variety of IoT devices. TL, FL, and OL were therefore proposed as solutions to these modern challenges.

Although TL is rarely studied as a mechanism for knowledge transmission in a decentralized environment, when combined with FL, i.e. FTL, it is capable of transmitting knowledge across distributed devices. Additionally, TL in non-federated contexts typically involves instance transmission [15, 27, 68], posing a risk of privacy leakage. FL, on the other hand, preserves privacy [69, 70] by sharing local model update parameters instead of raw instances from local clients [71]. TL enhances target model performance by providing learners in target domains with a baseline performance rather than starting from scratch, thereby reducing computational overhead [72]. On the other hand, standard FL involves tens of millions or even billions of local devices [73], and all of these devices must meet eligibility computation power to participate in training, which is not practical, as demonstrated in [74]. As a result, it is logical to apply TL to this framework in order to enable FL with clients who have limited processing capabilities.

Real-world applications necessitate machine learning models to be resilient to heterogeneous data [41]. One of the most challenging topics of heterogeneous scenarios is cross-modality [29], as it refers to situations in which the feature and/or label spaces of the source and target domains are completely different, which is one of the primary reasons for data heterogeneity in most real-world machine learning applications. The key idea in addressing this problem is to identify feature mapping functions that project the source and target feature spaces to a common latent space via matrix factorization [75] using labelled source data or co-occurrence data [76]. TL for cross-modality commonly transfers knowledge from easily labelled source domains to an expensively labelled target domain. An example is the well-known text-to-image TL [77], which leverages the semantic meaning of labelled text to improve model classification performance on sparsely annotated image data. Besides, VFL and FTL are also applicable to cross-modality scenarios. However, the former can only be used if certain conditions are met, i.e. having a large number of sample IDs that overlap between the source and target domains [13]. Additionally, while TL and FTL seek to leverage knowledge from source domains in enhancing the target model performance, the ultimate goal of VFL is to assist all source and target parties in developing a ‘common wealth’ strategy [13]. As shown in the table, TL, FL, and FTL can all be used in cross-modality scenarios, which explains why all of these strategies can help overcome challenges associated with the lack of well-labelled data.

Aside from cross-modality heterogeneity, FL is well-suited for cross-model and cross-system scenarios due to its decentralized nature. In cross-model scenarios, which are also prevalent in fundamental machine learning applications, the structure of the locally trained models varies due to the diverse patterns of data usage by local clients [78]. FL prefers to use the global model with a predefined model paradigm as the referencing information in a cross-model scenario, and clients can update their local models based on different structures [79, 80]. Ensemble strategies are frequently used to enable TL in cross-model scenarios, which combines multiple learners from different source domains or learning algorithms with a weight assignment strategy to maximize the utility of candidate learners that have better performance in the target domain [27]. Furthermore, most TL paradigms require all learners to be trained in a centralized and consistent environment whereas real-world situations are more complicated. On the other hand, FL is applicable in situations where there is system heterogeneity due to differences in storage, computing, and battery capacities between individual client devices. Xie et al. [81] developed an asynchronous FL framework (FedAsync) for adaptively updating the weights of local models in response to stale information, thereby enhancing FL in a more effective, flexible, and scalable manner.

Moreover, TL can personalize models in non-federated environments by leveraging data from source tasks to improve performance in a related target domain. However, when TL is applied to domains that are extremely unrelated, the model performance of the target domain could be worse than that of the source domain without transferring the source data, which is known as negative transfer [82]. Similar to this concept, when local clients come from highly unrelated domains or system settings, training local models in FL for these clients using a consistent scheme may reduce the ability of each local model to depict unique client characteristics [83], resulting in a worse aggregated model than local models trained exclusively on their own datasets, which can be recognized as a drift problem [84]. One of the most widely used strategies for mitigating negative transfer is to use effective selection mechanisms to determine the relatedness (also known as transferability [85]) between the source and target domains prior to the transfer [25, 77]. On the other hand, the drift problem is more complicated and can be approached differently. Rather than avoiding it, the majority of researchers have turned this into a feature [41] by applying various techniques such as multi-task learning to the FL framework [86,87,88]: they create personalized or device-specific local models [71] for clients that are intended to behave better than the aggregated global model.

In the new era of big data, a prominent application scenarios is modelling real-time data, which typically become obsolete within hours or even minutes [89], such as recommendation systems for business websites [90] and real-time non-intrusive load monitoring systems for elderly living alone [91]. Additionally, there is a cold start [71] problem in real-world machine learning applications, which refers to new clients or datasets incoming into the system from the source domain. Existing TL and FL methods are generally based on pre-given datasets, which wastes bandwidth and computational resources due to the need to retrain the framework to achieve optimal results in the scenarios above [92]. Thus, it is vital to incorporate TL and FL into the OL paradigm to overcome these constraints. However, as this field is still in its infancy, few solutions have been proposed in recent years, and no prior research has summarized the research area comprehensively. To fill this review gap, following the understanding of the relationship between related techniques, the following sections will provide detailed descriptions and summaries of current OTL and OFL studies for further consideration.

3 Online transfer learning

OTL enables the standard TL paradigms to transfer knowledge from source domains, thereby enhancing the online learning task on the target domain [9, 10].

It is worth noting that the organization of OTL in this survey differs from the aforementioned traditional TL categories, as OTL is a developing field with research focusing on a more fundamental and specific perspective. The following sections provide an interpretation of OTL approaches from a domain-task perspective. In general, domain-based interpretation is based on different settings within the source domain, including single source (SS) OTL and multiple sources (MS) OTL. On the other hand, the task-based interpretation is based on different task types within the target domain, including binary classification (BC) OTL and multi-class classification (MC) OTL. While the majority of OTL research has concentrated on classification tasks, similar techniques can be applied to other machine learning tasks such as regression and clustering [10, 93, 94].

Figure 7 gives a relation map of OTL studies, which includes all the obtained OTL papers in Section 1.2. Specifically, the relation map consists of a root representing the cornerstone literature [9, 10] in which the OTL was first proposed, and four stems representing the four sub-areas of OTL, namely, SS-BC, SS-MC, MS-SC, MS-BC, with each stem node containing leaf nodes that represent literature focused on various technical aspects of each area. According to the relation map, most existing OTL studies have focused on SS-BC, MS-BC and MS-MC OTL, while studies for SS-MC OTL have been relatively scarce. Figure 8 summarizes the evolution timeline of OTL according to its sub-areas. The earliest research interests on OTL focused on SS-BC, and then several studies on SS-MC were proposed. After addressing the research difficulties in the single source domain, researchers started examining different approaches to OTL for utilizing information from multiple sources, i.e. MS-SC and MS-MC.

Fig. 7
figure 7

Relation map for OTL. SS: Single source; BC: Binary classification; MS: Multiple sources; MC: Multi-class classification

Fig. 8
figure 8

Evolution timeline of OTL. SS: Single source; BC: Binary classification; MS: Multiple sources; MC: Multi-class classification

3.1 Notations and problem definition

Table 2 summarizes the frequently used mathematical notations in OTL, and we keep these notations consistent and similar to the majority of existing works [9, 10, 97, 104, 106, 108, 110] to facilitate comparisons between different OTL methods.

Table 2 Summary of frequently used mathematical notations in OTL

Given n source domains denoted by \(D^{S}=\left \{ D^{S_{i}} \right \}^{n}_{i=1}\), where each source domain \(D^{S_{i}}\) contains \(n^{S_{i}}\) labelled instances. The problem of OTL is formulated with single source (SS) task when n = 1, and with multiple sources (MS) task when n > 1. The source data space in the i-th source domain is denoted by \(\mathcal {X}^{S_{i}}\times \mathcal {Y}^{S_{i}}\), where the feature space \(\mathcal {X}^{S_{i}} = \mathbb {R}^{d_{i}}\). The target domain is denoted by DT, with nT instances. Similarly, we denote by \(\mathcal {X}^{T}\times \mathcal {Y}^{T}\) the target data space in the target domain, where the feature space \(\mathcal {X}^{T} = \mathbb {R}^{d_{T}}\). The problem of OTL is formulated with binary classification (BC) task when k = 2, and with multi-class classification (MC) task when k > 2. When \(\mathcal {X}^{S_{i}}=\mathcal {X}^{T}\) and \(\mathcal {Y}^{S_{i}}=\mathcal {Y}^{T}\), the problem is identified as homogeneous OTL (HomOTL). On the other hand, if the source and target domains have different feature spaces (\(\mathcal {X}^{S_{i}}\ne \mathcal {X}^{T}\)) or different label spaces (\(\mathcal {Y}^{S_{i}}\ne \mathcal {Y}^{T}\)), the problem is referred to as heterogeneous OTL (HetOTL) [9, 10, 111].

3.2 Single source-binary classification (SS-BC) OTL

SS-BC OTL was first proposed by [9, 10], which was considered in both homogeneous and heterogeneous scenarios (HomOTL and HetOTL). For HomOTL, as illustrated in Fig. 9, they first constructed the source model fS using the offline source data by support vector machine (SVM) and utilized the Passive-Aggressive (PA) algorithm to build model fT on the target domain. The PA formulated OL as a constrained convex optimization problem, and the weight ω of the online model on the target domain at a new time point t + 1 was updated by the solution:

$$ \omega_{t+1} = \omega_{t} + \tau_{t}y_{t}x_{t} $$
(1)

where \(\tau _{t}= \min \limits \left \{\mathcal {C},\frac {\ell ((x_{t},y_{t});\omega _{t})}{\left \|x_{t} \right \| } \right \}^{2}\), and \(\mathcal {C}\) is a positive regularization parameter. (⋅) is the hinge loss, which can be written as \(\ell ((x,y);\omega )=\max \limits \left \{ 1-y(\omega ^{\top } x),0 \right \} \). The resulting algorithm is passive and no update is needed when (⋅) = 0. Otherwise, when (⋅) is positive, the algorithm is aggressive and the instance xt will be selected as a support vector into the support vector set, which is then forced to learn ωt+ 1. The PA standardized the trade-off between progress achieved at each new time point and information gathered in previous rounds [59].

Fig. 9
figure 9

SS-BC homogeneous OTL framework

After obtaining both the source and the target models, [10] proposed a weight updating scheme to adjust the weights μ of the source model and v of the target model, respectively:

$$ \left\{\begin{array}{l} \mu_{t+1}= \frac{\mu_{t}s(f^{S}(x_{t}), y_{t})}{\mu_{t}s(f^{S}(x_{t}), y_{t}) + v_{t}s(f^{T}(x_{t}),y_{t})} \hfill \\ v_{t+1}= \frac{v_{t}s(f^{T}(x_{t}),y_{t})}{\mu_{t}s(f^{S}(x_{t}), y_{t}) + v_{t}s(f^{T}(x_{t}),y_{t})} \hfill \\ \qquad\quad~~~~~~\mu_{1}=v_{1}=\frac{1}{2} \hfill \end{array}\right. $$
(2)

where μt+ 1 and vt+ 1 are the weights of the source and target models, respectively, at time point t + 1. s(⋅) is a weight decay function that increases the weights of models that contribute significantly to the final forecast.

Unlike [10], which only used a single source classifier, [98] proposed an AB-HomOTL inspired by the boosting algorithm to learn multiple weak source classifiers. As illustrated in Fig. 10, this paper focused on the learning strategy of the source model fS in the homogeneous scenario for SS-BC OTL.

Fig. 10
figure 10

AB-HomOTL framework

Specifically, AB-HomOTL selected PA as the primary learning algorithm for training m multiple weak source classifiers in the AdaBoost algorithm at the first stage. In the second stage, the source classifiers were integrated with the model fT trained on the target domain. During this stage, a weight was assigned to each combination model based on its performance on the new instance (xt,yt). Finally, the ensemble models were integrated to produce the final robust target classifier ft.

Rather than weighting classifiers dynamically according to their forecast accuracy, [99] emphasized that data in the real world are cost-sensitive and considered the misclassification cost to present an OTL algorithm with adaptive cost (OLAC). Specifically, they utilized the proportion of minority and majority samples to calculate the misclassification cost, enabling dynamic classifier adjustment for different samples. OLAC has been proven to be effective in improving the classification accuracy of minority samples, thereby increasing the overall model performance.

Zhao et al. [10] also considered the SS-BC OTL in the heterogeneous environment (HetOTL), which assumed that the feature spaces of the source domain are a subset of those of the target domain. Given a newly arrived instance (xt,yt), HetOTL divided it into two instances (xt(1),yt) and (xt(2),yt) where \(x^{t(1)} \in \mathcal {X}^{S}\) and \(x^{t(2)} \in \mathcal {X}^{T} / \mathcal {X}^{S}\). Then, inspired by the multi-view approach, HetOTL trained and updated two classifiers fT(1) and fT(2) from two views simultaneously using the co-regularization optimization:

$$ \begin{array}{@{}rcl@{}} \left( f^{T(1)}_{t+1}, f^{T(2)}_{t+1}\right) &=& \underset{f^{T(1)},f^{T(2)}}{\arg\min}\frac{\gamma_{1}}{2}\left \| f^{T(1)}-f^{T(1)}_{t} \right \|^{2} \\ &&+\frac{\gamma_{2}}{2}\left \| f^{T(2)}-f^{T(2)}_{t} \right \|^{2}+\mathcal{C}\ell(f^{T(1)}, f^{T(2)};t) \end{array} $$
(3)

where γ1, γ2 and \(\mathcal {C}\) are predefined positive regularization parameters, and (⋅) is the loss function. During the updating, the classifier \(f^{T(1)}_{1}\) was initialized by the trained source classifier fS, and classifier \(f^{T(2)}_{1}\) was initialized to 0. This updating rule ensured that the two-view classifiers did not deviate excessively from the previous updates (the first two regularization terms) while maintaining prediction performance (the last term).

Similar to [93, 98] proposed heterogeneous ensembled OTL (HetEOTL) based on AdaBoost to improve the performance of OTL models in the heterogeneous environment. The comparative experiment demonstrated that employing the ensemble strategy outperformed the previous HetOTL framework in [10]. Chen et al. [93] improved the performance of the OTL model, however it made the same assumption as [10], i.e. the feature spaces in the source domain are a subset of those in the target domain.

To relax the above assumption, studies based on co-occurrence data have been proposed [95,96,97]. Given a source domain DS and a target domain DT, whose feature spaces are totally diverse, i.e. \(\mathcal {X}^{S} \cap \mathcal {X}^{T} = \emptyset \). The unlabelled co-occurrence data \(\left \{ \left (\widetilde {x}^{S_{i}}, \widetilde {x}^{T_{i}} \right ) \right \}^{n_{c}}_{i=1} \in \mathcal {X}^{S}\times \mathcal {X}^{T}\) are collected from offline sources to bridge different feature spaces, in which \(\widetilde {x}^{S_{i}} \in \mathcal {X}^{S}\) and \(\widetilde {x}^{T_{i}} \in \mathcal {X}^{T}\). For example, the website FlickrFootnote 1 contains a massive collection of images with tags that can be used as co-occurrence data and are significantly less expensive to collect than labelled images (Fig. 11).

Fig. 11
figure 11

An instance of co-occurrence text-image data from Flicker [112]

Yan et al. [96] proposed online heterogeneous transfer learning by hedge ensemble (OHTHE), which utilized co-occurrence data as auxiliary knowledge to build a correspondence map between the source and target domains, as illustrated in Fig. 12.

Fig. 12
figure 12

OHTHE framework. The ⊕ maker denotes the measure of similarity between two instances

They first measured the heterogeneous similarity between the newly arrived instance xt and the offline source instance xs based on co-occurrence text-image data. The source model was then built by adding the weights of the k nearest neighbours of xt in the source domain. Meanwhile, the target model was trained by PA. Following that, the OHTHE utilized the Hedge (β) strategy [113] to dynamically update the weights μ and v:

$$ \left\{\begin{array}{l} \mu_{t+1}= \mu_{t}\beta^{\ell(y_{t}f^{S}(x_{t}))} \\ v_{t+1}= v_{t}\beta^{\ell(y_{t}f^{T}(x_{t}))} \\ \quad~~\mu_{1}+v_{1}=1 \end{array}\right. $$
(4)

where μ1 ∈ (0,1) and v1 ∈ (0,1) are the initial weights. β is a weight decay factor that is used to identify models that contribute more to the final prediction and whose magnitude is determined by the loss function (⋅).

3.3 Multiple sources-binary classification (MS-BC) OTL

In real-world applications, it is difficult to extract sufficient knowledge from a single source domain, thus combining data from multiple source domains increases the reliability and robustness of source classifiers. However, combining all source domains directly may produce unsatisfactory forecasts since different source domains include information from different perspectives, and the data qualities within each source domain vary as well. As a result, OTL algorithms with multiple sources should be more sophisticated in order to distinguish critical source domains and thus construct a more robust source learner.

Wu et al. [105] trained a set of source classifiers using the kernel SVM, and each classifier was weighted according to its performance on the newly arrived instance of the target domain. The weighted source classifiers were then integrated to create an ensemble learner for the source domain. Simultaneously, PA was used to train the target classifier on the online data. The ensemble source and target classifiers were then integrated to generate an effective ensemble model in the second stage. The weight updating rule at the next round t + 1 of the classifier from i-th source domain, the ensemble source classifier, and the target classifier can be described as follows:

$$ \left\{\begin{array}{l} \mu_{t+1}^{i}= {\mu_{t}^{i}}\beta^{(f^{S_{i}}(x_{t}), y_{t})} \\ \mu_{t+1}= \mu_{t}\beta^{(f^{S}(x_{t}), y_{t})} \\ v_{t+1}= v_{t}\beta^{(f^{T}(x_{t}),y_{t})} \\ \qquad{\mu_{1}^{i}}=\frac{1}{2n} \\ \quad\mu_{1}=v_{1}=\frac{1}{2} \end{array}\right. $$
(5)

where \(f^{S}={{\sum }_{i=1}^{n}}{\mu _{t}^{i}}f^{S_{i}} (x_{t}) \). β ∈ (0,1) is a weight decay factor that is applied when the classifier suffers a loss value, and \({\mu ^{i}_{t}}\) denotes the weight of the classifier from the i-th source domain at time point t.

In contrast to [105], which only investigated HomOTL, [104] adapted the OTL framework to a heterogeneous environment. Similar to the problem setting in [9, 10, 104] introduced heterogeneous OTL with multiple source domains (HetOTLMS), which was based on the premise that the feature spaces of the source domain are a subset of those of the target domain. Instead of training an ensemble source classifier, HetOTLMS combined the weak classifier from the i-th source domain with the target classifiers trained by PA to form n ensemble classifiers. In particular, for the i-th source domain in the t-th round, each newly arrived instance was divided into two parts, the first of which shared the same feature spaces as the source domain, and the second of which shared the remainder of the target feature spaces. Two classifiers in the target domain were generated and then integrated with the source classifier based on their weights to form an ensemble classifier.

Most studies developed models based on PA that were limited to numerical attributes. Inspired by the very fast decision tree (VFDT), which incorporates Hoeffding bounds to guarantee the performance of an incremental decision tree, [103] modified VFDT as VFDT-D in the following ways to provide an OTL framework that handles mixed attributes:

  • Cache a few instances to initialize the statistical information for newly constructed leaf nodes to satisfy the Hoeffding constraint and manage mixture attributes.

  • Modify the output form of the VFDT to treat it as the posterior probability equal to the ratio of positive training instances in a leaf node with respect to the total number of training instances in that leaf node.

Then, using the VFDT-D, decision trees were induced from the source domains and the target domain. Following that, the tree path and posterior probability of the newly arrived instance xt were then combined to determine the ideal source domain with the highest degree of similarity to xt, which was then integrated with the target domain classifier to construct the final prediction decision function. Comparative experiments demonstrated that the proposed algorithm was capable of overcoming the cold start problem [71], which occurs when the model performance degrades in the early stage of the data stream due to the low number of instances arriving in the target domain.

It is worth noting that the target model performs worse than the source model as it lacks prior knowledge about the target domain. As more instances arrive, the target model will perform equally well or even better than the source model. On the other hand, most studies [9, 10, 96, 105] updated model weights solely based on cumulative error, ignoring the intrinsic timescale of online data. To address this issue, [106] proposed a new weight updating rule which assigns greater weight to later occurrences. They assumed that the predictions made by the newer samples were more plausible than those made by the earlier samples and hence increased the weights over time to narrow the gap between the accuracy and the weights of the models. On the other hand, the traditional accumulating criteria ensure that the newly arrived outliers have a negligible effect on model updating, examining, therefore investigating whether the same scenario holds in this framework is necessary.

3.4 Single source multi-class classification (SS-MC) OTL and multiple sources multi-class classification (MS-MC) OTL

After reviewing binary classification OTL frameworks in the previous section, we will discuss multi-class classification OTL studies in this section. Multi-class tasks are common in the real world, such as document classification. Specifically, when an instance is relevant to a single subject, the classification problem is referred to as multi-class single-label classification; otherwise, the classification problem is referred to as multi-class multi-label classification [59], and the majority of existing OTL research has focused on multi-class single-label classification. Multi-class classification is more complicated than binary classification as it involves the development of offline and online models that consider multiple classes, necessitating the use of more sophisticated strategies to create a combined multi-class classifier with satisfactory performance [96].

Inspired by the online multi-class PA (MPA) algorithm [59, 102] presented an OTL algorithm for multi-class classification (OTLAMC) that adopted a novel loss function and weight updating mechanism to enable OTL in multi-class classification tasks. However, this paper only concentrated on knowledge transfer from a single source domain. Kang et al. [110] then developed the online multi-source transfer learning for multi-class classification (OMTL-MC) system, which incorporated data from multiple domains. While the OMTL-MC structure is similar to that of the HetOTLMS framework described in [104], there are two significant differences:

  • The OMTL-MC framework examined OTL in a homogeneous environment, whereas the HetOTLMS framework investigated OTL in both homogeneous and heterogeneous settings.

  • OMT-MC was developed with an extended Hinge loss (EHL) function to support multi-class classification tasks whereas HetOTLMS is only suitable for binary classification tasks.

Zhang et al. [100] proposed an online PA feature transformation (OPAFT) algorithm to calculate the similarity in a k nearest neighbour (k-NN) classifier. Furthermore, they extended this algorithm to the online multiple kernel feature transformation (OMKFT) algorithm to improve the performance of OPAFT for cross-domain and multi-class object recognition. Another feature-based OTL framework was proposed in [108], which investigated multi-class classification OTL with multiple source domains. Specifically, they constructed an initial transformation matrix for the i-th source domain by utilizing the source and target data. Then, the transformation matrix was used to project the original data into a new feature space. Meanwhile, the newly arrived instance was projected into its appropriate feature space using all of the transformation matrices, and a new source classifier was trained in this new space. The projected instance was then trained using the MPA algorithm to generate the associated classifiers for the target domain. Finally, the source and target classifiers were combined using the Hedge strategy. Rather than updating the transformation matrices at each time step, this paper used a time window to control the frequency of updates, thereby reducing computation costs.

In contrast to previous OTL architectures that required label revealing of target instances after each prediction, [109] introduced an online multiple source transfer learning (OMS-TL) architecture that requires only a few labelled data points in the target domain as a priori and does not require label revealing after each prediction. They employed a bipartite graph to represent the classification results from all the source domains and then estimated the likelihood of a sample belonging to each class using convex minimization. When a new instance is observed, the averaged probability from all source domain classes to which the sample belongs was combined with the target prediction based on the weighted average of previous predictions to generate the final result.

OTL aims to enhance the online learning task in the target domain by leveraging knowledge from source domains. By applying standard TL in the online context, real-time data generated by various edge devices can be efficiently processed. However, as with traditional TL, OTL is constrained by the assumption that all data from the source and the target domains must be processed centrally, which is impractical in the real world due to data privacy regulations. As a result, the following section will introduce OFL, which enables real-time data processing in a distributed fashion while ensuring data privacy.

4 Online federated learning

FL holds significant promise for a variety of sophisticated applications, including smart traffic management [114], interactive social networks [115], and smart health monitoring [116], owing to the massive amounts of data generated by various edge devices (e.g. smartphones, wireless sensors, and wearable devices). On the other hand, Standard FL has been constrained to the premise that the training data at each local device is gathered offline and should be fully trained throughout each global round to deliver iteration round-efficient solutions [117]. It is impossible to assume that the training data at each local client remains constant throughout each round of training, as clients may have access to real-time data that will become obsolete in a matter of hours or even minutes [115, 118]. In this case, standard FL models will have difficulty capturing the fluctuations of real-time data, and their generalization performance is likely to decrease with an increase in training rounds. Therefore, enabling the standard FL architecture in online scenarios (i.e. OFL) is critical in the era of big data. Instead of delivering iteration round-efficient solutions by simply waiting for training results from all the local clients, OFL studies are increasingly focusing on the real-time data processing efficiency of local clients, i.e. on delivering iteration process-efficient solutions [117].

OFL considers that the data from each client is generated and collected in real-time, and it seeks to capture a high degree of temporal information from various distributions of data sources. Due to the time-varying nature of online data, several of the challenges associated with standard FL are becoming increasingly apparent in the online FL:

  • Statistical heterogeneity: non-IID and unbalanced properties of online time-varying data cause model/concept drift [119] in OFL, and capturing the dynamic change of the rapidly generated online data poses a significant challenge to OFL.

  • System heterogeneity: stragglers emerge due to device heterogeneity and network instabilities. Balancing the contribution of each local device to the local iteration against the communication cost of the global iteration is a critical challenge in OFL.

  • Privacy guarantees: the vast amount of online data generated makes it more challenging to guarantee privacy for OFL. Various privacy protection strategies, such as differential privacy (DP) [120], have been implemented in FL in order to strike a balance between data utility and privacy, and these techniques should be optimized for the online environment to be more reasonable and practical for OFL.

Different OFL research focuses on different challenge priorities, and Table 3 summarizes current OFL studies on those three challenges.

Table 3 Summary of studies on OFL

The table demonstrated that the majority of OFL studies focused on statistical and system heterogeneity, while more research is required on privacy guarantees in OFL. Figure 13 summarizes the evolution timeline of OFL according to its sub-categories. It should be noted that OFL research is still in its infancy. In 2019, OFL began raising concerns about statistical heterogeneity and privacy issues, followed by studies exploring system heterogeneity in 2020.

Fig. 13
figure 13

Evolution timeline of OFL

4.1 Notations and problem definition

Consider we have a set of \(\mathcal {K} =\left \{1 {\ldots } K \right \}\) distributed devices. At each round t, the global server broadcasts its most recent parameters \({\omega _{g}^{t}}\) to the K devices. Each local device k receives the global parameters \({\omega _{g}^{t}}\) and a time-varying local instance \(({x_{k}^{t}},{y_{k}^{t}})\) to update the parameters \(\omega _{k}^{t+1}\) of its local model \(f_{k}(\omega _{k}^{t+1})\). Finally, the local devices upload the updated parameters \(\omega _{k}^{t+1}\) to the global server for dynamic aggregation:

$$ \omega_{g}^{t+1} = h(\omega_{1}^{t+1}{\ldots} \omega_{K}^{t+1}) $$
(6)

mapping h should be carefully selected in accordance with the model parameter structures [127], from which each device k can estimate the label \(y_{k}^{t+1}\) of a newly arrived data \(x_{t}^{k+1}\) in real-time. One of the most commonly used mappings in the standard FL system is FedAvg [32], which averages the aggregated local parameter sets:

$$ \omega_{g}^{t+1} = \sum\limits_{k=1}^{K}\frac{n_{k}}{N}\omega_{k}^{t+1} $$
(7)

where nk is the number of the data samples taken on device k, and N is the total number of samples taken on K local devices. In OFL scenarios, the data is generated continuously on various local devices, increasing the uncertainty of the local model updating in comparison to the central model [122]. As a result, more plausible mappings should be established to constrain such variances and thus improve the generalization performance of the model. Additionally, as not all devices are activated during each round t for a variety of reasons (e.g. due to network delays or device heterogeneity), strategies such as devices selection should be used to minimize the negative impact on overall communication efficiency.

4.2 OFL with statistical heterogeneity

Giorgas et al. [124] concentrated on the statistical heterogeneity associated with unbalanced data in OFL. Specifically, [124] assumed that the central server had been provided with pre-given data for training the initial central model. After initialization, the central model was broadcast to local devices for training with new samples from different classes. Then, the updated models on local devices were uploaded to the central server for integration. To ensure that the integrated model did not deviate substantially from the original central model, the integrated model may optionally be retrained using pre-given training data in the central server. This strategy effectively addressed common OL challenges, such as the catastrophic forgetting [132] that occurs due to the time-varying nature of online data.

To enable OFL framework in non-IID scenarios, [123] designed a non-linear regression OFL framework based on random Fourier feature-based kernel least-mean-square (RFF-KLMS). Specifically, they defined a non-linearly local model \(f_{k}({\omega _{k}^{t}})\) for an arrived instance \(({x_{k}^{t}}, {y_{k}^{t}})\) of a local device k at time t. Then the local parameter updating function can be formulated as follows:

$$ \omega_{k}^{t+1} = \mathbb{E} \left [| f_{k}({\omega_{k}^{t}}) -\hat{f_{k}}({\omega_{k}^{t}}) | \right ] $$
(8)

where \(\mathbb {E}\left [\cdot \right ]\) is the expectation. \( \hat {f_{k}}({\omega _{k}^{t}}) = {\omega _{k}^{T}}{z_{k}^{t}}\), where ωk is a linear representation of the non-linear model fk in the random Fourier feature (RFF) space, and \({z_{k}^{t}}\) is the mapping of \({x_{k}^{t}}\) in the RFF space. Hence, the global parameter updating function can be further constructed in the RFF space by:

$$ \omega_{g}^{t+1} = \frac{1}{K}\sum\limits_{k=1}^{K} \omega_{k}^{t+1}. $$
(9)

Instead of training a global utility model for all local devices, some works have concentrated on improving the performance of personalized local models in the OFL framework. Based on the worker-leader-core network hierarchy, researchers designed hierarchical nested personalized federated learning (HN-PFL) for unmanned aerial vehicles (UAVs) [119]. The intra-UVA swarm is embedded inside an inter-UVA aggregation, which follows the worker-leader-core network structure to train high-level personalized models for local devices. To enhance the learning of HN-PFL, model/concept drift was introduced to quantify the dynamic changes of local online time-varying data. For a local device k with its local model \(f({\omega _{k}^{t}})\), denote the online model drift at time t by \({{\Lambda }_{k}^{t}}\in \mathbb {R^{+}}\), which could capture the upper bound of the variation of parameters between two adjacent instances, we have:

$$ \left \| \bigtriangledown f_{k}(\omega_{k}^{t+1})- \bigtriangledown f_{k}({\omega_{k}^{t}}) \right \|^{2}\le {\Lambda}_{k}^{t+1}. $$
(10)

Local models with a greater drift value are likely to become obsolete, necessitating a shorter learning period and more frequently revisiting. On the other hand, models with a small drift value have lower local parameter fluctuations, implying that they require less attention than models with greater drift values. The model drift value was then utilized to estimate the online gradient for each local model and a core network was created for each training sequence by storing the real-time properties of the network as reinforcement learning states. Additionally, to avoid the curse of dimensionality, they used a neural network to model the Q-table and determine the network states, rather than pre-building it using traditional reinforcement learning techniques [133].

Li et al. [92] emphasized the importance of developing individualized local models by combining multi-task learning with the OFL framework. Unlike previous works that analysed streaming data, [92] proposed an online federated multi-task learning framework (OFMTL) to address the problem of inferring effective local models for newly joined devices without affecting previous clients or the global server. The multi-task relationship learning [134] was used in the OFMTL to transfer the relationship between the local models of all the devices into a relationship precision matrix. The OFMTL formulated the learning of model parameters for the newly joined device as a convex optimization problem related to the weight matrix and the precision matrix, and an alternating optimization algorithm was proposed to alternatively optimize the model parameters and precision matrix of the new device by using information gained from previous devices. Additionally, to save computation resources while retaining the generalization performance of previous models, the model parameters were configured to be retrained only when the number of newly joined devices reached a fixed ratio with respect to the total number of previous devices.

4.3 OFL with system heterogeneity

Varying communication rates of heterogeneous local devices are also critical challenge for OFL, and the lagging devices with a lower communication rate in this system are known as stragglers [122]. Numerous solutions have since been proposed for standard FL systems. However, in real-world scenarios where the data on each local device fluctuates, the updated model for each global round may display more inherent dynamic features. Therefore, more sophisticated algorithms for FL in the online context (i.e. OFL) are required to minimize the negative impact of stragglers in a dynamic environment. Based on the reviewed papers, two types of protocols can be used to address the issue of stragglers: (1) synchronous protocol; (2) asynchronous protocol.

To deal with stragglers in the OFL system, [117] proposed an adaptive batch sizing (ABS) solution based on the synchronous protocol. Typical synchronous FL systems require the central server to wait until all local devices (including stragglers) have been updated before performing a global update [135, 136], or simply ignore and drop the stragglers [32]. Different from the above studies, ABS [117] limited the size of training data at each global round by allocating a batch size bound to each device based on their processing speed and real-time data generation speed, forcing all local devices, including stragglers, to be synchronous during each global communication round. Furthermore, ABS provided a buffer for each local device to retain or revisit local data depending on network settings to reduce volatility in the size of generated data during each training round. Despite a lack of mathematical definitions in [117], the proposed ABS structure in this paper is instructive.

Zhou et al. [125] proposed a cost-effective federated learning (CEFL) system capable of cooperatively reducing computation and communication overhead. Similar to [117], CEFL made dynamic decisions on local devices by limiting the entry of newly arrived training data, buffering, and scheduling the data according to the time-varying resource pricing of the local devices. Additionally, CEFL employed an additional optimization parameter to balance the computational overheads of local models with the overall communication cost.

Rather than leveraging all local devices for global model training, [128] highlighted that the key challenge for OFL is to distinguish between effective local models and to determine the appropriate number of local epochs using no prior knowledge. This study formulated the participant selection problem as an optimization problem based on system capacity (local device availability, data volume, and network bandwidth) and long-term convergence of both local and aggregated global models. To further extend this solution into online scenarios, an online schema was designed (Fig. 14) to dynamically select the participant and the number of local epochs for each device in the OFL system. Specifically, the online schema consists of two parts: online learning and online rounding. The first part produced fractional judgments solely based on prior knowledge, whereas the second part employed a compensation technique to randomly convert the fractional decisions to integers without violating any pre-defined constraints. The experiment results indicated that the proposed schema could dynamically adjust the upper bound on local convergence accuracy and select participants with superior local model performance and computation efficiency.

Fig. 14
figure 14

Structure of the online schema proposed in [128]

Asynchronous system design was also emphasized in studies addressing the heterogeneity of the OFL system.

Chen et al. [122] developed an asynchronous OFL framework (ASO-Fed) that enabled a wait-free OFL system and improved the prediction performance and computation efficiency of local devices when data arrived continuously. ASO-Fed learned the inter-client interaction on the global server using feature representation learning inspired by attention mechanisms [137, 138] and weight normalization [139, 140]. The decay coefficient was utilized on the client-side to balance the older and newer models when OL was performed on each local device. Additionally, ASO-Fed used a dynamic step size to minimize the negative impact of stragglers. The step size was determined by the data volume and communication capacity of each client, and a larger step size was assigned to local clients with a lower activation rate to compensate for the long latency and achieve higher performance. Experiments demonstrated that the proposed ASO-Fed framework converged faster than synchronized FL frameworks and significantly reduced the overall computational overheads.

On the other hand, [11] emphasized the importance of incorporating contributions from all local customers, even stragglers. They developed FLEET, which consists of two components, I-PROF and ADASGD. The first component aims to forecast and allocate computational overheads across all local devices. The latter is a novel stochastic gradient descent algorithm that employs weighted stale gradients determined by a stale-aware dampening factor and a similarity-based boosting value. The stragglers with longer delays were assigned a smaller stale-aware dampening factor, indicating that they contribute less to the overall updating process. In contrast, a lower similarity value indicates a gradient containing more significant new data features. The FLEET has been proven to be effective in minimizing the negative effects of stragglers while also capturing vital information to improve the generalization capability of the system.

4.4 OFL with privacy guarantees

Since the data are generated in an online fashion and the sequence of training data is unknown, ensuring privacy for the OFL framework requires a more sophisticated design of the privacy algorithms. Odeyomi and Zaruba [130] considered P2P FL in an online setting, and proposed an online mirror descent algorithm with long-term constraints on the sequential decisions made by each device. Additionally, a modified online version of local differential privacy was utilized to ensure the privacy of the OFL system. By using only the private version of loss gradients for real-time data sequence at each global round, the online local differential privacy method provides global privacy guarantees without relying upon loss information across the entire data sequence. In each global training round, each user received new data and updated its local model. After that, each updated local gradient was subjected to local differential privacy to ensure privacy in the online scenarios. When compared to the online gradient descent algorithm with differential privacy, the proposed algorithm was proved to be more accurate in the long run.

In large-scale online distributed network settings, the dynamic growth of the online dataset complicates the process of incorporating noise into each associated data sequence to ensure privacy. Zhou et al. [131] utilized a trusted third party to protect the privacy of OFL in a recommendation system based on adaptive binary tree-based noise aggregation. They constructed a binary item-cluster tree for each local device to reduce the scale of incoming online big data at each global round. Specifically, the item space was partitioned into refined child clusters, and the optimal recommendation for the corresponding client was searched from top to bottom of the constructed tree. Then, to ensure privacy, a trusted third party was proposed as a middleware to provide safe model aggregation over all agents using an exponential mechanism, and two forms of attacks from internal local devices and external adversaries were evaluated to demonstrate the usefulness of the approach.

OFL enables real-time distributed data training while maintaining data privacy, and much of the current literature on OFL focuses on addressing statistical and system heterogeneity rather than providing privacy guarantees. Compared to OTL studies, there are relatively few historical studies of OFL. In the following section, we will describe OTL and OFL from a practical perspective based on our discussion of the methodology in the previous two sections.

5 Practical aspects in online federated and transfer learning

Although studies of OTL and OFL have been conducted with promising results in a variety of fields in recent years, there are still practical concerns that need to be addressed. This section discusses the practical issues associated with online federated and transfer learning from two perspectives: datasets and applications.

5.1 Datasets

In this section, we first summarize all the datasets based on the obtained literature on OTL and OFL. Then we will discuss practical considerations and concerns around datasets for OTL and OFL.

The commonly used datasets for OTL and OFL are listed in Tables 4 and 5. In addition, there are special datasets used in some particular studies, which are described as follows. Five papers have adopted special datasets in the obtained OTL studies. [99] used various datasets in different scenarios: landmine datasetFootnote 2 for landmine detection, wdbc datasetFootnote 3 for breast cancer diagnostic, german dataset for credit risk detection, spambase dataset for spam email filtering, a9a dataset for adult income prediction, and w8a dataset for text categorization [141]. Similarly, to explore OTL in different tasks, [96] utilized the video dataset from YouTube along with two popular used datasets in Table 4: multi-language and text-image datasets. Moreover, there are studies that use medical imaging datasets for disease diagnosis. For example, [109] used electrocardiogram (ECG) data for cardiac arrhythmia detection, and [107] used electrocorticogram (ECoG) data for epileptic seizure detection.

Table 4 Summary of the commonly used datasets in OTL studies
Table 5 Summary of the commonly used datasets in OFL studies

Nine papers have adopted special datasets in the obtained OFL studies. Compared to OTL, datasets used in OFL are more personalized, which implies that more publicly recognized datasets are required as benchmarks in this field. Li et al. [92] used human activity recognition dataset [142], eating recognition via Google glass dataset [143], and eating habits monitoring datasets [144] for various real-world tasks. In [125], an image dataset comparable to the average image size of the CIFAR-10 dataset was used. Apart from using the listed MNIST and air quality datasets, [122] also adopted the FitRec datasetFootnote 4, ExtraSensory datasetFootnote 5 for human activity analysis. Similarly, [124] used a subset of WISDM Smartphone activity and biometrics dataset [145] for human activity recognition. To explore OFL in different application scenarios, apart from adopting the air quality dataset, [127] also used data from Twitter, conductivity dataset [146] in the OTL regression tasks, and used power consumptionFootnote 6 dataset, parking occupancyFootnote 7 dataset, and traffic dataset in the OTL time-series forecasting tasks. Jin et al. [128] used data from US-centric population [73] as the FL workload trace. [131] utilized YFCC100MFootnote 8 dataset, which is the largest released public multimedia dataset. Instead of using the real-world datasets, two OFL papers [123, 130] chose to use synthetic datasets for experimental analysis.

5.1.1 Popular datasets for OTL

Five datasets have been commonly used in OTL: Multi-language dataset, 20Newsgroups dataset, sentiment analysis dataset, text-image dataset, and Office-Caltech dataset.

The multi-language dataset [147] contains feature characteristics of documents written in five different languages (English, French, German, Spanish, and Italian) but sharing the same set of categories. Each language contains indexes of the documents written or translated in that language.

The 20Newsgroups datasetFootnote 9 contains about 20,000 newsgroup documents organized by subject and subcategory. The 20Newsgroups dataset has been mainly used to implement MS-BC OTL tasks [104,105,106]. Typically, researchers focus on two primary subjects, each of which has multiple subtopics. Then, to simulate multiple learning domains, a positive label is assigned to each subtopic, which corresponds to the negative label assigned to a subtopic within the other primary subject.

Another commonly used dataset is the sentiment analysis datasetFootnote 10, which consists of product reviews on Amazon for four different product categories (books, DVDs, electronics, and kitchen). Each review includes a human rating score (0-5 stars), a review caption, position, timestamp, an item description, a reviewer name, and the review content. This dataset has been used to perform SS-BC OTL [10, 93, 98] and MS-BC [104, 105] OTL tasks.

The Office-Caltech dataset [148] is made up of real-world object domains gathered from the Berkeley Office [149] and Caltech-256Footnote 11, which has been widely utilized in OTL tasks requiring multi-class classification. The Caltech-256 contains 30,607 pictures from 256 groups, and real-world object domains include Amazon, Webcam, and the digital single-lens reflex camera.

Different from the above datasets, the text-image dataset has been utilized in a wide range of cross-modality OTL scenarios, and it is sourced from the NUS-WIDE [150] collection on Flickr. This dataset comprises photos and tags that have been published on the internet and is often used in heterogeneous SS-BC OTL. More precisely, the unlabelled text-image data pairs in this dataset are often utilized as co-occurrence data to bridge the text samples from the source domain and the images from the target domain.

5.1.2 Popular datasets for OFL

There are several datasets commonly used in OFL: CIFAR-10 and CIFAR-100, MINIST, and air quality dataset. MINIST and CIFAR are two public datasets that are often utilized in OFL tasks [11, 122], particularly in simulations of non-IID settings.

Both CIFAR-10 and CIFAR-100 datasetsFootnote 12 have 60,000 images, with the former having 10 classes with 6,000 photos each and the latter having 100 classes with 600 images each.

MINISTFootnote 13 is a database of handwritten digits that contains a training set of 60,000 instances and a testing set of 10,000 instances. This dataset is suitable for pattern recognition tasks as it requires minimal pre-processing.

Air quality datasets collected from weather sensors in different countries were used in [122] and [127] to predict the level of pollutants in the air.

5.1.3 Practical considerations

OTL tasks are generally conducted on public datasets such as Office-Caltech, which may have storage format restrictions and are liable to become obsolete. It is also challenging to update these existing datasets or to re-collect fresh datasets. On the other hand, real-world datasets are difficult to obtain due to privacy regulations. Furthermore, most datasets only include a limited number of labelled instances in the target domains, making it challenging to perform cross-validation to fine-tune the target model [100]. For example, OTL applications for the healthcare system are commonly based on publicly available hospital data, and these applications may be limited to patients in a particular geographical area, as people within various geographical regions may have varying physical conditions. Additionally, an OTL system may require target patients to upload their physical states in near real-time, which is highly unlikely in practice due to privacy concerns and system/ infrastructure limitations.

When designing a comparative experiment, different domain types of OTL tasks require different data settings and must comply with the same data dividing rule. On the other hand, data settings for OFL are relatively complex, which requires the stimulation of both the statistical heterogeneity generated by non-IID or imbalanced data and the system heterogeneity caused by the varying uploading rates across numerous local devices in an online scenario. To deal with statistical heterogeneity, researchers often use the standard data decentralization method [73] to classify the data and partition individual categories into multiple shards of varying sizes, after which each local client is allocated with different shards [11, 122]. To stimulate stragglers, a random delay timer may be used to reflect various network delays across local clients [122]. Furthermore, a data growth rate should be predetermined to imitate the growth of online data. Data settings for OFL involve a variety of parameters, and it is important to establish a unified standard for these parameters to facilitate comparative experiments.

5.2 Applications

It is important to note that, despite relatively few studies focusing purely on application-based scenarios, several major prospects for OTL and OFL can be drawn from the obtained studies and their datasets summarized in the previous section, which may in turn lead to future investigations and applications to real-world scenarios. This section will describe the identified cutting-edge applications and discuss relevant practical considerations. It covers the application scenarios and their compatibility of both OTL and OFL in different contexts, with the recognition that existing studies can be categorized into two sectors, namely industrial engineering and healthcare, based on an exhaustive summary from the obtained papers.

5.2.1 Applications in industrial engineering

Given the achievements of OTL in domain shift scenarios and OFL in data privacy protection, it is reasonable to apply these methods to industrial engineering tasks, and Table 6 summarizes the detailed sub-scenarios of OTL and OFL applications in industrial engineering.

Table 6 Summary of sub-scenarios of OTL and OFL in industrial engineering

OFL has been used in a variety of data-sensitive industrial domains, including environmental protection [122, 127], and unmanned aerial vehicle (UAV) control [119, 121]. OTL applications, on the other hand, are most commonly found in industrial situations that involve domain shift problems, such as sentiment analysis [10, 93, 98, 104, 105]. There are other situations in the industrial engineering where data is likely to be sensitive and therefore a cross-domain task is required, such as image recognition [11, 94, 100, 108, 110, 122, 126] and online recommendation systems [101, 131].

By combining data from multiple weather sensors located at nine separate locations, [122] develop a novel collaborative OFL model for predicting the pollutants in the air. Apart from the environmental protection, OFL has been used to control UAVs in real-time [119, 121] for mission-critical applications such as first-aid packet dispatching and firefighting [151, 152].

Sentiment analysis has arisen as a hot topic in OTL, with applications ranging from spam detection [109] to document categorization [93, 105, 106]. For example, [109] developed a spam email filter system by analyzing real-world emails from fifteen different users. Such a system can help reduce labor costs while safeguarding the property of people.

Image recognition is a trending topic in OTL. Transferring information from related domains makes it possible to conduct online image classification on the target domain. In addition, computer vision-based tasks [126] are a hot topic in industrial applications of OFL. Rather than uploading annotated personal visual data to a central database, participants in an object recognition task can train a local model on their personal site. Furthermore, by leveraging an online learning framework, OFL enables computer vision-based tasks to manage massive amounts of online image data that arrive sequentially from cameras.

Additionally, OFL and OTL have been implemented in online recommendation systems. [101] proposed SocialTransfer, a cross-domain OTL system for multimedia applications that learns from time-varying social stream data. To address the privacy concerns associated with information sharing, [131] developed a privacy-preserving recommendation system based on OFL that takes advantage of the privacy guarantees provided by the federated learning architecture while still capable of managing the streaming data.

5.2.2 Applications in healthcare

OFL and OTL are both promising solutions for healthcare. For OFL, the connection between real-time data monitoring from various edge devices and hospital records breaks down analysis barriers between various parties while maintaining data privacy. Furthermore, the information required to detect a disease differs from patient to patient. Given that the medical records of each patient constitute a unique domain, OTL is well-suited for disease diagnosis, as it can leverage multiple patient records to improve the diagnosis accuracy of the target patients. OTL has been used to diagnose a wide variety of diseases, including arrhythmias [99], breast cancer [99], and epileptic seizures [107].

Nowadays, with the rapid development in the storage capacity and computing power of edge devices such as smartphones and wearable devices (e.g. google glass), physical data about daily human life can be collected and analysed conveniently. These data, however, are sensitive and are at risk of being compromised by unauthorized access. On the other hand, real-time monitoring systems are required for special scenarios, such as remote health condition monitoring for the elderly living alone, as certain acute-onset diseases (e.g. heart attack, stroke) must be detected instantly. With privacy guarantees, OFL is an excellent candidate for the aforementioned application scenarios, and it has been used in a variety of healthcare applications, including human activity recognition [92, 122], and eating habits monitoring [92].

5.2.3 Practical considerations

Existing research on OTL has primarily concentrated on text/image-based applications, which may not be applicable to certain scenarios involving users who are unfamiliar with text/image input. There are studies on TL that have recommended the use of more forms of input, such as voices [153] and gestures [154]. Future OTL research should consider extending these advanced applications to online contexts, which would accommodate a variety of inputs and facilitate human-machine interaction.

While current application domains of OTL and OFL are primarily focused on industrial engineering and healthcare, there are many other areas worth exploring in TL and FL, such as smart transportation [50]. Traditional offline frameworks for smart transportation may benefit from an online environment; for example, establishing an online autonomous driving system may capture the dynamic nature of the vehicle system and the inherent uncertainty of the real-life environment, allowing drivers to make more accurate and timely decisions.

With the widespread use of edge devices, device owners can easily annotate their data by simply tagging or labeling the device, which has been frequently utilized in OFL research. On the other hand, malicious and false tagging will become more prevalent as local users are able to tag on their own devices. As a result, OFL must concentrate on filtering out invalid tagging to ensure the accuracy of model inferences. Moreover, fewer OTL applications are utilizing smart edge devices due to privacy regulations regarding personal data. We anticipate that OTL models trained on real-time data generated by edge devices will perform significantly better. Therefore, it is anticipated that there will be future research opportunities to combine OTL and OFL to develop an online FTL framework that takes advantage of both OTL and OFL paradigms to accomplish this vision. After having investigated OTL and OFL from practical perspectives, we conclude this survey with a discussion of several areas of future work worthy of consideration. In particular, we present a vision for online FTL and describe the proposed framework in detail.

6 Discussion and conclusion

In this survey, we have provided a systematic and comprehensive overview of OTL and OFL. OTL employs knowledge from single or multiple source domains to train online target models for the target domain, while the OFL facilitates the traininig of online models at the edge of distributed networks. We discussed the unique properties of OTL from a domain-task perspective and described existing research on OFL addressing several major challenges. Moreover, popular datasets and cutting-edge online federated and transfer learning applications were summarized, and practical considerations were presented from both datasets and applications perspectives. In the following, we will identify open problems worthy of future research efforts, and also propose a vision of online federated transfer learning - a new concept we have developed with the aim of addressing the most significant challenges faced by existing studies.

rom the methodology perspective, existing OTL studies have mainly focused on SS-BC and MS-BC OTL, while studies for multi-class classification OTL tasks have been relatively scarce. Therefore, sophisticated OTL frameworks for various types of learning tasks should to be developed in future research. Moreover, most of the current OTL frameworks rely on the kernel method to build their online target classifiers, which has the distinct benefit of being more accurate than linear models. On the other hand, the kernel method also has an acknowledged disadvantage of being resource-intensive in terms of support vector storage. It is recommended that efficient solutions such as budget online kernel learning [54], which restricts the number of support vectors to a fixed budget, are included in the future OTL framework for their potential to reduce computing overhead significantly. On the other hand, studies in the field of OFL have frequently focused on developing effective models for a variety of asynchronous devices. Moreover, all current OFL frameworks, whether synchronous or asynchronous, assume that local devices are available during their allocated ‘working period’, which is impractical since unforeseen events may occasionally occur, rendering these local devices being unavailable. As a result, a feedback mechanism could be developed in the future OFL framework to confer sufficient authority on the local device to commence the communication process.

From the practical perspective, existing OTL studies often utilize public datasets, and the real-world datasets are difficult to obtain due to data privacy regulations, as OTL is based on the assumption that all models will be trained on a centralized platform. Therefore, there is a need for the collection of more state-of-the-art datasets for OTL tasks. On the other hand, OFL datasets are more diverse, since the local clients can retain the dataset on their own device. However, typical OFL tasks often require complex data settings for simulating the heterogeneous scenarios in the real world, and different settings of the datasets result in difficulties when comparing different OFL frameworks. Therefore, developing unified data setting protocols is also necessary for future research. Moreover, the most prevalent learning type in real-world applications is supervised learning for OTL and OFL, which involves label-revealing after each prediction. Although significant progress has been made in online federated and transfer learning for handling distributed time-varying data with few labels, applications for unsupervised learning remain a challenge in this field. Methods such as [155], which used a selective pseudo-labeling strategy to achieve high performance for unsupervised TL, and federated unsupervised representation learning [156], which pre-trained deep neural networks using unlabelled data in a federated setting, have shown promising outcomes recently. FL and TL, as two forms of collaborative training, hold tremendous potential in the filed of unsupervised learning. Given the dynamic requirements of real-world machine learning, it is reasonable to suggest that future research on FL and TL extensions for unsupervised learning in online contexts is necessary.

The implementation scenarios of TL, FL, FTL, OTL, and OFL are summarized in Table 7, and the ideal implementation scenarios of online FTL are also given in the table. As can be seen from Table 7, OTL enables standard TL to handle real-time data efficiently. As with the standard TL, OTL is rarely studied in decentralized environments, and carries the risk of data privacy violations due to the instance transmission process. On the other hand, OFL is able to handle local data generated in real-time as well as provide privacy guarantees. However, similar to standard FL, OFL needs to utilize special techniques, such as TL, to create personalized local models. Since FTL has gained increasing attention as numerous studies have demonstrated its efficiency [44], we envisage that extending FTL to online scenarios will enable the development of an advanced machine learning framework with dynamic natures that leverages both OTL and OFL paradigms for benefit to the general user.

Table 7 Frontier implementation scenarios of different techniques

Figure 15 illustrates the proposed online FTL framework that is described below. The data in the source domain can be generated in real-time or from pre-given datasets. It should be noted that a scratch of the source data is essential to ensure the benchmark performance of the source models. Each local device in the target domain generates data in an online fashion, and the real-time data is analysed by online learners, who then attempt to formulate an optimal strategy for online updating during each training round [54]. The global model enables model aggregation, heterogeneous computing, updating, and broadcasting. Local devices, such as smartphones and laptops, provide essential infrastructure tools, including local online/offline training, uploading, and distributed storage.

Fig. 15
figure 15

A vision of online FTL framework

Various applications may be developed on top of the proposed online FTL to provide critical human-machine interface services. By utilizing federated learning, machine learning models for multiple parties can be established without exporting local data, ensuring data security and privacy while providing users with tailored services. Meanwhile, TL enables FL to train models on a variety of different but related parties, which is practically important given that stakeholders within the same FL framework are usually from the same sector. Furthermore, classical batch/ offline learning has low efficiency in terms of computing costs, as well as limited scalability for large-scale applications due to the need for model retraining after online data sequences are generated. We envisage that extending FTL to online scenarios will help overcome the limitations of traditional batch learning by allowing online learners to update the local model safely and rapidly.

To summarize, this survey aims to serve as a resource for researchers and practitioners developing online federated and transfer learning frameworks. It provides a systematic and comprehensive description of OTL and OFL, and identifies open research questions worthy of future research efforts. Finding solutions to such new and arising research problems from methodologies to practical applications will necessitate collaborative and long-term efforts from various research communities.