Multi-source fast transfer learning algorithm based on support vector machine

Gao, Peng; Wu, Weifei; Li, Jingmei

doi:10.1007/s10489-021-02194-9

Multi-source fast transfer learning algorithm based on support vector machine

Published: 06 April 2021

Volume 51, pages 8451–8465, (2021)
Cite this article

Download PDF

Applied Intelligence Aims and scope Submit manuscript

Multi-source fast transfer learning algorithm based on support vector machine

Download PDF

Peng Gao^1,2,
Weifei Wu¹ &
Jingmei Li¹

2499 Accesses
11 Citations
Explore all metrics

Abstract

Knowledge in the source domain can be used in transfer learning to help train and classification tasks within the target domain with fewer available data sets. Therefore, given the situation where the target domain contains only a small number of available unlabeled data sets and multi-source domains contain a large number of labeled data sets, a new Multi-source Fast Transfer Learning algorithm based on support vector machine(MultiFTLSVM) is proposed in this paper. Given the idea of multi-source transfer learning, more source domain knowledge is taken to train the target domain learning task to improve classification effect. At the same time, the representative data set of the source domain is taken to speed up the algorithm training process to improve the efficiency of the algorithm. Experimental results on several real data sets show the effectiveness of MultiFTLSVM, and it also has certain advantages compared with the benchmark algorithm.

A Transfer Learning Algorithm Based on Support Vector Machine

Article 23 December 2022

Semi-supervised Learning with Transfer Learning

Active Selection Transfer Learning Algorithm

Article 29 April 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As one of the fastest-growing technical fields today, machine learning is also the core of artificial intelligence and data science. It solves the problem of how to automatically improve computers through experience [1]. Machine learning has been widely applied in intrusion detection [2, 3], computer vision [4], data mining [5], text classification [6], spam detection [7] and pattern recognition [8], and other fields. However, the further development of machine learning in these fields is restricted by the shortcomings of traditional machine learning methods. Usually, two basic assumptions in traditional machine learning should be met for the current traditional machine learning classification tasks: there are enough data samples in the training data set to train a high-precision classifier; training and test data comes from the same feature space and has the same distribution. For practical application, training and test data usually come from different domains, with differing marginal probabilities, or conditional probabilities, so the data distribution is also different. When the distribution is changed, most machine learning algorithms need to re-collect training data. In many real-world applications, the cost of re-collecting training data and reconstructing the model is very expensive, or even impossible [9].

In this case, transferring learning between learning task domains is desirable. The research motivation of transfer learning is that previously learned knowledge can be used by people to better solve new problems [9, 10], and its purpose is to build a model for the target domain by using labeled information in another related domain (source domain). Therefore transfer learning is defined in Wikipedia as “Transfer learning is a new machine learning method that takes existing knowledge to solve problems in different but similar fields. It no longer follows the two basic assumptions in traditional machine learning. Instead, the existing knowledge is transferred to solve the problem of only a small amount of labeled sample data in the target field [11]. The difference between traditional machine learning and transfer learning is shown in Fig. 1. It can be seen from Fig. 1a that each learning task in traditional machine learning starts from zero, while in (b) the knowledge from previous learning tasks can be transferred to the current target learning task by using transfer learning. The representative algorithms related to transfer learning research are as follows: Gao et al. [12] proposed an integrated framework of a locally weighted combination of multiple models (LWE). LWE can integrate the advantages of various algorithms and label learning in multiple training domains into one model and dynamically assign weights according to the predictive ability of each model per each instance. Pan et al. [13] proposed a new learning method for domain adaptation transfer components analysis(TCA). TCA tries to learn several transfer components across domains in the Reproducing Kernal Hilbert Spaces (RKHS) by the maximum mean discrepancy (MMD) error rate. In the subspace spanned by these components, data attributes are preserved, and data distribution in different domains is close to each other. An adaptive regularization transfer learning algorithm (ARTL) is based on the structural risk minimization principle and a regularization theory that was proposed by Long et al. [14]. Li et al. [15] implemented the transfer learning algorithm RankRE-TL based on the transfer learning mechanism of knowledge used and the error set selection method with rank reduction. A new SVM-based model transfer method was proposed in [16], in which a large boundary classifier is trained on the labeled target sample and adjusted by an offset of the source classifier. This method is called Heterogeneous Max-margin Classifier Adaptation Method (HMCA). Xie et al. [17] proposed a new method of supervised domain adaptation by dual support vector machines, called adaptive dual support vector machines for an aggregation domain.

However, in the above transfer learning algorithm, only the knowledge in a source domain is transferred to the target domain. In transfer learning, the performance of the target classifier largely depends on the correlation between the source domain and the target domain. If the correlation between the target domain and the source domain is strong, it will help to improve the learning effect in the target domain, otherwise, it will reduce the learning effect of the target domain, leading to the phenomenon of negative transfer [12]. One strategy to reduce this negative transfer is to import knowledge from multi-sources to increase the chance of discovering a source domain closely related to the target domain [18]. The typical research work for a multi-source domain transfer learning algorithm is as follows: A novel task-based instance transfer enhancement technology TransferBoost was proposed in [19], which selectively transfers knowledge from the source domain to the target task. It has been enhanced at both the instance level and the task level. The source tasks that show good portability to target tasks can be assigned higher weights and the weight of each instance in each source task can be adjusted through AdaBoost. Yao et al. [20] introduced multi-source domains to solve the problem of negative transfers and proposed two new algorithms, Multi-source-TrAdaBoost and TaskTrAdaBoost. Duan et al. [21] proposed a multi-source domain adaptation method (DAM) by using a set of pre-trained classifiers (called auxiliary/source classifiers) that use labeling patterns from multi-source domains to learn from robust decision function for pattern labeling prediction in the target domain (called target classifier). A incomplete multi-source transfer learning algorithm IMTL is proposed through two directions of knowledge transfer (ie, cross-domain transfer and cross-source transfer from each source to target) [22]. In [23], this paper present a Bayesian framework for transfer learning using neural networks that considers single and multiple sources of data. Other researches on multi-source transfer learning in literature [24,25,26]. Today, transfer learning has been applied in the fields of speech recognition, computer vision, information retrieval, natural language processing, adaptive update map coverage, fault diagnosis, automatic detection of COVID-19 infection and other fields [27,28,29,30,31,32,33].

In this paper, the structural risk minimization theory, support vector machine, and source transfer learning theory are taken. Aiming for an application scenario where there is only a small amount of labeled data in the target domain and a large amount of labeled data in multi-source domains, a new multi-source fast transfer algorithm-MultiFTLSVM is proposed. The idea of the MultiFTLSVM algorithm is to integrate the knowledge within the target domain and the labeled data in multi-source domains into a structural risk minimization framework of support vector machines. The knowledge that needs to be transferred in the source domain is selected by constructing a similar distance term and the MMD between the target domain and each source domain, and the AESVM algorithm is used to reduce the sample size of the source domain to improve algorithm training efficiency, and then construct an optimizable objective function. The theoretical proof of the objective function shows that the solution process is a quadratic programming problem with an optimal solution.

Compared with previous work, the contributions of this paper include:

1)
Existing data sets are taken to reduce the cost of collecting data sets.
2)
All training samples no longer require AESVM to train the learning model, which can greatly reduce the size of the training samples so that the training cost of the learning model is reduced.
3)
To prevent negative transfer and improve classification performance, multi-source transfer learning simultaneously extracts knowledge from multi-source domains to assist the learning task in the target domain. In this process, the knowledge of the similarity between each source domain and sample and the target domain is transferred to the target domain to the greatest extent.

The rest of the paper is arranged as follows: The related work of multi-source transfer learning, approximate pole support vector machine (AESVM), and maximum mean discrepancy (MMD) are reviewed in Section 2. The construction and training process of multi-source fast transfer learning is introduced in detail in Section 3. The effectiveness of the algorithm on the 20-Newsgroups text data set, sentiment analysis data set and the spam data set is verified in Section 4. The main work of the paper is summarized in Section 5.

2 Overview of related work

In this part, we give a brief introduction to multi-source transfer learning and group probability. In the introduction of group probability, we focus on IC technology and the group probability classification algorithm IC-SVM that is technology is based on.

2.1 Multi-source transfer learning

Transfer learning has been widely studied for many years since it was proposed in NIPS-95 in 1995. Compared with traditional machine learning algorithms, transfer learning has significant advantages. Useful knowledge can be taken from the source domain to significantly improve the learning performance of the target domain and greatly reduce costly data labeling work. There is no need for training data and test data to satisfy the same distribution. At present, most transfer learning tasks only transfer the knowledge in a source domain to the target domain [9]. However, in real-world applications, we can easily collect auxiliary data from multi-source domains. Therefore, the study of transfer learning in multi-source domains has gradually aroused the interest of researchers [18]. As shown in Fig. 2, the relationship between multi-source domains and target domains is taken for the multi-source transfer to improve the predictive performance of the target domain on samples and assist the target domain to establish a prediction model.

In Fig. 2,$ \left({D}_{S_1},{T}_{S_1}\right) $,$ \left({D}_{S_2},{T}_{S_2}\right) $,…, $ \left({D}_{S_n},{T}_{S_n}\right) $ represents n source domains and their corresponding learning tasks.(D_T, T_T) represents the target domain and its corresponding learning tasks.f_t represents the target domain classifier obtained by training the data sets in the target domain D_T and the source domain $ {D}_{S_i}\left(i=1,\dots, n\right) $.

Generally, multi-source transfer learning algorithms can be divided into two categories: methods based on boosting [19, 20] and methods based on regularization [21,22,23,24]. Additionally, [25] divided the multi-source transfer learning into multi-source sample transfer learning, parameter-based multi-source transfer learning, and feature-based multi-source transfer methods. The regularized multi-source transfer method needs to design a regularization term, while the boosting multi-source transfer method needs to adjust the weight of different domains or instances to achieve its purpose of transferring knowledge. The focus of these methods is sample mobility, instead of studying which source domain has better mobility. When there are multi-source domains, how to determine which source domain has better mobility is an important issue [26].

2.2 Approximate pole support vector

Given that the basic idea of support vector machines is to find a hyperplane that represents the largest interval between two types, from a geometric point of view, the calculation of the maximum boundary hyperplane is equivalent to calculating the nearest sample between the two convex hulls [34, 35]. For SVM, a prerequisite for achieving better training results is a large number of training samples. A large number of training samples not only requires a lot of manpower to label but also a lot of time is then consumed in the training phase, so the training efficiency of SVM is not very satisfactory.

A training data set X = {x₁, x₂, …, x_n}is given, and the corresponding class label set is Y = {y₁, y₂, …, y_n},y_i ∈ {1, −1}. AESVM optimization problem is described in Eq. (1):

$$ \underset{w,b}{\min }{F}_{AESVM}\left(w,b\right)=\frac{1}{2}{w}^Tw+\frac{C}{M}\sum \limits_{i=1}^M{\beta}_il\left(w,b,\varphi \left({x}_i\right)\right) $$

(1)

In Eq. (1), Mis the number of samples of representative training data set X^∗ selected in the data set X, parameters w, b, i and C are support vector normal vector, displacement item, sample No. and regularization coefficient, the vector β = [β₁, β₂, …β_M]is the weight vector corresponding to the representative data set sample, such as Eq. (9), M is the number of the representative data set, l is the hinge loss function, l(w, b, ϕ(x_i) = max {0, 1 − y_i(w^Tϕ(x_i) + b)}, x_i ∈ X^∗, and ϕ(⋅) is the nonlinear mapping function. The kernel function can be written as K_{i, j} = k(x_i, x_j) = ϕ(x_i)^Tϕ(x_j).

To obtain the representative data set X^∗, the data set Xneeds to be grouped according to a determined separation strategy X = {X₁, X₂, …, X_n/V}. n is the number of samples in the data set X,V is the maximum number of samples in each group and X_q(q = 1, 2, …, n/V) represents the group q. The sample data in each group has a high similarity, while the sample data between different groups had a low similarity. The concept of similarity is that the distance between sample data in a subset is less than the distance between sample data in different subsets. In each subset X_q, the representative data set $ X{}_q{}^{\ast } $ and the corresponding weight vector β_q are calculated. Finally, the representative data sets $ X{}_q{}^{\ast } $ are obtained from all subsetsX_q and are then merged into the representative data setsX^∗.

The specific process of obtaining representative data sets from the data set X is as follows.

Firstly, the initial representative dataset $ X{}_q{}^{\ast } $ is calculated by using the SVDD algorithm [36] and the sample $ {x}_i\left({x}_i\in {X}_q\ \mathrm{and}\ {x}_i\notin {X}_q^{\ast}\right) $ should determine whether or not it belongs to the representative data set $ X{}_q{}^{\ast } $, which is formally described as Eq. (2) .

$$ \Big\{{\displaystyle \begin{array}{c}\underset{x_i\in {X}_q,{x}_i\notin {X}_q^{\ast }}{\max }f\left(\varphi \left({x}_i\right),{X}_q^{\ast}\right)=\underset{\mu_{it}}{\min }{\left\Vert \varphi \left({x}_i\right)-{\sum}_{j=1}^{\mid {X}_q^{\ast}\mid }{\mu}_{i,t}\varphi \left({x}_j\right)\right\Vert}^2\le \varepsilon \\ {}s.t.0\le {\mu}_{i,j}\le 1,\sum \limits_{j=1}^{\mid {X}_q^{\ast}\mid }{\mu}_{i,j}=1,{x}_j\in {X}_q^{\ast}\end{array}} $$

(2)

In Eq. (2), $ \mid {X}_q^{\ast}\mid $ is the number of samples in the sample set $ {X}_q^{\ast } $, ε is a small normal vector artificially given, μ_{i, j} is the coordination coefficient, and j is the serial number of samples in the sample set $ {X}_q^{\ast } $. For x_i meeting Eq. (2) in X_q, the extended representative sample set $ {X}_q^{\ast } $ is $ {X}_q^{\ast }={X}_q^{\ast}\cup \left\{{x}_i\right\} $. For all samples $ {x}_i\left({x}_i\in {X}_q\mathrm{and}{x}_i\notin {X}_q^{\ast}\right) $, Eq. (3) is calculated by Eq. (2):

$$ \varphi \left({x}_i\right)=\sum \limits_{x_i\in {X}_q}{\gamma}_{i,j}\varphi \left({x}_i\right)+{\tau}_i $$

(3)

In Eq. (3), $ {\gamma}_{i,j}=\left\{\begin{array}{l}{\mu}_{i,j},{x}_j\in {X}_q^{\ast}\mathrm{and}{x}_i\in {X}_q\\ {}0\end{array}\right. $, τ_i is the approximate error vector ‖τ_i‖² ≤ ε in Eq. (3). The weight vector corresponding to the representative data set γ_{i, j} in Eq. (4):

$$ {\beta}_j=\sum \limits_{i=1}^n{\gamma}_{i,j} $$

(4)

2.3 Maximum mean discrepancy

In transfer learning, the difference in sample distribution leads to the problem of negative transfer. Therefore, it is necessary to select a convenient distribution distance measurement. MMD is an effective measure to estimate the distance between two different distribution in the Hilbert Spaces. The value is calculated by the distance distribution found in the given functions. The function can best separate the two kinds of distribution and is limited to a unit ball in RKHS.

The set D_S containing n_s training samples and the set D_T containing n_t test samples are given. The formal definition of nonlinear mapping function and MMD in Hilbert Spaces is as follows:

$$ MM{D}_H\left({D}_S,{D}_T\right)={\left\Vert \frac{1}{n_s}\sum \limits_{i=1}^{n_s}\phi \left({x}_i^s\right)-\frac{1}{n_t}\sum \limits_{j=1}^{n_t}\phi \left({x}_j^t\right)\right\Vert}_H $$

(5)

In Eq. (5), we find that the empirical estimate of the difference between the two distributions is considered as the distance between the two data distributions in the Hilbert Spaces, and an MMD value close to zero indicates that the two distributions are matched. Recently, the MMD measurement method is often used to calculate distribution values between domains in transfer learning.

3 Multi-source fast transfer support vector machine algorithm (MultiFTLSVM)

This section describes the group probability multi-source transfer algorithm in detail. The algorithm framework is shown in Fig. 3. As shown in Fig. 3, the input information for the MultiTLGP framework consists of two parts, labeled samples contained in the M source domains and quantity labeled samples contained in the target domain. For the sake of convenience, the dichotomy is considered.

N source domains are defined as $ {D}_S=\left\{{D}_{S_i}={\left({x}_j^{S_i},{y}_j^{S_i}\right)}_{j=1}^{n_{S_i}},i=1,\dots, N\right\} $, where $ {x}_j^{S_i} $ is the j^th sample in the S_i^th source domain, and $ {y}_j^{S_i} $ is the corresponding label of $ {x}_j^{S_i} $. $ {n}_{S_i} $ is the number of the S_i^th source domain. Joint distribution probability of $ {D}_{S_i} $is $ {P}_{S_i} $. Similarly, the target domain is defined as $ {D}_T={\left({x}_i^T\right)}_{i=1,..,{n}_T} $, and the corresponding joint distribution probability is P_T. $ {P}_{S_i}\left({x}^{S_i}\right) $ and P_T(x^T) are the marginal probability from the source domain $ {D}_{S_i} $ and the target domain D_T, and $ {P}_{S_i}\left({x}^{S_i}\right)\ne {P}_T\left({x}^T\right) $. The MultiFTLSVM algorithm fully addresses the difference between the source domain samples and the target domain by reducing the marginal probability difference.

Figure 3 shows the framework of the MultiFTLSVM algorithm. Firstly, the weights of each source domain sample’s marginal probability are calculated. Then, the proposed objective function is combines with support vector machine, structural risk minimization theory, and similarity distance minimization. Finally, the objective function is calculated, proved, and solved. The detailed process of MultiFTLSVM construction is as follows.

3.1 Re-weighting data samples based on marginal probability differences

For the convenience of calculation, the similarity of weights $ {\gamma}_j^{S_i} $ between samples and target domains in each source domain $ {D}_{S_i} $ with Eq. (5) showing the MMD measuring method in 2.3. Eq. (5) is modified as follows:

$$ \underset{\alpha^{S_i}}{\min }{\left\Vert \frac{1}{n_{s_i}}\sum \limits_{j=1}^{n_{S_i}}{\gamma}_j^{S_i}\phi \left({x}_j^{s_i}\right)-\frac{1}{n_T}\sum \limits_{j=1}^{n_T}\phi \left({x}_j^T\right)\right\Vert}_H $$

(6)

ϕ is the mapping function in Hilbert Spaces H. $ {n}_{s_i} $ is the number of samples in the source domain $ {D}_{S_i} $. n_T is the number of samples in the target domain D_T, and $ {\gamma}^{S_i} $ is the weight vector of the dimension $ {n}_{S_i} $. The minimization of Eq. (6) is a standard quadratic programming problem, which can be solved by many existing solutions.

3.2 Objective function construction of MultiFTLSVM

Based on 3.1 and support vector machines, the objective function of MultiFTLSVM is constructed and combined with structural risk minimization theory and similarity distance minimization, as follows:

$$ {\displaystyle \begin{array}{c}\underset{f_t,{f}_s\in {H}_k}{\min}\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {f}_{S_i}\right\Vert}_K^2+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\beta}_j^{S_i}{l}_{S_i}\left({f}_{S_i},{y}_j\right)+\frac{1}{2}{\left\Vert {f}_t\right\Vert}_K^2\\ {}+{C}_t\sum \limits_{i=1}^{n_T}{l}_t\left({f}_t,{y}_i\right)+\frac{\lambda }{2N}\sum \limits_{i=1}^Nd\left({f}_t,{f}_{S_i}\right)\end{array}} $$

(7)

f_S is the vector of decision function in N source domains, and the decision function in the target domain f_t is the same. $ {\left\Vert {f}_{S_i}\right\Vert}_K^2 $ and $ {\left\Vert {f}_t\right\Vert}_K^2 $ are the structural risk items controlling classifier complexity in the source domain and target domain respectively. ‖f‖² is two norm functions. $ {C}_{S_i} $ and C_t are the regularization coefficients. l() is the loss function. The function d() is used to quantify the difference between the two domains. λ is the compromise item. M_i represents the number of representative data sets in the S_i^th source domain. $ {\beta}^{S_i}=\left[{\beta}_1^{S_i},{\beta}_2^{S_i},\dots, {\beta}_{M_i}^{S_i}\right] $ represents corresponding weights of representative data sets in samples from each source domain.

Equation (7) contains three items. The first item $ \left(\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {f}_{S_i}\right\Vert}_K^2+\frac{1}{M}\sum \limits_{i=1}^M\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{l}_{S_i}\left({f}_{S_i},{y}_j\right)\right) $ represents knowledge learned from each source domain. The second $ \left(\frac{1}{2}{\left\Vert {f}_t\right\Vert}_K^2+{C}_t\sum \limits_{i=1}^{n_T}{l}_t\left({f}_t,{y}_i\right)\right) $ represents knowledge learned from the target domain. The third $ \left(\frac{\lambda }{2N}\sum \limits_{i=1}^Nd\left({f}_t,{f}_{S_i}\right)\right) $ represents the regularization term, which ensures good generalization performance by minimizing the differences between each source domain and target domain. $ \frac{\lambda }{2N}\sum \limits_{i=1}^Nd\left({f}_t,{f}_{S_i}\right) $ is expressed as $ \frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{\beta}^{s_i}{w}_{s_i}\right\Vert}^2 $ with simple quadratic distance measurement. To solve the issue of negative transfer, we use the weight $ {\gamma}^{S_i} $ and $ {\beta}^{S_i} $ from 3.1 and replace the weight vector $ {\beta}^{S_i} $ of the representative data set with it, as shown in Eq. (8).

$$ {\rho}^{S_i}={c}_1{\beta}^{S_i}+{c}_2{\gamma}^{S_i}\kappa \left({D}_{S_i}\right) $$

(8)

In Eq. (8), c₁ and c₂ represent the coefficients, c₁ + c₂ = 1, $ \kappa \left({D}_{S_i}\right) $ is the mapping function of samples from the source domain $ {D}_{S_i} $ and their corresponding representative data set samples.

In conclusion, Eq. (7) is rewritten as Eq. (9).

$$ {\displaystyle \begin{array}{c}\underset{w_t,{b}_t,{w}_s,{b}_s}{\min}\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+{C}_t\sum \limits_{i=1}^{n_T}{l}_t\left({w}_t^T\varphi (x)+{b}_t,{y}_i\right)+\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {w}_{S_i}\right\Vert}^2\\ {}+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}^{S_i}{l}_{S_i}\left({w}_{S_i}^T\varphi (x)+{b}_{S_i},{y}_j\right)+\frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{w}_{S_i}\right\Vert}^2\end{array}} $$

(9)

In Eq. (9), a hinge loss function is introduced in each source domain and target domain. Therefore, Eq. (9) can be transformed into the optimization problem shown in Eq. (10):

$$ {\displaystyle \begin{array}{c}\underset{w_t,{b}_t,{w}_s,{b}_s}{\min}\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+{C}_t\sum \limits_{i=1+\sum \limits_j^N{M}_j}^{\sum \limits_j^N{M}_j+{n}_T}{\xi}_i++\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {w}_{s_i}\right\Vert}^2+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}_j^{S_i}{\xi}_j^{S_i}+\\ {}+\frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{w}_{S_i}\right\Vert}^2\\ {}s.t.\\ {}{y}_j^{s_i}\left({w}_{s_i}^T\varphi \left({x}_j^{s_i}\right)+{b}_{s_i}\right)\ge 1-{\xi}_j^{s_i},j=1,\dots, {n}_{s_i},{s}_i=1,\dots, N\\ {}\tilde{y_i}\left({w}_t^T\varphi \left({x}_i^t\right)+{b}_t\right)\ge 1-{\xi}_i,i=1,\dots, {n}_T\end{array}} $$

(10)

In Eq. (10), $ {\xi}_j^{s_i} $ ($ {\xi}_j^{s_i}\ge 0 $) and ξ_i (ξ_i ≥ 0) are relaxation variables. The first constraint ensures that the learning tasks from each source domain are classified as correctly as possible. The second constraint ensures that the learning tasks in the target area are classified as correctly as possible.

3.3 Objective function theorem proofing

Theorem 1:

The dual problem of Eq. (10) is a quadratic programming (QP) problem, as shown in Eq. (11).

(11)

Proof: The Lagrangian function of Eq. (10) is as follows:

$$ {\displaystyle \begin{array}{c}L\left({w}_t,{w}_s,{b}_t,{b}_s,\xi, {\xi}^s,\alpha, {\alpha}^s,r,{r}^s\right)=\\ {}\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+\frac{1}{2\mathrm{N}}\sum \limits_{i=1}^{\mathrm{N}}{\left\Vert {w}_{S_i}\right\Vert}^2+{C}_t\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{\xi}_i+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}_j^{S_i}{\xi}_j^{S_i}\\ {}+\frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{w}_{{\mathrm{S}}_i}\right\Vert}^2-\frac{1}{N}\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{r}_j^{S_i}{\xi}_j^{S_i}-\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{r}_i{\xi}_i\\ {}-\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{\alpha}_j^{{\mathrm{S}}_i}\left({y}_j^{{\mathrm{S}}_i}\left({w}_{{\mathrm{s}}_i}^T\varphi \left({x}_j^{{\mathrm{S}}_i}\right)+{b}_{s_i}\right)-1+{\xi}_j^{S_i}\right)\\ {}-\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{\alpha}_i\left(\tilde{y_i}\left({w}_t^T\varphi \left({x}_i^t\right)+{b}_t\right)-1+{\xi}_i\right)\end{array}} $$

(12)

Where, $ {\alpha}^{S_i}=\left({\alpha}_1^{S_i},{\alpha}_2^{S_i},\dots, {\alpha}_{M_i}^{S_i}\right),\alpha =\left({\alpha}_1,{\alpha}_2,\dots, {\alpha}_{n_T}\right) $ and $ {r}^{S_i}=\left({r}_1^{S_i},{r}_2^{S_i},\dots, {r}_{M_i}^{S_i}\right),r=\left({r}_1,{r}_2,\dots, {r}_{n_T}\right) $ are the Lagrange multipliers, according to Karush-Kuhn-Tucker (KKT) conditions [17]:

$$ \frac{\partial L}{\partial {\xi}_j^{S_i}}=0\Rightarrow \sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}\left({r}_j^{S_i}+{\alpha}_j^{S_i}\right)=\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}_j^{S_i} $$

(13)

$$ \frac{\partial L}{\partial {\xi}_i}=0\Rightarrow \sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}\left({\alpha}_i+{r}_i\right)=\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{C}_t $$

(14)

$$ \frac{\partial L}{\partial {\mathbf{w}}_{s_i}}=0\Rightarrow \frac{1}{\mathrm{N}}\sum \limits_{i=1}^{\mathrm{N}}{\mathbf{w}}_{s_i}-\frac{\lambda }{N}\sum \limits_{i=1}^N\left({\mathbf{w}}_t-{\mathbf{w}}_{s_i}\right)-\sum \limits_{i=1}^N\sum \limits_{j=1}^{{\mathrm{M}}_{\mathrm{i}}}{\alpha}_j^{S_i}{y}_j^{S_i}\varphi \left({\mathbf{x}}_j^{S_i}\right)=0 $$

(15)

$$ \frac{\partial L}{\partial {w}_t}=0\Rightarrow {w}_t+\frac{\lambda }{N}\sum \limits_{i=1}^N\left({w}_t-{w}_{s_i}\right)-\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\alpha}_i\tilde{y_i}\varphi \left({x}_j\right)=0 $$

(16)

$$ \frac{\partial L}{\partial {b}_{s_i}}=0\Rightarrow \sum \limits_{i=1}^M\sum \limits_{j=1}^{n_{s_i}}{\alpha}_j^{S_i}{y}_j^{S_i}=0 $$

(17)

$$ \frac{\partial L}{\partial {b}_t}=0\Rightarrow \sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\alpha}_i\tilde{y_i}=0 $$

(18)

Equation (13) ˜ (18) can be substituted back into Eq. (10) and simplified to obtain Eq. (11) of the dual problem. Theorem 1 is thusly proved.

Theorem 2:

The quadratic programming form of the optimization problem of Eq. (11) is a standard convex quadratic programming problem.

Proof: The matrix $ \tilde{\mathbf{K}} $ can be broken down into the form $ \tilde{\mathbf{K}}={\tilde{\mathbf{K}}}_1+\tilde{{\mathbf{K}}_2}+\tilde{{\mathbf{K}}_3}+\tilde{{\mathbf{K}}_4} $. Where, the forms of $ {\tilde{\mathbf{K}}}_1 $, $ \tilde{{\mathbf{K}}_2} $, $ \tilde{{\mathbf{K}}_3} $and $ \tilde{{\mathbf{K}}_4} $ are as follows:

$$ {\displaystyle \begin{array}{c}{\tilde{K}}_1=\frac{\lambda }{1+2\lambda N}{\left[\begin{array}{c}{K}_{s_1}{,}_{s_1},\dots, {K}_{s_1}{,}_{s_N},-{K}_{s_1,t}\\ {}\dots \\ {}{K}_{s_M}{,}_{s_1},\dots, {K}_{s_N}{,}_{s_N},-{K}_{s_N,t}\\ {}-{K}_{s_1,t}^T,\dots, -{K}_{s_N,t}^T,{K}_{t,t}\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\\ {}{\tilde{K}}_2=\frac{N}{1+2\lambda N}{\left[\begin{array}{c}{K}_{s_1}{,}_{s_1},\dots, {K}_{s_1}{,}_{s_N},0\\ {}\dots \\ {}{K}_{s_N}{,}_{s_1},\dots, {K}_{s_N}{,}_{s_N},0\\ {}0,\dots, 0,0\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\\ {}{\tilde{K}}_3=\frac{\lambda }{N}{\left[\begin{array}{c}1,\dots, 1,0,\\ {}\dots \\ {}1,\dots, 1,0\\ {}0,\dots, 0,0\\ {}0,\dots, 0,0\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\\ {}{\tilde{K}}_4=\frac{N}{1+2\lambda N}{\left[\begin{array}{c}0,\dots, 0,0\\ {}\dots \\ {}0,\dots, {K}_{t,t}\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\end{array}} $$

For the matrix $ {\tilde{\mathbf{K}}}_1 $, let $ {Q}_1=\sqrt{\frac{\lambda }{1+2\lambda N}}\left({y}_1^{S_1}\varphi \left({x}_1^{S_1}\right),\dots, {y}_{M_1}^{S_1}\varphi \left({x}_{M_1}^{S_1}\right),\dots, {y}_1^{S_N}\varphi \left({x}_1^{S_N}\right),\dots, {y}_{M_N}^{S_N}\varphi \left({x}_{M_N}^{S_N}\right),-\sum \limits_{i\in {n}_T}\varphi \left({x}_i\right),\dots, -\sum \limits_{i\in {n}_T}\varphi \left({x}_i\right)\right) $ be the symmetric and positive semidefinite matrix. It is obvious that $ {\tilde{\mathbf{K}}}_1={\mathbf{Q}}_1^T{\mathbf{Q}}_1 $, so $ {\tilde{\mathbf{K}}}_1 $ is the symmetric semidefinite matrix and $ \tilde{{\mathbf{K}}_2} $, $ \tilde{{\mathbf{K}}_3} $ and $ \tilde{{\mathbf{K}}_4} $ are symmetric semidefinite matrices. Therefore, $ \tilde{\mathbf{K}} $ is the symmetric semidefinite matrix. Eq. (14) is a standard convex quadratic programming problem. Theorem 2 is thusly proved.

Theorem 3:

The solution to the quadratic programming problem of Eq. (11) is the optimal solution.

Proof:Since Eq. (9) is a convex quadratic programming problem and the KKT condition is also a sufficient condition, the obtained solution is the optimal solution. The solution of convex quadratic programming refers to [37].

The optimal value $ {\boldsymbol{\Gamma}}^{\ast }={\left({\alpha}^{s_1},{\alpha}^{s_2},\dots, {\alpha}^{s_N},\alpha \right)}^{\mathrm{T}} $ of Γ is calculated by Eq. (11), and the optimal solutions of w_t and b_t parameters are as follows:

$$ {w}_t^{\ast }=\frac{\lambda N}{1+2\lambda N}\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{\tilde{\alpha}}_j^{s_i}{\rho}_j^{S_i}{y}_j^{S_i}\varphi \left({x}_j^{S_i}\right)+\frac{N+\lambda }{1+2\lambda N}\sum \limits_{i=1+\sum \limits_{j=1}^N{M}_j}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\tilde{\alpha}}_i\sum \limits_{j\in {n}_T}\varphi \left({x}_j\right) $$

(19)

$$ {\displaystyle \begin{array}{c}{b}_t^{\ast }={y}_i-\frac{\lambda N}{1+2\lambda N}\sum \limits_{i^{\prime }=1}^N\sum \limits_{j=1}^{M_{i\prime }}{\rho}_j^{S_{i\prime }}{\alpha}_j^{S_{i\prime }}{y}_j\sum \limits_{q\in {S}_{i\prime }}k\left({x}_j,{x}_q\right)\\ {}-\frac{\lambda +N}{1+2\lambda N}\sum \limits_{i^{\prime }=1}^N\sum \limits_{j=1+\sum \limits_{l\in M}{M}_l}^{\sum \limits_{l\in M}{M}_l+{n}_T}\tilde{\alpha_j}\sum \limits_{j^{\prime}\in {n}_T}\sum \limits_{q\in {n}_T}k\left({x}_{j\prime },{x}_q\right)\end{array}} $$

(20)

Finally, the decision function for the MultiFTLSVM algorithm is as follows:

$$ f(x)={\mathbf{w}}_t\varphi (x)+{b}_t $$

(21)

The optimal solutions in Eqs. (19) and (20) contain information in M source domains and target domain. For example, in $ {\mathbf{w}}_t^{\ast } $, $ \frac{\lambda N}{1+2\lambda N}\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{\tilde{\alpha}}_j^{s_i}{\rho}_j^{S_i}{y}_j^{S_i}\varphi \left({x}_j^{S_i}\right) $ is the knowledge learned from the source domain, and $ \frac{N+\lambda }{1+2\lambda N}\sum \limits_{i=1+\sum \limits_{j=1}^N{M}_j}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\tilde{\alpha}}_i\sum \limits_{j\in {n}_T}\varphi \left({x}_j\right) $ is the knowledge learned from the target domain.

3.4 MultiFTLSVM algorithm process

From Sections 3.1 through 3.3, the specific process and training steps of the MultiTLGP algorithm are shown in Table 1.

Table 1 Steps of MultiTLGP algorithm

Full size table

4 Results of experiments

In this section, to test the generalization performance of the MultiFTLSVM algorithm, we compare MultiFTLSVM with reference algorithms STL-SVM [38], RankRE-TL [15], HMCA [16], STIL [39], MultiDTNN [40], FastDAM [21], IMTL [22], SSL-MSTL [26] and SVM [41] on 20-Newsgroups, emotion analysis and spam datasets.

4.1 Experimental settings

To ensure the impartiality of the experiment, all experiments adopted the five-time cross-validation strategy and the experimental result is the final comparison result after 2 repeats of the strategy. For 5 transfer learning algorithms, all labeled source domain data and 5% unlabeled target data were randomly selected as training data sets. For the non-transfer learning algorithm SVM, we only used labeled data from the target domain for training. In the experiment, we took classification accuracy and the mean value of the corresponding standard deviation after running the calculation 10 times as the criteria of the evaluation algorithm. Classification accuracy is expressed as follows [14, 17, 37]:

$$ Accuracy=\frac{\mid x:x\in {D}_t\wedge f(x)=y(x)\mid }{\mid x:x\in {D}_t\mid}\times 100\% $$

D_t is the target domain data set, f(x) is the sample class x label predicted by the classifier, and y(x) is the real class label of the sample x.

In the experiment, all kernel functions use the Gaussian functionk(x_i, x_j) = exp(−‖x_i − x_j‖/2σ²). The values of parametersC_t, $ {C}_{S_i} $ and λare obtained from the grid {10^‐4, 10⁻³, 10⁻², 10⁻¹, 10, 10¹, 10², 10³, 10⁴} by gird search methods often used in machine learning. In addition to the above parameters, the other parameters of the reference algorithms are the same as those in corresponding literature. The hardware environment of all experiments was Intel Core (TM)i3, 3.6GHz, 8GB, Windows 10 OS, and running MATLAB R2014b.

4.2 Data sets used for experiments

20-Newsgroups [14], emotion analysis [14], and spam [15] are commonly used applications for transfer learning, so all experiments in this paper were carried out based on these 3 data sets.

1)
20-newsgroups

The 20-Newsgroups data set contains about 20,000 documents divided into four categories comp(c), rec(r), sci(s), and talk(t), each of which can be subdivided into four subcategories. The details of the data sets are shown in Table 2. In this experiment, the dichotomous task group is constructed by randomly selecting two of the four categories, one of which is positive and the other negative. Each task group is specifically comp vs rec, comp vs sci, comp vs talk, rec vs sci, rec vs talk, and sci vs talk. Common construction methods for cross-domain task groups were, there were 4 subclasses A1, A2, A3, and A4, while B has four subclasses B1, B2, B3, and B4 in each task group A vs B. Two subcategories (A1 and A3) from A and two subcategories (B1 and B2) from B are randomly selected to form the target domain data set, and the remaining data sets in A and B constitute the source domain data set. Each task group A vs B can generate $ {C}_4^2\times {C}_4^2=36 $ classification tasks. The target domain data set and source domain data set obtained by the above construction methods ensure a correlation between the target domain and the source domain because they come from the same category. This also ensures target domain and source domain heterogeneity because they come from different subcategories. See Table 2 for details.

Table 2 20-Newsgroups Data set

Full size table

2)
Emotion analysis data set

The emotion analysis data set consists of reviews from four different types of Amazon products, books, DVD, electronics, and kitchen supplies, which represents 4 domains – Books (B), DVDs (D), Electronics (E) and Kitchen (K). The content of each review contains: name, title, name of reviewer, date, place, and review. We take products rated by the evaluators with a rating of more than 3 stars (0–5 stars) as positive examples and those rated less than 3 stars as negative examples, with vague evaluation being discarded. In these 4 domains, there were 2000 annotated examples and about 4000 un-annotated examples, with roughly the same number of positive and negative examples. The details of the data set are shown in Table 3.

Table 3 Emotion analysis data set

Full size table

3)
Spam data set

The spam data set is distributed by the ECML/PKDD 2006 Knowledge Discovery Challenge and consists of four separate user mailboxes: personal mailbox U1, personal mailbox U2, personal mailbox U3, and public mailbox U4. There are 1250 spam and 1250 normal mailboxes in each personal mailbox and 2000 spam and 2000 normal mailboxes in each public mailbox. The personal and public mailboxes are expressed by an item frequency vector. The probability distribution of emails within each group is similar, but the difference between groups is large. Therefore, the six classification tasks that were constructed across groups in this paper are: U1 → U4, U2 → U4, U3 → U4, U4 → U1, U4 → U2 and U4 → U3. In the above representation modes of classification tasks, for example, inU1 → U4, U1 is the source domain, and U4 is the target domain. The details of the data set are shown in Table 4_.

Table 4 Spam data set

Full size table

4.3 Results of experiments and analysis

In this section, we compare the mean classification accuracy and training time with a standard difference of MultiFTLSVM algorithm and reference algorithm on 3 real data sets and analyzed the results.

For the 20-Newsgroups data set, we selected one from data set r, s and t as the target domain. Five of the single source domain transfer learning algorithms SVM, STL-SVM, RankRE-TL, STIL and HCMA can only use one data set as a source domain, while five multi-source transfer learning algorithms FastDAM, IMIL, MultiDTNN、SSL-MSTL, and MultiFTLSVM can simultaneously use 3 data sets as its source domain. In the emotion analysis data set, 3 data sets were constructed with Books, DVDs, Electronics, and Kitchen as the target domain. Single source domain transfer learning algorithms can only select one data set from these 3 data sets as the source domain, while a multi-source transfer learning algorithm can simultaneously select 3 source domains. Similarly, for the spam data set, multi-source transfer learning algorithms used 3 personal mailboxes as 3 data sets and public mail data sets as the target domain. Single source domain transfer learning algorithms can only take one of the 3 personal data sets as the source domain and used the public mailbox as the target domain.

In Table 5, the experimental results on the data set 20-Newsgroups can draw the following conclusions: The classification accuracy of the MultiFTLSVM algorithm on the 9 cross-domain classification tasks has been improved compared with the benchmark algorithm, and the average accuracy has exceeded 95%. Compared with the non-migration algorithm, the average accuracy of SVM has increased by more than 10%, which also shows that the migration learning algorithm has considerable advantages over the traditional machine learning algorithm; compared with the single-source domain migration learning algorithm STL-SVM, STIL, RankRE-TL and HCMA, the average accuracy rate is improved; in the multi-source transfer learning algorithms MultiDTNN, FastDAM, IMTL, SSL-MSTL and MultiFTLSVM, the algorithm proposed in the article also has certain advantages. Because SVM does not have the ability of cross-domain transfer learning, the average classification accuracy is the lowest; single-source transfer learning algorithm is better than SVM; multi-source transfer learning algorithm is better than single-source transfer learning algorithm, and the algorithm proposed in the article is the best. The difficulty of transfer learning for the 12 cross-domain learning tasks is closely related to the similarity of the text content. It can be seen that the higher the similarity of the text content of the classification task, the higher the classification accuracy of the transfer learning algorithm.

Table 5 The average classification accuracy (%) and standard deviation of the algorithm on 20Newsgroups

Full size table

For the experimental results on the sentiment analysis and spam data sets in Table 6, MultiFTLSVM has the highest average accuracy of all algorithms, and it has certain advantages compared to the non-transfer learning algorithm or the transfer learning algorithm in the benchmark algorithm: Compared with non-transfer learning SVM, the average accuracy rate is increased by about 12%; compared with TL-SVM, STIL, RankRE-TL, HCMA, MultiDTNN, FastDAM, IMTL and SSL-MSTL, the average accuracy rate is improved. In the classification accuracy of each cross-domain classification task, MultiFTLSVM is the highest compared with all benchmark algorithms.

Table 6 The average classification accuracy (%) and standard deviation of the algorithm on spam email and emotion analysis

Full size table

According to the experimental analysis, we can draw the following conclusions:

(1)
Based on the accuracy of average classification, it can be observed from Tables 5 and 6 that transfer learning algorithms can help classification tasks from the target domain by using knowledge from the source domain. Therefore, it has better classification effects than merely using data set training classifiers from the target domain by non-transfer learning algorithm SVM. In addition, compared with single source transfer learning algorithms STL-SVM, STIL, RankRE-TL and HCMA, multi-source transfer learning algorithms MultiDTNN, FastDAM, IMTL and SSL-MSTL showed obvious advantages in terms of classification effects. Finally, since combination weight information obtained from MMD is applied to effectively handle negative transfer from the proposed MultiFTLSVM algorithm in this paper, the classification effects of such an algorithm is superior to the majority of multi-source transfer learning algorithms regarding all learning tasks.
(2)
As for algorithm operation time, the training time of SVM is relatively fast since SVM value just uses training data of the target domain. Since supplementary samples of the source domain are used in transfer learning, the training time of transfer learning algorithms increases when compared with that of the non-transfer learning algorithms. Since multi-source transfer learning uses more than two source domains to assist with the training of the target domain, its training time increases when compared with that of single source transfer learning. Among all of the multi-source transfer learning algorithms, MultiFTLSVM utilizes representative source domain data to shorten the training data set scale, which is conducive to reducing training time. Therefore, its training time is promoted in comparison to that of multi-source transfer learning algorithms from the benchmark algorithms Tables 7 and 8.
Table 7 Average score training time (s) and standard deviation of the algorithm on 20Newsgroups
Full size table
Table 8 The average training time (s) and standard deviation of the algorithm on the sentiment analysis data set and spam data set
Full size table

4.4 Parameter sensitivity analysis

In this section, we performed a sensitivity analysis of three parameters in the MultiFTLSVM objective function, namely the regular parameter of the target domain C_t, the mean value of regular parameters in the source target C_S, and the compromise item λ, to describe their impacts to the algorithm’s performance. As for each parameter, we fix another two parameters as the optimum values determined by cross-validation and observed the parameter’s impacts on classification results when using different values. The results of the experiment are seen in Figs. 4, 5 and 6.

Conclusions drawn from Figs. 4, 5 and 6 are as follows:

(1)
We first fix λ = 10,C_t = 10and then search the value of C_s on the grid of C_s ∈ {10⁻⁴, 10⁻³, 10⁻², 10⁻¹, 10, 10¹, 10², 10³, 10⁴}, and record the experimental results on the real data set as shown in Fig. 5. Figure 5 shows C_s that with different values, the proposed classification effect is also different; We can see that when C_s = 100, the classification effect of the algorithm is the best on 14 cross-domain tasks. In the same way, we fix λ = 10, C_s = 100 and we use the same method to get the experimental results when C_t taking different values as shown in Fig. 4. From Fig. 4, we conclude that when C_t = 10, the algorithm performed the best classification on most cross-domain classification tasks. After the above analysis, when C_S and C_t take different values, the average classification accuracy of the MultiFTLSVM method on 14 cross-domain tasks has a significant difference. We can find that MultiFTLSVM has a certain value range for the regularization parameters C_S and C_t Sensitive, the parameter values when the classification effect reaches the best can be obtained on different cross-domain tasks.
(2)
For parameter λ, fix C_s = 100 and C_t = 10 to obtain the experimental results in the same way as in (1) as shown in Fig. 6. When the value of λ is 10, MultiFTLSVM achieves the best classification effect on 14 cross-domain tasks. We can get the following conclusion by analyzing the results in Fig. 6. If the value of λ is too small, the difference between the source domain and the target domain will be ignored and negative migration will occur, so the classification effect will not change; on the contrary, when the value of λ is too large, It can make the distribution difference between the source domain and the target domain more obvious, which results in less knowledge in the source domain that can be transferred to the target domain, and the classification effect is also not good.

In short, the MultiFTLSVM algorithm is very sensitive to the regularization coefficients λ, C_t and C_s within a certain range, which means that it is very important to determine the optimal values of these parameters through effective strategies.

5 Conclusions

In this paper, we propose a multi-source fast transfer learning algorithm based on support vector machines, MultiFTLSVM, to provide multi-source domains for application in transfer learning. Firstly, the similarity weight of each source domain sample and the target domain is calculated with the purpose of resolving the minimum marginal probability differences, based negative transfer problem algorithm. Then the approximate pole support vector machine is used to obtain representative data sets from each source domain that are relatively important to the model training and corresponding weights, enhancing the training efficiency of the algorithm. Finally, knowledge from the target domain and combinatorially weighted multi-source domains are combined to be integrated into the structural risk minimization framework of the support vector machine. Then an objective function is constructed and a theoretical demonstration is performed. Results from experiments using the 20-Newsgroups data set, emotion analysis data set, and spam data set indicated that MultiFTLSVM is superior to the benchmark algorithms in regard to classification accuracy rate and training efficiency. Although the results indicated that the MultiFTLSVM algorithm has better results than the benchmark algorithms, it still needs to be further studied in the future with regard to the following aspects: extension of MultiFTLSVM in multi-class problems; increasing the number of source domains is an interesting challenge.

References

Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects [J]. Science 349(6245):255–260
Article MathSciNet Google Scholar
Ashfaq RAR, Wang XZ, Huang JZ et al (2016) Fuzziness based semi-supervised learning approach for intrusion detection system [J]. Inf Sci 378(C):484–497
Google Scholar
Li J, Wu W, Xue D (2020) An intrusion detection method based on active transfer learning[J]. Intell Data Anal 2020:363–383
Article Google Scholar
Athanasios V, Nikolaos D, Anastasios D et al (2018) Deep learning for computer vision: a brief review[J]. Comput Intell Neurosci 2018:1–13
Google Scholar
Nguyen G, Dlugolinsky S, Bobák M et al (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey[J]. Artif Intell Rev 52(2019):77–124
Article Google Scholar
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification[J]. Artif Intell Rev 52:273–292
Article Google Scholar
Kumari KRV, Kavitha CR (2018) Spam Detection Using Machine Learning in R[C]//International Conference on Computer Networks and Communication Technologies, Lecture Notes on Data Engineering and Communications Technologies, April26 -27, Coimbatore, Tamil Nadu, India Springer, 55–64
Chen CLP (2015) Deep learning for pattern learning and recognition[C]// IEEE Jubilee International Symposium on Applied Computational Intelligence & Informatics, Timisoara, Romania, May 21–23 May, IEEE, 17–17
Pan SJ, Yang Q (2010) A survey on transfer learning [J]. IEEE Trans Knowled Data Eng 22(10):1345–1359
Article Google Scholar
Day O, Khoshgoftaar TM (2017) A survey on heterogeneous transfer learning [J]. J Big Data 4(1):29
Article Google Scholar
Weiss K, Khoshgoftaar TM, Wang DD (2016) A survey of transfer learning [J]. J Big Data 3(1):9
Article Google Scholar
Gao J, Fan W, Jiang J, et al. (2008) Knowledge transfer via multiple model local structure mapping[C]// 14th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 24–27, Las Vegas, NV, United States. ACM, 283–291
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis[J]. IEEE Trans Neural Netw 22(2):199–210
Article Google Scholar
Long M, Wang J, Ding G, Pan SJ, Yu PS (2014) Adaptation regularization: a general framework for transfer learning [J]. IEEE Trans Knowl Data Eng 26(5):1076–1089
Article Google Scholar
Li M, Dai Q (2018) A novel knowledge-leverage-based transfer learning algorithm [J]. Appl Intell 48(8):2355–2372
Article Google Scholar
Mozafari AS, Jamzad M (2016) A SVM-based model-transferring method for heterogeneous domain adaptation [J]. Pattern Recogn 52:142–158
Article Google Scholar
Xie X, Sun S, Chen H, Qian J (2018) Domain adaptation with twin support vector machines[J]. Neural Process Lett 48(2):1213–1226
Article Google Scholar
Sun S, Shi H, Wu Y (2015) A survey of multi-source domain adaptation [J]. Inf Fusion 24:84–92
Article Google Scholar
Eaton E, des Jardins M (2011) Selective transfer between learning tasks using task-based boosting[C]// Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, California, USA, August 7–11, DBLP, 337–342
Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources[C]//2010 IEEE computer society conference on computer vision and pattern recognition, san Franccisco, CA, USA, June 13–18, IEEE, 1855-1862
Duan L, Xu D, Tsang IW (2012) Domain adaptation from multiple sources: a domain-dependent regularization approach [J]. IEEE Trans Neural Netw Learn Syst 23(3):504–518
Article Google Scholar
Ding Z, Shao M, Fu Y (2018) Incomplete multisource transfer learning [J]. IEEE Trans Neural Netw Learn Syst 29(2):310–323
Article MathSciNet Google Scholar
Chandra R, Kapoor A (2020) Bayesian neural multi-source transfer learning[J]. Neurocomputing 378:54–64
Article Google Scholar
Liu J, Li J, Lu K (2017) Coupled local-global adaptation for multi-source transfer learning[J]. Neurocomputing 275:247–254
Article Google Scholar
Wu Q, Zhou X, Yan Y, Wu H, Min H (2017) Online transfer learning by leveraging multiple source domains[J]. Knowl Inf Syst 52:687–707
Article Google Scholar
Fang M, Guo Y, Zhang X, Li X (2015) Multi-source transfer learning based on label shared subspace[J]. Pattern Recogn Lett 51:101–106
Article Google Scholar
Huang Z, Siniscalchi SM, Lee CH (2016) A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition[J]. Neurocomputing 2016(218):448–459
Article Google Scholar
Ma YX, Xu JY, Wu XY, Wang F, Chen W (2017) A visual analytical approach for transfer learning in classification[J]. Inf Sci 2017(390):54–69
Google Scholar
Lian QS, Shi BS, Chen SZ (2017) Transfer orthogonal sparsifying transform learning for phase retrieval [J]. Digit Signal Process 2017(62):11–25
Article Google Scholar
Yan HB (2016) Transfer subspace learning for cross-dataset facial expression recognition[J]. Neurocomputing 2016(208):165–173
Article Google Scholar
Ohata EF et al (2021) Automatic detection of COVID-19 infection using chest X-ray images through transfer learning [J]. IEEE/CAA J Autom Sin 8(1):239–248
Google Scholar
Li W, Sai G, Zhang X, Chen T Transfer learning for process fault diagnosis: knowledge transfer from simulation to physical processes[J]. Comput Chem Eng 139:106904
Wu W, Peng M, Chen W, Yan S (2020) Unsupervised deep transfer learning for fault diagnosis in fog radio access networks[J]. IEEE Internet Things J 7(9):8956–8966
Article Google Scholar
Chau AL, Li X, Yu W (2013) Convex and concave hulls for classification with support vector machine [J]. Neurocomputing 122(1):198–209
Article Google Scholar
Dong JX, Krzyżak A, Suen CY (2003) A fast SVM training algorithm [J]. Int J Pattern Artif Intell 17(3):367–384
Article Google Scholar
Ni T, Gu X, Wang J, Zheng Y, Wang H (2018) Scalable transfer support vector machine with group probabilities[J]. Neurocomputing 273:570–582
Article Google Scholar
Xie X, Sun S (2019) Multi-view support vector machines with the consensus and complementarity information [J]. IEEE Trans Knowl Data Eng 32:2401–2413. https://doi.org/10.1109/TKDE.2019.2933511
Article Google Scholar
Li J, Wu W, Xue D (2020) Research on transfer learning algorithm based on support vector machine [J]. J Intell Fuzzy Syst 38(4):4091–4106
Article Google Scholar
Xie G, Sun Y, Lin M et al. (2017) A Selective Transfer Learning Method for Concept Drift Adaptation[C]//14th International Symposium, ISNN 2017, Sapporo, Hakodate, and Muroran, Hokkaido, Japan, June 21–26, Springer, 353–361
Li J, Wu W, Xue D, Gao P (2019) Multi-source deep transfer neural networks algorithm [J]. Sensors:19(18)
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines[J]. ACM Trans Intell Syst Technol 2(3):1–27
Article Google Scholar

Download references

Acknowledgements

This work has been supported by China’s national key research and development plan.(2016YFB0801004).

Author information

Authors and Affiliations

College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Peng Gao, Weifei Wu & Jingmei Li
Technology Development Cente, Heilongjiang Broadcasting Station, Harbin, China
Peng Gao

Authors

Peng Gao
View author publications
You can also search for this author in PubMed Google Scholar
Weifei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jingmei Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingmei Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, P., Wu, W. & Li, J. Multi-source fast transfer learning algorithm based on support vector machine. Appl Intell 51, 8451–8465 (2021). https://doi.org/10.1007/s10489-021-02194-9

Download citation

Accepted: 04 January 2021
Published: 06 April 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s10489-021-02194-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-source fast transfer learning algorithm based on support vector machine

Abstract

Similar content being viewed by others

A Transfer Learning Algorithm Based on Support Vector Machine

Semi-supervised Learning with Transfer Learning

Active Selection Transfer Learning Algorithm

1 Introduction