1 Introduction

As one of the fastest-growing technical fields today, machine learning is also the core of artificial intelligence and data science. It solves the problem of how to automatically improve computers through experience [1]. Machine learning has been widely applied in intrusion detection [2, 3], computer vision [4], data mining [5], text classification [6], spam detection [7] and pattern recognition [8], and other fields. However, the further development of machine learning in these fields is restricted by the shortcomings of traditional machine learning methods. Usually, two basic assumptions in traditional machine learning should be met for the current traditional machine learning classification tasks: there are enough data samples in the training data set to train a high-precision classifier; training and test data comes from the same feature space and has the same distribution. For practical application, training and test data usually come from different domains, with differing marginal probabilities, or conditional probabilities, so the data distribution is also different. When the distribution is changed, most machine learning algorithms need to re-collect training data. In many real-world applications, the cost of re-collecting training data and reconstructing the model is very expensive, or even impossible [9].

In this case, transferring learning between learning task domains is desirable. The research motivation of transfer learning is that previously learned knowledge can be used by people to better solve new problems [9, 10], and its purpose is to build a model for the target domain by using labeled information in another related domain (source domain). Therefore transfer learning is defined in Wikipedia as “Transfer learning is a new machine learning method that takes existing knowledge to solve problems in different but similar fields. It no longer follows the two basic assumptions in traditional machine learning. Instead, the existing knowledge is transferred to solve the problem of only a small amount of labeled sample data in the target field [11]. The difference between traditional machine learning and transfer learning is shown in Fig. 1. It can be seen from Fig. 1a that each learning task in traditional machine learning starts from zero, while in (b) the knowledge from previous learning tasks can be transferred to the current target learning task by using transfer learning. The representative algorithms related to transfer learning research are as follows: Gao et al. [12] proposed an integrated framework of a locally weighted combination of multiple models (LWE). LWE can integrate the advantages of various algorithms and label learning in multiple training domains into one model and dynamically assign weights according to the predictive ability of each model per each instance. Pan et al. [13] proposed a new learning method for domain adaptation transfer components analysis(TCA). TCA tries to learn several transfer components across domains in the Reproducing Kernal Hilbert Spaces (RKHS) by the maximum mean discrepancy (MMD) error rate. In the subspace spanned by these components, data attributes are preserved, and data distribution in different domains is close to each other. An adaptive regularization transfer learning algorithm (ARTL) is based on the structural risk minimization principle and a regularization theory that was proposed by Long et al. [14]. Li et al. [15] implemented the transfer learning algorithm RankRE-TL based on the transfer learning mechanism of knowledge used and the error set selection method with rank reduction. A new SVM-based model transfer method was proposed in [16], in which a large boundary classifier is trained on the labeled target sample and adjusted by an offset of the source classifier. This method is called Heterogeneous Max-margin Classifier Adaptation Method (HMCA). Xie et al. [17] proposed a new method of supervised domain adaptation by dual support vector machines, called adaptive dual support vector machines for an aggregation domain.

Fig. 1
figure 1

Difference between transfer learning and traditional machine learning

However, in the above transfer learning algorithm, only the knowledge in a source domain is transferred to the target domain. In transfer learning, the performance of the target classifier largely depends on the correlation between the source domain and the target domain. If the correlation between the target domain and the source domain is strong, it will help to improve the learning effect in the target domain, otherwise, it will reduce the learning effect of the target domain, leading to the phenomenon of negative transfer [12]. One strategy to reduce this negative transfer is to import knowledge from multi-sources to increase the chance of discovering a source domain closely related to the target domain [18]. The typical research work for a multi-source domain transfer learning algorithm is as follows: A novel task-based instance transfer enhancement technology TransferBoost was proposed in [19], which selectively transfers knowledge from the source domain to the target task. It has been enhanced at both the instance level and the task level. The source tasks that show good portability to target tasks can be assigned higher weights and the weight of each instance in each source task can be adjusted through AdaBoost. Yao et al. [20] introduced multi-source domains to solve the problem of negative transfers and proposed two new algorithms, Multi-source-TrAdaBoost and TaskTrAdaBoost. Duan et al. [21] proposed a multi-source domain adaptation method (DAM) by using a set of pre-trained classifiers (called auxiliary/source classifiers) that use labeling patterns from multi-source domains to learn from robust decision function for pattern labeling prediction in the target domain (called target classifier). A incomplete multi-source transfer learning algorithm IMTL is proposed through two directions of knowledge transfer (ie, cross-domain transfer and cross-source transfer from each source to target) [22]. In [23], this paper present a Bayesian framework for transfer learning using neural networks that considers single and multiple sources of data. Other researches on multi-source transfer learning in literature [24,25,26]. Today, transfer learning has been applied in the fields of speech recognition, computer vision, information retrieval, natural language processing, adaptive update map coverage, fault diagnosis, automatic detection of COVID-19 infection and other fields [27,28,29,30,31,32,33].

In this paper, the structural risk minimization theory, support vector machine, and source transfer learning theory are taken. Aiming for an application scenario where there is only a small amount of labeled data in the target domain and a large amount of labeled data in multi-source domains, a new multi-source fast transfer algorithm-MultiFTLSVM is proposed. The idea of the MultiFTLSVM algorithm is to integrate the knowledge within the target domain and the labeled data in multi-source domains into a structural risk minimization framework of support vector machines. The knowledge that needs to be transferred in the source domain is selected by constructing a similar distance term and the MMD between the target domain and each source domain, and the AESVM algorithm is used to reduce the sample size of the source domain to improve algorithm training efficiency, and then construct an optimizable objective function. The theoretical proof of the objective function shows that the solution process is a quadratic programming problem with an optimal solution.

Compared with previous work, the contributions of this paper include:

  1. 1)

    Existing data sets are taken to reduce the cost of collecting data sets.

  2. 2)

    All training samples no longer require AESVM to train the learning model, which can greatly reduce the size of the training samples so that the training cost of the learning model is reduced.

  3. 3)

    To prevent negative transfer and improve classification performance, multi-source transfer learning simultaneously extracts knowledge from multi-source domains to assist the learning task in the target domain. In this process, the knowledge of the similarity between each source domain and sample and the target domain is transferred to the target domain to the greatest extent.

The rest of the paper is arranged as follows: The related work of multi-source transfer learning, approximate pole support vector machine (AESVM), and maximum mean discrepancy (MMD) are reviewed in Section 2. The construction and training process of multi-source fast transfer learning is introduced in detail in Section 3. The effectiveness of the algorithm on the 20-Newsgroups text data set, sentiment analysis data set and the spam data set is verified in Section 4. The main work of the paper is summarized in Section 5.

2 Overview of related work

In this part, we give a brief introduction to multi-source transfer learning and group probability. In the introduction of group probability, we focus on IC technology and the group probability classification algorithm IC-SVM that is technology is based on.

2.1 Multi-source transfer learning

Transfer learning has been widely studied for many years since it was proposed in NIPS-95 in 1995. Compared with traditional machine learning algorithms, transfer learning has significant advantages. Useful knowledge can be taken from the source domain to significantly improve the learning performance of the target domain and greatly reduce costly data labeling work. There is no need for training data and test data to satisfy the same distribution. At present, most transfer learning tasks only transfer the knowledge in a source domain to the target domain [9]. However, in real-world applications, we can easily collect auxiliary data from multi-source domains. Therefore, the study of transfer learning in multi-source domains has gradually aroused the interest of researchers [18]. As shown in Fig. 2, the relationship between multi-source domains and target domains is taken for the multi-source transfer to improve the predictive performance of the target domain on samples and assist the target domain to establish a prediction model.

Fig. 2
figure 2

Multi-source transfer learning

In Fig. 2,\( \left({D}_{S_1},{T}_{S_1}\right) \),\( \left({D}_{S_2},{T}_{S_2}\right) \),…, \( \left({D}_{S_n},{T}_{S_n}\right) \) represents n source domains and their corresponding learning tasks.(DT, TT) represents the target domain and its corresponding learning tasks.ft represents the target domain classifier obtained by training the data sets in the target domain DT and the source domain \( {D}_{S_i}\left(i=1,\dots, n\right) \).

Generally, multi-source transfer learning algorithms can be divided into two categories: methods based on boosting [19, 20] and methods based on regularization [21,22,23,24]. Additionally, [25] divided the multi-source transfer learning into multi-source sample transfer learning, parameter-based multi-source transfer learning, and feature-based multi-source transfer methods. The regularized multi-source transfer method needs to design a regularization term, while the boosting multi-source transfer method needs to adjust the weight of different domains or instances to achieve its purpose of transferring knowledge. The focus of these methods is sample mobility, instead of studying which source domain has better mobility. When there are multi-source domains, how to determine which source domain has better mobility is an important issue [26].

2.2 Approximate pole support vector

Given that the basic idea of support vector machines is to find a hyperplane that represents the largest interval between two types, from a geometric point of view, the calculation of the maximum boundary hyperplane is equivalent to calculating the nearest sample between the two convex hulls [34, 35]. For SVM, a prerequisite for achieving better training results is a large number of training samples. A large number of training samples not only requires a lot of manpower to label but also a lot of time is then consumed in the training phase, so the training efficiency of SVM is not very satisfactory.

A training data set X = {x1, x2, …, xn}is given, and the corresponding class label set is Y = {y1, y2, …, yn},yi ∈ {1, −1}. AESVM optimization problem is described in Eq. (1):

$$ \underset{w,b}{\min }{F}_{AESVM}\left(w,b\right)=\frac{1}{2}{w}^Tw+\frac{C}{M}\sum \limits_{i=1}^M{\beta}_il\left(w,b,\varphi \left({x}_i\right)\right) $$
(1)

In Eq. (1), Mis the number of samples of representative training data set X selected in the data set X, parameters w, b, i and C are support vector normal vector, displacement item, sample No. and regularization coefficient, the vector β = [β1, β2, …βM]is the weight vector corresponding to the representative data set sample, such as Eq. (9), M is the number of the representative data set, l is the hinge loss function, l(w, b, ϕ(xi) = max {0, 1 − yi(wTϕ(xi) + b)}, xi ∈ X, and ϕ(⋅) is the nonlinear mapping function. The kernel function can be written as Ki, j = k(xi, xj) = ϕ(xi)Tϕ(xj).

To obtain the representative data set X, the data set Xneeds to be grouped according to a determined separation strategy X = {X1, X2, …, Xn/V}. n is the number of samples in the data set X,V is the maximum number of samples in each group and Xq(q = 1, 2, …, n/V) represents the group q. The sample data in each group has a high similarity, while the sample data between different groups had a low similarity. The concept of similarity is that the distance between sample data in a subset is less than the distance between sample data in different subsets. In each subset Xq, the representative data set \( X{}_q{}^{\ast } \) and the corresponding weight vector βq are calculated. Finally, the representative data sets \( X{}_q{}^{\ast } \) are obtained from all subsetsXq and are then merged into the representative data setsX.

The specific process of obtaining representative data sets from the data set X is as follows.

Firstly, the initial representative dataset \( X{}_q{}^{\ast } \) is calculated by using the SVDD algorithm [36] and the sample \( {x}_i\left({x}_i\in {X}_q\ \mathrm{and}\ {x}_i\notin {X}_q^{\ast}\right) \) should determine whether or not it belongs to the representative data set \( X{}_q{}^{\ast } \), which is formally described as Eq. (2) .

$$ \Big\{{\displaystyle \begin{array}{c}\underset{x_i\in {X}_q,{x}_i\notin {X}_q^{\ast }}{\max }f\left(\varphi \left({x}_i\right),{X}_q^{\ast}\right)=\underset{\mu_{it}}{\min }{\left\Vert \varphi \left({x}_i\right)-{\sum}_{j=1}^{\mid {X}_q^{\ast}\mid }{\mu}_{i,t}\varphi \left({x}_j\right)\right\Vert}^2\le \varepsilon \\ {}s.t.0\le {\mu}_{i,j}\le 1,\sum \limits_{j=1}^{\mid {X}_q^{\ast}\mid }{\mu}_{i,j}=1,{x}_j\in {X}_q^{\ast}\end{array}} $$
(2)

In Eq. (2), \( \mid {X}_q^{\ast}\mid \) is the number of samples in the sample set \( {X}_q^{\ast } \), ε is a small normal vector artificially given, μi, j is the coordination coefficient, and j is the serial number of samples in the sample set \( {X}_q^{\ast } \). For xi meeting Eq. (2) in Xq, the extended representative sample set \( {X}_q^{\ast } \) is \( {X}_q^{\ast }={X}_q^{\ast}\cup \left\{{x}_i\right\} \). For all samples \( {x}_i\left({x}_i\in {X}_q\mathrm{and}{x}_i\notin {X}_q^{\ast}\right) \), Eq. (3) is calculated by Eq. (2):

$$ \varphi \left({x}_i\right)=\sum \limits_{x_i\in {X}_q}{\gamma}_{i,j}\varphi \left({x}_i\right)+{\tau}_i $$
(3)

In Eq. (3), \( {\gamma}_{i,j}=\left\{\begin{array}{l}{\mu}_{i,j},{x}_j\in {X}_q^{\ast}\mathrm{and}{x}_i\in {X}_q\\ {}0\end{array}\right. \), τi is the approximate error vector ‖τi2 ≤ ε in Eq. (3). The weight vector corresponding to the representative data set γi, j in Eq. (4):

$$ {\beta}_j=\sum \limits_{i=1}^n{\gamma}_{i,j} $$
(4)

2.3 Maximum mean discrepancy

In transfer learning, the difference in sample distribution leads to the problem of negative transfer. Therefore, it is necessary to select a convenient distribution distance measurement. MMD is an effective measure to estimate the distance between two different distribution in the Hilbert Spaces. The value is calculated by the distance distribution found in the given functions. The function can best separate the two kinds of distribution and is limited to a unit ball in RKHS.

The set DS containing ns training samples and the set DT containing nt test samples are given. The formal definition of nonlinear mapping function and MMD in Hilbert Spaces is as follows:

$$ MM{D}_H\left({D}_S,{D}_T\right)={\left\Vert \frac{1}{n_s}\sum \limits_{i=1}^{n_s}\phi \left({x}_i^s\right)-\frac{1}{n_t}\sum \limits_{j=1}^{n_t}\phi \left({x}_j^t\right)\right\Vert}_H $$
(5)

In Eq. (5), we find that the empirical estimate of the difference between the two distributions is considered as the distance between the two data distributions in the Hilbert Spaces, and an MMD value close to zero indicates that the two distributions are matched. Recently, the MMD measurement method is often used to calculate distribution values between domains in transfer learning.

3 Multi-source fast transfer support vector machine algorithm (MultiFTLSVM)

This section describes the group probability multi-source transfer algorithm in detail. The algorithm framework is shown in Fig. 3. As shown in Fig. 3, the input information for the MultiTLGP framework consists of two parts, labeled samples contained in the M source domains and quantity labeled samples contained in the target domain. For the sake of convenience, the dichotomy is considered.

Fig. 3
figure 3

Framework of MultiFTLSVM

N source domains are defined as \( {D}_S=\left\{{D}_{S_i}={\left({x}_j^{S_i},{y}_j^{S_i}\right)}_{j=1}^{n_{S_i}},i=1,\dots, N\right\} \), where \( {x}_j^{S_i} \) is the jth sample in the Sith source domain, and \( {y}_j^{S_i} \) is the corresponding label of \( {x}_j^{S_i} \). \( {n}_{S_i} \) is the number of the Sith source domain. Joint distribution probability of \( {D}_{S_i} \)is \( {P}_{S_i} \). Similarly, the target domain is defined as \( {D}_T={\left({x}_i^T\right)}_{i=1,..,{n}_T} \), and the corresponding joint distribution probability is PT. \( {P}_{S_i}\left({x}^{S_i}\right) \) and PT(xT) are the marginal probability from the source domain \( {D}_{S_i} \) and the target domain DT, and \( {P}_{S_i}\left({x}^{S_i}\right)\ne {P}_T\left({x}^T\right) \). The MultiFTLSVM algorithm fully addresses the difference between the source domain samples and the target domain by reducing the marginal probability difference.

Figure 3 shows the framework of the MultiFTLSVM algorithm. Firstly, the weights of each source domain sample’s marginal probability are calculated. Then, the proposed objective function is combines with support vector machine, structural risk minimization theory, and similarity distance minimization. Finally, the objective function is calculated, proved, and solved. The detailed process of MultiFTLSVM construction is as follows.

3.1 Re-weighting data samples based on marginal probability differences

For the convenience of calculation, the similarity of weights \( {\gamma}_j^{S_i} \) between samples and target domains in each source domain \( {D}_{S_i} \) with Eq. (5) showing the MMD measuring method in 2.3. Eq. (5) is modified as follows:

$$ \underset{\alpha^{S_i}}{\min }{\left\Vert \frac{1}{n_{s_i}}\sum \limits_{j=1}^{n_{S_i}}{\gamma}_j^{S_i}\phi \left({x}_j^{s_i}\right)-\frac{1}{n_T}\sum \limits_{j=1}^{n_T}\phi \left({x}_j^T\right)\right\Vert}_H $$
(6)

ϕ is the mapping function in Hilbert Spaces H. \( {n}_{s_i} \) is the number of samples in the source domain \( {D}_{S_i} \). nT is the number of samples in the target domain DT, and \( {\gamma}^{S_i} \) is the weight vector of the dimension \( {n}_{S_i} \). The minimization of Eq. (6) is a standard quadratic programming problem, which can be solved by many existing solutions.

3.2 Objective function construction of MultiFTLSVM

Based on 3.1 and support vector machines, the objective function of MultiFTLSVM is constructed and combined with structural risk minimization theory and similarity distance minimization, as follows:

$$ {\displaystyle \begin{array}{c}\underset{f_t,{f}_s\in {H}_k}{\min}\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {f}_{S_i}\right\Vert}_K^2+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\beta}_j^{S_i}{l}_{S_i}\left({f}_{S_i},{y}_j\right)+\frac{1}{2}{\left\Vert {f}_t\right\Vert}_K^2\\ {}+{C}_t\sum \limits_{i=1}^{n_T}{l}_t\left({f}_t,{y}_i\right)+\frac{\lambda }{2N}\sum \limits_{i=1}^Nd\left({f}_t,{f}_{S_i}\right)\end{array}} $$
(7)

fS is the vector of decision function in N source domains, and the decision function in the target domain ft is the same. \( {\left\Vert {f}_{S_i}\right\Vert}_K^2 \) and \( {\left\Vert {f}_t\right\Vert}_K^2 \) are the structural risk items controlling classifier complexity in the source domain and target domain respectively. ‖f2 is two norm functions. \( {C}_{S_i} \) and Ct are the regularization coefficients. l() is the loss function. The function d() is used to quantify the difference between the two domains. λ is the compromise item. Mi represents the number of representative data sets in the Sith source domain. \( {\beta}^{S_i}=\left[{\beta}_1^{S_i},{\beta}_2^{S_i},\dots, {\beta}_{M_i}^{S_i}\right] \) represents corresponding weights of representative data sets in samples from each source domain.

Equation (7) contains three items. The first item \( \left(\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {f}_{S_i}\right\Vert}_K^2+\frac{1}{M}\sum \limits_{i=1}^M\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{l}_{S_i}\left({f}_{S_i},{y}_j\right)\right) \) represents knowledge learned from each source domain. The second \( \left(\frac{1}{2}{\left\Vert {f}_t\right\Vert}_K^2+{C}_t\sum \limits_{i=1}^{n_T}{l}_t\left({f}_t,{y}_i\right)\right) \) represents knowledge learned from the target domain. The third \( \left(\frac{\lambda }{2N}\sum \limits_{i=1}^Nd\left({f}_t,{f}_{S_i}\right)\right) \) represents the regularization term, which ensures good generalization performance by minimizing the differences between each source domain and target domain. \( \frac{\lambda }{2N}\sum \limits_{i=1}^Nd\left({f}_t,{f}_{S_i}\right) \) is expressed as \( \frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{\beta}^{s_i}{w}_{s_i}\right\Vert}^2 \) with simple quadratic distance measurement. To solve the issue of negative transfer, we use the weight \( {\gamma}^{S_i} \) and \( {\beta}^{S_i} \) from 3.1 and replace the weight vector \( {\beta}^{S_i} \) of the representative data set with it, as shown in Eq. (8).

$$ {\rho}^{S_i}={c}_1{\beta}^{S_i}+{c}_2{\gamma}^{S_i}\kappa \left({D}_{S_i}\right) $$
(8)

In Eq. (8), c1 and c2 represent the coefficients, c1 + c2 = 1, \( \kappa \left({D}_{S_i}\right) \) is the mapping function of samples from the source domain \( {D}_{S_i} \) and their corresponding representative data set samples.

In conclusion, Eq. (7) is rewritten as Eq. (9).

$$ {\displaystyle \begin{array}{c}\underset{w_t,{b}_t,{w}_s,{b}_s}{\min}\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+{C}_t\sum \limits_{i=1}^{n_T}{l}_t\left({w}_t^T\varphi (x)+{b}_t,{y}_i\right)+\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {w}_{S_i}\right\Vert}^2\\ {}+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}^{S_i}{l}_{S_i}\left({w}_{S_i}^T\varphi (x)+{b}_{S_i},{y}_j\right)+\frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{w}_{S_i}\right\Vert}^2\end{array}} $$
(9)

In Eq. (9), a hinge loss function is introduced in each source domain and target domain. Therefore, Eq. (9) can be transformed into the optimization problem shown in Eq. (10):

$$ {\displaystyle \begin{array}{c}\underset{w_t,{b}_t,{w}_s,{b}_s}{\min}\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+{C}_t\sum \limits_{i=1+\sum \limits_j^N{M}_j}^{\sum \limits_j^N{M}_j+{n}_T}{\xi}_i++\frac{1}{2N}\sum \limits_{i=1}^N{\left\Vert {w}_{s_i}\right\Vert}^2+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}_j^{S_i}{\xi}_j^{S_i}+\\ {}+\frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{w}_{S_i}\right\Vert}^2\\ {}s.t.\\ {}{y}_j^{s_i}\left({w}_{s_i}^T\varphi \left({x}_j^{s_i}\right)+{b}_{s_i}\right)\ge 1-{\xi}_j^{s_i},j=1,\dots, {n}_{s_i},{s}_i=1,\dots, N\\ {}\tilde{y_i}\left({w}_t^T\varphi \left({x}_i^t\right)+{b}_t\right)\ge 1-{\xi}_i,i=1,\dots, {n}_T\end{array}} $$
(10)

In Eq. (10), \( {\xi}_j^{s_i} \) (\( {\xi}_j^{s_i}\ge 0 \)) and ξi (ξi ≥ 0) are relaxation variables. The first constraint ensures that the learning tasks from each source domain are classified as correctly as possible. The second constraint ensures that the learning tasks in the target area are classified as correctly as possible.

3.3 Objective function theorem proofing

Theorem 1:

The dual problem of Eq. (10) is a quadratic programming (QP) problem, as shown in Eq. (11).

(11)

Proof: The Lagrangian function of Eq. (10) is as follows:

$$ {\displaystyle \begin{array}{c}L\left({w}_t,{w}_s,{b}_t,{b}_s,\xi, {\xi}^s,\alpha, {\alpha}^s,r,{r}^s\right)=\\ {}\frac{1}{2}{\left\Vert {w}_t\right\Vert}^2+\frac{1}{2\mathrm{N}}\sum \limits_{i=1}^{\mathrm{N}}{\left\Vert {w}_{S_i}\right\Vert}^2+{C}_t\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{\xi}_i+\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}_j^{S_i}{\xi}_j^{S_i}\\ {}+\frac{\lambda }{2N}\sum \limits_{i=1}^N{\left\Vert {w}_t-{w}_{{\mathrm{S}}_i}\right\Vert}^2-\frac{1}{N}\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{r}_j^{S_i}{\xi}_j^{S_i}-\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{r}_i{\xi}_i\\ {}-\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{\alpha}_j^{{\mathrm{S}}_i}\left({y}_j^{{\mathrm{S}}_i}\left({w}_{{\mathrm{s}}_i}^T\varphi \left({x}_j^{{\mathrm{S}}_i}\right)+{b}_{s_i}\right)-1+{\xi}_j^{S_i}\right)\\ {}-\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{\alpha}_i\left(\tilde{y_i}\left({w}_t^T\varphi \left({x}_i^t\right)+{b}_t\right)-1+{\xi}_i\right)\end{array}} $$
(12)

Where, \( {\alpha}^{S_i}=\left({\alpha}_1^{S_i},{\alpha}_2^{S_i},\dots, {\alpha}_{M_i}^{S_i}\right),\alpha =\left({\alpha}_1,{\alpha}_2,\dots, {\alpha}_{n_T}\right) \) and \( {r}^{S_i}=\left({r}_1^{S_i},{r}_2^{S_i},\dots, {r}_{M_i}^{S_i}\right),r=\left({r}_1,{r}_2,\dots, {r}_{n_T}\right) \) are the Lagrange multipliers, according to Karush-Kuhn-Tucker (KKT) conditions [17]:

$$ \frac{\partial L}{\partial {\xi}_j^{S_i}}=0\Rightarrow \sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}\left({r}_j^{S_i}+{\alpha}_j^{S_i}\right)=\frac{1}{N}\sum \limits_{i=1}^N\frac{C_{S_i}}{M_i}\sum \limits_{j=1}^{M_i}{\rho}_j^{S_i} $$
(13)
$$ \frac{\partial L}{\partial {\xi}_i}=0\Rightarrow \sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}\left({\alpha}_i+{r}_i\right)=\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}^T}{C}_t $$
(14)
$$ \frac{\partial L}{\partial {\mathbf{w}}_{s_i}}=0\Rightarrow \frac{1}{\mathrm{N}}\sum \limits_{i=1}^{\mathrm{N}}{\mathbf{w}}_{s_i}-\frac{\lambda }{N}\sum \limits_{i=1}^N\left({\mathbf{w}}_t-{\mathbf{w}}_{s_i}\right)-\sum \limits_{i=1}^N\sum \limits_{j=1}^{{\mathrm{M}}_{\mathrm{i}}}{\alpha}_j^{S_i}{y}_j^{S_i}\varphi \left({\mathbf{x}}_j^{S_i}\right)=0 $$
(15)
$$ \frac{\partial L}{\partial {w}_t}=0\Rightarrow {w}_t+\frac{\lambda }{N}\sum \limits_{i=1}^N\left({w}_t-{w}_{s_i}\right)-\sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\alpha}_i\tilde{y_i}\varphi \left({x}_j\right)=0 $$
(16)
$$ \frac{\partial L}{\partial {b}_{s_i}}=0\Rightarrow \sum \limits_{i=1}^M\sum \limits_{j=1}^{n_{s_i}}{\alpha}_j^{S_i}{y}_j^{S_i}=0 $$
(17)
$$ \frac{\partial L}{\partial {b}_t}=0\Rightarrow \sum \limits_{i=\sum \limits_{j=1}^N{M}_j+1}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\alpha}_i\tilde{y_i}=0 $$
(18)

Equation (13) ˜ (18) can be substituted back into Eq. (10) and simplified to obtain Eq. (11) of the dual problem. Theorem 1 is thusly proved.

Theorem 2:

The quadratic programming form of the optimization problem of Eq. (11) is a standard convex quadratic programming problem.

Proof: The matrix \( \tilde{\mathbf{K}} \) can be broken down into the form \( \tilde{\mathbf{K}}={\tilde{\mathbf{K}}}_1+\tilde{{\mathbf{K}}_2}+\tilde{{\mathbf{K}}_3}+\tilde{{\mathbf{K}}_4} \). Where, the forms of \( {\tilde{\mathbf{K}}}_1 \), \( \tilde{{\mathbf{K}}_2} \), \( \tilde{{\mathbf{K}}_3} \)and \( \tilde{{\mathbf{K}}_4} \) are as follows:

$$ {\displaystyle \begin{array}{c}{\tilde{K}}_1=\frac{\lambda }{1+2\lambda N}{\left[\begin{array}{c}{K}_{s_1}{,}_{s_1},\dots, {K}_{s_1}{,}_{s_N},-{K}_{s_1,t}\\ {}\dots \\ {}{K}_{s_M}{,}_{s_1},\dots, {K}_{s_N}{,}_{s_N},-{K}_{s_N,t}\\ {}-{K}_{s_1,t}^T,\dots, -{K}_{s_N,t}^T,{K}_{t,t}\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\\ {}{\tilde{K}}_2=\frac{N}{1+2\lambda N}{\left[\begin{array}{c}{K}_{s_1}{,}_{s_1},\dots, {K}_{s_1}{,}_{s_N},0\\ {}\dots \\ {}{K}_{s_N}{,}_{s_1},\dots, {K}_{s_N}{,}_{s_N},0\\ {}0,\dots, 0,0\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\\ {}{\tilde{K}}_3=\frac{\lambda }{N}{\left[\begin{array}{c}1,\dots, 1,0,\\ {}\dots \\ {}1,\dots, 1,0\\ {}0,\dots, 0,0\\ {}0,\dots, 0,0\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\\ {}{\tilde{K}}_4=\frac{N}{1+2\lambda N}{\left[\begin{array}{c}0,\dots, 0,0\\ {}\dots \\ {}0,\dots, {K}_{t,t}\end{array}\right]}_{\left(\sum \limits_{i\in N}{M}_i+{n}_T\right)\times \left(\sum \limits_{i\in N}{M}_i+{n}_T\right)}\end{array}} $$

For the matrix \( {\tilde{\mathbf{K}}}_1 \), let \( {Q}_1=\sqrt{\frac{\lambda }{1+2\lambda N}}\left({y}_1^{S_1}\varphi \left({x}_1^{S_1}\right),\dots, {y}_{M_1}^{S_1}\varphi \left({x}_{M_1}^{S_1}\right),\dots, {y}_1^{S_N}\varphi \left({x}_1^{S_N}\right),\dots, {y}_{M_N}^{S_N}\varphi \left({x}_{M_N}^{S_N}\right),-\sum \limits_{i\in {n}_T}\varphi \left({x}_i\right),\dots, -\sum \limits_{i\in {n}_T}\varphi \left({x}_i\right)\right) \) be the symmetric and positive semidefinite matrix. It is obvious that \( {\tilde{\mathbf{K}}}_1={\mathbf{Q}}_1^T{\mathbf{Q}}_1 \), so \( {\tilde{\mathbf{K}}}_1 \) is the symmetric semidefinite matrix and \( \tilde{{\mathbf{K}}_2} \), \( \tilde{{\mathbf{K}}_3} \) and \( \tilde{{\mathbf{K}}_4} \) are symmetric semidefinite matrices. Therefore, \( \tilde{\mathbf{K}} \) is the symmetric semidefinite matrix. Eq. (14) is a standard convex quadratic programming problem. Theorem 2 is thusly proved.

Theorem 3:

The solution to the quadratic programming problem of Eq. (11) is the optimal solution.

Proof:Since Eq. (9) is a convex quadratic programming problem and the KKT condition is also a sufficient condition, the obtained solution is the optimal solution. The solution of convex quadratic programming refers to [37].

The optimal value \( {\boldsymbol{\Gamma}}^{\ast }={\left({\alpha}^{s_1},{\alpha}^{s_2},\dots, {\alpha}^{s_N},\alpha \right)}^{\mathrm{T}} \) of Γ is calculated by Eq. (11), and the optimal solutions of wt and bt parameters are as follows:

$$ {w}_t^{\ast }=\frac{\lambda N}{1+2\lambda N}\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{\tilde{\alpha}}_j^{s_i}{\rho}_j^{S_i}{y}_j^{S_i}\varphi \left({x}_j^{S_i}\right)+\frac{N+\lambda }{1+2\lambda N}\sum \limits_{i=1+\sum \limits_{j=1}^N{M}_j}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\tilde{\alpha}}_i\sum \limits_{j\in {n}_T}\varphi \left({x}_j\right) $$
(19)
$$ {\displaystyle \begin{array}{c}{b}_t^{\ast }={y}_i-\frac{\lambda N}{1+2\lambda N}\sum \limits_{i^{\prime }=1}^N\sum \limits_{j=1}^{M_{i\prime }}{\rho}_j^{S_{i\prime }}{\alpha}_j^{S_{i\prime }}{y}_j\sum \limits_{q\in {S}_{i\prime }}k\left({x}_j,{x}_q\right)\\ {}-\frac{\lambda +N}{1+2\lambda N}\sum \limits_{i^{\prime }=1}^N\sum \limits_{j=1+\sum \limits_{l\in M}{M}_l}^{\sum \limits_{l\in M}{M}_l+{n}_T}\tilde{\alpha_j}\sum \limits_{j^{\prime}\in {n}_T}\sum \limits_{q\in {n}_T}k\left({x}_{j\prime },{x}_q\right)\end{array}} $$
(20)

Finally, the decision function for the MultiFTLSVM algorithm is as follows:

$$ f(x)={\mathbf{w}}_t\varphi (x)+{b}_t $$
(21)

The optimal solutions in Eqs. (19) and (20) contain information in M source domains and target domain. For example, in \( {\mathbf{w}}_t^{\ast } \), \( \frac{\lambda N}{1+2\lambda N}\sum \limits_{i=1}^N\sum \limits_{j=1}^{M_i}{\tilde{\alpha}}_j^{s_i}{\rho}_j^{S_i}{y}_j^{S_i}\varphi \left({x}_j^{S_i}\right) \) is the knowledge learned from the source domain, and \( \frac{N+\lambda }{1+2\lambda N}\sum \limits_{i=1+\sum \limits_{j=1}^N{M}_j}^{\sum \limits_{j=1}^N{M}_j+{n}_T}{\tilde{\alpha}}_i\sum \limits_{j\in {n}_T}\varphi \left({x}_j\right) \) is the knowledge learned from the target domain.

3.4 MultiFTLSVM algorithm process

From Sections 3.1 through 3.3, the specific process and training steps of the MultiTLGP algorithm are shown in Table 1.

Table 1 Steps of MultiTLGP algorithm

4 Results of experiments

In this section, to test the generalization performance of the MultiFTLSVM algorithm, we compare MultiFTLSVM with reference algorithms STL-SVM [38], RankRE-TL [15], HMCA [16], STIL [39], MultiDTNN [40], FastDAM [21], IMTL [22], SSL-MSTL [26] and SVM [41] on 20-Newsgroups, emotion analysis and spam datasets.

4.1 Experimental settings

To ensure the impartiality of the experiment, all experiments adopted the five-time cross-validation strategy and the experimental result is the final comparison result after 2 repeats of the strategy. For 5 transfer learning algorithms, all labeled source domain data and 5% unlabeled target data were randomly selected as training data sets. For the non-transfer learning algorithm SVM, we only used labeled data from the target domain for training. In the experiment, we took classification accuracy and the mean value of the corresponding standard deviation after running the calculation 10 times as the criteria of the evaluation algorithm. Classification accuracy is expressed as follows [14, 17, 37]:

$$ Accuracy=\frac{\mid x:x\in {D}_t\wedge f(x)=y(x)\mid }{\mid x:x\in {D}_t\mid}\times 100\% $$

Dt is the target domain data set, f(x) is the sample class x label predicted by the classifier, and y(x) is the real class label of the sample x.

In the experiment, all kernel functions use the Gaussian functionk(xi, xj) = exp(−‖xi − xj‖/2σ2). The values of parametersCt, \( {C}_{S_i} \) and λare obtained from the grid {10‐4, 10−3, 10−2, 10−1, 10, 101, 102, 103, 104} by gird search methods often used in machine learning. In addition to the above parameters, the other parameters of the reference algorithms are the same as those in corresponding literature. The hardware environment of all experiments was Intel Core (TM)i3, 3.6GHz, 8GB, Windows 10 OS, and running MATLAB R2014b.

4.2 Data sets used for experiments

20-Newsgroups [14], emotion analysis [14], and spam [15] are commonly used applications for transfer learning, so all experiments in this paper were carried out based on these 3 data sets.

  1. 1)

    20-newsgroups

The 20-Newsgroups data set contains about 20,000 documents divided into four categories comp(c), rec(r), sci(s), and talk(t), each of which can be subdivided into four subcategories. The details of the data sets are shown in Table 2. In this experiment, the dichotomous task group is constructed by randomly selecting two of the four categories, one of which is positive and the other negative. Each task group is specifically comp vs rec, comp vs sci, comp vs talk, rec vs sci, rec vs talk, and sci vs talk. Common construction methods for cross-domain task groups were, there were 4 subclasses A1, A2, A3, and A4, while B has four subclasses B1, B2, B3, and B4 in each task group A vs B. Two subcategories (A1 and A3) from A and two subcategories (B1 and B2) from B are randomly selected to form the target domain data set, and the remaining data sets in A and B constitute the source domain data set. Each task group A vs B can generate \( {C}_4^2\times {C}_4^2=36 \) classification tasks. The target domain data set and source domain data set obtained by the above construction methods ensure a correlation between the target domain and the source domain because they come from the same category. This also ensures target domain and source domain heterogeneity because they come from different subcategories. See Table 2 for details.

Table 2 20-Newsgroups Data set
  1. 2)

    Emotion analysis data set

The emotion analysis data set consists of reviews from four different types of Amazon products, books, DVD, electronics, and kitchen supplies, which represents 4 domains – Books (B), DVDs (D), Electronics (E) and Kitchen (K). The content of each review contains: name, title, name of reviewer, date, place, and review. We take products rated by the evaluators with a rating of more than 3 stars (0–5 stars) as positive examples and those rated less than 3 stars as negative examples, with vague evaluation being discarded. In these 4 domains, there were 2000 annotated examples and about 4000 un-annotated examples, with roughly the same number of positive and negative examples. The details of the data set are shown in Table 3.

Table 3 Emotion analysis data set
  1. 3)

    Spam data set

The spam data set is distributed by the ECML/PKDD 2006 Knowledge Discovery Challenge and consists of four separate user mailboxes: personal mailbox U1, personal mailbox U2, personal mailbox U3, and public mailbox U4. There are 1250 spam and 1250 normal mailboxes in each personal mailbox and 2000 spam and 2000 normal mailboxes in each public mailbox. The personal and public mailboxes are expressed by an item frequency vector. The probability distribution of emails within each group is similar, but the difference between groups is large. Therefore, the six classification tasks that were constructed across groups in this paper are: U1 → U4, U2 → U4, U3 → U4, U4 → U1, U4 → U2 and U4 → U3. In the above representation modes of classification tasks, for example, inU1 → U4, U1 is the source domain, and U4 is the target domain. The details of the data set are shown in Table 4.

Table 4 Spam data set

4.3 Results of experiments and analysis

In this section, we compare the mean classification accuracy and training time with a standard difference of MultiFTLSVM algorithm and reference algorithm on 3 real data sets and analyzed the results.

For the 20-Newsgroups data set, we selected one from data set r, s and t as the target domain. Five of the single source domain transfer learning algorithms SVM, STL-SVM, RankRE-TL, STIL and HCMA can only use one data set as a source domain, while five multi-source transfer learning algorithms FastDAM, IMIL, MultiDTNN、SSL-MSTL, and MultiFTLSVM can simultaneously use 3 data sets as its source domain. In the emotion analysis data set, 3 data sets were constructed with Books, DVDs, Electronics, and Kitchen as the target domain. Single source domain transfer learning algorithms can only select one data set from these 3 data sets as the source domain, while a multi-source transfer learning algorithm can simultaneously select 3 source domains. Similarly, for the spam data set, multi-source transfer learning algorithms used 3 personal mailboxes as 3 data sets and public mail data sets as the target domain. Single source domain transfer learning algorithms can only take one of the 3 personal data sets as the source domain and used the public mailbox as the target domain.

In Table 5, the experimental results on the data set 20-Newsgroups can draw the following conclusions: The classification accuracy of the MultiFTLSVM algorithm on the 9 cross-domain classification tasks has been improved compared with the benchmark algorithm, and the average accuracy has exceeded 95%. Compared with the non-migration algorithm, the average accuracy of SVM has increased by more than 10%, which also shows that the migration learning algorithm has considerable advantages over the traditional machine learning algorithm; compared with the single-source domain migration learning algorithm STL-SVM, STIL, RankRE-TL and HCMA, the average accuracy rate is improved; in the multi-source transfer learning algorithms MultiDTNN, FastDAM, IMTL, SSL-MSTL and MultiFTLSVM, the algorithm proposed in the article also has certain advantages. Because SVM does not have the ability of cross-domain transfer learning, the average classification accuracy is the lowest; single-source transfer learning algorithm is better than SVM; multi-source transfer learning algorithm is better than single-source transfer learning algorithm, and the algorithm proposed in the article is the best. The difficulty of transfer learning for the 12 cross-domain learning tasks is closely related to the similarity of the text content. It can be seen that the higher the similarity of the text content of the classification task, the higher the classification accuracy of the transfer learning algorithm.

Table 5 The average classification accuracy (%) and standard deviation of the algorithm on 20Newsgroups

For the experimental results on the sentiment analysis and spam data sets in Table 6, MultiFTLSVM has the highest average accuracy of all algorithms, and it has certain advantages compared to the non-transfer learning algorithm or the transfer learning algorithm in the benchmark algorithm: Compared with non-transfer learning SVM, the average accuracy rate is increased by about 12%; compared with TL-SVM, STIL, RankRE-TL, HCMA, MultiDTNN, FastDAM, IMTL and SSL-MSTL, the average accuracy rate is improved. In the classification accuracy of each cross-domain classification task, MultiFTLSVM is the highest compared with all benchmark algorithms.

Table 6 The average classification accuracy (%) and standard deviation of the algorithm on spam email and emotion analysis

According to the experimental analysis, we can draw the following conclusions:

  1. (1)

    Based on the accuracy of average classification, it can be observed from Tables 5 and 6 that transfer learning algorithms can help classification tasks from the target domain by using knowledge from the source domain. Therefore, it has better classification effects than merely using data set training classifiers from the target domain by non-transfer learning algorithm SVM. In addition, compared with single source transfer learning algorithms STL-SVM, STIL, RankRE-TL and HCMA, multi-source transfer learning algorithms MultiDTNN, FastDAM, IMTL and SSL-MSTL showed obvious advantages in terms of classification effects. Finally, since combination weight information obtained from MMD is applied to effectively handle negative transfer from the proposed MultiFTLSVM algorithm in this paper, the classification effects of such an algorithm is superior to the majority of multi-source transfer learning algorithms regarding all learning tasks.

  2. (2)

    As for algorithm operation time, the training time of SVM is relatively fast since SVM value just uses training data of the target domain. Since supplementary samples of the source domain are used in transfer learning, the training time of transfer learning algorithms increases when compared with that of the non-transfer learning algorithms. Since multi-source transfer learning uses more than two source domains to assist with the training of the target domain, its training time increases when compared with that of single source transfer learning. Among all of the multi-source transfer learning algorithms, MultiFTLSVM utilizes representative source domain data to shorten the training data set scale, which is conducive to reducing training time. Therefore, its training time is promoted in comparison to that of multi-source transfer learning algorithms from the benchmark algorithms Tables 7 and 8.

    Table 7 Average score training time (s) and standard deviation of the algorithm on 20Newsgroups
    Table 8 The average training time (s) and standard deviation of the algorithm on the sentiment analysis data set and spam data set

4.4 Parameter sensitivity analysis

In this section, we performed a sensitivity analysis of three parameters in the MultiFTLSVM objective function, namely the regular parameter of the target domain Ct, the mean value of regular parameters in the source target CS, and the compromise item λ, to describe their impacts to the algorithm’s performance. As for each parameter, we fix another two parameters as the optimum values determined by cross-validation and observed the parameter’s impacts on classification results when using different values. The results of the experiment are seen in Figs. 4, 5 and 6.

Fig. 4
figure 4

Sensitivity of Parameter Ct in MultiFTLSVM Algorithm in 20-Newsgroups, Emotion Analysis, and Spam Data Sets

Fig. 5
figure 5

Sensitivity of Parameter Cs in MultiFTLSVM Algorithm in 20-Newsgroups, Emotion Analysis, and Spam Data Sets

Fig. 6
figure 6

Sensitivity of Parameter λ in MultiFTLSVM Algorithm in 20-Newsgroups, Emotion Analysis, and Spam Data Sets

Conclusions drawn from Figs. 4, 5 and 6 are as follows:

  1. (1)

    We first fix λ = 10,Ct = 10and then search the value of Cs on the grid of Cs ∈ {10−4, 10−3, 10−2, 10−1, 10, 101, 102, 103, 104}, and record the experimental results on the real data set as shown in Fig. 5. Figure 5 shows Cs that with different values, the proposed classification effect is also different; We can see that when Cs = 100, the classification effect of the algorithm is the best on 14 cross-domain tasks. In the same way, we fix λ = 10, Cs = 100 and we use the same method to get the experimental results when Ct taking different values as shown in Fig. 4. From Fig. 4, we conclude that when Ct = 10, the algorithm performed the best classification on most cross-domain classification tasks. After the above analysis, when CS and Ct take different values, the average classification accuracy of the MultiFTLSVM method on 14 cross-domain tasks has a significant difference. We can find that MultiFTLSVM has a certain value range for the regularization parameters CS and Ct Sensitive, the parameter values when the classification effect reaches the best can be obtained on different cross-domain tasks.

  2. (2)

    For parameter λ, fix Cs = 100 and Ct = 10 to obtain the experimental results in the same way as in (1) as shown in Fig. 6. When the value of λ is 10, MultiFTLSVM achieves the best classification effect on 14 cross-domain tasks. We can get the following conclusion by analyzing the results in Fig. 6. If the value of λ is too small, the difference between the source domain and the target domain will be ignored and negative migration will occur, so the classification effect will not change; on the contrary, when the value of λ is too large, It can make the distribution difference between the source domain and the target domain more obvious, which results in less knowledge in the source domain that can be transferred to the target domain, and the classification effect is also not good.

In short, the MultiFTLSVM algorithm is very sensitive to the regularization coefficients λ, Ct and Cs within a certain range, which means that it is very important to determine the optimal values of these parameters through effective strategies.

5 Conclusions

In this paper, we propose a multi-source fast transfer learning algorithm based on support vector machines, MultiFTLSVM, to provide multi-source domains for application in transfer learning. Firstly, the similarity weight of each source domain sample and the target domain is calculated with the purpose of resolving the minimum marginal probability differences, based negative transfer problem algorithm. Then the approximate pole support vector machine is used to obtain representative data sets from each source domain that are relatively important to the model training and corresponding weights, enhancing the training efficiency of the algorithm. Finally, knowledge from the target domain and combinatorially weighted multi-source domains are combined to be integrated into the structural risk minimization framework of the support vector machine. Then an objective function is constructed and a theoretical demonstration is performed. Results from experiments using the 20-Newsgroups data set, emotion analysis data set, and spam data set indicated that MultiFTLSVM is superior to the benchmark algorithms in regard to classification accuracy rate and training efficiency. Although the results indicated that the MultiFTLSVM algorithm has better results than the benchmark algorithms, it still needs to be further studied in the future with regard to the following aspects: extension of MultiFTLSVM in multi-class problems; increasing the number of source domains is an interesting challenge.