1 Introduction

Most traditional classification methods are based on an ideal assumption that the labeled training data are ample and subject to the distribution same as the unlabeled test data for achieving satisfactory performance. Since collecting labeled data is expensive and time consuming, these methods are difficult to quickly establish reliable classification models when the labeled data are scarce or the distribution of existing data changes. Transfer learning [1, 2], which leverages knowledge from label-rich source domains to train better target learners, can help alleviate these problems. Many transfer learning studies [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] focus on single-source transfer learning, and they assume that all the labeled data come from a single-source domain close to the target domain. However, in practical applications, the labeled data contained in one source domain are often inadequate, and available labeled data exist in other different domains. Thus, single-source transfer learning models are hardly applied in realistic multi-source scenarios.

To develop the adaptability of transfer learning models, many multi-source transfer learning methods are proposed [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40]. A direct strategy adopted in some methods [20, 21, 38] is to mine the commonality of all sources, such as combining multiple source domains into a single-source domain by aligning all the raw features or mapping all source domains with the target domain into a low-dimensional space for knowledge transfer. With the help of the data expansion, they can improve the learning performance. However, the mutual interferences across source domains are ignored: one is that the common features that can build shared structures will reduce as source domains increase, and the other is that mining commonalities cannot help derive domain-specific features that have some discriminative power for training classifiers. For instance, if the recommender system of Tiktok is trained on the viewing habits of all customers, although it may benefit from the huge amount of data and deliver funny videos winning most users’ favor, it will not ensure that people with different features can be accurately fed the videos more appealing to them. To capture the specificity of each source domain, some methods [22, 23] construct the target classifier by combining multiple classifiers trained on pairs of each source domain and target domain. These methods mitigate the negative effects of aligning or mapping strategies and extract domain-specific information effectively. However, the data distribution discrepancy among multiple sources makes it difficult to fuse all the information. For example, Tiktok can only recommend major-related videos to students relying on this strategy, and it hardly develops their versatility and inspires their potential talents. Other methods [24,25,26] incorporate the above two strategies, and thus they can utilize both commonality and specificity across domains. However, they still fail to actually correlate multiple sources with each other and transfer knowledge from an integral source-domain framework. For example, using this strategy, Tiktok can only attract users by delivering funny and major-related videos, but not any other types of videos.

In this paper, we propose a novel multi-source transfer learning method based on the power set framework (PSF-MSTL), which can multiply combine each source domain with other source domains and fuse the knowledge learned from different source combinations to train a shared classifier. First, we introduce the power set concept to construct a source-domain framework that enables different source domains to be interrelated. The power set in mathematics is defined as a set composed of all subsets of the original set. For instance, the power set of \(\{A,B,C\}\) can be denoted as \(\mathscr {P}(\{A,B,C\})=\{\emptyset ,\{A\},\{B\},\{C\},\{A,B\},\{A,C\},\{B,C\},\{A,B,C\}\}\), where \(\emptyset \) represents an empty set. With this concept, we can view the collection of source domains as an original set and obtain its power set to identify the correlations among multiple source domains. Then different types of training datasets are generated according to the power set, structuring a power set framework that allows multiple source domains to be associated with each other. Figure 1 presents the three-source case, and Fig. 1a shows the source collection can be extended to a power set framework.

Fig. 1
figure 1

In the three-source case, a shows the extension of the three-source collection, b presents the framework of our method, c,d, and e briefly visualizes three existing multi-source transfer learning strategies

Second, to make the power set framework integral and able to provide comprehensive knowledge, we utilize a dual-promotion strategy to integrate all the training datasets. Specifically, PSF-MSTL mines the latent information from corresponding latent feature spaces for pairs of each training dataset and target domain, and fuses all the information to bring the knowledge in the framework to the state of mutual complementarity. Then our method uses the fused knowledge and label information to train a shared classifier, which is exploited to further extract knowledge from every training dataset. In other words, refining the knowledge from the framework and training the shared classifier can reinforce each other directly. Thus, learning from different training datasets can promote each other indirectly by the shared classifier drawn on the complementary knowledge. Figure 1b shows that PSF-MSTL extracts and fuses latent information from seven training datasets to obtain complementary knowledge, which is used to train a shared classifier to help further learn from each training dataset. Figure 1c–e visualize three existing multi-source transfer learning strategies, respectively. It indicates that PSF-MSTL can build sounder relationships of multiple source domains than other strategies.

The main contributions of our work are listed below:

  1. (1)

    We systematically analyze the correlations among multiple source domains and propose PSF-MSTL, which can construct a power set framework to enable different source domains to be interrelated.

  2. (2)

    To obtain complementary knowledge from an integral framework, we utilize a dual-promotion strategy to integrate all the training datasets. Besides, PSF-MSTL is formulated as an optimization problem, and an iterative algorithm is presented to address it.

  3. (3)

    Additionally, we conducted extensive experiments on 20Newsgroups and Sentiment to verify the superiority and effectiveness of PSF-MSTL.

We organize the rest of this paper as follows: Sect. 2 summarizes some related works. Section 3 gives the notations and preliminary knowledge. Section 4 presents our method PSF-MSTL. Section 5 shows the experiment results. At last, Sect. 6 is the conclusion of our work.

2 Related Work

In this section, we summarize several existing single-source and multi-source transfer learning approaches related to our research.

We refer to the authoritative review literature [1, 2] and introduce some single-source transfer learning methods as well as some theoretical studies related to domain adaptation. Dai et al. [3] extended co-clustering to obtain the common word clusters and utilized them to propagate in-domain knowledge to the out-of-domain. Chen et al. [9] presented topical correspondence learning where the common and domain-specific features can jointly make up their corresponding topics, and the knowledge is transferred through the common features. Jiang et al. [4] proposed a two-step approach where the common features shared across domains are mined and assigned appropriate weights first, and then the features specific to the target are identified by semi-supervised learning. Uguroglu et al. [6] distinguished variant and invariant features to fit different data distributions and defined the distribution issue as a convex optimization problem. Fang et al. [17] proved a generalization error theory that the labeled data in source domain and unlabeled data in target domain can reduce the risk of target in heterogeneous scenarios, and proposed two novel algorithms based on the theory. Blitzer et al. [7] adopted structural correspondence learning and exploited pivot features to model the correlations of source and target. Zhuang et al. [8] argued that feature clusters have more stable associations with document classes, and they presented a novel classification method to cluster raw features into concepts. Pan et al. [10] analyzed the word semantics in different scenarios, and proposed quadruple transfer learning to reduce marginal and conditional distribution discrepancies between domains. Fang et al. [18, 19] pioneered the learning bound for open set domain adaptation and proposed relevant algorithms to regularize the open set difference bound.

However, in real-world applications, labeled data exist in multiple source domains, which have different data distributions from each other. To adapt to more difficult scenarios, three common strategies are adopted in many multi-source classification methods. A straightforward strategy is to find the common information of all domains for knowledge transfer. Xu et al. [20] developed a novel domain adaptation method, which maps all source and target domains into a common feature space to reduce data distribution discrepancies, and then generates prediction results under the distribution weighted combining rule. Zhao et al. [21] presented an optimization model and developed it by smoothed approximation to obtain the generalization bound of all source domains. Another strategy is to capture the specificities of multiple sources by transferring knowledge from each source domain. Zhu et al. [22] developed a two-alignment-step framework where the first step is to align the distributions of each source domain with the target domain, and the second step is to align the classification results learned from each pair. Cheng et al. [23] trained multiple weak classifiers on pairs of each source and target, and constructed the target classifier by co-training the results produced by each classifier. Other methods incorporate the first two strategies to refine both common and specific knowledge from multiple source domains. Zhang et al. [25] presented a multi-source knowledge transfer method which can learn different feature extraction networks to align two-stage features and fuse information from two parts. Xu et al. [27] combined joint and separate alignment strategies and make two branches learn from each other through their complementarity. In addition, we introduce some innovative multi-source learning strategies. Dai et al. [24] presented a selective domain adaptation method, which can extract the private features from the source domain nearest to the target domain, and then merge these features with the common features to train a target classifier. Li et al. [39] exploited an automatic sampling strategy to align classifiers and developed the cross-domain classification ability with the help of pseudo target labels. Wu et al. [40] proposed a novel method, which uses multiple graphs to model partial information and learn low-rank embedding through domain discrepancy and relevance. Kim et al. [41] mined the shared semantic space in the source and target domains and proposed a new concept-driven text classification algorithm based on deep neural networks. However, these methods ignore that the intrinsic relationships among multiple source domains may bring a large impact on the classification result. To address these problems, we propose the multi-source transfer learning method based on the power set framework(PSF-MSTL).

3 Preliminary Knowledge

In this section, we provide some mathematical notations with their denotations and give brief introductions of the high-level concepts and non-negative matrix tri-factorization(NMTF) technique that both support our method.

3.1 Notations

We use the calligraphic letter \(\mathcal {D}\) to denote domains and the script letter \(\mathscr {P}\) to abbreviate the power set framework. The data matrices are expressed by the uppercase(such as \(X\)), and \(X_{[i,j]}\) represents the \(i\)-th row and \(j\)-th element of matrix \(X\). In addition, the set of non-negative real numbers and real numbers are denoted as \(R_+\) and \(R\), respectively. Table 1 lists some notations with their denotations frequently used in this paper.

Table 1 Notations and denotations

3.2 High-Level Concepts

Due to the data distribution discrepancies among different domains in multi-source transfer learning problems, it is hard to obtain satisfactory performance by learning from the raw features directly. Thus, to build a more stable transfer learning bridge, we incorporate the high-level concepts to transform raw features into concepts. To be specific, the high-level concepts include two fundamental definitions: the concept extension (CE) and the concept intension (CI) [12, 13]. CE refers to the distribution of the raw word features that imply the same concept. CI refers to the associations between the document classes and the concepts. In this paper, we exploit identical concepts and alike concepts to reduce the data distribution discrepancies and train a shared classifier. Both of them are summarized in Table 2.

Table 2 Identical concepts and alike concepts

3.3 Non-negative Matrix Tri-factorization

To implement the high-level concepts mentioned above, we introduce the non-negative matrix tri-factorization(NMTF) technique, which has been widely used in text and image classification researches [10,11,12,13, 16, 42]. NMTF can decompose the data matrix into the multiplication of three non-negative factor matrices, and the basic formula of NMTF is:

$$\begin{aligned}&{X_{m \times n}} = {F_{m \times k}}{H_{k \times c}}G_{n \times c}^\top \end{aligned}$$
(1)

where \(X \in {R^{m \times n}}\), \(F \in {R^{m \times k}}\), \(H \in {R^{k \times c}}\), \(G \in {R^{n \times c}}\). Here, \(m\), \(n\), \(k\), and \(c\) are the numbers of features, documents, high-level concepts, and document classes, respectively. Besides, \(X_{m \times n}\) is the feature-document matrix which has \(m\) rows and \(n\) columns, \(F_{m \times k}\) is the feature-concept matrix where each column indicates the distribution of raw features which belong to one concept, \(H_{k \times c}\) is the factor matrix to associate concepts with document classes, and \(G_{n \times c}^{\top }\) is the transposition of \(G_{n \times c}\) that can be considered as a target classifier. In addition, \(F\) and \(H\) represent CE and CI, respectively.

Furthermore, NMTF is an optimization problem as follows:

$$\begin{aligned} \begin{aligned}&\mathop {\min }\limits _{F,H,G \ge 0} {\parallel X - FH{G^{\top }}\parallel }^{2}\\&s.t.\mathop \sum \limits _{i = 1}^m {F_{[i,j]}} = 1,\mathop \sum \limits _{j = 1}^c {G_{[i,j]}} = 1\\ \end{aligned} \end{aligned}$$
(2)

Since there are multiple source domains and one target domain in multi-source transfer learning tasks, NMTF is developed to obtain better adaptability. The optimization problem can be developed as follows:

$$\begin{aligned} \begin{aligned}&\mathop {\min } \limits _{{F_{i_s}},{F_{t}},H,{G_{i_s}},{G_{t}} \ge 0} \mathop \sum \limits _{i_s = 1}^{s} {\parallel {X_{i_s}} - {F_{i_s}}HG_{i_s}^{\top }\parallel }^{2} + {\parallel {X_{t}} - {F_{t}}HG_{t}^{\top }\parallel }^{2}\\&s.t.\mathop \sum \limits _{i = 1}^m {F_{i_s[i,j]}} = 1,\mathop \sum \limits _{i = 1}^m {F_{t[i,j]}} = 1,\mathop \sum \limits _{j = 1}^c {G_{i_s[i,j]}} = 1,\mathop \sum \limits _{j = 1}^c {G_{t[i,j]}} = 1\\ \end{aligned} \end{aligned}$$
(3)

4 Multi-source Transfer Learning Based on the Power Set Framework

In this section, PSF-MSTL is proposed to address multi-source transfer learning problems. We first discuss how many training datasets are generated in a classification task. Then PSF-MSTL is defined as an optimization problem, and an iterative algorithm is presented to solve it.

4.1 The Number of Training Datasets

Since PSF-MSTL multiply combines each source domain, it is necessary to analyze all possible combinations of source domains to ascertain how many training datasets are generated. According to the theory of the power set in mathematics, the number of the subsets in \(A\)’s power set is \(2^{{\vert }A{\vert }}\), where \({\vert }A{\vert }\) is the cardinality of set \(A\), which means \(A\)’s power set contains \(2^{{\vert }A{\vert }}-1\) subsets that are not empty. With this theory, PSF-MSTL produces \(2^s-1\) training datasets in a classification task, where \(s\) represents the total number of source domains. For clarity, we use \(l\)-source domain to denote each training dataset, where \(l\) is the number of source domains contained in a training dataset. According to \(l\), these training datasets can be divided into \(s\) types, denoted as \(l\)-source type. By analyzing the number of training datasets in each type, we can infer that \(l\)-source type contains \(C_{s}^{l}\) \(l\)-source domains. For example, in the three-source task, a total of 7(\(2^{3}-1\)) training datasets are generated, and they can be divided into three types: the one-source type contains 3(\(C_{3}^{1}\)) one-source domains, the two-source type contains 3(\(C_{3}^{2}\)) two-source domains, and the three-source type contains 1(\(C_{3}^{3}\)) three-source domain.

4.2 Problem Definition

Suppose, in a multi-source transfer learning problem, the data domain contains \(s\) source and one target domains, denoted as \(\mathcal {D}=(\mathcal {D}_1,\cdots ,\mathcal {D}_s,\mathcal {D}_{t})\). Since PSF-MSTL extends the source collection to the power set framework which has \(2^s-1\) training datasets, we define a general formula for the framework, denoted as \(\mathscr {P}=(\mathcal {D}'_1,\mathcal {D}'_2,\cdots ,\mathcal {D}'_{2^s-1})\), where \(\mathcal {D}'\) represents the training dataset. As a result, the data domain can be developed to \(\mathcal {D}=(\mathcal {D}'_1,\cdots ,\mathcal {D}'_{2^s-1},\mathcal {D}_{t})\), where the first \(2^s-1\) domains are labeled training datasets, i.e., \({\mathcal {D}}'_{i_s} = \left\{ {x_i^{i_s},y_i^{i_s}} \right\} {\vert }_{i = 1}^{{n_{i_s}}}(1 \le i_s \le 2^s-1)\), and the left one is the unlabeled target domain, i.e., \({\mathcal {D}}_{t} = \left\{ {x_i^{t}} \right\} {\vert }_{i = 1}^{{n_{t}}}\). \(n_{i_s}\) is the number of documents in the \(i_s\)-th source domain, and \(n_{t}\) is the number of documents in the target domain. In addition, the feature-document matrices of the data domain can be given as \(X = \left\{ X_{1},\cdots ,X_{2^s-1},X_{t}\right\} \). Then the objective function is formulated as:

$$\begin{aligned} \begin{aligned}&\mathcal {L}=\mathop \sum \limits _{i_s = 1}^{2^s-1} \left({\parallel {X_{i_s}} - {F_{i_s}}{H_{i_s}}G_{i_s}^{\top }\parallel }^{2}+ {\parallel {X_{t}} - {F_{t{\vert }i_s}}{H_{t{\vert }i_s}}G_{t}^{\top }\parallel }^{2}\right)\\ \end{aligned} \end{aligned}$$
(4)

where \(X_{i_s} \in {R_+^{m \times n_{i_s}}}\), \(X_{t} \in {R_+^{m \times n_{t}}}\), \(F_{i_s} \in {R_+^{m \times k}}\), \(F_{t{\vert }i_s} \in {R_+^{m \times k}}\), \({H_{i_s}} \in {R_+^{k \times c}}\), \({H_{t{\vert }i_s}} \in {R_+^{k \times c}}\), \(G_{i_s} \in {R_+^{n_{i_s} \times c}}\), \(G_{t} \in {R_+^{n_{t} \times c}}\).

Since the identical concepts and the alike concepts are utilized in this paper, \(F_{i_s}\), \(F_{t{\vert }i_s}\), \(H_{i_s}\) and \(H_{t{\vert }i_s}\) are separately divided into two parts. Specifically, in a pair of training dataset and target domain, the identical concept shares the same CE across domains while the alike concept has different CE in different domains. Thus, in the \(i_s\)-th pair, the feature-concept matrix of identical concept is denoted as \(F_{i_s}^{1}\) for both the training dataset and target domain, and the feature-concept matrix of alike concept is denoted as \(F_{i_s}^{2}\) and \(F_{t{\vert }i_s}^{2}\) for the training dataset and target domain, respectively. As a result, \(F_{i_s}=[F_{i_s}^{1},F_{i_s}^{2}]\) and \(F_{t{\vert }i_s}=[F_{i_s}^{1},F_{t{\vert }i_s}^{2}]\), where \(F_{i_s}^{1} \in {R_{+}^{m \times k_1}}\), \(F_{i_s}^{2} \in {R_{+}^{m \times k_2}}\), \(F_{t{\vert }i_s}^{2} \in {R_{+}^{m \times k_2}}\), and \(k_1+k_2=k\). Similarly, both the identical and alike concepts keep the same CI in the \(i_s\)-th pair, and we use \(H_{i_s}^1\) and \(H_{i_s}^2\) to express these two types of associations between concepts and document classes, i.e., \(H_{i_s}=H_{t{\vert }i_s}=\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] \), where \(H_{i_s}^1 \in {R_+^{k_1 \times c}}\), \(H_{i_s}^2 \in {R_+^{k_2 \times c}}\). Thus, we can rewrite the objective function in (4):

$$\begin{aligned} \begin{aligned} \mathcal {L}= &\mathop \sum \limits _{i_s = 1}^{2^s-1} \left({\parallel {X_{i_s}} - {F_{i_s}}{H_{i_s}}G_{i_s}^{\top }\parallel }^{2}+ {\parallel {X_{t}} - {F_{t{\vert }i_s}}{H_{t{\vert }i_s}}G_{t}^{\top }\parallel }^{2}\right)\\=&\mathop \sum \limits _{{i_s} = 1}^{2^s-1} \left({\parallel {X_{i_s}} - [F_{i_s}^{1},F_{i_s}^{2}]\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] G_{i_s}^{\top }\parallel }^{2}+ {\parallel {X_{t}} - [F_{i_s}^{1},F_{t{\vert }i_s}^{2}]\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] G_{t}^{\top }\parallel }^{2}\right)\\ \end{aligned} \end{aligned}$$
(5)

There are five constraints to \(F_{i_s}^{1}\), \(F_{i_s}^{2}\), \(F_{t{\vert }i_s}^{2}\), \(G_{i_s}\), and \(G_{t}\). Therefore, the optimization problem is as follows:

$$\begin{aligned} \begin{aligned}&\mathop {\min }\limits _{{F_{i_s}},{F_{t{\vert }i_s}},{H_{i_s}},{G_{i_s}},{G_{t}} \ge 0} \mathcal {L}\\&s.t.\mathop \sum \limits _{i = 1}^m {F_{{i_s}[i,j]}^1} = 1,\\&\sum \limits _{i = 1}^m {F_{{i_s}[i,j]}^{2}} = 1,\mathop \sum \limits _{i = 1}^m {F_{{t{\vert }i_s}[i,j]}^{2}} = 1,\\&\mathop \sum \limits _{j = 1}^c {G_{{i_s}[i,j]}} = 1,\mathop \sum \limits _{j = 1}^c {G_{{t}[i,j]}} = 1\\ \end{aligned} \end{aligned}$$
(6)

4.3 Solution to MSTL

Observing that \(\mathcal {L}\) is not concave, we further develop the objective function and propose an iterative algorithm instead of the latest nonlinear optimization techniques. According to the properties of the trace and Frobenius norm, we can expand the objective function as follows:

$$\begin{aligned} \mathcal {L}=&\mathop \sum \limits _{{i_s} = 1}^{2^s-1} ({\parallel {X_{i_s}} - [F_{i_s}^{1},F_{i_s}^{2}]\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] G_{i_s}^{\top }\parallel }^{2} \nonumber \\&\quad + {\parallel {X_{t}} - [F_{i_s}^{1},F_{t{\vert }i_s}^{2}]\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] G_{t}^{\top }\parallel }^{2})\nonumber \\ =&\mathop \sum \limits _{{i_s} = 1}^{2^s-1} (tr({X_{i_s}^{\top }}{X_{i_s}} - 2\cdot {X_{i_s}^{\top }}[F_{i_s}^{1},F_{i_s}^{2}] \left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] {G_{i_s}^{\top }}\nonumber \\&+{G_{i_s}}{\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] }^{\top } {[F_{i_s}^{1},F_{i_s}^{2}]}^{\top }\nonumber \\&[F_{i_s}^{1},F_{i_s}^{2}] \left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] {G_{i_s}^{\top }}) + tr({X_{t}^{\top }}{X_{t}}\nonumber \\&- 2\cdot {X_{t}^{\top }}[F_{i_s}^{1},F_{t{\vert }i_s}^{2}] \left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] {G_{t}^{\top }}\nonumber \\&+{G_{t}}{\left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] }^{\top } {[F_{i_s}^{1},F_{t{\vert }i_s}^{2}]}^{\top }[F_{i_s}^{1},F_{t{\vert }i_s}^{2}] \left[ {\begin{array}{*{20}{c}}{H_{i_s}^{1}}\\ {H_{i_s}^{2}}\end{array}}\right] {G_{t}^{\top }}))\nonumber \\ =&\mathop \sum \limits _{{i_s} = 1}^{2^s-1} (tr({X_{i_s}^{\top }}{X_{i_s}}-2\cdot {X_{i_s}^{\top }}{A_{i_s}}-2\cdot {X_{i_s}^{\top }}{B_{i_s}}+G_{i_s}{H_{i_s}^{1}}^{\top }{F_{i_s}^{1}}^{\top }{A_{i_s}}\nonumber \\&+G_{i_s}{H_{i_s}^{2}}^{\top }{F_{i_s}^{2}}^{\top }{B_{i_s}} + 2\cdot {G_{i_s}}{H_{i_s}^{1}}^{\top }{F_{i_s}^{1}}^{\top }{B_{i_s}})\nonumber \\&\quad +tr({X_{t}^{\top }}{X_{t}}-2\cdot {X_{t}^{\top }}{A_{t}}\nonumber \\&-2\cdot {X_{t}^{\top }}{B_{t}} + G_{t}{H_{i_s}^{1}}^{\top }{F_{i_s}^{1}}^{\top }{A_{t}} \nonumber \\&+ G_{t}{H_{i_s}^{2}}^{\top }{F_{t{\vert }i_s}^{2}}^{\top }{B_{t}}+2\cdot {G_{t}}{H_{i_s}^{1}}^{\top }{F_{i_s}^{1}}^{\top }{B_{t}}))\nonumber \\ s.t.&\mathop \sum \limits _{i = 1}^m {F_{i_s[i,j]}^1} = 1,\sum \limits _{i = 1}^m {F_{i_s[i,j]}^{2}} = 1,\mathop \sum \limits _{i = 1}^m {F_{t{\vert }i_s[i,j]}^{2}} = 1,\mathop \sum \limits _{j = 1}^c {G_{i_s[i,j]}} = 1,\nonumber \\&\mathop \sum \limits _{j = 1}^c {G_{t[i,j]}} = 1 \end{aligned}$$
(7)

where \({A_{i_s}}={F_{i_s}^1}{H_{i_s}^1}{{G_{i_s}}^{\top }}\), \({B_{i_s}}={F_{i_s}^{2}}{H_{i_s}^2}{{G_{i_s}}^{\top }}\), \({A_{t}}={F_{i_s}^1}{H_{i_s}^1}{{G_{t}}^{\top }}\), \({B_{t}}={F_{t{\vert }i_s}^{2}}{H_{i_s}^2}\) \({{G_{t}}^{\top }}\). Since the real label information of source domains already exists, we only need to solve \(G_{t}\). Then we come to the partial differentials of \(\mathcal {L}\) as follows:

$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial {F_{i_s}^{1}}}=\, & {} {-2}\cdot {X_{i_s}}{G_{i_s}}{H_{i_s}^{1}}^{\top }+2\cdot {A_{i_s}}{G_{i_s}}{H_{i_s}^{1}}^{\top }\nonumber \\{} & {} +2\cdot {B_{i_s}}{G_{i_s}}{H_{i_s}^{1}}^{\top }\nonumber \\{} & {} {-2}\cdot {X_{t}}{G_{t}}{H_{i_s}^{1}}^{\top }+2\cdot {A_{t}}{G_{t}}{H_{i_s}^{1}}^{\top }\nonumber \\{} & {} +2\cdot {B_{t}}{G_{t}}{H_{i_s}^{1}}^{\top } \end{aligned}$$
(8)
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial {F_{i_s}^{2}}}= \,& {} {-2}\cdot {X_{i_s}}{G_{i_s}}{H_{i_s}^{2}}^{\top }\nonumber \\{} & {} +2\cdot {B_{i_s}}{G_{i_s}}{H_{i_s}^{2}}^{\top }+2\cdot {A_{i_s}}{G_{i_s}}{H_{i_s}^{2}}^{\top } \end{aligned}$$
(9)
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial {F_{t{\vert }i_s}^{2}}}=\, & {} {-2}\cdot {X_{t}}{G_{t}}{H_{i_s}^{2}}^{\top }\nonumber \\{} & {} +2\cdot {B_{t}}{G_{t}}{H_{i_s}^{2}}^{\top }+2\cdot {A_{t}}{G_{t}}{H_{i_s}^{2}}^{\top } \end{aligned}$$
(10)
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial {H_{i_s}^{1}}}=\, & {} {-2}\cdot {F_{i_s}^{1}}^{\top }{X_{i_s}}{G_{i_s}}\nonumber \\{} & {} +2\cdot {F_{i_s}^{1}}^{\top }{A_{i_s}}{G_{i_s}}+2\cdot {F_{i_s}^{1}}^{\top }{B_{i_s}}{G_{i_s}}\nonumber \\{} & {} {-2}\cdot {F_{i_s}^{1}}^{\top }{X_{t}}{G_{t}}+2\cdot {F_{i_s}^{1}}^{\top }{A_{t}}{G_{t}}\nonumber \\{} & {} +2\cdot {F_{i_s}^{1}}^{\top }{B_{t}}{G_{t}} \end{aligned}$$
(11)
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial {H_{i_s}^{2}}}=\, & {} {-2}\cdot {F_{i_s}^{2}}^{\top }{X_{i_s}}{G_{i_s}}\nonumber \\{} & {} +2\cdot {F_{i_s}^{2}}^{\top }{B_{i_s}}{G_{i_s}}+2\cdot {F_{i_s}^{2}}^{\top }{A_{i_s}}{G_{i_s}}\nonumber \\{} & {} {-2}\cdot {F_{t{\vert }i_s}^{2}}^{\top }{X_{t}}{G_{t}}+2\cdot {F_{t{\vert }i_s}^{2}}^{\top }{B_{t}}{G_{t}}\nonumber \\{} & {} +2\cdot {F_{t{\vert }i_s}^{2}}^{\top }{A_{t}}{G_{t}} \end{aligned}$$
(12)
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial {G_{t}}}=\, & {} {-2}\cdot {{X_{t}}^{\top }}\mathop \sum \limits _{i_s = 1}^{2^{s}-1}{F_{t{\vert }i_s}}{H_{i_s}}+2\cdot {G_{t}}\mathop \sum \limits _{i_s = 1}^{2^{s}-1}{{H_{i_s}}^{\top }{F_{t{\vert }i_s}}^{\top }{F_{t{\vert }i_s}}{H_{i_s}}} \end{aligned}$$
(13)

The iterative algorithm updates these factor matrices as follows:

$$\begin{aligned}{} & {} F_{i_s\left[ {i,j} \right] }^1\leftarrow F_{i_s\left[ {i,j} \right] }^1 \nonumber \\{} & {} \quad \cdot \sqrt{\frac{{{{[\mathop {X_{i_s}}{G_{i_s}}{H_{i_s}^{1}}^{\top }+{X_{t}}{G_{t}}{H_{i_s}^{1}}^{\top }]}_{\left[ {i,j} \right] }}}}{{{{[\mathop {A_{i_s}}{G_{i_s}}{H_{i_s}^{1}}^{\top } + {B_{i_s}}{G_{i_s}}{H_{i_s}^{1}}^{\top }+{A_{t}}{G_{t}}{H_{i_s}^{1}}^{\top } + {B_{t}}{G_{t}}{H_{i_s}^{1}}^{\top }]}_{\left[ {i,j} \right] }}}}} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \quad F_{{i_s}\left[ {i,j} \right] }^{2}\leftarrow F_{{i_s}\left[ {i,j} \right] }^{2} \nonumber \\{} & {} \quad \cdot \sqrt{\frac{{{{[\mathop {X_{i_s}}{G_{i_s}}{H_{i_s}^{2}}^{\top }]}_{\left[ {i,j} \right] }}}}{{{{[\mathop {B_{i_s}}{G_{i_s}}{H_{i_s}^{2}}^{\top } + {A_{i_s}}{G_{i_s}}{H_{i_s}^{2}}^{\top }]}_{\left[ {i,j} \right] }}}}} \end{aligned}$$
(15)
$$\begin{aligned}{} & {} \quad F_{{t{\vert }i_s}\left[ {i,j} \right] }^{2}\leftarrow F_{{t{\vert }i_s}\left[ {i,j} \right] }^{2} \nonumber \\{} & {} \quad \cdot \sqrt{\frac{{{{[\mathop {X_{t}}{G_{t}}{H_{i_s}^{2}}^{\top }]}_{\left[ {i,j} \right] }}}}{{{{[\mathop {B_{t}}{G_{t}}{H_{i_s}^{2}}^{\top } + {A_{t}}{G_{t}}{H_{i_s}^{2}}^{\top }]}_{\left[ {i,j} \right] }}}}} \end{aligned}$$
(16)
$$\begin{aligned}{} & {} \quad H_{{i_s}\left[ {i,j} \right] }^{1}\leftarrow H_{{i_s}\left[ {i,j} \right] }^{1} \nonumber \\{} & {} \quad \cdot \sqrt{\frac{{{{[\mathop {F_{i_s}^1}^{\top }{X_{i_s}}{G_{i_s}}+{F_{i_s}^1}^{\top }{X_{t}}{G_{t}}]}_{\left[ {i,j} \right] }}}}{{{{[\mathop {F_{i_s}^1}^{\top }{A_{i_s}}{G_{i_s}}+{F_{i_s}^1}^{\top }{B_{i_s}}{G_{i_s}} + {F_{i_s}^1}^{\top }{A_{t}}{G_{t}} + {F_{i_s}^1}^{\top }{B_{t}}{G_{t}}]}_{\left[ {i,j} \right] }}}}} \end{aligned}$$
(17)
$$\begin{aligned}{} & {} \quad H_{{i_s}\left[ {i,j} \right] }^{2}\leftarrow H_{{i_s}\left[ {i,j} \right] }^{2} \nonumber \\{} & {} \quad \cdot \sqrt{\frac{{{{[\mathop {F_{i_s}^{2}}^{\top }{X_{i_s}}{G_{i_s}}+{F_{t{\vert }i_s}^{2}}^{\top }{X_{t}}{G_{t}}]}_{\left[ {i,j} \right] }}}}{{{{[\mathop {F_{i_s}^{2}}^{\top }{B_{i_s}}{G_{i_s}}+{F_{i_s}^{2}}^{\top }{A_{i_s}}{G_{i_s}} + {F_{t{\vert }i_s}^{2}}^{\top }{B_{t}}{G_{t}} + {F_{t{\vert }i_s}^{2}}^{\top }{A_{t}}{G_{t}}]}_{\left[ {i,j} \right] }}}}} \end{aligned}$$
(18)
$$\begin{aligned}{} & {} \quad G_{{t}\left[ {i,j} \right] }\leftarrow G_{{t}\left[ {i,j} \right] } \nonumber \\{} & {} \quad \cdot \sqrt{\frac{{{{[\mathop {X_{t}}^{\top }\sum \nolimits _{{i_s} = 1}^{2^{s}-1}{F_{t{\vert }i_s}}{H_{i_s}}]}_{\left[ {i,j} \right] }}}}{{{{[\mathop {G_{t}}\sum \nolimits _{{i_s} = 1}^{2^{s}-1}{H_{i_s}}^{\top }{F_{t{\vert }i_s}}^{\top }{F_{t{\vert }i_s}}{H_{i_s}}]}_{\left[ {i,j} \right] }}}}} \end{aligned}$$
(19)

We calculate all the variable matrices in each round of iteration and utilize Eq.(20) to normalize \(F_{i_s}^1\), \(F_{i_s}^2\), \(F_{t{\vert }i_s}^2\), \(H_{i_s}^1\), \(H_{i_s}^2\), and \(G_{t}\) as follows:

$$\begin{aligned} \begin{aligned}&F_{i_s\left[ {i,j} \right] }^1 \leftarrow \frac{F_{i_s\left[ {i,j} \right] }^1}{\sum \nolimits _{{j} = 1}^{k_1}F_{i_s\left[ {i,j} \right] }^1}, F_{{i_s}\left[ {i,j} \right] }^{2} \leftarrow \frac{F_{{i_s}\left[ {i,j} \right] }^{2}}{\sum \nolimits _{{j} = 1}^{k_2}F_{{i_s}\left[ {i,j} \right] }^{2}},F_{{t{\vert }i_s}\left[ {i,j} \right] }^{2} \\&\quad \leftarrow \frac{F_{{t{\vert }i_s}\left[ {i,j} \right] }^{2}}{\sum \nolimits _{{j} = 1}^{k_2}F_{{t{\vert }i_s}\left[ {i,j} \right] }^{2}},\\&\quad H_{{i_s}\left[ {i,j} \right] }^{1} \leftarrow \frac{H_{{i_s}\left[ {i,j} \right] }^{1}}{\sum \nolimits _{{j} = 1}^{c}H_{{i_s}\left[ {i,j} \right] }^{1}}, H_{{i_s}\left[ {i,j} \right] }^{2} \leftarrow \frac{H_{{i_s}\left[ {i,j} \right] }^{2}}{\sum \nolimits _{{j} = 1}^{c}H_{{i_s}\left[ {i,j} \right] }^{2}}, G_{{t}\left[ {i,j} \right] }\\&\quad \leftarrow \frac{G_{{t}\left[ {i,j} \right] }}{\sum \nolimits _{{j} = 1}^{c}G_{{t}\left[ {i,j} \right] }}\\ \end{aligned} \end{aligned}$$
(20)

According to Eqs.(1420), an iterative algorithm is proposed in Algorithm 1. The data matrices are normalized such that \(X_{i_s}^\top 1_m = 1_n\), \(X_{t}^\top 1_m = 1_n\). To initialize CE in each pair of the target domain and training dataset, all the data in each pair are combined to implement PLSA [44]. For example, we set the number of concepts in the \(i_s\)-th pair as \(k_{1}+k_{2}\), and obtain feature-concept matrix \(W_{i_s}\). \(W_{i_s}\) is divided into two parts \(W_{i_s}=[W^{1}_{i_s},W^{2}_{i_s}]\), where \(W^{1}_{i_s} \in R_+^{m \times k_1}\), \(W^{2}_{i_s} \in R_+^{m \times k_2}\). As a result, \(F^{1}_{i_s}\) is initialized as \(W^{1}_{i_s}\) as well as \(F^{2}_{i_s}\) and \(F^{2}_{t{\vert }i_s}\) are initialized as \(W^{2}_{i_s}\). In addition, we initialize the classifier of each pair by implementing logistic regression [43], denoted as \(G_{t{\vert }i_s}\), and then integrate them by Eqs.(1420) to get the initial \(G_{t}\). \(G_{i_s}\) is assigned as the true label information of documents.

4.4 Computational Complexity of the Iterative Algorithm

We verify the efficiency of our method by analyzing the computational complexities of Eqs.(1419) in each round of iteration. For instance, the computational complexity of Eq.(14) is \(O((2^s-1)(5mnc+2mkc+6mk_1c+mk_1))\), where \(n=n_{i_s}+n_{t}\). In general, \(c<k_1 <k \ll n\), \(s \ll n <m\), so the computational complexity can be written as \(O(mnc)\). Similarly, the computational complexities of Eqs.(15), (16), (17), (18), and (19) are, respectively, \(O(mn_{i_s}c)\), \(O(mn_{t}c)\), \(O(mnc+mk_1n)\), \(O(mnc+mk_2n)\), and \(O(mn_{t}k)\). Thus, in each round of iteration, the maximal computational complexity is \(O(mnc+mnk)\). As a result, the computational complexity of Algorithm 1 is \(O(maxIter \cdot (mnc+mnk))\).

figure a

5 Experimental Evaluation

In this section, we compare our method with other advanced multi-source transfer learning methods on two benchmark datasets. Furthermore, we systematically verify the effectiveness of PSF-MSTL.

5.1 Data Preparation

20-NewsgroupsFootnote 1 is one of the international standard datasets used in text classification research. It collects approximately 20000 news documents and divides them into 20 different newsgroups. Similar newsgroups are integrated into one topic, resulting in four topics. For instance, comp.graphics, comp.sys.mac.hardware, comp.sys.ibm.pc.hardware, comp.os.- ms-windows.misc are categorized into the topic comp. The topics and their corresponding newsgroups in 20newsgroups are listed in Table 3. We design the three-source classification tasks as follows:

Table 3 Topics and newsgroups

Since documents are classified into two classes in the binary classification tasks, we first choose rec as the positive class and sci as the negative class. Then we randomly select one newsgroup from rec and sci, respectively, to construct the first source domain, and each newsgroup has 200 news documents. The other two source domains and the target domain are generated in a similar way. Thus, 576\(({P_4^4}\times {P_4^4})\) three-source text classification tasks are produced. Since there exist \(6({P_3^3})\) permutations of three sources, and different permutations can hardly affect the experimental results, the number of tasks can be reduced to 96\(({P_4^4}\times {P_4^4}\div {P_3^3})\). In summary, there are 96 three-source transfer learning tasks on 20Newsgroups.

SentimentFootnote 2 is another dataset widely used in binary text classification. It is composed of four fields, i.e., books, electronics, kitchen, and DVD. We randomly choose them as three source domains and one target domain, and each domain has 200 positive and 200 negative documents. Thus, we produce 4\(({P_4^3}\div {P_3^3})\) three-source classification tasks on Sentiment.

Office-HomeFootnote 3 has four major categories, i.e., Art, Clipart, Product, and Real World. Each major category contains the same subcategories. We randomly choose two subcategories as positive and negative classes. In our experiment, we choose Alarm_Clock vs. Backpack. For three-source transfer learning tasks, since there are four major categories in Office-Home, we randomly choose three of them as source domains and choose the rest domain as the target domain. Thus, four classification tasks are constructed in a set of experiments. All images are first converted to grayscale images and compressed to 32*32 pixels, followed by the normalization of grayscale values. Then the grayscale features of a picture are arranged into rows, forming a matrix with features as rows and samples as columns.

5.2 Experimental Setting

5.2.1 Algorithms in Comparison

  1. (1)

    Logistic regression(LR) [43]: LR is a traditional supervised classification method. We combine all source domains into one source domain to train a classifier and use the target domain to test.

  2. (2)

    Non-negative matrix tri-factorization(NMTF) [9]: We use NMTF that can train on the source and target domains simultaneously as the baseline method.

  3. (3)

    Multi-source text classification methods: RCD-PLSA [38], MST3L [23], and MCPC [26] use three strategies described in Sect. 1, respectively. In addition, PSF-MSTL is compared to the unsupervised text classification method SDA [24] and deep-neural-network-based method TSM-DNN [41].

5.2.2 Parameter Settings

Since it is exceedingly difficult to quantify the high-level concepts and find an optimal solution, the numbers of identical and alike concepts in each pair of the target domain and training dataset are empirically set as 20(\(k_1=k_2=20\)), and the number of max iterations is set as 100(\(maxIter=100\)). In addition, LR is conducted on Matlab.Footnote 4 NMTF is obtained from [9]. Their parameters are set as default ones. To be fair, RCD-PLSA, MST3L, MCPC, and SDA are incorporated NMTF technique, and they are trained with the parameters same as PSF-MSTL. The initial learning rate of TSM-DNN is set as 0.01.

5.2.3 Evaluation Metrics

We use two general evaluation matrices and count the number of negative transfer to evaluate the experimental results.

(1) \(Accuracy\)

\(Accuracy=\displaystyle {\frac{{\vert }\{d:d \in D \wedge f(d)=y(d)\}{\vert }}{n}}\) where \(y(d)\) is the ture label of document \(d\), \(f(d)\) is the predictive label and \(n\) is the number of documents.

(2) \(F1-measure\)

\(F1-measure=\displaystyle {\frac{F_{1N}+F_{1P}}{2}}\) where \(F_{1N}(F_1 \mathrm{\ on\ negative\ predictions})=(2\cdot P_{N}\cdot R_{N})/(P_{N}+R_{N}),\ F_{1P}(F_1 \) \( \mathrm{on \ positive\ predictions})=(2\cdot P_{P}\cdot R_{P})/(P_{P}+R_{P}),\ R_{N}(\mathrm{recall \ on \ } \textrm{negative }\) \(\textrm{predictions})=a/(a+b)\), \(P_{N}(\mathrm{precision \ on \ negative \ predictions})\)=\(a/(a+c)\), \( R_{P}(\textrm{recall}\mathrm{\ on \ positive \ predictions})\)=\(d/(d+c)\), \(P_{P}(\mathrm {precision \ on \ positive} \mathrm{\ predi\text{- }}\) \(\textrm{ctions})\)=\(d/(d+b)\).

(3) \(Count \ of \ negative \ transfer\)

We count the occurrences of negative transfer to measure transfer learning methods from another perspective [10]. The number of negative transfers is denoted as \(NumNT\).

5.3 Experimental Results

Here, PSF-MSTL is compared with LR, NMTF, RCD-PLSA, MST3L, MCPC, and SDA on 20Newsgroups and Sentiment datasets, respectively. The average performances of PSF-MSTL and other transfer learning methods are shown in Table 4.

Table 4 Performances (%) on 20Newsgroups and Sentiment (10 repeated experiments)
  1. (1)

    Comparison on 20Newsgroups

    1. (a)

      The experimental results on 20Newsgroups are shown in Table 4. We can observe that PSF-MSTL outperforms all the compared methods. There are two reasons why PSF-MSTL can achieve the best performance. First, it constructs a power set framework, which allows each single-source domain to join different training datasets, and thus multiple source domains correlate from various perspectives preliminarily. Second, PSF-MSTL integrates all the training datasets, enabling the target classifier to be trained on complementary knowledge from an integral framework. Consequently, PSF-MSTL not only mines the commonality and specificity across domains but also builds the intrinsic relevance of multiple source domains. Other methods hardly make full use of the latent information because they do not construct correlations across multiple source domains, which causes them to be inferior to PSF-MSTL on multi-source classification tasks.

    2. (b)

      All the multi-source transfer learning methods outperform LR, which means traditional machine learning methods hardly address multi-source transfer learning problems.

    3. (c)

      NMTF, RCD-PLSA obtain poorer performances in multi-source tasks, which verifies our analysis in Sect. 1 that the combining strategy may ignore the mutual interferences across source domains.

    4. (d)

      MST3L and MCPC achieve better performance than RCD-PLSA. This indicates that fusing all single-source knowledge can improve learning performance.

  2. (2)

    Comparison on Sentiment

To verify that our method is able to perform more complex work, we construct classification tasks where the discrepancies among source domains are large. The experimental results are shown in Table 4, and we can find that PSF-MSTL has more obvious advantages over all the compared methods. Other multi-source transfer learning methods fail in challenging tasks because the common and domain-specific information is extracted from a few raw features, weakening their discriminative power for document classification. Only PSF-MSTL constructs a complete relation structure of multiple source domains, which can provide complementary latent information to reduce the learning bias in a harder scenario. Thus, the tasks on Sentiment further demonstrate that PSF-MSTL is more stable than other transfer learning methods.

  1. (3)

    Comparison on Office-Home

To further validate the adaptation capability of PSF-MSTL, we performed more validation experiments on the image dataset Office-Home. The experimental results are shown in Table 5, and we can find that PSF-MSTL can still obtain the optimal performance. The reason is that even in more challenging image datasets, PSF-MSTL can construct the power set framework that contains correlations across multiple source domains and mine complementary knowledge from it to further improve classification performance. Therefore, our algorithm can still obtain outstanding performances.

Table 5 Performances (%) on Office-Home (10 repeated experiments)

In summary, the comparison results verify the effectiveness and stability of PSF-MSTL on both traditional and challenging multi-source classification tasks.

5.4 Effectiveness of PSF-MSTL

To further evaluate PSF-MSTL, we design two-source and three-source experiments and perform several groups of single-source transfer learning tasks where the source domain is set as one of the generated training datasets. According to the number of training datasets in the power set framework, 864\((288\times {3})\) and 672\((96\times {7})\) single-source tasks are produced in two-source and three-source scenarios, respectively. The parameter settings in these tasks are the same as those in PSF-MSTL. Table 6 shows the learning performances on the complete framework and its components under the two-source scenario, and Table 7 shows the three-source one. Since we remove the duplicate tasks in Sect. 5.1, better or worse results may gather in a certain group of tasks, and thus we give the mean accuracy of each type to make the results more objective.

Table 6 Average performances (%) comparison between the complete framework and its components for two-source (10 repeated experiments)
Table 7 Average performances (%) comparison between the complete framework and its components for three-source (10 repeated experiments)

First, both Tables 6 and 7 show that the learning accuracy on any training dataset cannot exceed the one on the power set framework, indicating that it is necessary to construct a complete framework that contains richer knowledge. In addition, the reason why PSF-MSTL can reduce negative transfers in two-source tasks and avoid negative transfers in three-source tasks is that our method can build more sophisticated intrinsic relevance when there exist more source domains.

Second, by comparing the experimental results of two-source and three-source types in Table 7, we can infer that the increase of source domains may lead to different consequences in different scenarios: in traditional multi-source tasks, data expansion results in a limited development in learning performance, and in contrast, it brings a large improvement in challenging tasks. However, PSF-MSTL significantly improves the learning capabilities on both datasets, indicating that it can solve the problem of mutual interference among multiple sources and effectively utilize the valuable information in each source domain wherever it performs.

In summary, these results not only verify our ideas presented in Sect. 1 but also prove the effectiveness of PSF-MSTL.

5.5 Running Time

We randomly select six three-source tasks on 20Newsgroups and Sentiment, respectively, to check the running time of PSF-MSTL. The experimental results are shown in Table 8. The reason why PSF-MSTL has the longest running time is that how fast the high-level concepts-based algorithm runs has a positive correlation with the number of high-level concepts, and PSF-MSTL has to extract the latent information from more latent feature spaces. Additionally, the running time of our method is within acceptable limits.Footnote 5

Table 8 Running time of PSF-MSTL and other compared methods(s)

5.6 Parameter Sensitivity

Here, we analyze the parameter sensitivity of PSF-MSTL. To verify that PSF-MSTL is hardly affected by changing parameters, we sample in a larger range of both \(k_1\) and \(k_2\). To be specific, we randomly sample ten pairs of \(k_1\) and \(k_2\) when \(k_1 \in [15,25]\), \(k_2 \in [15,25]\), and randomly choose ten three-source tasks on 20Newsgroups to make verification. The experimental results are shown in Table 9. Apparently, the mean accuracy of ten pairs of parameters is almost equal to the accuracy trained from default parameters, and the variance is small. As a result, PSF-MSTL is not sensitive to parameters.

Table 9 The parameter influence on performance (%) of algorithm PSF-MSTL

5.7 Algorithm Convergence

In this section, we check the convergence of PSF-MSTL by randomly choosing six three-source tasks on rec vs. sci. The experimental results are shown in Fig. 2. The left y-axis represents the accuracy of PSF-MSTL, the right y-axis represents the objective value in Eq.(4), and the x-axis represents the number of iterations. Obviously, as the iteration develops, the accuracy of PSF-MSTL increases while the objective value decreases, and both of them converge within 100 iterations.

Fig. 2
figure 2

The performance of PSF-MSTL and objective value vs. the number of iterations

6 Conclusion

In this paper, we propose a novel multi-source transfer learning method called PSF-MSTL. First, our method can combine different source domains into various training datasets based on the power set concept, forming a source domain framework. Second, PSF-MSTL integrates all the training datasets and transfers complementary knowledge from an integral framework using a dual-promotion strategy. To solve the optimization problem, we propose an iterative algorithm based on non-negative matrix tri-factorization. Finally, we conduct extensive multi-source experiments to demonstrate that PSF-MSTL can outperform other state-of-the-art multi-source text classification methods.

It is worth mentioning that although PSF-MSTL has excellent performance, the parameters of the high-level concepts are set empirically, and the weights of different training datasets are defaulted to be equal. In the future, we will explore the automatic adjustment of the optimal parameters and consider how to weigh different types of training datasets in different scenarios to obtain better results.