1 Introduction

Over the past decade, with the development of the Internet and the social media, enormous opportunities have been created for companies of all sizes to interact with customers, advertise products, make business transactions, as well as for individuals to get better knowledge on the product reviews (Blitzer et al. 2012; Das and Chen 2007; Thomas et al. 2006). As a result, sentiment detection methods are becoming more and more important in automatically analyzing and summarizing the sentiments online.

With widely-varying domains, researchers who build sentiment classification systems need to collect and curate data for the domains they work on. But laboring the collected data is normally very expensive and inefficient. To solve the label insufficiency problem, a feasible way is to utilize examples from other domains. According to the availability of labels/sentiments on different domains, there are mainly three scenarios in sentiment detection: (1) Some labeled (with known sentiments) examples from other domains (source domain) are available, while no labeled example is available on the new domain (target domain). In this case, the sentiments of the examples in the new domain can be predicted by using domain adaption methods (Blitzer et al. 2012; Huang et al. 2006; Pan et al. 2010). (2) Sufficient unlabeled examples in other domains are available, while a small number of labeled examples in the new domain can be obtained. Then, a classifier can be trained through self taught learning (Raina et al. 2007). (3) Abundant labeled examples in the source domain are available, and meanwhile several labeled examples can also be obtained in the target domain. In this scenario, the classifier can be trained by treating the source domain examples as auxiliary data, and training a classifier that is consistent on both the source domain and the target domain.

In this paper, our focus is on the third case. In this case, it is clear that if the data distributions for both the source and the target domains are the same, then a classifier can be directly trained based on the labeled examples from both the source and target domains by using methods, such as support vector machines (SVM) (Scholkopf and Smola 2002). However, in practice, the distributions of these two domains are normally very different and directly applying the classifiers from other domains on the target domain normally leads to a poor classification performance (Blum and Chawla 2001; Lafferty et al. 2001; Nigam et al. 2000).

In twitter sentiment classification problem, normally some labeled tweets are available in the target domain, while in the source domain a rich set of labeled “complete" documents are available. Since each tweet contains only a limited number of characters, directly designing classifiers based on the labeled tweets will not be accurate due to the extreme sparseness of these tweet feature vectors. So, we consider to incorporate the examples in the source domain. But the distribution for the source domain examples is usually deviates a lot from that of tweets, since they may probably cover different topics and the huge difference of the feature sparseness on these two domains also poses a big challenge. So, if we simply merge the labeled source and target domain examples together, and design a classifier based on them, the sentiment classification results will be badly affected.

As another example, in the sentiment detection of product review comments, suppose we have some labeled review comments in the target domain, as well as a lot of labeled ones in the source domain. A natural question is whether we can use the labeled review comments for some other products to help us to understand the review comments on the target product. Normally, different products have different characteristics. For example, “sharpness” is a good descriptor for a good knife. But it is not a good evaluation for laptops. So, if we design a sentiment classifier based on the labeled “knife" products, we cannot directly use it to classify the sentiments for the laptop review comments.

To deal with the domain difference problem, a lot of methods have already been developed. One of the most successful methods is the structural correspondence learning (SCL) (Ando and Bartlett 2005; Blitzer et al. 2006). In this work, the authors first find a set of pivot features which frequently appear in both the source domain and the target domain. Then, the correlations between the pivot features and the non-pivot features are modeled through a set of linear classifiers. Some hidden patterns underlying these classifiers are then considered as the correlations between different kinds of features. Based on these hidden patterns, another set of features are designed and appended to the original feature space before the supervised training process. In Huang et al. (2006), the authors assume that the posterior probability for the two domains are the same, and the difference only lies on the data distribution without considering the labels. Based on this assumption, they modeled data distribution difference between different domains through kernel mean matching.

Their methods are reasonable and handle the distribution difference problem from different perspectives. However, the distribution difference problem still exists in these works. For example, in SCL, even if the new features are appended to the original feature space, it is clear that distribution difference problem still exists on the original features. Therefore, designing a classifier based on the new feature space is still not good for target domain examples. In Huang et al. (2006), the assumption that the conditional probability of P S (y|x) and P T (y|x), where x represents the instance and y is referred to as the label, are the same is too strong. Instead, in this paper, we propose a novel formulation—sentiment detection with auxiliary data (SDAD), which solves this problem by modeling the joint distribution difference between different domains through Kernel Density Estimation (KDE) (Bishop 2007) and incorporates the source domain examples more naturally into the objective function through reweighting the source domain examples. The proposed formulation is then solved by the bundle method (Smola et al. 2008; Teo et al. 2010). Some important properties of the proposed method, such as the convergence rate and the time complexity, are analyzed in “Appendix”. The experimental results clearly demonstrate the advantages of the proposed method.

The rest of this paper is organized as follows: Sect. 2 introduces the related works. Section 3 gives the problem statement and puts forward the proposed method. An extensive set of experiments are given in Sect. 4 At the end of this paper, a conclusion will be drawn.

2 Related works

2.1 Sentiment detection

With more than 10 years’ development, sentiment detection (Argamon et al. 1998; Kessler et al. 1997; Spertus 1997) has become one of the major subfields in information management (Dimitrova et al. 2002; Hillard et al. 2003; Wilson et al. 2005), especially after the year 2001. This is mainly due to three reasons (Pang and Lee 2008): (1) the increase of machine learning techniques in natural langauge processing; (2) the availability of the datasets due to the popularity of the Internet, especially the development of social media; (3) the rising interest in commercial and business intelligence applications in this area. As a result, a lot of approaches (Cardie et al. 2003; Das and Chen 2001; Morinaga et al. 2002; Pang and Lee 2004) have been developed to solve this problem.

In machine learning, sentiment detection can be viewed as a classification or regression problem, which mainly deals with two subproblems, i.e., sentiment polarity/classification and degrees of positivity. Depending on the domains on which training examples are available, the concrete methods can be categorized into two groups, i.e., the group that deals with only one single domain and the group with multiple domains. As for the first group (Taboada et al. 2006; Whitelaw et al. 2005; Wiebe and Riloff 2005), to train the classifiers, the training examples are normally available on the target domain. Some machine learning methods that have been used to train the classifiers and have shown state-of-the-art performances in these tasks include naive bayes, maximum entropy, support vector machine (SVM) (Pang et al. 2002) etc.

The second group, which is also the focus of this paper, considers training examples from several different domains. However, sentiment detection is a very domain specific problem, i.e., the classifiers trained in one domain do not perform well in others (Blum and Chawla 2001; Lafferty et al. 2001; Nigam et al. 2000). This is mainly because that the data distributions on different domains are usually different. For example, “sharpness” is a good word feature to describe knives, but is not a good one to evaluate computer products. To deal with this problem, one common way is to use the transfer learning (Pan and Yang 2010), which transfers the knowledge from the training examples in other domains (source domain) to the target domain.

However, previous transfer learning methods that have been applied to sentiment detection only deal with the data distributions problem implicitly. For example, in Blitzer et al. (2012), the authors picked some pivot features which appear frequently in both the source domain and the target domain, and then models the correspondences between these pivot features and all the other features. These correlations are considered as some new features in the training process (Ando and Bartlett 2005; Blitzer et al. 2006). Their method is reasonable. However, although being alleviated, the problem of distribution difference still exist, since they only append some additional features into the original feature space. And the performances are affected by the choices of the pivot features. In this paper, we model the joint distribution difference between the target domain and the source domain by kernel density estimation, so that the training examples on the source domain can be better utilized and the problem of picking the pivot features can also be avoided.

2.2 Transfer learning

In traditional machine learning, such as supervised learning (Duda et al. 2001) and semi-supervised learning (Zhu 2006), one of the common assumptions is that both the labeled and unlabeled data are sampled from the same distribution or lie on the same manifold. But when the distribution changes, a new model would need to be built. It would be useful if the previously trained models can be reused to guide the construction of the new model. This gives rise to the concept of transfer learning, a technique that transfers knowledge across domains, tasks and distributions that are similar but not the same.

An important problem in transfer learning is what kind of knowledge can actually be transferred from the source domain to the target domain. Roughly speaking, the assumptions introduced in previous transfer learning work can be grouped into four categories: (Pan and Yang 2010):

  • Feature Representation Transfer. In Argyriou et al. (2007), Dai et al. (2008), Duan et al. (2009), Pan et al. (2012), Raina et al. (2007), and Zhang and Si (2009), the authors assume that there exist some common feature space shared by both the source domain examples and the target domain examples, and this common feature space can be used as a bridge to transfer knowledge from the source domain to the target domain.

  • Parameter Transfer. By assuming the shared parameters/hyper-parameters, such as in Gaussian Process (GP)models, in Bonilla et al. (2008) and Lawrence and Platt (2004), the authors try to justify and estimate the shared parameters for the models in the source domain and the target domain.

  • Instance Transfer. The examples in the source domain examples are selected or reweighted for use in the target domain (Dai et al. 2007; Huang et al. 2006).

  • Relation Transfer. In Davis and Domingos (2009), Mihalkova et al. (2007), and Mihalkova and Mooney (2008), the authors build the relational map between the source and target domains, and relax the i.i.d. assumptions in these two domains.

In this paper, the proposed method is based on the instance transfer, which considers modeling the distribution difference between the examples on the source domain and target domain together through reweighting the importances of the labeled examples on the source domain. It is true that, some previous works, such as Huang et al. (2006), are also devoted to model the difference between different domains. However, they only achieve this goal indirectly by some approximation methods, while the proposed method directly models the distribution difference through kernel density estimation. Furthermore, to simplify the proposed formulation, the previous works assume that the conditional probability P S (y|x) and P T (y|x), where x are the same for both the source domain and the target domain. Instead, in this work, by taking advantage of kernel density estimation, we can avoid this assumption elegantly.

2.3 Training with auxiliary data

As a machine learning technology, SVM has enjoyed its popularity for more than ten years. One question related to SVM, as well as some other supervised learning methods, is how we can utilize the training examples from some other sources to improve the classification accuracy on the target domain. In Wu and Dietterich (2004), the authors proposed a novel formulation to incorporate the source domain examples into the training process as follows:

$$ \mathop{\hbox{min}}\limits_{h} \sum\limits_{i}^{N^p} L(h({\bf x}_i^p), y_i^p) +\gamma \sum\limits_{i}^{N^a} L(h({\bf x}_i^a), y_i^a) + \lambda D(h), $$
(1)

where h is the classification. (x p i y p i ) denotes the ith training example on the target domain, and (x a i y a i ) refers to the ith auxiliary example. L(•, •) is a predefined loss function, such as the hinge loss. D(h) is a complexity penalty to prevent overfitting. γ and λ are two trade-off parameters.

The problem with this method is that it incorporates the auxiliary data into the objective function without considering the distribution difference between the different domains. As suggested by Huang et al. (2006), by modeling the training data distribution difference between different domains, the model can be much more accurate. Out of the same motivation, in this paper, we model the distribution difference between different sentiment detection domains and combine them in a more natural way.

2.4 Bundle method

The proposed formulation is a convex optimization problem. In this paper, we proposed an optimization algorithm based on the bundle method (Smola et al. 2008; Teo et al. 2010), which has shown its superior performances in both efficiency and effectiveness over state-of-the-art methods, to solve this proposed formulation. The basic motivation of the bundle method is to approximate the objective function J(w) through a set of linear functions, where w is the model parameter. In particular, this objective function is lower bounded as follows:

$$ J({\bf w}) \geq \max\limits_{1\leq i \leq t}\{J({\bf w}_{i-1})+\langle {\bf w}- {\bf w}_{i-1}, {\bf a}_i\rangle\}, $$

where w i is a set of points picked by the bundle method, and a i is the gradient/sub-gradient at point w i . The bundle method monotonically decreases the gap between J(w) and \(\max_{1\leq i \leq t}\{J({\bf w}_{i-1})+\langle {\bf w}- {\bf w}_{i-1}, {\bf a}_i\rangle\}\) such that the minimal point of J(w) can be approximated by the minimum of the line segments \(\max_{1\leq i \leq t}\,\{J({\bf w}_{i-1})+\langle {\bf w}- {\bf w}_{i-1}, {\bf a}_i\rangle\}\).

Some recent development in bundle method (Teo et al. 2010) shows that if J(w) contains some regularizers by itself, the bundle method is guaranteed to converge to the precision \(\epsilon\) in \(O(1/\epsilon)\) steps. In this paper, we adapt the bundle method to solve the proposed problem, which can also be proven to have an efficient convergence rate.

3 Sentiment detection with auxiliary data

In this section, we first introduce the problem of SDAD. Then, an optimization formulation is proposed, which integrates the source domain examples (i.e., auxiliary data) into the objective function in a principled way. This problem is later solved by bundle method (Smola et al. 2008; Teo et al. 2010). In “Appendix”, we analyze some important properties of the proposed method, such as the convergence rate.

3.1 Problem statement

In the proposed problem, we have labeled data from both the source domain and the target domain, where the source domain examples are denoted as: \(({\bf X}, {\bf Y})=\{({\bf x}_1, y_1), ({\bf x}_2, y_2), \ldots, ({\bf x}_n, y_n)\}\) and the target domain examples are referred to as \(({\bf Z}, {\bf Y}^{*})=\{({\bf z}_1, y_1^{*}), ({\bf z}_2, y_2^{*}), \ldots, ({\bf z}_m, y_m^{*})\}, \) where \(y_i \in \{1,-1\}\) and \(y_i^{*} \in \{1,-1\}\) represent the positive and negative attitudes on source and target domains respectively. Without loss of generality, our objective is to train a linear sentiment classifier w based on the labeled examples from both the source domain and the target domain.

3.2 Methodology

3.2.1 Formulation

In this subsection, we propose the formulation of SDAD, which incorporates examples from different domains by modeling the data distribution difference. In particular, the optimization problem of SDAD can be formulated as follows:

$$ \begin{aligned} \mathop{\hbox{min}}\limits_{{\bf w}, \xi_i \geq 0, \xi_j^{*} \geq 0}&\quad \frac{1}{2} \|{\bf w}\|^2 + \frac{C_1}{n}\sum\limits_{i=1}^n \xi_i + \frac{C_2}{m}\sum\limits_{j=1}^m \beta_j \xi_j^{*} \\ s.t.&\quad \forall i \in \{1, 2, \ldots, n\},\quad y_i{\bf w}^T {\bf x}_i \geq 1 - \xi_i,\\ &\quad\forall j \in \{1, 2, \ldots, m\},\quad y_j^{*}{\bf w}^T {\bf z}_j \geq 1 - \xi_j^{*}. \end{aligned} $$
(2)

where β j refers to \(\frac{Pr_T({\bf x}_j, y_j^{*})}{Pr_S({\bf x}_j, y_j^{*})}, \) and represents the ratio between the joint data distribution on the target domain Pr T (x j y * j ) and that on the source domain Pr S (x j y * j ) for the source domain example x j .

There are various ways to estimate β j , such as Gaussian Mixture Model (GMM) (Liu et al. 2002), kernel density estimation (Sheather and Jones 1991), kernel mean matching (Huang et al. 2006), etc. In our approach, without loss of generality, we adopt kernel density estimation. In particular, we have:

$$ \beta_j = \frac{Pr_T({\bf x}_j, y_j^{*})}{Pr_S({\bf x}_j, y_j^{*})} = \frac{Pr_T({\bf x}_j | y_j^{*}) \times P_T(y_j^{*})}{Pr_S({\bf x}_j | y_j^{*}) \times P_S(y_j^{*})}. $$
(3)

It is clear that \(\frac{P_T(y_j^{*})}{P_S(y_j^{*})}\) represents the label ratios on the two different domains, which can be estimated from the labeled examples on both domains. As for \(\frac{Pr_T({\bf x}_j | y_j^{*})}{Pr_S({\bf x}_j | y_j^{*})}\), by using kernel density estimation with the gaussian kernel, it can be estimated as follows:

$$ \frac{Pr_T({\bf x}_j | y_j^{*})}{Pr_S({\bf x}_j | y_j^{*})} \propto \frac{\sum\nolimits_{i=1}^m I_{ij}^{*}exp(-\frac{\|{\bf x}_j-{\bf z}_i\|}{\sigma^2})}{\sum\nolimits_{k=1}^n I_{kj}exp(-\frac{\|{\bf x}_j-{\bf x}_k\|}{\sigma^2})-1}, $$
(4)

where σ is the bandwidth parameter for the gaussian kernel. I * ij is an indication function, which equals 1 if y * i equals y j , and otherwise zero. Similarly, I kj is an indication function, which equals 1 if y k equals y j , and otherwise zero. It is clear that if a source domain example is close enough to the target domain examples, then its importance is higher. Otherwise, it will be down weighted. Through this way, the data distribution of the training examples on the source domain is adjusted to follow the data distribution on the target domain as close as possible.

Some of the previous transfer learning works also share similar motivations as the one we are using here. However, they only do this indirectly, by calculating the probability ratios in some other ways, such as kernel mean matching. Furthermore, in these works, to ease the complexity of the formulations, one common assumption for these previous works is that the conditional probability P T (y * j |x j ) and P S (y * j |x j ) are the same and the difference between different domains only lies on their data distributions without considering the labels, which is too strong in most cases. In this paper, instead, we model the joint probability ratio directly through kernel density estimation. It effectively avoids the strong assumption on the conditional probability and directly models the distributions on the two domains, rather than approximate them in an implicit way.

3.2.2 Efficient optimization

There are several alternatives to solve problem (2) efficiently. Here, an efficient way, which is an adaption of the bundle method, is used to solve it. The concrete procedure is described in Table 1. Here, \(R_{emp}({\bf w})= \frac{C_1}{n}\sum\limits_{i=1}^n \max\{0, 1-y_i{\bf w}^T {\bf x}_i\} + \frac{C_2}{m}\sum\limits_{j=1}^m \beta_j \max\{0, 1-y_j^{*}{\bf w}^T {\bf z}_j\},\,J_t({\bf w}_t)=\frac{1}{2}\|{\bf w}\|^2 + \max_{1\leq i \leq t} \langle {\bf w}, {\bf a}_i \rangle + b_i. \) Since R emp (w) is non-smooth, when calculating its gradient, we use the subgradient instead, which can be calculated as:

$$ \partial_{{\bf w}} R_{emp}({\bf w}) = -\frac{C_1}{n} \sum\limits_{i=1}^n I_i^{S} y_i {\bf x}_i -\frac{C_2}{m} \sum\limits_{j=1}^m I_j^{T}\beta_j y_j^{*} {\bf z}_i, $$
(5)

where I S i equals 1 if y i w T x i  ≤ 1, and otherwise 0. Similarly, I T j equals 1 if y * j w T z j  ≤ 1, and 0 otherwise. Some important properties of the proposed method will be elaborated in “Appendix”.

Table 1 Algorithm Description: SDAD

4 Experiments

In this section, we present an extensive set of experimental results to demonstrate the advantages of the proposed method.

4.1 Datasets

We use two datasets in our experiments, i.e., the product review dataset, and the twitter dataset.

Product Review Dataset This is a benchmark dataset for sentiment detection (Blitzer et al. 2012), which is selected from the Amazon product reviews for four different product types: books, DVDs, electronics, and kitchen appliances. Since each review consists of a 0–5 stars rating, reviews with ratings higher than 3 points are considered as positive sentiment, and otherwise are considered as negative. Each dataset contains 2,000 reviews, among which 1,000 are positive reviews and the remaining 1,000 reviews are negative reviews. The detailed description of this dataset can be found in Table 2.

Table 2 Dataset description

Twitter Dataset This dataset contains tweets downloaded and labeled throughout the whole October, 2010. The tweets with keywords “software" and “education" are used in this dataset. When extracting features, we use the same feature space as we use for the product review dataset. For more details, please refer to Table 2.

For these datasets, the tf-idf (normalized term frequency and log inverse document frequency) (Manning et al. 2008) features are extracted, and the stop words are removed. We use porter as the stemmer. Given these two datasets, 14 sentiment detection tasks are created by specifying different combinations of source domain and task domain subdatasets. The detailed descriptions of these 14 tasks are specified in Table 3.

Table 3 Task description

4.2 Methods

We compare the proposed method with the following competitors: SVM on the Target domain (SVMT); SVM on both the Source domain and the Target domain (SVMST); SCL on the Target Domain (SCLT)Footnote 1; SCL on both the Source domain and the Target domain (SCLST). In our experiments, we show that the proposed method can also be combined with SCL naturally by considering the whole procedure except the final training step in SCL as a feature construction process. In particular, the pivot features are chosen on both the source domain and target domain, and then the correlations between the pivot features and non-pivot features are learned. This correlation is then converted as a set of features that are then appended to the original feature space as is done in SCL. In the final training step, we apply SDAD to these newly represented examples. We name this method SCL-SDAD.

For both the proposed method and the baseline methods, their parameters are all set by five fold cross validations. For each experiment, we use all the examples from the source domain and a specified ratio of target domain examples as the training examples, while the rest of the target domain examples are used for testing. The averaged results of 10 independent runs are reported.

4.3 Results and analysis

The sentiment detection results on the product review datasets are reported in Figs.1 and 2. The results with product review datasets as the source domains, and the twitter datasets as the target domains are reported in Fig.3. The sentiment detection results with 90% target domain training examples and all of the source domain examples are further reported in Table 4.

Fig. 1
figure 1

Sentiment detection accuracy, with different training ratios on the target domain and all of the examples on the source domain. The x-axis represents the different training ratios on the target domain, while the y-axis demonstrates the corresponding classification accuracy

Fig. 2
figure 2

Sentiment detection accuracy, with different training ratios on the target domain and all of the examples on the source domain. The x-axis represents the different training ratios on the target domain, while the y-axis demonstrates the corresponding classification accuracy

Fig. 3
figure 3

Sentiment detection accuracy, with different training ratios on the target domain and all of the examples on the source domain. The x-axis represents the different training ratios on the target domain, while the y-axis demonstrates the corresponding classification accuracy

Table 4 Sentiment detection accuracies with 90% examples on the target domain, and all the labeled examples in the source domain as the training set, and the remaining examples in the target domain as the testing set

As can be seen from these results, the proposed methods, i.e., SDAD and SCL-SDAD show the best performances in most cases. This is because through modeling examples on both the source and the target domains, examples on the source domain can be better incorporated to train a good classifier on the target domain. SVMST and SCLST can be considered as two special cases of SDAD and SCL-SDAD respectively, with β j being set to be 1. It is clear that by calculating appropriate β j , the data distribution on the source domain can be tuned to fit that on the target domain.

From Fig. 3, it can be seen that the sentiment detection accuracies on the twitter dataset can be helped by the incorporation of some “complete” (on contrary to the short text on twitter dataset) examples from other domains. These results are also consistent with the results on Zhang et al. (2010, 2011), in which the authors improve the twitter classification accuracies by transferring the knowledge from some labeled webpages.

SVMT and SCLT are two methods that only consider training classifiers on the target domain. From the experimental results, it is clear that these two methods perform worse than SVMST and SCLST on the previous twelve tasks in most cases, while they are very competitive on the latter two tasks. This is because the similarities between different domains in the product review dataset are much higher than those between the product review dataset and the twitter dataset. Therefore, on the first twelve tasks, even if we don’t consider the distribution difference between different domains, the source domain examples can still be directly used to help to improve the performance on the target domain. But on the latter two tasks, since the domain difference is relatively high, incorporating source domain without considering these difference will sometimes degrade the classification performance. It is clear that on these two tasks, the proposed methods can still use the source domain examples, and show the best performances in most cases.

As for SCLT, SCLST and SCL-SDAD, we can conclude that SCL-SDAD performs better than SCLT and SCLST. This is because although SCLT and SCLST are transfer learning methods, they do not model the distribution difference directly. Even after the pivot features are picked and the correlations between pivot and non-pivot features are appended to the original feature space, this distribution difference still exists and deteriorates the the performances of the classifier. Different from these methods, after the pivot choosing and correlation learning steps, SCL-SDAD integrates the distribution difference into the objective function and reweighting the examples on the source domain in a reasonable way. Therefore, SCL-SDAD is superior to SCLT and SCLST.

There are in total three parameters in the proposed method, i.e., σ, C 1 and C 2. To study the robustness of the proposed method (SDAD), some parameter sensitivity experiments are also conducted by each time fixing two parameters and varying the other one. The experimental results on Task 5 and Task 7 are reported in Fig. 4. It can be seen from these experiments that the proposed method is relatively robust with different parameter values.

Fig. 4
figure 4

The parameter sensitivity of the proposed method. The experiments are conducted by using all of the source domain examples and 90% of the target domain examples as the training set, while fixing the following examples as the testing set. For each experiment, we fix two parameters and tune the other one. The average performances of 10 independent runs are reported

5 Conclusions

Sentiment detection is an important technique for investigating what people think in opinion rich resources such as online review sites, microblogging sites and personal blogs. Transfer learning has been utilized in sentiment detection to transfer knowledge from one source domain with rich labeled information to another target domain. However, most existing transfer learning techniques for sentiment detection simply append some correlation features to the original feature space and the problem of distribution difference still exists. Moreover, the commonly used assumption on the conditional probability is too strong for practical applications. To address these problems, this paper presents a new method that directly models the joint distribution difference on different domains, and an efficient method is proposed to optimize the proposed formulations. The proposed method is guaranteed to converge within a finite number of steps. An extensive set of examples clearly demonstrate the advantages of the proposed method over the state-of-the-art methods. In the future, we plan to extend the proposed method to the the setting of multi-task learning, as well as the mood classification, in which more than two classes exist.