Effectively Constructing Reliable Data for Cross-Domain Text Classification

  • Fuzhen Zhuang
  • Qing He
  • Zhongzhi Shi
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 385)


Traditional classification algorithms often fail when the independent and identical distributed (i.i.d.) assumption does not hold, and the cross-domain learning emerges recently is to deal with this problem. Actually, we observe that though the trained model from training data may not perform well over all test data, it can give much better prediction results on a subset of the test data with high prediction confidence. Also this subset of data from test data set may have more similar distribution with the test data. In this study, we propose to construct the reliable data set with high prediction confidence, and use this reliable data as training data. Furthermore, we develop an EM algorithm to refine the model trained from the reliable data. The extensive experiments on text classification verify the effectiveness and efficiency of our methods. It is worth to mention that the model trained from the reliable data achieves a significant performance improvement compared with the one trained from the original training data, and our methods outperform all the baseline algorithms.


Cross-domain Learning Reliable Data EM Algorithm 


  1. 1.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  2. 2.
    Boser, B.E., Guyou, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proc. of the 5th AWCLT (1992)Google Scholar
  3. 3.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001) Software available at,
  4. 4.
    Dai, W., Xue, G., Yang, Q., Yu, Y.: Co-clustering based classification for out-of-domain documents. In: Proc. of the 13th ACM SIGKDD, pp. 210–219 (2007)Google Scholar
  5. 5.
    Dai, W., Yang, Q., Xue, G., Yu, Y.: Boosting for transfer learning. In: Proc. of the 24th International Conference on Machine Learning (ICML), Corvallis, OR, pp. 193–200 (2007)Google Scholar
  6. 6.
    Gao, J., Fan, W., Jiang, J., Han, J.W.: Knowledge transfer via multiple model local structure mapping. In: Proc. of the 14th ACM SIGKDD, pp. 283–291 (2008)Google Scholar
  7. 7.
    Gu, Q.Q., Zhou, J.: Learning the shared subspace for multi-task clustering and transductive transfer classification. In: Proc. of the ICDM (2009)Google Scholar
  8. 8.
    He, J.Z., Zhang, Y., Li, X., Wang, Y.: Naive bayes classifier for positive unlabeled learning with uncertainty. In: Proceedings of the 10th SIAM SDM, pp. 361–372 (2010)Google Scholar
  9. 9.
    Hosmer, D., Lemeshow, S.: Applied Logistic Regression. Wiley, New York (2000)zbMATHCrossRefGoogle Scholar
  10. 10.
    Jiang, J., Zhai, C.X.: Instance weighting for domain adaptation in nlp. In: Proceedings of the 45th ACL, pp. 264–271 (2007)Google Scholar
  11. 11.
    Jiang, J., Zhai, C.X.: A two-stage approach to domain adaptation for statistical classifiers. In: Proceedings of the 16th ACM CIKM, pp. 401–410 (2007)Google Scholar
  12. 12.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: Proc. of the 16th ICML, pp. 200–209 (1999)Google Scholar
  13. 13.
    Kullback, S., Leibler, R.A.: On information and sufficiency. The Annals of Mathematical Statistics 22(1), 79–86 (1951)MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    Lee, W.S., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: Proceedings of the 20th ICML (2003)Google Scholar
  15. 15.
    Lewis, D., Riguette, M.: A comparison of two learning algorithms for text categorization. In: Proc. 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)Google Scholar
  16. 16.
    Li, X.L., Liu, B., Ng, S.K.: Negative training data can be harmful to text classification. In: Proceedings of the 2010 Conference on EMNLP, pp. 218–228 (2010)Google Scholar
  17. 17.
    Liu, B., Dai, Y., Li, X.L., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the 3rd IEEE ICDM, pp. 179–186 (2002)Google Scholar
  18. 18.
    Liu, B., Lee, W.S., Yu, P.S., Li, X.L.: Partially supervised classification of text documents. In: Proceedings of the 19th ICML, pp. 387–394 (2002)Google Scholar
  19. 19.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI Workshop on Learning for Text Categorization (1998)Google Scholar
  20. 20.
    Pan, S.J., Kwok, J.T., Yang, Q.: Transfer learning via dimensionality reduction. In: Proceedings of the 23rd AAAI, pp. 677–682 (2008)Google Scholar
  21. 21.
    Pan, S.J., Ni, X.C., Su, J.T., Yang, Q., Chen, Z.: Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th WWW, pp. 751–760 (2010)Google Scholar
  22. 22.
    Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. In: Proceedings of the 21st IJCAI, pp. 1187–1192 (2009)Google Scholar
  23. 23.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE TKDE 22(10), 1345–1359 (2010)Google Scholar
  24. 24.
    Rocchio, J.: Relevance feedback in information retrieval. The SMART Retrieval System, 313–323 (1971)Google Scholar
  25. 25.
    Uguroglu, S., Carbonell, J.: Feature Selection for Transfer Learning. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS, vol. 6913, pp. 430–442. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  26. 26.
    Zadrozy, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21th ICML, pp. 114–121 (2004)Google Scholar
  27. 27.
    Zhang, B.Z., Zuo, W.L.: Learning from positive and unlabeled examples: A survey. In: Proceedings of ISIP, pp. 650–654 (2008)Google Scholar
  28. 28.
    Zhen, Y., Li, C.Q.: Cross-domain knowledge transfer using semi-supervised classification. In: Proceedings of the 21st AJCAI, pp. 362–371 (2008)Google Scholar
  29. 29.
    Zhuang, F.Z., Luo, P., Xiong, H., He, Q., Xiong, Y.H., Shi, Z.Z.: Exploiting associations between word clusters and document classes for cross-domain text categorization. In: Proc. of the SIAM SDM, pp. 13–24 (2010)Google Scholar
  30. 30.
    Zhuang, F.Z., Luo, P., Xiong, H., Xiong, Y.H., He, Q., Shi, Z.Z.: Cross-domain learning from multiple sources: A consensus regularization perspective. IEEE TKDE, 1664–1678 (2010)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2012

Authors and Affiliations

  • Fuzhen Zhuang
    • 1
  • Qing He
    • 1
  • Zhongzhi Shi
    • 1
  1. 1.The Key Laboratory of Intelligent Information Processing, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina

Personalised recommendations