Advertisement

AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

  • Zhixu Li
  • Pei Li
  • Wei Wei
  • Hongyan Liu
  • Jun He
  • Tao Liu
  • Xiaoyong Du
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5446)

Abstract

Nearly all text classification methods classify texts into predefined categories according to the terms appeared in texts. State-of-the-art of text classification prefer to simplely take a word as a term since it performs good on some famous datasets; some experts even pointed out that phrases don’t improve or improve only marginally the classifiction accuracy. However, we found out that this is not always true when we try to categorize texts about similar topics in the same domain. With words only we can not categorize those texts effectively since they nearly share the same word set. Then we suppose the results might be improved if we also use phrases as terms. To testify our supposition, we propose our own phrase extraction way as well as select proper feature selection method and classifier by conducting experimental study on a data set which comes from paper abstracts in the field of Databases. Accordingly, we also develop a system called AutoPCS which can be used to help experts in choosing relevant topics for newly coming papers from a predefined topic list only by their abstracts.

Keywords

Text Classification Phrase-based BOP Similar Texts 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  2. 2.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York (2000)zbMATHGoogle Scholar
  3. 3.
    Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons Inc., New York (1998)zbMATHGoogle Scholar
  4. 4.
    Schapire, R.E., Singer, Y.: BOOSTEXTER: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)CrossRefzbMATHGoogle Scholar
  5. 5.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, pp. 148–155 (1998)Google Scholar
  6. 6.
    Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing text-mining performance. IEEE Intelligent Systems 14(4), 63–69 (1999)CrossRefGoogle Scholar
  7. 7.
    Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report, University of Massachusetts at Amherst (2004)Google Scholar
  8. 8.
    Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992, Kobenhavn, DK, pp. 37–50 (1992)Google Scholar
  9. 9.
    Scott, S., Matwin, S.: Feature engineering for text classification. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of ICML 1999, Bled, SL, pp. 379–388 (1999)Google Scholar
  10. 10.
    Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Chin, A.G. (ed.) Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)Google Scholar
  11. 11.
    Zhang, D., Lee, W.S.: Question classification using support vector machines. In: Callan, J., Cormack, G., Clarke, C., Hawking, D., Smeaton, A. (eds.) Proceedings of SIGIR 2003, Toronto, CA, pp. 26–32 (2003)Google Scholar
  12. 12.
    Koster, C.H., Seutter, M.: Taming wild phrases. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 161–176. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529–546 (2002)CrossRefzbMATHGoogle Scholar
  14. 14.
    Raskutti, B., Ferra, H., Kowalczyk, A.: Second order features for maximising text classification performance. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS, vol. 2167, pp. 419–430. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  15. 15.
    Lodhi, H., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. In: Advances in Neural Information Processing Systems (NIPS), pp. 563–569 (2000)Google Scholar
  16. 16.
    Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR 1994, pp. 81–93 (1994)Google Scholar
  17. 17.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of ICML 1997, pp. 412–420 (1997)Google Scholar
  18. 18.
    Wiener, E.D., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of SDAIR 1995, pp. 317–332 (1995)Google Scholar
  19. 19.
    Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 103–137 (1997)CrossRefzbMATHGoogle Scholar
  20. 20.
    Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning 29, 131–163 (1997)CrossRefzbMATHGoogle Scholar
  21. 21.
    Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: SIGIR 1994, pp. 13–22 (1994)Google Scholar
  22. 22.
    Yuan, Y., Shaw, M.J.: Induction of fuzzy decision trees. Fuzzy Sets and Systems 69, 125–139 (1995)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Zhixu Li
    • 1
    • 2
  • Pei Li
    • 1
    • 2
  • Wei Wei
    • 1
    • 2
  • Hongyan Liu
    • 3
  • Jun He
    • 1
    • 2
  • Tao Liu
    • 1
    • 2
  • Xiaoyong Du
    • 1
    • 2
  1. 1.Key Labs of Data Engineering and Knowledge EngineeringMinistry of EducationChina
  2. 2.School of InformationRenmin University of ChinaBeijingChina
  3. 3.Department of Management Science and EngineeringTsinghua UniversityBeijingChina

Personalised recommendations