Language Resources and Evaluation

, Volume 47, Issue 1, pp 127–149 | Cite as

A document is known by the company it keeps: neighborhood consensus for short text categorization

  • Gabriela Ramírez-de-la-RosaEmail author
  • Manuel Montes-y-Gómez
  • Thamar Solorio
  • Luis Villaseñor-Pineda
Original Paper


During the last decades the Web has become the greatest repository of digital information. In order to organize all this information, several text categorization methods have been developed, achieving accurate results in most cases and in very different domains. Due to the recent usage of Internet as communication media, short texts such as news, tweets, blogs, and product reviews are more common every day. In this context, there are two main challenges; on the one hand, the length of these documents is short, and therefore, the word frequencies are not informative enough, making text categorization even more difficult than usual. On the other hand, topics are changing constantly at a fast rate, causing the lack of adequate amounts of training data. In order to deal with these two problems we consider a text classification method that is supported on the idea that similar documents may belong to the same category. Mainly, we propose a neighborhood consensus classification method that classifies documents by considering their own information as well as information about the category assigned to other similar documents from the same target collection. In particular, the short texts we used in our evaluation are news titles with an average of 8 words. Experimental results are encouraging; they indicate that leveraging information from similar documents helped to improve classification accuracy and that the proposed method is especially useful when labeled training resources are limited.


Short text categorization Unlabeled information Prototype-based classification News titles 


  1. Abney, S. P. (2008). Semi-supervised learning for computational linguistics. Computer science and data analysis series. London: Chapman and Hall/CRC.Google Scholar
  2. Angelova, R., & Weikum, G. (2006). Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 485–492). New York, NY: ACM.Google Scholar
  3. Anguiano-Hernández, E., Villaseñor-Pineda, L., Montes-y-Gómez, M., & Rosso, P. (2010). Summarization as feature selection for document categorization on small datasets. In Proceedings of the 7th international conference on advances in natural language processing, IceTAL’10 (pp. 39–44). Berlin, Heidelberg: Springer.Google Scholar
  4. Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using wikipedia. In SIGIR ’07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 787–788). New York, NY: ACM.Google Scholar
  5. Cardoso-Cachopo, A., & Oliveira, A. L. (2007). Semi-supervised single-label text categorization using centroid-based classifiers. In SAC ’07: Proceedings of the 2007 ACM symposium on applied computing (pp. 844–851). New york: ACM.Google Scholar
  6. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.Google Scholar
  7. Driessens, K., Reutemann, P., Pfahringer, B., & Leschi, C. (2006). Using weighted nearest neighbor to benefit from unlabeled data. Lecture Notes in Computer Science, 3918, 60–69.CrossRefGoogle Scholar
  8. Escobar-Acevedo, A., Montes-y-Gómez, M., & Villaseñor-Pineda, L. (2009). Using nearest neighbor information to improve cross-language text classification. In Proceedings of the 8th Mexican international conference on artificial intelligence, MICAI ’09 (pp. 157–164). Berlin, Heidelberg: Springer.Google Scholar
  9. Faguo, Z., Fan, Z., Bingru, Y., & Xingang, Y. (2010). Research on short text classification algorithm based on statistics and rules. In Proceedings of the 2010 third international symposium on electronic commerce and security, ISECS ’10 (pp. 3–7). Washington, DC: IEEE Computer Society.Google Scholar
  10. Fan, X., & Hu, H. (2010). A new model for chinese short-text classification considering feature extension. Artificial Intelligence and Computational Intelligence, International Conference on 2, 7–11.Google Scholar
  11. Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge, MA: Cambridge University Press.CrossRefGoogle Scholar
  12. Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision (pp. 1–6).Google Scholar
  13. Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., & Villaseñor-Pineda, L. (2009). Using the web as corpus for self-training text categorization. Information Retrieval, 12, 400–415.CrossRefGoogle Scholar
  14. Han, E. H., & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European conference on principles of data mining and knowledge discovery, PKDD ’00 (pp. 424–431). London: Springer.Google Scholar
  15. Healy, M., Delany, S. J., & Zamolotskikh, A. (2005). An assessment of case-based reasoning for short text message classification. In N. Creaney (Ed.), 16th Irish conference on artificial intelligence and cognitive science.Google Scholar
  16. Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09 (pp. 389–396). New York, NY: ACM.Google Scholar
  17. Huang, Y., Sun, L., & Nie, J. (2009). Smoothing document language model with local word graph. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09 (pp. 1943–1946). New York, NY: ACM.Google Scholar
  18. Ifrim, G., & Weikum, G. (2006). Transductive learning for text classification using explicit knowledge models. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases, PKDD 2006 (pp. 223–234). Berlin, Heidelberg, Germany: Springer.Google Scholar
  19. Jiang, E. P. (2010). Learning to integrate unlabeled data in text classification. In W. D. Yi Hang & P. S. Sandhu (Eds.), Proccedings of the 3rd IEEE international conference on computer science and information technology (Vol. 4, pp. 82–86). Chengdu, China.Google Scholar
  20. Kang, I. S., Na, S. H., Kim, J., & Lee, J. H. (2007). Cluster-based patent retrieval. Information Processing and Management, 43, 1173–1182.CrossRefGoogle Scholar
  21. Ko, Y., & Seo, J. (2009). Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing and Management, 45(1), 70–83.CrossRefGoogle Scholar
  22. Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04 (pp. 194–201). New York, NY: ACM.Google Scholar
  23. Kyriakopoulou, A., & Kalamboukis, T. (2006). Text classification using clustering. In Proceedings of the ECML-PKDD discovery challenge workshop.Google Scholar
  24. Lewis, D. (1998). Naive (bayes) at forty: The independence assumption in information retrieval. In C. Nédellec & C. Rouveirol (Eds.) Machine learning: ECML-98, lecture notes in computer science (Vol. 1398, pp. 4–15). Berlin/Heidelberg: Springer.Google Scholar
  25. Lewis, D. D. (1991). Evaluating text categorization. In Proceedings of speech and natural language workshop (pp. 312–318). Los Altos, CA: Morgan Kaufmann.Google Scholar
  26. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th annual international conference on research and development in information retrieval, SIGIR ’04 (pp. 186–193). New York, NY: ACM.Google Scholar
  27. Makagonov, P., Alex, M., & Gelbukh, E. (2004). Clustering abstracts instead of full texts. In Text, speech, dialog, LNAI N 3206 (pp. 129–135). Berlin: Springer.Google Scholar
  28. Mei, Q., Zhang, D., & Zhai, C. (2008). A general optimization framework for smoothing language models on graph structures. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08 (pp. 611–618). New York, NY: ACM.Google Scholar
  29. Navigli, R., & Crisafulli, G. (2010). Inducing word senses to improve web search result clustering. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP ’10 (pp. 116–126). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  30. Ning, X., & Karypis, G. (2008). The set classification problem and solution methods. In Proceedings of the 2008 IEEE international conference on data mining workshops (pp. 720–729). Washington, DC: IEEE Computer Society.Google Scholar
  31. Oh, H. J., Myaeng, S. H., & Lee, M. H. (2000). A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00 (pp. 264–271). New York, NY: ACM.Google Scholar
  32. Ostrowski, D. A. (2010). Sentiment mining within social media for topic identification. In Proceedings of the 2010 IEEE fourth international conference on semantic computing, ICSC ’10 (pp. 394–401). Washington, DC: IEEE Computer Society.Google Scholar
  33. Perez-Tellez, F., Pinto, D., Cardiff, J., & Rosso, P. (2010). On the difficulty of clustering company tweets. In Proceedings of the 2nd international workshop on search and mining user-generated contents, SMUC ’10 (pp. 95–102). New York, NY: ACM.Google Scholar
  34. Pinto, D. (2008). On clustering and evaluation of narrow domain short-text corpora. Ph.D. thesis, Polytechnic University of Valencia, Spain.Google Scholar
  35. Pinto, D., Rosso, P., & Jiménez-Salazar, H. (2010). A self-enriching methodology for clustering narrow domain short texts. The Computer Journal, 54, 1148–1165.CrossRefGoogle Scholar
  36. Quinlan, J. R. (1996). Improved use of continuous attributes in c4.5. Artificial Intelligence Research, 4, 77–90.Google Scholar
  37. Rigutini, L., Maggini, M., & Liu, B. (2005). An EM based training algorithm for cross-language text categorization. In Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence, WI ’05 (pp. 529–535). Washington, DC: IEEE Computer Society.Google Scholar
  38. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.CrossRefGoogle Scholar
  39. Sen, P., & Getoor, L. (2007). Link-based classification. Technical Report CS-TR-4858, University of Maryland.Google Scholar
  40. Sharifi, B., Hutton, M. A., & Kalita, J. (2010). Summarizing microblogs automatically. In The 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ’10 (pp. 685–688). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  41. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., & Demirbas, M. (2010). Short text classification in twitter to improve information filtering. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10 (pp. 841–842). New York, NY: ACM.Google Scholar
  42. Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications, 28(4), 667–671.CrossRefGoogle Scholar
  43. Tan, S. (2008). An improved centroid classifier for text categorization. Expert Systems with Applications, 35(1–2), 279–285.CrossRefGoogle Scholar
  44. Tao, T., Wang, X., Mei, Q., & Zhai, C. (2006). Language model information retrieval with document expansion. In Proceedings of the main conference on human language technology conference of the North American chapter of the association of computational linguistics, HLT-NAACL ’06 (pp. 407–414). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  45. Tao, Y., & Xi-wei, W. (2010). Feature extension for short text. In Z. J. Youfeng Zou Fei Yu (Ed.) Proceedings of the third international symposium on computer science and computational technology, ISCSCT ’10 (pp. 338–341). China: Jiaozuo.Google Scholar
  46. Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). ”A term is known by the company it keeps": On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of the 2nd international conference on theory of information retrieval: advances in information retrieval theory, ICTIR ’09 (pp. 104–115). Berlin, Heidelberg: Springer.Google Scholar
  47. Wang, J., Zhou, Y., Li, L., Hu, B., & Hu, X. (2009). Improving short text clustering performance with keyword expansion. In H. Wang, Y. Shen, T. Huang, & Z. Zeng (Eds.) The sixth international symposium on neural networks (ISNN 2009), advances in intelligent and soft computing (Vol. 56, pp. 291–298). Berlin/Heidelberg: Springer.Google Scholar
  48. Wermter, S., Panchev, C., & Arevian, G. (1999). Hybrid neural plausibility networks for news agents. In Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence, AAAI ’99/IAAI ’99 (pp. 93–98). Menlo Park, CA: American Association for Artificial Intelligence.Google Scholar
  49. Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann Series in Data Management Systems. San Fransisco, CA: Morgan Kaufmann.Google Scholar
  50. Xu, Z., Jin, R., Huang, K., Lyu, M. R., & King, I. (2008). Semi-supervised text categorization by active search. In Proceeding of the 17th ACM conference on information and knowledge management, CIKM ’08 (pp. 1517–1518). New York, NY: ACM.Google Scholar
  51. Zelikovitz, S. (2004). Transductive LSI for short text classification problems. In FLAIRS conference.Google Scholar
  52. Zelikovitz, S., & Hirsh, H. (2000). Improving short text classification using unlabeled background knowledge to assess document similarity. In Proceedings of the seventeenth international conference on machine learning, ICML’00 (pp. 1183–1190).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • Gabriela Ramírez-de-la-Rosa
    • 1
    Email author
  • Manuel Montes-y-Gómez
    • 2
  • Thamar Solorio
    • 1
  • Luis Villaseñor-Pineda
    • 2
  1. 1.Department of Computer and Information SciencesUniversity of Alabama at BirminghamBirminghamUSA
  2. 2.Department of Computational SciencesNational Institute for Astrophysics, Optics and ElectronicsPueblaMexico

Personalised recommendations