Classification of Short Texts by Deploying Topical Annotations

  • Daniele Vitale
  • Paolo Ferragina
  • Ugo Scaiella
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7224)


We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan et al. WWW ’08), and robust with respect to concept drifts and input sources.


Latent Semantic Analysis Concept Drift Input Text Short Text Output Category 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Banerjee, S., Ramanathan, K., Gupta, A.: Clustering Short Texts using Wikipedia. In: ACM SIGIR, pp. 787–788 (2007)Google Scholar
  2. 2.
    Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using Web Search engines. In: WWW, pp. 757–766 (2007)Google Scholar
  3. 3.
    Cilibrasi, R., Vitanyi, P.: The Google similarity distances. IEEE Trans. on Knowl. and Data Eng. 19(3), 370–383 (2007)CrossRefGoogle Scholar
  4. 4.
    Ferragina, P., Scaiella, U.: TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In: ACM CIKM, pp. 1625–1628 (2010)Google Scholar
  5. 5.
    Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Int. Joint Conference on A.I, pp. 1048–1053 (2005)Google Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Wikipedia-based Semantic Interpretation for Natural Language Processing. J. Artif. Intell. Res. 34, 443–498 (2009)zbMATHGoogle Scholar
  7. 7.
    Genc, Y., Sakamoto, Y., Nickerson, J.V.: Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) FAC 2011. LNCS, vol. 6780, pp. 484–492. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust Disambiguation of Named Entities in Text. In: EMNLP, pp. 782–792 (2011)Google Scholar
  9. 9.
    Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: ACM KDD, pp. 457–466 (2009)Google Scholar
  10. 10.
    Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from Wikipedia. Int. J. Hum.-Comput. Stud. 67(9), 716–754 (2009)CrossRefGoogle Scholar
  11. 11.
    Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: AAAI Workshop on Wikipedia and Artificial Intelligence (2008)Google Scholar
  12. 12.
    Phan, X.H., Nguyen, L.M., Houriguchi, S.: Learning to Classify Short and Sparse Text & Web with Hiddent Topics from Large-scale Data Collections. In: WWW, pp. 91–100 (2008)Google Scholar
  13. 13.
    Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: WWW, pp. 377–386 (2006)Google Scholar
  14. 14.
    Schlimmer, J.C., Graner, R.H.: Beyond Incremental Processing: Tracking Concept Drift. In: AAAI, pp. 502–507 (1986)Google Scholar
  15. 15.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: ACM SIGIR, pp. 841–842 (2010)Google Scholar
  17. 17.
    Strube, M., Ponzetto, S.P.: WikiRelate! Computing Semantic Relatedness Using Wikipedia. In: AAAI, pp. 1419–1424 (2006)Google Scholar
  18. 18.
    Sun, X., Haofen, W., Yong, Y.: Towards effective short text deep classification. In: ACM SIGIR, pp. 1143–1144 (2011)Google Scholar
  19. 19.
    Zelikovitz, S., Hirsh, H.: Improving short-text classification using unlabeled data for classification problems. In: ICML, pp. 1191–1198 (2000)Google Scholar
  20. 20.
    Zelikovitz, S., Marquez, F.: Transductive Learning for Short-Text Classification problems using Latent Semantic Indexing. IJPRAI 19(2), 146–163 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Daniele Vitale
    • 1
  • Paolo Ferragina
    • 1
  • Ugo Scaiella
    • 1
  1. 1.Dipartimento di InformaticaUniversity of PisaItaly

Personalised recommendations