Improving Classification of Tweets Using Linguistic Information from a Large External Corpus

  • Hugo Lewi Hammer
  • Anis Yazidi
  • Aleksander Bai
  • Paal Engelstad
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 188)


The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word, like synonyms.

In this paper we use word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Several different methods on how to include the co-occurence information are constructed and tested out on the classification of real twitter data. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.


Classification Co-occurrence information Text mining Tweets 


  1. 1.
    Zubiaga, A., Spina, D., Martinez, R., Fresno, V.: Real-time classification of twitter trends. J. Am. Soc. Inform. Sci. Technol. 66(3), 462–473 (2015)CrossRefGoogle Scholar
  2. 2.
    Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL HLT 2012, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 338–346 (2012)Google Scholar
  3. 3.
    Zhang, X., Fuehres, H., Gloor, P.A.: Predicting stock market indicators through twitter I hope it is not as bad as I fear. Procedia Soc. Behav. Sci. 26, 55–62 (2011)CrossRefGoogle Scholar
  4. 4.
    Lampos, V., Bie, T., Cristianini, N.: Flu detector - tracking epidemics on twitter. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 599–602. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15939-8_42 CrossRefGoogle Scholar
  5. 5.
    Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW 2011, Computer Society, pp. 251–258. IEEE, Washington, DC (2011)Google Scholar
  6. 6.
    Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web. WWW 2010, pp. 851–860. ACM, New York (2010)Google Scholar
  7. 7.
    Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P., et al.: Improving classification of tweets using word-word co-occurrence information from a large external corpus. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 1174–1177. ACM (2016)Google Scholar
  8. 8.
    Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)CrossRefGoogle Scholar
  9. 9.
    Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 541–544. IEEE (2003)Google Scholar
  10. 10.
    Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR Semantic Web Workshop (2003)Google Scholar
  11. 11.
    Rodriguez, M., Hidalgo, J., Agudo, B.: Using wordnet to complement training information in text categorization. In: Proceedings of 2nd International Conference on Recent Advances in Natural Language Processing II: Selected Papers from RANLP, vol. 97, pp. 353–364 (2000)Google Scholar
  12. 12.
    Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web, pp. 519–528. ACM (2003)Google Scholar
  13. 13.
    Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  14. 14.
    Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306 (2006)Google Scholar
  15. 15.
    Alahmadi, A., Joorabchi, A., Mahdi, A.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: 2013 7th IEEE GCC Conference and Exhibition (GCC). IEEE Press (2013)Google Scholar
  16. 16.
    Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in CQA by leveraging wikipedia semantic knowledge. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, pp. 1321–1330. ACM, New York (2011)Google Scholar
  17. 17.
    Chen, Z., Lu, Y.: A word co-occurrence matrix based method for relevance feedback. J. Comput. Inf. Syst. 7(1), 17–24 (2011)Google Scholar
  18. 18.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  19. 19.
    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors forword representation (2015). Accessed 27 July 2015
  20. 20.
    Friedman, J., Hastie, T., Tibshirani, R.: Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33(1), 1–22 (2010)CrossRefGoogle Scholar
  21. 21.
    Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49(14), 291–304 (2007)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)CrossRefGoogle Scholar

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2017

Authors and Affiliations

  • Hugo Lewi Hammer
    • 1
  • Anis Yazidi
    • 1
  • Aleksander Bai
    • 1
  • Paal Engelstad
    • 1
  1. 1.Department of Computer ScienceOslo and Akershus University College of Applied SciencesOsloNorway

Personalised recommendations