Skip to main content

Improving Classification of Tweets Using Linguistic Information from a Large External Corpus

  • Conference paper
  • First Online:
Industrial Networks and Intelligent Systems (INISCOM 2016)

Abstract

The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word, like synonyms.

In this paper we use word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Several different methods on how to include the co-occurence information are constructed and tested out on the classification of real twitter data. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 60.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zubiaga, A., Spina, D., Martinez, R., Fresno, V.: Real-time classification of twitter trends. J. Am. Soc. Inform. Sci. Technol. 66(3), 462–473 (2015)

    Article  Google Scholar 

  2. Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL HLT 2012, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 338–346 (2012)

    Google Scholar 

  3. Zhang, X., Fuehres, H., Gloor, P.A.: Predicting stock market indicators through twitter I hope it is not as bad as I fear. Procedia Soc. Behav. Sci. 26, 55–62 (2011)

    Article  Google Scholar 

  4. Lampos, V., Bie, T., Cristianini, N.: Flu detector - tracking epidemics on twitter. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 599–602. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15939-8_42

    Chapter  Google Scholar 

  5. Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW 2011, Computer Society, pp. 251–258. IEEE, Washington, DC (2011)

    Google Scholar 

  6. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web. WWW 2010, pp. 851–860. ACM, New York (2010)

    Google Scholar 

  7. Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P., et al.: Improving classification of tweets using word-word co-occurrence information from a large external corpus. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 1174–1177. ACM (2016)

    Google Scholar 

  8. Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)

    Article  Google Scholar 

  9. Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 541–544. IEEE (2003)

    Google Scholar 

  10. Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR Semantic Web Workshop (2003)

    Google Scholar 

  11. Rodriguez, M., Hidalgo, J., Agudo, B.: Using wordnet to complement training information in text categorization. In: Proceedings of 2nd International Conference on Recent Advances in Natural Language Processing II: Selected Papers from RANLP, vol. 97, pp. 353–364 (2000)

    Google Scholar 

  12. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web, pp. 519–528. ACM (2003)

    Google Scholar 

  13. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  14. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306 (2006)

    Google Scholar 

  15. Alahmadi, A., Joorabchi, A., Mahdi, A.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: 2013 7th IEEE GCC Conference and Exhibition (GCC). IEEE Press (2013)

    Google Scholar 

  16. Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in CQA by leveraging wikipedia semantic knowledge. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, pp. 1321–1330. ACM, New York (2011)

    Google Scholar 

  17. Chen, Z., Lu, Y.: A word co-occurrence matrix based method for relevance feedback. J. Comput. Inf. Syst. 7(1), 17–24 (2011)

    Google Scholar 

  18. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  19. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors forword representation (2015). http://nlp.stanford.edu/projects/glove/glove.pdf. Accessed 27 July 2015

  20. Friedman, J., Hastie, T., Tibshirani, R.: Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  21. Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49(14), 291–304 (2007)

    Article  MathSciNet  Google Scholar 

  22. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hugo Lewi Hammer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P. (2017). Improving Classification of Tweets Using Linguistic Information from a Large External Corpus. In: Maglaras, L., Janicke, H., Jones, K. (eds) Industrial Networks and Intelligent Systems. INISCOM 2016. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 188. Springer, Cham. https://doi.org/10.1007/978-3-319-52569-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52569-3_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52568-6

  • Online ISBN: 978-3-319-52569-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics