Improving Classification of Tweets Using Linguistic Information from a Large External Corpus

Hammer, Hugo Lewi; Yazidi, Anis; Bai, Aleksander; Engelstad, Paal

doi:10.1007/978-3-319-52569-3_11

Hugo Lewi Hammer¹⁸,
Anis Yazidi¹⁸,
Aleksander Bai¹⁸ &
…
Paal Engelstad¹⁸

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 188))

Included in the following conference series:

International Conference on Industrial Networks and Intelligent Systems

721 Accesses

Abstract

The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word, like synonyms.

In this paper we use word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Several different methods on how to include the co-occurence information are constructed and tested out on the classification of real twitter data. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 60.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zubiaga, A., Spina, D., Martinez, R., Fresno, V.: Real-time classification of twitter trends. J. Am. Soc. Inform. Sci. Technol. 66(3), 462–473 (2015)
Article Google Scholar
Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. NAACL HLT 2012, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 338–346 (2012)
Google Scholar
Zhang, X., Fuehres, H., Gloor, P.A.: Predicting stock market indicators through twitter I hope it is not as bad as I fear. Procedia Soc. Behav. Sci. 26, 55–62 (2011)
Article Google Scholar
Lampos, V., Bie, T., Cristianini, N.: Flu detector - tracking epidemics on twitter. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS (LNAI), vol. 6323, pp. 599–602. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15939-8_42
Chapter Google Scholar
Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, ICDMW 2011, Computer Society, pp. 251–258. IEEE, Washington, DC (2011)
Google Scholar
Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web. WWW 2010, pp. 851–860. ACM, New York (2010)
Google Scholar
Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P., et al.: Improving classification of tweets using word-word co-occurrence information from a large external corpus. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 1174–1177. ACM (2016)
Google Scholar
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)
Article Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 541–544. IEEE (2003)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of the SIGIR Semantic Web Workshop (2003)
Google Scholar
Rodriguez, M., Hidalgo, J., Agudo, B.: Using wordnet to complement training information in text categorization. In: Proceedings of 2nd International Conference on Recent Advances in Natural Language Processing II: Selected Papers from RANLP, vol. 97, pp. 353–364 (2000)
Google Scholar
Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th international conference on World Wide Web, pp. 519–528. ACM (2003)
Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In: AAAI, vol. 6, pp. 1301–1306 (2006)
Google Scholar
Alahmadi, A., Joorabchi, A., Mahdi, A.: A new text representation scheme combining bag-of-words and bag-of-concepts approaches for automatic text classification. In: 2013 7th IEEE GCC Conference and Exhibition (GCC). IEEE Press (2013)
Google Scholar
Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in CQA by leveraging wikipedia semantic knowledge. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM 2011, pp. 1321–1330. ACM, New York (2011)
Google Scholar
Chen, Z., Lu, Y.: A word co-occurrence matrix based method for relevance feedback. J. Comput. Inf. Syst. 7(1), 17–24 (2011)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors forword representation (2015). http://nlp.stanford.edu/projects/glove/glove.pdf. Accessed 27 July 2015
Friedman, J., Hastie, T., Tibshirani, R.: Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49(14), 291–304 (2007)
Article MathSciNet Google Scholar
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Oslo and Akershus University College of Applied Sciences, Oslo, Norway
Hugo Lewi Hammer, Anis Yazidi, Aleksander Bai & Paal Engelstad

Authors

Hugo Lewi Hammer
View author publications
You can also search for this author in PubMed Google Scholar
Anis Yazidi
View author publications
You can also search for this author in PubMed Google Scholar
Aleksander Bai
View author publications
You can also search for this author in PubMed Google Scholar
Paal Engelstad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugo Lewi Hammer .

Editor information

Editors and Affiliations

School of Computer Science Informatics, De Montfort University, Leicester, United Kingdom
Leandros A. Maglaras
Faculty of Technology, De Montfort University, Leicester, United Kingdom
Helge Janicke
Airbus Group, Cardiff, United Kingdom
Kevin Jones

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hammer, H.L., Yazidi, A., Bai, A., Engelstad, P. (2017). Improving Classification of Tweets Using Linguistic Information from a Large External Corpus. In: Maglaras, L., Janicke, H., Jones, K. (eds) Industrial Networks and Intelligent Systems. INISCOM 2016. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 188. Springer, Cham. https://doi.org/10.1007/978-3-319-52569-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-52569-3_11
Published: 19 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52568-6
Online ISBN: 978-3-319-52569-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics