A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis

  • Dimitrios EffrosynidisEmail author
  • Symeon Symeonidis
  • Avi Arampatzis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10450)


Pre-processing is considered to be the first step in text classification, and choosing the right pre-processing techniques can improve classification effectiveness. We experimentally compare 15 commonly used pre-processing techniques on two Twitter datasets. We employ three different machine learning algorithms, namely, Linear SVC, Bernoulli Naïve Bayes, and Logistic Regression, and report the classification accuracy and the resulting number of features for each pre-processing technique. Finally, based on our results, we categorize these techniques based on their performance. We find that techniques like stemming, removing numbers, and replacing elongated words improve accuracy, while others like removing punctuation do not.


Sentiment analysis Text pre-processing Machine learning Text classification 


  1. 1.
    Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, LSM 2011, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 30–38 (2011).
  2. 2.
    Bird, S.: NLTK: the natural language toolkit. In: Calzolari, N., Cardie, C., Isabelle, P. (eds.) ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17–21 July 2006. The Association for Computer Linguistics (2006).
  3. 3.
    Cherkassky, V.: The nature of statistical learning theory. IEEE Trans. Neural Netw. 8(6), 1564 (1997). doi: 10.1109/TNN.1997.641482 CrossRefGoogle Scholar
  4. 4.
    Fayyad, U.M., Piatetsky-Shapiro, G., Uthurusamy, R.: Summary from the KDD-03 panel: data mining: the next 10 years. SIGKDD Explor. 5(2), 191–196 (2003). doi: 10.1145/980972.981004 CrossRefGoogle Scholar
  5. 5.
    John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: UAI 1995: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, 18–20 August 1995, pp. 338–345 (1995).
  6. 6.
    Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, 2–6 November 2009, pp. 375–384 (2009).
  7. 7.
    Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995). doi: 10.1145/219717.219748 CrossRefGoogle Scholar
  8. 8.
    Mohammad, S., Kiritchenko, S., Zhu, X.: NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, Georgia, USA, 14–15 June 2013, pp. 321–327 (2013).
  9. 9.
    Mohammad, S.M., Zhu, X., Kiritchenko, S., Martin, J.D.: Sentiment, emotion, purpose, and style in electoral tweets. Inf. Process. Manage. 51(4), 480–499 (2015). doi: 10.1016/j.ipm.2014.09.003 CrossRefGoogle Scholar
  10. 10.
    Mullen, T., Malouf, R.: A preliminary investigation into sentiment analysis of informal political discourse. In: Computational Approaches to Analyzing Weblogs, Papers from the 2006 AAAI Spring Symposium, Technical Report SS-06-03, Stanford, California, USA, 27–29 March 2006, pp. 159–162 (2006).
  11. 11.
    Na, J.C., Sui, H., Khoo, C., Chan, S., Zhou, Y.: Effectiveness of simple linguistic processing in automatic sentiment classification of product reviews. In: Conference of the International Society for Knowledge Organization (ISKO), pp. 49–54 (2004)Google Scholar
  12. 12.
    Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., Wilson, T.: SemEval-2013 task 2: sentiment analysis in twitter. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 312–320. Association for Computational Linguistics, Atlanta, Georgia, USA, June 2013.
  13. 13.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). MathSciNetzbMATHGoogle Scholar
  14. 14.
    Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010)Google Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). doi: 10.1108/eb046814 CrossRefGoogle Scholar
  16. 16.
    Prasad, S.: Micro-blogging sentiment analysis using bayesian classification methods. Technical report (2010)Google Scholar
  17. 17.
    Saif, H., Fernández, M., He, Y., Alani, H.: Evaluation datasets for twitter sentiment analysis: a survey and a new dataset, the STS-gold. In: Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM 2013) A Workshop of the XIII International Conference of the Italian Association for Artificial Intelligence (AI*IA 2013), Turin, Italy, 3 December 2013, pp. 9–21 (2013).
  18. 18.
    Shi, Y., Xi, Y., Wolcott, P., Tian, Y., Li, J., Berg, D., Chen, Z., Herrera-Viedma, E., Kou, G., Lee, H., Peng, Y., Yu, L. (eds.): Proceedings of the First International Conference on Information Technology and Quantitative Management, ITQM 2013, Dushu Lake Hotel, Sushou, China, 16–18 May 2013, Procedia Computer Science, vol. 17. Elsevier (2013).
  19. 19.
    Singh, T., Kumari, M.: Role of text pre-processing in twitter sentiment analysis. Proc. Comput. Sci. 89, 549–554 (2016). CrossRefGoogle Scholar
  20. 20.
    Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. JASIST 63(1), 163–173 (2012). doi: 10.1002/asi.21662 CrossRefGoogle Scholar
  21. 21.
    Uysal, A.K., Günal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014). doi: 10.1016/j.ipm.2013.08.006 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Dimitrios Effrosynidis
    • 1
    Email author
  • Symeon Symeonidis
    • 1
  • Avi Arampatzis
    • 1
  1. 1.Database and Information Retrieval Research Unit, Department of Electrical and Computer EngineeringDemocritus University of ThraceXanthiGreece

Personalised recommendations