Big Text advantages and challenges: classification perspective

  • Marina SokolovaEmail author
Trends of Data Science


Big Text, i.e., large repositories of textual data, is a part of Big Data. In total, 80–85 % of Big Text comes in unstructured form, with significant contribution from social media. In this position paper, we discuss Big Text advantages and challenges in respect to text classification. We propose a new approach to performance evaluation of classification algorithms when they applied to Big Text, namely, using corpora comparison in the result evaluation. We also discuss a significant increase in texts with comprehensive information and challenges Big Text methods face in analysis of such texts.


Big text Machine learning Classification Performance evaluation 



The author thanks anonymous reviewers for helpful comments.

Conflict of interest

The author states that there is no conflict of interest.


  1. 1.
    Aletras, N., Stevenson, M.: Measuring the similarity between automatically generated topics. In: EACL, pp. 22–27 (2014)Google Scholar
  2. 2.
    Aly, R., Trieschnigg, D., McGuinness, K., O’Connor, N., De Jong, F.: Average precision: good guide or false friend to multimedia search effectiveness? In: International Conference on Multimedia Modeling, pp. 239–250. Springer, Berlin (2014)Google Scholar
  3. 3.
    Andersson, A., Davidsson, P., Lindén, J.: Measure-based classifier performance evaluation. Pattern Recognit. Lett. 20(11), 1165–1173 (1999)CrossRefGoogle Scholar
  4. 4.
    Aveda, J., Atxa, J., Carrillo, M., Zengotitabengoa, E.: Automatic text classification to support systematic reviews in medicine. Expert Syst. Appl. 41, 1498–1508 (2014)CrossRefGoogle Scholar
  5. 5.
    Babych, B., Hartley, A.: Meta-evaluation of comparability metrics using parallel corpora. arXiv preprint arXiv:1404.3759 (2014)
  6. 6.
    Bello-Orgaz, G., Jung, J., Camacho, D.: Social big data: recent achievements and new challenges. Inf. Fusion 28, 1–15 (2015)Google Scholar
  7. 7.
    Benamara, F., Cesarano, C., Picariello, A., Reforgiato, D., Subrahmanian, V.: Sentiment analysis: adjectives and adverbs are better than adjectives alone. In: International Conference on Weblogs and Social Media (2007)Google Scholar
  8. 8.
    Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Zesch, T.: Scalable construction of high-quality web corpora. JLCL 28(2), 23–59 (2013)Google Scholar
  9. 9.
    Bobicev, V., Sokolova, M., El Emam, K., Jafer, Y., Dewar, B., Jonker, E., Matwin, S.: Can anonymous posters on medical forums be reidentified? J. Med. Internet Res. (2013)Google Scholar
  10. 10.
    Broussalis, G., Markopoulos, G., Mikros, G.: Stylometric profiling of the Greek Legal Corpus. In: Selected Papers of the 10th International Conference of Greek Linguistics, pp. 167–176 (2012)Google Scholar
  11. 11.
    Bunch, G., Walqui, A., Pearson, D.: Complex text and new common standards in the United States: pedagogical implications for English learners. Tesol Q. 48(3), 533–559 (2014)CrossRefGoogle Scholar
  12. 12.
    Campbell-Kelly, M., Garcia-Swartz, D.: The history of the Internet: the missing narratives. J. Inf. Technol. 28, 18–33 (2013)CrossRefGoogle Scholar
  13. 13.
    Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (CSUR) 50(3), 43 (2017)CrossRefGoogle Scholar
  14. 14.
    Cao, L., Fayyad, U.: Data science: challenges and directions. Commun. ACM 60, 1–9 (2016)Google Scholar
  15. 15.
    Charalampakis, B., Spathis, D., Kouslis, E., Kermanidis, K.: A comparison between semi-supervised and supervised text mining techniques on detecting irony in greek political tweets. Eng. Appl. Artif. Intell. 51, 50–57 (2016)CrossRefGoogle Scholar
  16. 16.
    Cihon, P., Yasseri, T.: A biased review of biases in Twitter studies on political collective action. In: Borge-Holthoefer, J., Moreno, Y., Yasseri, T. (eds.) At the Crossroads: Lessons and Challenges in Computational Social Science, pp. 91–101. Frontiers Media, Lausanne (2016)Google Scholar
  17. 17.
    Cohen, A., Hersh, W.: A survey of current work in biomedical text mining. Brief. Bioinform. 6, 57–71 (2005)CrossRefGoogle Scholar
  18. 18.
    Collins, C., Viegas, F., Wattenberg, M.: Parallel tag clouds to explore and analyze faceted text corpora. In: IEEE Symposium on Visual Analytics Science and Technology, pp. 91–98. IEEE (2009)Google Scholar
  19. 19.
    Crystal, D.: Language and the Internet. Cambridge University Press, Cambridge (2006)CrossRefGoogle Scholar
  20. 20.
    Dunleavy, P.: Big data’and policy learning. In: Stoker, G., Evans, M. (eds.) Evidence-Based Policy Making in the Social Sciences: Methods that Matter, pp. 143–151. The Policy Press, Bristol (2016)CrossRefGoogle Scholar
  21. 21.
    Egozi, O., Markovitch, S., Gabrilovich, E.: Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29, 8 (2011)CrossRefGoogle Scholar
  22. 22.
    Eisenstein, J.: What to do about bad language on the Internet. In: HLT-NAACL, pp. 359–369 (2013)Google Scholar
  23. 23.
    Fankhauser, P., Kermes, H., Teich, E.: Combining macro-and microanalysis for exploring the construal of scientific disciplinarity. In: Proceedings of Digital Humanities (2014)Google Scholar
  24. 24.
    Fankhauser, P., Knappen, J., Teich, E.: Exploring and visualizing variation in language resources. In: LREC, pp. 4125–4128 (2014)Google Scholar
  25. 25.
    Fisichella, M., Stewart, A., Denecke, K., Nejdl, W.: Unsupervised public health event detection for epidemic intelligence. In: International Conference on Information and Knowledge Management, pp. 1881–1884. ACM (2010)Google Scholar
  26. 26.
    Ford, E., Carroll, J., Smith, H., Scott, D., Cassell, J.: Extracting information from the text of electronic medical records to improve case detection: a systematic review. J. Am. Med. Inform. Assoc. 23(5), 1007–1015 (2016)CrossRefGoogle Scholar
  27. 27.
    Forsyth, R., Sharoff, S.: Document dissimilarity within and across languages: a benchmarking study. Lit. Ling. Comput. 29(1), 6–21 (2014)CrossRefGoogle Scholar
  28. 28.
    Fukumoto, F., Suzuki, Y., Matsuyoshi, S.: Text classification from positive and unlabeled data using misclassified data correction. In: ACL, pp. 474–478 (2013)Google Scholar
  29. 29.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)Google Scholar
  30. 30.
    Ghazinour, K., Sokolova, M., Matwin, S.: Detecting health-related privacy leaks in social networks using text mining tools. In: Canadian Conference on Artificial Intelligence, pp. 25–39. Springer, Berlin (2013)Google Scholar
  31. 31.
    Holton, C.: Identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem. Decis. Support Syst. 46, 853–864 (2009)CrossRefGoogle Scholar
  32. 32.
    Japkowicz, N., Stefanowski, J.: A machine learning perspective on big data analysis. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society, pp. 1–31. Springer, Berlin (2016)CrossRefGoogle Scholar
  33. 33.
    Jindal, N., Liu, B.: Opinion spam and analysis. In: International Conference on Web Search and Data Mining, pp. 219–230. ACM (2008)Google Scholar
  34. 34.
    Kim, S.-M., Hovy, E.: Crystal: Analyzing predictive opinions on the web. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1056–1064. ACL (2007)Google Scholar
  35. 35.
    Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Assoc. Inf. Sci. Technol. 65, 178–187 (2014)CrossRefGoogle Scholar
  36. 36.
    Lagu, T., Kaufman, E., Asch, D., Armstrong, K.: Content of weblogs written by health professionals. J. Gen. Intern. Med. 23, 1642–1646 (2008)CrossRefGoogle Scholar
  37. 37.
    Lindquist, H., Levin, M.: Apples and oranges: on comparing data from different corpora. Lang. Comput. 33, 201–214 (2000)Google Scholar
  38. 38.
    Liu, H., Morstatter, F., Tang, J., Zafarani, R.: The good, the bad, and the ugly: uncovering novel research opportunities in social media mining. Int. J. Data Sci. Anal. 1(3–4), 137–143 (2016)CrossRefGoogle Scholar
  39. 39.
    Mäntylä, M., Graziotin, D., Kuutila, M.: The evolution of sentiment analysis: a review of research topics, venues, and top cited papers. arXiv preprint arXiv:1612.01556 (2016)
  40. 40.
    Markus, G., Davis, E.: Eight (no, nine!) problems with big data. NYTimes, April 6 (2014)Google Scholar
  41. 41.
    McLuhan, M.: Understanding Media: The Extensions of Man. MIT Press, Cambridge (1964, 1994)Google Scholar
  42. 42.
    McNeill, D., Davenport, T.H.: Analytics in Healthcare and the Life Sciences: Strategies, Implementation Methods, and Best Practices. Pearson Education, London (2013)Google Scholar
  43. 43.
    Meystre, S., Friedlin, J., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)CrossRefGoogle Scholar
  44. 44.
    Mohan, S., Guha, A., Harris, M., Popowich, F., Schuster, A., Priebe, C.: The impact of toxic language on the health of reddit communities. In: Canadian Conference on Artificial Intelligence, pp. 51–56. Springer, Berlin (2017)Google Scholar
  45. 45.
    Mosquera, A., Gutiérrez, Y., Moreda, P.: On evaluating the contribution of text normalisation techniques to sentiment analysis on informal web 2.0 texts. Procesamiento del Lenguaje Natural 58, 29–36 (2017)Google Scholar
  46. 46.
    O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., Ananiadou, S.: Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst. Rev. 4(1), 5 (2015)CrossRefGoogle Scholar
  47. 47.
    Ofoghi, B., Mann, M., Verspoor, K.: Towards early discovery of salient health threats: a social media emotion classification technique. In: Biocomputing 2016: Proceedings of the Pacific Symposium, pp. 504–515 (2016)Google Scholar
  48. 48.
    Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Annual Meeting of the Association for Computational Linguistics, pp. 115–124. ACL (2005)Google Scholar
  49. 49.
    Patton, D.U., Hong, J.S., Ranney, M., Patel, S., Kelley, C., Eschmann, R., Washington, T.: Social media as a vector for youth violence: a review of the literature. Comput. Hum. Behav. 35, 548–553 (2014)CrossRefGoogle Scholar
  50. 50.
    Pesaranghader, A., Matwin, S., Sokolova, M., Beiko, R.: simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics 32, 1380–1387 (2016)CrossRefGoogle Scholar
  51. 51.
    Piantadosi, S.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)CrossRefGoogle Scholar
  52. 52.
    Pollak, S., Coesemans, R., Daelemans, W., Lavrač, N.: Detecting contrast patterns in newspaper articles by combining discourse analysis and text mining. Pragmatics 21, 647–683 (2011)CrossRefGoogle Scholar
  53. 53.
    Rashid, A., Baron, A., Rayson, P., May-Chahal, C., Greenwood, P., Walkerdine, J.: Who am I? analysing digital personas in cybercrime investigations. Computer 46, 54–61 (2013)CrossRefGoogle Scholar
  54. 54.
    Razavi, A., Inkpen, D., Uritsky, S., Matwin, S.: Offensive language detection using multi-level classification. In: Advances in Artificial Intelligence, pp. 16–27. Springer, Berlin (2010)Google Scholar
  55. 55.
    Rebholz-Schuhmann, D., Oellrich, A., Hoehndorf, R.: Text-mining solutions for biomedical research: enabling integrative biology. Nat. Rev. 13, 829–839 (2012)CrossRefGoogle Scholar
  56. 56.
    Remus, R., Ziegelmayer, D.: Learning from domain complexity. In: LREC, pp. 2021–2028 (2014)Google Scholar
  57. 57.
    Reyns, B.W., Henson, B., Fisher, B.S.: Being pursued online: applying cyberlifestyle-routine activities theory to cyberstalking victimization. Crim. Justice Behav. 38(11), 1149–1169 (2011)CrossRefGoogle Scholar
  58. 58.
    Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N., Huang, R.: Sarcasm as contrast between a positive sentiment and negative situation. In: EMNLP, pp. 704–714. ACL (2013)Google Scholar
  59. 59.
    Schäfer, R., Bildhauer, F.: Automatic classification by topic domain for meta data generation, web corpus evaluation, and corpus comparison. In: 10thWeb as Corpus Workshop, pp. 1–6. ACL (2016)Google Scholar
  60. 60.
    Schäfer, R., Barbaresi, A., Bildhauer, F.: The good, the bad, and the hazy: design decisions in web corpus construction. In: 8th Web as Corpus Workshop, pp. 1–7 (2013)Google Scholar
  61. 61.
    Sebastiani, F.: An axiomatically derived measure for the evaluation of classification algorithms. In: International Conference on The Theory of Information Retrieval, pp. 11–20. ACM (2015)Google Scholar
  62. 62.
    Sim, Y., Acree, B., Gross, J., Smith, N.: Measuring ideological proportions in political speeches. In: Empirical Methods in Natural Language Processing, pp. 91–101. ACL (2013)Google Scholar
  63. 63.
    Sokolova, M., Lapalme, G.: Verbs speak loud: verb categories in learning polarity and strength of opinions. In: Advances in Artificial Intelligence, pp. 320–331 (2008)Google Scholar
  64. 64.
    Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)CrossRefGoogle Scholar
  65. 65.
    Sokolova, M., Matwin, S.: Personal privacy protection in time of big data. In: Challenges in Computational Statistics and Data Mining, pp. 365–380. Springer, Berlin (2016)Google Scholar
  66. 66.
    Sokolova, M., Ioshikhes, I., Poursepanj, H., MacKenzie, A.: Helping parents to understand rare diseases. In: Matwin, S., Mielniczuk, J. (eds.) The Workshop on NLP for Medicine and Biology Associated with RANLP, pp. 24–33 (2013)Google Scholar
  67. 67.
    Sokolova, M., Matwin, S., Jafer, Y., Schramm, D.: How Joe and Jane tweet about their health: mining for personal health information on Twitter. In: RANLP, pp. 626–632 (2013)Google Scholar
  68. 68.
    Taboada, M.: Sentiment analysis: an overview from linguistics. Annu. Rev. Linguist. 2, 325–347 (2016)CrossRefGoogle Scholar
  69. 69.
    Tan, L., Zhang, H., Clarke, C.L., Smucker, M.D.: Lexical comparison between Wikipedia and Twitter corpora by using word embeddings. In: ACL (2), pp. 657–661 (2015)Google Scholar
  70. 70.
    Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24, 2559–2560 (2008)CrossRefGoogle Scholar
  71. 71.
    Tweedie, F.J., Baayen, H.R.: How variable may a constant be? Measures of lexical richness in perspective. Comput. Humanit. 32(5), 323–352 (1998)CrossRefGoogle Scholar
  72. 72.
    Uribe, D., Urquiz, A., Cuan, E.: Analysis of asymmetric measures for performance estimation of a sentiment classifier. Res. Comput. Sci. 65, 75–83 (2013)Google Scholar
  73. 73.
    van der Laan, J., Shannon, B., Baker, C.: Identifying Internet mediated securities fraud: trends and technology. In: Web Science Conference (2010)Google Scholar
  74. 74.
    van Zoonen, W., van der Toni, G.L.: Social media research: the application of supervised machine learning in organizational communication research. Comput. Hum. Behav. 63, 132–141 (2016)CrossRefGoogle Scholar
  75. 75.
    Verheggen, K., Martens, L., Berven, F., Barsnes, H., Vaudel, M.: Database search engines: paradigms, challenges and solutions. In: Mirzaei, H., Carrasco, M. (eds.) Modern Proteomics-Sample Preparation, Analysis and Practical Applications, pp. 147–156. Springer, Berlin (2016)CrossRefGoogle Scholar
  76. 76.
    Vogel, R.: Lexical cohesion in popular versus theoretical scientific texts. In: Interpretation of Meaning Across Discourses, pp. 61–74. Masaryk University, Brno (2010)Google Scholar
  77. 77.
    Vogel, R.: (n.d.). Scientific discussion forums and scientific texts from the perspective of lexical cohesion. In: Approaches to Discourse, pp. 57–69Google Scholar
  78. 78.
    Wagstaff, K., Riloff, E., Lanza, N., Mattmann, C., Ramirez, P.: Creating a mars target encyclopedia by extracting information from the planetary science literature. In: AAAI Workshop: Knowledge Extraction from Text. AAAI (2016)Google Scholar
  79. 79.
    Wang, L., Dyer, C., Black, A., Trancoso, I.: Paraphrasing 4 microblog normalization. In: Empirical Methods in Natural Language Processing, pp. 73–84. ACL (2013)Google Scholar
  80. 80.
    Woodside, A.: Embrace-perform-model: complexity theory, contrarian case analysis, and multiple realities. J. Bus. Res. 67(12), 2495–2503 (2014)CrossRefGoogle Scholar
  81. 81.
    Yang, Z., Wolkowicz, J., Keselj, V.: Social media corporate user identification using text classification. In: Advances in Artificial Intelligence, vol. 27. Springer, Berlin (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2017

Authors and Affiliations

  1. 1.School of Epidemiology and Public HealthOttawaCanada

Personalised recommendations