Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora

  • Daša MunkováEmail author
  • Michal Munk
  • Martin Vozár
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 231)


Short texts like advertisements are characterised by a number of slogans, phrases, words, symbols etc. To improve the quality of textual data, it is necessary to filter out noise textual data from important data. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in English and Slovak advertisement corpora. For this purpose, an experiment was conducted focusing on data pre-processing in these two comparable corpora. We try to find out to what extent removing the stop words has an influence on a quantity and quality of extracted rules. Stop words removal has no impact on the quantity and quality of extracted rules in English as well as in Slovak advertisement corpora. Only language has a significant impact on the quantity and quality of extracted rules.


natural language processing comparable corpora text mining data pre-processing stop words sequence rule analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Feldman, R., Sanger, J.: The text mining handbook. Cambridge University Press (2007)Google Scholar
  2. 2.
    Choy, M.: Effective Listings of Function Stop words for Twitter. International Jurnal of Advanced Computer Science and Application 3(6), 8–11 (2012)Google Scholar
  3. 3.
    Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge Information Systems 1(1), 1–27 (1999)Google Scholar
  4. 4.
    Tayi, G.K., Ballou, D.P.: Examining Data Quality. Communications of the ACM 41(2), 54–57 (1998)CrossRefGoogle Scholar
  5. 5.
    Jung, W.: An Investigation of the Impact of Data Quality on Decision Performance. In: Proceedings of the 2004 International Symposium on Information and Communication Technology (ISICT 2004), pp. 166–171 (2004)Google Scholar
  6. 6.
    Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)Google Scholar
  7. 7.
    Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. John Wiley and Sons, Ltd. (2010)Google Scholar
  8. 8.
    Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In: Proceedings of the 23rd International Conference on Very Large Databases, pp. 446–455 (1997)Google Scholar
  9. 9.
    Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalabe Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. The VLDB Journal 7, 163–178 (1998)CrossRefGoogle Scholar
  10. 10.
    Silva, C., Ribeiro, B.: The Importance of Stop Word Removal on Recall Values in Text Categorization. In: Proceedings of the International Joint Conference on Neural Networks, vol. 3, pp. 1661–1666. IEEE (2003)Google Scholar
  11. 11.
    Nisbet, R., Elder, J., Miner, G.: Handbook of statistical analysis and data mining applications. Academic Press, Elsevier (2009)Google Scholar
  12. 12.
    Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC Stop-Words List Generation. International Journal of Computer Applications 46(8), 8–13 (2012)Google Scholar
  13. 13.
    Munk, M., Kapusta, J., Švec, P.: Data Preprocessing Evaluation for Web Log Mining: Reconstruction of Activities of a Web Visitor. In: International Conference on Computational Science, ICCS 2010, Procedia Computer Science, vol. 1, pp. 2273–2280 (2010)Google Scholar
  14. 14.
    Munk, M., Drlík, M.: Impact of Different Pre-Processing Tasks on Effective Identification of Users’ Behavioral Patterns in Web-based Educational System. In: International Conference on Computational Science, ICCS 2011, Procedia Computer Science, vol. 4, pp. 1640–1649 (2011)Google Scholar
  15. 15.
    Munková, et al.: Analysis of Social and Expressive Factors of Requests by Methods of Text Mining. In: Pacific Asia Conference on Language, Information and Computation, PACLIC 26, pp. 515–524 (2012)Google Scholar
  16. 16.
    Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. In: International Conference on Computational Science, ICCS 2013, Procedia Computer Science, vol. 18, pp. 1198–1207 (2013)Google Scholar
  17. 17.
    Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)Google Scholar
  18. 18.
    Myerson, R.B.: Fundamentals of social choice theory. Discussion Paper No. 1162 (1996)Google Scholar
  19. 19.
    Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic Construction of Chinese Stop Word List. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)Google Scholar
  20. 20.
    Khosrow, M.: Encyclopedia of Information Science and Technology. Information Sci. 2 edn. (2009)Google Scholar
  21. 21.
    Sinka, M.P., Come, D.W.: Evolving Better Stoplists for Document Clustering and Web Intelligence. In: Proceedings of the 3rd Hybrid Intelligent Systems Conference. IOS Press, Australia (2003)Google Scholar
  22. 22.
    El-Khair, I.A.: Effect of Stop Words Elimination for Arabic Information Retrieval: A comparative Study. International Journal of Computing & Information Sciences 4(3), 119–133 (2006)Google Scholar
  23. 23.
    Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text Preprocessing. In: Fourth International Conference on Intelligent Computation Technology and Automation (2011)Google Scholar
  24. 24.
    Fox, C.: Lexical analysis and stoplists. Information Retrieval - Data Structures & Algorithms 7, 102–130 (1992)Google Scholar
  25. 25.
    Khler, R.: Quantitative Syntax Analysis. De Gruyter, Berlin (2012)CrossRefGoogle Scholar
  26. 26.
  27. 27.
    Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)Google Scholar
  28. 28.
    Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (1994)Google Scholar
  29. 29.
    Han, J., Lakshmanan, L.V.S., Pei, J.: Scalable frequent-pattern mining methods: an overview. In: Tutorial notes of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)Google Scholar
  30. 30.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, New York (2000)Google Scholar
  31. 31.
    Gadušová, Z., Gromová, E.: Discourse Analysis in Translation. In: 1st Nitra Conference on Discourse Studies. Trends and Perspectives, pp. 59–64 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Constantine the Philosopher University in NitraNitraSlovakia

Personalised recommendations