Skip to main content

Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm

  • Conference paper
  • First Online:
Machine Learning and Knowledge Extraction (CD-MAKE 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13480))

Abstract

This article presents a validation study of the algorithm implemented in the text mining tool called SOBEK, comparing it with YAKE!’, a known unsupervised keyword extraction algorithm. Both algorithms identify keywords from single documents using mainly a statistical method, providing context independent information. The article describes briefly previous uses of SOBEK in the literature, and presents a detailed description of its text mining algorithm. The validation study presented in the paper compares SOBEK with YAKE!. Both systems were used to extract keywords from texts belonging to fourteen public text databases, each containing several documents. In general, their performance was found to be equivalent, with the algorithms outperforming one another in a batch of tests, and reaching similar results in others. Understanding why each algorithm outperformed the other in different circumstances may shed light on the advantages and disadvantages of specific features of keyword extraction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://hub.docker.com/r/liaad/yake.

  2. 2.

    http://SOBEK.ufrgs.br/#/.

References

  • Allahyari, M.: A brief survey of text mining: classification, clustering and extraction techniques. In: Proceedings of KDD Bigdas (2017). http://arxiv.org/abs/1707.02919

  • Azevedo, B.F.T., Reategui, E.B., Behar, P.A.: Analysis of the relevance of posts in asynchronous discussions. Interdisc. J. E-Learning Learn. Objects 10, 107–121 (2014). https://doi.org/10.28945/2064

  • Bromberg, C.: History of science: the problem of cataloging, knowledge indexing and information retrieval in the digital space. Circumscribere: Int. J. Hist. Sc. 21, 41 (2018). https://doi.org/10.23925/1980-7651.2018v21;p41-55

  • Campos, R.: Datasets of automatic keyphrase extraction (2020). https://github.com/LIAAD/KeywordExtractor-Datasets

  • Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020). https://doi.org/10.1016/J.INS.2019.09.013

    Article  Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of Deep Bidirectional Transformers for Language Understanding. Cornell University (2019). https://doi.org/10.48550/arXiv.1810.04805

  • El-Kassas, W.S., Salama, C.R., Rafea, A.A., Mohamed, H.K.: Automatic text summarization: a comprehensive survey. Expert Syst. Appl. 165, 113679 (2021). https://doi.org/10.1016/j.eswa.2020.113679

    Article  Google Scholar 

  • Feldman, R., Sanger, J.: Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)

    Book  Google Scholar 

  • Firoozeh, N., Nazarenko, A., Alizon, F., Daille, B.: Keyword extraction: Issues and methods. Nat. Lang. Eng. 26(3), 259–291 (2019). https://doi.org/10.1017/S1351324919000457

    Article  Google Scholar 

  • Flor, M., Hao, J.: Text mining and automated scoring. In: von Davier, A.A., Mislevy, R.J., Hao, J. (eds.) Computational Psychometrics: New Methodologies for a New Generation of Digital Learning and Assessment. Methodology of Educational Measurement and Assessment. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74394-9_14

  • Führ, F., Bisset Alvarez, E.: Digital humanities and open science: initial aspects. In: Bisset Álvarez, E. (ed.) DIONE 2021. LNICSSITE, vol. 378, pp. 154–173. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77417-2_12

    Chapter  Google Scholar 

  • Gonzalez-Gonzalez, C.S., Moreno, L., Popescu, B., Lotero, Y., Vargas, R.: Intelligent systems to support the active self-learning in industrial automation. In: IEEE Global Engineering Education Conference, EDUCON, 10–13 April 2016, pp. 1149–1154 (2016). https://doi.org/10.1109/EDUCON.2016.7474700

  • Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference, vol. 1, pp. 1262–1273 (2014). https://doi.org/10.3115/V1/P14-1119

  • Holzinger, A., Malle, B., Saranti, A., Pfeifer, B.: Towards a multi-modal causability with graph neural networks enabling information fusion for explainable ai. Inf. Fusion 71, 28–37 (2021). https://doi.org/10.1016/j.inffus.2021.01.008

    Article  Google Scholar 

  • Hulth, A., Megyesi, B.B.: A study on automatically extracted keywords in text categorization. In: COLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, vol. 1, pp. 537–544 (2006). https://doi.org/10.3115/1220175.1220243

  • Karami, A., Ghasemi, M., Sen, S., Moraes, M.F., Shah, V.: Exploring diseases and syndromes in neurology case reports from 1955 to 2017 with text mining. Comput. Biol. Med. 109(February), 322–332 (2019). https://doi.org/10.1016/j.compbiomed.2019.04.008

    Article  Google Scholar 

  • Krallinger, M., Valencia, A.: Text-mining and information-retrieval services for molecular biology (2005). https://doi.org/10.1186/gb-2005-6-7-224

  • Lamurias, A., Couto, F.M.: Text mining for bioinformatics using biomedical literature. In Encyclopedia of Bioinformatics and Computational Biology. Elsevier Ltd. (2019). https://doi.org/10.1016/b978-0-12-809633-8.20409-3

  • Lee, A.V.Y., Tan, S.C., Lee, A.V.Y., Tan, S.C.: Discovering dynamics of an idea pipeline: understanding idea development within a knowledge building discourse. In: Proceedings of the 25th International Conference on Computers in Education, pp. 119–128 (2017). https://repository.nie.edu.sg//handle/10497/19430

  • Lee, A.V.Y., Tan, S.C.: Promising ideas for collective advancement of communal knowledge using temporal analytics and cluster analysis. J. Learn. Anal. 4(3), 76–101 (2017). https://doi.org/10.18608/jla.2017.43.5

  • Macedo, A.L., Reategui, E., Lorenzatti, A., Behar, P.: Using text-mining to support the evaluation of texts produced collaboratively. In: Proceedings of IFIP World Conference on Computers in Education, Bento Gonçalves, Brazil (2009)

    Google Scholar 

  • Marcos-Pablos, S., García-Peñalvo, F.J.: Information retrieval methodology for aiding scientific database search. Soft. Comput. 24(8), 5551–5560 (2018). https://doi.org/10.1007/s00500-018-3568-0

    Article  Google Scholar 

  • Noh, H., Jo, Y., Lee, S.: Keyword selection and processing strategy for applying text mining to patent analysis. Expert Syst. Appl. 42(9), 4348–4360 (2015). https://doi.org/10.1016/j.eswa.2015.01.050

    Article  Google Scholar 

  • Novak, J.D., Cañas, A.J.: The theory underlying concept maps and how to construct them (2008)

    Google Scholar 

  • Pang, B., Lee, L.: Opinion mining and sentiment analysis. In: Foundations and Trends in Information Retrieval, vol. 2, issue number 2 (2008)

    Google Scholar 

  • Reategui, E., Epstein, D., Bastiani, E., Carniato, M.: Can text mining support reading comprehension? In: Gennari, R., et al. (eds.) MIS4TEL 2019. AISC, vol. 1007, pp. 37–44. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-23990-9_5

    Chapter  Google Scholar 

  • Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. Text Min. Appl. Theory 1–20 (2010). https://doi.org/10.1002/9780470689646.CH1

  • Schenker, A.: Graph-Theoretic Techniques for Web Content Mining Graph-Theoretic Techniques for Web Content Mining. University of South Florida (2003). https://scholarcommons.usf.edu/etd

  • Song, B., Yan, W., Zhang, T.: Cross-border e-commerce commodity risk assessment using text mining and fuzzy rule-based reasoning. Adv. Eng. Inform. 40(January), 69–80 (2019). https://doi.org/10.1016/j.aei.2019.03.002

    Article  Google Scholar 

  • Sun, A., Lachanski, M., Fabozzi, F.J.: Trade the tweet: social media text mining and sparse matrix factorization for stock market prediction. Int. Rev. Financ. Anal. 48, 272–281 (2016). https://doi.org/10.1016/j.irfa.2016.10.009

    Article  Google Scholar 

  • Tseng, Y.-H., Lin, C.-J., Lin, Y.-I.: Text mining techniques for patent analysis automatic information organization view project Chinese grammatical error diagnosis view project text mining techniques for patent Analysis. Inf. Process. Manage. 43, 1216–1247 (2007). https://doi.org/10.1016/j.ipm.2006.11.011

    Article  Google Scholar 

  • Winograd, P.N.: Strategic Difficulties in Summarizing Texts. University of Illinois at Urbana-Champaign, Cambridge (1983)

    Google Scholar 

  • Zvarevashe, K., Olugbara, O.O.: A framework for sentiment analysis with opinion mining of hotel reviews. In: Proceedings of the Conference on Information Communications Technology and Society (ICTAS), Durban, South Africa, 8–9 March, pp. 1–4 (2018). https://doi.org/10.1109/ICTAS.2018.8368746

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Eliseo Reategui , Marcio Bigolin , Michel Carniato or Rafael Antunes dos Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Reategui, E., Bigolin, M., Carniato, M., dos Santos, R.A. (2022). Evaluating the Performance of SOBEK Text Mining Keyword Extraction Algorithm. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14463-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14462-2

  • Online ISBN: 978-3-031-14463-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics