Advertisement

Cross-Evaluation of Automated Term Extraction Tools by Measuring Terminological Saturation

  • Victoria Kosa
  • David Chaves-Fraga
  • Dmitriy Naumenko
  • Eugene Yuschenko
  • Carlos Badenes-Olmedo
  • Vadim Ermolayev
  • Aliaksandr Birukou
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 826)

Abstract

This paper reports on cross-evaluating the two software tools for automated term extraction (ATE) from English texts: NaCTeM TerMine and UPM Term Extractor. The objective was to find the most fitting software for extracting the bags of terms to be the part of our instrumental pipeline for exploring terminological saturation in text document collections in a domain of interest. The choice of these particular tools from the bunch of the other available is explained in our review of the related work in ATE. The approach to measure terminological saturation is based on the use of the THD algorithm developed in frame of our OntoElect methodology for ontology refinement. The paper presents the suite of instrumental software modules, experimental workflow, 2 synthetic and 3 real document collections, generated datasets, and set-up of our experiments. Next, the results of the cross-evaluation experiments are presented, analyzed, and discussed. Finally the paper offers some conclusions and recommendations on the use of ATE software for measuring terminological saturation in retrospective text document collections.

Keywords

Automated term extraction Software tool Experimental Cross-Evaluation Terminological saturation Retrospective document collection OntoElect 

Notes

Acknowledgements

The first author is funded by a PhD grant from Zaporizhzhia National University and the Ministry of Education and Science of Ukraine. The research leading to this paper has been done in part in cooperation with the Ontology Engineering Group of the Universidad Politécnica de Madrid in frame of FP7 Marie Curie IRSES SemData project (http://www.semdata-project.eu/), grant agreement No. PIRSES-GA-2013-612551. A substantial part of the instrumental software used in the reported experiments has been developed in cooperation with BWT Group. The collection of Springer journal papers dealing with Knowledge Management, including DMKD, has been provided by Springer-Verlag.

References

  1. 1.
    Kosa, V., Chugunenko, A., Yuschenko, E., Badenes, C., Ermolayev, V., Birukou, A.: Semantic saturation in retrospective text document collections. In: Mallet, F., Zholtkevych, G. (eds.) Proceedings of ICTERI 2017 PhD Symposium, CEUR-WS, Kyiv, Ukraine, 16–17 May, vol. 1851, pp. 1–8 (2017). OnlineGoogle Scholar
  2. 2.
    Tatarintseva, O., Ermolayev, V., Keller, B., Matzke, W.-E.: Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics. In: Ermolayev, V., Mayr, H.C., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) ICTERI 2013. CCIS, vol. 412, pp. 136–162. Springer, Cham (2013).  https://doi.org/10.1007/978-3-319-03998-5_8 Google Scholar
  3. 3.
    Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M. et al. (eds.) ISWC 2015, Part I. LNCS, vol. 9366, pp. 408–424. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-319-25007-6_24
  4. 4.
    Astrakhantsev, N.: ATR4S: toolkit with state-of-the-art automatic terms recognition methods in scala. arXiv preprint arXiv:1611.07804 (2016)
  5. 5.
    Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: Proceedings of Sixth International Conference on Language Resources and Evaluation, LREC08, Marrakech, Morocco (2008)Google Scholar
  6. 6.
    Fahmi, I., Bouma, G., van der Plas, L.: Improving statistical method using known terms for automatic term extraction. In: Computational Linguistics in the Netherlands, CLIN 2007, vol. 17 (2007)Google Scholar
  7. 7.
    Wermter, J., Hahn, U.: Finding new terminology in very large corpora. In: Clark, P., Schreiber, G. (eds.) Proceedings of 3rd International Conference on Knowledge Capture, K-CAP 2005, pp. 137–144. ACM, Banff (2005).  https://doi.org/10.1145/1088622.1088648
  8. 8.
    Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: Klavans, J., Resnik, P. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. The MIT Press, Cambridge (1996)Google Scholar
  9. 9.
    Cohen, J.D.: Highlights: Language- and domain-independent automatic indexing terms for abstracting. J. Am. Soc. Inf. Sci. 46(3), 162–174 (1995). https://doi.org/10.1002/(SICI)1097-4571(199504)46:3<162::AID-ASI2>3.0.CO;2-6
  10. 10.
    Caraballo, S.A., Charniak, E.: Determining the specificity of nouns from text. In: Proceedings of 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70 (1999)Google Scholar
  11. 11.
    Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Marchionini, G., Nelson, M.L., Marshall, C.C. (eds.) Proceedings of ACM/IEEE Joint Conference on Digital Libraries, JCDL 2006, pp. 296–297. ACM, Chapel Hill (2006).  https://doi.org/10.1145/1141753.1141819
  12. 12.
    Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: Proeedings 8th Text REtrieval Conference, TREC-8 (1999)Google Scholar
  13. 13.
    Frantzi, K.T., Ananiadou, S.: The C/NC value domain independent method for multi-word term extraction. J. Nat. Lang. Process. 6(3), 145–180 (1999).  https://doi.org/10.5715/jnlp.6.3_145 CrossRefGoogle Scholar
  14. 14.
    Sclano, F., Velardi, P.: TermExtractor: a web application to learn the common terminology of interest groups and research communities. In: Proceedings of 9th Conference on Terminology and Artificial Intelligence, TIA 2007, Sophia Antinopolis, France (2007)Google Scholar
  15. 15.
    Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, Y., Cofino, T.: Glossary extraction and utilization in the information search and delivery system for IBM Technical Support. IBM Syst. J. 43(3), 546–563 (2004).  https://doi.org/10.1147/sj.433.0546 CrossRefGoogle Scholar
  16. 16.
    Astrakhantsev, N.: Methods and software for terminology extraction from domain-specific text collection. Ph.D. thesis, Institute for System Programming of Russian Academy of Sciences (2015)Google Scholar
  17. 17.
    Bordea, G., Buitelaar, P., Polajnar, T.: Domain-independent term extraction through domain modelling. In: Proceedings of 10th International Conference on Terminology and Artificial Intelligence, TIA 2013, Paris, France (2013)Google Scholar
  18. 18.
    Park, Y., Byrd, R.J., Boguraev, B.: Automatic glossary extraction: beyond terminology identification. In: Proceedings of 19th International Conference on Computational linguistics, Taipei, Taiwan, pp. 1–7 (2002).  https://doi.org/10.3115/1072228.1072370
  19. 19.
    Nokel, M., Loukachevitch, N.: An experimental study of term extraction for real information-retrieval thesauri. In: Proceedings of 10th International Conference on Terminology and Artificial Intelligence, pp. 69–76 (2013)Google Scholar
  20. 20.
    Zhang, Z., Gao, J., Ciravegna, F.: Jate 2.0: Java automatic term extraction with Apache Solr. In: Proceedings of LREC 2016, Slovenia, pp. 2262–2269 (2016)Google Scholar
  21. 21.
    Justeson, J., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1(1), 9–27 (1995).  https://doi.org/10.1017/S1351324900000048 CrossRefGoogle Scholar
  22. 22.
    Evans, D.A., Lefferts, R.G.: Clarit-trec experiments. Inf. Process. Manag. 31(3), 385–395 (1995).  https://doi.org/10.1016/0306-4573(94)00054-7 CrossRefGoogle Scholar
  23. 23.
    Church, K.W., Gale, W.A.: Inverse document frequency (IDF): a measure of deviations from Poisson. In: Proceedings of ACL 3rd Workshop on Very Large Corpora, pp. 121–130. Association for Computational Linguistics, Stroudsburg, PA, USA (1995).  https://doi.org/10.1007/978-94-017-2390-9_18
  24. 24.
    Oliver, A., V`azquez, M.: TBXTools: a free, fast and flexible tool for automatic terminology extraction. In: Angelova, G., Bontcheva, K., Mitkov, R. (eds.) Proceedings of Recent Advances in Natural Language Processing, pp. 473–479, Hissar, Bulgaria, 7–9 September 2015Google Scholar
  25. 25.
    Corcho, O., Gonzalez, R., Badenes, C., Dong, F.: Repository of indexed ROs. Deliverable No. 5.4. Dr Inventor project (2015)Google Scholar
  26. 26.
    Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014)Google Scholar
  27. 27.
    Kosa, V., Chaves Fraga, D., Naumenko, D., Yuschenko, E., Badenes, C., Ermolayev, V., Birukou, A.: Cross-evaluation of automated term extraction tools. Technical report TS-RTDC-TR-2017-1, 30.09.2017, Department of Computer Science, Zaporizhzhia National University, Ukraine, 60 p. (2017). http://ermolayev.com/TS-RTDS-TR-2017-1.pdf,  https://doi.org/10.13140/rg.2.2.31187.07207

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceZaporizhzhia National UniversityZaporizhzhiaUkraine
  2. 2.Ontology Engineering GroupUniversidad Politécnica de MadridMadridSpain
  3. 3.BWT GroupZaporizhzhiaUkraine
  4. 4.Springer-Verlag GmbHHeidelbergGermany

Personalised recommendations