Advertisement

Information Retrieval Journal

, Volume 19, Issue 1–2, pp 59–99 | Cite as

Biomedical term extraction: overview and a new methodology

  • Juan Antonio Lossio-Ventura
  • Clement Jonquet
  • Mathieu Roche
  • Maguelonne Teisseire
Medical Information Retrieval

Abstract

Terminology extraction is an essential task in domain knowledge acquisition, as well as for information retrieval. It is also a mandatory first step aimed at building/enriching terminologies and ontologies. As often proposed in the literature, existing terminology extraction methods feature linguistic and statistical aspects and solve some problems related (but not completely) to term extraction, e.g. noise, silence, low frequency, large-corpora, complexity of the multi-word term extraction process. In contrast, we propose a cutting edge methodology to extract and to rank biomedical terms, covering all the mentioned problems. This methodology offers several measures based on linguistic, statistical, graphic and web aspects. These measures extract and rank candidate terms with excellent precision: we demonstrate that they outperform previously reported precision results for automatic term extraction, and work with different languages (English, French, and Spanish). We also demonstrate how the use of graphs and the web to assess the significance of a term candidate, enables us to outperform precision results. We evaluated our methodology on the biomedical GENIA and LabTestsOnline corpora and compared it with previously reported measures.

Keywords

Automatic term extraction Biomedical terminology extraction Natural language processing BioNLP Text mining Web mining Graphs 

Notes

Acknowledgments

This work was supported in part by the French National Research Agency under the JCJC program, Grant ANR-12-JS02-01001, as well as by the University of Montpellier, CNRS, IBC of Montpellier project and the FINCyT program, Peru. Finally, the authors thank DIST (Scientific and Technical Information Service) for the acquisition of Cirad corpus.

References

  1. Ahmad, K., Gillam, L., & Tostevin, L. (1999). University of surrey participation in TREC-8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In TREC.Google Scholar
  2. Aubin, S., & Hamon, T. (2006). Improving term extraction with terminological resources. In Proceedings of the 5th international conference natural language processing (pp. 380–387). FinTAL’06 Turku, Finland: Springer.Google Scholar
  3. Banerjee, A., Chandrasekhar, A. G., Duo, E., & Jackson, M. O. (2014). Gossip: Identifying central individuals in a social network. Technical report, National Bureau of EconomicResearch.Google Scholar
  4. Barrón-Cedeño, A., Sierra, G., Drouin, P., & Ananiadou, S. (2009). An improved automatic term recognition method for spanish. In Proceedings of the 10th international conference on computational linguistics and intelligent text processing (pp. 125–136) CICLing’09. Springer.Google Scholar
  5. Blanco, R., & Lioma, C. (2012). Graph-based term weighting for information retrieval. Information Retrieval, 15(1), 54–92.CrossRefGoogle Scholar
  6. Boldi, P., & Vigna, S. (2014). Axioms for centrality. Internet Mathematics, 10(3–4), 222–262.MathSciNetCrossRefGoogle Scholar
  7. Borgatti, S. P. (2005). Centrality and network flow. Social Networks, 27(1), 55–71.CrossRefGoogle Scholar
  8. Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network analysis in the social sciences. Science, 323(5916), 892–895.CrossRefGoogle Scholar
  9. Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London: Routledge.CrossRefGoogle Scholar
  10. Chaudhari, D. L., Damani, O. P., & Laxman, S. (2011). Lexical co-occurrence, statistical significance, and word association. In Proceedings of the conference on empirical methods in natural language processing (pp. 1058–1068). EMNLP’11, Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
  11. Cilibrasi, R. L., & Vitanyi, P. M. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.CrossRefGoogle Scholar
  12. Conrado, M. S., Pardo, T. A., & Rezende, S. O. (2013). Exploration of a rich feature set for automatic term extraction. Advances in Artificial Intelligence and Its Applications (pp. 342–354), vol. 8265 of Lecture Notes in Computer Science Berlin Heidelberg: Springer.Google Scholar
  13. Daille, B., Gaussier, E., & Langé, J.-M. (1994). Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th conference on computational linguistics—Volume 1, COLING’94, pages 515–521, Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
  14. Daille, B., & Morin, E. (2005). French-english terminology extraction from comparable corpora. In Proceedings of the 2nd international joint conference natural language processing (pp. 707–718). IJCNLP’05. Springer.Google Scholar
  15. Déjean, H., & Gaussier, E. (2002). Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables.Google Scholar
  16. Deléger, L., Merkel, M., & Zweigenbaum, P. (2009). Translating medical terminologies through word alignment in parallel text corpora. Journal of Biomedical Informatics, 42(4), 692–701.CrossRefGoogle Scholar
  17. Dobrov, B., & Loukachevitch, N. (2011). Multiple evidence for term extraction in broad domains. In Proceeding of recent advances in natural language processing (pp. 710–715). RANLP’11 Bulgaria: Hissar.Google Scholar
  18. Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: The c-value/nc-value method. International Journal on Digital Libraries, 3(2), 115–130.CrossRefGoogle Scholar
  19. Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215–239.CrossRefGoogle Scholar
  20. Gaizauskas, R., Demetriou, G., & Humphreys, K. (2000). Term recognition and classification in biological science journal articles. In Proceeding of the computional terminology for medical and biological applications workshop of the 2nd international conference on NLP (pp. 37–44).Google Scholar
  21. Golik, W., Bossy, R., Ratkovic, Z., & Nédellec, C. (2013). Improving term extraction with linguistic analysis in the biomedical domain. In Proceedings of the 14th international conference on intelligent text processing and computational linguistics, special issue of the journal Research in Computing Science (pp. 24–30). CICLing’13.Google Scholar
  22. Hamon, T., Engström, C., & Silvestrov, S. (2014). Term ranking adaptation to the domain: Genetic algorithm-based optimisation of the c-value. In Proceedings of the 9th international conference on natural language processing (pp. 71–83). PolTAL’2014 - LNAI Warsaw, Poland: Springer.Google Scholar
  23. Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2014). The semantic measures library and toolkit: Fast computation of semantic similarity and relatedness using biomedical ontologies. Bioinformatics, 30(5), 740–742.CrossRefGoogle Scholar
  24. Hliaoutakis, A., Zervanou, K., & Petrakis, E. G. (2009). The amtex approach in the medical document indexing and retrieval application. Data and Knowledge Engineering, 68(3), 380–392.CrossRefGoogle Scholar
  25. Ji, L., Sum, M., Lu, Q., Li, W., & Chen. Y. (2007). Chinese terminology extraction using window-based contextual information. In Proceedings of the 8th international conference on computational linguistics and intelligent text processing (pp. 62–74). CICLing’07, Berlin, Heidelberg. Springer-Verlag.Google Scholar
  26. Kageura, K., & Umino, B. (1996). Methods of automatic term recognition: A review. Terminology, 3(2), 259–289.CrossRefGoogle Scholar
  27. Kontonatsios, G., Korkontzelos, I., Tsujii, J., & Ananiadou, S. (2014). Combining string and context similarity for bilingual term alignment from comparable corpora. In Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 1701–1712). EMNLP’14, Doha, Qatar. Association for Computational Linguistics.Google Scholar
  28. Kontonatsios, G., Mihăilă, C., Korkontzelos, I., Thompson, P., & Ananiadou, S. (2014). A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. In Statistical language and speech processing pp. 57–69. Springer.Google Scholar
  29. Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, N., & Confino, T. (2007). Glossary extraction and knowledge in large organisations via semantic web technologies. In Proceedings of the 6th international semantic web conference and he 2nd Asian semantic web conference (semantic web challenge track), ISWC-ASWC’07. Springer.Google Scholar
  30. Krauthammer, M., & Nenadic, G. (2004). Term identification in the biomedical literature. Journal of Biomedical Informatics, 37(6), 512–526.CrossRefGoogle Scholar
  31. Lossio-Ventura, J. A., Hacid, H., Ansiaux, A., & Maag, M. L. (2012). Conversations reconstruction in the social web. In Proceedings of the 21st international conference companion on World Wide Web (pp. 573–574). WWW’12, Lyon, France, ACM.Google Scholar
  32. Lossio-Ventura, J. A., Jonquet, C., Roche, M., & Teisseire, M. (2014). BIOTEX: A system for biomedical terminology extraction, ranking, and validation. In Proceedings of the 13th international semantic web conference, posters and demonstrations track (pp. 157–160). ISWC’14.Google Scholar
  33. Lossio-Ventura, J. A., Jonquet, C., Roche, M., Teisseire, M., & ACM. (2014). Integration of linguistic and web information to improve biomedical terminology extraction. In Proceedings of the 18th international database engineering and applications symposium (pp. 265–269). IDEAS’14 Porto, Portugal: ACM.Google Scholar
  34. Lossio-Ventura, J. A., Jonquet, C., Roche, M., & Teisseire, M. (2014). Yet another ranking function for automatic multiword term extraction. In Proceedings of the 9th international conference on natural language processing, number 8686 in PolTAL’2014 - LNAI (pp. 52–64). Warsaw, Poland, Springer.Google Scholar
  35. Lv, Y., & Zhai, C. (2011). Adaptive term frequency normalization for BM25. In Proceedings of the 20th ACM international conference on information and knowledge management (pp. 1985–1988). CIKM’11, New York, NY, USA. ACM.Google Scholar
  36. Lv, Y., & Zhai, C. (2011). When documents are very long, BM25 fails! In Proceedings of the 34th international acm sigir conference on research and development in information retrieval, SIGIR’11 (pp. 1103–1104). New York, NY, USA. ACM.Google Scholar
  37. Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157–169.CrossRefGoogle Scholar
  38. Morin, E., & Prochasson, E. (2011). Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web (pp. 27–34). Association for Computational Linguistics.Google Scholar
  39. Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of big data to health care. Journal of the American Medical Association, JAMA, 309(13), 1351–1352.CrossRefGoogle Scholar
  40. Nakagawa, H., & Mori, T. (2002). A simple but powerful automatic term extraction method. In COLING-02 on COMPUTERM 2002: Second International Workshop on Computational Terminology—Vol. 14, COMPUTERM ’02 (pp. 1–7). Stroudsburg, PA, USA, Association for Computational Linguistics.Google Scholar
  41. Névéol, A., Grosjean, J., Darmoni, S. J., & Zweigenbaum, P. (2014). Language resources for french in the biomedical domain. In Proceedings of the 9th international conference on language resources and evaluation, LREC’14. Association for Computational Linguistics.Google Scholar
  42. Newman, D., Koilada, N., Lau, J. H., & Baldwin, T. (December 2012). Bayesian text segmentation for index term identification and keyphrase extraction. In Proceedings of 24th international conference on computational linguistics (pp. 2077–2092). COLING’12 India: Mumbai.Google Scholar
  43. Noh, T.-G., Park, S.-B., Yoon, H.-G., Lee, S.-J., & Park, S.-Y. (2009). An automatic translation of tags for multimedia contents using folksonomy networks. In Proceedings of the 32Nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’09 (pp. 492–499). New York, NY, USA, ACM.Google Scholar
  44. Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N. B., et al. (2009). Bioportal: Ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37, 170–173.CrossRefGoogle Scholar
  45. Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 32(3), 245–251.CrossRefGoogle Scholar
  46. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web.Google Scholar
  47. Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.-M., & Vyas, V. (2009). Web-scale distributional similarity and entity set expansion. In Proceedings of the conference on empirical methods in natural language processing, EMNLP’09 (pp. 938–947). Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar
  48. Qureshi, M. A., O’Riordan, C., & Pasi, G. (2012). Short-text domain specific key terms/phrases extraction using an n-gram model with wikipedia. In Proceedings of the 21st ACM international conference on information and knowledge management, CIKM’12 (pp. 2515–2518). New York, NY, USA, ACM.Google Scholar
  49. Robertson, S. E., Walker, S., & Beaulieu, M. (1999). Okapi at TREC-7: Automatic ad hoc, filtering, vlc and interactive track. IN, 21, 253–264.Google Scholar
  50. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. In M. W. Berry, J. Kogan (Eds.), Text Mining: Applications and Theory (pp. 1–20). John Wiley and Sons, Ltd.Google Scholar
  51. Rousseau, F., & Vazirgiannis, M. (2013). Graph-of-word and tw-idf: New approach to ad hoc ir. In Proceedings of the 22Nd ACM international conference on information and knowledge management, CIKM’13 (pp. 59–68). New York, NY, USA, ACM.Google Scholar
  52. Rubin, D. L., Shah, N. H., & Noy, N. F. (2008). Biomedical ontologies: A functional perspective. Briefings in Bioinformatics, 9(1), 75–90.CrossRefGoogle Scholar
  53. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing and management, 24(5), 513–523.CrossRefGoogle Scholar
  54. Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 19th international ACM SIGIR conference on research and development in information retrieval, SIGIR’96 (pp. 21–29). New York, NY, USA, ACM.Google Scholar
  55. Spasic, I., Greenwood, M., Preece, A., Francis, N., & Elwyn, G. (2013). FlexiTerm: a flexible term recognition method. Biomedical Semantics, 4(1), 27. CrossRefGoogle Scholar
  56. Stoykova, V., & Petkova, E. (2012). Automatic extraction of mathematical terms for precalculus. Procedia Technology Journal, 1, 464–468.CrossRefGoogle Scholar
  57. Tamura, A., Watanabe, T., & Sumita, E. (2012). Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL’12 (pp. 24–36). Stroudsburg, PA, USA, Association for Computational Linguistics.Google Scholar
  58. Tian, Y., & Lo, D. (2015). A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports. In Proceedings of the 22nd international IEEE conference on software analysis, evolution, and reengineering, SANER’15 (pp. 570–574). Montreal, Canada, IEEE.Google Scholar
  59. Van Eck, N. J., Waltman, L., Noyons, E. C., & Buter, R. K. (2010). Automatic term identification for bibliometric mapping. Scientometrics, 82(3), 581–596.CrossRefGoogle Scholar
  60. Yang, Y., Zhao, T., Lu, Q., Zheng, D., & Yu, H. (2009). Chinese term extraction using different types of relevance. In Proceedings of the international joint conference on natural language processing, ACL-IJCNLP’09 (pp. 213–216). Suntec, Singapore, Association for Computational Linguistics.Google Scholar
  61. Zadeh, R. B., & Goel, A. (2013). Dimension independent similarity computation. Journal of Machine Learning Research, 14(1), 1605–1626.MATHGoogle Scholar
  62. Zhang, X., Song, Y., & Fang, A. (2010). Term recognition using conditional random fields. In International conference on natural language processing and knowledge engineering, NLP-KE’10 (pp. 1–6). IEEE.Google Scholar
  63. Zhang, Z., Iria, J., Brewster, C., & Ciravegna, F. (2008). A comparative evaluation of term recognition algorithms. In Proceedings of the sixth international conference on language resources and evaluation, LREC’08, Marrakech, Morocco.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Juan Antonio Lossio-Ventura
    • 1
  • Clement Jonquet
    • 1
  • Mathieu Roche
    • 1
    • 2
  • Maguelonne Teisseire
    • 1
    • 2
  1. 1.LIRMM, CNRSUniversity of MontpellierMontpellierFrance
  2. 2.UMR TETIS, Cirad, IrsteaAgroParisTechMontpellierFrance

Personalised recommendations