Skip to main content
Log in

Automatic extraction of acronym definitions from the Web

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Acronyms are widely used to abbreviate and stress important concepts. The discovery of the definitions associated to an acronym is an important matter in order to support language processing and knowledge-related tasks as information retrieval, ontology mapping or question answering. Acronyms represent a very dynamic and unbounded topic that is constantly evolving. Manual attempts to compose a global scale dictionary of acronym-definition pairs result in an overwhelming amount of work and limited results. Attending these shortcomings, this paper presents an automatic and unsupervised methodology to generate acronyms and extract their potential definitions from the Web. The method has been designed to minimise the set of constraints, offering a domain and -partially- language independent solution, and to exploit the Web in order to create large and general acronym-definition sets. Results have been manually evaluated against the largest manually built acronym repository: Acronym Finder. The evaluation shows that the proposed approach is able to improve the coverage of manual attempts maintaining a high precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adar E (2002) S-RAD: A simple and robust abbreviation dictionary. HP Laboratories

  2. Agirre E, Ansa O, Hovy E, Martínez D (2000) Enriching very large ontologies using the WWW. In: Proc of Workshop on Ontology Construction of the European Conference of AI. ECAI, Berlin, pp 73–77

    Google Scholar 

  3. Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (ed) Proc of 4th international conference on computational linguistics and intelligent text processing, bconfnameCICLing 2003, Mexico City, Mexico. Springer, Berlin/Heidelberg, pp 360–369

    Chapter  Google Scholar 

  4. Brill E, Lin J, Banko M, Dumais S (2001) Data-intensive question answering. In: Voorhees EM, Harman DK (eds) Proc of tenth text retrieval conference, TREC 2001. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Maryland, US, pp 393–400

  5. Carmel D, Farchi E, Petruschka Y, Soffer A (2002) Automatic query wefinement using lexical affinities with maximal information gain. In: Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K (eds) Proc of 25th annual international. ACM, SIGIR conference on research and development in information retrieval, SIGIR 02. Tampere, Finland, pp 283–290

  6. Castells P (2003) Sistemas interactivos y colaborativos en la Web. In: Bravo C, Redondo MA (eds) La web semántica. Ediciones de la Universidad de Castilla-La Mancha, pp 195–212

  7. Chang C-H, Hsu C-C (1998) Integrating query expansion and conceptual relevance feedback for personalized web information retrieval. Comput Netw ISDN Syst 30:621–623

    Article  Google Scholar 

  8. Chang JT, Schütze H (2006) Abbreviations in biomedical text. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 99–119

    Google Scholar 

  9. Chirita P-A, Firan CS, Nejdl W (2007) Personalized query expansion for the Web. In: Clarke CLA, Fuhr N, Kando N, Kraaij W, de Vries AP (eds) Proc of 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 07. ACM, Amsterdam, pp 7–14

    Chapter  Google Scholar 

  10. Church KW, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. In: Zernik U (ed) Lexical acquisition: exploiting on-line resources to build a lexicon. Lawrence Erlbaum Associates, New Jersey, pp 115–164

    Google Scholar 

  11. Cilibrasi RL, Vitányi PMB (2006) The Google similarity distance. IEEE Trans Knowl Data Eng 19:370–383

    Article  Google Scholar 

  12. Cimiano P, Staab S (2004) Learning by Googling. SIGKDD Explor 6:24–33

    Article  Google Scholar 

  13. Ciravegna F, Dingli A, Guthrie D, Wilks Y (2003) Integrating information to bootstrap information extraction from Web sites. In: Kambhampati S, Knoblock CA (eds) Proc of IJCAI workshop on information integration on the Web, IIWeb 2003. IJCAI Press, Acapulco, pp 9–14

    Google Scholar 

  14. Dannélls D (2006) Automatic acronym recognition. In: Proc of 11st conference of the European chapter of the association for computational linguistics, EACL 2006. The Association for Computer Linguistics, Trento, pp 167–170

    Google Scholar 

  15. Dimililer N, Varoğlu E, Altınçay H (2009) Classifier subset selection for biomedical named entity recognition. Appl Intell. doi:10.1007/s10489-008-0124-0 to appear

  16. Dujmovic J, Bai H (2006) Evaluation and comparison of search engines using the LSP method. Comput Sci Inf Syst 3:711–722

    Google Scholar 

  17. Etzioni O, Cafarella M, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS (2004) Web-scale information Extraction in KnowItAll. In: Proc of 13th international World Wide Web conference, WWW 2004. ACM Press, New York, pp 100–110

    Google Scholar 

  18. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the Web: an experimental study. Artif Intell 165:91–134

    Article  Google Scholar 

  19. Ferreira da Silva J, Lopes GP (1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In: Proc of sixth meeting on mathematics of language, MOL6. Association for Computational Linguistics, Orlando, pp 369–381

    Google Scholar 

  20. Grefenstette G (1999) The World Wide Web as a resource for example-based machine translation tasks. In: Proc of twenty-first international conference on translating and the computer. Aslib Press, London

    Google Scholar 

  21. Henzinger MR (2008) PageRank algorithm. In: Kao M-Y (ed) Encyclopedia of algorithms. Springer, New York

    Google Scholar 

  22. Hisamitsu T, Niwa Y (2001) Extracting useful terms from parenthetical expression by combining simple rules and statistical measures: a comparative evaluation of bigram statistics. In: Bourigault D, Christian J, L’Homme M-C (eds) Recent advances in computational terminology. Benjamins, Amsterdam, pp 209–224

    Google Scholar 

  23. Hunt JW, Szymanski TG (1977) A fast algorithm for computing longest common subsequences. Commun ACM 20:350–353

    Article  MathSciNet  MATH  Google Scholar 

  24. Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29:333–347

    Article  MathSciNet  Google Scholar 

  25. Kim M-C, Choi K-S (1999) A comparison of collocation-based similarity measures in query expansion. Inf Process Manag 35:19–30

    Article  Google Scholar 

  26. Kim S-B, Seo H-C, Rim H-C (2004) Information retrieval using word senses: root sense tagging approach. In: Järvelin K, Allan J, Bruza P, Sanderson M (eds) Proc of 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 04. ACM, Sheffield, pp 258–265

    Google Scholar 

  27. Lam-Adesina AM, Jones GJF (2001) Applying summarization techniques for term selection in relevance feedback. In: Kraft DH, Croft WB, Harper DJ, Zobel J (eds) Proc of 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 01. ACM, New Orleans, pp 1–9

    Chapter  Google Scholar 

  28. Larkey L, Ogilvie P, Price A, Tamilio B (2000) Acrophile: an automated acronym extractor and server. In: Proc of 5th ACM conference on digital libraries. Association for Computing Machinery, San Antonio, pp 205–214

    Chapter  Google Scholar 

  29. Liu H, Friedman C (2003) Mining terminological knowledge in large biomedical corpora. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 415–426

    Google Scholar 

  30. Nadeau D, Turney PD (2005) A supervised learning approach to acronym identification. In: Kégl B, Lapalme G (eds) Proc of 18th conference of the Canadian society for computational studies of intelligence, Canadian AI 2005. Springer, Berlin/Heidelberg, pp 319–329

    Google Scholar 

  31. Okazaki N, Ananiadou S (2006) A term recognition approach to acronym recognition. In: Proc of international committee on computational linguistics and the association for computational linguistics, COLING-ACL 2006. Association for Computational Linguistics, Sydney, pp 643–650

    Google Scholar 

  32. Park Y, Byrd RJ (2001) Hybrid text mining for finding abbreviations and their definitions. In: Lee L, Harman D (eds) Proc of conference on empirical methods in natural language processing, EMNLP 2001. Intelligent Information Systems Institute, Pittsburgh, pp 126–133

    Google Scholar 

  33. Pustejovsky J, Castaño J, Cochran B, Kotecki M, Morrell M (2001) Automatic extraction of acronym-meaning pairs from MEDLINE databases. In: Patel V, Rogers R, Haux R (eds) Proc of 10th Triennial congress of the international medical informatics association, MEDINFO 2001. IOS Press, London, pp 371–375

    Google Scholar 

  34. Qiu Y, Frei H-P (1993) Concept based query expansion. In: Korfhage R, Rasmussen E, Willett P (eds) Proc of 16th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 93. ACM, Pittsburgh, pp 160–169

    Chapter  Google Scholar 

  35. Resnik P, Smith N (2003) The Web as a parallel corpus. Comput Linguist 29:349–380

    Article  Google Scholar 

  36. Sánchez D, Moreno A (2008) Pattern-based automatic taxonomy learning from the Web. AI Commun 21:27–48

    MathSciNet  MATH  Google Scholar 

  37. Schwartz A, Hearst M (2003) A simple algorithm for identifying abbreviation definitions in biomedical texts. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 451–462

    Google Scholar 

  38. Taghva K, Gilbreth J (1999) Recognizing acronyms and their definitions. Int J Document Anal Recognit 1:191–198

    Article  Google Scholar 

  39. Torii M, Hu Z-Z, Song M, Wu CH, Liu H (2006) A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinform 8:S5

    Article  Google Scholar 

  40. Turney PD (2001) Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Raedt LD, Flach P (eds) Proc of 12th European conference on machine learning, ECML 2001, Freiburg, Germany. Springer, Berlin/Heidelberg, pp 491–499

    Google Scholar 

  41. WordNet (1998) WordNet—an electronic lexical database. MIT Press, Cambridge

    Google Scholar 

  42. Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the Web: system and techniques. Appl Intell 21:195–224

    Article  MATH  Google Scholar 

  43. Yarowsky D (1995) Unsupervised word-sense disambiguation rivaling supervised methods. In: Uszkoreit H (ed) Proc of 33rd annual meeting of the association for computational linguistics. Association for Computational Linguistics, Cambridge, pp 189–196

    Chapter  Google Scholar 

  44. Yeates S (1999) Automatic extraction of acronyms from text. In: Yeates S (ed.) Proc of third New Zealand computer science research students’ conference. University of Waikato, Te Kohinga Marama Marae, Hamilton, New Zealand, pp 117–124

  45. Yoon Y-C, Park S-Y, Song Y-I, Rim H-C, Rhee D-W (2008) Automatic acronym dictionary construction based on acronym generation types. IEICE Trans Inform Syst E91-D:1584–1587

    Article  Google Scholar 

  46. Yu H, Hripcsak G, Friedman C (2002) Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc 9:262–272

    Article  Google Scholar 

  47. Yu S, Cai D, Wen J-R, Ma W-Y (2003) Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Hencsey G, White B, Robin Chen Y-F, Kovács L, Lawrence S (eds) Proc of 12th international conference on World Wide Web, WWW 03, Budapest. ACM, New York, pp 11–18

    Google Scholar 

  48. Zahariev M (1991) In faculty of control systems and computers. Polytechnic Institute of Bucharest Simon Fraser University, Bucharest, Rumania

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Isern, D. Automatic extraction of acronym definitions from the Web. Appl Intell 34, 311–327 (2011). https://doi.org/10.1007/s10489-009-0197-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-009-0197-4

Keywords

Navigation