Abstract
Acronyms are widely used to abbreviate and stress important concepts. The discovery of the definitions associated to an acronym is an important matter in order to support language processing and knowledge-related tasks as information retrieval, ontology mapping or question answering. Acronyms represent a very dynamic and unbounded topic that is constantly evolving. Manual attempts to compose a global scale dictionary of acronym-definition pairs result in an overwhelming amount of work and limited results. Attending these shortcomings, this paper presents an automatic and unsupervised methodology to generate acronyms and extract their potential definitions from the Web. The method has been designed to minimise the set of constraints, offering a domain and -partially- language independent solution, and to exploit the Web in order to create large and general acronym-definition sets. Results have been manually evaluated against the largest manually built acronym repository: Acronym Finder. The evaluation shows that the proposed approach is able to improve the coverage of manual attempts maintaining a high precision.
Similar content being viewed by others
References
Adar E (2002) S-RAD: A simple and robust abbreviation dictionary. HP Laboratories
Agirre E, Ansa O, Hovy E, Martínez D (2000) Enriching very large ontologies using the WWW. In: Proc of Workshop on Ontology Construction of the European Conference of AI. ECAI, Berlin, pp 73–77
Brill E (2003) Processing natural language without natural language processing. In: Gelbukh A (ed) Proc of 4th international conference on computational linguistics and intelligent text processing, bconfnameCICLing 2003, Mexico City, Mexico. Springer, Berlin/Heidelberg, pp 360–369
Brill E, Lin J, Banko M, Dumais S (2001) Data-intensive question answering. In: Voorhees EM, Harman DK (eds) Proc of tenth text retrieval conference, TREC 2001. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Maryland, US, pp 393–400
Carmel D, Farchi E, Petruschka Y, Soffer A (2002) Automatic query wefinement using lexical affinities with maximal information gain. In: Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K (eds) Proc of 25th annual international. ACM, SIGIR conference on research and development in information retrieval, SIGIR 02. Tampere, Finland, pp 283–290
Castells P (2003) Sistemas interactivos y colaborativos en la Web. In: Bravo C, Redondo MA (eds) La web semántica. Ediciones de la Universidad de Castilla-La Mancha, pp 195–212
Chang C-H, Hsu C-C (1998) Integrating query expansion and conceptual relevance feedback for personalized web information retrieval. Comput Netw ISDN Syst 30:621–623
Chang JT, Schütze H (2006) Abbreviations in biomedical text. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Norwood, pp 99–119
Chirita P-A, Firan CS, Nejdl W (2007) Personalized query expansion for the Web. In: Clarke CLA, Fuhr N, Kando N, Kraaij W, de Vries AP (eds) Proc of 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 07. ACM, Amsterdam, pp 7–14
Church KW, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. In: Zernik U (ed) Lexical acquisition: exploiting on-line resources to build a lexicon. Lawrence Erlbaum Associates, New Jersey, pp 115–164
Cilibrasi RL, Vitányi PMB (2006) The Google similarity distance. IEEE Trans Knowl Data Eng 19:370–383
Cimiano P, Staab S (2004) Learning by Googling. SIGKDD Explor 6:24–33
Ciravegna F, Dingli A, Guthrie D, Wilks Y (2003) Integrating information to bootstrap information extraction from Web sites. In: Kambhampati S, Knoblock CA (eds) Proc of IJCAI workshop on information integration on the Web, IIWeb 2003. IJCAI Press, Acapulco, pp 9–14
Dannélls D (2006) Automatic acronym recognition. In: Proc of 11st conference of the European chapter of the association for computational linguistics, EACL 2006. The Association for Computer Linguistics, Trento, pp 167–170
Dimililer N, Varoğlu E, Altınçay H (2009) Classifier subset selection for biomedical named entity recognition. Appl Intell. doi:10.1007/s10489-008-0124-0 to appear
Dujmovic J, Bai H (2006) Evaluation and comparison of search engines using the LSP method. Comput Sci Inf Syst 3:711–722
Etzioni O, Cafarella M, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS (2004) Web-scale information Extraction in KnowItAll. In: Proc of 13th international World Wide Web conference, WWW 2004. ACM Press, New York, pp 100–110
Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the Web: an experimental study. Artif Intell 165:91–134
Ferreira da Silva J, Lopes GP (1999) A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In: Proc of sixth meeting on mathematics of language, MOL6. Association for Computational Linguistics, Orlando, pp 369–381
Grefenstette G (1999) The World Wide Web as a resource for example-based machine translation tasks. In: Proc of twenty-first international conference on translating and the computer. Aslib Press, London
Henzinger MR (2008) PageRank algorithm. In: Kao M-Y (ed) Encyclopedia of algorithms. Springer, New York
Hisamitsu T, Niwa Y (2001) Extracting useful terms from parenthetical expression by combining simple rules and statistical measures: a comparative evaluation of bigram statistics. In: Bourigault D, Christian J, L’Homme M-C (eds) Recent advances in computational terminology. Benjamins, Amsterdam, pp 209–224
Hunt JW, Szymanski TG (1977) A fast algorithm for computing longest common subsequences. Commun ACM 20:350–353
Kilgarriff A, Grefenstette G (2003) Introduction to the special issue on the Web as Corpus. Comput Linguist 29:333–347
Kim M-C, Choi K-S (1999) A comparison of collocation-based similarity measures in query expansion. Inf Process Manag 35:19–30
Kim S-B, Seo H-C, Rim H-C (2004) Information retrieval using word senses: root sense tagging approach. In: Järvelin K, Allan J, Bruza P, Sanderson M (eds) Proc of 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 04. ACM, Sheffield, pp 258–265
Lam-Adesina AM, Jones GJF (2001) Applying summarization techniques for term selection in relevance feedback. In: Kraft DH, Croft WB, Harper DJ, Zobel J (eds) Proc of 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 01. ACM, New Orleans, pp 1–9
Larkey L, Ogilvie P, Price A, Tamilio B (2000) Acrophile: an automated acronym extractor and server. In: Proc of 5th ACM conference on digital libraries. Association for Computing Machinery, San Antonio, pp 205–214
Liu H, Friedman C (2003) Mining terminological knowledge in large biomedical corpora. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 415–426
Nadeau D, Turney PD (2005) A supervised learning approach to acronym identification. In: Kégl B, Lapalme G (eds) Proc of 18th conference of the Canadian society for computational studies of intelligence, Canadian AI 2005. Springer, Berlin/Heidelberg, pp 319–329
Okazaki N, Ananiadou S (2006) A term recognition approach to acronym recognition. In: Proc of international committee on computational linguistics and the association for computational linguistics, COLING-ACL 2006. Association for Computational Linguistics, Sydney, pp 643–650
Park Y, Byrd RJ (2001) Hybrid text mining for finding abbreviations and their definitions. In: Lee L, Harman D (eds) Proc of conference on empirical methods in natural language processing, EMNLP 2001. Intelligent Information Systems Institute, Pittsburgh, pp 126–133
Pustejovsky J, Castaño J, Cochran B, Kotecki M, Morrell M (2001) Automatic extraction of acronym-meaning pairs from MEDLINE databases. In: Patel V, Rogers R, Haux R (eds) Proc of 10th Triennial congress of the international medical informatics association, MEDINFO 2001. IOS Press, London, pp 371–375
Qiu Y, Frei H-P (1993) Concept based query expansion. In: Korfhage R, Rasmussen E, Willett P (eds) Proc of 16th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 93. ACM, Pittsburgh, pp 160–169
Resnik P, Smith N (2003) The Web as a parallel corpus. Comput Linguist 29:349–380
Sánchez D, Moreno A (2008) Pattern-based automatic taxonomy learning from the Web. AI Commun 21:27–48
Schwartz A, Hearst M (2003) A simple algorithm for identifying abbreviation definitions in biomedical texts. In: Altman RB, Dunker AK, Hunter L, Klein TE (eds) Proc of 8th Pacific symposium on biocomputing, PSB 2003. PSB Association, Lihue, pp 451–462
Taghva K, Gilbreth J (1999) Recognizing acronyms and their definitions. Int J Document Anal Recognit 1:191–198
Torii M, Hu Z-Z, Song M, Wu CH, Liu H (2006) A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinform 8:S5
Turney PD (2001) Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Raedt LD, Flach P (eds) Proc of 12th European conference on machine learning, ECML 2001, Freiburg, Germany. Springer, Berlin/Heidelberg, pp 491–499
WordNet (1998) WordNet—an electronic lexical database. MIT Press, Cambridge
Xiao L, Wissmann D, Brown M, Jablonski S (2004) Information extraction from the Web: system and techniques. Appl Intell 21:195–224
Yarowsky D (1995) Unsupervised word-sense disambiguation rivaling supervised methods. In: Uszkoreit H (ed) Proc of 33rd annual meeting of the association for computational linguistics. Association for Computational Linguistics, Cambridge, pp 189–196
Yeates S (1999) Automatic extraction of acronyms from text. In: Yeates S (ed.) Proc of third New Zealand computer science research students’ conference. University of Waikato, Te Kohinga Marama Marae, Hamilton, New Zealand, pp 117–124
Yoon Y-C, Park S-Y, Song Y-I, Rim H-C, Rhee D-W (2008) Automatic acronym dictionary construction based on acronym generation types. IEICE Trans Inform Syst E91-D:1584–1587
Yu H, Hripcsak G, Friedman C (2002) Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc 9:262–272
Yu S, Cai D, Wen J-R, Ma W-Y (2003) Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Hencsey G, White B, Robin Chen Y-F, Kovács L, Lawrence S (eds) Proc of 12th international conference on World Wide Web, WWW 03, Budapest. ACM, New York, pp 11–18
Zahariev M (1991) In faculty of control systems and computers. Polytechnic Institute of Bucharest Simon Fraser University, Bucharest, Rumania
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sánchez, D., Isern, D. Automatic extraction of acronym definitions from the Web. Appl Intell 34, 311–327 (2011). https://doi.org/10.1007/s10489-009-0197-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-009-0197-4