Methods for automatic term recognition in domain-specific text collections: A survey
- 242 Downloads
- 7 Citations
Abstract
Applications related to domain specific text processing often use glossaries and ontologies, and the main step of such resource construction is term recognition. This paper presents a survey of existing definitions of the term and its linguistic features, formulates the task definition for term recognition, and analyzes presently-available methods for automatic term recognition, such as methods for candidates collection, methods based on statistics and contexts of term occurrences, methods using topic models, and methods based on external resources (such as text collections from other domains, ontologies, and Wikipedia). This paper also provides an overview of standard methodologies and datasets for experimental research.
Keywords
Semantic Relatedness Topic Model Term Frequency Term Candidate Computational LinguisticsPreview
Unable to display preview. Download preview PDF.
References
- 1.Astrakhantsev, N. and Turdakov, D., Automatic construction and enrichment of informal ontologies: A survey, Program. Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.CrossRefGoogle Scholar
- 2.Myakshin, K.A., Various approaches to definition of the concept “term,” Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2007, vol. 3, no. 3, pp. 175–178.Google Scholar
- 3.Pazienza, M., Pennacchiotti, M., and Zanzotto, F., Terminology extraction: An analysis of linguistic and statistical approaches, in Knowledge Mining, 2005, pp. 255–279.CrossRefGoogle Scholar
- 4.Komarova, R.I., Term system of the heuristics sublanguage (on the material of English), Extended Abstract of Cand. Phil. Sci. Dissertation, Odessa, 1991, p. 18.Google Scholar
- 5.Vinokur, G.O., Grammatical observations in the field of technical terminology, Tr. Mosk. Inst. Filosofii Literatury Istorii, 1939, vol. 5.Google Scholar
- 6.Wüster, E., Einfuuhrung in die allgemeine Terminologielehre und terminologische Lexikographie (1979), Kobenhavn: Handelshojskolen, 1985.Google Scholar
- 7.Felber, H., Basic principles and methods for the preparation of terminology standards, Standardization of Technical Terminology: Principle and Practices, ASTM STP, 1982, vol. 806, pp. 3–13.Google Scholar
- 8.Terminology–Vocabulary: Standard, CH: International Organization for Standardization, Geneva, 1990.Google Scholar
- 9.Pearson, J., Terms in Context, John Benjamins, 1998, vol. 1.Google Scholar
- 10.Rondeau, G., Introduction ala terminologie, Quebec: Gaetan Morin, 1984, 2nd ed.Google Scholar
- 11.Myakshin, K.A., On the question of main features of the term, Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2008, vol. 2, no. 21, pp. 17–22.Google Scholar
- 12.Khayutin, A.D., Compound terms: Functional type of complex linguistic units from the perspective of lexicography, in Otraslevaya terminologiya i leksikografiya (Industrial Terminology and Lexicography), Voronezh: Voronezh State Pedagogical Univ., 1981.Google Scholar
- 13.Akhmanova, O.S., Linguistic terminology, Linguistic Encyclopedic Dictionary, Moscow: Sov. Entsikl., 1990.Google Scholar
- 14.Judea, A., Schütze, H., and Bruegmann, S., Unsupervised training set generation for automatic acquisition of technical terminology in patents, Proc. 25th Int. Conf. Computational Linguistics: Technical Papers (COLING), Dublin, 2014, pp. 290–300.Google Scholar
- 15.Bernier-Colborne, G. and Drouin, P., Creating a test corpus for term extractors through term annotation, Terminology, 2014, vol. 20, no. 1, pp. 50–73.CrossRefGoogle Scholar
- 16.Wu, W., Liu, T., Hu, H., et al., Extracting domain-relevant term using Wikipedia based on random walk model, Proc. 7th IEEE Int. Conf. Data Mining Workshops, 2012, pp. 68–75.Google Scholar
- 17.Bordea, G., Buitelaar, P., and Polajnar, T., Domainindependent term extraction through domain modeling, Proc. 10th Int. Conf. Terminology and Artificial Intelligence (TIA), Paris, 2013.Google Scholar
- 18.Bagot, R.E., Les unites de signification sprecialisrees relargissant l’objet du travail en terminologie, Terminology, 2002, vol. 7, no. 2, pp. 217–237.CrossRefGoogle Scholar
- 19.Kageura, K. and Umino, B., Methods of automatic term recognition: A review, Terminology, 1996, vol. 3, no. 2, pp. 259–289.CrossRefGoogle Scholar
- 20.Wermter, J. and Hahn, U., You can’t beat frequency (unless you use linguistic knowledge): A qualitative evaluation of association measures for collocation and term extraction, Proc. 21st Int. Conf. Computational Linguistic and 44th Annu. Meet. Association for Computational Linguistic, 2006, pp. 785–792.Google Scholar
- 21.Zhang, Z., Brewster, C., and Ciravegna, F., A comparative evaluation of term recognition algorithms, Proc. 6th Int. Conf. Language Resources and Evaluation (LREC), Marrakech, 2008.Google Scholar
- 22.Evans, D.A. and Lefferts, R.G., Clarit-trec experiments, Inf. Process. Manage., 1995, vol. 31, no. 3, pp. 385–395.CrossRefGoogle Scholar
- 23.Ahmad, K., Gillam, L., Tostevin, L., et al., University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder), Proc. 8th Text Retrieval Conf. (TREC), 1999.Google Scholar
- 24.Frantzi, K., Ananiadou, S., and Mima, H., Automatic recognition of multi-word terms: The c-value/nc-value method, Int. J. Digital Libr., 2000, vol. 3, no. 2, pp. 115–130.CrossRefGoogle Scholar
- 25.Kozakov, L., Park, Y., Fin, T., et al., Glossary extraction and utilization in the information search and delivery system for IBM technical support, IBM Syst. J., 2004, vol. 43, no. 3, pp. 546–563.CrossRefGoogle Scholar
- 26.Sclano, F. and Velardi, P., Termextractor: A web application to learn the shared terminology of emergent web communities, Enterprise Interoperability II, 2007, pp. 287–290.CrossRefGoogle Scholar
- 27.Braslavskii, P.I. and Sokolov, E.A., Comparison of four methods for automatic recognition of two-word terms in text, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2006, pp. 88–94.Google Scholar
- 28.Braslavskii, P. and Sokolov, E., Comparison of five methods for recognition of terms of arbitrary length, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2008, no. 7, p. 14.Google Scholar
- 29.Bourigault, D., Surface grammatical analysis for the extraction of terminological noun phrases, Proc. 14th Conf. Computational Linguistic, 1992, vol. 3, pp. 977–981.CrossRefGoogle Scholar
- 30.Baroni, M. and Bernardini, S., BootCaT: Bootstrapping corpora and terms from the Web, Proc. Conf. Language Resources and Evaluation (LREC), 2004, pp. 1313–1316.Google Scholar
- 31.Dobrov, B.V., Lukashevich, N.V., and Syromyatnikov, S.V., Formation of the base of terminological phrases based on domain texts, Trudy 5oi Vseross. nauchn. konf. “Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii” (Proc. 5th All-Russ. Sci. Conf. “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”), 2003, pp. 201–210.Google Scholar
- 32.Automatic Text Processing, Syntactic analysis. http://www.aot.ru/docs/synan.html.Google Scholar
- 33.Fedorenko, D., Astrakhantsev, N., and Turdakov, D., Automatic recognition of domain-specific terms: An experimental evaluation, Proc. Spring Researchers Colloquium on Databases and Information Systems (SYRCo- DIS), 2013, pp. 15–23.Google Scholar
- 34.Nokel, M. and Loukachevitch, N., An experimental study of term extraction for real information-retrieval thesauri, Proc. 10th Int. Conf. Terminology and Artificial Intelligence, 2013, pp. 69–76.Google Scholar
- 35.Ventura, J.A.L., Jonquet, C., and Roche, M., et al., Combining C-value and keyword extraction methods for biomedical terms extraction, Proc. Int. Symp. Languages in Biology and Medicine (LBM), 2013, pp. 45–49.Google Scholar
- 36.Barron-Cedeno, A., Sierra, G., Drouin, P., et al., An improved automatic term recognition method for Spanish, in Computational Linguistics and Intelligent Text Processing, Berlin: Springer, 2009, pp. 125–136.CrossRefGoogle Scholar
- 37.Bordea, G., Domain adaptive extraction of topical hierarchies for expertise mining, Ph.D. Dissertation, Galway: National University of Ireland, 2013.Google Scholar
- 38.Navigli, R. and Velardi, P., Semantic interpretation of terminological strings, Proc. 6th Int. Conf. Terminology and Knowledge Engineering, 2002, pp. 95–100.Google Scholar
- 39.Dennis, S.F., The construction of a thesaurus automatically from a sample of text, Proc. Symp. Statistical Association Methods for Mechanized Documentation, Washington, 1965, pp. 61–148.Google Scholar
- 40.Church, K., Gale, W., Hanks, P., et al., Using statistics in lexical analysis, in Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991, p. 115.Google Scholar
- 41.Dunning, T., Accurate methods for the statistics of surprise and coincidence, Comput. Linguist., 1993, vol. 19, no. 1, pp. 61–74.Google Scholar
- 42.Church, K.W. and Hanks, P., Word association norms, mutual information, and lexicography, Comput. Linguist., 1990, vol. 16, no. 1, pp. 22–29.Google Scholar
- 43.Daille, B., Combined approach for terminology extraction: Lexical statistics and linguistic filtering, Ph.D. Thesis, Paris: University Paris 7, 1994.Google Scholar
- 44.Park, Y., Byrd, R., and Boguraev, B., Automatic glossary extraction: Beyond terminology identification, Proc. 19th Int. Conf. Computational Linguistic, 2002, vol. 1, pp. 1–7.Google Scholar
- 45.Blei, D.M. and Lafferty, J.D., Topic models, Text Min. Classif., Clustering, Appl., 2009, vol. 10, p. 71.CrossRefGoogle Scholar
- 46.Bolshakova, E., Loukachevitch, N., and Nokel, M., Topic models can improve domain term extraction, in Advances in Information Retrieval, Berlin: Springer, 2013, pp. 684–687.CrossRefGoogle Scholar
- 47.Li, S., Li, J., Song, T., et al., A novel topic model for automatic term extraction, Proc. 36th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2013, pp. 885–888.Google Scholar
- 48.Meijer, K., Frasincar, F., and Hogenboom, F., A semantic approach for extracting domain taxonomies from text, Decis. Support Syst., 2014, vol. 62, pp. 78–93.CrossRefGoogle Scholar
- 49.Penas, A., Verdejo, F., Gonzalo, J., et al., Corpusbased terminology extraction applied to information access, Proc. Corpus Linguistics, 2001, vol. 13.Google Scholar
- 50.Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press, 1999.zbMATHGoogle Scholar
- 51.Braslavskii, P.I. and Sokolov, E.A., Automatic term recognition using Internet retrieval engines, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2007, pp. 89–94.Google Scholar
- 52.Golomazov, D.D., Methods and techniques for scientific information management using ontologies, Cand. Sci. (Phys.–Math.) Dissertation, Moscow, 2012.Google Scholar
- 53.Dobrov, B.V. and Loukachevitch, N.V., Multiple evidence for term extraction in broad domains, Proc. Recent Advances in Natural Language Processing, 2011, pp. 710–715.Google Scholar
- 54.Xu, F., Kurz, D., Piskorski, J., et al., A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping, Proc. Int. Conf. Language Resources and Evaluation, 2002.Google Scholar
- 55.Milne, D., Medelyan, O., and Witten, I.H., Mining domain-specific thesauri from Wikipedia: A case study, Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence, 2006, pp. 442–448.Google Scholar
- 56.Strube, M. and Ponzetto, S.P., WikiRelate!: Computing semantic relatedness using Wikipedia, Proc. 21st AAAI Conf. Artificial Intelligence, 2006, vol. 6, pp. 1419–1424.Google Scholar
- 57.Mihalcea, R. and Csomai, A., Wikify!: Linking documents to encyclopedic knowledge, Proc. 16th ACM Conf. Information and Knowledge Management, 2007, pp. 233–242.Google Scholar
- 58.Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. Information and Knowledge Management, 2008, pp. 509–518.Google Scholar
- 59.Vivaldi, J. and Rodriguez, H., Using Wikipedia for term extraction in the biomedical domain: First experiences, Procesamiento del Lenguaje Natural, 2010, vol. 45, pp. 251–254.Google Scholar
- 60.Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., et al., Using Wikipedia to validate the terminology found in a corpus of basic textbooks, Proc. Conf. Language Resources and Evaluation (LREC), 2012, pp. 3820–3827.Google Scholar
- 61.Astrakhantsev, N., Automatic term recognition in a domain-specific text collection using Wikipedia, Tr. Inst. Sistemnogo Program. Ross. Akad. Nauk, 2014, vol. 26, no. 4, pp. 7–20.Google Scholar
- 62.Patry, A. and Langlais, P., Corpus-based terminology extraction, Proc. 7th Int. Conf. Terminology and Knowledge Engineering, Copenhagen, 2005.Google Scholar
- 63.Astrakhantsev, N., Fedorenko, D., and Turdakov, D., Automatic enrichment of informal ontology by analyzing a domain-specific text collection, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2014, vol. 13, pp. 29–42.Google Scholar
- 64.Yang, Y., Yu, H., Meng, Y., et al., Fault-tolerant learning for term extraction. http://www.aclweb.org/anthology/Y10-1036.Google Scholar
- 65.Liu, X. and Kit, C., An improved corpus comparison approach to domain specific term recognition, Proc. Pacific Asia Conf. Language, Information, and Computing (PACLIC), 2008, pp. 253–261.Google Scholar
- 66.Kim, J.-D., Ohta, T., Tateisi, Y., et al., GENIA corpus: A semantically annotated corpus for bio-textmining, Bioinformatics, 2003, vol. 19, no. Suppl. 1, pp. 180–182.CrossRefGoogle Scholar
- 67.Nenadie, G., Ananiadou, S., and McNaught, J., Enhancing automatic term recognition through recognition of variation, Proc. 20th Int. Conf. Computational Linguistics, 2004, p. 604.Google Scholar
- 68.Krauthammer, M. and Nenadic, G., Term identification in the biomedical literature, J. Biomed. Inf., 2004, vol. 37, no. 6, pp. 512–526.CrossRefGoogle Scholar
- 69.Medelyan, O. and Witten, I.H., Domain-independent automatic keyphrase indexing with small training sets, J. Am. Soc. Inf. Sci. Technol., 2008, vol. 59, no. 7, pp. 1026–1040.CrossRefGoogle Scholar
- 70.Krapivin, M., Autaeu, A., and Marchese, M., Large dataset for keyphrases extraction. http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeumarchese. pdf.Google Scholar