Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

Dorji, Tshering Cigay; Atlam, El-sayed; Yata, Susumu; Fuketa, Masao; Morita, Kazuhiro; Aoe, Jun-ichi

doi:10.1007/s10115-010-0296-x

Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

Regular Paper
Published: 24 April 2010

Volume 27, pages 141–161, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Tshering Cigay Dorji¹,
El-sayed Atlam¹,
Susumu Yata¹,
Masao Fuketa¹,
Kazuhiro Morita¹ &
…
Jun-ichi Aoe¹

202 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Sketch Engine: ten years on

Article 10 July 2014

A novel feature and class-based globalization technique for text classification

Article 25 April 2023

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Article 31 July 2023

References

Atlam E, Fuketa M, Morita K, Aoe J (2003) Documents similarity measurement using field association terms. Inf Process Manag 39(6): 809–824
Article Google Scholar
Atlam E, Ghada E, Morita K, Fuketa M, Aoe J (2006) Automatic building of new field association word candidates using search engine. Inf Process Manag 42(4): 951–962
Article Google Scholar
Atlam E, Morita K, Fuketa M, Aoe J (2002) A new method for selecting English field association terms of compound words and its knowledge representation. Inf Process Manag 38(6): 807–821
Article MATH Google Scholar
Bennet NA, He Q, Powell K, Schatz BR (1999) Extracting noun phrases for all of MEDLINE, In: Proceedings of the AMIA symposium. pp 671–675
Broughton V (2007) A faceted classification as the basis of a faceted terminology: conversion of a classified structure to thesaurus format in the bliss bibliographic classification, 2nd edn. Axiomathes 18(2): 193–210
Article Google Scholar
Brunzel M, Spiliopoulou M (2007) Domain relevance on term weighting. Lecture notes in Computer Science, vol 4592. Springer, pp 427–432
Collier N, Nobata C, Tsujii J (2002) Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. J Terminol, John Benjamins 7(2): 239–257
Google Scholar
Dozawa T (1999) Innovative multi information dictionary Imidas’99. Annual series. Japan: Zueisha Publication Co. [in Japanese]
Drouin P (2004) Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th international conference on language resources and evaluation (CLREC), pp 79–82
Fuketa M, Lee S, Tsuji T, Okada M, Aoe J (2000) A document classification method by using field association words. Int J Inf Sci 126: 57–70
MATH Google Scholar
Graham-Cumming J (2005) Naive Bayesian text classification: fast, accurate, and easy to implement, Dr. Dobb’s Journal, http://www.ddj.com/development-tools/184406064, [Accessed 3 Sep 2009]
Jiang G, Sato H, Endoh A, Ogasawara K, Sakurai T (2005) Extraction of specific nursing terms using corpora comparison. In: Proceedings of the AMIA annual symposium, 2005:997
Jing L, Ng M, Huang J (2009) Knowledge-based vector space model for text clustering, Knowledge and information systems, Springer, London, published online October 2009
Jones K (2004) A statistical interpretation of term specificity and its application in retrieval. J Doc 60(5): 493–502
Article Google Scholar
Krauthammer M, Nenadic G (2004) Term identification in the biomedical literature. J Biomed Inf 37(6): 512–526
Article Google Scholar
Lan M, Tan C, Low H, Sung S (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Posters proceedings of 14th international world wide web conference, pp 1032–1033
Lee S, Shishibori M, Sumitomo T, Aoe J (2002) Extraction of field-coherent passages. Inf Process Manag 38(2): 173–207
Article MATH Google Scholar
Leopold E, Kindermann J (2002) Text categorization with support vector machines: how to represent texts in input space?. Mach Learn 46(1–3): 423–444
Article MATH Google Scholar
Lu W, Lin R, Chan Y, Chen K (2008) Using web resources to construct multilingual medical thesaurus for cross-language medical information retrieval. Decis Support Syst 45(3): 585–595
Article Google Scholar
Nguyen T, Phan T (2007) Using hybrid solution for CLIR noun phrase translation. In: Proceedings of the 9th international conference on information integration and web-based applications & services (iiWAS2007)
Pang S, Kasabov N (2009) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inf Syst 19(1): 79–105
Article Google Scholar
Patry A, Langlais P (2005) Corpus-based terminology extraction. In: Proceedings of the 7th international conference on terminology and knowledge engineering, Copenhagen, Denmark, pp 313–321
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst, Springer, London 16(3): 281–301
Google Scholar
Pinto H, Martins J (2004) Ontologies: how can they be built?. Knowl Inf Syst 6(4): 441–464
Article Google Scholar
Ramakrishnan N (2009) The pervasiveness of data mining and machine learning. Computer 42(8): 28–29
Article Google Scholar
Rokaya M, Atlam E, Fuketa M, Dorji T, Aoe J (2008) Ranking of field association terms using co-word analysis. Inf Process Manag 44(2): 738–755
Article Google Scholar
Rose T, Stevenson M, Whitehead M (2002) The reuters corpus Vol. 1- from yesterday’s news to tomorrow’s language resources. In: Proceedings of the 3rd international conference on language resources and evaluation
Salton G, Allan J, Buckley C (1993) Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, pp 49–58
Saneifar H, Bonniol S, Laurent A, Poncelet P, Roche M (2009) Terminology extraction from log files, database and expert systems applications. Lect Notes Comput Sci 5690: 769–776
Article Google Scholar
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing
Sclano F, Velardi P (2007) TermExtractor: a web application to learn the shared terminology of emergent web communities. In: Proceedings of the 3rd international conference on interoperability for enterprise software and applications I-ESA 2007
Sharif UM, Ghada E, Atlam E, Fuketa M, Morita K, Aoe J (2007) Improvement of building field association term dictionary using passage retrieval. Inf Process Manag 43(2): 1793–1807
Article Google Scholar
Smadja F (1993) Retrieving collocations form text: xtract. Comput Linguist 19(1): 143–177
Google Scholar
Srinivasan P, Pant G, Menczer F (2005) A general evaluation framework for topical crawlers. Inf Retr 8(3): 417–447
Article Google Scholar
Tsuji T, Nigazawa H, Okada M, Aoe J (1999) Early field recognition by using field association words. In: Proceedings of the 18th international conference on computer processing of oriental languages, pp 301–304
University of Stuttgart, TreeTagger—a language-independent part-of-speech Tagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ [Downloaded 2 June 2008]
Velardi P, Navigli R, D’Amadio P (2008) Mining the web to create specialized glossaries. IEEE Intell Syst 23(5): 18–25
Article Google Scholar
Voutilamen A (1993) NPtool, a detector of english noun phrases. In: Proceedings of the workshop on very large corpora: academic and industrial perspectives, pp 48–57
Wang P, Hu J, Zeng H, Chen Z (2008) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394
Article Google Scholar
Wikipedia Foundation, Inc., English Wikipedia Dumps, http://download.wikimedia.org/enwiki/ [Downloaded 24 July 2008]
Wright SE, Budin G (1997) Handbook of terminology management, vol. 1, Basic aspects of terminology management. Amsterdam, Philadelphia, John Benjamins

Download references

Author information

Authors and Affiliations

Department of Information Science and Intelligent Systems, Faculty of Engineering, University of Tokushima, Minamijosanjima 2-1, Tokushima, 770-8506, Japan
Tshering Cigay Dorji, El-sayed Atlam, Susumu Yata, Masao Fuketa, Kazuhiro Morita & Jun-ichi Aoe

Authors

Tshering Cigay Dorji
View author publications
You can also search for this author in PubMed Google Scholar
El-sayed Atlam
View author publications
You can also search for this author in PubMed Google Scholar
Susumu Yata
View author publications
You can also search for this author in PubMed Google Scholar
Masao Fuketa
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Morita
View author publications
You can also search for this author in PubMed Google Scholar
Jun-ichi Aoe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tshering Cigay Dorji.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dorji, T.C., Atlam, Es., Yata, S. et al. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl Inf Syst 27, 141–161 (2011). https://doi.org/10.1007/s10115-010-0296-x

Download citation

Received: 19 May 2009
Revised: 11 January 2010
Accepted: 03 April 2010
Published: 24 April 2010
Issue Date: April 2011
DOI: https://doi.org/10.1007/s10115-010-0296-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

Abstract

Access this article

Similar content being viewed by others

The Sketch Engine: ten years on

A novel feature and class-based globalization technique for text classification

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

Abstract

Access this article

Similar content being viewed by others

The Sketch Engine: ten years on

A novel feature and class-based globalization technique for text classification

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation