Skip to main content
Log in

Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Atlam E, Fuketa M, Morita K, Aoe J (2003) Documents similarity measurement using field association terms. Inf Process Manag 39(6): 809–824

    Article  Google Scholar 

  2. Atlam E, Ghada E, Morita K, Fuketa M, Aoe J (2006) Automatic building of new field association word candidates using search engine. Inf Process Manag 42(4): 951–962

    Article  Google Scholar 

  3. Atlam E, Morita K, Fuketa M, Aoe J (2002) A new method for selecting English field association terms of compound words and its knowledge representation. Inf Process Manag 38(6): 807–821

    Article  MATH  Google Scholar 

  4. Bennet NA, He Q, Powell K, Schatz BR (1999) Extracting noun phrases for all of MEDLINE, In: Proceedings of the AMIA symposium. pp 671–675

  5. Broughton V (2007) A faceted classification as the basis of a faceted terminology: conversion of a classified structure to thesaurus format in the bliss bibliographic classification, 2nd edn. Axiomathes 18(2): 193–210

    Article  Google Scholar 

  6. Brunzel M, Spiliopoulou M (2007) Domain relevance on term weighting. Lecture notes in Computer Science, vol 4592. Springer, pp 427–432

  7. Collier N, Nobata C, Tsujii J (2002) Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. J Terminol, John Benjamins 7(2): 239–257

    Google Scholar 

  8. Dozawa T (1999) Innovative multi information dictionary Imidas’99. Annual series. Japan: Zueisha Publication Co. [in Japanese]

  9. Drouin P (2004) Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th international conference on language resources and evaluation (CLREC), pp 79–82

  10. Fuketa M, Lee S, Tsuji T, Okada M, Aoe J (2000) A document classification method by using field association words. Int J Inf Sci 126: 57–70

    MATH  Google Scholar 

  11. Graham-Cumming J (2005) Naive Bayesian text classification: fast, accurate, and easy to implement, Dr. Dobb’s Journal, http://www.ddj.com/development-tools/184406064, [Accessed 3 Sep 2009]

  12. Jiang G, Sato H, Endoh A, Ogasawara K, Sakurai T (2005) Extraction of specific nursing terms using corpora comparison. In: Proceedings of the AMIA annual symposium, 2005:997

  13. Jing L, Ng M, Huang J (2009) Knowledge-based vector space model for text clustering, Knowledge and information systems, Springer, London, published online October 2009

  14. Jones K (2004) A statistical interpretation of term specificity and its application in retrieval. J Doc 60(5): 493–502

    Article  Google Scholar 

  15. Krauthammer M, Nenadic G (2004) Term identification in the biomedical literature. J Biomed Inf 37(6): 512–526

    Article  Google Scholar 

  16. Lan M, Tan C, Low H, Sung S (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Posters proceedings of 14th international world wide web conference, pp 1032–1033

  17. Lee S, Shishibori M, Sumitomo T, Aoe J (2002) Extraction of field-coherent passages. Inf Process Manag 38(2): 173–207

    Article  MATH  Google Scholar 

  18. Leopold E, Kindermann J (2002) Text categorization with support vector machines: how to represent texts in input space?. Mach Learn 46(1–3): 423–444

    Article  MATH  Google Scholar 

  19. Lu W, Lin R, Chan Y, Chen K (2008) Using web resources to construct multilingual medical thesaurus for cross-language medical information retrieval. Decis Support Syst 45(3): 585–595

    Article  Google Scholar 

  20. Nguyen T, Phan T (2007) Using hybrid solution for CLIR noun phrase translation. In: Proceedings of the 9th international conference on information integration and web-based applications & services (iiWAS2007)

  21. Pang S, Kasabov N (2009) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inf Syst 19(1): 79–105

    Article  Google Scholar 

  22. Patry A, Langlais P (2005) Corpus-based terminology extraction. In: Proceedings of the 7th international conference on terminology and knowledge engineering, Copenhagen, Denmark, pp 313–321

  23. Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst, Springer, London 16(3): 281–301

    Google Scholar 

  24. Pinto H, Martins J (2004) Ontologies: how can they be built?. Knowl Inf Syst 6(4): 441–464

    Article  Google Scholar 

  25. Ramakrishnan N (2009) The pervasiveness of data mining and machine learning. Computer 42(8): 28–29

    Article  Google Scholar 

  26. Rokaya M, Atlam E, Fuketa M, Dorji T, Aoe J (2008) Ranking of field association terms using co-word analysis. Inf Process Manag 44(2): 738–755

    Article  Google Scholar 

  27. Rose T, Stevenson M, Whitehead M (2002) The reuters corpus Vol. 1- from yesterday’s news to tomorrow’s language resources. In: Proceedings of the 3rd international conference on language resources and evaluation

  28. Salton G, Allan J, Buckley C (1993) Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, pp 49–58

  29. Saneifar H, Bonniol S, Laurent A, Poncelet P, Roche M (2009) Terminology extraction from log files, database and expert systems applications. Lect Notes Comput Sci 5690: 769–776

    Article  Google Scholar 

  30. Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing

  31. Sclano F, Velardi P (2007) TermExtractor: a web application to learn the shared terminology of emergent web communities. In: Proceedings of the 3rd international conference on interoperability for enterprise software and applications I-ESA 2007

  32. Sharif UM, Ghada E, Atlam E, Fuketa M, Morita K, Aoe J (2007) Improvement of building field association term dictionary using passage retrieval. Inf Process Manag 43(2): 1793–1807

    Article  Google Scholar 

  33. Smadja F (1993) Retrieving collocations form text: xtract. Comput Linguist 19(1): 143–177

    Google Scholar 

  34. Srinivasan P, Pant G, Menczer F (2005) A general evaluation framework for topical crawlers. Inf Retr 8(3): 417–447

    Article  Google Scholar 

  35. Tsuji T, Nigazawa H, Okada M, Aoe J (1999) Early field recognition by using field association words. In: Proceedings of the 18th international conference on computer processing of oriental languages, pp 301–304

  36. University of Stuttgart, TreeTagger—a language-independent part-of-speech Tagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ [Downloaded 2 June 2008]

  37. Velardi P, Navigli R, D’Amadio P (2008) Mining the web to create specialized glossaries. IEEE Intell Syst 23(5): 18–25

    Article  Google Scholar 

  38. Voutilamen A (1993) NPtool, a detector of english noun phrases. In: Proceedings of the workshop on very large corpora: academic and industrial perspectives, pp 48–57

  39. Wang P, Hu J, Zeng H, Chen Z (2008) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394

    Article  Google Scholar 

  40. Wikipedia Foundation, Inc., English Wikipedia Dumps, http://download.wikimedia.org/enwiki/ [Downloaded 24 July 2008]

  41. Wright SE, Budin G (1997) Handbook of terminology management, vol. 1, Basic aspects of terminology management. Amsterdam, Philadelphia, John Benjamins

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tshering Cigay Dorji.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dorji, T.C., Atlam, Es., Yata, S. et al. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl Inf Syst 27, 141–161 (2011). https://doi.org/10.1007/s10115-010-0296-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0296-x

Keywords

Navigation