Advertisement

Knowledge and Information Systems

, Volume 27, Issue 1, pp 141–161 | Cite as

Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary

  • Tshering Cigay Dorji
  • El-sayed Atlam
  • Susumu Yata
  • Masao Fuketa
  • Kazuhiro Morita
  • Jun-ichi Aoe
Regular Paper

Abstract

Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.

Keywords

Field Association (FA) Terms Terms weighting and selection Document classification Terminology extraction Information retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Atlam E, Fuketa M, Morita K, Aoe J (2003) Documents similarity measurement using field association terms. Inf Process Manag 39(6): 809–824CrossRefGoogle Scholar
  2. 2.
    Atlam E, Ghada E, Morita K, Fuketa M, Aoe J (2006) Automatic building of new field association word candidates using search engine. Inf Process Manag 42(4): 951–962CrossRefGoogle Scholar
  3. 3.
    Atlam E, Morita K, Fuketa M, Aoe J (2002) A new method for selecting English field association terms of compound words and its knowledge representation. Inf Process Manag 38(6): 807–821CrossRefzbMATHGoogle Scholar
  4. 4.
    Bennet NA, He Q, Powell K, Schatz BR (1999) Extracting noun phrases for all of MEDLINE, In: Proceedings of the AMIA symposium. pp 671–675Google Scholar
  5. 5.
    Broughton V (2007) A faceted classification as the basis of a faceted terminology: conversion of a classified structure to thesaurus format in the bliss bibliographic classification, 2nd edn. Axiomathes 18(2): 193–210CrossRefGoogle Scholar
  6. 6.
    Brunzel M, Spiliopoulou M (2007) Domain relevance on term weighting. Lecture notes in Computer Science, vol 4592. Springer, pp 427–432Google Scholar
  7. 7.
    Collier N, Nobata C, Tsujii J (2002) Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. J Terminol, John Benjamins 7(2): 239–257Google Scholar
  8. 8.
    Dozawa T (1999) Innovative multi information dictionary Imidas’99. Annual series. Japan: Zueisha Publication Co. [in Japanese]Google Scholar
  9. 9.
    Drouin P (2004) Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th international conference on language resources and evaluation (CLREC), pp 79–82Google Scholar
  10. 10.
    Fuketa M, Lee S, Tsuji T, Okada M, Aoe J (2000) A document classification method by using field association words. Int J Inf Sci 126: 57–70zbMATHGoogle Scholar
  11. 11.
    Graham-Cumming J (2005) Naive Bayesian text classification: fast, accurate, and easy to implement, Dr. Dobb’s Journal, http://www.ddj.com/development-tools/184406064, [Accessed 3 Sep 2009]
  12. 12.
    Jiang G, Sato H, Endoh A, Ogasawara K, Sakurai T (2005) Extraction of specific nursing terms using corpora comparison. In: Proceedings of the AMIA annual symposium, 2005:997Google Scholar
  13. 13.
    Jing L, Ng M, Huang J (2009) Knowledge-based vector space model for text clustering, Knowledge and information systems, Springer, London, published online October 2009Google Scholar
  14. 14.
    Jones K (2004) A statistical interpretation of term specificity and its application in retrieval. J Doc 60(5): 493–502CrossRefGoogle Scholar
  15. 15.
    Krauthammer M, Nenadic G (2004) Term identification in the biomedical literature. J Biomed Inf 37(6): 512–526CrossRefGoogle Scholar
  16. 16.
    Lan M, Tan C, Low H, Sung S (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Posters proceedings of 14th international world wide web conference, pp 1032–1033Google Scholar
  17. 17.
    Lee S, Shishibori M, Sumitomo T, Aoe J (2002) Extraction of field-coherent passages. Inf Process Manag 38(2): 173–207CrossRefzbMATHGoogle Scholar
  18. 18.
    Leopold E, Kindermann J (2002) Text categorization with support vector machines: how to represent texts in input space?. Mach Learn 46(1–3): 423–444CrossRefzbMATHGoogle Scholar
  19. 19.
    Lu W, Lin R, Chan Y, Chen K (2008) Using web resources to construct multilingual medical thesaurus for cross-language medical information retrieval. Decis Support Syst 45(3): 585–595CrossRefGoogle Scholar
  20. 20.
    Nguyen T, Phan T (2007) Using hybrid solution for CLIR noun phrase translation. In: Proceedings of the 9th international conference on information integration and web-based applications & services (iiWAS2007)Google Scholar
  21. 21.
    Pang S, Kasabov N (2009) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inf Syst 19(1): 79–105CrossRefGoogle Scholar
  22. 22.
    Patry A, Langlais P (2005) Corpus-based terminology extraction. In: Proceedings of the 7th international conference on terminology and knowledge engineering, Copenhagen, Denmark, pp 313–321Google Scholar
  23. 23.
    Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst, Springer, London 16(3): 281–301Google Scholar
  24. 24.
    Pinto H, Martins J (2004) Ontologies: how can they be built?. Knowl Inf Syst 6(4): 441–464CrossRefGoogle Scholar
  25. 25.
    Ramakrishnan N (2009) The pervasiveness of data mining and machine learning. Computer 42(8): 28–29CrossRefGoogle Scholar
  26. 26.
    Rokaya M, Atlam E, Fuketa M, Dorji T, Aoe J (2008) Ranking of field association terms using co-word analysis. Inf Process Manag 44(2): 738–755CrossRefGoogle Scholar
  27. 27.
    Rose T, Stevenson M, Whitehead M (2002) The reuters corpus Vol. 1- from yesterday’s news to tomorrow’s language resources. In: Proceedings of the 3rd international conference on language resources and evaluationGoogle Scholar
  28. 28.
    Salton G, Allan J, Buckley C (1993) Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, pp 49–58Google Scholar
  29. 29.
    Saneifar H, Bonniol S, Laurent A, Poncelet P, Roche M (2009) Terminology extraction from log files, database and expert systems applications. Lect Notes Comput Sci 5690: 769–776CrossRefGoogle Scholar
  30. 30.
    Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processingGoogle Scholar
  31. 31.
    Sclano F, Velardi P (2007) TermExtractor: a web application to learn the shared terminology of emergent web communities. In: Proceedings of the 3rd international conference on interoperability for enterprise software and applications I-ESA 2007Google Scholar
  32. 32.
    Sharif UM, Ghada E, Atlam E, Fuketa M, Morita K, Aoe J (2007) Improvement of building field association term dictionary using passage retrieval. Inf Process Manag 43(2): 1793–1807CrossRefGoogle Scholar
  33. 33.
    Smadja F (1993) Retrieving collocations form text: xtract. Comput Linguist 19(1): 143–177Google Scholar
  34. 34.
    Srinivasan P, Pant G, Menczer F (2005) A general evaluation framework for topical crawlers. Inf Retr 8(3): 417–447CrossRefGoogle Scholar
  35. 35.
    Tsuji T, Nigazawa H, Okada M, Aoe J (1999) Early field recognition by using field association words. In: Proceedings of the 18th international conference on computer processing of oriental languages, pp 301–304Google Scholar
  36. 36.
    University of Stuttgart, TreeTagger—a language-independent part-of-speech Tagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ [Downloaded 2 June 2008]
  37. 37.
    Velardi P, Navigli R, D’Amadio P (2008) Mining the web to create specialized glossaries. IEEE Intell Syst 23(5): 18–25CrossRefGoogle Scholar
  38. 38.
    Voutilamen A (1993) NPtool, a detector of english noun phrases. In: Proceedings of the workshop on very large corpora: academic and industrial perspectives, pp 48–57Google Scholar
  39. 39.
    Wang P, Hu J, Zeng H, Chen Z (2008) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–394CrossRefGoogle Scholar
  40. 40.
    Wikipedia Foundation, Inc., English Wikipedia Dumps, http://download.wikimedia.org/enwiki/ [Downloaded 24 July 2008]
  41. 41.
    Wright SE, Budin G (1997) Handbook of terminology management, vol. 1, Basic aspects of terminology management. Amsterdam, Philadelphia, John BenjaminsGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Tshering Cigay Dorji
    • 1
  • El-sayed Atlam
    • 1
  • Susumu Yata
    • 1
  • Masao Fuketa
    • 1
  • Kazuhiro Morita
    • 1
  • Jun-ichi Aoe
    • 1
  1. 1.Department of Information Science and Intelligent Systems, Faculty of EngineeringUniversity of TokushimaTokushimaJapan

Personalised recommendations