Improving Term Extraction by System Combination Using Boosting

  • Jordi Vivaldi
  • 2Lluís Màrquez
  • Horacio Rodríguez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2167)


Term extraction is the task of automatically detecting, from textual corpora, lexical units that designate concepts in thematically restricted domains (e.g. medicine). Current systems for term extraction integrate linguistic and statistical cues to perform the detection of terms. The best results have been obtained when some kind of combination of simple base term extractors is performed [14]. In this paper it is shown that this combination can be further improved by posing an additional learning problem of how to find the best combination of base term extractors. Empirical results, using AdaBoost in the metalearning step, show that the ensemble constructed surpasses the performance of all individual extractors and simple voting schemes, obtaining significantly better accuracy figures at all levels of recall.


Medical Domain Term Candidate System Combination Weak Hypothesis AdaBoost Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ananiadou, S.: A Methodology for Automatic Term Recognition. In Proceedings of the 15th International Conference on Computational Linguistics, COLING, pages 1034–1038, Kyoto, Japan, 1994.Google Scholar
  2. 2.
    Abney, S., Schapire, R.E. and Singer, Y.: Boosting Applied to Tagging and PP—attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP-VLC, pages 38–45, College Park, MD, 1999.Google Scholar
  3. 3.
    Bourigault, D.: LEXTER, un Logiciel d’EXtraction de TERminologie. Application à l’acquisition des connaissances à partir de textes. Phd. Thesis, École des Hautes Études en Sciences Sociales, Paris, 1994.Google Scholar
  4. 4.
    Carreras, X. and Màrquez, L.: Boosting Trees for Clause Splitting. To appear in Proceedings of the 5th Conference on Computational Natural Language Learning, CoNLL’01, Tolouse, France, 2001.Google Scholar
  5. 5.
    Daille, B.: Approche mixte pour l’extraction de terminologie: statistique lexicale et filtres linguistiques. Phd. Thesis, Université Paris VII, 1994.Google Scholar
  6. 6.
    Escudero, G; Màrquez, L. and Rigau, G.: Boosting Applied to Word Sense Disambiguation. In Proceedings of the 12th European Conference on Machine Learning, ECML, Barcelona, Spain, 2000.Google Scholar
  7. 7.
    Justeson, J. and Katz, S.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering,1(1), 1994.Google Scholar
  8. 8.
    Kageura, K. and Umino, B.: Methods for Automatic Term Recognition: A Review. Terminology, 3(2): 259–289, 1996.Google Scholar
  9. 9.
    Kittler, J.; Hatef, M.; Duin, R. and Matas, J.: On Combining Classifiers. IEEE Transations on Pattern Analysis and Machine Intelligence, 20(3):226–238, 1998.CrossRefGoogle Scholar
  10. 10.
    Magnini, B. and Cavaglia, G.: Integrating Subject Field Codes into WordNet. Proceedings of the 2nd International Conference on Language resources and Evaluation, LREC2000, Atenas.Google Scholar
  11. 11.
    Maynard, D.: Term Recognition Using Combined Knowledge Sources. Phd. Thesis, Manchester Metropolitan Univ., Faculty of Science and Engineering, 1999.Google Scholar
  12. 12.
    Schapire, R.E. and Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning, 37(3):297–336, 1999.zbMATHCrossRefGoogle Scholar
  13. 13.
    Schapire, R.E. and Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning, 39(2/3):135–168, 2000.zbMATHCrossRefGoogle Scholar
  14. 14.
    Vivaldi, J. and Rodríguez, H.: Improving Term Extraction by Combining Different Techniques. In Proceedings of the Workshop on Computational Terminology for Medical and Biological Applications, pages 61–68, Patras, Greece, 2000.Google Scholar
  15. 15.
    Vivaldi, J.: A Multistrategy Approach to Term Candidate Extraction. Phd. Thesis (forthcoming). Dep. LSI, Technical University of Catalonia, Barcelona, 2001Google Scholar
  16. 16.
    Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht, 1998.zbMATHGoogle Scholar
  17. 17.
    Wolpert, D. H.: Stacked Generalization. Neural Networks, Pergamon Press, 5:241–259, 1992.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Jordi Vivaldi
    • 1
  • 2Lluís Màrquez
  • Horacio Rodríguez
    • 2
  1. 1.Institut Universitari de Lingüística AplicadaUniversitat Pompeu FabraBarcelonaCatalonia
  2. 2.TALP Research CenterUniversitat Politècnica de CatalunyaBarcelona, Catalonia

Personalised recommendations