Exploration of a Rich Feature Set for Automatic Term Extraction

  • Merley S. Conrado
  • Thiago A. S. Pardo
  • Solange O. Rezende
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8265)

Abstract

Despite the importance of the term extraction methods and that several efforts have been devoted to improve them, they still have 4 main problems: (i) noise and silence generation; (ii) difficulty dealing with high number of terms; (iii) human effort and time to evaluate the terms; and (iv) still limited extraction results. In this paper, we deal with these four major problems in automatic term extraction by exploring a rich feature set in a machine learning approach. We minimized these problems and achieved state of the art results for unigrams in Brazilian Portuguese.

Keywords

Automatic term extraction classification machine learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cabré, M.T., Estopà, R., Vivaldi, J.: Automatic term detection: a review of current systems. In: Bourigault, D., Jacquemin, C., L’Homme, M.-C. (eds.) Recent Advances in Computational Terminology, pp. 53–88. John Benjamins, Amsterdam (2001)Google Scholar
  2. 2.
    Conrado, M.S., Pardo, T.A.S., Rezende, S.O.: A machine learning approach to automatic term extraction using a rich feature set. In: Proceedings of the 2013 NAACL HLT Student Research Workshop, Atlanta, USA, pp. 16–23 (2013)Google Scholar
  3. 3.
    Vivaldi, J., Rodríguez, H.: Evaluation of terms and term extraction systems: A practical approach. Terminology 13(2), 225–248 (2007)Google Scholar
  4. 4.
    Zhang, X., Song, Y., Fang, A.: Term recognition using conditional random fields. In: Proc. of IEEE NLP-KE, pp. 333–336 (2010)Google Scholar
  5. 5.
    Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proc. of the 6th on LREC, pp. 2108–2113. ELRA, Marrakech (2008)Google Scholar
  6. 6.
    Foo, J., Merkel, M.: Using machine learning to perform automatic term recognition. In: Bel, N., Daille, B., Vasiljevs, A. (eds.) Proc. of the 7th LREC - Wksp on Methods for automatic acquisition of Language Resources and their Evaluation Methods, pp. 49–54 (2010)Google Scholar
  7. 7.
    Nazar, R.: A statistical approach to term extraction. Int. Journal of English Studies 11(2) (2011)Google Scholar
  8. 8.
    Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., Pozzi, M.: Using wikipedia to validate the terminology found in a corpus of basic textbooks. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proc. of the 8th Int. CNF on LREC. ELRA, Istanbul (2012)Google Scholar
  9. 9.
    Lopes, L.: Extração automática de conceitos a partir de textos em língua portugesa. Ph.D. dissertation, PUCRS. RS, Brazil (2012)Google Scholar
  10. 10.
    Ventura, J., Silva, J.F.: Ranking and extraction of relevant single words in text. In: Rossi, C. (ed.) Brain, Vision and AI, pp. 265–284. InTech, Education and Publishing (2008)Google Scholar
  11. 11.
    Gelbukh, A., Sidorov, G., Lavin-Villa, E., Chanona-Hernandez, L.: Automatic term extraction using log-likelihood based comparison with general reference corpus. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 248–255. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Zavaglia, C., Oliveira, L.H.M., Nunes, M.G.V., Aluísio, S.M.: Estrutura ontológica e unidades lexicais: uma aplicação computacional no domínio da ecologia. In: Proc. of the 5th TIL Wksp, pp. 1575–1584. SBC, RJ (2007)Google Scholar
  13. 13.
    Loukachevitch, N.: Automatic term recognition needs multiple evidence. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M., Maegaard, B., Mariani, J. (eds.) Proc. of the 8th on LREC, pp. 2401–2407. ELRA, Turkey (2012)Google Scholar
  14. 14.
    Liu, L., Kang, J., Yu, J., Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: Proc. of IEEE NLP-KE, pp. 597–601 (2005)Google Scholar
  15. 15.
    Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Berry, M.W. (ed.) Survey of Text Mining, pp. 73–100. Springer (2003)Google Scholar
  16. 16.
    Liu, T., Liu, S., Chen, Z.: An evaluation on feature selection for text clustering. In: Proceedings of the 10th Int. CNF on Machine Learning, pp. 488–495. Morgan Kaufmann, San Francisco (2003)Google Scholar
  17. 17.
    Almeida, G.M.B., Vale, O.A.: Do texto ao termo: interação entre terminologia, morfologia e linguística de corpus na extração semi-automática de termos. In: Isquerdo, A.N., Finatto, M.J.B. (eds.) As Ciências do Léxico: Lexicologia, Lexicografia e Terminologia, 1st edn., vol. IV, pp. 483–499. UFMS, MS (2008)Google Scholar
  18. 18.
    Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval, Ithaca, NY, USA, Tech. Rep. (1987) http://ecommons.library.cornell.edu/bitstream/1813/6721/1/87-881.pdf (October 10, 2008)
  19. 19.
    Frantzi, K.T., Ananiadou, S., Tsujii, J.: The C − value/NC − value method of automatic recognition for multi-word terms. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 585–604. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  20. 20.
    Barrón-Cedeño, A., Sierra, G., Drouin, P., Ananiadou, S.: An improved automatic term recognition method for spanish. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 125–136. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  21. 21.
    Souza, J.W.C., Di Felippo, A.: Um exercício em lingüística de corpus no âmbito do projeto TermiNet. University of Sao Paulo (ICMC-USP), SP, Brazil, Tech. Rep. NILC-TR-10-08 (2010)Google Scholar
  22. 22.
    Gianoti, A.C., Di Felippo, A.: Extração de conhecimento terminológico no projeto TermiNet. University of Sao Paulo (ICMC-USP), SP, Brazil, Tech. Rep. NILC-TR-11-01 (2011), http://www.ufscar.br/~letras/pdf/NILC-TR-11-01_GianotiDiFelippo.pdf (April 4, 2013)
  23. 23.
    Coleti, J.S., Mattos, D.F., Genoves Junior, L.C., Candido Junior, A., Di Felippo, A., Almeida, G.M.B., Aluísio, S.M., Oliveira Junior, O.N.: Compilação de Corpus em Língua Portuguesa na área de Nanociência/Nanotecnologia: Problemas e soluções, 192nd ed., L. e. C. H. F. U. Humanitas/Faculdade de Filosofia, Ed. SP, Brazil: Tagnin and Vale, vol. 1 (2008)Google Scholar
  24. 24.
    Coleti, J.S., Mattos, D.F., Almeida, G.M.B.: Primeiro dicionário de nanociência e nanotecnologia em língua portuguesa. In: Pecenin, M.F., Miotello, V., Oliveira, T.A. (eds.) II Encontro Acadêmico de Letras (EALE), Caderno de Resumos do II EALE, pp. 1–10 (2009)Google Scholar
  25. 25.
    Bick, E.: The Parsing System “PALAVRAS”. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus Universitetsforlag (2000)Google Scholar
  26. 26.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. In: SIGKDD-ACM, vol. 11, pp. 10–18 (2009)Google Scholar
  27. 27.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc., San Francisco (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Merley S. Conrado
    • 1
  • Thiago A. S. Pardo
    • 1
  • Solange O. Rezende
    • 1
  1. 1.Instituto de Ciências Matemáticas e de Computação - ICMCUniversidade de São Paulo - Campus de São CarlosSão CarlosBrazil

Personalised recommendations