Exploration of a Rich Feature Set for Automatic Term Extraction
Conference paper
Abstract
Despite the importance of the term extraction methods and that several efforts have been devoted to improve them, they still have 4 main problems: (i) noise and silence generation; (ii) difficulty dealing with high number of terms; (iii) human effort and time to evaluate the terms; and (iv) still limited extraction results. In this paper, we deal with these four major problems in automatic term extraction by exploring a rich feature set in a machine learning approach. We minimized these problems and achieved state of the art results for unigrams in Brazilian Portuguese.
Keywords
Automatic term extraction classification machine learningPreview
Unable to display preview. Download preview PDF.
References
- 1.Cabré, M.T., Estopà, R., Vivaldi, J.: Automatic term detection: a review of current systems. In: Bourigault, D., Jacquemin, C., L’Homme, M.-C. (eds.) Recent Advances in Computational Terminology, pp. 53–88. John Benjamins, Amsterdam (2001)Google Scholar
- 2.Conrado, M.S., Pardo, T.A.S., Rezende, S.O.: A machine learning approach to automatic term extraction using a rich feature set. In: Proceedings of the 2013 NAACL HLT Student Research Workshop, Atlanta, USA, pp. 16–23 (2013)Google Scholar
- 3.Vivaldi, J., Rodríguez, H.: Evaluation of terms and term extraction systems: A practical approach. Terminology 13(2), 225–248 (2007)Google Scholar
- 4.Zhang, X., Song, Y., Fang, A.: Term recognition using conditional random fields. In: Proc. of IEEE NLP-KE, pp. 333–336 (2010)Google Scholar
- 5.Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proc. of the 6th on LREC, pp. 2108–2113. ELRA, Marrakech (2008)Google Scholar
- 6.Foo, J., Merkel, M.: Using machine learning to perform automatic term recognition. In: Bel, N., Daille, B., Vasiljevs, A. (eds.) Proc. of the 7th LREC - Wksp on Methods for automatic acquisition of Language Resources and their Evaluation Methods, pp. 49–54 (2010)Google Scholar
- 7.Nazar, R.: A statistical approach to term extraction. Int. Journal of English Studies 11(2) (2011)Google Scholar
- 8.Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., Pozzi, M.: Using wikipedia to validate the terminology found in a corpus of basic textbooks. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proc. of the 8th Int. CNF on LREC. ELRA, Istanbul (2012)Google Scholar
- 9.Lopes, L.: Extração automática de conceitos a partir de textos em língua portugesa. Ph.D. dissertation, PUCRS. RS, Brazil (2012)Google Scholar
- 10.Ventura, J., Silva, J.F.: Ranking and extraction of relevant single words in text. In: Rossi, C. (ed.) Brain, Vision and AI, pp. 265–284. InTech, Education and Publishing (2008)Google Scholar
- 11.Gelbukh, A., Sidorov, G., Lavin-Villa, E., Chanona-Hernandez, L.: Automatic term extraction using log-likelihood based comparison with general reference corpus. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 248–255. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 12.Zavaglia, C., Oliveira, L.H.M., Nunes, M.G.V., Aluísio, S.M.: Estrutura ontológica e unidades lexicais: uma aplicação computacional no domínio da ecologia. In: Proc. of the 5th TIL Wksp, pp. 1575–1584. SBC, RJ (2007)Google Scholar
- 13.Loukachevitch, N.: Automatic term recognition needs multiple evidence. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M., Maegaard, B., Mariani, J. (eds.) Proc. of the 8th on LREC, pp. 2401–2407. ELRA, Turkey (2012)Google Scholar
- 14.Liu, L., Kang, J., Yu, J., Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: Proc. of IEEE NLP-KE, pp. 597–601 (2005)Google Scholar
- 15.Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Berry, M.W. (ed.) Survey of Text Mining, pp. 73–100. Springer (2003)Google Scholar
- 16.Liu, T., Liu, S., Chen, Z.: An evaluation on feature selection for text clustering. In: Proceedings of the 10th Int. CNF on Machine Learning, pp. 488–495. Morgan Kaufmann, San Francisco (2003)Google Scholar
- 17.Almeida, G.M.B., Vale, O.A.: Do texto ao termo: interação entre terminologia, morfologia e linguística de corpus na extração semi-automática de termos. In: Isquerdo, A.N., Finatto, M.J.B. (eds.) As Ciências do Léxico: Lexicologia, Lexicografia e Terminologia, 1st edn., vol. IV, pp. 483–499. UFMS, MS (2008)Google Scholar
- 18.Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval, Ithaca, NY, USA, Tech. Rep. (1987) http://ecommons.library.cornell.edu/bitstream/1813/6721/1/87-881.pdf (October 10, 2008)
- 19.Frantzi, K.T., Ananiadou, S., Tsujii, J.: The C − value/NC − value method of automatic recognition for multi-word terms. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 585–604. Springer, Heidelberg (1998)CrossRefGoogle Scholar
- 20.Barrón-Cedeño, A., Sierra, G., Drouin, P., Ananiadou, S.: An improved automatic term recognition method for spanish. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 125–136. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 21.Souza, J.W.C., Di Felippo, A.: Um exercício em lingüística de corpus no âmbito do projeto TermiNet. University of Sao Paulo (ICMC-USP), SP, Brazil, Tech. Rep. NILC-TR-10-08 (2010)Google Scholar
- 22.Gianoti, A.C., Di Felippo, A.: Extração de conhecimento terminológico no projeto TermiNet. University of Sao Paulo (ICMC-USP), SP, Brazil, Tech. Rep. NILC-TR-11-01 (2011), http://www.ufscar.br/~letras/pdf/NILC-TR-11-01_GianotiDiFelippo.pdf (April 4, 2013)
- 23.Coleti, J.S., Mattos, D.F., Genoves Junior, L.C., Candido Junior, A., Di Felippo, A., Almeida, G.M.B., Aluísio, S.M., Oliveira Junior, O.N.: Compilação de Corpus em Língua Portuguesa na área de Nanociência/Nanotecnologia: Problemas e soluções, 192nd ed., L. e. C. H. F. U. Humanitas/Faculdade de Filosofia, Ed. SP, Brazil: Tagnin and Vale, vol. 1 (2008)Google Scholar
- 24.Coleti, J.S., Mattos, D.F., Almeida, G.M.B.: Primeiro dicionário de nanociência e nanotecnologia em língua portuguesa. In: Pecenin, M.F., Miotello, V., Oliveira, T.A. (eds.) II Encontro Acadêmico de Letras (EALE), Caderno de Resumos do II EALE, pp. 1–10 (2009)Google Scholar
- 25.Bick, E.: The Parsing System “PALAVRAS”. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus Universitetsforlag (2000)Google Scholar
- 26.Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. In: SIGKDD-ACM, vol. 11, pp. 10–18 (2009)Google Scholar
- 27.Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc., San Francisco (2005)Google Scholar
Copyright information
© Springer-Verlag Berlin Heidelberg 2013