A Comparative Analysis of Balancing Techniques and Attribute Reduction Algorithms

Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 154)

Abstract

In this study we analyze several data balancing techniques and attribute reduction algorithms and their impact over the information retrieval process. Specifically, we study its performance when used in biomedical text classification using Support Vector Machines (SVMs) based on Linear, Radial, Polynomial and Sigmoid kernels. From experiments on the TREC Genomics 2005 biomedical text public corpus we conclude that these techniques are necessary to improve the classification process. Kernels get some improvements about their results when attribute reduction algorithms were used.Moreover, if balancing techniques and attribute reduction algorithms are applied, results obtained with oversampling are better than subsampling.

Keywords

Biomedical text mining classification techniques balanced data attribute reduction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36(3), 849–851 (2003)CrossRefGoogle Scholar
  2. 2.
    Borrajo, L., Romero, R., Iglesias, E.L., Redondo Marey, C.M.: Improving imbalanced scientific text classification using sampling strategies and dictionaries. Journal of Integrative Bioinformatics 8(3) (2011)Google Scholar
  3. 3.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001)Google Scholar
  4. 4.
    Cunningham, H., Wilks, Y., Gaizauskas, R.J.: Gate - a general architecture for text engineering. In: COLING, pp. 1057–1060 (1996)Google Scholar
  5. 5.
    Garner, S.R.: Weka: The waikato environment for knowledge analysis. In: Proc. of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)Google Scholar
  6. 6.
    Hersh, W., Cohen, A., Yang, J., Bhupatiraju, R.T., Roberts, P., Hearst, M.: Trec 2005 genomics track overview. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of the Fourteenth Text REtrieval Conference, TREC 2005, Special Publication 500-266, pp. 14–25. National Institute of Standards and Technology, NIST (2005)Google Scholar
  7. 7.
    Kang, P., Cho, S.: EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 837–846. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Romero, R., Iglesias, E.L., Borrajo, L.: Building biomedical text classifiers under sample selection bias. In: Abraham, A., Corchado, J.M., Rodríguez-González, S., Santana, J.F.D.P. (eds.) DCAI. AISC, vol. 91, pp. 11–18. Springer, Heidelberg (2011)Google Scholar
  9. 9.
    Romero, R., Iglesias, E.L., Borrajo, L., Marey, C.M.R.: Using dictionaries for biomedical text classification. Advances in Intelligent and Soft Computing 93, 365–372 (2011)CrossRefGoogle Scholar
  10. 10.
    Settles, B.: Abner: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)CrossRefGoogle Scholar
  11. 11.
    Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications 28(4), 667–671 (2005)CrossRefGoogle Scholar
  12. 12.
    Voorhees, E.M., Buckland, L.P. (eds.): Proceedings of the Fourteenth Text REtrieval Conference,TREC 2005, vol. Special Publication 500-266. National Institute of Standards and Technology, NIST (2005)Google Scholar
  13. 13.
    Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Computer Science DeptUniv. of VigoVigoSpain

Personalised recommendations