Journal of Intelligent Information Systems

, Volume 45, Issue 3, pp 379–396 | Cite as

Optimizing text classification through efficient feature selection based on quality metric

  • Jean-Charles Lamirel
  • Pascal Cuxac
  • Aneesh Sreevallabh Chivukula
  • Kafil Hajlaoui
Article

Abstract

Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we show that a simple adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. The method is experienced on different types of textual datasets. The paper illustrates that the proposed method provides a very significant performance increase, as compared to state of the art methods, in all the studied cases even when a single bag of words model is exploited for data description. Interestingly, the most significant performance gain is obtained in the case of the classification of highly unbalanced, highly multidimensional and noisy data, with a high degree of similarity between the classes.

Keywords

Feature maximization Clustering quality index Feature selection Supervised learning Unbalanced data Text 

Notes

Acknowledgments

This work was done under the program QUAERO11 supported by OSEO12 French national agency of research development.

References

  1. Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.Google Scholar
  2. Attik, M., Lamirel, J.-C., Al Shehabi, S. (2006). Clustering analysis for data with multiple labels. In Proceedings of the IASTED International conference on databases and applications (DBA). Innsbruck.Google Scholar
  3. Bache, K., & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml.
  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.MATHCrossRefGoogle Scholar
  5. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.Google Scholar
  6. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Belmont: Wadsworth International Group.MATHGoogle Scholar
  7. Chawla, N.V., Bowyer, K.V., Hall, L.O., Kegelmeyer, W.P. (2002). Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.MATHGoogle Scholar
  8. Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.MATHMathSciNetCrossRefGoogle Scholar
  9. Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’ information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. PhD, Université de Nantes, France.Google Scholar
  10. Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1), 1–38.MATHMathSciNetGoogle Scholar
  11. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289–1305.MATHGoogle Scholar
  12. Good, P. (2006). Resampling methods, 3rd edn. Birkhauser.Google Scholar
  13. Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1), 389–422.MATHCrossRefGoogle Scholar
  14. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.MATHGoogle Scholar
  15. Hall, M.A., & Smith, L.A. (1999). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In Proceedings of the 12th international florida artificial intelligence research society conference (pp. 235-239). AAAI Press.Google Scholar
  16. Hajlaoui, K., Cuxac, P., Lamirel, J.C., Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. Discovery Science LNCS, 7569, 299–312.CrossRefGoogle Scholar
  17. Ken Lang, K. (1995). Learning to filter netnews. In Proceedings of the 12th international conference on machine learning (pp. 331–339).Google Scholar
  18. Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273–324.MATHCrossRefGoogle Scholar
  19. Kononenko, I. (1994). Estimating Attributes: Analysis and Extensions of RELIEF. European Conference on Machine Learning, 171–182.Google Scholar
  20. Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.Google Scholar
  21. Lallich, S., & Rakotomalala, R. (2000). Fast Feature Selection Using Partial Correlation for Multi-valued Attributes. In D. A. Zighed, J. Komorowski, J. Zytkow (Eds.), Principles of data mining and knowledge discovery, 221-231. Lecture notes in computer science (pp. 1910). Berlin-Heidelberg: Springer.Google Scholar
  22. Lamirel, J.-C., Al Shehabi, S., Francois, C., Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics, 60(3).Google Scholar
  23. Lamirel, J.-C., & Ta, A.P. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: an application to social network analysis. In Proceedings of the 4th international conference on webometrics, informetrics and scientometrics and 9th COLLNET meeting. Berlin.Google Scholar
  24. Lamirel, J.-C., Ghribi, M., Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Paris.Google Scholar
  25. Lamirel, J.-C, Mall, R., Cuxac, P., Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In Proceedings of IJCNN 2011. San Jose.Google Scholar
  26. Lamirel, J.-C. (2012). A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics, 93, 151–166.CrossRefGoogle Scholar
  27. Mejía-Lavalle, M., Sucar, E., Arroyo, G. (2006). Feature selection with a perceptron neural net. Feature selection for data mining: interfacing machine learning and statistics.Google Scholar
  28. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.CrossRefGoogle Scholar
  29. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf, C. Burges, A. Smola (Eds.). Advances in kernel methods - support vector learning. MIT Press.Google Scholar
  30. Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRefGoogle Scholar
  31. Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann.Google Scholar
  32. Salton, G. (1971). Automatic processing of foreign language documents. Englewood Cliffs: Prentice-Hill.Google Scholar
  33. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRefGoogle Scholar
  34. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing.Google Scholar
  35. Witten, I.H., & Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann.Google Scholar
  36. Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML 2003, 856–863. Washington.Google Scholar
  37. Zhang, T., & Oles, F.J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4(1), 5–31.MATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Jean-Charles Lamirel
    • 1
  • Pascal Cuxac
    • 2
  • Aneesh Sreevallabh Chivukula
    • 3
  • Kafil Hajlaoui
    • 3
  1. 1.SYNALP Team - LORIA, INRIA Nancy-Grand EstVandoeuvre-les-NancyFrance
  2. 2.INIST-CNRSVandoeuvre-les-NancyFrance
  3. 3.Center For Data EngineeringInternational Institute of Information TechnologyGachibowli HyderabadIndia

Personalised recommendations