Visual Data Mining pp 136-153

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4404)

Visual Methods for Examining SVM Classifiers

  • Doina Caragea
  • Dianne Cook
  • Hadley Wickham
  • Vasant Honavar

Abstract

Support vector machines (SVM) offer a theoretically wellfounded approach to automated learning of pattern classifiers. They have been proven to give highly accurate results in complex classification problems, for example, gene expression analysis. The SVM algorithm is also quite intuitive with a few inputs to vary in the fitting process and several outputs that are interesting to study. For many data mining tasks (e.g., cancer prediction) finding classifiers with good predictive accuracy is important, but understanding the classifier is equally important. By studying the classifier outputs we may be able to produce a simpler classifier, learn which variables are the important discriminators between classes, and find the samples that are problematic to the classification. Visual methods for exploratory data analysis can help us to study the outputs and complement automated classification algorithms in data mining. We present the use of tour-based methods to plot aspects of the SVM classifier. This approach provides insights about the cluster structure in the data, the nature of boundaries between clusters, and problematic outliers. Furthermore, tours can be used to assess the variable importance. We show how visual methods can be used as a complement to crossvalidation methods in order to find good SVM input parameters for a particular data set.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ankerst, M.: Report on the sigkdd-2002 panel the perfect data mining tool: Interactive or automated. SIGKDD Explorations 4(2) (2002)Google Scholar
  2. 2.
    Ankerst, M., Elsen, C., Ester, M., Kriegel, H.-P.: Visual classification: An interactive approach to decision tree construction. In: Proceedings of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Diego, CA (1999)Google Scholar
  3. 3.
    Ankerst, M., Jones, D., Kao, A., Wang, C.: Datajewel: Tightly integrating visualization with temporal data mining. In: Proceedings of the ICDM Workshop on Visual Data Mining, Melbourne, FL (2003)Google Scholar
  4. 4.
    Asimov, D.: The Grand Tour: A Tool for Viewing Multidimensional Data. SIAM Journal of Scientific and Statistical Computing 6(1), 128–143 (1985)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Becker, B., Kohavi, R., Sommerfield, D.: Visualizing the simple bayesian classifier. In: Fayyad, U., Grinstein, G., Wierse, A. (eds.) Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (2001)Google Scholar
  6. 6.
    Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J., Gandrillon, O.: Strong association rule mining for large gene expression data analysis: a case study on human sage data. Genome Biology 3(12) (2002)Google Scholar
  7. 7.
    Bi, J., Bennett, K., Embrechts, M., Breneman, C., Song, M.: Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research 3 (2003)Google Scholar
  8. 8.
    Brown, M., Grundy, W., Lin, D., Christianini, N., Sugnet, C., Furey, T., Ares Jr., M., Haussler, D.: Knowledge based analysis of microarray gene expression data using support vector machines. Technical Report UCSC CRL-99-09, Computing Research Laboratory, USSC, Santa Cruz, CA. (1999)Google Scholar
  9. 9.
    Buja, A., Cook, D., Asimov, D., Hurley, C.: Computational Methods for High-Dimensional Rotations in Data Visualization. In: Rao, C.R., Wegman, E.J., Solka, J.L. (eds.) Handbook of Statistics: Data Mining and Visualization, Elsevier/North Holland (2005), http://www.elsevier.com
  10. 10.
    Caragea, D., Cook, D., Honavar, V.: Gaining insights into support vector machine classifiers using projection-based tour methods. In: Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA (2001)Google Scholar
  11. 11.
    Caragea, D., Cook, D., Honavar, V.: Visual methods for examining support vector machines results, with applications to gene expression data analysis. Technical report, Iowa State University (2005)Google Scholar
  12. 12.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
  13. 13.
    Cook, D., Buja, A.: Manual Controls For High-Dimensional Data Projections. Journal of Computational and Graphical Statistics 6(4), 464–480 (1997)CrossRefGoogle Scholar
  14. 14.
    Cook, D., Buja, A., Cabrera, J., Hurley, C.: Grand Tour and Projection Pursuit. Journal of Computational and Graphical Statistics 4(3), 155–172 (1995)CrossRefGoogle Scholar
  15. 15.
    Cook, D., Caragea, D., Honavar, V.: Visualization for classification problems, with examples using support vector machines. In: Proceedings of the COMPSTAT 2004, 16th Symposium of IASC, Prague, Czech Republic (2004)Google Scholar
  16. 16.
    Cook, D., Lee, E.-K., Buja, A., Wickham, H.: Grand Tours, Projection Pursuit Guided Tours and Manual Controls. In: Handbook of Data Visualization. Springer, New York (2006)Google Scholar
  17. 17.
    Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A.: e1071: Misc Functions of the Department of Statistics, TU Wien (2006), http://www.r-project.org
  18. 18.
    Do, T.-N., Poul, F.: Incremental SVM and visualization tools for bio-medical data mining. In: Proceedings of the European Workshop on Data Mining and Text Mining for Bioinformatics (2003)Google Scholar
  19. 19.
    Dudoit, S., Fridlyand, J., Speed, T.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Society 97(1) (2002)Google Scholar
  20. 20.
    Fayyad, U., Grinstein, G., Wierse, A.: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (2001)Google Scholar
  21. 21.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003)Google Scholar
  22. 22.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2002)MATHCrossRefGoogle Scholar
  23. 23.
    Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association 89(428), 1255–1270 (1994)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Inselberg, A., Avidan, T.: The automated multidimensional detective. In: Proceedings of Infovis 1999, pp. 112–119 (1999)Google Scholar
  25. 25.
    Keim, D., Sips, M., Ankerst, M.: Visual data mining. In: Johnson, C., Hansen, C. (eds.) The Visualization Handbook. Academic Press, London (2005)Google Scholar
  26. 26.
    Lee, E.-K., Cook, D., Klinke, S., Lumley, T.: Projection pursuit for exploratory supervised classification. Technical Report 2004-06, Iowa State University (2004)Google Scholar
  27. 27.
    Liu, H., Li, J., Wong, L.: A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics 13 (2002)Google Scholar
  28. 28.
    Ng, R.T., Sander, J., Sleumer, M.C.: Hierarchical cluster analysis of SAGE data for cancer profiling. In: BIOKDD, pp. 65–72 (2001)Google Scholar
  29. 29.
    Poulet, F.: Cooperation between automatic algorithms, interactive algorithms and visualization tools for visual data mining. In: Proceedings of VDM@ECML/PKDD 2002, the 2nd Int. Workshop on Visual Data Mining, Helsinki, Finland (2002)Google Scholar
  30. 30.
    Poulet, F.: Full view: A visual data mining environment. International Journal of Image and Graphics 2(1), 127–143 (2002)CrossRefGoogle Scholar
  31. 31.
    Poulet, F.: Svm and graphical algorithms: a cooperative approach. In: Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM 2004) (2004)Google Scholar
  32. 32.
    R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2006); ISBN 3-900051-07-0Google Scholar
  33. 33.
    Rakotomamonjy, A.: Variable selection using svm-based criteria. Journal of Machine Learning Research 3 (2003)Google Scholar
  34. 34.
    Ripley, B.: Pattern recongnition and neural networks. Cambridge University Press, Cambridge (1996)Google Scholar
  35. 35.
    Soukup, T., Davidson, I.: Visual Data Mining: Techniques and Tools for Data Visualization and Mining. John Wiley and Sons, Inc., Chichester (2002)Google Scholar
  36. 36.
    Streeter, M.J., Ward, M.O., Alvarez, S.A.: NVIS: An interactive visualization tool for neural networks. In: Visual Data Exploration and Analysis VII, San Jose, CA, vol. 4302, pp. 234–241 (2001)Google Scholar
  37. 37.
    Swayne, D.F., Temple Lang, D., Buja, A., Cook, D.: GGobi: Evolving from XGobi into an Extensible Framework for Interactive Data Visualization. Computational Statistics & Data Analysis 43, 423–444 (2003), http://www.ggobi.org CrossRefMathSciNetGoogle Scholar
  38. 38.
    Temple Lang, D., Swayne, D., Wickham, H., Lawrence, M.: rggobi: An Interface between R and GGobi (2006), http://www.r-project.org
  39. 39.
    Vapnik, V.: The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science). Springer, New York (1999)Google Scholar
  40. 40.
    Velculescu, V., Zhang, L., Vogelstein, B., Kinzler, K.: Serial analysis of gene expression. Science 270, 484–487 (1995)CrossRefGoogle Scholar
  41. 41.
    Wegman, E.J.: The Grand Tour in k-Dimensions. Technical Report 68, Center for Computational Statistics, George Mason University, (1991)Google Scholar
  42. 42.
    Wegman, E.J., Carr, D.B.: Statistical Graphics and Visualization. In: Rao, C.R. (ed.) Handbook of Statistics, pp. 857–958. Elsevier Science Publishers, Amsterdam (1993)Google Scholar
  43. 43.
    Wickham, H.: classifly: Classify and Explore a Data Set (2006), http://www.r-project.org
  44. 44.
    Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  45. 45.
    Zhang, L., Zhou, W., Velculescu, V.E.,, S.E.K., Hruban, R.H., Hamilton, S.R., Vogelstein, B., Kinzler, K.W.: Gene expression profiles in normal and cancer cells. Science 276(5316), 1268–1272 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Doina Caragea
    • 1
  • Dianne Cook
    • 2
  • Hadley Wickham
    • 2
  • Vasant Honavar
    • 3
  1. 1.Dept. of Computing and Information SciencesKansas State UniversityManhattanUSA
  2. 2.Dept. of StatisticsIowa State UniversityAmesUSA
  3. 3.Dept. of Computer ScienceIowa State UniversityAmesUSA

Personalised recommendations