Data Mining and Knowledge Discovery

, Volume 13, Issue 2, pp 119–136 | Cite as

VizRank: Data Visualization Guided by Machine Learning

  • Gregor Leban
  • Blaž Zupan
  • Gaj Vidmar
  • Ivan Bratko
Article

Abstract

Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics.

Keywords

data visualization data mining visual data mining machine learning exploratory data analysis 

Notes

Acknowledgments

The authors wish to thank Uros Petrovic for the help on analysis of yeast gene expression data set and twelve post-graduate students of University of Ljubljana who for participating in the experiments. We would also like to acknowledge the support from a Program Grant (P2-0209) from Slovenian Research Agency.

References

  1. Bardorfer, A., Munih, M., and Zupan, A. 2001. Upper limb motion analysis using haptic interface. IEEE/ASME Transactions on Mechatronics, 6(3):253–260.CrossRefGoogle Scholar
  2. Blake, C. and Merz, C. 1998. UCI repository of machine learning databases.Google Scholar
  3. Brier, G.W. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review, 78:1–3.CrossRefGoogle Scholar
  4. Broder, A.J. 1990. Strategies for efficient incremental nearest neighbor search. Pattern Recognition, 23(1–2):171–178.CrossRefGoogle Scholar
  5. Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, M.J., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1):262–267.CrossRefGoogle Scholar
  6. Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P.A. 1983. Graphical Methods for Data Analysis, Chapman and Hall.Google Scholar
  7. Cleveland, W.S. 1993. Visualizing data, New Jersey: Hobart Press (Summit).Google Scholar
  8. Cleveland, W.S. and McGill, R. 1984. The many faces of a scatter plot. Journal of the American Statistical Association, 79(388):807–822.CrossRefMathSciNetGoogle Scholar
  9. Cook, R.D. and Yin, X. 2001. Dimension reduction and visualization in discriminant analysis. Australian and New Zealand Journal of Statistics, 43(2):147–199.MATHCrossRefMathSciNetGoogle Scholar
  10. Cutting, J.E. and Vishton, P.M. 1995. Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth. Handbook of perception and cognition, San Diego, CA: Academic Press, pp. 69–117.Google Scholar
  11. Dasarathy, B.W. 1991. Nearest neighbor (NN) norms: NN pattern classification techniques, IEEE Computer Society Press.Google Scholar
  12. Demšar, J. and Zupan, B. 2004. From experimental machine learning to interactive data mining, a white paper. AI Lab, Faculty of Computer and Information Science, Ljubljana.Google Scholar
  13. DeRisi, J.L., Iyer, V.R., and Brown, P.O. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686.CrossRefGoogle Scholar
  14. Diaconis, P. and Friedman, D. 1984. Asymptotics of graphical projection pursuit. Annals of Statistics, 1(12):793–815.CrossRefGoogle Scholar
  15. Dillon, I., Modha, D., and Spangler, W. 1998. Visualizing class structure of multidimensional data. Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Minneapolis, MN.Google Scholar
  16. Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification, John Wiley and Sons, Inc.Google Scholar
  17. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS, 95(25):14863–14868.CrossRefGoogle Scholar
  18. Friedman, J.H., Bentley, J.L., and Finkel, R. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209–222.MATHCrossRefGoogle Scholar
  19. Friedman, J.H. and Tukey, J.W. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23:881–890.MATHCrossRefGoogle Scholar
  20. Grinstein, G., Trutschl, M. and Cvek, U. 2001. High-dimensional visualizations. Proceedings of the Visual Data Mining Workshop, KDD.Google Scholar
  21. Harris, R.L. 1999. Information graphics: A comprehensive illustrated reference, New York: Oxford Press, pp. 290–297.Google Scholar
  22. Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning, Springer.Google Scholar
  23. Hoffman, P.E. and Grinstein, G.G. 1999. Dimensional anchors: A graphic primitive for multidimensional multivariate information visualizations. Proc. of the NPIV 99.Google Scholar
  24. Hoffman, P.E., Grinstein, G.G., Marx, K., Grosse, I., and Stanley, E. 1997. DNA visual and analytic data mining. IEEE Visualization, 1:437–441.Google Scholar
  25. Huber, P. 1985. Projection pursuit (with discussion). Annals of Statistics, 13:435–525.MATHCrossRefMathSciNetGoogle Scholar
  26. Inselberg, A. 1981. n-dimensional graphics, part i-lines and hyperplanes, Technical Report G320-2711, IBM Los Angeles Scientific Center.Google Scholar
  27. Kaski, S. and Peltonen, J. 2003. Informative discriminant analysis. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 1:329–336.Google Scholar
  28. Keim, D.A. and Kriegel, H. 1996. Visualization techniques for mining large databases: A comparison. Transactions on Knowledge and Data Engineering, Special Issue on Data Mining, 8(6):923–938.CrossRefGoogle Scholar
  29. Kononenko, I. and Simec, E. 1995. Induction of decision trees using relieff. Mathematical and statistical methods in artificial intelligence, Springer Verlag.Google Scholar
  30. Leban, G., Bratko, I., Petrovic, U., Curk, T., and Zupan, B. 2005. Vizrank: Finding informative data projections in functional genomics by machine learning. Bioinformatics, 21(3):413–414.CrossRefGoogle Scholar
  31. Nason, G. 1992. Design and Choice of Projection Indices, PhD thesis, University of Bath.Google Scholar
  32. Santini, S. and Jain, R. 1996. The use of psychological similarity measure for queries in image databases.Google Scholar
  33. Santini, S. and Jain, R. 1999. Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):871–883.CrossRefGoogle Scholar
  34. Schucany, W. and Frawley, W. 1973. A rank test for two group concordance. Psychometrika, 2(38):249–258.CrossRefGoogle Scholar
  35. Siegel, S. and Castellan, J. 1988. Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill.Google Scholar
  36. Torkkola, K. 2003. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3:1415–1438.MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  • Gregor Leban
    • 1
  • Blaž Zupan
    • 1
    • 2
  • Gaj Vidmar
    • 3
  • Ivan Bratko
    • 1
    • 4
  1. 1.Faculty of Computer and Information ScienceUniversity of LjubljanaLjubljanaSlovenia
  2. 2.Department of Molecular and Human GeneticsBaylor College of MedicineHoustonUSA
  3. 3.Institute of Biomedical InformaticsUniversity of LjubljanaLjubljanaSlovenia
  4. 4.Jozef Stefan InstituteLjubljanaSlovenia

Personalised recommendations