Abstract
Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics.
Similar content being viewed by others
References
Bardorfer, A., Munih, M., and Zupan, A. 2001. Upper limb motion analysis using haptic interface. IEEE/ASME Transactions on Mechatronics, 6(3):253–260.
Blake, C. and Merz, C. 1998. UCI repository of machine learning databases.
Brier, G.W. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review, 78:1–3.
Broder, A.J. 1990. Strategies for efficient incremental nearest neighbor search. Pattern Recognition, 23(1–2):171–178.
Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, M.J., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1):262–267.
Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P.A. 1983. Graphical Methods for Data Analysis, Chapman and Hall.
Cleveland, W.S. 1993. Visualizing data, New Jersey: Hobart Press (Summit).
Cleveland, W.S. and McGill, R. 1984. The many faces of a scatter plot. Journal of the American Statistical Association, 79(388):807–822.
Cook, R.D. and Yin, X. 2001. Dimension reduction and visualization in discriminant analysis. Australian and New Zealand Journal of Statistics, 43(2):147–199.
Cutting, J.E. and Vishton, P.M. 1995. Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth. Handbook of perception and cognition, San Diego, CA: Academic Press, pp. 69–117.
Dasarathy, B.W. 1991. Nearest neighbor (NN) norms: NN pattern classification techniques, IEEE Computer Society Press.
Demšar, J. and Zupan, B. 2004. From experimental machine learning to interactive data mining, a white paper. AI Lab, Faculty of Computer and Information Science, Ljubljana.
DeRisi, J.L., Iyer, V.R., and Brown, P.O. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686.
Diaconis, P. and Friedman, D. 1984. Asymptotics of graphical projection pursuit. Annals of Statistics, 1(12):793–815.
Dillon, I., Modha, D., and Spangler, W. 1998. Visualizing class structure of multidimensional data. Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Minneapolis, MN.
Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification, John Wiley and Sons, Inc.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. PNAS, 95(25):14863–14868.
Friedman, J.H., Bentley, J.L., and Finkel, R. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209–222.
Friedman, J.H. and Tukey, J.W. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23:881–890.
Grinstein, G., Trutschl, M. and Cvek, U. 2001. High-dimensional visualizations. Proceedings of the Visual Data Mining Workshop, KDD.
Harris, R.L. 1999. Information graphics: A comprehensive illustrated reference, New York: Oxford Press, pp. 290–297.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning, Springer.
Hoffman, P.E. and Grinstein, G.G. 1999. Dimensional anchors: A graphic primitive for multidimensional multivariate information visualizations. Proc. of the NPIV 99.
Hoffman, P.E., Grinstein, G.G., Marx, K., Grosse, I., and Stanley, E. 1997. DNA visual and analytic data mining. IEEE Visualization, 1:437–441.
Huber, P. 1985. Projection pursuit (with discussion). Annals of Statistics, 13:435–525.
Inselberg, A. 1981. n-dimensional graphics, part i-lines and hyperplanes, Technical Report G320-2711, IBM Los Angeles Scientific Center.
Kaski, S. and Peltonen, J. 2003. Informative discriminant analysis. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 1:329–336.
Keim, D.A. and Kriegel, H. 1996. Visualization techniques for mining large databases: A comparison. Transactions on Knowledge and Data Engineering, Special Issue on Data Mining, 8(6):923–938.
Kononenko, I. and Simec, E. 1995. Induction of decision trees using relieff. Mathematical and statistical methods in artificial intelligence, Springer Verlag.
Leban, G., Bratko, I., Petrovic, U., Curk, T., and Zupan, B. 2005. Vizrank: Finding informative data projections in functional genomics by machine learning. Bioinformatics, 21(3):413–414.
Nason, G. 1992. Design and Choice of Projection Indices, PhD thesis, University of Bath.
Santini, S. and Jain, R. 1996. The use of psychological similarity measure for queries in image databases.
Santini, S. and Jain, R. 1999. Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):871–883.
Schucany, W. and Frawley, W. 1973. A rank test for two group concordance. Psychometrika, 2(38):249–258.
Siegel, S. and Castellan, J. 1988. Nonparametric statistics for the behavioral sciences, 2nd edn. McGraw-Hill.
Torkkola, K. 2003. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3:1415–1438.
Acknowledgments
The authors wish to thank Uros Petrovic for the help on analysis of yeast gene expression data set and twelve post-graduate students of University of Ljubljana who for participating in the experiments. We would also like to acknowledge the support from a Program Grant (P2-0209) from Slovenian Research Agency.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Leban, G., Zupan, B., Vidmar, G. et al. VizRank: Data Visualization Guided by Machine Learning. Data Min Knowl Disc 13, 119–136 (2006). https://doi.org/10.1007/s10618-005-0031-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0031-5