Exploration, Visualization, and Preprocessing of High–Dimensional Data

  • Zhijin Wu
  • Zhiqiang Wu
Part of the Methods in Molecular Biology book series (MIMB, volume 620)


The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high–dimensional data and introduce the basic preprocessing procedures.

Key words

Preprocessing exploratory data analysis visualization microarray proteomics 


  1. 1.
    Gentleman, R. and Biocore. geneplotter: Graphics related functions for Bioconductor R package version 1.20.0.Google Scholar
  2. 2.
    Ringnér, M. (2008) What is principal component analysis?. Nat. Biotechnol., 26, 303–304.PubMedCrossRefGoogle Scholar
  3. 3.
    Mutelo, R. M., Woo, W. L., and Dlay, S. S. (2008) Two dimensional principle component analysis of gabor features for face representation and recognition. Communication Systems, Networks and Digital Signal Processing, CNSDSP, p. 457–461.Google Scholar
  4. 4.
    Li, J., Tao, D., Hu, W., and Li, X. (2005) Kernel principle component analysis in pixels clustering. Web Intelligence, 2005. Proceedings, IEEE/WIC/ACM International Conference, 786–789.Google Scholar
  5. 5.
    Lee, J.-K., Kim, K.-H., Kim, T.-Y., and Choi, W.-H. (2003) Nonlinear principle component analysis using local probability. Science and Technology, Proceedings KORUS, 2, 103–107.Google Scholar
  6. 6.
    Shah, M. and Sorensen, D. C. (2005) Principle component analysis and model reduction for dynamical systems with symmetry constraints. Decision and Control on 2005 and 2005 European Control Conference, 2260–2264.Google Scholar
  7. 7.
    Yang, H., Zhang, J. Q., and Wang, B. (2007) Hypercomplex principle component weighted approach to multiplespectral and panchromatic images fusions. Geoscience and Remote Sensing Symposium on IEEE, 3096–3099.Google Scholar
  8. 8.
    Chen, T., Hsu, Y. J., Liu, X., and Zhang, W. (2002) Principle component analysis and its variants for biometrics. Image Processing. Proceedings. 2002 International Conference, 1, 61–64.Google Scholar
  9. 9.
    Friston, K. J., Frith, C. D., Liddle, P. F., and Frackowiak, R. S. (1993) Functional connectivity: The principal-component analysis of large (PET) data sets. J. Cereb. Blood Flow Metab., 13, 5–14.PubMedCrossRefGoogle Scholar
  10. 10.
    Alter, O., Brown, P. O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA, 97, 10101–10106.PubMedCrossRefGoogle Scholar
  11. 11.
    Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinfromatics, 19(2), 185–193.CrossRefGoogle Scholar
  12. 12.
    Verhaak, R. G., Sanders, M. A., Bijl, M. A., Delwel, R., Horsman, S., Moorhouse, M. J., van derSpek, P. J., Löwenberg, B., and Valk, P. J. (2006) Heatmapper: Powerful combined visualization of gene expression profile correlations, genotypes, phenotypes and sample characteristics. BMC Bioinformatics, 7, 337.PubMedCrossRefGoogle Scholar
  13. 13.
    Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868.PubMedCrossRefGoogle Scholar
  14. 14.
    Kibbey, C. and Calvet, A. (2005) Molecular property explorer: A novel approach to visualizing sar using tree-maps and heatmaps. J. Chem. Inf. Model., 45, 523–532.PubMedCrossRefGoogle Scholar
  15. 15.
    Lee, P. S. and P. North, C. (2005) Visualization of graphs with associated timeseries data. Information Visualization, 2005. INFOVIS 2005. IEEE Symposium, 225– 232.Google Scholar
  16. 16.
    Fisher, D. (2007) Hotmap: Looking at geographic attention. IEEE Trans. Vis. Comput. Graph., 13(6), 1184–1191.PubMedCrossRefGoogle Scholar
  17. 17.
    Podowski, R. M., Miller, B., and Wasserman, W. W. (2006) Visualization of complementary systems biology data with parallel heatmaps. IBM J. Res. Dev., 50(6), 575–581.CrossRefGoogle Scholar
  18. 18.
    Phattarsukol, S. and Muenchaisri, P. (2001) Identifying candidate objects using hierarchical clustering analysis. Software Engineering Conference on APSEC, 381–389.Google Scholar
  19. 19.
    Werle, P., Borsi, H., and Gockenbach, E. (1999) Hierarchical cluster analysis of broadband measured partial discharges as part of a modular structured monitoring system for transformers. High Voltage Engineering, 1999. Eleventh International Symposium, 5, 29–32.Google Scholar
  20. 20.
    Hooper, E. (2007) An intelligent intrusion detection and response system using hybrid ward hierarchical clustering analysis. 2007 International Conference on Multimedia and Ubiquitous Engineering, 1187–1192.Google Scholar
  21. 21.
    Yanagida, R. and Takagi, N. (2005) Consideration on hierarchical cluster analysis based on connecting adjacent hyper-rectangles. 2005 IEEE International Conference on Systems, Man and Cybernetics, 3, 2795– 2800.CrossRefGoogle Scholar
  22. 22.
    Kobayasi, M. (1999) Classification of color combinations based on distance between color distributions. Image Processing, 3, 70–74.Google Scholar
  23. 23.
    Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A., 96, 6745–6750.PubMedCrossRefGoogle Scholar
  24. 24.
    Hodge, D., Karim, N., and Reardon, K. F. (2003) Hierarchical cluster analysis to detect coordinated protein expression in metabolically engineered Zymomonas mobilis. Proc. Am. Control Con., 3, 2081–2082.Google Scholar
  25. 25.
    Muzinich, N. (2005) Discovery of prokaryotic relationships through latent structure of correlated nucleotide sequences. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 143.Google Scholar
  26. 26.
    Wang, Y. and Chen, H. (2008) Sex differences in hierarchical clustering of the spontaneous fluctuations in brain resting state. Bioinformatics and Biomedical Engineering on ICBBE, 2087–2090.Google Scholar
  27. 27.
    Liao, W., Chen, H., Yang, Q., and Lei, X. (2008) Analysis of fmri data using improved self-organizing mapping and spatio-temporal metric hierarchical clustering. Medical Imaging, IEEE Transactions, 27(10), 1472–1483.CrossRefGoogle Scholar
  28. 28.
    Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An introduction to cluster analysis, Wiley Series in Probability and Mathematical Statistics. Wiley.Google Scholar
  29. 29.
    Liu, J.-L., Bai, Y., Kang, J., and An, N. (2006) A new approach to hierarchical clustering using partial least squares. 2006 International Conference on Machine Learning and Cybernetics, 1125–1131.Google Scholar
  30. 30.
    Getz, G., Levine, E., and Domany, E. (2000) Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA, 97, 12079–12084.PubMedCrossRefGoogle Scholar
  31. 31.
    Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., Bittner, M., and Trent, J. M. (2002) Inference from clustering with application to gene-expression microarrays. J. Comput. Biol., 9, 105–126.PubMedCrossRefGoogle Scholar
  32. 32.
    Durbin, B. P., Hardin, J. S., Hawkins, D. M., and Rocke, D. M. (2002) A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics, 18(Suppl. 1), S105–S110.Google Scholar
  33. 33.
    Huber, W. von Heydebreck, A. Sueltmann, H. Poustka, A. and Vingron, M. (2003) Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Genet. Mol. Biol., 2(1), Article 3.Google Scholar
  34. 34.
    Rocke, D. M. and Durbin, B. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol., 8(6), 557–569.PubMedCrossRefGoogle Scholar
  35. 35.
    Wu, Z. and Irizarry, R. A. (2007) A statistical framework for the analysis of microarray probe-level data. Ann. Appl. Stat. 1(2), 333–357.CrossRefGoogle Scholar
  36. 36.
    Naef, F., Hacker, C. R., Patil, N., and Magnasco, M. (2002) Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol., 3, RESEARCH0018.Google Scholar
  37. 37.
    Naef, F., Lim, D. A., Patil, N., and Magnasco, M. (2002) Dna hybridization to mismatched templates: A chip study. Phys. Rev. E, 65, 040902.Google Scholar
  38. 38.
    Irizarry, R. A., B. Hobbs, F. C., Beaxer-Barclay, Y., Antonellis, K., Scherf, U., and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264.PubMedCrossRefGoogle Scholar
  39. 39.
    Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003) Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res., 31(4), e15.Google Scholar
  40. 40.
    Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F., and Spencer, F. (2004) A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99(468), 909–917.CrossRefGoogle Scholar
  41. 41.
    Johnson, W. E., Li, W., Meyer, C. A., Gottardo, R., Carroll, J. S., Brown, M., and Liu, X. S. (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc. Natl. Acad. Sci. USA, 103, 12457–12462.PubMedCrossRefGoogle Scholar
  42. 42.
    Kapur, K., Xing, Y., Ouyang, Z., and Wong, W. H. (2007) Exon arrays provide accurate assessments of gene expression. Genome Biol., 8, R82.Google Scholar
  43. 43.
    Li, C. and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Nat. Acad. Sci. USA, 98, 31–36.PubMedCrossRefGoogle Scholar
  44. 44.
    Calza, S., Valentini, D., and Pawitan, Y. (2008) Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioinformatics, 9, 140.PubMedCrossRefGoogle Scholar
  45. 45.
    Cope, L. M., Irizarry, R. A., Jaffee, H., Wu, Z., and Speed, T. P. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331.PubMedCrossRefGoogle Scholar
  46. 46.
    Irizarry, R. A., Wu, Z., and Jaffee, H. A. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22, 789–794.PubMedCrossRefGoogle Scholar
  47. 47.
    Kooperberg, C., Fazzio, T. G., Delrow, J. J., and Tsukiyama, T. (2002) Improved background correction for spotted DNA microarrays. J. Comput. Biol., 9(1), 55–66.PubMedCrossRefGoogle Scholar
  48. 48.
    Glish, G. L. and Vachet, R. W. (2003) The basics of mass spectrometry in the twenty-first century. Nat. Rev. Drug Discov., 2, 140–150.PubMedCrossRefGoogle Scholar
  49. 49.
    Baggerly, K. A., Edmonson, S. R., Morris, J. S., and Coombes, K. R. (2004) High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer, 11, 583–584.PubMedCrossRefGoogle Scholar
  50. 50.
    Baggerly, K. A., Morris, J. S., and Coombes, K. R. (2004) Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20, 777–785.PubMedCrossRefGoogle Scholar
  51. 51.
    Coombes, K. R., Baggerly, K. A., and Morris, J. S. (2007) chapter Pre-Processing Mass Spectrometry Data, Fundamentals of Data Mining in Genomics and Proteomics, Springer US, 79–102Google Scholar
  52. 52.
    Yasui, Y., Pepe, M., Thompson, M. L., Adam, B. L., Wright, G. L., Qu, Y., Potter, J. D., Winget, M., Thornquist, M., and Feng, Z. (2003) A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, 4, 449–463.PubMedCrossRefGoogle Scholar
  53. 53.
    Kwon, D., Vannucci, M., Song, J. J., Jeong, J., and Pfeiffer, R. M. (2008) A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics, 8, 3019–3029.PubMedCrossRefGoogle Scholar
  54. 54.
    Lange, E., Gropl, C., Reinert, K., Kohlbacher, O., and Hildebrandt, A. (2006) High-accuracy peak picking of proteomics data using wavelet techniques. Pac. Symp. Biocomput., 243–254.Google Scholar
  55. 55.
    Li, X., Li, J., and Yao, X. (2007) A wavelet-based data pre-processing analysis approach in mass spectrometry. Comput. Biol. Med., 37, 509–516.PubMedCrossRefGoogle Scholar
  56. 56.
    Coombes, K. R., Tsavachidis, S., Morris, J. S., Baggerly, K. A., Hung, M. C., and Kuerer, H. M. (2005) Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5, 4107–4117.PubMedCrossRefGoogle Scholar
  57. 57.
    Du, P., Kibbe, W. A., and Lin, S. M. (2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22, 2059–2065.PubMedCrossRefGoogle Scholar
  58. 58.
    Cannataro, M. and Veltri, P. (2007) Ms-analyzer: preprocessing and data mining services for proteomics applications on the grid. Concurrency and Computation, 19(15), 2047–2066.CrossRefGoogle Scholar
  59. 59.
    Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80.Google Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Zhijin Wu
    • 1
  • Zhiqiang Wu
    • 2
  1. 1.Center for Statistical SciencesBrown UniversityProvidenceUSA
  2. 2.Department of Electrical EngineeringWright State UniversityDaytonUSA

Personalised recommendations