Advertisement

Data Analysis and Bioinformatics

  • Vito Di Gesù
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4815)

Abstract

Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the complications of very large data-sets with many attributes of different types. And this is a typical situation in biology. Some cases studies are also described.

Keywords

Clustering data mining bio-informatics Kernel methods Hidden Markov Models Multi-Layers Model 

References

  1. 1.
    Brudno, M., Malde, S., Poliakov, A.: Glocal alignment: finding rearrangements during alignment. Bioinformatics 19(1), 54–62 (2003)CrossRefGoogle Scholar
  2. 2.
    Rogic, S.: The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae, PhD Dissertation, University of British Columbia (2006)Google Scholar
  3. 3.
    Bourne, P.E., Shindyalov, I.N.: Structure Comparison and Alignment. In: Bourne, P.E., Weissig, H. (eds.) Structural Bioinformatics, Wiley-Liss, Hoboken, NJ (2003)Google Scholar
  4. 4.
    Zhang, Y., Skolnick, J.: The protein structure prediction problem could be solved using the current PDB library. Proc. Natl. Acad. Sci. USA 102(4), 1029–1034 (2005)CrossRefGoogle Scholar
  5. 5.
    Gould, S.J.: The Structure of Evolutionary Theory. Belknap Press (2002)Google Scholar
  6. 6.
    Matsuda, T., Motoda, H., Yoshida, T., Washio, T.: Mining Patterns from Structured Data by Beam-wise Graph-Based Induction. In: Lange, S., Satoh, K., Smith, C.H. (eds.) DS 2002. LNCS, vol. 2534, pp. 422–429. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29(14), 2994–3005 (2001)CrossRefGoogle Scholar
  8. 8.
  9. 9.
    Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Paley, S.M., Pellegrini-Toole, A.: The EcoCyc and MetaCyc databases. Nucleic Acids Research 28, 56–59 (2000)CrossRefGoogle Scholar
  10. 10.
    Vert, J.-P.: Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings. In: Proceedings of the Pacific Symposium on Biocomputing, vol. 7, pp. 649–660 (2002)Google Scholar
  11. 11.
    Aerts, S., Thijs, G., Coessens, B., Staes, M., Moreau, Y., De Moor, B.: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Research 31(6), 1753–1764 (2003)CrossRefGoogle Scholar
  12. 12.
  13. 13.
    Cappé, O., Moulines, E., Rydén, T.: Inference in Hidden Markov Models. Springer, Heidelberg (2005)zbMATHGoogle Scholar
  14. 14.
    Kielbasa, S.M., Blüthgen, N., Sers, C., Schäfer, R., Herze, H.: Prediction of Cis-Regulatory Elements of Coregulated Genes Szymon. Genome Informatics 15(1), 117–124 (2004)Google Scholar
  15. 15.
    Cheng Cheung, L.-L., Siu-Ming Yiu, D.W.: Approximate string matching in DNA sequences. In: Proceedings DASFAA 2003, pp. 303–310 (2003)Google Scholar
  16. 16.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Aoki, K.F., Yamaguchi, A., Okuno, Y.: Effcient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14, 134–143 (2003)Google Scholar
  18. 18.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, The Press Syndacate of the University of Cambridge, UK (1999)Google Scholar
  19. 19.
    Taylor, W.R.: Protein Structure Comparison Using Bipartite Graph Matching and Its Application to Protein Structure Classification. Molecular & Cellular Proteomics 1(4), 334–339 (2002)CrossRefGoogle Scholar
  20. 20.
    Yang, Q., Sze, S.-H.: Path Matching and Graph Matching in Biological Networks. Journal of Computational Biology 14(1), 56–67 (2007)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Sholom, M.W., Indurkhya, N.: Predictive Data-Mining: A Practical Guide. Morgan Kaufmann, San Francisco (1998)zbMATHGoogle Scholar
  22. 22.
    Tana, A.H., Panb, H.: Predictive neural networks for gene expression data analysis. Neural Networks 18, 297–306 (2005)CrossRefGoogle Scholar
  23. 23.
    Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of Computational Biology 6(3/4), 281–297 (1999)CrossRefGoogle Scholar
  24. 24.
    Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(25), 14863–14868 (1998)CrossRefGoogle Scholar
  25. 25.
    MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967)Google Scholar
  26. 26.
    Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.H.: Systematic determination of genetic network architecture. Nature Genet. 22(3), 281–285 (1999)CrossRefGoogle Scholar
  27. 27.
    Herwig, R., Poustka, A.J., Muller, C., Bull, C., Lehrach, H., O’Brien, J.: Large-Scale Clustering of cDNA Fingerprinting Data. Genome Research 9(11), 1093–1105 (1999)CrossRefGoogle Scholar
  28. 28.
    Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9(11), 1106–1115 (1999)CrossRefGoogle Scholar
  29. 29.
    De Smet, F., Mathys, J., Marchal, K., Thijs, G., De Moor, B., Moreau, Y.: Adaptive quality-based clustering of gene expression profiles. Bioinformatics 18, 735–746 (2002)CrossRefGoogle Scholar
  30. 30.
    Kohonen, T.: Self-Organization and Associative Memory. Springer, Berlin (1984)zbMATHGoogle Scholar
  31. 31.
    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R.: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96(6), 2907–2912 (1999)CrossRefGoogle Scholar
  32. 32.
  33. 33.
    Mahony, S., Golden, A., Smith, T.J., Benos, P.V.: Improved detection of DNA motifs using a self-organized clustering of familial binding profiles. Bioinformatics 21(Suppl 1), 283–291 (2005)CrossRefGoogle Scholar
  34. 34.
    Yeung, K.Y., Fraley, C., Mura, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977–987 (2001)CrossRefGoogle Scholar
  35. 35.
    Yeang, C.-H., Jaakkola, T.: Time Series Analysis of Gene Expression and Location Data. In: Proceedings of the Third IEEE Symposium on BioInformatics and BioEngineering (BIBE 2003), pp. 1–8 (2003)Google Scholar
  36. 36.
    Ramoni, M.F., Sebastiani, P., Kohane, I.S.: Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA 99(14), 9121–9126 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  37. 37.
    Koski, T.T.: Hidden Markov Models for Bioinformatics. Series: Computational Biology, vol. 2. Springer, Heidelberg (2002)Google Scholar
  38. 38.
    Hartuv, E., Shamir, R.: A clustering algorithm based on graph connectivity. Information Processing Letters 76(4/6), 175–181 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18, 536–545 (2002)CrossRefGoogle Scholar
  40. 40.
    Jiang, D., Pei, J., Zhang, A.: Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), Washington, DC, USA, pp. 24–27 (2003)Google Scholar
  41. 41.
    Sultan, M., Wigle, D.A., Cumbaa, C.A., Marziar, M., Glasgow, J., Tsao, M.S., Jurisca, J.: Binary tree-structured vector quantization approach to clustering and visualizing microarray data. Bioinformatics 18(1), 111–119 (2002)Google Scholar
  42. 42.
    Bellaachia, A., Portnoy, D., Chen, Y., Elkahloun, A.G.: E-CAST: a data mining algorithm for gene expression data. In: Proceedings of the ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD 2002), pp. 49–54 (2002)Google Scholar
  43. 43.
    Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), vol. 8, pp. 93–103 (2000)Google Scholar
  44. 44.
    Mirkin, B.: Mathematical Classification and Clustering. Kluwer Academic Publishers, Dordrecht (1996)zbMATHGoogle Scholar
  45. 45.
    Van Mechelen, I., Bock, H.H., De Boeck, P.: Two-mode clustering methods:a structured overview. Statistical Methods in Medical Research 13(5), 363–394 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  46. 46.
    Bryan, K., Cunningham, P., Bolshakova, N.: Biclustering of Expression Data Using Simulated Annealing. In: 18th IEEE Symposium on Computer-Baseds Medical Systems (CBMS 2005), pp. 383–388 (2005)Google Scholar
  47. 47.
    Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220(4598), 671–680 (1983)CrossRefMathSciNetGoogle Scholar
  48. 48.
    Chakraborty, A., Maka, H.: Biclustering of Gene Expression Data Using Genetic Algorithm. In: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2005), vol. 14(15), pp. 1–8 (2005)Google Scholar
  49. 49.
    Sushmita, M., Haider, B.: Multi-objective evolutionary biclustering of gene expression data. Pattern Recognition 39(12), 2464–2477 (2006)zbMATHCrossRefGoogle Scholar
  50. 50.
    Di Gesù, V., Giancarlo, R., Lo Bosco, G., Raimondi, A., Scaturro, D.: GenClust: A Genetic Algorithm for Clustering Gene Expression Data. BMC Bioinformatics 6(289) (2005)Google Scholar
  51. 51.
    Di Gesù, V., Lo Bosco, G.: A genetic integrated fuzzy classifier. Pattern Recognition Letters 26(4), 411–420 (2005)CrossRefGoogle Scholar
  52. 52.
    Lu, Y., Lu, S., Fotouhi, F., Deng, Y., Brown, S.J.: Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics 5(172) (2004)Google Scholar
  53. 53.
    Di Gesù, V., Lo Bosco, G.: GenClust: a Genetic Algorithm for Cluster Analysis. In: Proc. ADA III, pp. 12–18 (2004)Google Scholar
  54. 54.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  55. 55.
    Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-Scale Identification of Nucleosome Positions in S. cerevisiae. Science 309, 626–630 (2005)CrossRefGoogle Scholar
  56. 56.
    Delcher, A.L., Kasif, S., Goldberg, H.R., Hsu, W.H.: Protein secondary structure modelling with probabilistic networks. In: Proc. of Int. Conf. on Intelligent Systems and Molecular Biology, pp. 109–117 (1993)Google Scholar
  57. 57.
    Corona, D., Di Gesù, V., Lo Bosco, G., Pinello, L., Yuan, G.-C.: A new Multi-Layers Method to Analyze Gene Expression. In: Proc. KES 2007. LNCS, Springer, Heidelberg (in press, 2007)Google Scholar
  58. 58.
    Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17, 309–318 (2001)CrossRefGoogle Scholar
  59. 59.
    Somogyi, R., Wen, X., Ma, W., Barker, J.L.: Developmental kinetic of GLAD family mRNAs parallel neurogenesis in the rat Spinal Cord. Journal Neurosciences 15, 2575–2591 (1995)Google Scholar
  60. 60.
    Spellman, P., Sherlock, G., Zhang, M., et al.: Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces Cerevisiae by microarray hybridization. Journal of Mol. Biol. Cell 9, 3273–3297 (1998)Google Scholar
  61. 61.
    Cho, R.J., et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Journal of Molecular Cell 2, 65–73 (1998)CrossRefGoogle Scholar
  62. 62.
    Hartuv, E., Schmitt, A., Lange, J., et al.: An Algorithm for Clustering of cDNAs for Gene Expression Analysis Using Short Oligonucleotide Fingerprints. Journal Genomics 66, 249–256 (2000)CrossRefGoogle Scholar
  63. 63.
    Jiang, D., Pei, J., Zhang, A.: Towards Interactive Exploration of Gene Expression Patterns. SIGKDD Explorations 5(2), 79–90 (2003)CrossRefGoogle Scholar
  64. 64.
    Delcher, A.L., Kasif, S., Goldberg, H.R., Hsu, W.H.: Protein secondary structure modelling with probabilistic networks. In: Proc. of Int. Conf. on Intelligent Systems and Molecular Biology, pp. 109–117 (1993)Google Scholar
  65. 65.
    Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J., Rando, O.J.: Genome-Scale Identification of Nucleosome Positions in S. cerevisiae. Science 309, 626–630 (2005)CrossRefGoogle Scholar
  66. 66.
    Delcher, A.L., Kasif, S., Goldberg, H.R., Hsu, W.H.: Protein secondary structure modelling with probabilistic networks. In: Proc. of Int. Conf. on Intelligent Systems and Molecular Biology, pp. 109–117 (1993)Google Scholar
  67. 67.
    Corona, D., Di Gesù, V., Lo Bosco, G., Pinello, L., Yuan, G.-C.: A new Multi-Layers Method to Analyze Gene Expression. In: Proc. KES 2007 11th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems. LNCS, Springer, Heidelberg (in press, 2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Vito Di Gesù
    • 1
  1. 1.C.I.T.C., Università di PalermoItaly

Personalised recommendations