Abstract
Interaction of multiple genetic variants is a major challenge in the development of effective treatment strategies for complex disorders. Identifying the most promising genes enhances the understanding of the underlying mechanisms of the disease, which, in turn leads to better diagnostic and therapeutic predictions. Categorizing the disease genes into meaningful groups even helps in analyzing the correlated phenotypes which will further improve the power of detecting disease-associated variants. Since experimental approaches are time consuming and expensive, computational methods offer an accurate and efficient alternative for analyzing gene–disease associations from vast amount of publicly available genomic information. Integration of biological knowledge encoded in genes are necessary for identifying significant groups of functionally similar genes and for the sufficient biological elucidation of patterns classified by these clusters. The aim of the work is to identify gene clusters by utilizing diverse genomic information instead of using a single class of biological data in isolation and using efficient feature selection methods and edge pruning techniques for performance improvement. An optimized and streamlined procedure is proposed based on spectral clustering for automatic detection of gene communities through a combination of weighted knowledge fusion, threshold-based edge detection and entropy-based eigenvector subset selection. The proposed approach is applied to produce communities of genes related to Autism Spectrum Disorder and is compared with standard clustering solutions.
Similar content being viewed by others
References
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S (2010) GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26(7):976–978
Datta S, Datta S (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4):459–466
Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2(6):418–427
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
White S, Smyth P (2005) A spectral clustering approach to finding communities in graphs. In: Proceedings of the 2005 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 274–285
Hernandez T, Kambhampati S (2004) Integration of biological sources: current systems and challenges ahead. ACM SIgmod Rec 33(3):51–60
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402(6757):83–86
Joshi T, Chen Y, Becker JM, Alexandrov N, Xu D (2004) Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae. Omics J Integr Biol 8(4):322–333
Huang YT, Yeh HY, Cheng SW, Tu CC, Kuo CL, Soo VW (2006) Automatic extraction of information about the molecular interactions in biological pathways from texts based on ontology and semantic processing. In IEEE International Conference on Systems, Man and Cybernetics, 2006. SMC’06, vol 5, pp 3679–3684, IEEE
Tiffin N, Adie E, Turner F, Brunner HG, van Driel MA, Oti M et al (2006) Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res 34(10):3067–3081
Gaulton KJ, Mohlke KL, Vision TJ (2007) A computational system to select candidate genes for complex human traits. Bioinformatics 23(9):1132–1140
Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, et al (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24(5):537–544
Perez-Iratxeta C, Bork P, Andrade-Navarro MA (2007) Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res 35(suppl 2):W212–W216
Al-Mubaid H, Singh RK (2005) A new text mining approach for finding protein-to-disease associations. Am J Biochem Biotechnol 1(3):145–152
Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283
Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst Biol 3(1):88
Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5(2):101–113
Deng M, Tu Z, Sun F, Chen T (2004) Mapping gene ontology to proteins based on protein–protein interaction data. Bioinformatics
Lee H, Tu Z, Deng M, Sun F, Chen T (2006) Diffusion kernel-based logistic regression models for protein function prediction. Omics J Integr Biol 10(1):40–55
Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS (2004) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635
Tsuda K, Shin H, Schölkopf B (2005) Fast protein classification with multiple networks. Bioinformatics 21(suppl 2):ii59–ii65
Alpert CJ, Kahng AB, Yao SZ (1999) Spectral partitioning with multiple eigenvectors. Discrete Appl Math 90(1):3–26
Dong X, Frossard P, Vandergheynst P, Nefedov N (2012) Clustering with multi-layer graphs: a spectral perspective. IEEE Trans Signal Process 60(11):5820–5831
Mohar B (1997) Some applications of Laplace eigenvalues of graphs. Graph symmetry. Springer, The Netherlands, pp 225–275
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Malik J, Belongie S, Leung T, Shi J (2001) Contour and texture analysis for image segmentation. Int J Comput Vision 43(1):7–27
American Psychiatric Association (2013) Diagnostic and statistical manual of mental disorders (DSM-5®). American Psychiatric Pub
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29
Piñero González J, Rosinach Q, Bravo N, Déu À, Pons J, Bauer-Mehren A, Baron M et al (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes
Mayer MÁ, Bundschus M, Rautschka M, Sanz F, Furlong LI (2011) Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PLoS One 6(6):e20284
Rogers FB (1963) Medical subject headings. Bull Med Libr Assoc 51:114
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M (1999) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 27(1):29–34
Kanehisa M (1997) A database for post-genome analysis. Trends Genet TIG 13(9):375
Altermann E, Klaenhammer TR (2005) PathwayVoyager: pathway mapping using the Kyoto encyclopedia of genes and genomes (KEGG) database. BMC Genom 6(1):60
Ade AS, Wright ZC (2007) States DJ: Gene2MeSH [Internet]. Ann Arbor (MI): National Center for Integrative Biomedical Informatics
Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R, Vanhoutte A (1989) Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf Process Manag 25(3):315–318
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval (vol 1, No. 1. Cambridge University Press, Cambridge, 496
Wang JZ, Du Z, Payattakool R, Philip SY, Chen CF (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Ding CH (2003) Unsupervised feature selection via two-way ordering in gene expression analysis. Bioinformatics 19(10):1259–1266
Guzzi PH, Veltri P, Cannataro M (2013) Thresholding of semantic similarity networks using a spectral graph-based technique. In: International workshop on new frontiers in mining complex patterns. Springer International Publishing, pp 201–213
Varshavsky R, Gottlieb A, Linial M, Horn D (2006) Novel unsupervised feature filtering of biological data. Bioinformatics 22(14):e507–e513
Alvim M, Andrés M, Palamidessi C (2010) Probabilistic information flow. In: Proceedings of the 25th annual IEEE symposium on logic in computer science, pp 314–321
Lima C, de Assis F, de Souza C (2012) An empirical investigation of attribute selection techniques based on Shannon, Rényi and Tsallis entropies for network intrusion detection. Am J Intell Syst 2(5):111–117
Dash M, Liu H (2000) Feature selection for clustering. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 110–121
Marsden A (2013) Eigenvalues of the laplacian and their relationship to the connectedness of a graph. University of Chicago, REU
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL (2007) The human disease network. Proc Natl Acad Sci 104(21):8685–8690
Sreeja A, Vinayan KP (2017) Multidimensional knowledge-based framework is an essential step in the categorization of gene sets in complex disorders. J Bioinf Comput Biol 15(6):1750022
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113
Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Computer and Information Sciences—ISCIS 2005. Springer, Berlin, pp 284–293
Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci 103(23):8577–8582
Fernández A, Sessel S (2009) Selective antagonism of anticancer drugs for side-effect removal. Trends Pharmacol Sci 30(8):403–410
Berger SI, Iyengar R (2009) Network analyses in systems pharmacology. Bioinformatics 25(19):2466–2472
Bocchio-Chiavetto L, Maffioletti E, Bettinsoli P, Giovannini C, Bignotti S, Tardito D et al (2013) Blood microRNA changes in depressed patients during antidepressant treatment. Eur Neuropsychopharmacol 23(7):602–611
Kohane IS, McMurry A, Weber G, MacFadden D, Rappaport L, Kunkel L et al (2012) The co-morbidity burden of children and young adults with autism spectrum disorders. PLoS One 7(4):e33224
Acknowledgements
This study is supported by the Cognitive Science Research Initiative (CSRI) of the Department of Science and Technology (DST), Government of India, as part of the funded Project, SR/CSI/81/2011 at Department of Computer Science, School of Arts and Sciences, Amrita University, Kochi.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author states that the present manuscript presents no conflict of interest.
Rights and permissions
About this article
Cite this article
Sreeja, A., Krishnakumar, U. & Vinayan, K.P. Functional Categorization of Disease Genes Based on Spectral Graph Theory and Integrated Biological Knowledge. Interdiscip Sci Comput Life Sci 11, 460–474 (2019). https://doi.org/10.1007/s12539-017-0279-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-017-0279-7