Advertisement

Data Mining and Knowledge Discovery

, Volume 32, Issue 1, pp 124–161 | Cite as

BSig: evaluating the statistical significance of biclustering solutions

  • Rui Henriques
  • Sara C. Madeira
Article

Abstract

Statistical evaluation of biclustering solutions is essential to guarantee the absence of spurious relations and to validate the high number of scientific statements inferred from unsupervised data analysis without a proper statistical ground. Most biclustering methods rely on merit functions to discover biclusters with specific homogeneity criteria. However, strong homogeneity does not guarantee the statistical significance of biclustering solutions. Furthermore, although some biclustering methods test the statistical significance of specific types of biclusters, there are no methods to assess the significance of flexible biclustering models. This work proposes a method to evaluate the statistical significance of biclustering solutions. It integrates state-of-the-art statistical views on the significance of local patterns and extends them with new principles to assess the significance of biclusters with additive, multiplicative, symmetric, order-preserving and plaid coherencies. The proposed statistical tests provide the unprecedented possibility to minimize the number of false positive biclusters without incurring on false negatives, and to compare state-of-the-art biclustering algorithms according to the statistical significance of their outputs. Results on synthetic and real data support the soundness and relevance of the proposed contributions, and stress the need to combine significance and homogeneity criteria to guide the search for biclusters.

Keywords

Biclustering Statistical significance Pattern mining 

Notes

Acknowledgements

This work was supported by FCT under the Neuroclinomics2 Project PTDC/EEI-SII/1937/2014, Research Grant SFRH/BD/75924/2011 to RH, INESC-ID plurianual Ref. UID/CEC/50021/2013, and LASIGE Research Unit Ref. UID/CEC/00408/2013.

Compliance with Ethical Standards

Conflicts of interest

The authors declare that they have no conflict of interest.

References

  1. Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, ACM, New York, NY, USA, PODS ’98, pp 18–24, doi: 10.1145/275487.275490
  2. Alzahrani M, Kuwahara H, Wang W, Gao X (2017) Gracob: a novel graph-based constant-column biclustering method for mining growth phenotype data. Bioinformatics. doi: 10.1093/bioinformatics/btx199 Google Scholar
  3. Balakrishnan S, Kolar M, Rinaldo A, Singh A, Wasserman L (2011) Statistical and computational tradeoffs in biclustering. In: NIPS 2011 workshop on computational trade-offs in statistical learning, vol 4Google Scholar
  4. Barkow S, Bleuler S, Prelić A, Zimmermann P, Zitzler E (2006) Bicat: a biclustering analysis toolbox. Bioinformatics 22(10):1282. doi: 10.1093/bioinformatics/btl099 CrossRefGoogle Scholar
  5. Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246. doi: 10.1023/A:1011429418057 MATHCrossRefGoogle Scholar
  6. Bellay J, Atluri G, Sing TL, Toufighi K, Costanzo M, Ribeiro PSM, Pandey G, Baller J, VanderSluis B, Michaut M, Han S, Kim P, Brown GW, Andrews BJ, Boone C, Kumar V, Myers CL (2011) Putting genetic interactions in context through a global modular decomposition. Genome Res 21(8):1375–1387. doi: 10.1101/gr.117176.110 CrossRefGoogle Scholar
  7. Ben-Dor A, Chor B, Karp R, Yakhini Z (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 10(3–4):373–384. doi: 10.1089/10665270360688075 CrossRefGoogle Scholar
  8. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Series B (Methodological), pp 289–300, doi: 10.2307/2346101
  9. Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 1165–1188. doi: 10.1214/aos/1013699998
  10. Bolton RJ, Hand DJ, Adams NM (2002) Determining hit rate in pattern search. Springer, Berlin, pp 36–48. doi: 10.1007/3-540-45728-3_4
  11. Brown GW (1947) On small-sample estimation. Ann Math Stat 18(4):582–585MathSciNetMATHCrossRefGoogle Scholar
  12. Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. Int Conf Intell Syst Mol Biol 8:75–85Google Scholar
  13. Carmona-Saez P, Chagoyen M, Rodriguez A, Trelles O, Carazo JM, Pascual-Montano A (2006) Integrated analysis of gene expression by association rules discovery. BMC Bioinform 7(1):54. doi: 10.1186/1471-2105-7-54 CrossRefGoogle Scholar
  14. Chen Y, Xu J (2016) Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J Mach Learn Res 17(1):882–938MathSciNetMATHGoogle Scholar
  15. Cheng Y, Church GM (2000) Biclustering of expression data. Intell Syst Mol BiolPress 8:93–103Google Scholar
  16. DuMouchel W (1999) Bayesian data mining in large frequency tables, with an application to the fda spontaneous reporting system. Am Stat 53(3):177–190. doi: 10.2307/2686093 Google Scholar
  17. DuMouchel W, Pregibon D (2001) Empirical bayes screening for multi-item associations. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’01, pp 67–76, doi: 10.1145/502512.502526
  18. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14,863–14,868CrossRefGoogle Scholar
  19. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11(12):4241–4257. doi: 10.1091/mbc.11.12.4241 CrossRefGoogle Scholar
  20. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3). doi: 10.1145/1297332.1297338
  21. Gnatyshak D, Ignatov D, Semenov A, Poelmans J (2012) Gaining insight in social networks with biclustering and triclustering. In: Perspectives in business informatics research, LNBIP, vol 128. Springer, Berlin Heidelberg, pp 162–171, doi: 10.1007/978-3-642-33281-4_13
  22. Hämälinen W, Nykänen M (2008) Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE international conference on data mining (ICDM), pp 203–212. doi: 10.1109/ICDM.2008.144
  23. Henriques R (2016) Learning from high-dimensional data using local descriptive models. PhD thesis, Instituto Superior Tecnico, Universidade de Lisboa, LisboaGoogle Scholar
  24. Henriques R, Madeira S (2014a) Bicspam: flexible biclustering using sequential patterns. BMC Bioinform 15(1):130. doi: 10.1186/1471-2105-15-130 CrossRefGoogle Scholar
  25. Henriques R, Madeira SC (2014b) Bicpam: pattern-based biclustering for biomedical data analysis. Algorithms Mol Biol 9(1):27. doi: 10.1186/s13015-014-0027-z CrossRefGoogle Scholar
  26. Henriques R, Madeira SC (2015) Biclustering with flexible plaid models to unravel interactions between biological processes. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 12(4):738–752. doi: 10.1109/TCBB.2014.2388206 CrossRefGoogle Scholar
  27. Henriques R, Madeira SC (2016a) Bic2pam: constraint-guided biclustering for biological data analysis with domain knowledge. Algorithms Mol Biol 11(1):23. doi: 10.1186/s13015-016-0085-5 CrossRefGoogle Scholar
  28. Henriques R, Madeira SC (2016b) Bicnet: flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 11(1):1–30. doi: 10.1186/s13015-016-0074-8 CrossRefGoogle Scholar
  29. Henriques R, Antunes C, Madeira SC (2015) A structured view on pattern mining-based biclustering. Pattern Recognit 48(12):3941–3958. doi: 10.1016/j.patcog.2015.06.018 CrossRefGoogle Scholar
  30. Henriques R, Ferreira FL, Madeira SC (2017) Bicpams: software for biological data analysis with pattern-based biclustering. BMC Bioinform 18(1):82. doi: 10.1186/s12859-017-1493-3 CrossRefGoogle Scholar
  31. Hochreiter S, Bodenhofer U, Heusel M et al (2010) Fabia: factor analysis for bicluster acquisition. Bioinformatics 26(12):1520–1527. doi: 10.1093/bioinformatics/btq227 CrossRefGoogle Scholar
  32. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70MathSciNetMATHGoogle Scholar
  33. Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37(1):1. doi: 10.1093/nar/gkn923 CrossRefGoogle Scholar
  34. Ihmels J, Bergmann S, Barkai N (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13):1993. doi: 10.1093/bioinformatics/bth166 CrossRefGoogle Scholar
  35. Jaroszewicz S, Scheffer T (2005) Fast discovery of unexpected patterns in data, relative to a bayesian network. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, ACM, New York, NY, USA, KDD ’05, pp 118–127. doi: 10.1145/1081870.1081887
  36. Karian Z, Dudewicz E (2010) Handbook of fitting statistical distributions with R. Taylor & Francis, Milton ParkMATHCrossRefGoogle Scholar
  37. Kirsch A, Mitzenmacher M, Pietracaprina A, Pucci G, Upfal E, Vandin F (2012) An efficient rigorous approach for identifying statistically significant frequent itemsets. J ACM 59(3):12:1–12:22. doi: 10.1145/2220357.2220359 MathSciNetMATHCrossRefGoogle Scholar
  38. Koyuturk M, Szpankowski W, Grama A (2004) Biclustering gene-feature matrices for statistically significant dense patterns. In: Proceedings. 2004 IEEE computational systems bioinformatics conference (CSB), pp 480–484. doi: 10.1109/CSB.2004.1332467
  39. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statistica Sinica 12(1):61–86. http://www.jstor.org/stable/24307036
  40. Lee JD, Sun Y, Taylor JE (2015) Evaluating the statistical significance of biclusters. In: Advances in neural information processing systems 28 (NIPS), Curran Associates, Inc., pp 1324–1332Google Scholar
  41. Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C (2007) A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet 39(10):1235–1244. doi: 10.1038/ng2117 CrossRefGoogle Scholar
  42. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45. doi: 10.1109/TCBB.2004.2 CrossRefGoogle Scholar
  43. Madeira SC, Oliveira AL (2007) An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data. In: Asia Pacific bioinformatics conference, pp 67–80Google Scholar
  44. Madeira SC, Teixeira MC, Sa-Correia I, Oliveira AL (2010) Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 7(1):153–165. doi: 10.1109/TCBB.2008.34 CrossRefGoogle Scholar
  45. Mahfouz MA, Ismail MA (2009) Bidens: Iterative density based biclustering algorithm with application to gene expression analysis. Int J Comput Electr Automa Control Inf Eng 3(1):40–46Google Scholar
  46. Mankad S, Michailidis G (2014) Biclustering three-dimensional data arrays with plaid models. J Comput Graph Stat 23(4):943–965. doi: 10.1080/10618600.2013.851608 MathSciNetCrossRefGoogle Scholar
  47. Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the fourth international conference on knowledge discovery and data mining, AAAI Press, KDD’98, pp 274–278Google Scholar
  48. Mitra S, Banka H (2006) Multi-objective evolutionary biclustering of gene expression data. Pattern Recognit 39(12):2464–2477. doi: 10.1016/j.patcog.2006.03.003 MATHCrossRefGoogle Scholar
  49. Noureen N, Kulsoom N, de la Fuente A, Fazal S, Malik SI (2009) Functional and promoter enrichment based analysis of biclustering algorithms using gene expression data of yeast. In: 2009 IEEE 13th international multitopic conference (INMIC), IEEE, pp 1–6, doi: 10.1109/INMIC.2009.5383144
  50. Ojala M, Vuokko N, Kallio A, Haiminen N, Mannila H (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM international conference on data mining (SDM), SIAM, vol 8, pp 494–505. doi: 10.1137/1.9781611972788.45
  51. Okada Y, Fujibuchi W, Horton P (2007) A biclustering method for gene expression module discovery using closed itemset enumeration algorithm. IPSJ Trans Bioinform 3(SIG5):183–192. doi: 10.2197/ipsjdc.3.183 Google Scholar
  52. Pio G, Ceci M, D’Elia D, Loglisci C, Malerba D (2012) A novel biclustering algorithm for the discovery of meaningful biological correlations between mirnas and mrnas. EMBnetjournal 18(A). doi: 10.14806/ej.18.A.375
  53. Ramon J, Miettinen P, Vreeken J (2013) Detecting bicliques in gf[q]. In: Proceedings of the European conference on machine learning and knowledge discovery in databases, vol 8188, Springer New York, Inc., New York, NY, USA, ECML PKDD, pp 509–524. doi: 10.1007/978-3-642-40988-2_33
  54. Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM, Hurt EM, Zhao H, Averett L, Yang L, Wilson WH, Jaffe ES, Simon R, Klausner RD, Powell J, Duffey PL, Longo DL, Greiner TC, Weisenburger DD, Sanger WG, Dave BJ, Lynch JC, Vose J, Armitage JO, Montserrat E, López-Guillermo A, Grogan TM, Miller TP, LeBlanc M, Ott G, Kvaloy S, Delabie J, Holte H, Krajci P, Stokke T, Staudt LM (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med 346(25):1937–1947. doi: 10.1056/NEJMoa012914 CrossRefGoogle Scholar
  55. Scheffer T (2005) Finding association rules that trade support optimally against confidence. Intell Data Anal 9(4):381–395. doi: 10.1007/3-540-44794-6_35 Google Scholar
  56. Serin A, Vingron M (2011) Debi: Discovering differentially expressed biclusters using a frequent itemset approach. Algorithms Mol Biol 6:1–12. doi: 10.1186/1748-7188-6-18 CrossRefGoogle Scholar
  57. Silberschatz A, Tuzhilin A (1996) What makes patterns interesting in knowledge discovery systems. IEEE Trans Knowl Data Eng 8(6):970–974. doi: 10.1109/69.553165 CrossRefGoogle Scholar
  58. Silverstein C, Brin S, Motwani R (1998) Beyond market baskets: generalizing association rules to dependence rules. Data Min Knowl Discov 2(1):39–68. doi: 10.1023/A:1009713703947 CrossRefGoogle Scholar
  59. Tanay A, Sharan R, Shamir R (2002) Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl1):S136. doi: 10.1093/bioinformatics/18.suppl_1.S136 CrossRefGoogle Scholar
  60. Tavazoie S, Hughes J, Campbell M, Cho R, Church G (1999) Systematic determination of genetic network architecture. Nature Genet 22(3):281–285. doi: 10.1038/10343 CrossRefGoogle Scholar
  61. Wang H, Wang W, Yang J, Yu PS (2002) Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’02, pp 394–405. doi: 10.1145/564691.564737
  62. Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33. doi: 10.1007/s10994-007-5006-x CrossRefGoogle Scholar
  63. Yang J, Wang W, Wang H, Yu P (2002) delta-clusters: capturing subspace correlation in a large data set. In: Proceedings 18th international conference on data engineering (ICDE), IEEE, pp 517–528. doi: 10.1109/ICDE.2002.994771
  64. Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp 374–383. doi: 10.1145/1014052.1014094

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.INESC-ID and DEI, Instituto Superior TécnicoUniversidade de LisboaLisbonPortugal
  2. 2.LASIGE, Faculdade de CiênciasUniversidade de LisboaLisbonPortugal
  3. 3.INESC-IDLisbonPortugal

Personalised recommendations