Boolean Property Encoding for Local Set Pattern Discovery: An Application to Gene Expression Data Analysis

  • Ruggero G. Pensa
  • Jean-François Boulicaut
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3539)


In the domain of gene expression data analysis, several researchers have recently emphasized the promising application of local pattern (e.g., association rules, closed sets) discovery techniques from boolean matrices that encode gene properties. Detecting local patterns by means of complete constraint-based mining techniques turns to be an important complementary approach or invaluable counterpart to heuristic global model mining. To take the most from local set pattern mining approaches, a needed step concerns gene expression property encoding (e.g., over-expression). The impact of this preprocessing phase on both the quantity and the quality of the extracted patterns is crucial. In this paper, we study the impact of discretization techniques by a sound comparison between the dendrograms, i.e., trees that are generated by a hierarchical clustering algorithm on raw numerical expression data and its various derived boolean matrices. Thanks to a new similarity measure, we can select the boolean property encoding technique which preserves similarity structures holding in the raw data. The discussion relies on several experimental results for three gene expression data sets. We believe our framework is an interesting direction of work for the many application domains in which (a) local set patterns have been proved useful, and (b) Boolean properties have to be derived from raw numerical data.


Association Rule Similarity Score Hierarchical Cluster Algorithm Discretization Technique Gene Expression Data Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    DeRisi, J., Iyer, V., Brown, P.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)CrossRefGoogle Scholar
  2. 2.
    Velculescu, V., Zhang, L., Vogelstein, B., Kinzler, K.: Serial analysis of gene expression. Science 270, 484–487 (1995)CrossRefGoogle Scholar
  3. 3.
    Piatetsky-Shapiro, G., Tamayo, P. (eds.): Special issue on microrray data mining. SIGKDD Explorations 5(2) (2003)Google Scholar
  4. 4.
    Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998)CrossRefGoogle Scholar
  5. 5.
    Niehrs, C., Pollet, N.: Synexpression groups in eukaryotes. Nature 402, 483–487 (1999)CrossRefGoogle Scholar
  6. 6.
    Boulicaut, J.F., Bykowski, A.: Frequent closures as a concise representation for binary data mining. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 62–73. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Pei, J., Han, J., Mao, R.: CLOSET an efficient algorithm for mining frequent closed itemsets. In: Proceedings ACM SIGMOD Workshop DMKD 2000, Dallas, USA, pp. 21–30 (2000)Google Scholar
  8. 8.
    Zaki, M.J., Hsiao, C.J.: CHARM: An efficient algorithm for closed itemset mining. In: Proccedings SIAM DM 2002, Arlington, USA (2002)Google Scholar
  9. 9.
    Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J.F., Gandrillon, O.: Strongassociation- rule mining for large-scale gene-expression data analysis: a case study on human sage data. Genome Biology 12 (2002)Google Scholar
  10. 10.
    Creighton, C., Hanash, S.: Mining gene expression databases for association rules. Bioinformatics 19, 79–86 (2003)CrossRefGoogle Scholar
  11. 11.
    Wille, R.: Restructuring lattice theory: an approach based on hierarchies of concepts. In: Rival, I. (ed.) Ordered sets, pp. 445–470. Reidel (1982)Google Scholar
  12. 12.
    Rioult, F., Boulicaut, J.F., Crémilleux, B., Besson, J.: Using transposition for pattern discovery from microarray data. In: Proceedings ACM SIGMODWorkshop DMKD 2003, San Diego (USA), pp. 73–79 (2003)Google Scholar
  13. 13.
    Rioult, F., Robardet, C., Blachon, S., Crémilleux, B., Gandrillon, O., Boulicaut, J.F.: Mining concepts from large sage gene expression matrices. In: Proceedings KDID 2003 co-located with ECML-PKDD 2003, Catvat-Dubrovnik (Croatia), pp. 107–118 (2003)Google Scholar
  14. 14.
    Besson, J., Robardet, C., Boulicaut, J.F.: Constraint-based mining of formal concepts in transactional data. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 615–624. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  15. 15.
    Besson, J., Robardet, C., Boulicaut, J.F., Rome, S.: Constraint-based concept mining and its application to microarray data analysis. Intelligent Data Analysis Journal 9 (2004) (to appear)Google Scholar
  16. 16.
    Pensa, R.G., Leschi, C., Besson, J., Boulicaut, J.F.: Assessment of discretization techniques for relevant pattern discovery from gene expression data. In: Proceedings ACM BIOKDD 2004 co-located with SIGKDD 2004, Seattle, USA, pp. 24–30 (2004)Google Scholar
  17. 17.
    Parthasarathy, S.: Efficient progressive sampling for association rules. In: Proceedings IEEE ICDM 2002, Maebashi City, Japan, pp. 354–361 (2002)Google Scholar
  18. 18.
    Moore, G.W., Goodman, M., Barnabas, J.: An iterative approach from the standpoint of the additive hypothesis to the dendrogram problem posed by molecular data sets. Journal of Theoretical Biology 38, 423–457 (1973)CrossRefGoogle Scholar
  19. 19.
    Robinsons, D.F.: Comparison of labeled trees with valency three. Journal of Combinatorial Theory, Series B 11, 105–119 (1971)CrossRefMathSciNetGoogle Scholar
  20. 20.
    DasGupta, B., He, X., Jiang, T., Li, M., Tromp, J., Zhang, L.: On distances between phylogenetic trees. In: Proceedings ACM-SIAM SODA 1997, vol. 55, pp. 427–436 (1997)Google Scholar
  21. 21.
    DasGupta, B., He, X., Jiang, T., Li, M., Tromp, J., Zhang, L.: On computing the nearest neighbor interchange distance. In: Discrete mathematical problems with medical applications, New Brunswick, NJ, 1999, pp. 125–143. Amer. Math. Soc., Providence (2000)Google Scholar
  22. 22.
    Finden, C., Gordon, A.: Obtaining common pruned trees. Journal of Classification 2, 255–276 (1985)CrossRefGoogle Scholar
  23. 23.
    Cole, R., Hariharan, R.: An o(n log n) algorithm for the maximum agreement subtree problem for binary trees. In: Proceedings of the 7th annual ACM-SIAM symposium on Discrete algorithms, Atlanta, Georgia, United States, pp. 323–332 (1996)Google Scholar
  24. 24.
    Bozdech, Z., Llinás, M., Pulliam, B.L., Wong, E., Zhu, J., DeRisi, J.: The transcriptome of the intraerythrocytic developmental cycle of plasmodium falciparum. PLoS Biology 1, 1–16 (2003)CrossRefGoogle Scholar
  25. 25.
    Arbeitman, M., Furlong, E., Imam, F., Johnson, E., Null, B., Baker, B., Krasnow, M., Scott, M., Davis, R., White, K.: Gene expression during the life cycle of drosophila melanogaster. Science 297, 2270–2275 (2002)CrossRefGoogle Scholar
  26. 26.
    Lash, A., Tolstoshev, C., Wagner, L., Schuler, G., Strausberg, R., Riggins, G., Altschul, S.: SAGEmap: A public gene expression resource. Genome Research 10, 1051–1060 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Ruggero G. Pensa
    • 1
  • Jean-François Boulicaut
    • 1
  1. 1.INSA LyonLIRIS CNRS UMR 5205Villeurbanne cedexFrance

Personalised recommendations