Abstract
In the domain of gene expression data analysis, several researchers have recently emphasized the promising application of local pattern (e.g., association rules, closed sets) discovery techniques from boolean matrices that encode gene properties. Detecting local patterns by means of complete constraint-based mining techniques turns to be an important complementary approach or invaluable counterpart to heuristic global model mining. To take the most from local set pattern mining approaches, a needed step concerns gene expression property encoding (e.g., over-expression). The impact of this preprocessing phase on both the quantity and the quality of the extracted patterns is crucial. In this paper, we study the impact of discretization techniques by a sound comparison between the dendrograms, i.e., trees that are generated by a hierarchical clustering algorithm on raw numerical expression data and its various derived boolean matrices. Thanks to a new similarity measure, we can select the boolean property encoding technique which preserves similarity structures holding in the raw data. The discussion relies on several experimental results for three gene expression data sets. We believe our framework is an interesting direction of work for the many application domains in which (a) local set patterns have been proved useful, and (b) Boolean properties have to be derived from raw numerical data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
DeRisi, J., Iyer, V., Brown, P.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997)
Velculescu, V., Zhang, L., Vogelstein, B., Kinzler, K.: Serial analysis of gene expression. Science 270, 484–487 (1995)
Piatetsky-Shapiro, G., Tamayo, P. (eds.): Special issue on microrray data mining. SIGKDD Explorations 5(2) (2003)
Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998)
Niehrs, C., Pollet, N.: Synexpression groups in eukaryotes. Nature 402, 483–487 (1999)
Boulicaut, J.F., Bykowski, A.: Frequent closures as a concise representation for binary data mining. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 62–73. Springer, Heidelberg (2000)
Pei, J., Han, J., Mao, R.: CLOSET an efficient algorithm for mining frequent closed itemsets. In: Proceedings ACM SIGMOD Workshop DMKD 2000, Dallas, USA, pp. 21–30 (2000)
Zaki, M.J., Hsiao, C.J.: CHARM: An efficient algorithm for closed itemset mining. In: Proccedings SIAM DM 2002, Arlington, USA (2002)
Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J.F., Gandrillon, O.: Strongassociation- rule mining for large-scale gene-expression data analysis: a case study on human sage data. Genome Biology 12 (2002)
Creighton, C., Hanash, S.: Mining gene expression databases for association rules. Bioinformatics 19, 79–86 (2003)
Wille, R.: Restructuring lattice theory: an approach based on hierarchies of concepts. In: Rival, I. (ed.) Ordered sets, pp. 445–470. Reidel (1982)
Rioult, F., Boulicaut, J.F., Crémilleux, B., Besson, J.: Using transposition for pattern discovery from microarray data. In: Proceedings ACM SIGMODWorkshop DMKD 2003, San Diego (USA), pp. 73–79 (2003)
Rioult, F., Robardet, C., Blachon, S., Crémilleux, B., Gandrillon, O., Boulicaut, J.F.: Mining concepts from large sage gene expression matrices. In: Proceedings KDID 2003 co-located with ECML-PKDD 2003, Catvat-Dubrovnik (Croatia), pp. 107–118 (2003)
Besson, J., Robardet, C., Boulicaut, J.F.: Constraint-based mining of formal concepts in transactional data. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 615–624. Springer, Heidelberg (2004)
Besson, J., Robardet, C., Boulicaut, J.F., Rome, S.: Constraint-based concept mining and its application to microarray data analysis. Intelligent Data Analysis Journal 9 (2004) (to appear)
Pensa, R.G., Leschi, C., Besson, J., Boulicaut, J.F.: Assessment of discretization techniques for relevant pattern discovery from gene expression data. In: Proceedings ACM BIOKDD 2004 co-located with SIGKDD 2004, Seattle, USA, pp. 24–30 (2004)
Parthasarathy, S.: Efficient progressive sampling for association rules. In: Proceedings IEEE ICDM 2002, Maebashi City, Japan, pp. 354–361 (2002)
Moore, G.W., Goodman, M., Barnabas, J.: An iterative approach from the standpoint of the additive hypothesis to the dendrogram problem posed by molecular data sets. Journal of Theoretical Biology 38, 423–457 (1973)
Robinsons, D.F.: Comparison of labeled trees with valency three. Journal of Combinatorial Theory, Series B 11, 105–119 (1971)
DasGupta, B., He, X., Jiang, T., Li, M., Tromp, J., Zhang, L.: On distances between phylogenetic trees. In: Proceedings ACM-SIAM SODA 1997, vol. 55, pp. 427–436 (1997)
DasGupta, B., He, X., Jiang, T., Li, M., Tromp, J., Zhang, L.: On computing the nearest neighbor interchange distance. In: Discrete mathematical problems with medical applications, New Brunswick, NJ, 1999, pp. 125–143. Amer. Math. Soc., Providence (2000)
Finden, C., Gordon, A.: Obtaining common pruned trees. Journal of Classification 2, 255–276 (1985)
Cole, R., Hariharan, R.: An o(n log n) algorithm for the maximum agreement subtree problem for binary trees. In: Proceedings of the 7th annual ACM-SIAM symposium on Discrete algorithms, Atlanta, Georgia, United States, pp. 323–332 (1996)
Bozdech, Z., Llinás, M., Pulliam, B.L., Wong, E., Zhu, J., DeRisi, J.: The transcriptome of the intraerythrocytic developmental cycle of plasmodium falciparum. PLoS Biology 1, 1–16 (2003)
Arbeitman, M., Furlong, E., Imam, F., Johnson, E., Null, B., Baker, B., Krasnow, M., Scott, M., Davis, R., White, K.: Gene expression during the life cycle of drosophila melanogaster. Science 297, 2270–2275 (2002)
Lash, A., Tolstoshev, C., Wagner, L., Schuler, G., Strausberg, R., Riggins, G., Altschul, S.: SAGEmap: A public gene expression resource. Genome Research 10, 1051–1060 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pensa, R.G., Boulicaut, JF. (2005). Boolean Property Encoding for Local Set Pattern Discovery: An Application to Gene Expression Data Analysis. In: Morik, K., Boulicaut, JF., Siebes, A. (eds) Local Pattern Detection. Lecture Notes in Computer Science(), vol 3539. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11504245_8
Download citation
DOI: https://doi.org/10.1007/11504245_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26543-6
Online ISBN: 978-3-540-31894-1
eBook Packages: Computer ScienceComputer Science (R0)