Abstract
A typical mining problem is the extraction of patterns from subspaces of multidimensional data. Such patterns, known as a biclusters, comprise subsets of objects that behave similarly across subsets of attributes, and may overlap each other, i.e., objects/attributes may belong to several patterns, or to none. For many miners, a key input parameter is the maximum allowed error used which greatly affects the quality, quantity and coherency of the mined clusters. As the error is dataset dependent, setting it demands either domain knowledge or some trial-and-error. The paper presents a new method for automatically setting the error to the value that maximizes the number of clusters mined. This error value is strongly correlated to the value for which performance scores are maximized. The correlation is extensively evaluated using six datasets, two mining algorithms, seven prevailing performance measures, and compared with five prior literature methods, demonstrating a substantial improvement in the mining score.
Keywords
- Biclustering
- Subspace Mining
- Error Setting
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aguilar-Ruiz, J.S.: Shifting and scaling patterns from gene expression data. Bioinformatics 21(20), 3840–3845 (2005)
Bache, K., Lichman, M.: UCI Machine Learning Repository (2013)
Berkhin, P.: A survey of clustering data mining techniques. Grouping Multidimensional Data, pp. 25–71 (2006)
Berson, A., Smith, S., Thearling, K.: Building data mining applications for CRM. McGraw-Hill, New York (2000)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: ICDT, pp. 217–235 (1999)
Bryan, K., Cunningham, P.: Bottom-up biclustering of expression data. In: CIBCB, pp. 1–8 (2006)
Califano, A., Stolovitzky, G., Tu, Y.: Analysis of gene expression microarrays for phenotype classification. In: ISMB, vol. 8, pp. 75–85 (2000)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: ISMB, pp. 93–103 (2000)
Dom, B.E.: An information-theoretic external cluster-validity measure. In: UAI, pp. 137–145 (2002)
Guan, J., Gan, Y., Wang, H.: Discovering pattern-based subspace clusters by pattern tree. KBS 22(8), 569–579 (2009)
Günnemann, S., Färber, I., Müller, E., Assent, I., Seidl, T.: External evaluation measures for subspace clustering. In: CIKM, pp. 1363–1372 (2011)
Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: A survey. TKDE 16(11), 1370–1386 (2004)
Keogh, E., Wei, L., Xi, X., Lee, S.H., Vlachos, M.: LB_Keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures. In: VLDB, pp. 882–893 (2006)
Lagarias, J., Reeds, J., Wright, M., Wright, P.: Convergence Properties of the Nelder–Mead Simplex Method in Low Dimensions. SIOPT 9(1), 112–147 (1998)
Liu, G., Sim, K., Li, J., Wong, L.: Efficient mining of distance-based subspace clusters. SADM 2(5-6), 427–444 (2009)
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. TCBB 1(1), 24–45 (2004)
McDaid, A.F., Greene, D., Hurley, N.: Normalized mutual information to evaluate overlapping community finding algorithms. CoRR abs/1110.2515 (2011)
Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)
Melkman, A.A., Shaham, E.: Sleeved CoClustering. In: KDD, pp. 635–640 (2004)
Moise, G., Zimek, A., Kroeger, P., Kriegel, H., Sander, J.: Subspace and projected clustering: experimental evaluation and analysis. KAIS 21(3), 299–326 (2009)
Patrikainen, A., Meila, M.: Comparing subspace clusterings. TKDE 18(7), 902–916 (2006)
Peeters, R.: The maximum edge biclique problem is NP-complete. DAM 131(3), 651–654 (2003)
Pei, J., Zhang, X., Cho, M., Wang, H., Yu, P.S.: Maple: A fast algorithm for maximal pattern-based clustering. In: ICDM, pp. 259–266 (2003)
Pio, G., Ceci, M., D’Elia, D., Loglisci, C., Malerba, D.: A Novel Biclustering Algorithm for the Discovery of Meaningful Biological Correlations between microRNAs and their Target Genes. BMC Bioinformatics 14(7), 1–25 (2013)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical recipes in C: the art of scientific computing. Cambridge University Press (1992)
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.: A Monte Carlo algorithm for fast projective clustering. In: SIGMOD, pp. 418–427 (2002)
Rosenberg, A., Hirschberg, J.: V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In: EMNLP-CoNLL, vol. 7, pp. 410–420 (2007)
Shaham, E., Sarne, D., Ben-Moshe, B.: Sleeved co-clustering of lagged data. KAIS 31(2), 251–279 (2012)
Supporting webpage (2013), http://tinyurl.com/Supporting-MLDM14
Van Rijsbergen, C.: Information retrieval, 2nd edn. Butterworths (1979)
Wang, H., Chu, F., Fan, W., Yu, P.S., Pei, J.: A fast algorithm for subspace clustering by pattern similarity. In: SSDBM, pp. 51–60 (2004)
Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: SIGMOD, pp. 394–405 (2002)
Yiu, M.L., Mamoulis, N.: Iterative projected clustering by subspace mining. TKDE 17(2), 176–189 (2005)
Yoon, S., Nardini, C., Benini, L., De Micheli, G.: Enhanced pClustering and its applications to gene expression data. In: BIBE, pp. 275–282 (2004)
Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An adaptive meta-clustering approach: combining the information from different clustering results. In: CSB, pp. 276–287 (2002)
Zhao, L., Zaki, M.J.: TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. In: SIGMOD, pp. 694–705 (2005)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Machine Learning (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Shaham, E., Sarne, D., Ben-Moshe, B. (2014). Efficient Error Setting for Subspace Miners. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2014. Lecture Notes in Computer Science(), vol 8556. Springer, Cham. https://doi.org/10.1007/978-3-319-08979-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-08979-9_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08978-2
Online ISBN: 978-3-319-08979-9
eBook Packages: Computer ScienceComputer Science (R0)