Advertisement

A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

  • Guanhua Chen
  • Xiuli Ma
  • Dongqing Yang
  • Shiwei Tang
  • Meng Shuai
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5566)

Abstract

Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.

Keywords

Data Summarization High-Dimensional Data Bipartite Graph the MDL Principle 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Frequent Itemset Mining Implementations Repository, http://fimi.cs.helsinki.fi/
  2. 2.
    Afrati, F., Gionis, A., Mannila, H.: Approximating a collection of frequent sets. In: Proc. KDD 2004 (2004)Google Scholar
  3. 3.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc SIGMOD 1998 (1998)Google Scholar
  4. 4.
    Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
  5. 5.
    Chandola, V., Kumar, V.: Summarization - Compressing data into an informative representation. Knowl. Inf. Syst. 12(3) (2007)Google Scholar
  6. 6.
    Cortez, P., Morais, A.: A Data Mining Approach to Predict Forest Fires using Meteorological Data. In: Proc. EPIA 2007 (2007)Google Scholar
  7. 7.
    Gao, B.J., Ester, M.: Turning Clusters into Patterns: Rectangle-based Discriminative Data Description. In: Proc. ICDM 2006 (2006)Google Scholar
  8. 8.
    Han, J., Wang, J., Lu, Y., Tzvetkov, P.: Mining top-k frequent closed patterns without minimum support. In: Proc. ICDM 2002 (2002)Google Scholar
  9. 9.
    Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. KDD 2004 (2004)Google Scholar
  10. 10.
    Johnson, D., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proc. VLDB 2004 (2004)Google Scholar
  11. 11.
    Lakshmanan, L.V.S., Ng, R.T., Wang, C.X., Zhou, X., Johnson, T.J.: The Generalized MDL approach for Summarization. In: Proc. VLDB 2002 (2002)Google Scholar
  12. 12.
    Liu, B., Hu, M., Hsu, W.: Multi-level organization and summarization of the discovered rules. In: Proc. KDD 2000 (2000)Google Scholar
  13. 13.
    Mani, I.: Advances in Automatic Text Summarization. MIT Press, Cambridge (1999)Google Scholar
  14. 14.
    Navlakha, S., Rastogi, R., Shrivastava, N.: Graph Summarization with Bounded Error. In: Proc. SIGMOD 2008 (2008)Google Scholar
  15. 15.
    Siebes, A., Vreeken, J., Leeuwen, M.: Item Sets that Compress. In: Proc. SDM (2006)Google Scholar
  16. 16.
    Rissanen, J.: Modeling by the shortest data description. Automatica 14, 465–471 (1978)CrossRefMATHGoogle Scholar
  17. 17.
    Tian, Y., Hankins, R.A., Patel, J.M.: Efficient Aggregation for Graph Summarization. In: Proc. SIGMOD 2008 (2008)Google Scholar
  18. 18.
    Wang, J., Karypis, G.: On Efficiently Summarizing Categorical Databases. Knowl. Inf. Syst. 9(1), 19–37 (2006)CrossRefGoogle Scholar
  19. 19.
    Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme. In: Proc. KDD (2008)Google Scholar
  20. 20.
    Zhu, F., Yan, X., Han, J., Yu, P.S., Cheng, H.: Mining Colossal Frequent Patterns by Core Pattern Fusion. In: Proc. ICDE 2007 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Guanhua Chen
    • 1
  • Xiuli Ma
    • 1
    • 2
  • Dongqing Yang
    • 1
    • 3
  • Shiwei Tang
    • 1
    • 2
  • Meng Shuai
    • 2
  1. 1.School of Electronics Engineering and Computer SciencePeking UniversityBeijingChina
  2. 2.Key Laboratory of Machine Perception (Ministry of Education)Peking UniversityBeijingChina
  3. 3.Key Laboratory of High Confidence Software Technologies (Ministry of Education)Peking UniversityBeijingChina

Personalised recommendations