Skip to main content

Data Summarization Techniques for Big Data—A Survey

  • Chapter
  • First Online:
Book cover Handbook on Data Centers

Abstract

In current digital era according to (as far) massive progress and development of internet and online world technologies such as big and powerful data servers we face huge volume of information and data day by day from many different resources and services which was not available to human kind just a few decades ago. This data comes from available different online resources and services that are established to serve customers. Services and resources like Sensor Networks, Cloud Storages, Social Networks and etc., produce big volume of data and also need to manage and reuse that data or some analytical aspects of the data. Although this massive volume of data can be really useful for people and corporates it could be problematic as well. Therefore big volume of data or big data has its own deficiencies as well. They need big storage/s and this volume makes operations such as analytical operations, process operations, retrieval operations real difficult and hugely time consuming. One resolution to overcome these difficult problems is to have big data summarized so they would need less storage and extremely shorter time to get processed and retrieved. The summarized data will be then in “compact format” and still informative version of the entire data. Data summarization techniques aim then to produce a “good” quality of summaries. Therefore, they would hugely benefit everyone from ordinary users to researches and corporate world, as it can provide an efficient tool to deal with large data such as news (for new summarization).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A. Hathaway, J. Bezdek, and Y. Hu, “Generalized fuzzyc-means clustering strategies using Lnorm distances,” IEEE Transaction on Fuzzy Systems, 8(5):576–582, October 2000.

    Google Scholar 

  2. J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Sympium, 1:281–297, 1967.

    Google Scholar 

  3. G. Carpenter, S. Grossberg, and D. Rosen, “Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system,” Neural Network, 4:759–771, 1991.

    Google Scholar 

  4. G. Anagnostopoulos and M. Georgiopoulos, “Ellipsoid ART and ARTMAP for incremental unsupervised and supervised learning,” Proceedings of IEEE International Joint Conference Neural Networks (IJCNN’01), Washington DC, pp. 1221–1226, 2001.

    Google Scholar 

  5. J. Mao and A. Jain, “A self-organizing network for hyperellipsoidal clustering (HEC),” IEEE Transactions Neural Networks, 7(1):16–29, January 1996.

    Google Scholar 

  6. C. Van Rijsbergen, “Information Retrieval,” Butterworth-Heinemann, 1979.

    Google Scholar 

  7. J. Cezkanowski, “Zur differentialdiagnose der neandertalgruppe. KorrespondenzBlatt deutsch. Ges. Anthropol,” Ethnol. Urgesch, 40:44–47, 1909.

    Google Scholar 

  8. R. Whittaker, “A study of summer foliage insect communities in the Great Smoky Mountains,” Ecological Monographs, 22:1–44, 1952.

    Google Scholar 

  9. L. Legendre and P. Legendre, “Numerical ecology,” New York: Elsevier Scientific, 1983.

    Google Scholar 

  10. R. Johnson and D. Wichern, “Applied multivariate statistical analysis,” Englewood Cliffs, NJ: Prentice–Hall, 1998.

    Google Scholar 

  11. P.F. Russel and T. R. Rao, “On habitat and association of species of anopheline larvae in south-eastern Madras,” Journal of Malaria India Institute (3):153–178, 1940.

    Google Scholar 

  12. R.R. Sokal and C. D. Michener, “A statistical method for evaluating systematic relationships,” Bulletin of the Society of University of Kansas, 38:1409–1438, 1958.

    Google Scholar 

  13. P. Jaccard, “Étude comparative de la distribuition florale dans une portion des Alpes et de Jura,” Bulletin de la Societé Voudoise des Sciences Naturelles, 37:547–579, 1901.

    Google Scholar 

  14. J.S. Rogers and T. T. Tanimoto, “A computer program for classifying plants,” Science, 132:1115–1118, 1960.

    Google Scholar 

  15. S. Kulczynski, “Classe des Sciences Mathématiques et Naturelles, ” Bulletin International de lʼAcadamie Polonaise des Sciences et des Lettres Série B (Sciences Naturelles) (Supplement II), pp. 57–203, 1927.

    Google Scholar 

  16. J. Tubbs, “A note on binary template matching,” Pattern Recognition, 22(4):359–365, 1989.

    Google Scholar 

  17. L. Kaufman and P. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis,” Wiley, 1990.

    Google Scholar 

  18. B. Everitt, S. Landau, and M. Leese, “Cluster Analysis,” London:Arnold, 2001.

    Google Scholar 

  19. P. Sneath, “The application of computers to taxonomy,” J. Gen. Microbiology, 17:201–226, 1957.

    Google Scholar 

  20. T. Sorensen, “A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyzes of the vegetation on Danish commons,” Biologiske Skrifter, 5:1–34, 1948.

    Google Scholar 

  21. A. Jain and R. Dubes, “Algorithms for clustering data,” Englewood Cliffs, NJ: Prentice–Hall, 1988.

    Google Scholar 

  22. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” Proceedings of ACM International Conference Management of Data (SIGMOD), pp. 103–114, 1996.

    Google Scholar 

  23. T. Chiu, D. Fang, J. Chen, Y. Wang and C. Jeris, “A robust and scalable clustering algorithm for mixed type attributes in large database environment,” Proceedings of 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–268, 2001.

    Google Scholar 

  24. V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French, “Clustering large datasets in arbitrary metric spaces,” Proceedings of the 15th International Conference on Data Engineering (ICDE), pp. 502–511, 1999.

    Google Scholar 

  25. S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” Proc. ACM SIGMOD International Conference Management of Data, pp. 73–84, 1998.

    Google Scholar 

  26. S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering algorithm for categorical attributes,” Information Systems, 25(5):345–366, 2000.

    Google Scholar 

  27. E. Forgy, “Cluster analysis of multivariate data: efficiency vs. interpretability of classifications,” Biometrics, 21:768–780, 1965.

    Google Scholar 

  28. J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of 5th Berkeley Symposium, 1:281–297, 1976.

    Google Scholar 

  29. J. Mao and A.K. Jain, “A Self-organizing network for hyperellipsoidal clustering (HEC),” IEEE Transactions on Neural Networks, 7(1):16–29, 1996.

    Google Scholar 

  30. J. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters,” Journal of Cybernetic, 3(3):32–57, 1974.

    Google Scholar 

  31. E. Forgy, “Cluster analysis of multivariate data: Efficiency versus interpretability of classification,” Biometrics, 21:768–780, 1965.

    Google Scholar 

  32. J. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters,” Journal of Cybernetics, 3(3):32–57, 1974.

    Google Scholar 

  33. J. Bezdek, “Pattern Recognition with fuzzy objective function algorithms,” New York: Plenum, 1981.

    Google Scholar 

  34. S. Eschrich, J. Ke, J. Hall and D. Goldgof, “Fast accurate fuzzy clustering through data reduction,” IEEE Transactions on Fuzzy Systems, 11 (2):262–270, 2003.

    Google Scholar 

  35. M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” KDD Workshop on Text Mining, 2000.

    Google Scholar 

  36. D. Pelleg and A. Moore, “Accelerating exact K-means algorithms with geometric reasoning,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.277–281, 1999.

    Google Scholar 

  37. D. Pelleg and A. Moore, “X-means: extending K-means with efficient estimation of the number of clusters,” Proceedings 17th International Conference on Machine Learning (ICML), Stanford University, 2000.

    Google Scholar 

  38. B. Schölkopf, C. Burges, and A. Smola, “Advances in kernel methods: support vector learning,” The MIT Press, 1999.

    Google Scholar 

  39. L. Kaufman and P. Rousseeuw, “Finding groups in data: an introduction to cluster analysis,” John Wiley and Sons, New York, NY, 1990.

    Google Scholar 

  40. R. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” Proceedings of the 20th International Conference on Very Large Databases (VLDB), pp.144–155, Santiago, Chile, 1994.

    Google Scholar 

  41. M. Ester, H-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–231, Portland, Oregon, 1996.

    Google Scholar 

  42. X. Xu, M. Ester, H-P. Kriegel, and J. Sander, “A distribution-based clustering algorithm for mining in large spatial databases,” Proceedings of the 14th International Conference on Data Engineering (ICDE), 324–331, Orlando, FL, 1998.

    Google Scholar 

  43. J. Sander, M. Ester, H-P. Kriegel, and X. Xu, “Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, 2(2):169–194, 1998.

    Google Scholar 

  44. A. Hinneburg and D. Keim, “An efficient approach to clustering large multimedia databases with noise,” Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 58–65, 1998.

    Google Scholar 

  45. M. Ankerst, M. Breunig, and H-P. Kriegel, K. Sander, “OPTICS: Ordering points to identify clustering structure,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 49–60, 1999.

    Google Scholar 

  46. P. Grabusts and Borisov, “A Using grid-clustering methods in data classification,” Proceedings of the IEEE International Conference on Parallel Computing in Electrical Engineering (PARELEC), 2002.

    Google Scholar 

  47. F. Murtagh and P. Contreras, “Methods of Hierarchical Clustering,” CSIR, 2011.

    Google Scholar 

  48. S.A. Elavarasi, J. Akilandeswari, B. Sathiyabhama, “A survey on partition clustering algorithms,” International Journal of Enterprise Computing and Business Systems, 2011.

    Google Scholar 

  49. W. Wang, J. Yang, and R. Muntz, “STING: a statistical information grid approach to spatial data mining,”, Proceedings of the 23rd International Conference on Very Large Databases (VLDB), pp. 18–195, 1997.

    Google Scholar 

  50. G. Sheikholeslami, S. Chatterjee, and A. Zhang, “Wavecluster: a wavelet based clustering approach for spatial data in very large databases,” The VLDB Journal, 8(3–4):289–304, 2000.

    Google Scholar 

  51. E. Schikuta, “Grid-clustering: An efficient hierarchical clustering method for very large data sets,” Proceedings of the 13th IEEE International Conference on Pattern Recognition, pp. 101–105, 1996

    Google Scholar 

  52. D. Barbar and P. Chen, “Using the fractal dimension to cluster datasets,” Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 260–264, 2000.

    Google Scholar 

  53. A. Hinneburg and D. Keim, “Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering,” Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), pp. 506–517, 1999.

    Google Scholar 

  54. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 94–105, 1998.

    Google Scholar 

  55. P. Berkhin, “Survey of clustering data mining techniques,” Technical report, Accrue Software, San Jose, California, 2002.

    Google Scholar 

  56. P. Kaur and S. Aggrawal, “Comparative study of clustering techniques,” International Journal on Advanced Research in Engineering and Technology, 1:69–75, 2013.

    Google Scholar 

  57. R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, 16(3):645–678, 2005.

    Google Scholar 

  58. W.G. Cochran, “Sampling techniques,” 3rd Ed. John Wiley, 1977.

    Google Scholar 

  59. J.S. Vitter. “Random sampling with a reservoir,” ACM Transactions on Mathematical Software, pp.37–57, 1985.

    Google Scholar 

  60. J.S. Vitter, “Faster methods for random sampling,” Communication of the ACM (CACM), 27(7), July 1984.

    Google Scholar 

  61. J. Zhang, J. Xu, and S. Liao, “Sampling methods for summarizing unordered vehicle-to-vehicle data streams”, Transportation Research Part C—Emerging Technologies, 23:56–67, 2012.

    Google Scholar 

  62. M. Dash. And W. Ng, “Efficient reservoir sampling for transactional data streams,” Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 662–666, 2006.

    Google Scholar 

  63. D. Ghosh, and A. Vogt, “A modification of Poisson sampling,” Proceedings of the American Statistical Association, Survey Research Methods Section, pp.198–199, 1999.

    Google Scholar 

  64. B. Babcock, M. Datar, and R. Motwani, “Sampling from a moving window over streaming data,” Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, Philadelphia, pp. 633–634, 2002.

    Google Scholar 

  65. C.C. Aggarwal. “On biased reservoir sampling in the presence of stream evolution,” Proceedings of the 32nd International Conference on Very large Data Bases (VLDB), pp.607–618, 2006.

    Google Scholar 

  66. R. Gemulla, W. Lehner, and P.J. Haas, “A Dip in the reservoir maintaining sample synopses of evolving datasets,” Proceedings of the 32nd International Conference on Very large Data Bases (VLDB), pp. 595–606, 2006.

    Google Scholar 

  67. P.B. Gibbons and Y. Matias, “New sampling-based summary statistics for improving approximate query answers,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York, NY USA, pp. 331–342, 1998.

    Google Scholar 

  68. R. Gemulla, W. Lehner, and P.J. Haas, “Maintaining Bernoulli samples over evolving multisets,” In: Proc. ACM International Conference on Principles of Database Systems (PODS), pp. 93–102, 2007.

    Google Scholar 

  69. S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya, ” Overcoming limitations of sampling for aggregation queries,” Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2001.

    Google Scholar 

  70. C. Hua-Hui and L. Kang-Li, “Weighted random sampling based hierarchical amnesic synopses for data streams,”Proceedings of the 5th International Conference on Computer Science and Education (ICCSE), pp.1816–1820, 2010.

    Google Scholar 

  71. P.S. Efraimidis and P.G. Spirakis, “Weighted random sampling with a reservoir,” Information Processing Letters, 97(5):181–185, 2006.

    Google Scholar 

  72. S. Acharya, P.B. Gibbons, and V. Poosala, “Congressional samples for approximate answering of group-by queries,” ACMSIGMOD Record, 29(2):487–498, 2000.

    Google Scholar 

  73. H.J. Chang and K.C. Huang, “Remainder linear systematic sampling,” Sankhya B 62, pp. 249–256, 2000.

    Google Scholar 

  74. N. Uthayakumaran, “Additional circular systematic sampling methods”. Biometrical Journal, 40 (4):467–474, 1998.

    Google Scholar 

  75. C.-H. Leu and F.F. Kao, “Modified balanced circular systematic sampling,” Statistics & Probability Letters, 76(4):373–383, 2006.

    Google Scholar 

  76. M.A. Bujang et al., “Modification of systematic sampling: a comparison with a conventional approach in systematic sampling,” Proceedings of the International Conference on Statistics in Science, Business, and Engineering (ICSSBE), pp.1–4, 2012.

    Google Scholar 

  77. M. Al-Kateb, B.S. Lee, and X.S. Wang, “Adaptive-size reservoir sampling over data streams,” Proceedings of the 19th IEEE International Conference on Scientific and Statistical Database Management, Banff, Canada, pp. 22–33, 2007.

    Google Scholar 

  78. M. Al-Kateb and B.S. Lee, “Adaptive stratified reservoir sampling over heterogeneous data streams,” Information Systems, Available online, 2012.

    Google Scholar 

  79. M.D. Bankier, “Power allocations: determining sample sizes for subnational areas,” The American Statistician, 42:174–177, 1988.

    Google Scholar 

  80. S. Chaudhuri, G. Das, and V. Narasayya, “Optimized stratified sampling for approximate query processing,” ACM Transactions on Database Systems (TODS), 32(2), p.9-es, June 2007.

    Google Scholar 

  81. T. Liu and G. Agrawal, “Stratified k-means clustering over a deep web data source,” Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp.1113–1121, 2012.

    Google Scholar 

  82. H. Sug, “A structural sampling technique for better decision trees,” Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS), pp.24–27, 2009.

    Google Scholar 

  83. A. Pol, C. Jermaine, and S. Arumugam, “Maintaining very large random samples using the geometric file,” The VLDB Journal, 17:997–1018, 2008.

    Google Scholar 

  84. T.S. Buda, J. Murphy, and M. Kristiansen, “Towards realistic sampling: generating dependencies in a relational database”. Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication (ICUIMC), 2013.

    Google Scholar 

  85. S. Cong, J. Han, J. Hoeflinger, and D. Padua, “A sampling-based framework for parallel data mining,” Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 255–265, 2005.

    Google Scholar 

  86. B. Babcock, S. Chaudhuri, and G. Das, “Dynamic sample selection for approximate query processing,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 539–550, 2003.

    Google Scholar 

  87. R. Gemulla, W. Lehner, and P. J. Haas, “Maintaining bounded-size sample synopses of evolving datasets,” The VLDB Journal, 17:173–201, 2008.

    Google Scholar 

  88. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” In Advances in Knowledge Discovery and Data Mining, 1996.

    Google Scholar 

  89. B. Chen, P. Haas, and P. Scheuermann, “A new two-phase sampling based algorithm for discovering association rules,” Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2002.

    Google Scholar 

  90. F. Olken, “Random sampling from databases,” Ph. D. Dissertation, 1993.

    Google Scholar 

  91. I. Boxill, C. Chambers, and W. Eleanor, “Introduction to social research with applications to the Caribbean,” University of the West Indies Press, Chapter 4, page 36, 1997.

    Google Scholar 

  92. C.A. Moser, “Quota sampling,” Journal of the Royal Statistical Society, 115(3):411–423, 1952.

    Google Scholar 

  93. C. Sibona and S. Walczak, “Purposive sampling on Twitter: a case study," Proceedings of the 45th Hawaii International Conference System Science (HICSS), pp. 3510, 3519, 2012.

    Google Scholar 

  94. D.F. Nettleton, “Data mining of social networks represented as graphs,” Computer Science Review, 7:1–34, 2013.

    Google Scholar 

  95. P.D. Grünwald, “Minimum description length tutorial,” In: Advances in Minimum Description Length, P. Grünwald and I. Myung I (eds), MIT Press, Cambridge, 2005.

    Google Scholar 

  96. J. Rissanen, “Modeling by shortest data description,” Automatica, 14(1):465–471, 1978.

    Google Scholar 

  97. P.D. Grunwald, “The Minimum description length principle and reasoning under uncertainty,” cwi.nl, 1998.

    Google Scholar 

  98. J. Kiernan and E. Terzi,“Constructing comprehensive summaries of large event sequences,” Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 417–425, 2008.

    Google Scholar 

  99. J. Kiernan and E. Terzi, “Constructing comprehensive summaries of large event sequences,” ACM Transactions on Knowledge and Data Discovery Data, 3(4), 2009.

    Google Scholar 

  100. P. Wang, H. Wang, M. Liu, and W. Wang, “An algorithmic approach to event summarization,” Proceedings of the ACM International Conference on Management of data (SIGMOD), pp.183–194, 2010.

    Google Scholar 

  101. Y. Jiang, C.-S. Perng, and T. Li, “Natural event summarization,” Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM), pp.765–774, 2011.

    Google Scholar 

  102. R. Agrawal, C. Aggarwal, and V.V.V. Prasad, “Depth first generation of long patterns,” Proceedings of 7th International Conference on Knowledge Discovery and Data Mining, 2000.

    Google Scholar 

  103. D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a maximal frequent itemset algorithm for transactional databases,” Proceedings of the International Conference on Data Engineering (ICDE), April 2001.

    Google Scholar 

  104. J. Pei, J. Han, and R. Mao, “Closet: An efficient algorithm for mining frequent closed itemsets,” Proceedings of the ACM SIGMOD Workshop on Data Mining and Knowledge Discovery, May 2000.

    Google Scholar 

  105. W. Zhou, H. Liu, and H. Cheng, “Mining closed episodes from event sequences efficiently,” Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), pp. 310–318, 2010.

    Google Scholar 

  106. S. A. Vreeken and M. van Leeuwen, “Item sets that compress,” Proceedings of SIAM International Conference on Data Mining (SDM), pp.393–404, 2006.

    Google Scholar 

  107. M. van Leeuwen, J. Vreeken, A. Siebes, “Compression picks the item sets that matter,” Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp 585–592, 2006.

    Google Scholar 

  108. J. Vreeken, M. van Leeuwen, and A. Siebes, “Krimp: mining itemsets that compress,” Data Mining and Knowledge Discovery, 23(1):169–214, 2011.

    Google Scholar 

  109. M. Leeuwen and A. Siebes, “StreamKrimp: detecting change in data streams,” Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp: 672–687, 2008.

    Google Scholar 

  110. K. Smets and J. Vreeken, “Slim: directly mining descriptive patterns,” Proceedings of SIAM International Conference on Data Mining (SDM), pp. 236–247, 2012.

    Google Scholar 

  111. N. Tatti and J. Vreeken, “The long and the short of it: summarising event sequences with serial episodes,” Proceedings of the 18th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD), pp: 462–470, 2012.

    Google Scholar 

  112. L.H. Thanh, M. Fabian, F. Dmitriy, and C. Toon, “Mining compressing sequential patterns,” Statistical Analysis and Data Mining, 2013.

    Google Scholar 

  113. F. Moerchen, M. Thies, and A. Ultsch, “Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression,” Knowledge Information Systems, 29:55–80, 2011.

    Google Scholar 

  114. R. Polikar, “The wavelet tutorial,” http://engineering.rowan.edu/polikar/WAVELETS/WTtutorial.html.

  115. G. Strang, “Wavelet transforms versus fourier transforms,” Bulletin of American Mathematic Society, (new series 28):288–305, 1990.

    Google Scholar 

  116. A. Haar, “Zur Theorie der orthogonalen Funktionensysteme,”Mathematische Annalen, 69(3):331–371, 1910.

    Google Scholar 

  117. I. Daubechies, “Ten lectures on wavelets,” SIAM publications, 1992.

    Google Scholar 

  118. M. Garofalakis and P. B. Gibbons, “Probabilistic wavelet synopses,” ACM Transactions on Database Systems (TODS), 29:43–90, 2004.

    Google Scholar 

  119. Y. Matias, J.S. Vitter, and M. Wang, “Wavelet-based histograms for selectivity estimation,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 448–459, 1998.

    Google Scholar 

  120. Y. Matias and D. Urieli, “Inner-product based wavelet synopses for range-sum queries,” Proceedings of the 14th Annual European Symposium on Algorithms (ESA), pp. 504–515, 2006.

    Google Scholar 

  121. J. S. Vitter and M. Wang, “Approximate computation of multidimensional aggregates of sparse data using wavelets”, Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 193–204, 1999.

    Google Scholar 

  122. K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim, “Approximate query processing using wavelets,” The VLDB Journal, 10(2–3):199–223, 2001.

    Google Scholar 

  123. A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing wavelets on streams: One-pass summaries for approximate aggregate queries”. The VLDB Journal, pp. 79–88, 2001.

    Google Scholar 

  124. D. Sacharidis, A. Deligiannakis, and T. Sellis, “Hierarchically compressed wavelet synopses,” The VLDB Journal, 18:203–231, 2009.

    Google Scholar 

  125. A. Deligiannakis and N. Roussopoulos, “Extended wavelets for multiple measures,” Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 229–240, 2003.

    Google Scholar 

  126. A. Deligiannakis, M. Garofalakis, and N. Roussopoulos, “Extended wavelets for multiple measures,” ACM Transactions on Database Systems (TODS), 32(2), 2007.

    Google Scholar 

  127. S. Guha, C. Kim, and K. Shim, “Xwave: Approximate extended wavelets for streaming data,” Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 288–299, 2004.

    Google Scholar 

  128. S. Guha and B. Harb, “Approximation algorithms for wavelet transform coding of data streams,” Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006.

    Google Scholar 

  129. Y. Matias, J.S. Vitter, and M. Wang, “Dynamic maintenance of wavelet-based histograms,” Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 101–110, 2000.

    Google Scholar 

  130. G. Cormode, M. Garofalakis, and D. Sacharidis, “Fast approximate wavelet tracking on streams,” Proceedings of the International Conference on Extending Database Technology (EDBT), 2006.

    Google Scholar 

  131. P. Karras and N. Mamoulis, “One-pass wavelet synopses for maximum-error metrics,” Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 421–432, 2005.

    Google Scholar 

  132. K.-L. Liao, H.-H. Chen, J.-B. Qian, and Y.-H. Dong, “Wavelet decomposition algorithm for uncertain data streams,”Proceedings of the 6th International Conference on Computer Science & Education (ICCSE), pp.965–970, 2011.

    Google Scholar 

  133. Y. Zhao, C. Aggarwal, and P. Yu, “On wavelet decomposition of uncertain time series data sets,” Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM), pp.129–138, 2010.

    Google Scholar 

  134. C.C. Aggarwal (ed.), “Data streams: models and algorithms”, Springer, 2007.

    Google Scholar 

  135. M. Stern, E. Buchmann, and K. Böhm, “A wavelet transform for efficient consolidation of sensor relations with quality guarantees,” Proceedings of the International Conference on Very Large Databases (VLDB), pp.157–168, 2009.

    Google Scholar 

  136. J. Jestes, K. Yi, and F. Li, “Building wavelet histograms on large data in MapReduce,” Proceedings of the International Conference on Very Large Databases (VLDB), pp.109–120, 2011.

    Google Scholar 

  137. G. Cormode and M. Garofalakis, “Histograms and wavelets on probabilistic data,"Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE), pp.293–304, 2009.

    Google Scholar 

  138. R. P. Kooi, “The optimization of queries in relational databases,” PhD thesis, Case Western Reserver University, Sept. 1980.

    Google Scholar 

  139. M. Muralikrisbna and D.J. Dewitt, “Equi-depth histograms for estimating selectivity factors for multidimensional queries,” Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 28–36, 1988.

    Google Scholar 

  140. Y. Ioannidis and V. Poosala. “Balancing histogram optimality and practicality for query result size estimation”. Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 233–244, 1995.

    Google Scholar 

  141. V. Poosala, Y.E. Ioannidis, P.J. Haas, E.J. Shekita, “Improved histograms for selectivity estimation of range predicates,” Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 294–305, 1996.

    Google Scholar 

  142. A.C. Konig and G. Weikum, “Combining histograms and parametric curve fitting for feedback-driven query result-size estimation,” Proceedings of the International Conference on Very Large Data Bases (VLDB), Edinburgh, pp. 423–434, 1999.

    Google Scholar 

  143. V. Poosala and Y. Ioannidis, “Selectivity estimation without the attribute value independence assumption,” Proceedings of the International Conference on Very Large Data Bases (VLDB), Athens, pp: 486–495, 1997.

    Google Scholar 

  144. D. Gunopulos, G. Kollios, V.J. Tsotras, and C. Domeniconi, “Approximating multi-dimensional aggregate range queries over real attributes,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp.463–474, 2000.

    Google Scholar 

  145. N. Bruno and S. Chaudhuri, “Exploiting statistics on query expressions for optimization,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 263–274, 2002.

    Google Scholar 

  146. C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” Proceedings of the 29th International conference on Very Large Data Bases (VLDB), pp. 81–92, 2003.

    Google Scholar 

  147. F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over an evolving data stream with noise,” Proceedings of SIAM Conference on Data Mining (SDM), pp. 328–339, 2006.

    Google Scholar 

  148. Y. Chen, “Density-based clustering for real-time stream data,” Proceedings of the Knowledge Discovery and Data Mining (KDD), San Jose, California, USA, pp. 133–142, 2007.

    Google Scholar 

  149. J. Ren, R. Ma, and J. Ren, “Density-based data streams clustering over sliding windows,” Proceedings of the 6th International Conference on Fuzzy systems and Knowledge Discovery (FSKD), Piscataway, NJ, USA, pp. 248–252, 2009.

    Google Scholar 

  150. W. Ng and M. Dash, “Discovery of frequent patterns in transactional data streams,” Transactions on Large-Scale Data- and Knowledge-Centered Systems II,. Springer Berlin/Heidelberg, 6380:1–30, 2010.

    Google Scholar 

  151. L.-X. Liu, H. Huang, Y.-F. Gu, and F.-C. Chen, “rDenStream—a clustering algorithm over an evolving data stream,”Proceedings of CIECS International Conference on Information Engineering and Computer Science, pp.1–4, 2009.

    Google Scholar 

  152. C. Ruiz, E. Menasalvas, and M. Spiliopoulou, “C-DenStream: using domain knowledge on a data stream,” Proceedings of the 12th International Conference on Discovery Science, pp. 287–301, 2009.

    Google Scholar 

  153. W.-H. Zhu, Y. Yin, Y.-H. Xie, “Arbitrary shape cluster algorithm for clustering data stream,” Journal of Software, 17(3):379–387, 2006.

    Google Scholar 

  154. H. Wang, Y. Yu, Q. Wang, and Y. Wan, “A density-based clustering structure mining algorithm for data streams,” Proceedings of the 1st ACM International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine), pp. 69–76, 2012.

    Google Scholar 

  155. P. Kranen, I. Assent, C. Baldauf, and T. Sei, “The ClusTree: indexing micro-clusters for anytime stream mining,” Knowledge Information Systems, 29(2):249–272, 2011.

    Google Scholar 

  156. A. Amini, T.Y. Wah, M.R. Saybani, and S.R.A.S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” Proceedings of 18th International Conference Fuzzy Systems and Knowledge Discovery (FSKD), 3:1652–1656, 2011.

    Google Scholar 

  157. A. Amini and T.Y. Wah,“ Density micro-clustering algorithms on data streams: a review,” Proceeding of the International Multiconference of Engineers and Computer scientists (IMECS), 2011.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Z. R. Hesabi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this chapter

Cite this chapter

Hesabi, Z., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C. (2015). Data Summarization Techniques for Big Data—A Survey. In: Khan, S., Zomaya, A. (eds) Handbook on Data Centers. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2092-1_38

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-2092-1_38

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-2091-4

  • Online ISBN: 978-1-4939-2092-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics