Data Summarization Techniques for Big Data—A Survey

Hesabi, Z. R.; Tari, Z.; Goscinski, A.; Fahad, A.; Khalil, I.; Queiroz, C.

doi:10.1007/978-1-4939-2092-1_38

Z. R. Hesabi³,
Z. Tari³,
A. Goscinski⁴,
A. Fahad³,
I. Khalil³ &
…
C. Queiroz⁵

4780 Accesses
12 Citations

Abstract

In current digital era according to (as far) massive progress and development of internet and online world technologies such as big and powerful data servers we face huge volume of information and data day by day from many different resources and services which was not available to human kind just a few decades ago. This data comes from available different online resources and services that are established to serve customers. Services and resources like Sensor Networks, Cloud Storages, Social Networks and etc., produce big volume of data and also need to manage and reuse that data or some analytical aspects of the data. Although this massive volume of data can be really useful for people and corporates it could be problematic as well. Therefore big volume of data or big data has its own deficiencies as well. They need big storage/s and this volume makes operations such as analytical operations, process operations, retrieval operations real difficult and hugely time consuming. One resolution to overcome these difficult problems is to have big data summarized so they would need less storage and extremely shorter time to get processed and retrieved. The summarized data will be then in “compact format” and still informative version of the entire data. Data summarization techniques aim then to produce a “good” quality of summaries. Therefore, they would hugely benefit everyone from ordinary users to researches and corporate world, as it can provide an efficient tool to deal with large data such as news (for new summarization).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Hathaway, J. Bezdek, and Y. Hu, “Generalized fuzzyc-means clustering strategies using Lnorm distances,” IEEE Transaction on Fuzzy Systems, 8(5):576–582, October 2000.
Google Scholar
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Sympium, 1:281–297, 1967.
Google Scholar
G. Carpenter, S. Grossberg, and D. Rosen, “Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system,” Neural Network, 4:759–771, 1991.
Google Scholar
G. Anagnostopoulos and M. Georgiopoulos, “Ellipsoid ART and ARTMAP for incremental unsupervised and supervised learning,” Proceedings of IEEE International Joint Conference Neural Networks (IJCNN’01), Washington DC, pp. 1221–1226, 2001.
Google Scholar
J. Mao and A. Jain, “A self-organizing network for hyperellipsoidal clustering (HEC),” IEEE Transactions Neural Networks, 7(1):16–29, January 1996.
Google Scholar
C. Van Rijsbergen, “Information Retrieval,” Butterworth-Heinemann, 1979.
Google Scholar
J. Cezkanowski, “Zur differentialdiagnose der neandertalgruppe. KorrespondenzBlatt deutsch. Ges. Anthropol,” Ethnol. Urgesch, 40:44–47, 1909.
Google Scholar
R. Whittaker, “A study of summer foliage insect communities in the Great Smoky Mountains,” Ecological Monographs, 22:1–44, 1952.
Google Scholar
L. Legendre and P. Legendre, “Numerical ecology,” New York: Elsevier Scientific, 1983.
Google Scholar
R. Johnson and D. Wichern, “Applied multivariate statistical analysis,” Englewood Cliffs, NJ: Prentice–Hall, 1998.
Google Scholar
P.F. Russel and T. R. Rao, “On habitat and association of species of anopheline larvae in south-eastern Madras,” Journal of Malaria India Institute (3):153–178, 1940.
Google Scholar
R.R. Sokal and C. D. Michener, “A statistical method for evaluating systematic relationships,” Bulletin of the Society of University of Kansas, 38:1409–1438, 1958.
Google Scholar
P. Jaccard, “Étude comparative de la distribuition florale dans une portion des Alpes et de Jura,” Bulletin de la Societé Voudoise des Sciences Naturelles, 37:547–579, 1901.
Google Scholar
J.S. Rogers and T. T. Tanimoto, “A computer program for classifying plants,” Science, 132:1115–1118, 1960.
Google Scholar
S. Kulczynski, “Classe des Sciences Mathématiques et Naturelles, ” Bulletin International de lʼAcadamie Polonaise des Sciences et des Lettres Série B (Sciences Naturelles) (Supplement II), pp. 57–203, 1927.
Google Scholar
J. Tubbs, “A note on binary template matching,” Pattern Recognition, 22(4):359–365, 1989.
Google Scholar
L. Kaufman and P. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis,” Wiley, 1990.
Google Scholar
B. Everitt, S. Landau, and M. Leese, “Cluster Analysis,” London:Arnold, 2001.
Google Scholar
P. Sneath, “The application of computers to taxonomy,” J. Gen. Microbiology, 17:201–226, 1957.
Google Scholar
T. Sorensen, “A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyzes of the vegetation on Danish commons,” Biologiske Skrifter, 5:1–34, 1948.
Google Scholar
A. Jain and R. Dubes, “Algorithms for clustering data,” Englewood Cliffs, NJ: Prentice–Hall, 1988.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” Proceedings of ACM International Conference Management of Data (SIGMOD), pp. 103–114, 1996.
Google Scholar
T. Chiu, D. Fang, J. Chen, Y. Wang and C. Jeris, “A robust and scalable clustering algorithm for mixed type attributes in large database environment,” Proceedings of 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–268, 2001.
Google Scholar
V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French, “Clustering large datasets in arbitrary metric spaces,” Proceedings of the 15th International Conference on Data Engineering (ICDE), pp. 502–511, 1999.
Google Scholar
S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” Proc. ACM SIGMOD International Conference Management of Data, pp. 73–84, 1998.
Google Scholar
S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering algorithm for categorical attributes,” Information Systems, 25(5):345–366, 2000.
Google Scholar
E. Forgy, “Cluster analysis of multivariate data: efficiency vs. interpretability of classifications,” Biometrics, 21:768–780, 1965.
Google Scholar
J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of 5th Berkeley Symposium, 1:281–297, 1976.
Google Scholar
J. Mao and A.K. Jain, “A Self-organizing network for hyperellipsoidal clustering (HEC),” IEEE Transactions on Neural Networks, 7(1):16–29, 1996.
Google Scholar
J. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters,” Journal of Cybernetic, 3(3):32–57, 1974.
Google Scholar
E. Forgy, “Cluster analysis of multivariate data: Efficiency versus interpretability of classification,” Biometrics, 21:768–780, 1965.
Google Scholar
J. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters,” Journal of Cybernetics, 3(3):32–57, 1974.
Google Scholar
J. Bezdek, “Pattern Recognition with fuzzy objective function algorithms,” New York: Plenum, 1981.
Google Scholar
S. Eschrich, J. Ke, J. Hall and D. Goldgof, “Fast accurate fuzzy clustering through data reduction,” IEEE Transactions on Fuzzy Systems, 11 (2):262–270, 2003.
Google Scholar
M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” KDD Workshop on Text Mining, 2000.
Google Scholar
D. Pelleg and A. Moore, “Accelerating exact K-means algorithms with geometric reasoning,” Proceedings of the 5^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.277–281, 1999.
Google Scholar
D. Pelleg and A. Moore, “X-means: extending K-means with efficient estimation of the number of clusters,” Proceedings 17^th International Conference on Machine Learning (ICML), Stanford University, 2000.
Google Scholar
B. Schölkopf, C. Burges, and A. Smola, “Advances in kernel methods: support vector learning,” The MIT Press, 1999.
Google Scholar
L. Kaufman and P. Rousseeuw, “Finding groups in data: an introduction to cluster analysis,” John Wiley and Sons, New York, NY, 1990.
Google Scholar
R. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” Proceedings of the 20^th International Conference on Very Large Databases (VLDB), pp.144–155, Santiago, Chile, 1994.
Google Scholar
M. Ester, H-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” Proceedings of the 2^nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–231, Portland, Oregon, 1996.
Google Scholar
X. Xu, M. Ester, H-P. Kriegel, and J. Sander, “A distribution-based clustering algorithm for mining in large spatial databases,” Proceedings of the 14^th International Conference on Data Engineering (ICDE), 324–331, Orlando, FL, 1998.
Google Scholar
J. Sander, M. Ester, H-P. Kriegel, and X. Xu, “Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications,” Data Mining and Knowledge Discovery, 2(2):169–194, 1998.
Google Scholar
A. Hinneburg and D. Keim, “An efficient approach to clustering large multimedia databases with noise,” Proceedings of the 4^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 58–65, 1998.
Google Scholar
M. Ankerst, M. Breunig, and H-P. Kriegel, K. Sander, “OPTICS: Ordering points to identify clustering structure,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 49–60, 1999.
Google Scholar
P. Grabusts and Borisov, “A Using grid-clustering methods in data classification,” Proceedings of the IEEE International Conference on Parallel Computing in Electrical Engineering (PARELEC), 2002.
Google Scholar
F. Murtagh and P. Contreras, “Methods of Hierarchical Clustering,” CSIR, 2011.
Google Scholar
S.A. Elavarasi, J. Akilandeswari, B. Sathiyabhama, “A survey on partition clustering algorithms,” International Journal of Enterprise Computing and Business Systems, 2011.
Google Scholar
W. Wang, J. Yang, and R. Muntz, “STING: a statistical information grid approach to spatial data mining,”, Proceedings of the 23^rd International Conference on Very Large Databases (VLDB), pp. 18–195, 1997.
Google Scholar
G. Sheikholeslami, S. Chatterjee, and A. Zhang, “Wavecluster: a wavelet based clustering approach for spatial data in very large databases,” The VLDB Journal, 8(3–4):289–304, 2000.
Google Scholar
E. Schikuta, “Grid-clustering: An efficient hierarchical clustering method for very large data sets,” Proceedings of the 13^th IEEE International Conference on Pattern Recognition, pp. 101–105, 1996
Google Scholar
D. Barbar and P. Chen, “Using the fractal dimension to cluster datasets,” Proceedings of the 6^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 260–264, 2000.
Google Scholar
A. Hinneburg and D. Keim, “Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering,” Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), pp. 506–517, 1999.
Google Scholar
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 94–105, 1998.
Google Scholar
P. Berkhin, “Survey of clustering data mining techniques,” Technical report, Accrue Software, San Jose, California, 2002.
Google Scholar
P. Kaur and S. Aggrawal, “Comparative study of clustering techniques,” International Journal on Advanced Research in Engineering and Technology, 1:69–75, 2013.
Google Scholar
R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural Networks, 16(3):645–678, 2005.
Google Scholar
W.G. Cochran, “Sampling techniques,” 3^rd Ed. John Wiley, 1977.
Google Scholar
J.S. Vitter. “Random sampling with a reservoir,” ACM Transactions on Mathematical Software, pp.37–57, 1985.
Google Scholar
J.S. Vitter, “Faster methods for random sampling,” Communication of the ACM (CACM), 27(7), July 1984.
Google Scholar
J. Zhang, J. Xu, and S. Liao, “Sampling methods for summarizing unordered vehicle-to-vehicle data streams”, Transportation Research Part C—Emerging Technologies, 23:56–67, 2012.
Google Scholar
M. Dash. And W. Ng, “Efficient reservoir sampling for transactional data streams,” Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 662–666, 2006.
Google Scholar
D. Ghosh, and A. Vogt, “A modification of Poisson sampling,” Proceedings of the American Statistical Association, Survey Research Methods Section, pp.198–199, 1999.
Google Scholar
B. Babcock, M. Datar, and R. Motwani, “Sampling from a moving window over streaming data,” Proceedings of the 13^th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, Philadelphia, pp. 633–634, 2002.
Google Scholar
C.C. Aggarwal. “On biased reservoir sampling in the presence of stream evolution,” Proceedings of the 32^nd International Conference on Very large Data Bases (VLDB), pp.607–618, 2006.
Google Scholar
R. Gemulla, W. Lehner, and P.J. Haas, “A Dip in the reservoir maintaining sample synopses of evolving datasets,” Proceedings of the 32^nd International Conference on Very large Data Bases (VLDB), pp. 595–606, 2006.
Google Scholar
P.B. Gibbons and Y. Matias, “New sampling-based summary statistics for improving approximate query answers,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), New York, NY USA, pp. 331–342, 1998.
Google Scholar
R. Gemulla, W. Lehner, and P.J. Haas, “Maintaining Bernoulli samples over evolving multisets,” In: Proc. ACM International Conference on Principles of Database Systems (PODS), pp. 93–102, 2007.
Google Scholar
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya, ” Overcoming limitations of sampling for aggregation queries,” Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2001.
Google Scholar
C. Hua-Hui and L. Kang-Li, “Weighted random sampling based hierarchical amnesic synopses for data streams,”Proceedings of the 5^th International Conference on Computer Science and Education (ICCSE), pp.1816–1820, 2010.
Google Scholar
P.S. Efraimidis and P.G. Spirakis, “Weighted random sampling with a reservoir,” Information Processing Letters, 97(5):181–185, 2006.
Google Scholar
S. Acharya, P.B. Gibbons, and V. Poosala, “Congressional samples for approximate answering of group-by queries,” ACMSIGMOD Record, 29(2):487–498, 2000.
Google Scholar
H.J. Chang and K.C. Huang, “Remainder linear systematic sampling,” Sankhya B 62, pp. 249–256, 2000.
Google Scholar
N. Uthayakumaran, “Additional circular systematic sampling methods”. Biometrical Journal, 40 (4):467–474, 1998.
Google Scholar
C.-H. Leu and F.F. Kao, “Modified balanced circular systematic sampling,” Statistics & Probability Letters, 76(4):373–383, 2006.
Google Scholar
M.A. Bujang et al., “Modification of systematic sampling: a comparison with a conventional approach in systematic sampling,” Proceedings of the International Conference on Statistics in Science, Business, and Engineering (ICSSBE), pp.1–4, 2012.
Google Scholar
M. Al-Kateb, B.S. Lee, and X.S. Wang, “Adaptive-size reservoir sampling over data streams,” Proceedings of the 19^th IEEE International Conference on Scientific and Statistical Database Management, Banff, Canada, pp. 22–33, 2007.
Google Scholar
M. Al-Kateb and B.S. Lee, “Adaptive stratified reservoir sampling over heterogeneous data streams,” Information Systems, Available online, 2012.
Google Scholar
M.D. Bankier, “Power allocations: determining sample sizes for subnational areas,” The American Statistician, 42:174–177, 1988.
Google Scholar
S. Chaudhuri, G. Das, and V. Narasayya, “Optimized stratified sampling for approximate query processing,” ACM Transactions on Database Systems (TODS), 32(2), p.9-es, June 2007.
Google Scholar
T. Liu and G. Agrawal, “Stratified k-means clustering over a deep web data source,” Proceedings of the 18^th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp.1113–1121, 2012.
Google Scholar
H. Sug, “A structural sampling technique for better decision trees,” Proceedings of the 1^st Asian Conference on Intelligent Information and Database Systems (ACIIDS), pp.24–27, 2009.
Google Scholar
A. Pol, C. Jermaine, and S. Arumugam, “Maintaining very large random samples using the geometric file,” The VLDB Journal, 17:997–1018, 2008.
Google Scholar
T.S. Buda, J. Murphy, and M. Kristiansen, “Towards realistic sampling: generating dependencies in a relational database”. Proceedings of the 7^th International Conference on Ubiquitous Information Management and Communication (ICUIMC), 2013.
Google Scholar
S. Cong, J. Han, J. Hoeflinger, and D. Padua, “A sampling-based framework for parallel data mining,” Proceedings of the 10^th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 255–265, 2005.
Google Scholar
B. Babcock, S. Chaudhuri, and G. Das, “Dynamic sample selection for approximate query processing,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 539–550, 2003.
Google Scholar
R. Gemulla, W. Lehner, and P. J. Haas, “Maintaining bounded-size sample synopses of evolving datasets,” The VLDB Journal, 17:173–201, 2008.
Google Scholar
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” In Advances in Knowledge Discovery and Data Mining, 1996.
Google Scholar
B. Chen, P. Haas, and P. Scheuermann, “A new two-phase sampling based algorithm for discovering association rules,” Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2002.
Google Scholar
F. Olken, “Random sampling from databases,” Ph. D. Dissertation, 1993.
Google Scholar
I. Boxill, C. Chambers, and W. Eleanor, “Introduction to social research with applications to the Caribbean,” University of the West Indies Press, Chapter 4, page 36, 1997.
Google Scholar
C.A. Moser, “Quota sampling,” Journal of the Royal Statistical Society, 115(3):411–423, 1952.
Google Scholar
C. Sibona and S. Walczak, “Purposive sampling on Twitter: a case study," Proceedings of the 45^th Hawaii International Conference System Science (HICSS), pp. 3510, 3519, 2012.
Google Scholar
D.F. Nettleton, “Data mining of social networks represented as graphs,” Computer Science Review, 7:1–34, 2013.
Google Scholar
P.D. Grünwald, “Minimum description length tutorial,” In: Advances in Minimum Description Length, P. Grünwald and I. Myung I (eds), MIT Press, Cambridge, 2005.
Google Scholar
J. Rissanen, “Modeling by shortest data description,” Automatica, 14(1):465–471, 1978.
Google Scholar
P.D. Grunwald, “The Minimum description length principle and reasoning under uncertainty,” cwi.nl, 1998.
Google Scholar
J. Kiernan and E. Terzi,“Constructing comprehensive summaries of large event sequences,” Proceedings of the 14^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 417–425, 2008.
Google Scholar
J. Kiernan and E. Terzi, “Constructing comprehensive summaries of large event sequences,” ACM Transactions on Knowledge and Data Discovery Data, 3(4), 2009.
Google Scholar
P. Wang, H. Wang, M. Liu, and W. Wang, “An algorithmic approach to event summarization,” Proceedings of the ACM International Conference on Management of data (SIGMOD), pp.183–194, 2010.
Google Scholar
Y. Jiang, C.-S. Perng, and T. Li, “Natural event summarization,” Proceedings of the 20^th ACM International Conference on Information and Knowledge Management (CIKM), pp.765–774, 2011.
Google Scholar
R. Agrawal, C. Aggarwal, and V.V.V. Prasad, “Depth first generation of long patterns,” Proceedings of 7^th International Conference on Knowledge Discovery and Data Mining, 2000.
Google Scholar
D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a maximal frequent itemset algorithm for transactional databases,” Proceedings of the International Conference on Data Engineering (ICDE), April 2001.
Google Scholar
J. Pei, J. Han, and R. Mao, “Closet: An efficient algorithm for mining frequent closed itemsets,” Proceedings of the ACM SIGMOD Workshop on Data Mining and Knowledge Discovery, May 2000.
Google Scholar
W. Zhou, H. Liu, and H. Cheng, “Mining closed episodes from event sequences efficiently,” Proceedings of the 14^th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), pp. 310–318, 2010.
Google Scholar
S. A. Vreeken and M. van Leeuwen, “Item sets that compress,” Proceedings of SIAM International Conference on Data Mining (SDM), pp.393–404, 2006.
Google Scholar
M. van Leeuwen, J. Vreeken, A. Siebes, “Compression picks the item sets that matter,” Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp 585–592, 2006.
Google Scholar
J. Vreeken, M. van Leeuwen, and A. Siebes, “Krimp: mining itemsets that compress,” Data Mining and Knowledge Discovery, 23(1):169–214, 2011.
Google Scholar
M. Leeuwen and A. Siebes, “StreamKrimp: detecting change in data streams,” Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp: 672–687, 2008.
Google Scholar
K. Smets and J. Vreeken, “Slim: directly mining descriptive patterns,” Proceedings of SIAM International Conference on Data Mining (SDM), pp. 236–247, 2012.
Google Scholar
N. Tatti and J. Vreeken, “The long and the short of it: summarising event sequences with serial episodes,” Proceedings of the 18th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD), pp: 462–470, 2012.
Google Scholar
L.H. Thanh, M. Fabian, F. Dmitriy, and C. Toon, “Mining compressing sequential patterns,” Statistical Analysis and Data Mining, 2013.
Google Scholar
F. Moerchen, M. Thies, and A. Ultsch, “Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression,” Knowledge Information Systems, 29:55–80, 2011.
Google Scholar
R. Polikar, “The wavelet tutorial,” http://engineering.rowan.edu/polikar/WAVELETS/WTtutorial.html.
G. Strang, “Wavelet transforms versus fourier transforms,” Bulletin of American Mathematic Society, (new series 28):288–305, 1990.
Google Scholar
A. Haar, “Zur Theorie der orthogonalen Funktionensysteme,”Mathematische Annalen, 69(3):331–371, 1910.
Google Scholar
I. Daubechies, “Ten lectures on wavelets,” SIAM publications, 1992.
Google Scholar
M. Garofalakis and P. B. Gibbons, “Probabilistic wavelet synopses,” ACM Transactions on Database Systems (TODS), 29:43–90, 2004.
Google Scholar
Y. Matias, J.S. Vitter, and M. Wang, “Wavelet-based histograms for selectivity estimation,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 448–459, 1998.
Google Scholar
Y. Matias and D. Urieli, “Inner-product based wavelet synopses for range-sum queries,” Proceedings of the 14^th Annual European Symposium on Algorithms (ESA), pp. 504–515, 2006.
Google Scholar
J. S. Vitter and M. Wang, “Approximate computation of multidimensional aggregates of sparse data using wavelets”, Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 193–204, 1999.
Google Scholar
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim, “Approximate query processing using wavelets,” The VLDB Journal, 10(2–3):199–223, 2001.
Google Scholar
A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing wavelets on streams: One-pass summaries for approximate aggregate queries”. The VLDB Journal, pp. 79–88, 2001.
Google Scholar
D. Sacharidis, A. Deligiannakis, and T. Sellis, “Hierarchically compressed wavelet synopses,” The VLDB Journal, 18:203–231, 2009.
Google Scholar
A. Deligiannakis and N. Roussopoulos, “Extended wavelets for multiple measures,” Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 229–240, 2003.
Google Scholar
A. Deligiannakis, M. Garofalakis, and N. Roussopoulos, “Extended wavelets for multiple measures,” ACM Transactions on Database Systems (TODS), 32(2), 2007.
Google Scholar
S. Guha, C. Kim, and K. Shim, “Xwave: Approximate extended wavelets for streaming data,” Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 288–299, 2004.
Google Scholar
S. Guha and B. Harb, “Approximation algorithms for wavelet transform coding of data streams,” Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006.
Google Scholar
Y. Matias, J.S. Vitter, and M. Wang, “Dynamic maintenance of wavelet-based histograms,” Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 101–110, 2000.
Google Scholar
G. Cormode, M. Garofalakis, and D. Sacharidis, “Fast approximate wavelet tracking on streams,” Proceedings of the International Conference on Extending Database Technology (EDBT), 2006.
Google Scholar
P. Karras and N. Mamoulis, “One-pass wavelet synopses for maximum-error metrics,” Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 421–432, 2005.
Google Scholar
K.-L. Liao, H.-H. Chen, J.-B. Qian, and Y.-H. Dong, “Wavelet decomposition algorithm for uncertain data streams,”Proceedings of the 6^th International Conference on Computer Science & Education (ICCSE), pp.965–970, 2011.
Google Scholar
Y. Zhao, C. Aggarwal, and P. Yu, “On wavelet decomposition of uncertain time series data sets,” Proceedings of the 19^th ACM International Conference on Information and Knowledge Management (CIKM), pp.129–138, 2010.
Google Scholar
C.C. Aggarwal (ed.), “Data streams: models and algorithms”, Springer, 2007.
Google Scholar
M. Stern, E. Buchmann, and K. Böhm, “A wavelet transform for efficient consolidation of sensor relations with quality guarantees,” Proceedings of the International Conference on Very Large Databases (VLDB), pp.157–168, 2009.
Google Scholar
J. Jestes, K. Yi, and F. Li, “Building wavelet histograms on large data in MapReduce,” Proceedings of the International Conference on Very Large Databases (VLDB), pp.109–120, 2011.
Google Scholar
G. Cormode and M. Garofalakis, “Histograms and wavelets on probabilistic data,"Proceedings of the IEEE 25^th International Conference on Data Engineering (ICDE), pp.293–304, 2009.
Google Scholar
R. P. Kooi, “The optimization of queries in relational databases,” PhD thesis, Case Western Reserver University, Sept. 1980.
Google Scholar
M. Muralikrisbna and D.J. Dewitt, “Equi-depth histograms for estimating selectivity factors for multidimensional queries,” Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 28–36, 1988.
Google Scholar
Y. Ioannidis and V. Poosala. “Balancing histogram optimality and practicality for query result size estimation”. Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 233–244, 1995.
Google Scholar
V. Poosala, Y.E. Ioannidis, P.J. Haas, E.J. Shekita, “Improved histograms for selectivity estimation of range predicates,” Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 294–305, 1996.
Google Scholar
A.C. Konig and G. Weikum, “Combining histograms and parametric curve fitting for feedback-driven query result-size estimation,” Proceedings of the International Conference on Very Large Data Bases (VLDB), Edinburgh, pp. 423–434, 1999.
Google Scholar
V. Poosala and Y. Ioannidis, “Selectivity estimation without the attribute value independence assumption,” Proceedings of the International Conference on Very Large Data Bases (VLDB), Athens, pp: 486–495, 1997.
Google Scholar
D. Gunopulos, G. Kollios, V.J. Tsotras, and C. Domeniconi, “Approximating multi-dimensional aggregate range queries over real attributes,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp.463–474, 2000.
Google Scholar
N. Bruno and S. Chaudhuri, “Exploiting statistics on query expressions for optimization,” Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 263–274, 2002.
Google Scholar
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” Proceedings of the 29^th International conference on Very Large Data Bases (VLDB), pp. 81–92, 2003.
Google Scholar
F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over an evolving data stream with noise,” Proceedings of SIAM Conference on Data Mining (SDM), pp. 328–339, 2006.
Google Scholar
Y. Chen, “Density-based clustering for real-time stream data,” Proceedings of the Knowledge Discovery and Data Mining (KDD), San Jose, California, USA, pp. 133–142, 2007.
Google Scholar
J. Ren, R. Ma, and J. Ren, “Density-based data streams clustering over sliding windows,” Proceedings of the 6^th International Conference on Fuzzy systems and Knowledge Discovery (FSKD), Piscataway, NJ, USA, pp. 248–252, 2009.
Google Scholar
W. Ng and M. Dash, “Discovery of frequent patterns in transactional data streams,” Transactions on Large-Scale Data- and Knowledge-Centered Systems II,. Springer Berlin/Heidelberg, 6380:1–30, 2010.
Google Scholar
L.-X. Liu, H. Huang, Y.-F. Gu, and F.-C. Chen, “rDenStream—a clustering algorithm over an evolving data stream,”Proceedings of CIECS International Conference on Information Engineering and Computer Science, pp.1–4, 2009.
Google Scholar
C. Ruiz, E. Menasalvas, and M. Spiliopoulou, “C-DenStream: using domain knowledge on a data stream,” Proceedings of the 12^th International Conference on Discovery Science, pp. 287–301, 2009.
Google Scholar
W.-H. Zhu, Y. Yin, Y.-H. Xie, “Arbitrary shape cluster algorithm for clustering data stream,” Journal of Software, 17(3):379–387, 2006.
Google Scholar
H. Wang, Y. Yu, Q. Wang, and Y. Wan, “A density-based clustering structure mining algorithm for data streams,” Proceedings of the 1^st ACM International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine), pp. 69–76, 2012.
Google Scholar
P. Kranen, I. Assent, C. Baldauf, and T. Sei, “The ClusTree: indexing micro-clusters for anytime stream mining,” Knowledge Information Systems, 29(2):249–272, 2011.
Google Scholar
A. Amini, T.Y. Wah, M.R. Saybani, and S.R.A.S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” Proceedings of 18^th International Conference Fuzzy Systems and Knowledge Discovery (FSKD), 3:1652–1656, 2011.
Google Scholar
A. Amini and T.Y. Wah,“ Density micro-clustering algorithms on data streams: a review,” Proceeding of the International Multiconference of Engineers and Computer scientists (IMECS), 2011.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and IT, RMIT University, Melbourne, Australia
Z. R. Hesabi, Z. Tari, A. Fahad & I. Khalil
School of Information Technology, Deakin University, Melbourne, Australia
A. Goscinski
IBM Research Laboratory, Melbourne, Australia
C. Queiroz

Authors

Z. R. Hesabi
View author publications
You can also search for this author in PubMed Google Scholar
Z. Tari
View author publications
You can also search for this author in PubMed Google Scholar
A. Goscinski
View author publications
You can also search for this author in PubMed Google Scholar
A. Fahad
View author publications
You can also search for this author in PubMed Google Scholar
I. Khalil
View author publications
You can also search for this author in PubMed Google Scholar
C. Queiroz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Z. R. Hesabi .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, North Dakota State University, Fargo, North Dakota, USA
Samee U. Khan
School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hesabi, Z., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C. (2015). Data Summarization Techniques for Big Data—A Survey. In: Khan, S., Zomaya, A. (eds) Handbook on Data Centers. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2092-1_38

Download citation

DOI: https://doi.org/10.1007/978-1-4939-2092-1_38
Published: 17 March 2015
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2091-4
Online ISBN: 978-1-4939-2092-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics