Abstract
Sky surveys for Astronomy are expected to generate 2.5 petabytes a year. Electronic medical records hold the promise of treatment comparisons, grouping patients by outcomes but will be contained in petabyte data storage. We can store lots of data and much of it wont have labels. How can we analyze or explore the data? Unsupervised clustering, fuzzy, possibilistic or probabilistic will allow us to group data. However, the algorithms scale poorly in terms of computation time as the data gets large and are impractical without modification when the data exceeds the size of memory. We will explore distributed clustering, stream data clustering and subsampling approaches to enable scalable clustering. Examples will show that one can scale to build good models of the data without necessarily seeing all the data and, if needed, modified algorithms can be applied to terabytes and more of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York City (1981)
Gu, Y., Hall, L.O., Goldgof, D.B.: Evaluating scalable fuzzy clustering. In: Proc. 2010 IEEE Int. Conf. on Systems Man and Cybernetics (SMC), Istanbul, Turkey, October 10-13, pp. 873–880. IEEE Press (2010)
Hall, L., Goldgof, D.: Convergence of the single-pass and online fuzzy c-means algorithms. IEEE Trans. Fuzzy Syst. 19(4), 792–794 (2011)
Hathaway, R.J., Bezdek, J.C., Tucker, W.T.: An improved convergence theory for the fuzzy c-means clustering algorithms. In: Bezdek, J.C. (ed.) Analysis of Fuzzy Information: Applications in Engineering and Science, vol. 3, pp. 123–131. CRC Press, Boca Raton (1987)
Hore, P., Hall, L., Goldgof, D., Cheng, W.: Online fuzzy c means. In: Ann. Meeting of the North American Fuzzy Information Processing Society (NAFIPS 2008), pp. 1–5 (2008)
Hore, P., Hall, L.O., Goldgof, D.B., Gu, Y., Maudsley, A.A., Darkazanli, A.: A scalable framework for segmenting magnetic resonance images. J. Sign. Process. Syst. 54, 183–203 (2009)
Hung, M.C., Yang, D.L.: An efficient fuzzy c-means clustering algorithm. In: Proc. 2001 IEEE Int. Conf. on Data Mining (ICDM 2001), pp. 225–232. IEEE Press (2001)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Pal, N.R., Bezdek, J.C.: Complexity reduction for “large image” processing. IEEE Trans. Syst. Man Cybern. 32(5), 598–611 (2002)
Parker, J.K., Hall, L.O., Bezdek, J.C.: Comparison of scalable fuzzy clustering methods. In: Proc. IEEE Int. Conf. on Fuzzy Systems (FUZZ-IEEE 2012), Brisbane, Australia, June 10-15, pp. 359–367. IEEE Press (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hall, L.O. (2013). Exploring Big Data with Scalable Soft Clustering. In: Kruse, R., Berthold, M., Moewes, C., Gil, M., Grzegorzewski, P., Hryniewicz, O. (eds) Synergies of Soft Computing and Statistics for Intelligent Data Analysis. Advances in Intelligent Systems and Computing, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33042-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-33042-1_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33041-4
Online ISBN: 978-3-642-33042-1
eBook Packages: EngineeringEngineering (R0)