Skip to main content

K-Means over Incomplete Datasets Using Mean Euclidean Distance

  • Conference paper
  • First Online:
Book cover Machine Learning and Data Mining in Pattern Recognition (MLDM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9729))

Abstract

Missing values in data are common in real world applications. In this research we developed a new version of the well-known k-means clustering algorithm that deals with such incomplete datasets. The k-means algorithm has two basic steps, performed at each iteration: it associates each point with its closest centroid and then it computes the new centroids. So, to run it we need a distance function and a mean computation formula. To measure the similarity between two incomplete points, we use the distribution of the incomplete attributes. We propose several directions for computing the centroids. In the first, incomplete points are dealt with as one point and the centroid is computed according to the developed formula derived in this research. In the second and the third, each incomplete point is replaced with a large number of points according to the data distribution and from these points the centroid is computed. Even so, the runtime complexity of the suggested k-means is the same as the standard k-means over complete datasets. We experimented on six standard numerical datasets from different fields and compared the performance of our proposed k-means to other basic methods. Our experiments show that our suggested k-means algorithms outperform previously published methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AbdAllah, L., Shimshoni, I.: Mean shift clustering algorithm for data with missing values. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 426–438. Springer, Heidelberg (2014)

    Google Scholar 

  2. Donders, A.R.T., van der Heijden, G.J., Stijnen, T., Moons, K.G.: Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 59(10), 1087–1091 (2006)

    Article  Google Scholar 

  3. Emil, E., Amaury, L., Vincent, V., Christophe, B.: Mixture of gaussians for distance estimation with missing data. Neurocomputing 131, 32–42 (2014)

    Article  Google Scholar 

  4. Ghahramani, Z., Jordan, M.: Learning from incomplete data. Technical Report, MIT AI Lab Memo, (1509) (1995)

    Google Scholar 

  5. Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  6. Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis 41(3), 429–440 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  7. Ibrahim, J.G., Chen, M.H., Lipsitz, S.R., Herring, A.H.: Missing-data methods for generalized linear models: A comparative review. Journal of the American Statistical Association 100(469), 332–346 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  8. Little, R.J.A.: Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 6(3), 287–296 (1988)

    MathSciNet  Google Scholar 

  9. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons (2014)

    Google Scholar 

  10. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Information Theory 28, 129–137 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  11. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Symposium on Math, Statistics, and Probability, pp. 281–297 (1967)

    Google Scholar 

  12. Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01), 2007 (2004). http://magnanim.web.cs.unibo.it/index.html

    Google Scholar 

  13. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)

    Article  Google Scholar 

  14. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Science 4(3), 801–804 (1956)

    MathSciNet  MATH  Google Scholar 

  15. Speech University of Eastern Finland and Image Processing Unit. Clustering dataset. (2008). http://cs.joensuu.fi/sipu/datasets/

  16. Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans. on KDE 17(12), 1689–1693 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Loai AbdAllah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

AbdAllah, L., Shimshoni, I. (2016). K-Means over Incomplete Datasets Using Mean Euclidean Distance. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41920-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41919-0

  • Online ISBN: 978-3-319-41920-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics