Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters

  • Haimei Zhao
  • Zhuo Chen
  • Qiuhui Tong
  • Yuan BoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)


Due to the restriction of computing resources, it is often inconvenient to directly conduct analysis on massive datasets. Instead, a set of representatives can be extracted to approximate the spatial distribution of data objects. Standard data mining algorithms are then performed on these selected points only, which typically account for a small fraction of the original data, reducing the computational time significantly. In practice, the boundary points of data clusters can be regarded as a compact and effective representation of the original data, with great potential in clustering, outlier or anomaly detection and classification. As a result, given a complex dataset, how to reliably identify a set of effective boundary points creates a new challenge in data mining. In this paper, we present a boundary extraction technique similar to the method in SCUBI (Scalable Clustering Using Boundary Information). The key difference is that our technique exploits the clustering information in a feedback loop to further refine the boundary. Experimental results show that our technique is more robust and can produce more representative boundary points than SCUBI, especially on complex datasets with large inhomogeneity in terms of cluster density.


Boundary Extraction Clustering SCUBI 


  1. 1.
    Jain, K., Murty, N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)CrossRefGoogle Scholar
  2. 2.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken (2008)zbMATHGoogle Scholar
  3. 3.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistic and Probability, vol. 1, pp. 281–297 (1967)Google Scholar
  4. 4.
    Arthur, D., Manthey, B., Röglin, H.: K-means has polynomial smoothed complexity. In: Foundations of Computer Science, vol. 157, pp. 405–414 (2009)Google Scholar
  5. 5.
    Ester, M., Kriegel, H.P., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)Google Scholar
  6. 6.
    Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data. Springer, Heidelberg (2006).
  8. 8.
    Tong, Q.H., Li, X., Yuan, B.: A highly scalable clustering scheme using boundary information. Pattern Recogn. Lett. 89, 1–7 (2017)CrossRefGoogle Scholar
  9. 9.
    Edelsbrunner, H., Kirkpatrick, D., Seidel, R.: On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29(4), 551–559 (1983)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Moreira, A.J.C., Santos, M.Y.: Concave hull: a k-nearest neighbors approach for the computation of the region occupied by a set of points. In: Proceedings of the Second International Conference on Computer Graphics Theory and Applications, vol. 3520, pp. 61–68. Springer, Barcelona (2006)Google Scholar
  11. 11.
    López Chau, A., Li, X., Yu, W., Cervantes, J., Mejía-Álvarez, P.: Border samples detection for data mining applications using non convex hulls. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011. LNCS (LNAI), vol. 7095, pp. 261–272. Springer, Heidelberg (2011). Scholar
  12. 12.
    Hoogs, A., Collins, R.: Object boundary detection in images using a semantic ontology. In: Conference on Computer Vision and Pattern Recognition Workshop, pp. 956–963 (2006)Google Scholar
  13. 13.
    Liu, D., Nosovskiy, G.V., Sourina, O.: Effective clustering and boundary detection algorithm based on delaunay triangulation. Pattern Recogn. Lett. 29, 1261–1273 (2008)CrossRefGoogle Scholar
  14. 14.
    Estivill-Castro, V., Lee, I.: AUTOCLUST: automatic clustering via boundary extraction for mining massive point-data sets. In: International Conference on Geocomputation, vol. 26, pp. 23–25 (2000)Google Scholar
  15. 15.
    Yang, J., Estivill-Castro, V., Chalup, S.K.: Support vector clustering through proximity graph modelling. In: International Conference on Neural Information Processing, vol. 2, pp. 898–903. IEEE, Singapore (2002)Google Scholar
  16. 16.
    Chen, X.J., Zhang, G., Hua, X.H.: Point cloud simplification based on the information entropy of normal vector angle. Chin. J. Lasers 42(8), 328–336 (2015)Google Scholar
  17. 17.
    Xia, C., Hsu, W., Lee, M.L.: BORDER: efficient computation of boundary points. IEEE Trans. Knowl. Data Eng. 18(3), 289–303 (2006)CrossRefGoogle Scholar
  18. 18.
    Nosovskiy, G.V., Liu, D., Sourina, O.: Automatic clustering and boundary detection algorithm based on adaptive influence function. Pattern Recogn. 41, 2757–2776 (2008)CrossRefGoogle Scholar
  19. 19.
    Zhu, F., Ye, N., Yu, W., Xu, S., Li, G.: Boundary detection and sample reduction for one-class support vector machines. Neurocomputing 123, 166–173 (2014)CrossRefGoogle Scholar
  20. 20.
    Qiu, B.-Z., Yue, F., Shen, J.-Y.: BRIM: an efficient boundary points detecting algorithm. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 761–768. Springer, Heidelberg (2007). Scholar
  21. 21.
    Li, Y.: Selecting training points for one-class support vector machines. Pattern Recogn. Lett. 32(11), 1517–1522 (2011)CrossRefGoogle Scholar
  22. 22.
    He, Y.Z., Wang, C.H., Qiu, B.Z.: Clustering boundary points detection algorithm based on gradient binarization. Appl. Mech. Mater. 266, 2358–2363 (2013)Google Scholar
  23. 23.
    Silva, J.A., Faria, E.R., Barros, R.C.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13 (2013)CrossRefGoogle Scholar
  24. 24.
    Pokrajac, D., Lazarevic, A., Latecki, L.J.: Incremental local outlier detection for data streams. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 504–515. IEEE, Honolulu (2007)Google Scholar
  25. 25.
    Salehi, M., Leckie, C., Bezdek, J.C.: Fast memory efficient local outlier detection in data streams. IEEE Trans. Knowl. Data Eng. 28(12), 3246–3260 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Intelligent Computing Lab, Division of Informatics, Graduate School at ShenzhenTsinghua UniversityShenzhenPeople’s Republic of China

Personalised recommendations