Deterministic Data Sampling Based on Neighborhood Analysis

Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 297)


The amount of large-scale real data around us is increasing in size very quickly, as is the necessity to reduce its size by obtaining a representative sample. Such sample allows us to use a great variety of analytical methods, the direct application of which on original data would be unfeasible. There are many methods used for different purposes and with different results. In this paper, we outline a simple, flexible and straightforward approach based on analyzing the nearest neighbors that is generally applicable. This feature is illustrated in experiments with synthetic and real-world datasets. The properties of the representative sample show that the presented approach maintains very well internal data structures (e.g. clusters and density). The key technical parameters of the approach are low complexity and high scalability.


sampling data mining density bias nearest neighbor 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barbar’a, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H., Johnson, T., Ng, R., Poosala, V., et al.: The new jersey data reduction report. In: IEEE Data Engineering Bulletin. Citeseer (1997)Google Scholar
  2. 2.
    Ernvall, J., Nevalainen, O.: An algorithm for unbiased random sampling. The Computer Journal 25(1), 45–47 (1982)CrossRefGoogle Scholar
  3. 3.
    Kerdprasop, K., Kerdprasop, N., Sattayatham, P.: Weighted k-means for density-biased clustering. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 488–497. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  4. 4.
    Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 77–85. ACM (1994)Google Scholar
  5. 5.
    Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15(5), 1170–1187 (2003)CrossRefGoogle Scholar
  6. 6.
    Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. ACM SIGMOD Record 28, 251–262 (1999)CrossRefGoogle Scholar
  7. 7.
    Nanopoulos, A., Manolopoulos, Y., Theodoridis, Y.: An efficient and effective algorithm for density biased sampling. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 398–404. ACM (2002)Google Scholar
  8. 8.
    Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering, vol. 29. ACM (2000)Google Scholar
  9. 9.
    Toivonen, H., et al.: Sampling large databases for association rules. In: VLDB, vol. 96, pp. 134–145 (1996)Google Scholar
  10. 10.
    Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11(1), 37–57 (1985)CrossRefMATHMathSciNetGoogle Scholar
  11. 11.
    Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)CrossRefMATHMathSciNetGoogle Scholar
  12. 12.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)CrossRefGoogle Scholar
  13. 13.
    Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu, Y.: Combining sampling technique with dbscan algorithm for clustering large spatial databases. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 169–172. Springer, Heidelberg (2000)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.VSB - Technical University of Ostrava, Czech RepublicOstravaCzech Republic

Personalised recommendations