Deterministic Data Sampling Based on Neighborhood Analysis
The amount of large-scale real data around us is increasing in size very quickly, as is the necessity to reduce its size by obtaining a representative sample. Such sample allows us to use a great variety of analytical methods, the direct application of which on original data would be unfeasible. There are many methods used for different purposes and with different results. In this paper, we outline a simple, flexible and straightforward approach based on analyzing the nearest neighbors that is generally applicable. This feature is illustrated in experiments with synthetic and real-world datasets. The properties of the representative sample show that the presented approach maintains very well internal data structures (e.g. clusters and density). The key technical parameters of the approach are low complexity and high scalability.
Keywordssampling data mining density bias nearest neighbor
Unable to display preview. Download preview PDF.
- 1.Barbar’a, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H., Johnson, T., Ng, R., Poosala, V., et al.: The new jersey data reduction report. In: IEEE Data Engineering Bulletin. Citeseer (1997)Google Scholar
- 4.Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 77–85. ACM (1994)Google Scholar
- 7.Nanopoulos, A., Manolopoulos, Y., Theodoridis, Y.: An efficient and effective algorithm for density biased sampling. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 398–404. ACM (2002)Google Scholar
- 8.Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering, vol. 29. ACM (2000)Google Scholar
- 9.Toivonen, H., et al.: Sampling large databases for association rules. In: VLDB, vol. 96, pp. 134–145 (1996)Google Scholar