Selection of Statistically Representative Subset from a Large Data Set
Selecting a representative subset of objects is one of the effective ways for processing large data sets. It concerns both automatic time-consuming algorithms and manual study of object properties by experts. ‘Representativity’ is considered here in a narrow sense as the equality of the statistical distributions of objects parameters for the subset and for the whole set. We propose a simple method for the selection of such a subset based on testing complex statistical hypotheses including an artificial hypothesis to avoid ambiguity. We demonstrate its functionality on two data sets, where one is related to the companies of mobile communication in Russia and the other – to the intercity autobuses communication in Peru.
KeywordsSampling Representative objects Big data Statistics Complex hypothesis
- 5.Gelbukh, A., Alexandrov, M., Bourek, A., Makagonov, P.: Selection of representative documents for clusters in a document collection. In: Natural Language Processing and Information Systems, GI-Edition, LNI, Germany, vol. 29, pp. 120–126 (2003)Google Scholar
- 6.National Research Council: Frontiers in Massive Data Analysis, Report of NRC, National Academies Press, USA (2013)Google Scholar
- 8.Stein, B., Meyer zu Eissen, S., Wilssbrock, F.: On cluster validity and the information need of users. In: Proceedings of 3rd IASTED International Conference on AI and Applications (AIA-2003), Acta Press, pp. 216–221 (2003)Google Scholar
- 9.Sung, A., Ribeiro, B., Liu, Q.: Sampling and evaluating the big data for knowledge discovery. In: Proceedings of International Conference on Internet of Things and Big Data (IoTBD 2016), Science and Technology Publications, pp. 378–382 (2016)Google Scholar