Selection of Statistically Representative Subset from a Large Data Set

  • Javier TejadaEmail author
  • Mikhail Alexandrov
  • Gabriella Skitalinskaya
  • Dmitry Stefanovskiy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10125)


Selecting a representative subset of objects is one of the effective ways for processing large data sets. It concerns both automatic time-consuming algorithms and manual study of object properties by experts. ‘Representativity’ is considered here in a narrow sense as the equality of the statistical distributions of objects parameters for the subset and for the whole set. We propose a simple method for the selection of such a subset based on testing complex statistical hypotheses including an artificial hypothesis to avoid ambiguity. We demonstrate its functionality on two data sets, where one is related to the companies of mobile communication in Russia and the other – to the intercity autobuses communication in Peru.


Sampling Representative objects Big data Statistics Complex hypothesis 


  1. 1.
    Alexandrov, M., Gelbukh, A., Lozovoi, G.: Chi-square classifier for document categorization. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 457–459. Springer, Heidelberg (2001). doi: 10.1007/3-540-44686-9_45 CrossRefGoogle Scholar
  2. 2.
    Au, T., Chin, M.-L.I., Ma, G.: Mining rare events data by sampling and boosting: a case study. In: Prasad, S.K., Vin, H.M., Sahni, S., Jaiswal, M.P., Thipakorn, B. (eds.) ICISTM 2010. CCIS, vol. 54, pp. 373–379. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-12035-0_38 CrossRefGoogle Scholar
  3. 3.
    Chaudhur, D., Murthy, C., Chaudhur, B.: Finding a subset of representative points in a data set. IEEE Trans. Syst. Man Cybern. 24(9), 1416–1424 (1994)CrossRefGoogle Scholar
  4. 4.
    Cramer, H.: Mathematical Methods of Statistics. Princeton Landmark in Mathematics. Princeton University Press, Princeton (2016)zbMATHGoogle Scholar
  5. 5.
    Gelbukh, A., Alexandrov, M., Bourek, A., Makagonov, P.: Selection of representative documents for clusters in a document collection. In: Natural Language Processing and Information Systems, GI-Edition, LNI, Germany, vol. 29, pp. 120–126 (2003)Google Scholar
  6. 6.
    National Research Council: Frontiers in Massive Data Analysis, Report of NRC, National Academies Press, USA (2013)Google Scholar
  7. 7.
    Stein, B., Niggemann, O.: On the nature of structure and its identification. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS, vol. 1665, pp. 122–134. Springer, Heidelberg (1999). doi: 10.1007/3-540-46784-X_13 CrossRefGoogle Scholar
  8. 8.
    Stein, B., Meyer zu Eissen, S., Wilssbrock, F.: On cluster validity and the information need of users. In: Proceedings of 3rd IASTED International Conference on AI and Applications (AIA-2003), Acta Press, pp. 216–221 (2003)Google Scholar
  9. 9.
    Sung, A., Ribeiro, B., Liu, Q.: Sampling and evaluating the big data for knowledge discovery. In: Proceedings of International Conference on Internet of Things and Big Data (IoTBD 2016), Science and Technology Publications, pp. 378–382 (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Javier Tejada
    • 1
    Email author
  • Mikhail Alexandrov
    • 2
    • 3
  • Gabriella Skitalinskaya
    • 4
  • Dmitry Stefanovskiy
    • 3
  1. 1.Catholic University of San PabloArequipaPeru
  2. 2.Autonomous University of BarcelonaBarcelonaSpain
  3. 3.Russian Presidential Academy of National Economy and Public AdministrationMoscowRussia
  4. 4.Moscow Institute of Physics and TechnologyMoscowRussia

Personalised recommendations