Skip to main content

Sampling Technique for Complex Data

  • Chapter
  • First Online:
Sampling Techniques for Supervised or Unsupervised Tasks

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

  • 1045 Accesses

Abstract

In the context of Big Data, complex data from heterogeneous and distributed sources is potentially unlimited in number. The analysis of these data is now at the center of the concerns of actors in most activity domains whose predictions and perspectives are decisive. In many key domains (security, air traffic control, road traffic, industry, medicine, telemonitoring, etc.), there is a need to process large complex data streams to provide near-immediate results. To fill this void, several works have applied on-the-fly processing on mass data. Sampling is widely used in large data situations to reduce the volume to be processed. However, with the emergence of complex data, traditional sampling techniques have shown some limitations. To overcome this problem, there is an increased need for tools and techniques from statistics, mathematics, machine learning and deep learning. In this chapter, we present a state-of-the-art, not exhaustive, of the sampling of complex data. We also present an overview on integrating data from heterogeneous sources. In fact, data integration is an implicit problem in the sampling data problem from heterogeneous sources. The underlying idea is to combine these data in order to reduce both the heterogeneity of these data and the uncertainty about the resulting data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, D.J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)

    Article  Google Scholar 

  2. Abdelhédi, F.: Conception assistée d’entrepôts de données et de documents XML pour l’analyse OLAP. Thèse de Doctorat de l’université de Toulouse (2014)

    Google Scholar 

  3. Andre, C.: Approche crédibilise pour la fusion multi capteurs décentralisée. Thèse de Doctorat Université Paris Sud (2013)

    Google Scholar 

  4. Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: Proceedings of the International Conference on Scientific and Statistical Databases Management, pp. 22–34, Washington, DC (2007)

    Google Scholar 

  5. Al-Kateb, M., Lee, B.S.: Stratified reservoir sampling over heterogeneous data streams. In: International Conference on Scientific and Statistical Database Management, pp. 621–639 (2010)

    Google Scholar 

  6. Al-Kateb, M., Lee, B.S.: Adaptive stratified reservoir sampling over heterogeneous data streams. Inf. Syst. 39, 199–21 (2014)

    Article  Google Scholar 

  7. Arasu, A., Babu, S., Widom, J.: Cql: a language for continuous queries over streams and relations. Lecture Notes in Computer Science, pp. 1–19. Springer, Berlin (2004)

    Google Scholar 

  8. Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. SIGMOD Rec., 29(2), 261–272 (2000)

    Article  Google Scholar 

  9. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS ’02: Proceedings of the Twenty-First ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, pp. 1–16. ACM, New York, NY (2002)

    Google Scholar 

  10. Bibiduh, L., Tirthapura, S.: Stream sampling. Encyclopedia of Database Systems, pp. 2838–2842. Springer, Berlin (2009)

    Google Scholar 

  11. Bloch, I.: Fusion d’informations en traitement du signal et des images. In: IC2, Trait IC2. Hermes Science, Paris, France (2003)

    Google Scholar 

  12. Bonin, R., Marcacini, R.M., Rezende, S.O.: Unsupervised instance selection from text streams. J. Inf. Data Manag. 5(1), 114–123 (2014)

    Google Scholar 

  13. Chen, C., Yin, H., Yao, J., Cui, B.: Terec: a temporal recommender system over tweet stream. Proc. VLDB Endow. 6(12), 1254–1257 (2013)

    Article  Google Scholar 

  14. Chiky R.: Résumé de flux de données distribués, thèse de Doctorat de l’Ecole Nationale Supérieure des Télécommunications, France (2009)

    Google Scholar 

  15. Chiky, R., Hébrail, G.: Echantillonnage optimisé de données temporelles distribuées pour l’alimentation des entrepôts de données. In 3èmes Journées Francophones sur les Entrepôts de Données et l’Analyse en Ligne (EDA 2007), Poitiers, vol. B-3 of RNTI, pp. 51–66. Cépaduès, Toulouse (2007)

    Google Scholar 

  16. Chu, C., Kim, S.K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Proces. Syst. 19, 281 (2007)

    Google Scholar 

  17. Csernel B.: Résumé généraliste de flux de données, Rapport de thèse. ENST, Paris (2008)

    Google Scholar 

  18. Cochran, W.G.: Sampling Techniques. John Wiley et Sons, Hoboken (2007).

    Google Scholar 

  19. Dejmal, K.: De la modélisation à l’exploitation des documents à structures multiples. Thèse de Doctorat de L’Université de Toulouse, France (2010)

    Google Scholar 

  20. de Moura, A.M.C., Victorino, M., Tanaka, A.: Combining mediator and data warehouse technologies for developing environmental decision support systems. Lecture Notes in Computer Science book series (LNCS), vol. 2478 (2012)

    Google Scholar 

  21. Dutarte P.: L’induction statistique au lycée (ed : Didier) (2005)

    Google Scholar 

  22. ECG.: Fouille de Données Complexes dans un processus d’extraction de connaissances. Extraction et Gestion des Connaissances, Lille (2006)

    Google Scholar 

  23. El Malki, M.: Modélisation NoSQL des entrepôts de données multidimensionnelles massives. Thèse de Doctorat de l’Université de Toulouse (2016)

    Google Scholar 

  24. Feraud, R., Clérot, F., Gouzien, P.: Sampling the join of streams. In: IFCS’2009 International Conference on International Federation of Classification Societies, Dresden (2009)

    Google Scholar 

  25. Fourquet J.: Emission C dans l’air du 17 Février (2011)

    Google Scholar 

  26. Gabsi, N.: Extension et interrogation des résumés de flux de données. Thèse de Doctorat de l’Ecole Nationale Supérieure des Télécommunications, France (2011)

    Google Scholar 

  27. Gabsi, N., Clérot, F., Hébrail, G.: Résumé hybride de flux de données par échantillonnage et classification automatique. Résumé hybride de flux de données par échantillonnage et classification automatique. In: EGC ’09: Conference Internationale Francophone sur l’Extraction et la Gestion des Conaissances, Strasbourg, France (2009)

    Google Scholar 

  28. Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 379–392. ACM, New York, NY (2008)

    Google Scholar 

  29. Gio, W.: Mediators in the architecture of future information systems. Computers 25(3), 38–49 (1992)

    Article  Google Scholar 

  30. Gio, W.: Mediators, Concepts and Practice To appear in Studies Information Reuse and Integration in Academia and Industry. Springer-Verlag, Wien (2012)

    Google Scholar 

  31. Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-euclidean error. In: KDD 2005: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 88–97. ACM, New York, NY (2005)

    Google Scholar 

  32. Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: STOC ’01 : Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, pp. 471–475. ACM, New York, NY (2001)

    Google Scholar 

  33. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. SIGMOD Rec. 28(2), 287–298 (1999)

    Article  Google Scholar 

  34. Haddad R.:Apprentissage supervisé de données symboliques et l’adaptation aux données massives et distribuées. Thèse de Doctorat de l’Université de Paris dauphine, France (2016)

    Google Scholar 

  35. Hellerstein, J., Haas, P., Wang, H.: Online aggregation. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp. 171–182 (1997)

    Google Scholar 

  36. Hu, W., Zhang, B.: Study of sampling techniques and algorithms in data stream environments, 2012. In: 9th International Conference on Fuzzy Systems and Knowledge Discovery. Sichuan, China (2012)

    Google Scholar 

  37. Idarrou, A.: Entreposage de documents multimédias : comparaison de structures. Thèse de Doctorat de L’Université de Toulouse, France (2013)

    Google Scholar 

  38. John, S.: Feeding a data warehouse with data coming from web services. A mediation approach for the DaWeS prototype. Other [cs.OH]. Université Blaise Pascal - Clermont-Ferrand II (2014)

    Google Scholar 

  39. Karkouda, K., Harbi, N., Darmont, J.: Gérald Gavin Confidentialité et disponibilté des données entreposées dans les nuages. In: 12ième Conférence Internationale Francophone sur l’Extraction et la Gestion de Connaissance (EGC 2. 2012)

    Google Scholar 

  40. Karoui, M., Devauchelle, G., Dudezert, A.: Big Data : Mise en perspective et enjeux pour les entreprises, N Spécial “Big Data”. Revue Ingénierie des Systèmes d’Information 19(3), 73–92 Hermès (2014)

    Google Scholar 

  41. Kepe, T.R., de Almeida, E.C., Cerqueus, T.: KSample: dynamic sampling over unbounded data streams. J. Inf. Data Manag. 6(1), 32–47 (2015)

    Google Scholar 

  42. Kostadinov, D., Peralta, V., Soukane, A., Xue, X.: Intégration de données hétérogènes basée sur la qualité. Dans les actes des 23e Congrès INFORSID, Grenoble, France, pp. 471–486 (2005)

    Google Scholar 

  43. LeCompte, M.D., Preissle, J.: Ethnography and Qualitative Design in Educational Research. Academic Press, San Diego (1993)

    Google Scholar 

  44. Loic, M., Loudcher, S., Boussaïd, O.: Analyse en ligne d’objets complexes avec l’analyse factorielle. 10èmes Journées Francophones d’Extraction et de Gestion des Connaissances (EGC 2010), pp. 381–386. Tunisie (2010)

    Google Scholar 

  45. Malek, M., Kadima, H.: Searching frequent itemsets by clustering data: towards a parallel approach using mapreduce. In: Web Information Systems Engineering–WISE 2011 and 2012 Workshops, pp. 251–258. Springer, Berlin (2013)

    Chapter  Google Scholar 

  46. Mbarki, M.: Gestion de l’hétérogénéité documentaire: le cas d’un entrepôt de documents multimédia. Thèse de Doctorat de L’Université de Toulouse, France (2008)

    Google Scholar 

  47. Midas, C.D.A., Inria, A.: Résumé généraliste de flux de données. EGC: Extraction et Gestion des Connaissances, Jan 2010, pp.255–260. Hammamet, Tunisie (2010)

    Google Scholar 

  48. Muthukrishnan, S.: Data Streams: Algorithms and Applications, vol. 1. Now Publishes Inc. (2005)

    Google Scholar 

  49. Olivier, T.: Modélisation et manipulation des systèmes OLAP : de l’intégration des documents ‘a l’usager, Interface homme-machine [cs.HC]. Université Paul Sabatier - Toulouse III (2009)

    Google Scholar 

  50. Padieu, R.: Qu’est-ce que la représentativité? In: Economie et statistique, vol. 56, pp. 65–66 (1974)

    Google Scholar 

  51. Petre, R.-S.: Data mining in cloud computing. Database Syst. J. 3(3), 67–71 (2012)

    Google Scholar 

  52. Raissi, C., Poncelet, P.: Echantillonnage pour l’extraction de motifs séquentiels : des bases de données statiques aux flots de données. EGC, pp. 145–156 (2008)

    Google Scholar 

  53. Sais, F.: Intégration Sémantique de Données guidée par une Ontologie. Thèse de Doctorat de l’Université de Paris Sud (2007)

    Google Scholar 

  54. Sautory, O.: Journée d’études sur la représentativité, ENS Paris (2010)

    Google Scholar 

  55. Vaillant, J.: Initiation à la théorie de l’échantillonnage, Web (2005)

    Google Scholar 

  56. Vitter, J.S.: “Random sampling with a reservoir”. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  57. Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on mapreduce. In: IEEE International Conference on Cloud Computing, pp. 674–679. Springer, Berlin (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Idarrou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Idarrou, A., Douzi, H. (2020). Sampling Technique for Complex Data. In: Ros, F., Guillaume, S. (eds) Sampling Techniques for Supervised or Unsupervised Tasks. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-29349-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29349-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29348-2

  • Online ISBN: 978-3-030-29349-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics