Sampling Technique for Complex Data

Idarrou, A.; Douzi, H.

doi:10.1007/978-3-030-29349-9_6

A. Idarrou⁴ &
H. Douzi⁴

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

1045 Accesses

Abstract

In the context of Big Data, complex data from heterogeneous and distributed sources is potentially unlimited in number. The analysis of these data is now at the center of the concerns of actors in most activity domains whose predictions and perspectives are decisive. In many key domains (security, air traffic control, road traffic, industry, medicine, telemonitoring, etc.), there is a need to process large complex data streams to provide near-immediate results. To fill this void, several works have applied on-the-fly processing on mass data. Sampling is widely used in large data situations to reduce the volume to be processed. However, with the emergence of complex data, traditional sampling techniques have shown some limitations. To overcome this problem, there is an increased need for tools and techniques from statistics, mathematics, machine learning and deep learning. In this chapter, we present a state-of-the-art, not exhaustive, of the sampling of complex data. We also present an overview on integrating data from heterogeneous sources. In fact, data integration is an implicit problem in the sampling data problem from heterogeneous sources. The underlying idea is to combine these data in order to reduce both the heterogeneity of these data and the uncertainty about the resulting data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, D.J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Article Google Scholar
Abdelhédi, F.: Conception assistée d’entrepôts de données et de documents XML pour l’analyse OLAP. Thèse de Doctorat de l’université de Toulouse (2014)
Google Scholar
Andre, C.: Approche crédibilise pour la fusion multi capteurs décentralisée. Thèse de Doctorat Université Paris Sud (2013)
Google Scholar
Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: Proceedings of the International Conference on Scientific and Statistical Databases Management, pp. 22–34, Washington, DC (2007)
Google Scholar
Al-Kateb, M., Lee, B.S.: Stratified reservoir sampling over heterogeneous data streams. In: International Conference on Scientific and Statistical Database Management, pp. 621–639 (2010)
Google Scholar
Al-Kateb, M., Lee, B.S.: Adaptive stratified reservoir sampling over heterogeneous data streams. Inf. Syst. 39, 199–21 (2014)
Article Google Scholar
Arasu, A., Babu, S., Widom, J.: Cql: a language for continuous queries over streams and relations. Lecture Notes in Computer Science, pp. 1–19. Springer, Berlin (2004)
Google Scholar
Avnur, R., Hellerstein, J.M.: Eddies: continuously adaptive query processing. SIGMOD Rec., 29(2), 261–272 (2000)
Article Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS ’02: Proceedings of the Twenty-First ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, pp. 1–16. ACM, New York, NY (2002)
Google Scholar
Bibiduh, L., Tirthapura, S.: Stream sampling. Encyclopedia of Database Systems, pp. 2838–2842. Springer, Berlin (2009)
Google Scholar
Bloch, I.: Fusion d’informations en traitement du signal et des images. In: IC2, Trait IC2. Hermes Science, Paris, France (2003)
Google Scholar
Bonin, R., Marcacini, R.M., Rezende, S.O.: Unsupervised instance selection from text streams. J. Inf. Data Manag. 5(1), 114–123 (2014)
Google Scholar
Chen, C., Yin, H., Yao, J., Cui, B.: Terec: a temporal recommender system over tweet stream. Proc. VLDB Endow. 6(12), 1254–1257 (2013)
Article Google Scholar
Chiky R.: Résumé de flux de données distribués, thèse de Doctorat de l’Ecole Nationale Supérieure des Télécommunications, France (2009)
Google Scholar
Chiky, R., Hébrail, G.: Echantillonnage optimisé de données temporelles distribuées pour l’alimentation des entrepôts de données. In 3èmes Journées Francophones sur les Entrepôts de Données et l’Analyse en Ligne (EDA 2007), Poitiers, vol. B-3 of RNTI, pp. 51–66. Cépaduès, Toulouse (2007)
Google Scholar
Chu, C., Kim, S.K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Proces. Syst. 19, 281 (2007)
Google Scholar
Csernel B.: Résumé généraliste de flux de données, Rapport de thèse. ENST, Paris (2008)
Google Scholar
Cochran, W.G.: Sampling Techniques. John Wiley et Sons, Hoboken (2007).
Google Scholar
Dejmal, K.: De la modélisation à l’exploitation des documents à structures multiples. Thèse de Doctorat de L’Université de Toulouse, France (2010)
Google Scholar
de Moura, A.M.C., Victorino, M., Tanaka, A.: Combining mediator and data warehouse technologies for developing environmental decision support systems. Lecture Notes in Computer Science book series (LNCS), vol. 2478 (2012)
Google Scholar
Dutarte P.: L’induction statistique au lycée (ed : Didier) (2005)
Google Scholar
ECG.: Fouille de Données Complexes dans un processus d’extraction de connaissances. Extraction et Gestion des Connaissances, Lille (2006)
Google Scholar
El Malki, M.: Modélisation NoSQL des entrepôts de données multidimensionnelles massives. Thèse de Doctorat de l’Université de Toulouse (2016)
Google Scholar
Feraud, R., Clérot, F., Gouzien, P.: Sampling the join of streams. In: IFCS’2009 International Conference on International Federation of Classification Societies, Dresden (2009)
Google Scholar
Fourquet J.: Emission C dans l’air du 17 Février (2011)
Google Scholar
Gabsi, N.: Extension et interrogation des résumés de flux de données. Thèse de Doctorat de l’Ecole Nationale Supérieure des Télécommunications, France (2011)
Google Scholar
Gabsi, N., Clérot, F., Hébrail, G.: Résumé hybride de flux de données par échantillonnage et classification automatique. Résumé hybride de flux de données par échantillonnage et classification automatique. In: EGC ’09: Conference Internationale Francophone sur l’Extraction et la Gestion des Conaissances, Strasbourg, France (2009)
Google Scholar
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 379–392. ACM, New York, NY (2008)
Google Scholar
Gio, W.: Mediators in the architecture of future information systems. Computers 25(3), 38–49 (1992)
Article Google Scholar
Gio, W.: Mediators, Concepts and Practice To appear in Studies Information Reuse and Integration in Academia and Industry. Springer-Verlag, Wien (2012)
Google Scholar
Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-euclidean error. In: KDD 2005: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 88–97. ACM, New York, NY (2005)
Google Scholar
Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: STOC ’01 : Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, pp. 471–475. ACM, New York, NY (2001)
Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. SIGMOD Rec. 28(2), 287–298 (1999)
Article Google Scholar
Haddad R.:Apprentissage supervisé de données symboliques et l’adaptation aux données massives et distribuées. Thèse de Doctorat de l’Université de Paris dauphine, France (2016)
Google Scholar
Hellerstein, J., Haas, P., Wang, H.: Online aggregation. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp. 171–182 (1997)
Google Scholar
Hu, W., Zhang, B.: Study of sampling techniques and algorithms in data stream environments, 2012. In: 9th International Conference on Fuzzy Systems and Knowledge Discovery. Sichuan, China (2012)
Google Scholar
Idarrou, A.: Entreposage de documents multimédias : comparaison de structures. Thèse de Doctorat de L’Université de Toulouse, France (2013)
Google Scholar
John, S.: Feeding a data warehouse with data coming from web services. A mediation approach for the DaWeS prototype. Other [cs.OH]. Université Blaise Pascal - Clermont-Ferrand II (2014)
Google Scholar
Karkouda, K., Harbi, N., Darmont, J.: Gérald Gavin Confidentialité et disponibilté des données entreposées dans les nuages. In: 12ième Conférence Internationale Francophone sur l’Extraction et la Gestion de Connaissance (EGC 2. 2012)
Google Scholar
Karoui, M., Devauchelle, G., Dudezert, A.: Big Data : Mise en perspective et enjeux pour les entreprises, N Spécial “Big Data”. Revue Ingénierie des Systèmes d’Information 19(3), 73–92 Hermès (2014)
Google Scholar
Kepe, T.R., de Almeida, E.C., Cerqueus, T.: KSample: dynamic sampling over unbounded data streams. J. Inf. Data Manag. 6(1), 32–47 (2015)
Google Scholar
Kostadinov, D., Peralta, V., Soukane, A., Xue, X.: Intégration de données hétérogènes basée sur la qualité. Dans les actes des 23e Congrès INFORSID, Grenoble, France, pp. 471–486 (2005)
Google Scholar
LeCompte, M.D., Preissle, J.: Ethnography and Qualitative Design in Educational Research. Academic Press, San Diego (1993)
Google Scholar
Loic, M., Loudcher, S., Boussaïd, O.: Analyse en ligne d’objets complexes avec l’analyse factorielle. 10èmes Journées Francophones d’Extraction et de Gestion des Connaissances (EGC 2010), pp. 381–386. Tunisie (2010)
Google Scholar
Malek, M., Kadima, H.: Searching frequent itemsets by clustering data: towards a parallel approach using mapreduce. In: Web Information Systems Engineering–WISE 2011 and 2012 Workshops, pp. 251–258. Springer, Berlin (2013)
Chapter Google Scholar
Mbarki, M.: Gestion de l’hétérogénéité documentaire: le cas d’un entrepôt de documents multimédia. Thèse de Doctorat de L’Université de Toulouse, France (2008)
Google Scholar
Midas, C.D.A., Inria, A.: Résumé généraliste de flux de données. EGC: Extraction et Gestion des Connaissances, Jan 2010, pp.255–260. Hammamet, Tunisie (2010)
Google Scholar
Muthukrishnan, S.: Data Streams: Algorithms and Applications, vol. 1. Now Publishes Inc. (2005)
Google Scholar
Olivier, T.: Modélisation et manipulation des systèmes OLAP : de l’intégration des documents ‘a l’usager, Interface homme-machine [cs.HC]. Université Paul Sabatier - Toulouse III (2009)
Google Scholar
Padieu, R.: Qu’est-ce que la représentativité? In: Economie et statistique, vol. 56, pp. 65–66 (1974)
Google Scholar
Petre, R.-S.: Data mining in cloud computing. Database Syst. J. 3(3), 67–71 (2012)
Google Scholar
Raissi, C., Poncelet, P.: Echantillonnage pour l’extraction de motifs séquentiels : des bases de données statiques aux flots de données. EGC, pp. 145–156 (2008)
Google Scholar
Sais, F.: Intégration Sémantique de Données guidée par une Ontologie. Thèse de Doctorat de l’Université de Paris Sud (2007)
Google Scholar
Sautory, O.: Journée d’études sur la représentativité, ENS Paris (2010)
Google Scholar
Vaillant, J.: Initiation à la théorie de l’échantillonnage, Web (2005)
Google Scholar
Vitter, J.S.: “Random sampling with a reservoir”. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on mapreduce. In: IEEE International Conference on Cloud Computing, pp. 674–679. Springer, Berlin (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Labo. Image et Reconnaissance de Formes – Systèmes Intelligents et Communicants (IRF-SIC), Ibn Zohr University, Agadir, Morocco
A. Idarrou & H. Douzi

Authors

A. Idarrou
View author publications
You can also search for this author in PubMed Google Scholar
H. Douzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Idarrou .

Editor information

Editors and Affiliations

PRISME Laboratory, University of Orléans, Orléans, France
Frédéric Ros
UMR ITAP, Irstea, Montpellier, France
Serge Guillaume

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Idarrou, A., Douzi, H. (2020). Sampling Technique for Complex Data. In: Ros, F., Guillaume, S. (eds) Sampling Techniques for Supervised or Unsupervised Tasks. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-29349-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-29349-9_6
Published: 27 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29348-2
Online ISBN: 978-3-030-29349-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics