Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

Alserafi, Ayman; Abelló, Alberto; Romero, Oscar; Calders, Toon

doi:10.1007/978-3-030-32065-2_3

Ayman Alserafi^10,11,
Alberto Abelló¹⁰,
Oscar Romero¹⁰ &
…
Toon Calders¹²

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11815))

Included in the following conference series:

International Conference on Model and Data Engineering

1049 Accesses
4 Citations
1 Altmetric

Abstract

With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Algergawy, A., Massmann, S., Rahm, E.: A clustering-based approach for large-scale ontology matching. In: Eder, J., Bielikova, M., Tjoa, A.M. (eds.) ADBIS 2011. LNCS, vol. 6909, pp. 415–428. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23737-9_30
Chapter Google Scholar
Algergawy, A., Schallehn, E., Saake, G.: A schema matching-based approach to XML schema clustering. In: Proceedings of the International Conference on Information Integration and Web-based Applications & Services, pp. 131–136. ACM (2008)
Google Scholar
Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: DINA Workshop, ICDM (2016)
Google Scholar
Alserafi, A., Calders, T., Abelló, A., Romero, O.: DS-prox: Dataset proximity mining for governing the data lake. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds.) SISAP 2017. LNCS, vol. 10609, pp. 284–299. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68474-1_20
Chapter Google Scholar
Baralis, E., Cerquitelli, T., Chiusano, S., Grimaudo, L., Xiao, X.: Analysis of Twitter data using a multiple-level clustering strategy. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 13–24. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41366-7_2
Chapter Google Scholar
Friedman, J., Hastie, T., Tibshirani, R., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337–407 (2000)
Article Google Scholar
Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)
Article Google Scholar
Han, E.-H.S., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 53–65. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45357-1_9
Chapter Google Scholar
Hentech, H., Gouider, M.S., Farhat, A.: Clustering heterogeneous data streams with uncertainty over sliding window. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 162–175. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41366-7_14
Chapter Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: Proceedings of the International Conference on Information and Knowledge Management, pp. 292–299. ACM (2002)
Google Scholar
Mahmoud, H.A., Aboulnaga, A.: Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 411–422. ACM (2010)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval, no. c (2009)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Article Google Scholar
Shvaiko, P.: A survey of schema-based matching approaches. J. Data Semant. 3730, 146–171 (2005)
MATH Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education, New York (2006)
Google Scholar
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: 7th Biennial Conference on Innovative Data Systems Research CIDR 2015 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya - BarcelonaTech, Barcelona, Catalunya, Spain
Ayman Alserafi, Alberto Abelló & Oscar Romero
Université Libre de Bruxelles (ULB), Brussels, Belgium
Ayman Alserafi
Universiteit Antwerpen (UAntwerp), Antwerp, Belgium
Toon Calders

Authors

Ayman Alserafi
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayman Alserafi .

Editor information

Editors and Affiliations

UIUC Institute, Zhejiang University, Zhejiang, China
Klaus-Dieter Schewe
INPT-ENSEEIHT/IRIT, Toulouse, France
Neeraj Kumar Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alserafi, A., Abelló, A., Romero, O., Calders, T. (2019). Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-32065-2_3
Published: 21 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32064-5
Online ISBN: 978-3-030-32065-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics