Skip to main content

Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining

  • Conference paper
  • First Online:
Model and Data Engineering (MEDI 2019)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11815))

Included in the following conference series:

Abstract

With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.openml.org.

  2. 2.

    https://github.com/AymanUPC/ds-knn.

References

  1. Algergawy, A., Massmann, S., Rahm, E.: A clustering-based approach for large-scale ontology matching. In: Eder, J., Bielikova, M., Tjoa, A.M. (eds.) ADBIS 2011. LNCS, vol. 6909, pp. 415–428. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23737-9_30

    Chapter  Google Scholar 

  2. Algergawy, A., Schallehn, E., Saake, G.: A schema matching-based approach to XML schema clustering. In: Proceedings of the International Conference on Information Integration and Web-based Applications & Services, pp. 131–136. ACM (2008)

    Google Scholar 

  3. Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: DINA Workshop, ICDM (2016)

    Google Scholar 

  4. Alserafi, A., Calders, T., Abelló, A., Romero, O.: DS-prox: Dataset proximity mining for governing the data lake. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds.) SISAP 2017. LNCS, vol. 10609, pp. 284–299. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68474-1_20

    Chapter  Google Scholar 

  5. Baralis, E., Cerquitelli, T., Chiusano, S., Grimaudo, L., Xiao, X.: Analysis of Twitter data using a multiple-level clustering strategy. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 13–24. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41366-7_2

    Chapter  Google Scholar 

  6. Friedman, J., Hastie, T., Tibshirani, R., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337–407 (2000)

    Article  Google Scholar 

  7. Gallinucci, E., Golfarelli, M., Rizzi, S.: Schema profiling of document-oriented databases. Inf. Syst. 75, 13–25 (2018)

    Article  Google Scholar 

  8. Han, E.-H.S., Karypis, G., Kumar, V.: Text categorization using weight adjusted k-nearest neighbor classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 53–65. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45357-1_9

    Chapter  Google Scholar 

  9. Hentech, H., Gouider, M.S., Farhat, A.: Clustering heterogeneous data streams with uncertainty over sliding window. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 162–175. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41366-7_14

    Chapter  Google Scholar 

  10. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: Proceedings of the International Conference on Information and Knowledge Management, pp. 292–299. ACM (2002)

    Google Scholar 

  11. Mahmoud, H.A., Aboulnaga, A.: Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 411–422. ACM (2010)

    Google Scholar 

  12. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval, no. c (2009)

    Google Scholar 

  13. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

    Article  Google Scholar 

  14. Shvaiko, P.: A survey of schema-based matching approaches. J. Data Semant. 3730, 146–171 (2005)

    MATH  Google Scholar 

  15. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education, New York (2006)

    Google Scholar 

  16. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: 7th Biennial Conference on Innovative Data Systems Research CIDR 2015 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayman Alserafi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alserafi, A., Abelló, A., Romero, O., Calders, T. (2019). Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining. In: Schewe, KD., Singh, N. (eds) Model and Data Engineering. MEDI 2019. Lecture Notes in Computer Science(), vol 11815. Springer, Cham. https://doi.org/10.1007/978-3-030-32065-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32065-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32064-5

  • Online ISBN: 978-3-030-32065-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics