Skip to main content

An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13281))

Included in the following conference series:

Abstract

Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts’ suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://earthdata.nasa.gov/learn/backgrounders/essential-variables.

  2. 2.

    https://data.icos-cp.eu/portal/.

  3. 3.

    https://cdi.seadatanet.org/search.

  4. 4.

    https://edmed.seadatanet.org/search/.

  5. 5.

    https://metadatacatalogue.lifewatch.eu.

  6. 6.

    We published the results of our observations, analysis, script, and contextual information on Mendeley Data [10].

References

  1. Altman, M., Castro, E., Crosas, M., Durbin, P., Garnett, A., Whitney, J.: Open journal systems and dataverse integration-helping journals to upgrade data publication for reusable research. Code4Lib J. 50(30) (2015)

    Google Scholar 

  2. Balazinska, M., Howe, B., Koutris, P., Suciu, D., Upadhyaya, P.: A discussion on pricing relational data. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) In Search of Elegance in the Theory and Practice of Computation. LNCS, vol. 8000, pp. 167–173. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41660-6_7

    Chapter  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  4. Borgman, C.L.: The conundrum of sharing research data. J. Am. Soc. Inform. Sci. Technol. 63(6), 1059–1078 (2012)

    Article  Google Scholar 

  5. Borgman, C.L.: Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press, Cambridge (2016)

    Google Scholar 

  6. Brickley, D., Burgess, M., Noy, N.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375 (2019)

    Google Scholar 

  7. Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ibáñez, L.-D., Kacprzak, E., Groth, P.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2019). https://doi.org/10.1007/s00778-019-00564-x

    Article  Google Scholar 

  8. Codd, E.F., et al.: Relational completeness of data base sublanguages. IBM Corporation (1972)

    Google Scholar 

  9. Data Catalog Vocabulary (DCAT) - Version 3. https://www.w3.org/TR/vocab-dcat-3/. Accessed 30 Sept 2021

  10. Farshidi, S.: The observations, analysis, script, and contextual information regarding this paper. Mendeley Data (2022). https://doi.org/10.17632/3yb7mhxtyf.1

    Article  Google Scholar 

  11. Farshidi, S., Jansen, S.: A decision support system for pattern-driven software architecture. In: Muccini, H., Avgeriou, P., Buhnova, B., Camara, J., Caporuscio, M., Franzago, M., Koziolek, A., Scandurra, P., Trubiani, C., Weyns, D., Zdun, U. (eds.) ECSA 2020. CCIS, vol. 1269, pp. 68–81. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59155-7_6

    Chapter  Google Scholar 

  12. Farshidi, S., Jansen, S., Deldar, M.: A decision model for programming language ecosystem selection: seven industry case studies. Inf. Softw. Technol. 139, 106640 (2021)

    Article  Google Scholar 

  13. Farshidi, S., Jansen, S., Fortuin, S.: Model-driven development platform selection: four industry case studies. Softw. Syst. Model. 20(5), 1525–1551 (2021). https://doi.org/10.1007/s10270-020-00855-w

    Article  Google Scholar 

  14. Find open data. https://data.gov.uk. Accessed 30 Sept 2021

  15. Gao, Y., Huang, S., Parameswaran, A.: Navigating the data lake with datamaran: automatically extracting structure from log datasets. In: Proceedings of the 2018 International Conference on Management of Data, pp. 943–958 (2018)

    Google Scholar 

  16. Goel, S., Broder, A., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 201–210 (2010)

    Google Scholar 

  17. Gohar, M., Muzammal, M., Rahman, A.U.: Smart TSS: Defining transportation system behavior using big data analytics in smart cities. Sustain. Urban Areas 41, 114–119 (2018)

    Google Scholar 

  18. Grubenmann, T., Bernstein, A., Moor, D., Seuken, S.: Financing the web of data with delayed-answer auctions. In: Proceedings of the 2018 World Wide Web Conference, pp. 1033–1042 (2018)

    Google Scholar 

  19. Hendler, J., Holm, J., Musialek, C., Thomas, G.: US government linked open data: semantic. data. gov. IEEE Intell. Syst. 27(03), 25–31 (2012)

    Google Scholar 

  20. Kacprzak, E., Koesten, L., Ibáñez, L.D., Blount, T., Tennison, J., Simperl, E.: Characterising dataset search-an analysis of search logs and data requests. J. Web Semant. 55, 37–55 (2019). Article no. 106640

    Google Scholar 

  21. Kassen, M.: A promising phenomenon of open data: a case study of the Chicago open data project. Gov. Inf. Q. 30(4), 508–513 (2013)

    Google Scholar 

  22. Lehmann, A., Masò, J., Nativi, S., Giuliani, G.: Towards integrated essential variables for sustainability (2020)

    Google Scholar 

  23. Lehmberg, O., Bizer, C.: Stitching web tables for improving matching quality. Proc. VLDB Endowment 10(11), 1502–1513 (2017)

    Google Scholar 

  24. Madhavan, J., Ko, D., Kot, Ł, Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endowment 1(2), 1241–1252 (2008)

    Google Scholar 

  25. Mendeley data. https://data.mendeley.com/research-data/. Accessed 30 Sept 2021

  26. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. (TOIS) 27(1), 1–27 (2008)

    Google Scholar 

  27. Nguyen, T.T., Nguyen, Q.V.H., Weidlich, M., Aberer, K.: Result selection and summarization for web table search. In: 2015 IEEE 31st International Conference on Data Engineering, pp. 231–242. IEEE (2015)

    Google Scholar 

  28. Open data monitor. https://www.opendatamonitor.eu/. Accessed 30 Sept 2021

  29. Open knowledge foundation (CKAN). https://ckan.org/. Accessed 30 Sept 2021

  30. Pasquetto, I.V., Randles, B.M., Borgman, C.L.: On the reuse of scientific data. Data Sci. J. 16, 8 (2017)

    Google Scholar 

  31. Reynolds, P., Neuman, K.L., Officer, C.P.: DHS data framework. dhs.gov (2014)

    Google Scholar 

  32. Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. (2019)

    Google Scholar 

  33. Sansone, S.A., et al.: Dats, the data tag suite to enable discoverability of datasets. Sci. Data 4(1), 1–8 (2017)

    Google Scholar 

  34. Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis, pp. 439–460. Psychology Press, Hove (2007)

    Google Scholar 

  35. The linked open data cloud. https://www.lod-cloud.net/. Accessed 30 Sept 2021

  36. Zhao, W.X., et al.: Comparing Twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34

Download references

Acknowledgment

This work has been partially funded by the European Union’s Horizon 2020 research and innovation programme, by the project of ARTICONF (825134), ENVRI-FAIR (824068) and BLUECLOUD (862409).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Siamak Farshidi or Zhiming Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Farshidi, S., Zhao, Z. (2022). An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories. In: Gama, J., Li, T., Yu, Y., Chen, E., Zheng, Y., Teng, F. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science(), vol 13281. Springer, Cham. https://doi.org/10.1007/978-3-031-05936-0_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-05936-0_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-05935-3

  • Online ISBN: 978-3-031-05936-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics