An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories

Farshidi, Siamak; Zhao, Zhiming

doi:10.1007/978-3-031-05936-0_37

Siamak Farshidi¹³ &
Zhiming Zhao¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13281))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2060 Accesses
1 Citations
3 Altmetric

Abstract

Dataset repositories publish a significant number of datasets continuously within the context of a variety of domains, such as biodiversity and oceanography. To conduct multidisciplinary research, scientists and practitioners must discover datasets from various disciplines unfamiliar with them. Well-known search engines, such as Google dataset and Mendeley data, try to support researchers with cross-domain dataset discovery based on their contents. However, as datasets typically contain scientific observations or collected data from service providers, their contextual information is limited. Accordingly, effective dataset indexing can be impossible to increase the Findability, Accessibility, Interoperability, and Reusability (FAIRness) based on their contextual information. This paper presents an indexing pipeline to extend contextual information of datasets based on their scientific domains by using topic modeling and a set of suggested rules and domain keywords (such as essential variables in environment science) based on domain experts’ suggestions. The pipeline relies on an open ecosystem, where dataset providers publish semantically enhanced metadata on their data repositories. We aggregate, normalize, and reconcile such metadata, providing a dataset search engine that enables research communities to find, access, integrate, and reuse datasets. We evaluated our approach on a manually created gold standard and a user study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://earthdata.nasa.gov/learn/backgrounders/essential-variables.
2.
https://data.icos-cp.eu/portal/.
3.
https://cdi.seadatanet.org/search.
4.
https://edmed.seadatanet.org/search/.
5.
https://metadatacatalogue.lifewatch.eu.
6.
We published the results of our observations, analysis, script, and contextual information on Mendeley Data [10].

References

Altman, M., Castro, E., Crosas, M., Durbin, P., Garnett, A., Whitney, J.: Open journal systems and dataverse integration-helping journals to upgrade data publication for reusable research. Code4Lib J. 50(30) (2015)
Google Scholar
Balazinska, M., Howe, B., Koutris, P., Suciu, D., Upadhyaya, P.: A discussion on pricing relational data. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) In Search of Elegance in the Theory and Practice of Computation. LNCS, vol. 8000, pp. 167–173. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41660-6_7
Chapter Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Borgman, C.L.: The conundrum of sharing research data. J. Am. Soc. Inform. Sci. Technol. 63(6), 1059–1078 (2012)
Article Google Scholar
Borgman, C.L.: Big Data, Little Data, No Data: Scholarship in the Networked World. MIT Press, Cambridge (2016)
Google Scholar
Brickley, D., Burgess, M., Noy, N.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375 (2019)
Google Scholar
Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ibáñez, L.-D., Kacprzak, E., Groth, P.: Dataset search: a survey. VLDB J. 29(1), 251–272 (2019). https://doi.org/10.1007/s00778-019-00564-x
Article Google Scholar
Codd, E.F., et al.: Relational completeness of data base sublanguages. IBM Corporation (1972)
Google Scholar
Data Catalog Vocabulary (DCAT) - Version 3. https://www.w3.org/TR/vocab-dcat-3/. Accessed 30 Sept 2021
Farshidi, S.: The observations, analysis, script, and contextual information regarding this paper. Mendeley Data (2022). https://doi.org/10.17632/3yb7mhxtyf.1
Article Google Scholar
Farshidi, S., Jansen, S.: A decision support system for pattern-driven software architecture. In: Muccini, H., Avgeriou, P., Buhnova, B., Camara, J., Caporuscio, M., Franzago, M., Koziolek, A., Scandurra, P., Trubiani, C., Weyns, D., Zdun, U. (eds.) ECSA 2020. CCIS, vol. 1269, pp. 68–81. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59155-7_6
Chapter Google Scholar
Farshidi, S., Jansen, S., Deldar, M.: A decision model for programming language ecosystem selection: seven industry case studies. Inf. Softw. Technol. 139, 106640 (2021)
Article Google Scholar
Farshidi, S., Jansen, S., Fortuin, S.: Model-driven development platform selection: four industry case studies. Softw. Syst. Model. 20(5), 1525–1551 (2021). https://doi.org/10.1007/s10270-020-00855-w
Article Google Scholar
Find open data. https://data.gov.uk. Accessed 30 Sept 2021
Gao, Y., Huang, S., Parameswaran, A.: Navigating the data lake with datamaran: automatically extracting structure from log datasets. In: Proceedings of the 2018 International Conference on Management of Data, pp. 943–958 (2018)
Google Scholar
Goel, S., Broder, A., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 201–210 (2010)
Google Scholar
Gohar, M., Muzammal, M., Rahman, A.U.: Smart TSS: Defining transportation system behavior using big data analytics in smart cities. Sustain. Urban Areas 41, 114–119 (2018)
Google Scholar
Grubenmann, T., Bernstein, A., Moor, D., Seuken, S.: Financing the web of data with delayed-answer auctions. In: Proceedings of the 2018 World Wide Web Conference, pp. 1033–1042 (2018)
Google Scholar
Hendler, J., Holm, J., Musialek, C., Thomas, G.: US government linked open data: semantic. data. gov. IEEE Intell. Syst. 27(03), 25–31 (2012)
Google Scholar
Kacprzak, E., Koesten, L., Ibáñez, L.D., Blount, T., Tennison, J., Simperl, E.: Characterising dataset search-an analysis of search logs and data requests. J. Web Semant. 55, 37–55 (2019). Article no. 106640
Google Scholar
Kassen, M.: A promising phenomenon of open data: a case study of the Chicago open data project. Gov. Inf. Q. 30(4), 508–513 (2013)
Google Scholar
Lehmann, A., Masò, J., Nativi, S., Giuliani, G.: Towards integrated essential variables for sustainability (2020)
Google Scholar
Lehmberg, O., Bizer, C.: Stitching web tables for improving matching quality. Proc. VLDB Endowment 10(11), 1502–1513 (2017)
Google Scholar
Madhavan, J., Ko, D., Kot, Ł, Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endowment 1(2), 1241–1252 (2008)
Google Scholar
Mendeley data. https://data.mendeley.com/research-data/. Accessed 30 Sept 2021
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. (TOIS) 27(1), 1–27 (2008)
Google Scholar
Nguyen, T.T., Nguyen, Q.V.H., Weidlich, M., Aberer, K.: Result selection and summarization for web table search. In: 2015 IEEE 31st International Conference on Data Engineering, pp. 231–242. IEEE (2015)
Google Scholar
Open data monitor. https://www.opendatamonitor.eu/. Accessed 30 Sept 2021
Open knowledge foundation (CKAN). https://ckan.org/. Accessed 30 Sept 2021
Pasquetto, I.V., Randles, B.M., Borgman, C.L.: On the reuse of scientific data. Data Sci. J. 16, 8 (2017)
Google Scholar
Reynolds, P., Neuman, K.L., Officer, C.P.: DHS data framework. dhs.gov (2014)
Google Scholar
Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. (2019)
Google Scholar
Sansone, S.A., et al.: Dats, the data tag suite to enable discoverability of datasets. Sci. Data 4(1), 1–8 (2017)
Google Scholar
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis, pp. 439–460. Psychology Press, Hove (2007)
Google Scholar
The linked open data cloud. https://www.lod-cloud.net/. Accessed 30 Sept 2021
Zhao, W.X., et al.: Comparing Twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34

Download references

Acknowledgment

This work has been partially funded by the European Union’s Horizon 2020 research and innovation programme, by the project of ARTICONF (825134), ENVRI-FAIR (824068) and BLUECLOUD (862409).

Author information

Authors and Affiliations

Multiscale Networked Systems, University of Amsterdam, Amsterdam, The Netherlands
Siamak Farshidi & Zhiming Zhao

Authors

Siamak Farshidi
View author publications
You can also search for this author in PubMed Google Scholar
Zhiming Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Siamak Farshidi or Zhiming Zhao .

Editor information

Editors and Affiliations

Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
João Gama
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Tianrui Li
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yang Yu
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Enhong Chen
JD iCity, JD Technology & JD Intelligent Cities Research, Beijing, China
Yu Zheng
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Fei Teng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Farshidi, S., Zhao, Z. (2022). An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories. In: Gama, J., Li, T., Yu, Y., Chen, E., Zheng, Y., Teng, F. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science(), vol 13281. Springer, Cham. https://doi.org/10.1007/978-3-031-05936-0_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-05936-0_37
Published: 11 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-05935-3
Online ISBN: 978-3-031-05936-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Adaptable Indexing Pipeline for Enriching Meta Information of Datasets from Heterogeneous Repositories