Skip to main content
Log in

Open dataset discovery using context-enhanced similarity search

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of the enormous sparsity of their metadata (e.g., several keywords). As an alternative, in this paper, we propose an approach to dataset discovery based on similarity search over metadata descriptions enhanced by various semantic contexts. In general, the semantic contexts enrich the dataset metadata in a way that enables the identification of additional relevant datasets to a query that could not be retrieved using just the keyword or full-text search. In experimental evaluation we show that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios. As a part of the evaluation, we created a catalog-like user interface for dataset discovery and recorded streams of user actions that served us to create the ground truth. For the sake of reproducibility, we published the entire evaluation testbed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://data.europa.eu.

  2. https://data.gov.cz.

  3. https://www.data.gov/.

  4. Note that the data-engineering terminology could be a bit misleading. A database is generally a collection of database objects (or documents). In our case, the database object (document) is a particular dataset (e.g., a CSV file). Hence, in dataset discovery we search a database of datasets.

  5. https://data.europa.eu/en.

  6. https://gitlab.com/european-data-portal/metrics/edp-metrics-dataset-similarities/-/blob/master/src/main/java/io/piveau/metrics/similarities/SimilarityVerticle.java.

  7. https://github.com/mff-uk/simpipes-components/blob/main/processors/compute-similarity/basic/linda/tlsh.py.

  8. https://data.gov.cz.

  9. The numbers can be computed using query1.sparql published with the dataset [50].

  10. The numbers can be computed using query2.sparql published with the dataset [50].

  11. The numbers can be computed using query3.sparql published with the dataset [50].

  12. https://measuringu.com/sus/.

  13. https://www.usability.gov/how-to-and-tools/methods/system-usability-scale.html.

References

  1. Miller RJ, Nargesian F, Zhu E, Christodoulakis C, Pu KQ, Andritsos P (2018) Making open data transparent: Data discovery on open data. IEEE Data Eng Bull 41(2), 59–70. http://sites.computer.org/debull/A18june/p59.pdf

  2. Brickley D, Burgess M, Noy NF (2019) Google dataset search: Building a search engine for datasets in an open web ecosystem. In: Liu, L., White, R.W., Mantrach, A., Silvestri, F., McAuley, J.J., Baeza-Yates, R., Zia, L. (eds.) The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp. 1365–1375. ACM, ??? . https://doi.org/10.1145/3308558.3313685. https://doi.org/10.1145/3308558.3313685

  3. Chen X, Gururaj AE, Ozyurt B, Liu R, Soysal E, Cohen T, Tiryaki F, Li Y, Zong N, Jiang M, Rogith D, Salimi M, Kim H-E, Rocca-Serra P, Gonzalez-Beltran A, Farcas C, Johnson T, Margolis R, Alter G, Sansone S-A, Fore IM, Ohno-Machado L, Grethe JS, Xu H (2018) DataMed - an open source discovery index for finding biomedical datasets. J Am Med Inform Assoc 25(3):300–308. https://doi.org/10.1093/jamia/ocx121

    Article  Google Scholar 

  4. Chapman A, Simperl E, Koesten L, Konstantinidis G, Ibáñez LD, Kacprzak E, Groth P (2020) Dataset search: a survey. VLDB J 29(1):251–272. https://doi.org/10.1007/s00778-019-00564-x

    Article  Google Scholar 

  5. Gregory K, Groth P, Scharnhorst A, Wyatt S (2020) Lost or found? discovering data needed for research. Harvard Data Sci Rev. https://doi.org/10.1162/99608f92.e38165eb

    Article  Google Scholar 

  6. Gregory KM, Cousijn H, Groth P, Scharnhorst A, Wyatt S (2020) Understanding data search as a socio-technical practice. J Inf Sci 46(4):459–475. https://doi.org/10.1177/0165551519837182

    Article  Google Scholar 

  7. Koesten L (2018) A User Centred Perspective on Structured Data Discovery. In: Companion Proceedings of the The Web Conference 2018. WWW ’18, pp. 849–853. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE. https://doi.org/10.1145/3184558.3186574

  8. Klímek J (2019) DCAT-AP representation of Czech National Open Data Catalog and its impact. J Web Semant 55:69–85. https://doi.org/10.1016/j.websem.2018.11.001

    Article  Google Scholar 

  9. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity Search - The Metric Space Approach. Advances in Database Systems, vol. 32. Springer, Boston, MA, USA. https://doi.org/10.1007/0-387-29151-2

  10. Hetland ML, Skopal T, Lokoc J, Beecks C (2013) Ptolemaic access methods: challenging the reign of the metric space model. Inf Syst 38(7):989–1006. https://doi.org/10.1016/j.is.2012.05.011

    Article  Google Scholar 

  11. Connor R, Vadicamo L, Cardillo FA, Rabitti F (2019) Supermetric search. Inf Syst 80:108–123. https://doi.org/10.1016/j.is.2018.01.002

    Article  Google Scholar 

  12. Skopal T, Bustos B (2011) On nonmetric similarity search problems in complex domains. ACM Comput Surv 43(4):34–13450. https://doi.org/10.1145/1978802.1978813

    Article  MATH  Google Scholar 

  13. Das Sarma A, Fang L, Gupta N, Halevy A, Lee H, Wu F, Xin R, Yu C (2012) Finding Related Tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD ’12, pp. 817–828. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2213836.2213962

  14. Yakout M, Ganjam K, Chakrabarti K, Chaudhuri S (2012) InfoGather: Entity Augmentation and Attribute Discovery by Holistic Matching with Web Tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD ’12, pp. 97–108. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2213836.2213848

  15. Zhang S, Balog K (2018) Ad Hoc Table Retrieval Using Semantic Similarity. In: Proceedings of the 2018 World Wide Web Conference. WWW ’18, pp. 1553–1562. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE. https://doi.org/10.1145/3178876.3186067

  16. Fernandez RC, Abedjan Z, Koko F, Yuan G, Madden S, Stonebraker M (2018) Aurum: A Data Discovery System. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pp. 1001–1012. IEEE Computer Society, USA. https://doi.org/10.1109/ICDE.2018.00094

  17. Bogatu A, Fernandes AAA, Paton NW, Konstantinou N (2020) Dataset Discovery in Data Lakes. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pp. 709–720. IEEE, USA. https://doi.org/10.1109/ICDE48307.2020.00067

  18. Mountantonakis M, Tzitzikas Y (2020) Content-based union and complement metrics for dataset search over RDF knowledge graphs. J Data Inf Qual. https://doi.org/10.1145/3372750

    Article  Google Scholar 

  19. Altaf B, Akujuobi U, Yu L, Zhang X (2019) Dataset recommendation via variational graph autoencoder. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 11–20. https://doi.org/10.1109/ICDM.2019.00011

  20. Degbelo A, Teka BB (2019) Spatial search strategies for open government data: a systematic comparison. CoRR arXiv:1911.01097

  21. Tekli J, Chbeir R (2012) Minimizing user effort in XML grammar matching. Inf Sci 210:1–40. https://doi.org/10.1016/j.ins.2012.04.026

    Article  Google Scholar 

  22. Tekli J, Chbeir R, Traina AJM, Traina C, Fileto R (2015) Approximate XML structure validation based on document-grammar tree similarity. Inf Sci 295:258–302. https://doi.org/10.1016/j.ins.2014.09.044

    Article  MathSciNet  MATH  Google Scholar 

  23. Hovy E, Navigli R, Ponzetto SP (2013) Collaboratively built semi-structured content and Artificial Intelligence: the story so far. Artif Intell 194:2–27. https://doi.org/10.1016/j.artint.2012.10.002

    Article  MathSciNet  MATH  Google Scholar 

  24. Tekli J, Chbeir R, Traina AJM, Traina C (2019) SemIndex+: a semantic indexing scheme for structured, unstructured, and partly structured data. Knowl-Based Syst 164:378–403. https://doi.org/10.1016/j.knosys.2018.11.010

    Article  Google Scholar 

  25. Berners-Lee T (2006) Linked Data. https://www.w3.org/DesignIssues/LinkedData.html

  26. Mountantonakis M, Tzitzikas Y (2018) Scalable methods for measuring the connectivity and quality of large numbers of linked datasets. J Data Inf Qual. https://doi.org/10.1145/3165713

    Article  Google Scholar 

  27. Wagner A, Haase P, Rettinger A, Lamm H (2014) Entity-Based Data Source Contextualization for Searching the Web of Data. In: The Semantic Web: ESWC 2014 Satellite Events, pp. 25–41. Springer, Cham. https://doi.org/10.1007/978-3-319-11955-7_3

  28. Ben Ellefi M, Bellahsene Z, Dietze S, Todorov K (2016) Dataset recommendation for data linking: an intensional approach. The Semantic Web. Latest Advances and New Domains. Springer, Cham, pp 36–51

  29. Ellefi MB, Bellahsene Z, Dietze S, Todorov K (2016) Dataset Recommendation for Data Linking: An Intensional Approach. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) The Semantic Web. Latest Advances and New Domains - 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29 - June 2, 2016, Proceedings. Lecture Notes in Computer Science, vol. 9678, pp. 36–51. Springer, Cham. https://doi.org/10.1007/978-3-319-34129-3_3

  30. Martins YC, da Mota FF, Cavalcanti MC (2016) DSCrank: a method for selection and ranking of datasets, pp. 333–344. Springer, Cham. https://doi.org/10.1007/978-3-319-49157-8_29

  31. Leme LAPP, Lopes GR, Nunes BP, Casanova MA, Dietze S (2013) Identifying candidate datasets for data interlinking. In: Daniel, F., Dolog, P., Li, Q. (eds.) Web engineering, pp. 354–366. Springer, Berlin, Heidelberg . https://doi.org/10.1007/978-3-642-39200-9_29

  32. Oliver J, Cheng C, Chen Y (2013) TLSH – A Locality Sensitive Hash. In: 2013 fourth cybercrime and trustworthy computing workshop, pp. 7–13. https://doi.org/10.1109/CTC.2013.9

  33. Dutkowski S, Schramm A (2015) Duplicate evaluation - position paper by Fraunhofer FOKUS. Technical report, Fraunhofer FOKUS. https://www.w3.org/2016/11/sdsvoc/SDSVoc16_paper_24

  34. Miller FP, Vandome AF, McBrewster J (2009) Levenshtein Distance. Alphascript Publishing. https://www.abebooks.com/products/isbn/9786130216900

  35. Straka M, Hajič J, Straková J (2016) UDPipe: Trainable pipeline for processing CoNLL-u files performing tokenization, morphological analysis, POS tagging and parsing. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 4290–4297. European Language Resources Association (ELRA), Portorož, Slovenia. https://www.aclweb.org/anthology/L16-1680

  36. Straka M, Straková J (2019) Universal Dependencies 2.5 Models for UDPipe (2019-12-06). LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University . http://hdl.handle.net/11234/1-3131

  37. Sammut C, Webb GI (eds.) (2010) TF–IDF, pp. 986–987. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_832

  38. About WordNet. Princeton University, USA (2010). Princeton University. https://wordnet.princeton.edu/

  39. Speer R, Chin J, Havasi C (2017) ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In: Singh, S.P., Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pp. 4444–4451. AAAI Press, California, USA. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972

  40. Škoda P, Matějík J, Skopal T (2020) Visualizer of Dataset Similarity Using Knowledge Graph. In: Satoh, S., Vadicamo, L., Zimek, A., Carrara, F., Bartolini, I., Aumüller, M., Jónsson, B.Þ., Pagh, R. (eds.) Similarity search and applications . In: 13th international conference, SISAP 2020, Copenhagen, Denmark, September 30 - October 2, 2020, Proceedings. Lecture Notes in Computer Science, vol. 12440, pp. 371–378. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_29

  41. Tekli J, Charbel N, Chbeir R (2016) Building Semantic Trees from XML Documents. Web Semant. 37(C), 1–24. https://doi.org/10.1016/j.websem.2016.03.002

  42. Pilehvar MT, Navigli R (2014) A large-scale Pseudoword-based evaluation framework for state-of-the-art word sense disambiguation. Comput Linguist 40(4):837–881. https://doi.org/10.1162/COLI_a_00202

    Article  Google Scholar 

  43. Tekli J (2016) An overview on XML semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans Knowl Data Eng 28(6):1383–1407. https://doi.org/10.1109/TKDE.2016.2525768

    Article  Google Scholar 

  44. Mikolov T, Chen K, Corrado GS, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  45. Grover A, Leskovec J (2016) node2vec: Scalable Feature Learning for Networks. In: Krishnapuram, B., Shah, M., Smola, A.J., Aggarwal, C.C., Shen, D., Rastogi, R. (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 855–864. ACM, USA. https://doi.org/10.1145/2939672.2939754

  46. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805

  47. Skopal T, Bernhauer D, Škoda P, Klímek J, Nečaský M (2021) Similarity vs. Relevance: From Simple Searches to Complex Discovery. In: Reyes, N., Connor, R., Kriege, N.M., Kazempour, D., Bartolini, I., Schubert, E., Chen, J. (eds.) Similarity Search and Applications - 14th International Conference, SISAP 2021, Dortmund, Germany, September 29 - October 1, 2021, Proceedings. Lecture Notes in Computer Science, vol. 13058, pp. 104–117. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_9

  48. Košarko O, Variš D, Popel M (2019) LINDAT Translation service. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-2922

  49. Klímek J, Škoda P (2018) LinkedPipes DCAT-AP Viewer: A Native DCAT-AP Data Catalog. In: van Erp, M., Atre, M., López, V., Srinivas, K., Fortuna, C. (eds.) Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks Co-located with 17th International Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th - to - 12th, 2018. CEUR Workshop Proceedings, vol. 2180. CEUR-WS.org, Aachen, Germany . https://ceur-ws.org/Vol-2180/paper-32.pdf

  50. Škoda P, Klímek J (2021) Data collected from user evaluation of dataset search using similarity methods. Zenodo. https://doi.org/10.5281/zenodo.5788427

    Article  Google Scholar 

  51. Klímek J, Škoda P (2021) Dump of metadata from the Czech national open data catalog, 2020–04-20, State Administration of Land Surveying and Cadastre datasets removed. Zenodo. https://doi.org/10.5281/zenodo.4433464

    Article  Google Scholar 

  52. Klímek J, Bernhauer D (2021) Ground truths for dataset search using similarity methods generated from a user evaluation. Zenodo. https://doi.org/10.5281/zenodo.5788444

    Article  Google Scholar 

  53. Bechhofer S, Miles A (August 2009) SKOS Simple Knowledge Organization System Reference. W3C Recommendation, W3C . https://www.w3.org/TR/2009/REC-skos-reference-20090818/

  54. Baeza-Yates R, Ribeiro-Neto BA (2011) Modern Information Retrieval - the Concepts and Technology Behind Search, Second Edition. Pearson Education Ltd., Harlow, England. http://www.mir2ed.org/

  55. Zhang E, Zhang Y (2009) In: LIU, L., ÖZSU, M.T. (eds.) Eleven Point Precision-recall Curve, pp. 981–982. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_481

  56. Lewis JR (2018) The system usability scale: past, present, and future. Int J Human-Comput Interact 34(7):577–590. https://doi.org/10.1080/10447318.2018.1455307

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Czech Science Foundation (GAČR), grant number 19-01641S.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakub Klímek.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bernhauer, D., Nečaský, M., Škoda, P. et al. Open dataset discovery using context-enhanced similarity search. Knowl Inf Syst 64, 3265–3291 (2022). https://doi.org/10.1007/s10115-022-01751-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01751-z

Keywords

Navigation