Detecting Similar Linked Datasets Using Topic Modelling

  • Michael Röder
  • Axel-Cyrille Ngonga NgomoEmail author
  • Ivan Ermilov
  • Andreas Both
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9678)


The Web of data is growing continuously with respect to both the size and number of the datasets published. Porting a dataset to five-star Linked Data however requires the publisher of this dataset to link it with the already available linked datasets. Given the size and growth of the Linked Data Cloud, the current mostly manual approach used for detecting relevant datasets for linking is obsolete. We study the use of topic modelling for dataset search experimentally and present Tapioca, a linked dataset search engine that provides data publishers with similar existing datasets automatically. Our search engine uses a novel approach for determining the topical similarity of datasets. This approach relies on probabilistic topic modelling to determine related datasets by relying solely on the metadata of datasets. We evaluate our approach on a manually created gold standard and with a user study. Our evaluation shows that our algorithm outperforms a set of comparable baseline algorithms including standard search engines significantly by 6 % F1-score. Moreover, we show that it can be used on a large real world dataset with a comparable performance.


Search Engine Topic Modelling Latent Dirichlet Allocation Topic Distribution Link Discovery 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arun, R., Suresh, V., Veni Madhavan, C.E., Narasimha Murthy, M.N.: On finding the natural number of topics with latent Dirichlet allocation: some observations. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 391–402. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  2. 2.
    Baeza Yates, R.A., Neto, B.R.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)Google Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Buntine, W., Lofstrom, J., Perkio, J., Perttu, S., Poroshin, V., Silander, T., Tirri, H., Tuominen, A., Tuulos, V.: A scalable topic-based open source search engine. In: Proceedings of the WI 2004, pp. 228–234, September 2004Google Scholar
  5. 5.
    Ell, B., Vrandečić, D., Simperl, E.: Labels in the web of data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 162–176. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl. 1), 5228–5235 (2004)CrossRefGoogle Scholar
  7. 7.
    Herzig, D.M., Mika, P., Blanco, R., Tran, T.: Federated entity search using on-the-fly consolidation. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 167–183. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  8. 8.
    Hogan, A., Harth, A., Umrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with swse: the semantic web search engine. Web Semant. Sci. Serv. Agents World Wide Web 9(4), 365–401 (2011)CrossRefGoogle Scholar
  9. 9.
    Kunze, S., Auer, S.: Dataset retrieval. In: IEEE Seventh International Conference on Semantic Computing (ICSC), pp. 1–8, September 2013Google Scholar
  10. 10.
    Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2), 178–203 (2011)CrossRefGoogle Scholar
  11. 11.
    McCallum, A.K.: Mallet: A machine learning for language toolkit (2002).
  12. 12.
    Ngomo, A.-C.N., Auer, S., Lehmann, J., Zaveri, A.: Introduction to linked data and its lifecycle on the web. In: Koubarakis, M., Stamou, G., Stoilos, G., Horrocks, I., Kolaitis, P., Lausen, G., Weikum, G. (eds.) Reasoning Web 2014. LNCS, vol. 8714, pp. 1–99. Springer, Heidelberg (2014)Google Scholar
  13. 13.
    Sleeman, J., Finin, T., Joshi, A.: Topic modeling for rdf graphs. In: 3rd International Workshop on Linked Data for Information Extraction, 14th International Semantic Web Conference (2015)Google Scholar
  14. 14.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)Google Scholar
  15. 15.
    Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker, S.: live views on the web of data. Web Semant. Sci. Serv. Agents World Wide Web 8(4), 355–364 (2010)CrossRefGoogle Scholar
  16. 16.
    Wallach, H.M., Mimno, D.M., McCallum, A.: Rethinking LDA: why priors matter. In: Advances in Neural Information Processing Systems, vol. 22, pp. 1973–1981 (2009)Google Scholar
  17. 17.
    Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Michael Röder
    • 1
  • Axel-Cyrille Ngonga Ngomo
    • 1
    Email author
  • Ivan Ermilov
    • 1
  • Andreas Both
    • 2
  1. 1.AKSWLeipzig UniversityLeipzigGermany
  2. 2.Mercateo AGLeipzigGermany

Personalised recommendations