Searching the Big Data: Practices and Experiences in Efficiently Querying Knowledge Bases

  • Wei Emma Zhang
  • Quan Z. ShengEmail author


Knowledge bases (KBs) are computer systems that store complex structured and unstructured facts, i.e., knowledge. KB are described as open shared database of the world’s knowledge and typically use the entity-relational model. Most of the existing knowledge bases make their data in the RDF format. Tools including querying, inferencing and reasoning on facts are developed to consume the knowledge. In this chapter, we introduce a client-side caching framework aiming at accelerating the overall query response speed. In particular, we improve a suboptimal graph edit distance function to estimate the similarity of SPARQL queries and develop an approach to transform the SPARQL queries to feature vectors. Machine learning algorithms are leveraged using these feature vectors to identify similar queries that could potentially be the subsequent queries. We adapt multiple dimensional reduction algorithms to reduce the identification time. We then prefetch and cache the results of these queries aiming to improve the overall querying performance. We also develop a forecasting method, namely Modified Simple Exponential Smoothing, to implement the cache replacement. Our approach has been evaluated by using a very large set of real world queries. The empirical results show that our approach has great potential to enhance the cache hit rate and accelerate the querying speed on SPARQL endpoints.


  1. 1.
    J. Bao, N. Duan, M. Zhou, T. Zhao, Knowledge-based question answering as machine translation, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, USA (2014), pp. 967–976Google Scholar
  2. 2.
    J. Berant, A. Chou, R. Frostig, P. Liang, Semantic parsing on freebase from question-answer pairs, in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, USA (2013), pp. 1533–1544Google Scholar
  3. 3.
    J. Berant, P. Liang, Semantic parsing via paraphrasing, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), Baltimore, USA (2014), pp. 1415–1425Google Scholar
  4. 4.
    K.D. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2008), Vancouver, Canada (2008), pp. 1247–1250Google Scholar
  5. 5.
    H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, H. Li, Context-aware query suggestion by mining click-through and session data, in Proceeding of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2008), Las Vegas, Nevada, USA (2008), pp. 875–883Google Scholar
  6. 6.
    S. Dar, M.J. Franklin, B.T. Jónsson, D. Srivastava, M. Tan, Semantic data caching and replacement, in Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB1996), Mumbai (Bombay), India (1996), pp. 330–341Google Scholar
  7. 7.
    P.J. Denning, The working set model for program behaviour. Commun. ACM 11(5), 323–333 (1968)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    S. Elbassuoni, M. Ramanath, G. Weikum, Query relaxation for entity-relationship search, in Proceedings of the 8th Extended Semantic Web Conference (ESWC 2011), Heraklion, Crete, Greece (2011), pp. 62–76Google Scholar
  9. 9.
    A. Fader, L. Zettlemoyer, O. Etzioni, Open question answering over curated and extracted knowledge bases, in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2014), New York, USA (2014), pp. 1156–1165Google Scholar
  10. 10.
    D.A. Ferrucci, E.W. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J.W. Murdock, E. Nyberg, J.M. Prager, N. Schlaefer, C.A. Welty, Building Watson: an overview of the DeepQA project. AI Magazine 31(3), 59–79 (2010)Google Scholar
  11. 11.
    G. Fokou, S. Jean, A. Hadjali, M. Baron, Cooperative techniques for SPARQL query relaxation in RDF databases, in Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia (2015), pp. 237–252Google Scholar
  12. 12.
    J.H. Friedman, J.L. Bentley, R.A. Finkel, An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3(3), 209–226 (1977)CrossRefzbMATHGoogle Scholar
  13. 13.
    E.S. Gardner, Exponential smoothing: the state of the art-part II. Int. J. Forecast. 22(4), 637–666 (2006)CrossRefGoogle Scholar
  14. 14.
    P. Godfrey, J. Gryz, Answering queries by semantic caches, In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA 1999), Florence, Italy (1999), pp. 485–498Google Scholar
  15. 15.
    R. Hasan, Predicting SPARQL query performance and explaining linked data, in Proceedings of the 11th Extended Semantic Web Conference (ESWC 2014), Anissaras, Crete, Greece (2014), pp. 795–805Google Scholar
  16. 16.
    H. Hotelling, Relations between two sets of variates. Biometrika (1936), pp. 321–377Google Scholar
  17. 17.
    N.L. Johnson, A.W. Kemp, S. Kotz, Univariate Discrete Distributions, 2nd edn. (Wiley, New Jersey, 1993)Google Scholar
  18. 18.
    I. Jolliffe, Principal Component Analysis, Wiley Online Library (2002)Google Scholar
  19. 19.
    L. Kaufman, P. Rousseeuw, Clustering by Means of Medoids, (North-Holland, Amsterdam, 1987)Google Scholar
  20. 20.
    D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  21. 21.
    J. Lehmann, L. Bühmann, AutoSPARQL: let users query your knowledge base, in Proceedings of the 8th Extended Semantic Web Conference (ESWC 2011), Heraklion, Crete, Greece (2011), pp. 63–79Google Scholar
  22. 22.
    J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, C. Bizer, DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web J. 6(2), 167–195 (2015)Google Scholar
  23. 23.
    J.J. Levandoski, P. Larson, R. Stoica, Identifying hot and cold data in main-memory databases, in Proceedings of 29th International Conference on Data Engineering (ICDE 2013), Brisbane, Australia (2013), pp. 26–37Google Scholar
  24. 24.
    J. Lorey, F. Naumann, Detecting SPARQL query templates for data prefetching, in Proceedings of the 10th Extended Semantic Web Conference (ESWC 2013), Montpellier, France (2013), pp. 124–139Google Scholar
  25. 25.
    M. Martin, J. Unbehauen, S. Auer, Improving the performance of semantic web applications with SPARQL query caching, in Proceedings of the 7th Extended Semantic Web Conference (ESWC 2010), Heraklion, Crete, Greece (2010), pp. 304–318Google Scholar
  26. 26.
    N. Megiddo, D.S. Modha, ARC: a self-tuning, low overhead replacement cache, in Proceedings of the Conference on File and Storage Technologies (FAST, San Francisco, California, USA (2003)Google Scholar
  27. 27.
    M. Morsey, J. Lehmann, S. Auer, A.N. Ngomo, Usage-centric benchmarking of RDF triple stores, in Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI 2012), Toronto, Canada (2012)Google Scholar
  28. 28.
    J.R. Movellan, A quickie on exponential smoothing.
  29. 29.
    E.J. O’Neil, P.E. O’Neil, G. Weikum, The LRU-K page replacement algorithm for database disk buffering, in Proceedings of the International Conference on Management of Data (SIGMOD 1993), Washington, D.C., USA (1993), pp. 297–306Google Scholar
  30. 30.
    N. Papailiou, D. Tsoumakos, P. Karras, N. Koziris, Graph-aware, workload-adaptive SPARQL query caching, in Proceedings of the International Conference on Management of Data (SIGMOD 2015), Melbourne, Australia (2015), pp. 1777–1792Google Scholar
  31. 31.
    J. Pérez, M. Arenas, C. Gutierrez, Semantics and complexity of SPARQL. ACM Trans. Database Sys. 34(3) (2009)Google Scholar
  32. 32.
    R. Punnoose, A. Crainiceanu, D. Rapp, SPARQL in the cloud using Rya. Inf. Syst. 48, 181–195 (2015)CrossRefGoogle Scholar
  33. 33.
    S. Reid, Knowledge-based systems concepts, Techniques, Examples. (1985)
  34. 34.
    Q. Ren, M.H. Dunham, V. Kumar, Semantic caching and query processing. IEEE Trans. Knowl. Data Eng. 15(1), 192–210 (2003)CrossRefGoogle Scholar
  35. 35.
    A. Sanfeliu, K. Fu, A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Sys. Man Cybern. 13(3), 353–362 (1983)CrossRefzbMATHGoogle Scholar
  36. 36.
    Y. Shu, M. Compton, H. Müller, K. Taylor, Towards content-aware SPARQL query caching for semantic web applications, in Proceedings of the 14th International Conference on Web Information Systems Engineering (WISE 2013), Nanjing, China (2013), pp. 320–329Google Scholar
  37. 37.
    F.M. Suchanek, G. Kasneci, G. Weikum. Yago: a core of semantic knowledge, in Proceedings of the 16th International World Wide Web Conference (WWW 2007), Banff, Canada (2007), pp. 697–706Google Scholar
  38. 38.
    R. Verborgh, O. Hartig, B.D. Meester, G. Haesendonck, L.D. Vocht, M.V. Sande, R. Cyganiak, P. Colpaert, E. Mannens, R.V. de Walle, Querying datasets on the web with high availability, in Proceedings of the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy (2014), pp. 180–196Google Scholar
  39. 39.
    M. Yahya, K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, G. Weikum, Natural language questions for the web of data, in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), Jeju Island, Korea (2012), pp. 379–390Google Scholar
  40. 40.
    M. Yang, G. Wu, Caching intermediate result of SPARQL queries, in Proceedings of the 20th International World Wide Web Conference (WWW 2011), Hyderabad, India (2011), pp. 159–160Google Scholar
  41. 41.
    P. Yin, N. Duan, B. Kao, J. Bao, M. Zhou, Answering questions with complex semantic constraints on open knowledge bases, in Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015), Melbourne, Australia (2015), pp. 1301–1310Google Scholar
  42. 42.
    W.E. Zhang, Q.Z. Sheng, Y. Qin, K. Taylor, L. Yao, A. Shemshadi, SECF: improving SPARQL querying performance with proactive fetching and caching, in Proceedings of the 31st ACM Symposium on Applied Computing(SAC 2016), Pisa, Italy (2016), (To appear)Google Scholar
  43. 43.
    W.E. Zhang, Q.Z. Sheng, K. Taylor, Y. Qin, Identifying and caching hot triples for efficient RDF query processing, in Proceedings of the 20th International Conference on Database Systems for Advanced Applications (DASFAA 2015), Hanoi, Vietnam (2015), pp. 259–274Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.The University of AdelaideAdelaideAustralia

Personalised recommendations