DAW: Duplicate-AWare Federated Query Processing over the Web of Data

  • Muhammad Saleem
  • Axel-Cyrille Ngonga Ngomo
  • Josiane Xavier Parreira
  • Helena F. Deus
  • Manfred Hauswirth
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8218)

Abstract

Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the effect of duplicated data on federated querying. This work presents DAW, a novel duplicate-aware approach to federated querying over the Web of Data. DAW is based on a combination of min-wise independent permutations and compact data summaries. It can be directly combined with existing federated query engines in order to achieve the same query recall values while querying fewer data sources. We extend three well-known federated query processing engines – DARQ, SPLENDID, and FedX – with DAW and compare our extensions with the original approaches. The comparison shows that DAW can greatly reduce the number of queries sent to the endpoints, while keeping high query recall values. Therefore, it can significantly improve the performance of federated query processing engines. Moreover, DAW provides a source selection mechanism that maximises the query recall, when the query processing is limited to a subset of the sources.

Keywords

federated query processing SPARQL min-wise independent permutations Web of Data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., Ruckhaus, E.: Anapsid: an adaptive query processing engine for sparql endpoints. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 18–34. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  2. 2.
    Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to linked data and its lifecycle on the web. In: Rudolph, S., Gottlob, G., Horrocks, I., van Harmelen, F. (eds.) Reasoning Weg 2013. LNCS, vol. 8067, pp. 1–90. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  3. 3.
    Basca, C., Bernstein, A.: Avalanche: putting the spirit of the web back into semantic web querying. In: SSWS, pp. 64–79 (November 2010)Google Scholar
  4. 4.
    Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in p2p search engines. In: SIGIR, pp. 67–74 (2005)Google Scholar
  5. 5.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefMATHGoogle Scholar
  6. 6.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. IJCSS 60, 327–336 (1998)Google Scholar
  7. 7.
    Drukh, N., Polyzotis, N., Garofalakis, M., Matias, Y.: Fractional xsketch synopses for xml databases. In: Bellahsène, Z., Milo, T., Rys, M., Suciu, D., Unland, R. (eds.) XSym 2004. LNCS, vol. 3186, pp. 189–203. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Görlitz, O., Staab, S.: Splendid: Sparql endpoint federation exploiting void descriptions. In: COLD, ISWC (2011)Google Scholar
  9. 9.
    Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.-U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: WWW, pp. 411–420 (2010)Google Scholar
  10. 10.
    Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. In: WWW (Special interest tracks and posters), pp. 1128–1129 (2005)Google Scholar
  11. 11.
    Hose, K., Schenkel, R.: Towards benefit-based rdf source selection for sparql queries. In: SWIM, p. 2 (2012)Google Scholar
  12. 12.
    Ladwig, G., Tran, T.: Linked data query processing strategies. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 453–469. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Langegger, A., Wöß, W., Blöchl, M.: A semantic web middleware for virtual data integration on the web. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 493–507. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Li, Y., Heflin, J.: Using reformulation trees to optimize queries over distributed heterogeneous sources. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 502–517. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Michel, S., Bender, M., Triantafillou, P., Weikum, G.: IQN routing: Integrating quality and novelty in P2P querying and ranking. In: Ioannidis, Y., et al. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 149–166. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  16. 16.
    Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Netw. 10(5), 604–612 (2002)CrossRefGoogle Scholar
  17. 17.
    Morsey, M., Lehmann, J., Auer, S., Ngonga Ngomo, A.-C.: Dbpedia sparql benchmark: performance assessment with real queries on real data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 454–469. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Nie, Z., Kambhampati, S., Hernandez, T.: Bibfinder/statminer: Effectively mining and using coverage and overlap statistics in data integration. In: VLDB, pp. 1097–1100 (2003)Google Scholar
  19. 19.
    Ntarmos, N., Triantafillou, P., Weikum, G.: Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets. ACM Trans. Comput. Syst., 27 (2009)Google Scholar
  20. 20.
    Polyzotis, N., Garofalakis, M.: Statistical synopses for graph-structured xml databases. In: SIGMOD, pp. 358–369 (2002)Google Scholar
  21. 21.
    Quilitz, B., Leser, U.: Querying distributed rdf data sources with sparql. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  22. 22.
    Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: Fedx: Optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  23. 23.
    Shokouhi, M., Zobel, J.: Federated text retrieval from uncooperative overlapped collections. In: SIGIR, pp. 495–502 (2007)Google Scholar
  24. 24.
    Si, L., Callan, J.P.: Relevant document distribution estimation method for resource selection. In: SIGIR, pp. 298–305 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Muhammad Saleem
    • 1
  • Axel-Cyrille Ngonga Ngomo
    • 1
  • Josiane Xavier Parreira
    • Helena F. Deus
      • Manfred Hauswirth
        1. 1.IFI/AKSWUniversität LeipzigLeipzigGermany

        Personalised recommendations