Skip to main content

Towards a Multi-way Similarity Join Operator

Part of the Communications in Computer and Information Science book series (CCIS,volume 767)

Abstract

Increasing volumes of data consumed and managed by enterprises demand effective and efficient data integration approaches. Additionally, the amount and variety of data sources impose further challenges for query engines. However, the majority of existing query engines rely on binary join-based query planners and execution methods with complexity that depends on the number of involved data sources. Moreover, traditional binary join operators are not able to distinguish between similar and different tuples, treating every incoming tuple as an independent object. Thus, if tuples are represented differently but refer to the same real-world entity, they are still considered as non-related objects. We propose MSimJoin, an approach towards a multi-way similarity join operator. MSimJoin accepts more than two inputs and is able to identify duplicates that correspond to similar entities from incoming tuples using Semantic Web technologies. Therefore, MSimJoin allows for the reduction of both the height of tree query plans and duplicated results.

Keywords

  • Semantic data management
  • Semantic Web
  • Join operators

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-67162-8_26
  • Chapter length: 8 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-67162-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

Notes

  1. 1.

    https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.

  2. 2.

    https://www.w3.org/TR/sparql11-query/.

References

  1. Acosta, M., Vidal, M.-E.: Networks of linked data eddies: an adaptive web query processing engine for RDF data. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 111–127. Springer, Cham (2015). doi:10.1007/978-3-319-25007-6_7

    CrossRef  Google Scholar 

  2. Acosta, M., Vidal, M.-E., Lampo, T., Castillo, J., Ruckhaus, E.: ANAPSID: an adaptive query processing engine for sparql endpoints. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 18–34. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25073-6_2

    CrossRef  Google Scholar 

  3. Buil-Aranda, C., Arenas, M., Corcho, O., Polleres, A.: Federating queries in SPARQL1.1: syntax, semantics and evaluation. Web Semant. Sci. Serv. Agents World Wide Web 18, 1–17 (2013)

    CrossRef  Google Scholar 

  4. Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)

    CrossRef  Google Scholar 

  5. Fernández, J.D., Llaves, A., Corcho, O.: Efficient RDF interchange (ERI) format for RDF data streams. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8797, pp. 244–259. Springer, Cham (2014). doi:10.1007/978-3-319-11915-1_16

    Google Scholar 

  6. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  7. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)

    Google Scholar 

  8. Morales, C., Collarana, D., Vidal, M.-E., Auer, S.: MateTee: a semantic similarity metric based on translation embeddings for knowledge graphs. In: Cabot, J., Virgilio, R., Torlone, R. (eds.) ICWE 2017. LNCS, vol. 10360, pp. 246–263. Springer, Cham (2017). doi:10.1007/978-3-319-60131-1_14

    CrossRef  Google Scholar 

  9. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., Nascimento, B.H.B.: Incorporating clustering into set similarity join algorithms: the SjClust framework. In: Hartmann, S., Ma, H. (eds.) DEXA 2016. LNCS, vol. 9827, pp. 185–204. Springer, Cham (2016). doi:10.1007/978-3-319-44403-1_12

    CrossRef  Google Scholar 

  10. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_16

    Google Scholar 

  11. Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25073-6_38

    CrossRef  Google Scholar 

  12. Shang, Z., Liu, Y., Li, G., Feng, J.: K-join: knowledge-aware similarity join. IEEE Trans. Knowl. Data Eng. 28(12), 3293–3308 (2016)

    CrossRef  Google Scholar 

  13. Traverso, I., Vidal, M.-E., Kämpgen, B., Sure-Vetter, Y.: Gades: a graph-based semantic similarity measure. In: SEMANTiCS, pp. 101–104. ACM (2016)

    Google Scholar 

  14. Verborgh, R., Sande, M.V., Hartig, O., Herwegen, J.V., Vocht, L.D., Meester, B.D., Haesendonck, G., Colpaert, P.: Triple pattern fragments: a low-cost knowledge graph interface for the web. J. Web Sem. 37–38, 184–206 (2016)

    CrossRef  Google Scholar 

  15. Vidal, M.-E., Castillo, S., Acosta, M., Montoya, G., Palma, G.: On the selection of SPARQL endpoints to efficiently execute federated SPARQL queries. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. LNCS, vol. 9620, pp. 109–149. Springer, Heidelberg (2016). doi:10.1007/978-3-662-49534-6_4

    CrossRef  Google Scholar 

  16. Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Rec. 43(1), 64–76 (2014)

    CrossRef  Google Scholar 

  17. Wang, Y., Wang, H., Li, J., Gao, H.: Efficient graph similarity join for information integration on graphs. Front. Comput. Sci. 10(2), 317–329 (2016)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mikhail Galkin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Galkin, M., Vidal, ME., Auer, S. (2017). Towards a Multi-way Similarity Join Operator. In: , et al. New Trends in Databases and Information Systems. ADBIS 2017. Communications in Computer and Information Science, vol 767. Springer, Cham. https://doi.org/10.1007/978-3-319-67162-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67162-8_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67161-1

  • Online ISBN: 978-3-319-67162-8

  • eBook Packages: Computer ScienceComputer Science (R0)