The Journal of Supercomputing

, Volume 71, Issue 10, pp 3695–3725 | Cite as

SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join

Article

Abstract

Large numbers of Resource Description Framework triples are available in Linked Data which can grow exponentially. It makes SPARQL query processing engines infeasible on a single machine. To address this scalability issue, MapReduce framework-based SPARQL engines have been proposed, but we note that these methods are limited in terms of join evaluations. The two-way join-based approach evaluates joins via a sequence of binary multiplications that require multiple MapReduce jobs, which involves costly disk accesses between MapReduce jobs. The multi-way join-based approach combines multiple two-way join operations, which allows the simultaneous evaluation of joins during one MapReduce job. However, the size of data for the MapReduce job might increase exponentially if a complex query is given. In this study, we propose SigMR, a pruning method for multi-way join-based SPARQL query processing in MapReduce. In the proposed approach, a SPARQL query can be evaluated in a single MapReduce job, where the size of data is reduced dramatically by pruning based on our signature encoding technique, thereby overcoming the weaknesses of the previous approaches. In experiments, we showed that the query processing time required was lower with our approach than existing MapReduce-based methods.

Keywords

Hadoop MapReduce Multi-way join Signature encoding SigMR SPARQL 

References

  1. 1.
    Abadi DJ, Marcus A, Madden SR, Hollenbach K (2007) Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd international conference on very large data bases, VLDB ’07. VLDB endowment, pp 411–422Google Scholar
  2. 2.
    Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Trans Knowl Data Eng 23(9):1282–1298. doi:10.1109/TKDE.2011.47 CrossRefGoogle Scholar
  3. 3.
    Aluç G, Ozsu MT, Daudjee K (2014) Workload matters: why rdf databases need a new design. Proc VLDB Endow 7(10):837–840CrossRefGoogle Scholar
  4. 4.
    Apache storm. https://storm.apache.org. Accessed 25 May 2015
  5. 5.
    Aranda-Andújar A, Bugiotti F, Camacho-Rodríguez J, Colazzo D, Goasdoué F, Kaoudi Z, Manolescu I (2012) Amada: web data repositories in the amazon cloud. In: CIKM 2012. Maui, États-UnisGoogle Scholar
  6. 6.
    Arenas M, Cuenca Grau B, Evgeny E, Marciuska S, Zheleznyakov D (2014) Towards semantic faceted search. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion, WWW companion ’14. International world wide web conferences steering committee, Republic and Canton of Geneva, Switzerland, pp 219–220. doi:10.1145/2567948.2577381
  7. 7.
    Atre M, Chaoji V, Zaki MJ, Hendler JA (2010) Matrix bit loaded: a scalable lightweight join query processor for rdf data. In: Proceedings of the 19th international conference on world wide web. ACM, pp 41–50Google Scholar
  8. 8.
    Becker C, Bizer C (2008) Dbpedia mobile: a location-enabled linked data browser. In: Proceedings of World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), Beijing, China, 2008Google Scholar
  9. 9.
    Berners-Lee T, Hendler J, Lassila O et al (2001) The semantic web. Sci Am 284(5):28–37CrossRefGoogle Scholar
  10. 10.
    Berners-Lee T, Chen Y, Chilton L, Connolly D, Dhanaraj R, Hollenbach J, Lerer A, Sheets D (2006) Tabulator: exploring and analyzing linked data on the semantic web. In: Proceedings of the 3rd international semantic web user interaction workshop, vol 2006Google Scholar
  11. 11.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692 MATHCrossRefGoogle Scholar
  12. 12.
    Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering usingmapreduce. J Supercomput 70(3):1249–1259. doi:10.1007/s11227-014-1225-7 CrossRefGoogle Scholar
  13. 13.
    Cure Faye, Blin O (2012) A survey of RDF storage approaches. ARIMA J 15:11–35Google Scholar
  14. 14.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  15. 15.
    Xicheng D, Ying W, Huaming L (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: Proceedings of the 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 2011Google Scholar
  16. 16.
    Galárraga L, Hose K, Schenkel R (2014) Partout: a distributed engine for efficient rdf processing. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion. International world wide web conferences steering committee, pp 267–268Google Scholar
  17. 17.
    Hose K, Schenkel R (2013) Warp: workload-aware replication and partitioning for rdf. In: 4th international workshop on data engineering meets semantic web (DESWeb 2013). Brisbane, AustraliaGoogle Scholar
  18. 18.
    Huang J, Abadi DJ, Ren K (2011) Scalable sparql querying of large rdf graphs. Proc VLDB Endow 4(11):1123–1134Google Scholar
  19. 19.
    Husain M, McGlothlin J, Masud MM, Khan L, Thuraisingham B (2011) Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans Knowl Data Eng 23(9):1312–1327CrossRefGoogle Scholar
  20. 20.
    Kaoudi Z, Manolescu I (2014) Rdf in the clouds: a survey. VLDB J. doi:10.1007/s00778-014-0364-z Google Scholar
  21. 21.
    Koren J, Zhang Y, Liu X (2008) Personalized interactive faceted search. In: Proceedings of the 17th international conference on world wide web. ACM, pp 477–486Google Scholar
  22. 22.
    Lee T, Im DH, Kim H, Kim HJ (2014) Application of filters to multiway joins in mapreduce. Math Probl Eng 2014, Art. ID 249418. doi:10.1155/2014/249418
  23. 23.
    McBride B (2001) Jena: implementing the rdf model and syntax specification. In: Proceedings of the Second International Workshop on the Semantic Web, Hongkong, 2001Google Scholar
  24. 24.
    Minack E, Sauermann L, Grimnes G, Fluit C, Broekstra J (2008) The sesame lucene sail: rdf queries with full-text search. In: Technical Report 2008-1, NEPOMUK consortiumGoogle Scholar
  25. 25.
    Myung J, Sg Lee (2013) Exploiting inter-operation parallelism for matrix chain multiplication using mapreduce. J Supercomput 66(1):594–609. doi:10.1007/s11227-013-0936-5 CrossRefGoogle Scholar
  26. 26.
    Myung J, Yeon J, Lee Sg (2010) Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 workshop on massive data analytics on the cloud, MDAC ’10. ACM, New York, NY, USA, pp 6:1–6:6. doi:10.1145/1779599.1779605
  27. 27.
    Neumann T, Weikum G (2010) The rdf-3x engine for scalable management of rdf data. VLDB J 19(1):91–113. doi:10.1007/s00778-009-0165-y CrossRefGoogle Scholar
  28. 28.
    Papailiou N, Konstantinou I, Tsoumakos D, Koziris N (2012) H2rdf: adaptive query processing on rdf data in the cloud. In: Proceedings of the 21st international conference companion on world wide web. ACM, pp 397–400Google Scholar
  29. 29.
    Phan LTX, Zhang Z, Loo BT, Lee I (2010) Real-time MapReduce scheduling. In: Technical report no. MS-CIS-10-32, University of Pennsylvania, PhiladelphiaGoogle Scholar
  30. 30.
    Punnoose R, Crainiceanu A, Rapp D (2012) Rya: a scalable rdf triple store for the clouds. In: Proceedings of the 1st international workshop on cloud intelligence. ACM, p 4Google Scholar
  31. 31.
    Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the shard triple-store. In: Programming support innovations for emerging distributed applications. ACM, p 4Google Scholar
  32. 32.
    Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10Google Scholar
  33. 33.
    Um Jh, Choi H, Sk Song, Sp Choi, Yoon H, Jung H, Kim Th (2013) Development of a virtualized supercomputing environment for genomic analysis. J Supercomput 65(1):71–85. doi:10.1007/s11227-012-0752-3 CrossRefGoogle Scholar
  34. 34.
    Van Aart C, Wielinga B, Van Hage WR (2010) Mobile cultural heritage guide: location-aware semantic search. In: Proceedings of The 17th International Conference on Knowledge Engineering and Knowledge Management, Lisbon, Portugal, 2001Google Scholar
  35. 35.
    Virtuoso. http://virtuoso.openlinksw.com/. Accessed 25 May 2015
  36. 36.
    Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019. doi:10.14778/1453856.1453965 CrossRefGoogle Scholar
  37. 37.
    Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. In: Proceedings of the VLDB Endowment, vol 6. VLDB Endowment, pp 265–276Google Scholar
  38. 38.
    Zhang X, Chen L, Tong Y, Wang M (2013) Eagre: towards scalable i/o efficient sparql query evaluation on the cloud. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 565–576Google Scholar
  39. 39.
    Zou L, Mo J, Chen L, Özsu MT, Zhao D (2011) gstore: answering sparql queries via subgraph matching. Proc VLDB Endow 4(8):482–493CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Biomedical Knowledge Engineering Laboratory, Dental Research InstituteSeoul National UniversitySeoulRepublic of Korea
  2. 2.Department of Computer and Information EngineeringHoseo UniversityAsanRepublic of Korea
  3. 3.Institute of Human-Environment Interface BiologySeoul National UniversitySeoulRepublic of Korea

Personalised recommendations