Abstract
This article introduces a novel approach to scale complex calculations in extensive IT infrastructures and presents significant case studies in SONCA and DISESOR projects. Described system is enabling parallelism of calculations by providing dynamic data sharding without necessity of direct integration with storage repositories. Presented solution doesn’t require to complete a single phase of processing before starting the next one, hence it is suitable for supporting many dependent calculations and can be used to provide scalability and robustness of whole data processing pipelines. Introduced mechanism is designed to support case of still emerging data, thereby it is suitable for data streams e.g. transformation and analysis of data collected from multiple sensors. As will be shown in this article, this approach scales well and is very attractive because can be easily applied to data processing between heterogeneous systems.
This research was partly supported by Polish National Science Centre (NCN) grant DEC-2011/01/B/ST6/03867, as well as Polish National Centre for Research and Development (NCBiR) grant PBS2/B9/20/2013 in frame of Applied Research Programmes. This publication has been co-financed with the European Union funds by the European Social Fund.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bembenik, R., Skonieczny, L., Rybiński, H., Niezgodka, M. (eds.): Intelligent Tools for Building a Scient. Info. Plat. SCI, vol. 390. Springer, Heidelberg (2012)
Bennett, K., Layzell, P., Budgen, D., Brereton, P., Macaulay, L., Munro, M.: Service-based software: The future for flexible software. In: Proceedings of the Seventh Asia-Pacific Software Engineering Conference. IEEE Computer Society, Washington, DC (2000), http://dl.acm.org/citation.cfm?id=580763.785797
Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E., O’Neil, P.: A critique of ansi sql isolation levels. SIGMOD Rec. 24(2), 1–10 (1995), http://doi.acm.org/10.1145/568271.223785
Boniewicz, A., Wiśniewski, P., Stencel, K.: On redundant data for faster recursive querying via orm systems. In: FedCSIS, pp. 1439–1446 (2013)
Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI 2006, pp. 335–350. USENIX Association, Berkeley (2006), http://dl.acm.org/citation.cfm?id=1298455.1298487
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004), http://dl.acm.org/citation.cfm?id=1251254.1251264
DeWitt, D., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992), http://doi.acm.org/10.1145/129888.129894
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003), http://doi.acm.org/10.1145/1165389.945450
Grzegorowski, M., Pardel, P.W., Stawicki, S., Stencel, K.: Sonca: Scalable semantic processing of rapidly growing document stores. In: ADBIS Workshops, pp. 89–98 (2012)
Janusz, A., Slezak, D., Nguyen, H.S.: Unsupervised similarity learning from textual data. Fundam. Inform. 119(3-4), 319–336 (2012)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York (2010), http://doi.acm.org/10.1145/1807167.1807184
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. (2011)
Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI 2010, pp. 1–15. USENIX Association, Berkeley (2010), http://dl.acm.org/citation.cfm?id=1924943.1924961
Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge University Press, Cambridge (2012), http://www.amazon.de/Mining-Massive-Datasets-Anand-Rajaraman/dp/1107015359/ref=sr_1_1?ie=UTF8&qid=1350890245&sr=8-1
Ślęzak, D., Janusz, A., Świeboda, W., Nguyen, H.S., Bazan, J.G., Skowron, A.: Semantic analytics of PubMed content. In: Holzinger, A., Simonic, K.-M. (eds.) USAB 2011. LNCS, vol. 7058, pp. 63–74. Springer, Heidelberg (2011)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8) (August 1990), http://doi.acm.org/10.1145/79173.79181
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Grzegorowski, M. (2014). Scaling of Complex Calculations over Big Data-Sets. In: Ślȩzak, D., Schaefer, G., Vuong, S.T., Kim, YS. (eds) Active Media Technology. AMT 2014. Lecture Notes in Computer Science, vol 8610. Springer, Cham. https://doi.org/10.1007/978-3-319-09912-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-09912-5_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09911-8
Online ISBN: 978-3-319-09912-5
eBook Packages: Computer ScienceComputer Science (R0)