Abstract
Data Intensive (DI) applications are becoming more and more important in several fields of science, economy, and even in our normal life. Unfortunately, even if some technological frameworks are available for their development, we still lack solid software engineering approaches to support their development and, in particular, to ensure that they offer the required properties in terms of availability, throughput, data loss, etc.. In this paper we report our action research experience in developing-testing-reengineering a specific DI application, Hegira4Cloud, that migrates data between widely used NoSQL databases. We highlight the issues we have faced during our experience and we show how cumbersome, expensive and time-consuming the developing-testing-reengineering approach can be in this specific case. Also, we analyse the state of the art in the light of our experience and identify weaknesses and open challenges that could generate new research in the areas of software design and verification.
Similar content being viewed by others
Notes
Repositories: Monolithic version: https://github.com/deib-polimi/Hegira4Cloud Improved prototype: https://github.com/deib-polimi/hegira-components Rest API: https://github.com/deib-polimi/hegira-api
RemoteApiException: remote API call: unexpected HTTP response: 500
The total cost can be obtained by summing, per each voice of cost in Table 6, the value obtained by respectively multiplying the “Price per unit” with the “Resource usage”.
(or any other indexable property if allowed by the database)
Data migration users are able to choose the size of the VDPs before actually starting the migration; by doing so, users are able to trade data migration logging granularity for performance.
Service Level Agreement Legal and Open Model (SLALOM) European Project – http://slalom-project.eu
References
Abdelzaher T, Diao Y, Hellerstein J, Lu C, Zhu X (2008) Introduction to control theory and its application to computing systems. In: Liu Z, Xia C (eds) Performance Modeling and Engineering. Springer, USA, pp 185–215
Abadi D (2012) Consistency tradeoffs in modern distributed database system design: Cap is only part of the story. IEEE Computer, 45(2)
ArchiveTeam (2012) Twitter Stream. https://ia601605.us.archive.org/10/items/archiveteam-twitter-stream-2012-12/archiveteam-twitter-2012-12.tar
Ardagna D, Casale G, Ciavotta M, Pérez JF, Wang W (2014) Quality-of-service in cloud computing, Modeling techniques and their applications, Journal of Internet Services and Applications
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53:50–58
Atzeni P, Bellomarini L, Bugiotti F, Celli F, Gianforme G (2012) A runtime approach to model-generic translation of schema and data, Inf. Syst
Baskerville R, Myers M (2004) Special issue on action research in information systems: Making is research relevant to practice—foreword. MIS Q 28(3):329–335
Becker S, Koziolek H, Reussner R (2009) The palladio component model for model-driven performance prediction. J Syst Softw 82(1):3–22
Bernardi S, Dranca L, Merseguer J (2016) A model-driven approach to survivability requirement assessment for critical systems. In: Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
Brewer E (2012) CAP twelve years later: How the rules have changed. Computer 45:23–29
Casale G, Ardagna D, Artac M, Barbier F, Nitto ED, Henry A, Iuhasz G, Joubert C, Merseguer J, Munteanu VI, Pérez JF, Petcu D, Rossi M, Sheridan C, Spais I, Vladui̇c D (2015) Dice: Quality-driven development of data-intensive cloud applications. In: Proceedings of the 7th International Workshop on Modelling in Software Engineering (MiSE)
Ceri S, Widom J (1993) Managing semantic heterogeneity with production rules and persistent queues. In: Proceedings of the Nineteenth International Conference on Very Large Data Bases, pp 108–119
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI ’06, pp 15–15 USENIX Association
Chauhan A (2012) How the size of an entity is caclulated in Windows Azure table storage?. http://goo.gl/ch9YXu
Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275(0):314–347
Cluet S, Connor RCH, Hull R, Maier D, Matthes F, Suciu D (1998) Panel session: Metadata for database interoperation. In: Proceedings of the 6th International Workshop on Database Programming Languages, DBLP-6, (London, UK, UK). Springer, pp 35–37
Das et al (2012) All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, (New York, NY, USA), pp 18:1–18:14, ACM
Das S, Narasayya V, Li F, Syamala M (2013) Cpu sharing techniques for performance isolation in multi-tenant relational database-as-a-service. In: Proceedings of the VLDB Endowment. Very Large Data Bases Endowment Inc., vol 7, p 12
Didona D, Romano P (2015) Hybrid machine learning/analytical models for performance prediction: A tutorial. In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 341–344
Duggan J, Papaemmanouil O, Cetintemel U, Upfal E (2014) Contender: A resource modeling approach for concurrent query performance prediction. In: EDBT, pp 109–120
Ferry N, Solberg A, Jamshidi P, Osman R, Wang W, Seycek S, Gligor V, Sucasa R, Abhervé A (2015) MODAClouds evaluation report–Final versivon. Deliverable D8.5.2, Available from http://www.modaclouds.eu/wp-content/uploads/2012/09/MODAClouds_D3.7.2_MODACloudsEvaluationReportFinalVersion1.pdf [accessed 5 January 2017]
Godfrey R et al (2014) Information technology – advanced message queuing protocol (AMQP) v1.0 specification. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=64955
Gorton I, Klein J (2015) Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Softw 32(3):78–85
Hacigumus H, Chi Y, Wu W, Zhu S, Tatemura J, Naughton JF (2013) Predicting query execution time: Are optimizer cost models really unusable?. In: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE ’13, (Washington, DC, USA), pp 1081–1092 IEEE Computer Society
Harizopoulos S, Abadi DJ, Madden S, Stonebraker M (2008) OLTP Through the Looking Glass, and What We Found There. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, (New York, NY, USA), pp 981–992 ACM
Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p 18
Hill Z, Li J, Mao M, Ruiz-alvarez A, Humphrey M (2010) Early observations on the performance of windows azure. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp 367–376 ACM
Hunt P, Konar M, Junqueira FP, Reed B (2010a) ZooKeeper: Wait-free Coordination for Internet-scale Systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, (Berkeley, CA, USA), pp 11–11 USENIX Association
Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573. Special Issue on Perspectives on Parallel and Distributed Processing
Kent W (1983) A simple guide to five normal forms in relational database theory. Commun ACM 26:120–125
Kroß J, Brunnert A, Krcmar H (2015) Modeling big data systems by extending the palladio component model. Softwaretechnik-Trends 3:35
Li M, Zeng L, Meng S, Tan J, Zhang L, Butt AR, Fuller N (2014) Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, pp 165–176
Lightstone S, Surendra M, Diao Y, Parekh SS, Hellerstein JL, Rose K, Storm AJ, Garcia-Arellano C (2007) Control theory: a foundational technique for self managing databases. In: ICDE Workshops, pp 395– 403
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2012) Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute
Marr B (2015) Big Data: 20 Mind-Boggling Facts Everyone Must Read. http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read. [Forbes Online; accessed January 2017]
Menascé DA, Gomaa H (2000) A method for design and performance modeling of client/server systems. IEEE Trans Softw Eng 26(11):1066–1085
MG (2009) Uml profile for marte: Modeling and analysis of real-time embedded systems
NIST Big Data Interoperability Framework (2015) Volume 6, Reference Architecture. doi:10.6028/NIST.SP.1500-6 [accessed 14 January 2017]
O’Brien R (1998) An overview of the methodological approach of action research
Picioroaga F, Nechifor S (2014) Modelling Smart City Urban Safety Planner - Final prototype design. Deliverable D8.5.2, Available from http://www.modaclouds.eu/wp-content/uploads/2012/09/MODACloudsD8.5.2_SmartCityUrbanSafetyPlannerDesignFinalPrototypeDesign.pdf [accessed 5 January 2017]
Popescu A (2010) Nosql at codemash – an interesting nosql categorization. http://nosql.mypopescu.com/post/396337069/presentation-nosql-codemash-an-interesting-nosql
Rolia J, Casale G, Krishnamurthy D, Dawson S, Kraft S (2009) Predictive modelling of sap erp applications: Challenges and solutions. In: Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS ’09, (ICST, Brussels, Belgium, Belgium), pp 9:1–9:9, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering)
Sadalage PJ, Fowler M (2012) NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional 1st ed.
Scavuzzo M (2013) Interoperable data migration between NoSQL columnar databases, Master’s thesis Politecnico di Milano
Scavuzzo M, Di Nitto E, Ceri S (2014) Interoperable data migration between nosql columnar databases. In: 18th IEEE International Enterprise Distributed Object Computing Conference Workshops and Demonstrations, EDOC Workshops 2014, Ulm, Germany, September 1-2, 2014, pp 154–162
Scavuzzo M, Tamburri DA, Di Nitto E (2016) Providing Big Data Applications with Fault-Tolerant Data Migration Across Heterogeneous NoSQL Databases. In: Proceedings of the Second International Workshop on BIG Data Software Engineering, (Austin, TX, USA)
Scoffield B (2010) Nosql – death to relational databases(?). Presentation at the CodeMash conference in Sandusky (Ohio) 2010-01-14
Shivam P, Babu S, Chase J (2006) Active and accelerated learning of cost models for optimizing scientific applications. In: Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pp 535–546 VLDB Endowment
Shivam P, Demberel A, Gunda P, Irwin DE, Grit LE, Yumerefendi AR, Babu S, Chase JS (2007) Automated and on-demand provisioning of virtual machines for database applications. In: SIGMOD Conference, pp 1079–1081
Stewart C, Chakrabarti A, Griffith R (2013) Zoolander: Efficiently meeting very strict, low-latency slos. In: ICAC, pp 265–277
Stonebraker M, Cetintemel U (2005) One Size Fits All: An Idea Whose Time Has Come and Gone. In: Proceedings of the 21st International Conference on Data Engineering, ICDE ’05, (Washington, DC, USA), pp 2–11 IEEE Computer Society
Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P (2007) The End of an Architectural Era: (It’s Time for a Complete Rewrite). In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp 1150–1160 VLDB Endowment
Szyperski C, Petitclerc M, Barga R (2016) Three experts on big data engineering. IEEE Softw 33:68– 72
Tanelli M, Ardagna D, Lovera M (2011) Identification of LPV state space models for autonomic web service systems. IEEE Trans Contr Sys Techn 19(1):93–103
Terry DB, Prabhakaran V, Kotla R, Balakrishnan M, Aguilera MK, Abu-Libdeh H (2013) Consistency-based service level agreements for cloud storage. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, (New York, NY, USA), pp 309–324 ACM
Wong et al (2009) Oracle streams: A high performance implementation for near real time asynchronous replication. In: Ioannidis YE, Lee DL, Ng RT (eds) ICDE. IEEE, pp 1363–1374
Acknowledgments
The authors would like to thank Stefano Ceri, Alfonso Fuggetta and Damian Andrew Tamburri for their advices and for reviewing preliminary versions of this paper. This work has been supported by the European Commission grant no. FP7-ICT-2011-8- 318484 (MODAClouds), by the Windows Azure Research Pass 2013 and by various Amazon grants for supporting research activities.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Luciano Baresi, Tim Menzies and Andreas Metzger
Rights and permissions
About this article
Cite this article
Scavuzzo, M., Nitto, E.D. & Ardagna, D. Experiences and challenges in building a data intensive system for data migration. Empir Software Eng 23, 52–86 (2018). https://doi.org/10.1007/s10664-017-9503-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9503-7