Experiences and challenges in building a data intensive system for data migration

Scavuzzo, Marco; Nitto, Elisabetta Di; Ardagna, Danilo

doi:10.1007/s10664-017-9503-7

Experiences and challenges in building a data intensive system for data migration

Published: 17 February 2017

Volume 23, pages 52–86, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

972 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Data Intensive (DI) applications are becoming more and more important in several fields of science, economy, and even in our normal life. Unfortunately, even if some technological frameworks are available for their development, we still lack solid software engineering approaches to support their development and, in particular, to ensure that they offer the required properties in terms of availability, throughput, data loss, etc.. In this paper we report our action research experience in developing-testing-reengineering a specific DI application, Hegira4Cloud, that migrates data between widely used NoSQL databases. We highlight the issues we have faced during our experience and we show how cumbersome, expensive and time-consuming the developing-testing-reengineering approach can be in this specific case. Also, we analyse the state of the art in the light of our experience and identify weaknesses and open challenges that could generate new research in the areas of software design and verification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A UML Profile for the Design, Quality Assessment and Deployment of Data-intensive Applications

Article 15 April 2019

Self-adapting data migration in the context of schema evolution in NoSQL databases

Article Open access 30 April 2021

Remaining in Control of the Impact of Schema Evolution in NoSQL Databases

Notes

Repositories: Monolithic version: https://github.com/deib-polimi/Hegira4Cloud Improved prototype: https://github.com/deib-polimi/hegira-components Rest API: https://github.com/deib-polimi/hegira-api
http://hadoop.apache.org/
http://spark.apache.org/
http://flink.apache.org/
https://cloud.google.com/datastore/
https://azure.microsoft.com/en-us/services/storage/tables/
http://cassandra.apache.org/
http://hbase.apache.org/
http://www.oracle.com/technetwork/database/migration/index-084442.html
https://github.com/flyway/flyway
http://www.liquibase.org
http://www.mysql.it/products/workbench/migrate/
https://www-01.ibm.com/marketing/iwm/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov4921&S_TACT=M161001W&dynform=9816 https: //www-01.ibm.com/marketing/iwm/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov4921&S_TACT=M161001W&dynform=9816
https://chromium.googlesource.com/external/googleappengine/python/+/ 200fcb767bdc358a3acb5cf7cad1376fe69f12c5/google/appengine/tools/bulkloader.py https://chromium.googlesource.com/external/googleappengine/python/+/ 200fcb767bdc358a3acb5cf7cad1376fe69f12c5/google/appengine/tools/bulkloader.py
http://neo4j.com/docs/stable/import-tool.html
https://docs.mongodb.org/manual/reference/program/mongoimport/
http://goo.gl/Z307aS
https://docs.mongodb.com/manual/reference/program/mongoimport/
https://docs.mongodb.com/manual/reference/program/mongoexport/
RemoteApiException: remote API call: unexpected HTTP response: 500
‘https://github.com/esnet/iperf’
‘http://www.rabbitmq.com/’
https://www.rabbitmq.com/confirms.html
‘http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/’
‘https://thrift.apache.org/.’
‘http://avro.apache.org/’
‘https://developers.google.com/protocol-buffers/’
The total cost can be obtained by summing, per each voice of cost in Table 6, the value obtained by respectively multiplying the “Price per unit” with the “Resource usage”.
(or any other indexable property if allowed by the database)
Data migration users are able to choose the size of the VDPs before actually starting the migration; by doing so, users are able to trade data migration logging granularity for performance.
http://zookeeper.apache.org
Service Level Agreement Legal and Open Model (SLALOM) European Project – http://slalom-project.eu
http://azure.microsoft.com/en-us/services/documentdb/
http://spark.apache.org/streaming/
http://www.dice-h2020.eu/

References

Abdelzaher T, Diao Y, Hellerstein J, Lu C, Zhu X (2008) Introduction to control theory and its application to computing systems. In: Liu Z, Xia C (eds) Performance Modeling and Engineering. Springer, USA, pp 185–215
Abadi D (2012) Consistency tradeoffs in modern distributed database system design: Cap is only part of the story. IEEE Computer, 45(2)
ArchiveTeam (2012) Twitter Stream. https://ia601605.us.archive.org/10/items/archiveteam-twitter-stream-2012-12/archiveteam-twitter-2012-12.tar
Ardagna D, Casale G, Ciavotta M, Pérez JF, Wang W (2014) Quality-of-service in cloud computing, Modeling techniques and their applications, Journal of Internet Services and Applications
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53:50–58
Article Google Scholar
Atzeni P, Bellomarini L, Bugiotti F, Celli F, Gianforme G (2012) A runtime approach to model-generic translation of schema and data, Inf. Syst
Baskerville R, Myers M (2004) Special issue on action research in information systems: Making is research relevant to practice—foreword. MIS Q 28(3):329–335
Article Google Scholar
Becker S, Koziolek H, Reussner R (2009) The palladio component model for model-driven performance prediction. J Syst Softw 82(1):3–22
Article Google Scholar
Bernardi S, Dranca L, Merseguer J (2016) A model-driven approach to survivability requirement assessment for critical systems. In: Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability
Brewer E (2012) CAP twelve years later: How the rules have changed. Computer 45:23–29
Article Google Scholar
Casale G, Ardagna D, Artac M, Barbier F, Nitto ED, Henry A, Iuhasz G, Joubert C, Merseguer J, Munteanu VI, Pérez JF, Petcu D, Rossi M, Sheridan C, Spais I, Vladui̇c D (2015) Dice: Quality-driven development of data-intensive cloud applications. In: Proceedings of the 7th International Workshop on Modelling in Software Engineering (MiSE)
Ceri S, Widom J (1993) Managing semantic heterogeneity with production rules and persistent queues. In: Proceedings of the Nineteenth International Conference on Very Large Data Bases, pp 108–119
Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI ’06, pp 15–15 USENIX Association
Chauhan A (2012) How the size of an entity is caclulated in Windows Azure table storage?. http://goo.gl/ch9YXu
Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275(0):314–347
Article Google Scholar
Cluet S, Connor RCH, Hull R, Maier D, Matthes F, Suciu D (1998) Panel session: Metadata for database interoperation. In: Proceedings of the 6th International Workshop on Database Programming Languages, DBLP-6, (London, UK, UK). Springer, pp 35–37
Das et al (2012) All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, (New York, NY, USA), pp 18:1–18:14, ACM
Das S, Narasayya V, Li F, Syamala M (2013) Cpu sharing techniques for performance isolation in multi-tenant relational database-as-a-service. In: Proceedings of the VLDB Endowment. Very Large Data Bases Endowment Inc., vol 7, p 12
Didona D, Romano P (2015) Hybrid machine learning/analytical models for performance prediction: A tutorial. In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 341–344
Duggan J, Papaemmanouil O, Cetintemel U, Upfal E (2014) Contender: A resource modeling approach for concurrent query performance prediction. In: EDBT, pp 109–120
Ferry N, Solberg A, Jamshidi P, Osman R, Wang W, Seycek S, Gligor V, Sucasa R, Abhervé A (2015) MODAClouds evaluation report–Final versivon. Deliverable D8.5.2, Available from http://www.modaclouds.eu/wp-content/uploads/2012/09/MODAClouds_D3.7.2_MODACloudsEvaluationReportFinalVersion1.pdf [accessed 5 January 2017]
Godfrey R et al (2014) Information technology – advanced message queuing protocol (AMQP) v1.0 specification. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=64955
Gorton I, Klein J (2015) Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Softw 32(3):78–85
Article Google Scholar
Hacigumus H, Chi Y, Wu W, Zhu S, Tatemura J, Naughton JF (2013) Predicting query execution time: Are optimizer cost models really unusable?. In: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE ’13, (Washington, DC, USA), pp 1081–1092 IEEE Computer Society
Harizopoulos S, Abadi DJ, Madden S, Stonebraker M (2008) OLTP Through the Looking Glass, and What We Found There. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, (New York, NY, USA), pp 981–992 ACM
Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p 18
Hill Z, Li J, Mao M, Ruiz-alvarez A, Humphrey M (2010) Early observations on the performance of windows azure. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp 367–376 ACM
Hunt P, Konar M, Junqueira FP, Reed B (2010a) ZooKeeper: Wait-free Coordination for Internet-scale Systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, (Berkeley, CA, USA), pp 11–11 USENIX Association
Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573. Special Issue on Perspectives on Parallel and Distributed Processing
Article Google Scholar
Kent W (1983) A simple guide to five normal forms in relational database theory. Commun ACM 26:120–125
Article Google Scholar
Kroß J, Brunnert A, Krcmar H (2015) Modeling big data systems by extending the palladio component model. Softwaretechnik-Trends 3:35
Google Scholar
Li M, Zeng L, Meng S, Tan J, Zhang L, Butt AR, Fuller N (2014) Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, pp 165–176
Lightstone S, Surendra M, Diao Y, Parekh SS, Hellerstein JL, Rose K, Storm AJ, Garcia-Arellano C (2007) Control theory: a foundational technique for self managing databases. In: ICDE Workshops, pp 395– 403
Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2012) Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute
Marr B (2015) Big Data: 20 Mind-Boggling Facts Everyone Must Read. http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read. [Forbes Online; accessed January 2017]
Menascé DA, Gomaa H (2000) A method for design and performance modeling of client/server systems. IEEE Trans Softw Eng 26(11):1066–1085
Article Google Scholar
MG (2009) Uml profile for marte: Modeling and analysis of real-time embedded systems
NIST Big Data Interoperability Framework (2015) Volume 6, Reference Architecture. doi:10.6028/NIST.SP.1500-6 [accessed 14 January 2017]
O’Brien R (1998) An overview of the methodological approach of action research
Picioroaga F, Nechifor S (2014) Modelling Smart City Urban Safety Planner - Final prototype design. Deliverable D8.5.2, Available from http://www.modaclouds.eu/wp-content/uploads/2012/09/MODACloudsD8.5.2_SmartCityUrbanSafetyPlannerDesignFinalPrototypeDesign.pdf [accessed 5 January 2017]
Popescu A (2010) Nosql at codemash – an interesting nosql categorization. http://nosql.mypopescu.com/post/396337069/presentation-nosql-codemash-an-interesting-nosql
Rolia J, Casale G, Krishnamurthy D, Dawson S, Kraft S (2009) Predictive modelling of sap erp applications: Challenges and solutions. In: Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS ’09, (ICST, Brussels, Belgium, Belgium), pp 9:1–9:9, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering)
Sadalage PJ, Fowler M (2012) NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional 1st ed.
Scavuzzo M (2013) Interoperable data migration between NoSQL columnar databases, Master’s thesis Politecnico di Milano
Scavuzzo M, Di Nitto E, Ceri S (2014) Interoperable data migration between nosql columnar databases. In: 18th IEEE International Enterprise Distributed Object Computing Conference Workshops and Demonstrations, EDOC Workshops 2014, Ulm, Germany, September 1-2, 2014, pp 154–162
Scavuzzo M, Tamburri DA, Di Nitto E (2016) Providing Big Data Applications with Fault-Tolerant Data Migration Across Heterogeneous NoSQL Databases. In: Proceedings of the Second International Workshop on BIG Data Software Engineering, (Austin, TX, USA)
Scoffield B (2010) Nosql – death to relational databases(?). Presentation at the CodeMash conference in Sandusky (Ohio) 2010-01-14
Shivam P, Babu S, Chase J (2006) Active and accelerated learning of cost models for optimizing scientific applications. In: Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pp 535–546 VLDB Endowment
Shivam P, Demberel A, Gunda P, Irwin DE, Grit LE, Yumerefendi AR, Babu S, Chase JS (2007) Automated and on-demand provisioning of virtual machines for database applications. In: SIGMOD Conference, pp 1079–1081
Stewart C, Chakrabarti A, Griffith R (2013) Zoolander: Efficiently meeting very strict, low-latency slos. In: ICAC, pp 265–277
Stonebraker M, Cetintemel U (2005) One Size Fits All: An Idea Whose Time Has Come and Gone. In: Proceedings of the 21st International Conference on Data Engineering, ICDE ’05, (Washington, DC, USA), pp 2–11 IEEE Computer Society
Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P (2007) The End of an Architectural Era: (It’s Time for a Complete Rewrite). In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp 1150–1160 VLDB Endowment
Szyperski C, Petitclerc M, Barga R (2016) Three experts on big data engineering. IEEE Softw 33:68– 72
Article Google Scholar
Tanelli M, Ardagna D, Lovera M (2011) Identification of LPV state space models for autonomic web service systems. IEEE Trans Contr Sys Techn 19(1):93–103
Article Google Scholar
Terry DB, Prabhakaran V, Kotla R, Balakrishnan M, Aguilera MK, Abu-Libdeh H (2013) Consistency-based service level agreements for cloud storage. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, (New York, NY, USA), pp 309–324 ACM
Wong et al (2009) Oracle streams: A high performance implementation for near real time asynchronous replication. In: Ioannidis YE, Lee DL, Ng RT (eds) ICDE. IEEE, pp 1363–1374

Download references

Acknowledgments

The authors would like to thank Stefano Ceri, Alfonso Fuggetta and Damian Andrew Tamburri for their advices and for reviewing preliminary versions of this paper. This work has been supported by the European Commission grant no. FP7-ICT-2011-8- 318484 (MODAClouds), by the Windows Azure Research Pass 2013 and by various Amazon grants for supporting research activities.

Author information

Authors and Affiliations

Dipartimento di Elettronica Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133, Milano, Italy
Marco Scavuzzo, Elisabetta Di Nitto & Danilo Ardagna

Authors

Marco Scavuzzo
View author publications
You can also search for this author in PubMed Google Scholar
Elisabetta Di Nitto
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Ardagna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Scavuzzo.

Additional information

Communicated by: Luciano Baresi, Tim Menzies and Andreas Metzger

Rights and permissions

Reprints and permissions

About this article

Cite this article

Scavuzzo, M., Nitto, E.D. & Ardagna, D. Experiences and challenges in building a data intensive system for data migration. Empir Software Eng 23, 52–86 (2018). https://doi.org/10.1007/s10664-017-9503-7

Download citation

Published: 17 February 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10664-017-9503-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experiences and challenges in building a data intensive system for data migration

Abstract

Access this article

Similar content being viewed by others

A UML Profile for the Design, Quality Assessment and Deployment of Data-intensive Applications

Self-adapting data migration in the context of schema evolution in NoSQL databases

Remaining in Control of the Impact of Schema Evolution in NoSQL Databases

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Experiences and challenges in building a data intensive system for data migration

Abstract

Access this article

Similar content being viewed by others

A UML Profile for the Design, Quality Assessment and Deployment of Data-intensive Applications

Self-adapting data migration in the context of schema evolution in NoSQL databases

Remaining in Control of the Impact of Schema Evolution in NoSQL Databases

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation