Experiences and challenges in building a data intensive system for data migration

Abstract

Data Intensive (DI) applications are becoming more and more important in several fields of science, economy, and even in our normal life. Unfortunately, even if some technological frameworks are available for their development, we still lack solid software engineering approaches to support their development and, in particular, to ensure that they offer the required properties in terms of availability, throughput, data loss, etc.. In this paper we report our action research experience in developing-testing-reengineering a specific DI application, Hegira4Cloud, that migrates data between widely used NoSQL databases. We highlight the issues we have faced during our experience and we show how cumbersome, expensive and time-consuming the developing-testing-reengineering approach can be in this specific case. Also, we analyse the state of the art in the light of our experience and identify weaknesses and open challenges that could generate new research in the areas of software design and verification.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    Repositories: Monolithic version: https://github.com/deib-polimi/Hegira4Cloud Improved prototype: https://github.com/deib-polimi/hegira-components Rest API: https://github.com/deib-polimi/hegira-api

  2. 2.

    http://hadoop.apache.org/

  3. 3.

    http://spark.apache.org/

  4. 4.

    http://flink.apache.org/

  5. 5.

    https://cloud.google.com/datastore/

  6. 6.

    https://azure.microsoft.com/en-us/services/storage/tables/

  7. 7.

    http://cassandra.apache.org/

  8. 8.

    http://hbase.apache.org/

  9. 9.

    http://www.oracle.com/technetwork/database/migration/index-084442.html

  10. 10.

    https://github.com/flyway/flyway

  11. 11.

    http://www.liquibase.org

  12. 12.

    http://www.mysql.it/products/workbench/migrate/

  13. 13.

    https://www-01.ibm.com/marketing/iwm/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov4921&S_TACT=M161001W&dynform=9816 https: //www-01.ibm.com/marketing/iwm/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov4921&S_TACT=M161001W&dynform=9816

  14. 14.

    https://chromium.googlesource.com/external/googleappengine/python/+/ 200fcb767bdc358a3acb5cf7cad1376fe69f12c5/google/appengine/tools/bulkloader.py https://chromium.googlesource.com/external/googleappengine/python/+/ 200fcb767bdc358a3acb5cf7cad1376fe69f12c5/google/appengine/tools/bulkloader.py

  15. 15.

    http://neo4j.com/docs/stable/import-tool.html

  16. 16.

    https://docs.mongodb.org/manual/reference/program/mongoimport/

  17. 17.

    http://goo.gl/Z307aS

  18. 18.

    https://docs.mongodb.com/manual/reference/program/mongoimport/

  19. 19.

    https://docs.mongodb.com/manual/reference/program/mongoexport/

  20. 20.

    RemoteApiException: remote API call: unexpected HTTP response: 500

  21. 21.

    https://github.com/esnet/iperf

  22. 22.

    http://www.rabbitmq.com/

  23. 23.

    https://www.rabbitmq.com/confirms.html

  24. 24.

    http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/

  25. 25.

    https://thrift.apache.org/.

  26. 26.

    http://avro.apache.org/

  27. 27.

    https://developers.google.com/protocol-buffers/

  28. 28.

    The total cost can be obtained by summing, per each voice of cost in Table 6, the value obtained by respectively multiplying the “Price per unit” with the “Resource usage”.

  29. 29.

    (or any other indexable property if allowed by the database)

  30. 30.

    Data migration users are able to choose the size of the VDPs before actually starting the migration; by doing so, users are able to trade data migration logging granularity for performance.

  31. 31.

    http://zookeeper.apache.org

  32. 32.

    Service Level Agreement Legal and Open Model (SLALOM) European Project – http://slalom-project.eu

  33. 33.

    http://azure.microsoft.com/en-us/services/documentdb/

  34. 34.

    http://spark.apache.org/streaming/

  35. 35.

    http://www.dice-h2020.eu/

References

  1. Abdelzaher T, Diao Y, Hellerstein J, Lu C, Zhu X (2008) Introduction to control theory and its application to computing systems. In: Liu Z, Xia C (eds) Performance Modeling and Engineering. Springer, USA, pp 185–215

  2. Abadi D (2012) Consistency tradeoffs in modern distributed database system design: Cap is only part of the story. IEEE Computer, 45(2)

  3. ArchiveTeam (2012) Twitter Stream. https://ia601605.us.archive.org/10/items/archiveteam-twitter-stream-2012-12/archiveteam-twitter-2012-12.tar

  4. Ardagna D, Casale G, Ciavotta M, Pérez JF, Wang W (2014) Quality-of-service in cloud computing, Modeling techniques and their applications, Journal of Internet Services and Applications

  5. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53:50–58

    Article  Google Scholar 

  6. Atzeni P, Bellomarini L, Bugiotti F, Celli F, Gianforme G (2012) A runtime approach to model-generic translation of schema and data, Inf. Syst

  7. Baskerville R, Myers M (2004) Special issue on action research in information systems: Making is research relevant to practice—foreword. MIS Q 28(3):329–335

    Article  Google Scholar 

  8. Becker S, Koziolek H, Reussner R (2009) The palladio component model for model-driven performance prediction. J Syst Softw 82(1):3–22

    Article  Google Scholar 

  9. Bernardi S, Dranca L, Merseguer J (2016) A model-driven approach to survivability requirement assessment for critical systems. In: Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability

  10. Brewer E (2012) CAP twelve years later: How the rules have changed. Computer 45:23–29

    Article  Google Scholar 

  11. Casale G, Ardagna D, Artac M, Barbier F, Nitto ED, Henry A, Iuhasz G, Joubert C, Merseguer J, Munteanu VI, Pérez JF, Petcu D, Rossi M, Sheridan C, Spais I, Vladui̇c D (2015) Dice: Quality-driven development of data-intensive cloud applications. In: Proceedings of the 7th International Workshop on Modelling in Software Engineering (MiSE)

  12. Ceri S, Widom J (1993) Managing semantic heterogeneity with production rules and persistent queues. In: Proceedings of the Nineteenth International Conference on Very Large Data Bases, pp 108–119

  13. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI ’06, pp 15–15 USENIX Association

  14. Chauhan A (2012) How the size of an entity is caclulated in Windows Azure table storage?. http://goo.gl/ch9YXu

  15. Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275(0):314–347

    Article  Google Scholar 

  16. Cluet S, Connor RCH, Hull R, Maier D, Matthes F, Suciu D (1998) Panel session: Metadata for database interoperation. In: Proceedings of the 6th International Workshop on Database Programming Languages, DBLP-6, (London, UK, UK). Springer, pp 35–37

  17. Das et al (2012) All aboard the databus!: Linkedin’s scalable consistent change data capture platform. In: Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, (New York, NY, USA), pp 18:1–18:14, ACM

  18. Das S, Narasayya V, Li F, Syamala M (2013) Cpu sharing techniques for performance isolation in multi-tenant relational database-as-a-service. In: Proceedings of the VLDB Endowment. Very Large Data Bases Endowment Inc., vol 7, p 12

  19. Didona D, Romano P (2015) Hybrid machine learning/analytical models for performance prediction: A tutorial. In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31 - February 4, 2015, pp 341–344

  20. Duggan J, Papaemmanouil O, Cetintemel U, Upfal E (2014) Contender: A resource modeling approach for concurrent query performance prediction. In: EDBT, pp 109–120

  21. Ferry N, Solberg A, Jamshidi P, Osman R, Wang W, Seycek S, Gligor V, Sucasa R, Abhervé A (2015) MODAClouds evaluation report–Final versivon. Deliverable D8.5.2, Available from http://www.modaclouds.eu/wp-content/uploads/2012/09/MODAClouds_D3.7.2_MODACloudsEvaluationReportFinalVersion1.pdf [accessed 5 January 2017]

  22. Godfrey R et al (2014) Information technology – advanced message queuing protocol (AMQP) v1.0 specification. http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=64955

  23. Gorton I, Klein J (2015) Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Softw 32(3):78–85

    Article  Google Scholar 

  24. Hacigumus H, Chi Y, Wu W, Zhu S, Tatemura J, Naughton JF (2013) Predicting query execution time: Are optimizer cost models really unusable?. In: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), ICDE ’13, (Washington, DC, USA), pp 1081–1092 IEEE Computer Society

  25. Harizopoulos S, Abadi DJ, Madden S, Stonebraker M (2008) OLTP Through the Looking Glass, and What We Found There. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, (New York, NY, USA), pp 981–992 ACM

  26. Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p 18

  27. Hill Z, Li J, Mao M, Ruiz-alvarez A, Humphrey M (2010) Early observations on the performance of windows azure. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp 367–376 ACM

  28. Hunt P, Konar M, Junqueira FP, Reed B (2010a) ZooKeeper: Wait-free Coordination for Internet-scale Systems. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’10, (Berkeley, CA, USA), pp 11–11 USENIX Association

  29. Kambatla K, Kollias G, Kumar V, Grama A (2014) Trends in big data analytics. J Parallel Distrib Comput 74(7):2561–2573. Special Issue on Perspectives on Parallel and Distributed Processing

    Article  Google Scholar 

  30. Kent W (1983) A simple guide to five normal forms in relational database theory. Commun ACM 26:120–125

    Article  Google Scholar 

  31. Kroß J, Brunnert A, Krcmar H (2015) Modeling big data systems by extending the palladio component model. Softwaretechnik-Trends 3:35

    Google Scholar 

  32. Li M, Zeng L, Meng S, Tan J, Zhang L, Butt AR, Fuller N (2014) Mronline: Mapreduce online performance tuning. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, pp 165–176

  33. Lightstone S, Surendra M, Diao Y, Parekh SS, Hellerstein JL, Rose K, Storm AJ, Garcia-Arellano C (2007) Control theory: a foundational technique for self managing databases. In: ICDE Workshops, pp 395– 403

  34. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH (2012) Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute

  35. Marr B (2015) Big Data: 20 Mind-Boggling Facts Everyone Must Read. http://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read. [Forbes Online; accessed January 2017]

  36. Menascé DA, Gomaa H (2000) A method for design and performance modeling of client/server systems. IEEE Trans Softw Eng 26(11):1066–1085

    Article  Google Scholar 

  37. MG (2009) Uml profile for marte: Modeling and analysis of real-time embedded systems

  38. NIST Big Data Interoperability Framework (2015) Volume 6, Reference Architecture. doi:10.6028/NIST.SP.1500-6 [accessed 14 January 2017]

  39. O’Brien R (1998) An overview of the methodological approach of action research

  40. Picioroaga F, Nechifor S (2014) Modelling Smart City Urban Safety Planner - Final prototype design. Deliverable D8.5.2, Available from http://www.modaclouds.eu/wp-content/uploads/2012/09/MODACloudsD8.5.2_SmartCityUrbanSafetyPlannerDesignFinalPrototypeDesign.pdf [accessed 5 January 2017]

  41. Popescu A (2010) Nosql at codemash – an interesting nosql categorization. http://nosql.mypopescu.com/post/396337069/presentation-nosql-codemash-an-interesting-nosql

  42. Rolia J, Casale G, Krishnamurthy D, Dawson S, Kraft S (2009) Predictive modelling of sap erp applications: Challenges and solutions. In: Proceedings of the Fourth International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS ’09, (ICST, Brussels, Belgium, Belgium), pp 9:1–9:9, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering)

  43. Sadalage PJ, Fowler M (2012) NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional 1st ed.

  44. Scavuzzo M (2013) Interoperable data migration between NoSQL columnar databases, Master’s thesis Politecnico di Milano

  45. Scavuzzo M, Di Nitto E, Ceri S (2014) Interoperable data migration between nosql columnar databases. In: 18th IEEE International Enterprise Distributed Object Computing Conference Workshops and Demonstrations, EDOC Workshops 2014, Ulm, Germany, September 1-2, 2014, pp 154–162

  46. Scavuzzo M, Tamburri DA, Di Nitto E (2016) Providing Big Data Applications with Fault-Tolerant Data Migration Across Heterogeneous NoSQL Databases. In: Proceedings of the Second International Workshop on BIG Data Software Engineering, (Austin, TX, USA)

  47. Scoffield B (2010) Nosql – death to relational databases(?). Presentation at the CodeMash conference in Sandusky (Ohio) 2010-01-14

  48. Shivam P, Babu S, Chase J (2006) Active and accelerated learning of cost models for optimizing scientific applications. In: Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB ’06, pp 535–546 VLDB Endowment

  49. Shivam P, Demberel A, Gunda P, Irwin DE, Grit LE, Yumerefendi AR, Babu S, Chase JS (2007) Automated and on-demand provisioning of virtual machines for database applications. In: SIGMOD Conference, pp 1079–1081

  50. Stewart C, Chakrabarti A, Griffith R (2013) Zoolander: Efficiently meeting very strict, low-latency slos. In: ICAC, pp 265–277

  51. Stonebraker M, Cetintemel U (2005) One Size Fits All: An Idea Whose Time Has Come and Gone. In: Proceedings of the 21st International Conference on Data Engineering, ICDE ’05, (Washington, DC, USA), pp 2–11 IEEE Computer Society

  52. Stonebraker M, Madden S, Abadi DJ, Harizopoulos S, Hachem N, Helland P (2007) The End of an Architectural Era: (It’s Time for a Complete Rewrite). In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp 1150–1160 VLDB Endowment

  53. Szyperski C, Petitclerc M, Barga R (2016) Three experts on big data engineering. IEEE Softw 33:68– 72

    Article  Google Scholar 

  54. Tanelli M, Ardagna D, Lovera M (2011) Identification of LPV state space models for autonomic web service systems. IEEE Trans Contr Sys Techn 19(1):93–103

    Article  Google Scholar 

  55. Terry DB, Prabhakaran V, Kotla R, Balakrishnan M, Aguilera MK, Abu-Libdeh H (2013) Consistency-based service level agreements for cloud storage. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, (New York, NY, USA), pp 309–324 ACM

  56. Wong et al (2009) Oracle streams: A high performance implementation for near real time asynchronous replication. In: Ioannidis YE, Lee DL, Ng RT (eds) ICDE. IEEE, pp 1363–1374

Download references

Acknowledgments

The authors would like to thank Stefano Ceri, Alfonso Fuggetta and Damian Andrew Tamburri for their advices and for reviewing preliminary versions of this paper. This work has been supported by the European Commission grant no. FP7-ICT-2011-8- 318484 (MODAClouds), by the Windows Azure Research Pass 2013 and by various Amazon grants for supporting research activities.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Marco Scavuzzo.

Additional information

Communicated by: Luciano Baresi, Tim Menzies and Andreas Metzger

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Scavuzzo, M., Nitto, E.D. & Ardagna, D. Experiences and challenges in building a data intensive system for data migration. Empir Software Eng 23, 52–86 (2018). https://doi.org/10.1007/s10664-017-9503-7

Download citation

Keywords

  • Data intensive applications
  • Experiment-driven action research
  • Big data
  • Data migration