To succeed at big data you must be able to process large volumes of data, data that is very often unstructured. More importantly, you must be able to swiftly react to emerging opportunities and insights before your competitor does. A Disciplined Agile approach to big data is evolutionary and collaborative in nature, leveraging proven strategies from the traditional, lean, and agile canons. Collaborative strategies increase both the velocity and quality of work performed while reducing overhead. Evolutionary strategies – those that deliver incremental value through iterative application of architecture and design modeling, database refactoring, automated regression testing, continuous integration (CI) of data assets, continuous deployment (CD) of data assets, and configuration management – build a solid data foundation that will stand the test of time. In effect this is the application of proven, leading-edge software engineering practices to big data.
Why Disciplined Agile Big Data?
Be Agile: An Agile Mindset for Data Professionals
Do Agile: The Agile Database Technique Stack
Why Disciplined Agile Big Data?
The Big Data environment is complex. You are dealing with overwhelming amounts of data coming in from a large number of disparate data sources; the data is often of questionable quality and integrity, and the data is often coming from sources that are outside your scope of influence. You need to respond to quickly changing stakeholder needs without increasing the technical debt within your organization. It is clear that in the one extreme traditional approaches to data management are insufficiently responsive, yet at the other extreme, mainstream agile strategies (in particular Scrum) come up short for addressing your long-term data management ideas. You need a middle ground that combines techniques for just enough modeling and planning at the most responsible moments for doing so with engineering techniques that produce high-quality assets that are easily evolved yet will still stand the test of time. That middle ground is Disciplined Agile Big Data.
Disciplined Agile (DA) (Ambler and Lines 2012) is a hybrid framework that combines strategies from a range of sources including Scrum, Agile Modeling, Agile Data, Unified Process, Kanban, traditional, and many other sources. DA promotes a pragmatic and flexible strategy for tailoring and evolving processes that reflect the situation that you face. A Disciplined Agile approach to Big Data leverages agile strategies architecture and design modeling and modern software engineering techniques. These practices, described below, are referred to as the agile database technique stack. The aim is to quickly meet the dynamic needs of the marketplace without short-changing the long-term viability of your organization.
Be Agile: An Agile Mindset for Data Professionals
Willing to work closely with others, working in pairs or small teams as appropriate
Pragmatic in that they are willing to do what needs to be done to the extent that it needs to be done
Open minded, willing to experiment and learn new techniques
Responsible and therefore willing to seek the help of the right person(s) for the task at hand
Eager to work iteratively and incrementally, creating artifacts that are sufficient to the task at hand
Do Agile: The Agile Database Techniques Stack
We say they form a stack because in order to be viable, each technique requires the one immediately below it. For it to make sense to continuously deploy database changes you need to be able to develop small and valuable vertical slices, which in turn require clean architecture and design, and so on. Let’s explore each one in greater detail.
Continuous Database Deployment
The aim of continuous database deployment is to reduce the time, cost, and risk of releasing database changes. Continuous database deployment only works if you are able to organize the functionality you are delivering into small, yet still valuable, vertical slices.
A vertical slice is a top to bottom, fully implemented and tested piece of functionality that provides some form of business value to an end user. It should be possible to easily deploy a vertical slice into production upon request. A vertical slice can be very small, such as a single value on a report, the implementation of a business rule or calculation, or a new reporting view. For an agile team, all of this implementation work should be accomplished during a single iteration/sprint, typically a one- or two-week period. For teams following a lean delivery lifecycle, this timeframe typically shrinks to days and even hours in some cases.
Extraction from the data source(s)
Staging of the raw source data (if you stage data)
Transformation/cleansing of the source data
Loading the data into the DW
Loading into your data marts (DMs)
Updating the appropriate BI views/reports where needed
A key concept is that you only do the work for the vertical slice that you’re currently working on. This is what enables you to get the work done in a matter of days (and even hours once you get good at it) instead of weeks or months. It should be clear that vertical slicing is only viable when you are able to take an agile approach to modeling.
Agile Data Modeling
Initial requirements envisioning. This includes both usage modeling, likely via user stores and epics, and conceptual modeling. These models are high-level at first; their details will be fleshed out later as construction progresses.
Initial architecture envisioning. Your architecture strategy is typically captured in a free-form architecture diagram, network diagram, or UML deployment diagram. Your model(s) should capture potential data sources; how data will flow from the data sources to the target data warehouse(s) or data marts; and how that work flows through combinations of data extraction, data transformation, and data loading capabilities.
Look-ahead modeling. Sometimes referred to as “backlog refinement” or “backlog grooming,” the goal of look-ahead modeling is to explore work that is a few weeks in the future. This is particularly needed in complex domains where there may be a few weeks of detailed data analysis required to work through the semantics of your source data. For teams taking a sprint/iteration-based approach, this may mean that during the current iteration someone(s) on the team explores requirements to be implemented one or two iterations in the future.
Model storming. This is a just-in-time (JIT) modeling strategy where you explore something through in greater detail, perhaps work through the details of what a report should look like or how the logic of a business calculation should work.
Test-driven development (TDD). With TDD, your tests both validate your work and specify it. Specification can be done at the requirements level with acceptance tests and at the design level with developer tests. More on this later.
Clean Architecture and Design
Choose a data warehouse architecture paradigm. Although there is something to be said about both the Inmon and Kimball strategies, I generally prefer DataVault 2 (Lindstedt and Olschimke 2015). DataVault 2 (DV2) has its roots in the Inmon approach, bringing learnings in from Kimball and more importantly practical experiences dealing with DW/BI and Big Data in a range of situations.
Focus on loose coupling and high cohesion. When a system is loosely coupled, it should be easy to evolve its components without significant effects on other components. Components that are highly cohesive do one thing and one thing only, in data parlance they are “highly normalized.”
Adopt common conventions. Guidelines around data naming conventions, architectural guidelines, coding conventions, user experience (UX) conventions, and others promote greater consistency in the work produced.
Train and coach your people. Unfortunately few IT professionals these days get explicit training in architecture and design strategies, resulting in poor quality work that increases your organization’s overall technical debt.
A refactoring is a simple change to your design that improves its quality without changing its semantics in a practical manner. A database refactoring is a simple change to a database schema that improves the quality of its design OR improves the quality of the data that it contains (Ambler and Sadalage 2006). Database refactoring enables you to safely and easily evolve database schemas, including production database schemas, over time by breaking large changes into a collection of smaller less-risky changes. Refactoring enables you to keep existing clean designs of high quality and to safely address problems in poor quality implementations.
Automated Database Testing
Quality is paramount for agility. Disciplined Agile teams will develop, in an evolutionary manner of course, an automated regression test suite that validates their work. They will run this test suite many times a day so as to detect any problems as early as possible. Automated regression testing like this enables teams to safely make changes, such as refactorings, because if they inject a problem they will be able to quickly find and then fix it.
In fact, very disciplined teams will take a test-driven development (TDD) approach where they write tests before they do the work to implement the functionality that the tests validate (Guernsey 2013). As a result, the tests do double duty – they both validate and specify. You can do this at the requirements level by writing user acceptance tests, a strategy referred to as behavior driven design (BDD) or acceptance test driven design (ATDD), and at the design level via developer tests. By rethinking the order in which you work, in this case by testing first not last, you can streamline your approach while you increase its quality.
Continuous Database Integration
Continuous integration (CI) is a technique where you automatically build and test your system every time someone checks in a code change (Sadalage 2003). Disciplined agile developers will typically update a few lines of code, or make a small change to a configuration file, or make a small change to a PDM and then check their work into their configuration management tool. The CI tool monitors this, and when it detects a check, it automatically kicks off the build and regression test suite in the background. This provides very quick feedback to team members, enabling them to detect issues early.
Configuration management is at the bottom of the stack, providing a foundation for all other agile database techniques. In this case there is nothing special about the assets that you are creating – ETL code, configuration files, data models, test data, stored procedures, and so on – in that if they are worth creating then they are also worth putting under CM control.
I would like to end with two simple messages: First, you can do this. Everything described in this chapter is pragmatic, supported by tooling, and has been proven in practice in numerous contexts. Second, you need to do this. The modern, dynamic business environment requires you to work in a reactive manner that does not short change your organization’s future. The Disciplined Agile approach described in this chapter describes how to do exactly that.
- Ambler, S. W. (2002). Agile modeling: Effective practices for extreme programming and the unified process. New York: Wiley.Google Scholar
- Ambler, S. W. (2013). Database testing: How to regression test a relational database. Retrieved from http://www.agiledata.org/essays/databaseTesting.html.
- Ambler, S. W., & Lines, M. (2012). Disciplined agile delivery: A practitioner’s guide to agile software delivery in the enterprise. New York: IBM Press.Google Scholar
- Ambler, S. W., & Sadalage, P. J. (2006). Refactoring databases: Evolutionary database design. Boston: Addison Wesley.Google Scholar
- Guernsey, M., III. (2013). Test-driven database development: Unlocking agility. Upper Saddle River: Addison-Wesley Professional.Google Scholar
- Lindstedt, D., & Olschimke, M. (2015). Building a scalable data warehouse with database 2.0. Waltham: Morgan Kaufman.Google Scholar
- Sadalage, P. J. (2003). Recipes for continuous database integration: Evolutionary database development. Upper Saddle River: Addison-Wesley Professional.Google Scholar