Encyclopedia of Big Data

Living Edition
| Editors: Laurie A. Schintler, Connie L. McNeely

Agile Data

  • Scott W. Ambler
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-32001-4_216-1

To succeed at big data you must be able to process large volumes of data, data that is very often unstructured. More importantly, you must be able to swiftly react to emerging opportunities and insights before your competitor does. A Disciplined Agile approach to big data is evolutionary and collaborative in nature, leveraging proven strategies from the traditional, lean, and agile canons. Collaborative strategies increase both the velocity and quality of work performed while reducing overhead. Evolutionary strategies – those that deliver incremental value through iterative application of architecture and design modeling, database refactoring, automated regression testing, continuous integration (CI) of data assets, continuous deployment (CD) of data assets, and configuration management – build a solid data foundation that will stand the test of time. In effect this is the application of proven, leading-edge software engineering practices to big data.

This chapter is organized into the following sections:
  1. 1.

    Why Disciplined Agile Big Data?

  2. 2.

    Be Agile: An Agile Mindset for Data Professionals

  3. 3.

    Do Agile: The Agile Database Technique Stack

  4. 4.

    Last Words


Why Disciplined Agile Big Data?

The Big Data environment is complex. You are dealing with overwhelming amounts of data coming in from a large number of disparate data sources; the data is often of questionable quality and integrity, and the data is often coming from sources that are outside your scope of influence. You need to respond to quickly changing stakeholder needs without increasing the technical debt within your organization. It is clear that in the one extreme traditional approaches to data management are insufficiently responsive, yet at the other extreme, mainstream agile strategies (in particular Scrum) come up short for addressing your long-term data management ideas. You need a middle ground that combines techniques for just enough modeling and planning at the most responsible moments for doing so with engineering techniques that produce high-quality assets that are easily evolved yet will still stand the test of time. That middle ground is Disciplined Agile Big Data.

Disciplined Agile (DA) (Ambler and Lines 2012) is a hybrid framework that combines strategies from a range of sources including Scrum, Agile Modeling, Agile Data, Unified Process, Kanban, traditional, and many other sources. DA promotes a pragmatic and flexible strategy for tailoring and evolving processes that reflect the situation that you face. A Disciplined Agile approach to Big Data leverages agile strategies architecture and design modeling and modern software engineering techniques. These practices, described below, are referred to as the agile database technique stack. The aim is to quickly meet the dynamic needs of the marketplace without short-changing the long-term viability of your organization.

Be Agile: An Agile Mindset for Data Professionals

In many ways agility is more of an attitude than a skillset. The common characteristics of agile professionals are:
  • Willing to work closely with others, working in pairs or small teams as appropriate

  • Pragmatic in that they are willing to do what needs to be done to the extent that it needs to be done

  • Open minded, willing to experiment and learn new techniques

  • Responsible and therefore willing to seek the help of the right person(s) for the task at hand

  • Eager to work iteratively and incrementally, creating artifacts that are sufficient to the task at hand

Do Agile: The Agile Database Techniques Stack

Of course it isn’t sufficient to “be agile” if you don’t know how to “do agile.” The following figure overviews the critical technical techniques required for agile database evolution. These agile database techniques have been proven in practice and enjoy both commercial and open source tooling support (Fig. 1).
Fig. 1

The agile database technique stack

We say they form a stack because in order to be viable, each technique requires the one immediately below it. For it to make sense to continuously deploy database changes you need to be able to develop small and valuable vertical slices, which in turn require clean architecture and design, and so on. Let’s explore each one in greater detail.

Continuous Database Deployment

Continuous deployment (CD) refers to the practice that when an integration build is successful (it compiles, passes all tests, and passes any automated analysis checks), your CD tool will automatically deploy to the next appropriate environment(s) (Sadalage 2003). This includes both changes to your business logic code as well as to your database. As you see in the following diagram, if the build runs successfully on a developer’s work station their changes are propagated automatically into the team integration environment (which automatically invokes the integration build in that space). When the build is successful the changes are promoted into an integration testing environment, and so on (Fig. 2).
Fig. 2

Continuous database deployment

The aim of continuous database deployment is to reduce the time, cost, and risk of releasing database changes. Continuous database deployment only works if you are able to organize the functionality you are delivering into small, yet still valuable, vertical slices.

Vertical Slicing

A vertical slice is a top to bottom, fully implemented and tested piece of functionality that provides some form of business value to an end user. It should be possible to easily deploy a vertical slice into production upon request. A vertical slice can be very small, such as a single value on a report, the implementation of a business rule or calculation, or a new reporting view. For an agile team, all of this implementation work should be accomplished during a single iteration/sprint, typically a one- or two-week period. For teams following a lean delivery lifecycle, this timeframe typically shrinks to days and even hours in some cases.

For a Big Data solution, a vertical slice is fully implemented from the appropriate data sources all the way through to a data warehouse (DW), data mart (DM), or business intelligence (BI) solution. For the data elements required by the vertical slice, you need to fully implement the following:
  • Extraction from the data source(s)

  • Staging of the raw source data (if you stage data)

  • Transformation/cleansing of the source data

  • Loading the data into the DW

  • Loading into your data marts (DMs)

  • Updating the appropriate BI views/reports where needed

A key concept is that you only do the work for the vertical slice that you’re currently working on. This is what enables you to get the work done in a matter of days (and even hours once you get good at it) instead of weeks or months. It should be clear that vertical slicing is only viable when you are able to take an agile approach to modeling.

Agile Data Modeling

Many traditional data professionals believe that they need to perform detailed, up-front requirements, architecture, and design modeling before they can begin construction work. Not only has this been shown to be an ineffective strategy in general, when it comes to the dynamically evolving world of Big Data environments, it also proves to be disastrous. A Disciplined Agile approach strives to keep the benefits of modeling and planning, which are to think things through, yet avoid the disadvantages associated with detailed documentation and making important decisions long before you need to. DA does this by applying light-weight Agile Modeling (Ambler 2002) strategies such as:
  1. 1.

    Initial requirements envisioning. This includes both usage modeling, likely via user stores and epics, and conceptual modeling. These models are high-level at first; their details will be fleshed out later as construction progresses.

  2. 2.

    Initial architecture envisioning. Your architecture strategy is typically captured in a free-form architecture diagram, network diagram, or UML deployment diagram. Your model(s) should capture potential data sources; how data will flow from the data sources to the target data warehouse(s) or data marts; and how that work flows through combinations of data extraction, data transformation, and data loading capabilities.

  3. 3.

    Look-ahead modeling. Sometimes referred to as “backlog refinement” or “backlog grooming,” the goal of look-ahead modeling is to explore work that is a few weeks in the future. This is particularly needed in complex domains where there may be a few weeks of detailed data analysis required to work through the semantics of your source data. For teams taking a sprint/iteration-based approach, this may mean that during the current iteration someone(s) on the team explores requirements to be implemented one or two iterations in the future.

  4. 4.

    Model storming. This is a just-in-time (JIT) modeling strategy where you explore something through in greater detail, perhaps work through the details of what a report should look like or how the logic of a business calculation should work.

  5. 5.

    Test-driven development (TDD). With TDD, your tests both validate your work and specify it. Specification can be done at the requirements level with acceptance tests and at the design level with developer tests. More on this later.


Clean Architecture and Design

High-quality IT assets are easier to understand, to work with, and to evolve. In many ways, clean architecture and design are fundamental enablers of agility in general. Here are a few important considerations for you:
  1. 1.

    Choose a data warehouse architecture paradigm. Although there is something to be said about both the Inmon and Kimball strategies, I generally prefer DataVault 2 (Lindstedt and Olschimke 2015). DataVault 2 (DV2) has its roots in the Inmon approach, bringing learnings in from Kimball and more importantly practical experiences dealing with DW/BI and Big Data in a range of situations.

  2. 2.

    Focus on loose coupling and high cohesion. When a system is loosely coupled, it should be easy to evolve its components without significant effects on other components. Components that are highly cohesive do one thing and one thing only, in data parlance they are “highly normalized.”

  3. 3.

    Adopt common conventions. Guidelines around data naming conventions, architectural guidelines, coding conventions, user experience (UX) conventions, and others promote greater consistency in the work produced.

  4. 4.

    Train and coach your people. Unfortunately few IT professionals these days get explicit training in architecture and design strategies, resulting in poor quality work that increases your organization’s overall technical debt.


Database Refactoring

A refactoring is a simple change to your design that improves its quality without changing its semantics in a practical manner. A database refactoring is a simple change to a database schema that improves the quality of its design OR improves the quality of the data that it contains (Ambler and Sadalage 2006). Database refactoring enables you to safely and easily evolve database schemas, including production database schemas, over time by breaking large changes into a collection of smaller less-risky changes. Refactoring enables you to keep existing clean designs of high quality and to safely address problems in poor quality implementations.

Let’s work through an example. The following diagram depicts three stages in the life of the Split Column database refactoring. The first stage shows the original database schema where we see that the Customer table has a Name column where the full name of a person is stored. We have decided that we want to improve the quality of this table by splitting the column into three – in this case FirstName, MiddleName, and LastName. The second stage, the transition period, shows how Customer contains both the original version of the schema (the Name column), the new/desired version of the schema, and scaffolding code to keep the two columns in sync. The transition period is required so as to give the people responsibility for any systems that access customer name to update their code to instead work with the new columns. This approach is based on the Java Development Kit (JDK) deprecation strategy. The scaffolding code, in this case a trigger that keeps the four columns consistent with one another, is required so that the database maintains integrity over the transition period. There may be hundreds of systems accessing this information – at first they will all be accessing the original schema but over time they will be updated to access the new version of the schema – and because these systems can not all be reworked at once the database must be responsible for its own integrity. Once the transition period ends and the existing systems that access the Customer table have been update accordingly, the original schema and the scaffolding code can be removed safely (Fig. 3).
Fig. 3

Example database refactoring

Automated Database Testing

Quality is paramount for agility. Disciplined Agile teams will develop, in an evolutionary manner of course, an automated regression test suite that validates their work. They will run this test suite many times a day so as to detect any problems as early as possible. Automated regression testing like this enables teams to safely make changes, such as refactorings, because if they inject a problem they will be able to quickly find and then fix it.

When it comes to testing, a database the following diagram summarizes the kind of tests that you should consider implementing (Ambler 2013). Of course there is more to testing Big Data implementations than this, you will also want to develop automated tests/checks for the entire chain from data sources through your data processing architecture into your DW/BI solution (Fig. 4).
Fig. 4

What to test in a database

In fact, very disciplined teams will take a test-driven development (TDD) approach where they write tests before they do the work to implement the functionality that the tests validate (Guernsey 2013). As a result, the tests do double duty – they both validate and specify. You can do this at the requirements level by writing user acceptance tests, a strategy referred to as behavior driven design (BDD) or acceptance test driven design (ATDD), and at the design level via developer tests. By rethinking the order in which you work, in this case by testing first not last, you can streamline your approach while you increase its quality.

Continuous Database Integration

Continuous integration (CI) is a technique where you automatically build and test your system every time someone checks in a code change (Sadalage 2003). Disciplined agile developers will typically update a few lines of code, or make a small change to a configuration file, or make a small change to a PDM and then check their work into their configuration management tool. The CI tool monitors this, and when it detects a check, it automatically kicks off the build and regression test suite in the background. This provides very quick feedback to team members, enabling them to detect issues early.

Configuration Management

Configuration management is at the bottom of the stack, providing a foundation for all other agile database techniques. In this case there is nothing special about the assets that you are creating – ETL code, configuration files, data models, test data, stored procedures, and so on – in that if they are worth creating then they are also worth putting under CM control.

Last Words

I would like to end with two simple messages: First, you can do this. Everything described in this chapter is pragmatic, supported by tooling, and has been proven in practice in numerous contexts. Second, you need to do this. The modern, dynamic business environment requires you to work in a reactive manner that does not short change your organization’s future. The Disciplined Agile approach described in this chapter describes how to do exactly that.

Further Readings

  1. Ambler, S. W. (2002). Agile modeling: Effective practices for extreme programming and the unified process. New York: Wiley.Google Scholar
  2. Ambler, S. W. (2013). Database testing: How to regression test a relational database. Retrieved from http://www.agiledata.org/essays/databaseTesting.html.
  3. Ambler, S. W., & Lines, M. (2012). Disciplined agile delivery: A practitioner’s guide to agile software delivery in the enterprise. New York: IBM Press.Google Scholar
  4. Ambler, S. W., & Sadalage, P. J. (2006). Refactoring databases: Evolutionary database design. Boston: Addison Wesley.Google Scholar
  5. Guernsey, M., III. (2013). Test-driven database development: Unlocking agility. Upper Saddle River: Addison-Wesley Professional.Google Scholar
  6. Lindstedt, D., & Olschimke, M. (2015). Building a scalable data warehouse with database 2.0. Waltham: Morgan Kaufman.Google Scholar
  7. Sadalage, P. J. (2003). Recipes for continuous database integration: Evolutionary database development. Upper Saddle River: Addison-Wesley Professional.Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Scott W. Ambler
    • 1
  1. 1.Disciplined Agile ConsortiumTorontoCanada