Having characterised the kinds of data movement that can be problematic, the next step is to create a method for identifying the presence of the movement anti-patterns in a new development. In this section, we describe the modelling approach we have designed, aimed at capturing only the information needed to discover the movement patterns.
The core requirement is to identify the points in an information infrastructure where data is moved between two organisational entities which differ in some way significant to the interpretation of the data. These are the places where the portability of the data is put under stress, where errors can occur when the differences are not recognised, and where effort must be put in to resolve the differences. The model must therefore allow us to capture:
The movement of data across an information infrastructure, including the entities which “hold” data within the system, and the routes by which data moves between them; we call this the model’s landscape.
The points at which key differences in the interpretation of data occur, both social and technological.
5.1 The Model
We model the information infrastructure of an IT development in terms of the existing data containers, actors and links of the data journey model landscape. We use data containers to note the places where data rests when is not moving. A data container can be a system’s database whenever the data are in electronic form, or even a file cabinet, a pigeon hole, or a desk whenever the data are in a physical form. For example, when a general practitioner (GP) requests blood test results from the lab pathology of a hospital, data needs to travel from the GP secretary’s desk (where the request card and the blood sample rests), to the hospital porter’s pigeon holes, to the lab’s database (where results are input by the lab analyst), and back to the GP’s database to discuss with the patient. We model data containers using a rectangular box, as shown in Fig. 3.
Actors are the people or systems that interact with the containers to create, consume, or transform the data resting in them. In the example described above, a lab analyst interacts with the lab system database to input the results of the analysis in. He is the creator of the test results data. The GP consumes those data by interacting with the GP system database. Actors are modelled using the actor symbol of the UML notation, and the interaction with the containers with a dotted arrow, as shown in Fig. 3.
While data may be stored in one container, it may be consumed at several places in the landscape. Links are the routes that currently exist between two containers along which data can move, and are modelled as straight lines between two containers.
To move along a link, data must be represented in a medium of physical or electronic form. For example, the request card resting in the secretary’s office is moved to the pathology lab by post. The test results move from the lab’s database to the GP’s system through an internet connection.
Containers, actors and links are parts of the landscape of the existing infrastructure in which data moves. Often, a new movement must be implemented. A journey describes the movement of data that needs to occur for a piece of data that is needed by some consumer to move from its point of entry into the landscape, to its point of use by the new actor. A data journey begins from a container storing the source data, and ends at the container which the end consumer interacts with. In Fig. 3, the initial container of the journey is the GP desk and the final consumer is the GP.
Sometimes a direct link between the source and target container doesn’t exist making the data to move through intermediary containers using existing links. Those intermediate links are called legs. A data journey is made up of a number of journey legs. Journey legs are modelled with an arrow connecting the containers in which data are moved between. The direction of the arrow shows the direction in which data needs to move. Journey legs can constitute existing links or create a new link between two containers.
Figure 4 shows the meta-model for the data journey model, expressed in UML. A data journey diagram is a set of consecutive journey legs. A journey leg moves a piece of data from a source container to a target container through an electronic or physical medium. An actor interacts with a container to create, consume or transform the data stored in it.
5.2 Identifying Potential Costs
Having created a data journey model, the next step is to add in the information that can help us identify the legs where high cost or risk might be involved. We have seen from the case study analysis that costs and risks arise when data is moved between two entities that differ in some key way. Thus, when a human enters data into a software system, or two humans with very different professional backgrounds share data, or when software systems designed for different user sets communicate with each other, there is the potential need to transform or filter the data, to make it fit for its new context of use. However, to predict those places where costs might appear, we need cheap to apply information, since there is little value in predictions that cost a significant fraction of the actual development costs to create. We therefore focus on obtaining only the bare minimum of information needed, and ideally only on information that is readily available or cheap to acquire.
In the case studies, we found that high cost and risk occurred when data was shared between actors and containers with the following discrepancies:
Change of media: Containers using different media. For example, when a legacy X-Ray image on film must be scanned into a PDF for online storage and manipulation.
Discontinuity - external organisation: Containers belonging to different organisational units. For example, cancer data captured by a F.T needed for researching purposes by another agency.
Change of context, clash of grammars : People speaking different vocabularies. For example, when a secretary is asked to transcribe notes dictated by a consultant.
We need low cost ways of incorporating these factors into the data journey model. In some cases, the information is readily available. For example, it is normally well known to stakeholders when information is stored on paper, in a filing cabinet, or in electronic form. However, other factors, like people’s vocabularies, are less obvious. For these factors we use a proxy; some piece of information which is cheap to apply, and approximates the same relationship between the actors and containers as by the original factor. For example, we use salary bands as a proxy indicator for the presence of “clash of grammars”, on the grounds that a large difference in salary bands between actors probably indicates a different degree of technical expertise.
We use the following rules and proxies for indicating the presence of a boundary between the source and target of a data journey leg. A boundary indicative of high cost/risk can be predicted to be present when:
the medium of the source container of a journey leg is different from the medium of the target,
the source container of a journey leg belongs to a different organisational unit from the target container, or
the actor creating the data at the source container has a different salary band than the actor consuming it at the target.
To identify the places in which the above factors may impose costs, we group together the elements of the data journey diagram with similar properties. For example, we group together all physical containers, or electronic containers, or clerical staff, clinical staff, elements belonging to the radiology department of a Foundation Trust (F.T.), elements belonging to the GP, and so on. These groupings are overlaid onto the landscape of the data journey model and form boundaries. For example, Fig. 5 shows the containers belonging to the GP organisation with blue colour and the ones belonging to the F.T. with orange colour. The places where a journey leg crosses from one grouping into another are the predicted location of the cost/risk introduced by the external organisational factor. In Fig. 5, the costly journey legs are noted with a red warning sign.
Other boundaries stemming from factors other than those stated above, are also likely to exist. However, we do not include them in this analysis since the amount of work needed to evaluate is another paper of its own. Both the boundaries described above and the data journey model have been evaluated in a retrospective study of a real world case study from the NHS domain. The study describes data moved from a GP organisation to the radiology department of a F.T. The results of the evaluation showed that our model can identify places of high costs and risks. A further description of the results is given in .