Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

We are at the beginning of a series of interdependent steps, where the project understanding phase marks the first. In this initial phase of the data analysis project, we have to map a problem onto one or many data analysis tasks. In a nutshell, we conjecture that the nature of the problem at hand can be adequately captured by some data sets (that still have to be identified or constructed), that appropriate modeling techniques can successfully be applied to learn the relationships in the data, and finally that the gained insights or models can be transferred back to the real case and applied successfully. This endeavor relies on a number of assumptions and is threatened by several risks, so the goal of the project understanding phase is to assess the main objective, the potential benefit, as well as the constraints, assumptions, and risks. While the number of data analysis projects is rapidly expanding, the failure rate is still high, so this phase should be carried out seriously to rate the chances of success realistically. The project understanding phase should be carried out with care to keep the project on the right track.

We have already sketched the data analysis process (CRISP-DM in Sect. 1.2). There is a clear order in the steps in the sense that for a later step, all precedent steps must have been executed. However, this does not mean that we can run once through all steps to deterministically achieve the desired results. There are many options and decisions to be made. Most of them will rely on our (subjective and dynamic) understanding of the problem at hand. The line of argument will not always be from an earlier phase to a later one. For instance, if a regression problem has to be solved, the analyst may decide that a certain method seems to be a promising choice for the modeling phase. From the characteristics of this technique he knows that all input data have to be transformed into numerical data, which has to be carried out beforehand (data preparation phase). This requires a careful look at the multivalued ordinally scaled attributes already in the data understanding phase to see how the order of the values is best preserved. If it is not considered in time, it may happen that later, in the evaluation phase, it turns out that the project owner expected to gain insights into the input–output relationship rather than having a black-box model only. If the analyst had considered this requirement beforehand, he might have chosen a different method. Changing this decision at any point later than in this initial project understanding phase often renders some (if not most) of the earlier work in data understanding, data preparation, and modeling useless. While the time spent on project and data understanding compared to data preparation and modeling is small (20% : 80%), the importance to success is just the opposite (80% : 20%) [4].

3.1 Determine the Project Objective

As a first step, a primary objective (not a long list but one or two) and some success criteria in terms of the project domain have to be determined (who will decide which results are desired and whether the original project goal was achieved or not). This is much easier said than done, especially if the analysis is not carried out by the domain expert himself. In such cases the project owner and the analyst speak different languages which may cause misunderstandings and confusion. In the worst case, the communication problems lead to very soft project goals, just vague enough to allow every stakeholder seeing his own perspective somehow accounted for. At the end, all of a sudden, the stakeholders recognize that the results do not fit their expectations. The challenge here is usually not a matter of technical but of communicative competence.

Table 3.1 shows some typical problems occurring in such projects. To overcome language confusion, a glossary of terms, definition, acronyms, and abbreviations is inevitable. Knowing the terms still does not imply an understanding of the project domain, the objectives, constraints, and influencing factors. One interviewing technique that may help to get most out of the expert is to rephrase all of her statements, which often provokes additional relativizing statements. Another technique is to use explorative tools such as mind maps or cognitive maps to sketch beliefs, experiences, and known factors and how they influence each other.

Table 3.1 Problems faced in data analysis projects, excerpt from [1]

An example of a cognitive map in the shopping domain considered in Sect. 2 is given in Fig. 3.1. Each node of this graph represents a property of the considered product or the customer. The variable of interest is placed in the center: how often will a certain product be found in the shopping basket of the customer? This depends on various factors, which are placed around this node. The direction of influence is given by the arrows, and the line style indicates the way how the variables influence each other: The higher the customer’s affinity to the product, the more often it will be found in the basket. The author of the cognitive map conjectures that the product affinity itself is positively influenced by a high product quality and the customer’s brand loyalty (a loyal customer is less likely to buy substitute products). On the other hand, the broader the range of offered substitutes, the more likely a customer may try out a different product. Other relationships depend on the product itself: The higher the demand of a certain product, the more often it will be found in the shopping basket, but the demand itself may, depending on the product, vary with gender (e.g., razor blades, hairspray), age (e.g., denture cleaner), or family status (e.g., napkins, fruit juice). The development of such a map supports the domain understanding and adjustment of expectations.

Fig. 3.1
figure 1

A cognitive map for the shopping domain: How often will a certain product occur in a shopping basket of some customer? The positive correlation between income and affordability reads like the higher the income, the higher the affordability, whereas an example of a negative correlation reads like the broader the range of offered substitutes, the lower the product affinity

While constructing a cognitive map, a few rules should be adhered to: First, to keep the map clear, only direct dependencies should be included in the graph. For instance, the size of the household influences the target variable, but only indirectly via the generated product demand and the affordability, and therefore there is no direct connection from size of household to frequency of product in shopping basket. Secondly, the labels of the nodes should be chosen carefully, so that they are easily interpretable when plugged into the relationship templates such as the higher …, the higher …. As an example, the node size of household could have been named family status, but then it is not quite clear what the more family status … actually means.

Once an understanding of the domain has been achieved, the problem and primary objective have to be identified (see Table 3.2). Again, it is often useful to discuss or model the current solution first, for instance, by using techniques from software engineering (business process modeling, UML use cases, etc.) [3]. When the current solution has been elaborated, its advantages and disadvantages can be explored and discussed. Often, the primary objective is assumed to be known beforehand, probably the project would not have been initiated without having identified a problem first. But as there are many different ways to attack a problem, the objective should be precise about the direction to follow. A general statement about the goal is easily made (“model the profitable customers to increase the sales”), but it is often not precise enough (how do we precisely identify a profitable customer?) and not actionable (how exactly shall this model help to increase the sales?). To render the objective more precise, it is necessary to sketch the target use already at this early stage. Thus it becomes clear what kind of result has to be delivered, which may range from a written technical report with interesting findings to a user-friendly software that uses the final model to automatize decisions.

Table 3.2 Clarifying the primary objectives

From the perspective of the project owner some of these elaborate steps may appear unnecessary—they master their domain already, after all. However, these steps must be considered as a preparation of the closely linked data understanding phase (see next section). All the identified factors, situations, and relationships that are assumed to be relevant must be present and recognizable in the data. If they cannot be found in the data, either there is a misconception in the project understanding or (even worse) the data is not substantial or detailed enough to reflect the important relations of the real-world problem. In both cases, it would be fatal to miss this point and proceed unworried.

3.2 Assess the Situation

The next step is to estimate the chances of a successful data analysis project. This requires the review of available resources, requirements, and risks. The most important resources are data and knowledge, that is, databases and experts who can provide background information (about the domain in general and about the databases in particular). Besides a plain listing of databases and personnel, it is important to clarify the access to both: if the data is stored in an operative system, mining the data may paralyze the applications using it. To become independent, it is advisable to provide a database dump. Experts are typically busy and difficult to grasp—but an inaccessible knowledge source is useless. A sufficiently large number of time slots for meetings should be arranged.

Based on the domain exploration (cognitive map, business process model, etc.), a list of explicit and implicit assumptions and risks is created to judge the chances of a successful project and guide the next steps. Data analysis lives on data. This list shall help to convince ourselves that the data is meaningful and relevant to the project. Why should we undertake this effort? We will see whether we can build a model from this data later anyway. Unfortunately, this is only half of the truth. After reviewing a number of reports in a data analysis competition, Charles Elkan noted that “when something surprising happens, rather than question the expectations, people typically believe that they should have done something slightly different” [2]. Expecting that the problem can be solved with the given data may lead to continuously changing and “optimizing” the model—rather than taking the possibility into account that the data is not appropriate for this problem. In order to avoid this pitfall, the conjectured relations and expert-proven connections can help us in verifying that the given data satisfy our needs—or to put forward good reasons why the project will probably fail. This is particularly important as in many projects the available data have not been collected to serve the purpose that is intended now. To prevent us from carrying out an expensive project having almost no prospect of success, we have to carefully track all assumptions and verify them as soon as possible. Typical requirements and assumptions include:

  • requirements and constraints

    • model requirements,

      e.g., model has to be explanatory (because decisions must be justified clearly)

    • ethical, political, legal issues,

      e.g., variables such as gender, age, race must not be used

    • technical constraints,

      e.g., applying the technical solution must not take more than n seconds

  • assumptions

    • representativeness:

      If conclusions about a specific target group are to be derived, a sufficiently large number of cases from this group must be contained in the database, and the sample in the database must be representative for the whole population.

    • informativeness:

      To cover all aspects by the model, most of the influencing factors (identified in the cognitive map) should be represented by attributes in the database.

    • good data quality:

      The relevant data must be of good quality (correct, complete, up-to-date) and unambiguous thanks to the available documentation.

    • presence of external factors:

      We may assume that the external world does not change constantly—for instance, in a marketing project we may assume that the competitors do not change their current strategy or product portfolio at all.

Every assumption inherently represents a risk (there might be other risks though). If possible, a contingency should be sketched in case the assumption turns out to be invalid, including options such as the acquisition of additional data sources.

3.3 Determine Analysis Goals

Finally, the primary objective must be transformed into a more technical data mining goal. An architecture for the envisaged solution has to be found, composed out of building blocks as discussed in Sect. 1.3 (data analysis tasks). For instance, this architecture might contain a component responsible for grouping the customers according to some readily available attributes first, another component finds interesting deviating subgroups in each of the groups, and a third component predicts some variable of interest based on the customer data and the membership to the respective groups and subgroups. The better this architecture fits the actual situation, the better the chances of finding a model class that will prove successful in practice. To achieve this analogy, the discussions about the project domain are of great help.

Again there is the danger of accepting a reasonable architecture quickly, underestimating or even ignoring the great impact on the overall effort. Suppose that a company wants to increase the sales of some high-end product by direct mailing. One approach is to develop a model that predicts who will buy this product using the company’s own customer database. Such a model might be interesting to interpret (useful for a report), but if it is used to decide to whom a mailing should be sent, most of the customers may have the product already (within the same customer database). Applying the model to people not being in the database is impossible as we lack the information about them that is needed by the model. The predictive model may also find out that customers buying the product were loyal customers for many years—but artificially increasing the duration of the customer relationship to support the purchase of the product is unfortunately impossible. If a foreseeable result is ignored or a misconception w.r.t. the desired use of the model is not recognized, considerable time may be wasted with building a correct model that turns out to be useless in the end.

For each of the building blocks, we can select a model class and technique to derive a model of this class automatically from data. There is nothing like the unique best method for predictive tasks, because they all have their individual weaknesses and strengths and it is impossible to combine all their properties or remove all biases (see Chap. 5). Although the final decision about the modeling technique will be made in the modeling phase, it should be clear already at this point of the analysis which properties the model should have and why. The methods and tools optimize the technical aspects of the model quality (such as accuracy, see also Chap. 5). Other aspects are often difficult to formalize and thus to optimize (such as interestingness or interpretability), so that the choice of the model class has the greatest influence on these properties. Desirable properties may be, for instance:

  • Interpretability:

    If the goal of the analysis is a report that sketches possible explanations for a certain situation, the ultimate goal is to understand the delivered model. For some black-box models, it is hard to comprehend how the final decision is made, and their model lacks interpretability.

  • Reproduceability/stability:

    If the analysis is carried out more than once, we may achieve similar performance—but not necessarily similar models. This does no harm if the model is used as a black box, but hinders a direct comparison of subsequent models to investigate their differences.

  • Model flexibility/adequacy:

    A flexible model can adapt to more (complicated) situations than an inflexible model, which typically makes more assumptions about the real world and requires less parameters. If the problem domain is complex, the model learned from data must also be complex to be successful. However, with flexible models the risk of overfitting increases (will be discussed in Chap. 5).

  • Runtime:

    If restrictive runtime requirements are given (either for building or applying the model), this may exclude some computationally expensive approaches.

  • Interestingness and use of expert knowledge:

    The more an expert already knows, the more challenging it is to “surprise” him or her with new findings. Some techniques looking for associations (see Sect. 7.6) are known for their large number of findings, many of them redundant and thus uninteresting. So if there is a possibility of including any kind of previous knowledge, this may ease the search for the best model considerably on the one hand and may prevent us from rediscovering too many well-known artifacts.

When discussing the various modeling techniques in Chaps. 7–9, we will give hints which properties they possess. The final choice is then up to the analyst.

3.4 Further Reading

The books by Dorian Pyle [4, 5] offer many suggestions and constructive hints for carrying out the project understanding phase. [5] contains a step-by-step workflow for business understanding and data mining consisting of various action boxes. An organizationally grounded framework to formally implement the business understanding phase of data mining projects is presented in [6]. In [1] a template set for educing and documenting project requirements is proposed.