Requirements Engineering

, Volume 12, Issue 4, pp 231–244

Cognitive complexity in data modeling: causes and recommendations

Authors

    • College of Business AdministrationFlorida International University
Original Article

DOI: 10.1007/s00766-006-0040-y

Cite this article as:
Batra, D. Requirements Eng (2007) 12: 231. doi:10.1007/s00766-006-0040-y

Abstract

Data modeling is a complex task for novice designers. This paper conducts a systematic study of cognitive complexity to reveal important factors pertaining to data modeling. Four major sources of complexity principles are identified: problem solving principles, design principles, information overload, and systems theory. The factors that lead to complexity are listed in each category. Each factor is then applied to the context of data modeling to evaluate if it affects data modeling complexity. Redundant factors from different sources are ignored, and closely linked factors are merged. The factors are then integrated to come up with a comprehensive list of factors. The factors that cannot largely be controlled are dropped from further analysis. The remaining factors are employed to develop a semantic differential scale for assessing cognitive complexity. The paper concludes with implications and recommendations on how to address cognitive complexity caused by data modeling.

Keywords

Data modelingCognitive complexityProblem solvingDesign principlesInformation overloadSystems theory

1 Introduction

In a comprehensive survey of usability studies in data modeling, Topi and Ramesh [1] have identified research questions in this area that have traditionally been considered important: how do the characteristics of the available tools affect users’ ability to succeed in their tasks (i.e., what is the level of usability of the tools), and how satisfied are the users with the tools (e.g., what is the perceived ease of use). Although each surveyed study defines one or more task variables, the focus is invariably on data modeling formalisms. The line of research has been able to identify the better formalisms. However, novice database designers engaged in data modeling still perform inadequately and little is known about why data modeling is a cognitively complex task.

Several studies (e.g., [2, 3]) have found low performance of novice designers in modeling relationships. Batra and Antony [4] examined designer performance in modeling open-ended exercises and attributed the low designer performance not just to certain selected relationships such as unary and ternary, but all kinds of relationships including binary relationships. They found that when designers are asked to solve problems, they commit errors in modeling relationships because of biases resulting from naïve heuristics [5]. Other studies (see [1] for a complete survey) have also found that novices mainly face problems in modeling relationships. Problems have not been found mainly in modeling entities. Batra and Wishart [6] found that the use of patterns [7] does not significantly improve designer performance. It is evident that despite a line of research in data model evaluation, the key complexity problems have not been outlined and novice designers still encounter difficulty in applying the models.

To address the above issue, we need to examine what causes cognitive complexity in general, and what aspects of it are relevant to the data modeling domain. The scope of this study is limited to novice database designers given that expert designers, by definition, are able to tackle difficult problems [8]. According to Lord and Maher [9], novices typically have only a few hundred hours experience in an area. In the data modeling context, a novice database designer would be someone who has completed a database course but has very limited experience.

A few usability studies in data modeling have considered task complexity but have not defined it in detail. Task complexity was used as an independent variable in four studies [1013]. All four studies found a main effect for complexity on performance, which essentially means that the experimental manipulation was successful. These studies do not describe the complexity construct in detail, however. For example, Shoval and Even-Chaime [12] describe the tasks as follows: “The two simple tasks involved a relatively simple DFD, with a single data store and a few data flows and examples only. The two more complicated tasks involved a more detailed DFD which included two data stores, more data flows and examples, more data elements, and more complex relationships/dependencies between them.”

Although task complexity has been considered in data modeling, cognitive complexity has not been studied. The two are, obviously, related. It is evident from past studies that novice designers encounter cognitive complexity in data modeling. A technique should effectively use a data modeling formalism to ease the difficulty encountered in modeling a task. We, thus, need to complement the better data modeling formalisms with the right techniques. A big hurdle in devising an effective technique is the lack of understanding of what causes cognitive complexity in data modeling. This paper conducts a systematic analysis of causes of cognitive complexity to reveal important factors that determine its scales. The factors are used to devise a semantic differential scale for assessing cognitive complexity.

2 What is cognitive complexity?

Cognitive complexity may be defined as the sum of the factors that makes things hard to see, use, grasp, and understand, and contribute directly to our neural load. It is a function of the content, structure, and amount of knowledge required to perform a task using a specific application [14]. An increase in complexity places heavy demands on working memory and leads to an increase in cognitive strain, which leads to lower performance [15, 16]. Of course, complexity is inherently subjective. Whatever complexity such systems have is a joint property of the system and its interaction with other systems, most often as an observer or controller [17]. In this paper, cognitive complexity is considered with respect to a novice designer.

Cognitive complexity is related to the generic term complexity. Flood and Carson [18] proposed that complexity is associated with anything people find difficult to understand, leading to the conclusion that complexity revolves around two main aspects: (1) features inherent in things, also called objects or systems, as well as (2) the characteristics of the way in which people perceive these objects. It means composite as opposed to simple, and in the sense “consisting of many interrelated parts” [19]. This definition of complexity is portrayed in the illustration in Fig. 1 from Flood and Carson [18].
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig1_HTML.gif
Fig. 1

Disassembly of complexity (from Flood and Carson [18])

Regarding the human element, the main characteristics that affect complexity are interest, capability, and perceptions. The major constraint on capability is the limitation inherent in human cognition that only seven plus or minus two chunks of information can be processed without causing cognitive overload [20]. Uhr et al. [21] described this limit as the “inelastic limit of human capacity.” Complexity is encountered when this limit is exceeded. In order to deal with the complexity, individuals must engage in strategies such as reformulating the information into larger chunks [18, 20].

Pippenger [22] referred to complex situations as involving interconnected constructs of a large number of simple components. He stated that “the most important lesson of complexity theory is the demonstration of the diversity of phenomena that can arise through the interaction of simple components.” Figure 2, taken from Flood and Carson [18], illustrates complexity arising from the growth in number of parts and consequently in the number of possible relationships and states.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig2_HTML.gif
Fig. 2

Elements, possible relationships, and states as a measure of complexity (from Flood and Carson [18])

Cognitive complexity is difficult to measure directly; instead, a surrogate like structural complexity has usually been used by considering the elements or constructs such as the classes, attributes, and relationships [2325]. Time also has been used as a surrogate to estimate complexity [26]. Rossi and Brinkkemper [27] proposed 17 distinct definitions relating to the structural complexity of a diagramming technique. This paper looks beyond the structural complexity and examines cognitive complexity in a holistic manner.

We first study theoretical or formal sources that have addressed complexity in general, and then apply the factors to the data modeling context. Two references [28, 29] provide generic and a relatively complete list of factors relevant to the issue. Based on these texts, four sources of complexity principles are identified: problem solving, design complexity, information overload, and systems theory. Not surprisingly, there is overlap among the factors proposed in each category. The paper identifies the overlap and proposes new names that are more relevant to the data modeling domain. A comprehensive set of factors are identified, and the ones that are difficult to control are dropped. The remaining factors are used to develop a semantic differential scale. The factors are classified according to their severity so that the more critical ones can be identified. The paper also identifies areas for future research.

3 Problem solving principles

This source is based on the work on heuristics [30] and human problem solving [31]. Problem solving is defined as a process of search through the decision space looking for the right operator to transform the problem space into a solution. Cognitively complex problems are those that require a more difficult search through a more complicated maze of possible operators. Complexity then is a matter of the difficulty in finding the right operators that will eventually lead to the ultimate solution.

In evaluating complexity factors related to problem solving, these general ways of problem solving need to be considered. Based on Funke’s [32] list of elements, the following elements lead to complexity: intransparency, multiple goals, situation complexity, connectivity, dynamic nature of problems, and time delay. Out of these, intransparency, which causes complexity because only some variables lend themselves to direct observation, is not relevant; the other factors are explicated in the context.

3.1 Multiple goals

With multiple goals, some may be contradictory, and trade-offs are often required. This is a concern in physical data modeling as the goal of controlling redundancy is matched off against efficiency. However, this is not a source of cognitive complexity in conceptual and logical data modeling. For business applications, normalization is usually the rough goal, and is adapted for any special features or extensions.

3.2 Situation complexity

If the analyst/designer does not understand the requirements accurately because the domain is complex and requires lot of knowledge, data modeling would obviously suffer. There are complex domains today, e.g., bioinformatics, which are difficult to understand. However, this paper does not address these emerging domains, because such domains may need fundamentally different data modeling formalisms. Cognitive complexity in data modeling exists even when the domains are simple and familiar.

3.3 Connectivity

Complex problems often contain a high degree of connectivity or interrelationships. Note that connectivity here does not mean cardinality, but refers to the large number of relationships among a limited number of entities. Consider, for example, a simple requirement from a vehicle sale application: A customer purchases a vehicle with options from a salesperson. The requirements are likely to lead to the following entities: Customer, Purchase (or Sale), Vehicle, Option, and Salesperson. The problem is that all entities are interrelated. Even if we consider binary links only, there are ten combinations even before cardinalities are considered (Fig. 3).
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig3_HTML.gif
Fig. 3

The novice designer view of relationships among five interrelated entities

Note that there is a remarkable resemblance between the generic complexity problem shown in Fig. 2 and the specific data modeling problem shown in Fig. 3, which is the novice designer approach. This view also has empirical support given the way novices arbitrarily choose relationships [4]. The novice approach invariably leads to a wrong data modeling solution.

Based on the semantics and the constraints of the application, the designer needs to select certain relationships out of all possible relationships. Supplementary information provided with the application requirements (e.g., each purchase involves only one customer) can be used to arrive at the correct solution (see Fig. 4). But since novice designers use naïve heuristics, their solutions are arbitrary and unpredictable.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig4_HTML.gif
Fig. 4

The experienced designer solution

3.4 Dynamic nature of problems

The dynamic nature of problems is a factor that is extrinsic and is unlikely to cause complexity in data modeling. If user requirements are constantly changing, it may cause additional work for the designer, but will probably not directly add to the cognitive complexity of data modeling. In fact, the data aspect of an application is relatively stable because requirements change usually affect the interface (front tier) and application (middle tier) instead of the database (back end).

3.5 Time delay

In data modeling problems, there is a delay between the action taken and the response or the appearance of consequences. This is a not a problem in query writing, but is certainly a problem in data modeling. In query writing, because of immediate feedback and a fair amount of flexibility in breaking down a query into manageable parts, the cognitive complexity of a problem can be controlled (Fig. 5). In data modeling, there is a large time gap between modeling and implementation, and the consequences of a bad design are not immediately apparent. Time delay may not directly cause complexity, but it certainly exacerbates the deleterious effects arising from a bad data model.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig5_HTML.gif
Fig. 5

A comparison of time delay involved in query writing and data modeling

4 Design complexity

In User Centered System Design, Norman and Draper [33] have included articles that list important design principles for objects used everyday. There are additional principles suggested by Nielsen and Molich [34]. From Norman’s [35] list of good design principles, we can deduce the following elements of complexity: insufficient information, extensive use of memory, no visual display of what actions are possible, no mental aids, no visual feedback, response incompatibility, lack of constraints, no flexibility for error, and no standards. The factor insufficient information is ignored because it is assumed that the relevant user requirements are available. Even when information is complete, data modeling is a complex problem. The factor no visual feedback is generalized to the factor no direct feedback, and the factor no visual display of what actions are possible is generalized to the more generic no visual display. The factors response incompatibility and no standards are considered irrelevant to the context.

4.1 Extensive use of memory

According to Miller [20], memory overload can occur anytime the number of items (or chunks) to be tracked exceeds the magic number seven. Given that the number of items can easily exceed seven, the main cause of memory overload is connectivity, which has been discussed under problem solving factors in the previous section. Since a limited number of entities can result in large number of possible relationships, far greater than the magic number seven, it requires effort-managing strategies to reduce the load on memory. Memory overload can also be caused in determining cardinality especially of higher degree relationships. The extensive use of memory limitation factor together with the connectivity factor mentioned under problem solving can sharply exacerbate cognitive complexity.

4.2 No visual display

The relational model does not have a visual display. Consequently, modeling becomes difficult. It has been well established now that graphical models like the entity relationship [36], as compared to text based models like the relational [37], lead to better designer performance. This is because graphical models are vastly superior in showing links, and it is well known that relationships, which can easily be represented using links, are the leading cause of data modeling complexity. Today, a graphical representation is the norm for conceptual data modeling.

4.3 No mental aids

This factor is prominent both in the relational and the entity relationship (ER) models, which provide practically no mental aid to help in data modeling because they are modeling formalisms rather than techniques. Relational databases follow the normalization approach, which prescribes certain properties that databases need to possess. In other words, normalization provides criteria that can be used to assess the quality of a database after data modeling has been done. Thus, normalization is more of an end point check rather than a technique that can serve as a mental aid. The ER model provides a visual representation, but there are still no mental aids, since normalization, the eventual target, is not inherently incorporated in the model. If memory overload is to be reduced, mental aids need to be provided in terms of simple rules as part of a modeling technique that a novice designer can follow easily.

4.4 No direct feedback

Complexity can result if there is no feedback or delayed feedback on the results of an action. This point is similar to the “time delay” issue listed in the previous category. Data modeling provides no immediate feedback, and the consequences of a bad design show up only at the end of the implementation. In other words, there is no direct and immediate feedback, and this can cause cognitive complexity. In contrast, when querying, a user can assess the result and gauge if the query needs to be corrected.

4.5 Unconstrained choices

Lack of constraints allows the designer choices among too many options. This is a common problem in data modeling as was illustrated when discussing connectivity in the previous section. In data modeling, there is a vast gap between the problem space and the solution space. Constraints facilitate pruning of the problem space so that the solution space can be reached. A typical constraint would be in the form ‘A purchase invoice pertains to only one customer’. Empirical evidence shows that novice designers are inept in using the constraints to arrive at the correct solution. This is because some constraints may be useful (e.g., an invoice has many products) while the other may lead to wrong answers if applied carelessly (e.g., a salesperson sells to several customers). The latter is likely to result in semantics that are derived and redundant.

4.6 No flexibility for errors

Once user requirements have been articulated unambiguously, there is not much flexibility in data modeling to allow for alternative solutions. In other words, for a given semantics, there is usually only one correct solution. Given the large problem space, a novice cannot guess the correct answer.

In addition to Norman’s design principles, there are additional heuristics suggested by Nielsen and Molich [34] as good design practices. Many of these factors have been covered, but three new complexity factors emerge from the analysis: lack of natural dialogue, lack of clearly marked exits, and lack of shortcuts.

4.7 Lack of natural dialogue

This is prominent in relational data modeling, which relies on the notion of functional dependency and multivalued dependency. This is clearly an unnatural dialogue for the designer. For example, the definition of the fourth normal form in terms of multivalued dependencies is totally alien to designers. Even the entity relationship concepts like cardinality may be difficult for novice designers when a higher degree relationship is encountered. Data modeling methods should rely on languages that are close to the way novices understand the world.

4.8 Lack of clearly marked exits

When is a data model complete? There is, indeed, no clearly marked exit. If, say, all seven entities in a data model have been somehow connected, is it complete? It is a reasonable heuristic. But, since all entities do not have to be connected for a data model to be complete, and since some entities may be connected among each other more than once, there is no visible milestone that signals the end of data modeling. Further, binary relationships may be present among the same entities that define a ternary relationship.

4.9 Lack of shortcuts

Merely using the ER model is not a shortcut. In fact, short cuts are largely missing in data modeling techniques, which should employ heuristics that considerably reduce the cognitive complexity. It is possible that this may sometimes result in erroneous solutions that need to be corrected after working with a prototype. Nevertheless, heuristic based methods that use short cuts are more likely to be understood and used by designers.

Heuristics are often faster than serial algorithms as a means of selecting relevant information from the totality of information available. Regarding individual user characteristics, some notable studies have produced interesting insights into the process that subjects employed while creating a data model. For example, Srinivasan and Te’eni [38] used verbalized protocols to clarify the problem-solving process undertaken by users during a modeling task. They reported that efficient data modelers use heuristics to reduce the complexity of the problem, test models at regular intervals, and make orderly transitions from one level of abstraction of problem representation to another.

Further, the use of patterns needs to be investigated although so far it has only limited empirical support. Pattern recognition reduces the effort of processing facts individually and speeds up understanding. A pattern based approach may be used to come up with an intelligent tool [39]. At times, the template nature of patterns will provide solutions that are erroneous in the first pass, so the pattern-based approaches need to be integrated with other approaches.

5 Information overload

Information overload is simply the excessive amount of information that an individual may encounter and which may cause stress and anxiety [40]. Complexity reduction is the phenomenon that social systems are exposed to a much greater ‘information pressure’ than they can handle by rational methods [41]. That is why they must reduce this complexity, which may be done arbitrarily. The following factors are reported in Reeves [29] to cause information overload: disorder, novelty, inconsistency, noise, and undifferentiated features.

5.1 Disorder

This pertains to lack of categories. Categorization leads to cognitive economy [42]. Attribute based models like the relational data model can create information overload and lead to low designer performance. But data models like ER model are based on the notion of abstraction and reduce the complexity by a significant factor. Attributes are first categorized into entities, which are then related, thus, considerably reducing the number of interrelationships to be evaluated by the designer. However, for large applications, the mere use of the ER model may not be enough.

5.2 Novelty

This pertains to situations that are new and unfamiliar to the designer. A new situation may not cause much complexity if an analogue can be found. However, in general, it will cause some complexity because the solution has to be worked either from the first principles, or by a deft handling of mapping and manipulating from an analogue [43].

5.3 Inconsistency

Models, by definition, are abstract representations. Data modeling is governed by rules that control redundancy and ensure that queries on a data model will not lead to spurious results. Such factors can sometimes cause representation inconsistency between the real world semantics and their data model representations. For example, a supervisor–subordinate relationship is between two persons, but the data model representation is not binary; instead it is unary.

Further, a data modeling relationship may seem inconsistent with the real world view. For example, because the options can be changed every time a vehicle is sold/resold, the transaction Purchase connects to the Option entity (see Fig. 4). This may seem inconsistent with the real world view that suggests options belong to the vehicle (see Fig. 6). Inconsistency may also result when similar kinds of relationships have very different representations. For example, in the relational model, a many-many relationship requires a separate relation; a one-many relationship does not.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig6_HTML.gif
Fig. 6

Option shown as belonging to Vehicle

5.4 Noise

This pertains to the presence of irrelevant information. In data modeling, noise can take a totally different meaning since credible information may also be considered as noise. Consider the example that involves a student’s registration into sections of courses. In user requirements (e.g., use cases), it would be normal to write that a student registers for a course suggesting there is a business relationship between the two; yet, more details and minimality requirements will reveal that they do not have a direct data modeling relationship, which actually goes from student to registration, registration to section, and section to course. In Fig. 7, the valid relationships are shown by full lines, and noise is shown by dotted lines.
https://static-content.springer.com/image/art%3A10.1007%2Fs00766-006-0040-y/MediaObjects/766_2006_40_Fig7_HTML.gif
Fig. 7

Student Registration example

Note that a use case may mention requirements about student taking courses as much as about student enrolling in section, or registration involving courses. Although these would be well accepted in user requirements, there is no direct relationship capturing entities in these statements. In data modeling terms, the statements constitute noise that can actually mislead the designer into modeling faulty relationships [4]. The indirect relationships are established using queries. However, a novice designer may not be able to make the distinction.

5.5 Undifferentiated features

This pertains to the situation when features are not distinct. Data models like the relational do not have separate representations for different facets; everything is handled using relations. This can make modeling difficult since different semantics need to be squeezed into the same structure. Data models like the extended entity relationship (EER) model [44] provide a richer set of constructs to provide appropriate constructs for different semantics.

6 Systems theory

Complexity has also been addressed from the systems viewpoint [18, 45]. The following factors have been reported: significant interactions, high number of parts, nonlinearity, broken symmetry, lack of constraints, open versus closed to their environment, human versus machine, and emergence—characteristics of a whole different from its parts. Out of these, the factors nonlinearity, broken symmetry, open versus closed to their environment, human versus machine, and emergence are not relevant. The factor significant interactions and lack of constraints have been considered earlier. Only one remaining factor—high number of parts—is, therefore, discussed.

6.1 High number of parts

There are two aspects to this issue: at the metamodel level, and at the class (or the entity) level. First, the metamodel of a data modeling technique may have a high number of constructs leading to high structural complexity. This issue has been addressed in the literature analytically as well as empirically [25, 27]. Second, the issue arises at the class level. As discussed earlier, as the number of entities/classes in an application increases, complexity rapidly increases because of the number of interactions at a combinatorial rate.

7 Integrating the complexity factors

The paper has listed factors from four sources of complexity principles—problem solving, design complexity, information overload, and systems theory—and examined the factors in the context of data modeling. Since some factors are same or closely related among these sources while others are unique within a source, the factors need to be compared and integrated. The factors are tabulated (see Table 1) so that a comprehensive list of factors can be obtained. The first four columns list these factors, and the fifth provides the most appropriate name. Across a row or within a cell, the factors are clustered based on similarity. It is readily noted that all four sources—problem solving, design, information overload, and systems theory—agree on the significant interactions factor. Evidently, this must be the most important factor.
Table 1

Integrating the complexity factors

Problem solving

Design

Information overload

Systems theory

Proposed factor

Connectivity

Extensive use of memory

Unconstrained choices

Noise

Significant interactions

Lack of constraints

High number of parts

Significant interactions

Time delay

No feedback

No visual display

No mental aids

No flexibility for errors

Lack of natural dialogue

Lack of clearly marked exits

Lack of shortcuts

  

No direct feedback

No visual display

No mental aids

Very limited solutions

Lack of natural dialogue

Lack of clearly marked exits

Lack of shortcuts

  

Disorder

 

Low abstraction

  

Novelty

 

Novelty

  

Inconsistency

 

Semantic mismatch

  

Undifferentiated features

 

Undifferentiated features

8 Proposing a preliminary semantic differential scale

The analysis so far has given us the following list of factors: significant interactions, no direct feedback, no visual display, no mental aids, very limited solutions, lack of natural dialogue, lack of clearly marked exits, lack of short cuts, low abstraction, novelty, semantic mismatch, and undifferentiated features. Some of these factors can be controlled by proposing effective modeling techniques by employing well designed software tools. This is addressed in the next section.

However, there are factors that cannot be controlled. For example, the fact that data modeling of a given scenario with clearly defined semantics has very limited solutions, and usually only one solution, is a given and cannot largely be controlled. It is inherent to data modeling. In examining the factor lack of clearly marked exits, it is noted that we have very limited control: we really do not know when to exit; we can only indicate if an entity has or has not been connected to the rest of the model, and define a possible exit when all the entities have been connected. Whether a model has several relationships among the same entities is dependent on the scenario (say, the presence of binary relationship/s among the same entities involved in a ternary relationship), and something that cannot be guided by a tool; further, there are scenarios when an entity need not be connected with the rest of the model. The factor novelty is dependent on the domain, and can cause difficult in modeling if the requirements are poorly understood. However, novelty is extrinsic to data modeling, and cannot be controlled.

The factor semantic mismatch can be controlled slightly by choosing a suitable representation. At least, the inconsistency between the representation of a 1:m relationship versus a m:n relationship can be addressed by avoiding the relational data model at the conceptual stage and, instead, employing an ER kind of a data model. Models like the ER are quite standard now. But even the ER model cannot address the inherent inconsistency in modeling unary relationships. Further, domain related inconsistency situations are beyond control. On the whole, the factor cannot largely be controlled.

Although the aforementioned factors cannot largely be controlled, thankfully, these factors are generally low in severity. In other words, the deleterious impact of these factors should be minimal, and these factors need less concern. The discussion, therefore, focuses on the factors that can be partially or fully controlled. These factors are: significant interactions, no direct feedback, no visual display, no mental aids, lack of natural dialogue, lack of short cuts, low abstraction, and undifferentiated features.

These factors may be used to develop a questionnaire to measure cognitive complexity of data modeling techniques. A questionnaire in the semantic-differential style used by Guillemette [46] is presented in Table 2. The overall score can be used to assess to what extent a data modeling technique handles cognitive complexity. The questionnaire implicitly identifies dimensions that need to be considered to develop a data modeling technique that can reduce cognitive complexity. For example, a technique that is low on shortcuts is likely to cause cognitive strain.
Table 2

Semantic-differential scales for gauging complexity control in data modeling

Proposed factor

Semantic scale for data modeling

Significant interactions control

High

     

Low

1

2

3

4

5

6

7

Mental aids

Profuse

     

Absent

1

2

3

4

5

6

7

Shortcuts

Ample

     

None

1

2

3

4

5

6

7

Feedback

Sufficient

     

Absent

1

2

3

4

5

6

7

Abstraction level

Appropriate

     

Inappropriate

1

2

3

4

5

6

7

Display

Visual

     

Not Visual

1

2

3

4

5

6

7

Dialogue

Natural

     

Unnatural

1

2

3

4

5

6

7

Differentiated features

High

     

Low

1

2

3

4

5

6

7

The questionnaire presented in Table 2 is, obviously, preliminary since its validity and reliability has not been established [47]. Assuming that the relevant data has been collected, a factor analysis may reveal fewer factors than the eight mentioned in the questionnaire. Further, a multiple regression study may show that some factors are more important than others, and while some may be critical, others may be largely irrelevant. These future steps needed to develop a rigorous instrument are beyond the scope of this paper.

The analysis done in the current study may be used to better understand usability studies in data modeling that have considered complexity as a variable. For example, the Batra and Wishart [6] considered only three factors to provide variations in task complexity—combinatorial, intricacy of form, and semantic mismatch. The factor semantic mismatch is the same in the two studies. The factor combinatorial is similar to the factor significant interaction and represents complexity arising because of relationships among entities. The factor intricacy of form is also similar to the factor significant interaction, but represents complexity arising because the structure of some relationships is more intricate than others. Note that the current study is based on a number of reference disciplines and it shows that the cognitive complexity construct has a much wider focus and a larger number of items.

9 Addressing data modeling complexity: a design science approach

Hevner et al. [48] have underscored the need for design science research in the Information Systems discipline. They state that two paradigms characterize much of the research in the Information Systems discipline: behavioral science and design science. The behavioral science paradigm seeks to develop and verify theories that explain or predict human or organizational behavior. The design science paradigm seeks to extend the boundaries of human and organizational capabilities by creating new and innovative artifacts. The discussion so far has taken a behavioral science approach and shown that a research instruments can be used to assess data modeling complexity. The following discussion takes a design science approach and develops guidelines for developing and evaluating software tools that address the complexity involved in data modeling.

It is evident that it may be difficult for a researcher interested in extending this research may want to develop a software tool that addresses each of the eight factors. Design science focuses on utility [48]. Thus, we need to prioritize based on those factors that may yield the most benefit. A problem here is that we do not have any research based on multiple regression kind of techniques that can reveal if one factor is a lot more severe than others. Thus, we need to depend mainly on past research in the area. Even if the classification is somewhat subjective, empirical evaluation can justify and evaluate the assessment. However, the scope of this paper does not include formal justification.

Addressing factors that are the most severe can yield the most benefit. In terms of severity, the eight factors seem to cluster into three categories—low severity, medium severity, and high severity. Each factor is discussed in this framework and recommendations are provided. The low severity factors are addressed first, the medium severity factors next, and the high severity factors at the end.

9.1 Low severity factors

These are factors that seem to have less impact. There is sufficient research in the area that reasonable solution exist to address the problems.

9.1.1 No visual display

This can easily be solved by using a visual representation like the ER diagram. An ER diagram can easily be translated to a representation that can be implemented. There is empirical evidence that visual representations lead to better designer performance [2].

9.1.2 Lack of natural dialogue

Studies have shown that determining cardinality is a difficult task for novice designers [2]. The cardinality is always with respect to one instance; however, this does not appear so when a novice literally reads the cardinality. For example, if there is a one to many relationship between Customer and Repair, where Customer is the one side and Repair on many, a student may interpret the cardinality to mean that many repairs map to one customer. Actually, many repairs map to many customers; however, one repair maps to (no more than) one customer. Such errors may be mitigated by using a tool that provides the translation of the cardinality selected by the designer in a natural language [49].

9.1.3 Undifferentiated features

This problem arises when there are different semantic constructs in the real world which are modeled by the same representation. Undifferentiated features is ontologically a poor modeling practice [50]. However, too many features may lead to cognitive overload [29]. This requires precarious balancing, and retaining essential features. So far the ER model has served as a useful data modeling representation, which has been extended as needed [44]. This is, thus, a low impact factor, and does not need a new solution.

9.2 Medium severity factors

These are moderately significant factors. There is little research available to address the problems.

9.2.1 No direct feedback

Data modeling does not provide direct feedback. To minimize delay, the testing phase can be moved closer to analysis and design. Currently popular methodologies such as iterative and agile emphasize early testing [51]. A novice data modeler can go one step further, and employ a handy desktop DBMS like Microsoft Access to test out a piece of data model by defining the structure, loading a small number of rows, and testing out some typical queries, especially the ones involving joins. In a relatively short period of time, a prototype may reveal major errors. Of course, this method may not always be followed.

9.2.2 Low abstraction

The ER data model provides an appropriate level of abstraction when relatively small problems, say those involving fewer than ten entities, are involved. However, when the size of the problem increases, the level of abstraction needs to be increased, too. This issue has been addressed by using a clustering approach [52], and although the approach has not been empirically validated, one wonders if large problems can be handled without a clustering approach, and, therefore, the approach has much face validity. The approach works by providing a name to each cluster of closely related entities and relationships. Novice designers are not expected to design large databases, hence the problem is not considered as severe.

9.3 High severity factors

These are significant problems in conceptual data modeling. Although there is some research in addressing the problems, significant gains in designer performance are possible if additional research leads to conclusive solutions.

The three high severity factors—significant interactions, mental aids, and lack of shortcuts are related. In an earlier section, it was explained how significant interactions cause data modeling complexity. Formal techniques to address significant interactions are themselves too complex, and so, there is a need for shortcuts, that is, a need for a heuristic based approach. However, a heuristic based approach needs input from the database designer, so there is the need for a tool that provides mental aids, that is, guidance, prompts, and aids to move the designer along the modeling process. The three factors are, therefore, discussed together. Given that the other factors can be addressed without undue effort, this is the main challenge.

9.3.1 Significant interactions, lack of shortcuts, and no mental aids

Significant interactions are undoubtedly the most important factor causing complexity and consequent difficulty in data modeling. Unfortunately, it is little researched and understood issue in the data modeling area. Experience can probably overcome the difficulty caused by significant interactions, but it could take a fair amount of time and effort, which can be reduced by formal instruction. Experience can eventually ameliorate inadequate understanding gained by modeling simple exercises like the seemingly ubiquitous Customer-Order-Product problem. Even a widely used book like Date [53] has conventionally relied on simple examples involving two entities and one relationship (supplier and product). For cases that are more real world like (e.g., those in the excellent casebook by Whitlock et al. [54]), novice designers will certainly face difficulty.

Attempts have been made to provide rules extracted and interpreted from the relational theory into simple ER, and natural language. The rules and heuristics in Batra and Zanakis [55] provide a sequence for modeling the various facets. The approach is based on Armstrong’s axioms [56] but uses ER model to convey the approach in simple terms. Entities are modeled first. The designer then needs to look for only one kind of relationship among the entities. For example, the designer looks for the 1:m relationships first. If one kind of relationship has been modeled, then it may preclude another kind thus reducing the search space. This implies that an intelligent tool can provide mental aids to manage the complexity. Such rules have been incorporated in the software CODASYS [57] that provides several mental aids to assist novice designers. Extensive empirical testing is required, however, to evaluate and justify such tools.

For example, in the vehicle invoice problem discussed earlier in the paper, we can model the four 1:m relationships first (Fig. 4). Thus, we do not have to waste time mulling whether we need to show the m:n relationship between Customer and Salesperson, which ends up as derived. Also the 1:n relationship between Vehicle and Purchase precludes the m:n relationship between Vehicle and Option (unless Option is not related to Purchase); instead the m:n relationship is between Purchase and Option. Such mental aids can reduce routine data modeling errors.

In the Problem Solving section, it was shown that the number of relationships increases rapidly with respect to an increase in the number of entities. So, how is it possible that a large database, which may easily have more than 100 entities, gets modeled at all? This is possible because a given entity is not related to each and every entity in the application. Typically, an entity closely relates to only a small set of entities. Although even a small set like seven or eight entities can give rise to a large number of relationships, at least this puts some limit on the possible number of relationships. Closely related entities form a cluster [52]. Thus, a high number of entities may not directly be a critical factor in data modeling complexity, which may be more dependent on the number of entities in a typical cluster. Further, it is unlikely that a novice designer would be asked to model a very large database. However, even a small database will have a few clusters.

The Danish physicist Bak [58] proposed the ‘theory of self-organized criticality’ which is a robust phenomenon that turns up again and again. The size of avalanches is inversely proportional to their frequency such that there are very few very large avalanches, and very many very small ones. If the same phenomenon applies to clusters, then most clusters should be of reasonable size allowing mental aids to provide short cuts to control for the cognitive strain caused by significant interactions within a cluster. We can develop intelligent tools that can handle regular sized clusters.

10 Implications and future research

An interesting issue is the relationship of this study to research to complexity in modeling UML diagrams [2325]. Past research has generally focused on structural complexity as a surrogate for cognitive complexity. However, both approaches have been used. For instance, the Siau and Cao [23] followed a structural approach, while the Siau and Tian [24] study employed the GOMS model, a cognitive approach.

Siau and Cao [23] assessed structural complexity in the context of UML by considering elements in various UML diagrams such as class, and use case diagrams. This is similar to the intricacy of form argument employed in the Batra and Wishart [6] study. In UML, one can focus, say, on the class diagram, and by evaluating its elements come up with a measure of structural complexity. An empirical issue is whether structural complexity is a good surrogate of cognitive complexity. This can be helpful because it is easier to measure structural complexity.

Cognitive complexity can also be measured by methods like GOMS model or by developing a questionnaire customized to a given context. The GOMS model is an attempt to represent specific human problem solving behavior in terms of goals, operators, methods, and selection rules after the Newell and Simon [31] information processing approach. One issue with GOMS model is that for successful application, it requires a certain amount of expertise in the subjects being evaluated. If the scope of a study is limited to novice designers, GOMS model may be difficult to use.

The Siau and Tian [24] study employed the GOMS model, a cognitive approach, and found that the results were similar to that of the Siau and Cao [23], which followed a structural approach, thus showing correspondence between structural and cognitive complexity. The current study considers a much broader scope of cognitive complexity and takes a technique focus rather than a data formalism focus, and it will be interesting to investigate the relationship between structural and cognitive complexity.

There is some empirical evidence that designers tend to reduce the cognitive complexity by reducing structural complexity. In a study on UML diagrams, Siau et al. [25] found that designers tend to use a subset, usually three or four, of the about ten UML diagrams. Within a given diagram, they routinely use a subset of the features. The cognitive load of using all the features in all the diagrams would be immense. By reducing the structural complexity, the designers are able to control the cognitive complexity.

Another extension of this research would be to link and apply it to learning theories, especially to the meaning theory of thinking. If cognitive complexity can be reduced, then learning can be enhanced. It is important to associate new learning to familiar knowledge in the reading level, terms, and concepts used [59]. The meaning theory of thinking recommends the following: assimilation (compare to known), relate to user’s experience, use terms familiar to user, make concrete, use imagery, discovery through activity, and priming prior to new material. These recommendations can be used to formulate a data modeling technique to address the complexity factors associated with data modeling.

11 Conclusion

Although enterprise resource planning (ERP) systems have become popular in recent years, the implementations have been usually accompanied by low end-user satisfaction and unsubstantiated changes in productivity [60, 61]. Data modeling continues to play an important role in custom built information systems. One can hypothesize that the quality of custom built information systems can suffer because of poor data modeling practices. In the data modeling area, usability studies have focused on comparisons of data modeling formalisms, but not on developing techniques to manage complexity.

We need to, thus, understand the cognitive complexity phenomenon in data modeling, propose effective data modeling techniques, develop corresponding tool prototypes, test the prototypes in a controlled laboratory environment, improve the prototypes, and validate the tools in realistic settings. This process subscribes to the design science paradigm [48].

The paper has considered four viewpoints of cognitive complexity—problem solving principles, design principles, information overload, and systems theory—to list factors that cause complexity in data modeling. A large number of factors are discussed. Some factors are beyond control, and are therefore dropped from the analysis. The remaining are categorized by severity so that critical factors can be outlined. A preliminary semantic differential scale has been proposed to measure cognitive complexity in data modeling.

The MIS literature has addressed cognitive complexity indirectly through the notion of perceived ease of use. Some researchers consider perceived ease of use as limited in scope. Soloway and Pryor [62] argue “We need to address the real issues of our times; nurturing the growth of children and adults [in school and organizational settings], supporting them as they grapple with ideas...and [in] developing all manners of expertise...Ease of use, valuable as it certainly is, is too limited a vision.” By directly addressing cognitive complexity, we can realize a broader vision in the data modeling domain.

Copyright information

© Springer-Verlag London Limited 2006