Definition

An inconsistent database is a database instance that does not satisfy those integrity constraints that have been declared together with the schema of the database.

Historical Background

Already in the classical and seminal paper by E. F. Codd [5] on the relational data model it is possible to find the notions of integrity constraint and consistency of a database. The idea of consistent query answering, consisting in characterizing and computing semantically correct answers to queries in inconsistent databases, was explicitly introduced in [1].

Foundations

A database can be seen as a model, i.e., as a simplified, abstract description, of an external reality. In the case of relational databases, one starts by choosing certain predicates of a prescribed arity. The schema of the database consists of this set of predicates, possibly attributes, which can be seen as names for the arguments of the predicates, together with an indication of the domains where the attributes can take their values. Having chosen the schema, the representation of the external reality is given in terms of relations, which are extensions for the predicates in the schema. This set of relations is called an instance of the schema.

For example, relational database for representing information about students of a university might be based on the schema consisting of the predicates Students(StNum,StName) and Enrollment(StName,Course). The attribute StNum is expected to take numerical values; StName, character string values; and Course, alphanumeric string values. In Fig. 1 there is a possible instance for this schema.

Inconsistent Databases. Figure 1
figure 1_1242

A database instance.

In order to make the database a more accurate model of the university domain (or to be in a more accurate correspondence with it), certain conditions are imposed on the possible instances of the database. Those conditions are intended to capture more meaning from the outside application domain. In consequence, these conditions are called semantic constraints or integrity constraints (ICs). For example, a condition could be that, in every instance, the student name functionally depends upon the student number, i.e., a student number is assigned to at most one student name. This condition, called a functional dependency (FD), is denoted with StuNumberStuName, or Students : StuNumber StuName, to indicate that this dependency should hold for attributes of relation Students. Actually, in this case, since all the attributes in the relation functionally depend on StuNum, the FD is called a key constraint.

Integrity constraints can be declared together with the schema, indicating that the instances for the schema should all satisfy the integrity constraints. For example, if the functional dependency Students : StuNumber StuName is added to the schema, the instance in Fig. 1 is consistent, because it satisfies the FD. However, the instance in Fig. 2 is inconsistent. This is because this instance does not satisfy, or, what is the same, violates the functional dependency (the student number 101 is assigned to two different student names.

Inconsistent Databases. Figure 2
figure 2_1242

Another instance.

Functional dependencies are particular cases of integrity constraints. It is also possible to consider with the schema a referential integrity constraint that requires that every student (number) in the relation Enrollment appears, associated with a student name, in relation Students, the official “table” of students. This is denoted with Enrollement[StNum]\(\subset\) Students[StNum]. If this IC is considered in the schema, the instance in Fig. 1 is inconsistent, because student 105 does not appear in relation Students. However, if only the referential constraint were in the schema, the instance in Fig. 2 would be consistent.

In can be seen that the notion of consistency is relative to a set of integrity constraints. When a database is said to be inconsistent, it is meant that the particular instance of the database at hand is inconsistent.

The two particular kinds of integrity constraints presented above and also other forms of ICs can be easily expressed in the language of predicate logic. For example, the FD above can be expressed by the symbolic sentence

$$\forall x\forall y\forall z(({\it Students}(x,y) \wedge {\it Students}(x,z))\ \mathrel{\rightarrow }\ y = z),$$
((1))

whereas the referential constraint above can be expressed by

$$\forall x\forall y({\it Enrollment}(x,y)\mathrel{\rightarrow }\exists z{\it Students}(x,z)).$$
((2))

Notice that this language of predicate logic is determined by the database schema, whose predicates are now being used to write down logical formulas. We may also use “built-in” predicates, like the equality predicate. Thus, ICs can be seen as forming a set Σ of sentences written in a language of predicate logic.

A database instance can be seen as an interpretation structure D for the language of predicate logic that is used to express ICs. This is because an instance has an underlying domain and (finite) extensions for the predicates in the schema. Having the database instance as an interpretation structure and the set of ICs as a set of symbolic sentences is crucial, and makes it possible to simply apply the notion of satisfaction of a formula by a structure of first-order predicate logic [6]. In this way, the notion of satisfaction of an integrity constraint by a database instance is a precisely defined notion. The database instance D is consistent if and only if it satisfies Σ, which is commonly denoted with D ⊨ Σ.

Since it is usually assumed that the set of ICs is consistent as a set of logical sentences, in databases the notion of consistency becomes a condition on the database instance. Thus, this use of the term “consistency” differs from its use in logic, where consistency characterizes a set of formulas.

Inconsistency is an undesirable property for a database. In consequence, one attempts to keep it consistent as it is subject to updates. There are a few ways to achieve this goal. One of them consists in declaring the ICs together with the schema, and the database management system (DBMS) will take care of the database maintenance, i.e., of keeping it consistent. This is done by rejecting transactions that may lead to a violation of the ICs. For example, the DBMS should reject the insertion of the tuple (101, sue jones) into the instance in Fig. 1 if the FD (1) was declared with the schema (as a key constraint). Unfortunately, commercial DBMSs offer limited support for this kind of database maintenance.

An alternative way of keeping consistency is based on the use of triggers (or active rules) that are stored in the database. The reaction to a potential violation is programmed as the action of the trigger: if a violation is about to be produced or is produced, the trigger automatically reacts, and its action may reject the violating transaction or compensate it with additional updates, to make sure that at the end, consistency is reestablished. Consistency can also be enforced through the application programs that interact with the DBMS. However, the correctness of triggers or application programs with respect to (with regard to) ensuring database consistency is not guaranteed by the DBMS.

It is the case that, for whatever reasons, databases may become inconsistent, i.e., they may violate certain ICs that are considered to be relevant to maintain for a certain application domain. This can be due to several reasons, e.g., poorly designed or implemented applications that fail to maintain the consistency of the database, or ICs for which a DBMS does not offer any kind of support, or ICs that are not enforced for better performance of application programs or DBMSs, or ICs that are just assumed to be satisfied based on knowledge about the application domain and the kind of updates on the database. It is also possible to have a legacy database on which semantic constraints have to be imposed; or more generally, a database on which imposing new constrains depending on specific needs, e.g., user constraints, becomes necessary.

In the area of data integration the satisfaction of desirable ICs by a database is much more difficult to achieve. One can have different autonomous databases that are separately consistent with regard to their own, local ICs. However, when their data is integrated into a single database, either material or virtual, certain desirable global ICs may not be satisfied. For example, two university databases may use the same numbers for students. If their data is put together into an integrated database, a student number might be assigned to two different students.

When trying to use an inconsistent database, the application of some data cleaning techniques may be attempted, to cleanse the database from data that participates in the violation of the ICs. This is done sometimes. However, data cleaning is a complex and non-deterministic process; and it may also lead to the loss of information that might be useful. Furthermore, in certain cases like virtual data integration, where the data stays at the autonomous data sources, there is no way to change the data without ownership of the sources.

One might try to live with inconsistent databases. Actually, most likely one will be forced to keep using it, because there is still useful information in it. It is also likely that most of the information in it is somehow consistent. Thus, the challenge consists in retrieving from the database only information that is consistent. For example, one could pose queries to the database at hand, but expecting to obtain only answers that are semantically correct, i.e., that are consistent with the ICs. This is the problem of consistent query answering (CQA).

The notion of consistency of a database is a holistic notion, that applies to the entire database, and not to portions of it. In consequence, in order to pursue this idea of retrieving consistent query answers, it becomes necessary to characterize the consistent data in an inconsistent database first. The idea that was proposed in [1] is as follows: the consistent data in an inconsistent data is the one that is invariant under all possible way of restoring the consistency by performing minimal changes on the initial database. That is, no matter what minimal consistency restoration process is applied to the database, the consistent data stays in the database. Each of the consistent versions of the original instance obtained by minimal changes is called a minimal repair, or simply, a repair.

It becomes necessary to be more precise about the meaning of minimal change. In between, a few notions have been proposed and studied (cf. [24] for surveys of CQA). Which notion to use may depend on the application. The notion of minimal change can be illustrated using the definition of repair given in [1]. First of all, a database instance D can be seen as a finite set of ground atoms (or database tuples) of the form P(\(\bar{c}\)), where P is a predicate in the schema, and \(\bar{c}\) is a finite sequence of constants in the database domain. For example, Students(101, john bell) is an atom in the database. Next, it is possible to compare the original database instance D with any other database instance D′ (of the same schema) through their symmetric difference DΔD′ = {A | A ∈ (D \ D′) ∪ (D′ \ D)}.

Now, a repair of an instance D with regard to a set of ICs Σ is defined as an instance D′ that is consistent, i.e., D′ ⊨ Σ, and for which there is no other consistent instance D′′ that is closer to D than D′, i.e., for which it holds DΔD′′ \({\subset \ne}\) DΔD′. For example, the database in Fig. 2 has two repairs with regard to the FD (1). They are shown in Fig. 3 and are obtained each by deleting one of the two conflicting tuples in relation Students (relation Enrollment does not change).

Inconsistent Databases. Figure 3
figure 3_1242

Two repairs.

Having defined the notion of repair, a consistent answer from an instance D to a query Q(\(\bar{x}\)) with regard to a set Σ of ICs is defined as an answer \(\bar{c}\) to Q that is obtained from every possible repair of D with regard to Σ. That is, if the query Q is posed to each of the repairs, \(\bar{c}\) will be returned as a usual answer to Q from each of them.

For example, if the query Q 1(x, y) : Students(x, y), asking for the tuples in relation Students, is posed to the instance in Fig. 2, then (104, claire stevens) and (107, pat norton) should be the only consistent answers wrt the FD (1). Those are the tuples that are shared by the extensions of Students in the two repairs. Now, for the query Q 2(x) : ∃yStudents(x, y), i.e., the projection on the first attribute of relation Students, the consistent answers are (101), (104) and (107).

There might be a large number of repairs for an inconsistent database. In consequence, it is desirable to come up with computational methodologies to retrieve consistent answers that use only the original database, in spite of its inconsistency. Such a methodology that works for particular syntactic classes of queries and ICs, was proposed in [1]. The idea is to take the original query Q that expects consistent answers, and syntactically transform it into a new query Q′, such that the rewritten query Q′, when posed to the original database, obtains as usual answers the consistent answers to query Q. The essential question is, depending on the language in which Q is expressed, what kind of language is necessary for expressing the rewriting Q′. The answer to this question should also depend on the kind of ICs being considered.

The idea behind the rewriting approach presented in [1] can be illustrated by means of an example. The consistent answers to the query Q 1(x, y) : Students(x, yc(1) can be obtained by posing the query Q′(x, y) : Students(x, y) ∧ \(\neg\)z(Students(x, z) ∧ zy) to the database. The new query collects as normal answers those tuples where the value of the first attribute is not associated to two different values of the second attribute in the relation. It can be seen that the answer set for the new query can be computed in polynomial time in the size of the database.

In this example, a query expressed in first-order predicate logic was rewritten into a new query expressed in the same language. It has been established in the literature that, for complexity-theoretic reasons, a more expressive language to do the rewriting of a first-order query may be necessary. For example, it may be necessary to do the rewritings as queries written in expressive extensions of Datalog [24].

If a database is inconsistent wrt referential ICs, like the instance in Fig. 1 and the constraint in (2), it is natural to restore consistency by deleting tuples or inserting tuples containing null values for the existentially quantified variables in the ICs. For example, the tuple (105, comp120) could be deleted from Enrollment or the tuple (105, null) could be inserted in relation Students. This requires a modification of the notion of repair and a precise semantics for satisfaction of ICs in the presence of null values [2,4].

Key Applications

Key applications of consistent query answering (CQA) are still missing. Applications to virtual data integration look promising, and also applications to data cleaning.

Future Directions

There are many open problems and research directions, among them, and most prominently, the development of key applications of CQA. A more precise characterization of the languages that are needed for doing CQA using query rewriting is also missing. It also becomes necessary to shed more light on the right kind of repair semantics to use depending on the application. CQA in a dynamic setting, when the databases is subject to updates, has not been investigated much. Integrity constraints and consistency issues for the relational model of data have been investigated for many years. However, there are other data models, e.g., spatial databases, for which much research of this kind is still necessary.

Cross-references

Active Databases

Data Cleaning

Logical Data Integration

Logics and Databases

Null Values

Relational Theory