Integrity Rules and Normalization

  • Elvis C. Foster
  • Shripad Godbole


In order to design high quality databases, you need to be cognizant of the fundamental integrity and normalization rules for the database. We will discuss these rules in this chapter. The sections in this chapter include.


Normal Form Database System Functional Dependence Temporal Database Integrity Rule 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
In order to design high quality databases, you need to be cognizant of the fundamental integrity and normalization rules for the database. We will discuss these rules in this chapter. The sections in this chapter include
  • Fundamental Integrity Rules

  • Foreign Key Concept

  • Rationale for Normalization

  • Functional Dependence and Non-Loss Decomposition

  • First Normal Form

  • Second Normal Form

  • Third Normal Form

  • Boyce/Codd Normal Form

  • Fourth Normal Form

  • Fifth Normal Form

  • An Example

  • Other Normal Forms

  • Summary and Concluding Remarks

4.1 Fundamental Integrity Rules

Two fundamental integrity rules that the database designer must be cognizant of are the entity integrity rule and the referential integrity rule.

Entity Integrity Rule : The entity integrity rule states that no component of the primary key in a base relation is allowed to accept nulls. Put another way, in a relational model, we never record information about something that we cannot identify. Three points are worth noting here:
  • The rule applies to base relations.

  • The rule applies to the primary key, not alternate keys.

  • The primary key must be wholly non-null.

Recall from the previous chapter that the primary key may be a single attribute or a composite key; either way, a null value is not allowed. If the primary key is a single attribute, it should not be null for any tuple in the relation. If the primary key is a composite key, then no attribute comprising this composite key should be null for any tuple in the relation. Also note that for composite primary keys, the correct attributes must be chosen. Choosing fewer attributes than required causes the primary key to lose the uniqueness quality required for distinguishing tuples; choosing more attributes than required makes the primary key redundant and increases the database overhead.

Referential Integrity Rule : The referential integrity rule states that the database must not contain unmatched foreign key values. By unmatched foreign key value, we mean a non-null foreign key value for which there is no match in the referenced (target) relation. Put another way, if B references A, then A must exist. The following points should be noted:
  • The rule requires that foreign keys must match primary keys, not alternate keys.

  • The foreign key and referential integrity are defined in terms of each other. It is not possible to explain one without mentioning the other.

4.2 Foreign Key Concept

The concept of a foreign key was introduced in the previous chapter. Let’s revisit this concept, by introducing a more formal definition:
  • Attribute FK of base relation R2 is a foreign key if and only if (denoted iff from this point) it satisfies the following conditions:

  1. a.

    Each value of FK is either wholly null or wholly non-null.

  2. b.

    There exists a base relation, R1 with primary key PK such that each non-null value of FK is identical to the value of PK in some tuple of R1.


We will use the notation R1 → R2 to mean, the relation R1 references the relation R2. In this case, R1 is the referencing (primary) relation and R2 is the referenced relation. Since R1 is the referencing relation, it contains a foreign key. We will also use the notion R1{A, B, C, …} to mean, the relation R1 contains attributes A, B, C, and so on. Where specific examples are given, the relation name will be highlighted or placed in upper case; attribute names of specific examples will not be highlighted when stated with the related relation; however, they will be highlighted when reference is made to them from the body of the text.

This definition of a foreign key is quite loaded, and therefore deserves some clarification. Based on the stated definition, the following consequential points should be noted:
  1. 1.

    The foreign key and the referenced primary key must be defined on the same domain. However, the attribute names can be different (do not be surprised if you see widespread violation of this principle).

  2. 2.

    The foreign key need not be a component of the primary key for the host (holding) relation (in which case nulls may be accepted, but only with the understanding that they will be subsequently updated). To illustrate, in the following two relations, Employee references Department via a M:1 relationship; you could have an employee who is temporarily not assigned to a department.

    Department {Dept#, DeptName, . . .} PK[Dept#]

    Employee {Emp#, Emp_FName, Emp_LName, Emp_Dept#, . . .} PK[Emp#] FK[Emp_Dept#]  references Department.Dept#

  3. 3.

    If relations Rn, R(n-1), R(n-2) .... R1 are such that Rn → R(n-1) → R(n-2) → .... R2 → R1, then the chain Rn to R1 forms a referential path.

  4. 4.

    In a referential path, a relation can be both referenced and referencing. Consider the referential path R3 → R2 → R1. In this case, R2 is both a referenced and a referencing relation.

  5. 5.

    The simplest possible referential path is a self-referencing relation. To illustrate, consider a revised version of the aforementioned relation Employee where each employee tuple also includes a reference to the employee’s manager. In this case, we have a self-referencing relation. Thus,

    Employee {Emp#, Emp_FName, Emp_LName, Emp_Dept#, . . ., Emp_MgrEmp#} PK[Emp#] FK[Emp_Dept#] references Department.Dept# and FK [Emp_MgrEmp#] references Employee.Emp#

  6. 6.

    More generally, a referential cycle exists when there is a referential path from Rn to itself: Rn → R(n-1) → .... R1 → Rn

  7. 7.

    From this simple discourse, you should note the important role of foreign keys. Foreign keys are said to be the "glue" that holds the database together. As you will later see, relations are joined based on foreign keys.

Now that we have established the importance of foreign keys, we need to address a question: How will we treat deletion of referenced tuples? Three alternatives exist.
  • Restrict deletion to tuples that are not referenced.

  • Cascade deletion to all referencing tuples in referencing relations.

  • Allow the deletion but nullify all referencing foreign keys in referencing relations.

The nullification strategy is particularly irresponsible if applied to component relationships, as it could quite quickly plunge the integrity of the database into question, by introducing orphan records (i.e. records with unmatched references). However, it could be applied to some M:1 relationships that are not component relationships (more on this in Chapter  5). For instance, referring to the Employee and Department relations above, we could remove a particular department from the database, and then nullify the FK values for employees that belonged to that department; the assumption would be that these employees would be subsequently assigned to different departments.

Traditionally, DBMS suites implement the restriction strategy (and for good reasons). This is the most conservative and safest approach. If you desire to remove a referenced tuple from the database, you must first take appropriate action on all references to that referenced tuple, either by removing those referencing tuples or changing their references.

The cascading strategy has been surfacing in contemporary systems, as an optional feature. It must be used with much care, as it is potentially dangerous when used without discretion. This strategy would not be the most prudent for the Department and Employee relations above. You may want to remove a department without removing all Employee tuples referenced that department. However, as you will see later in the chapter, an invoice is a perfect example where the cascading strategy would be applicable. An invoice typically consists of several line items (for instance, purchasing multiple items from a specific organization). The invoice is typically broken up into a summary portion along with corresponding details. If you desire to remove the invoice summary from the database, the details should also be removed, as otherwise the database would wind up with orphan records.

4.3 Rationale for Normalization

Normalization is the process of ensuring that the database (conceptual schema) is defined in such a manner as to ensure efficiency and ease of data access. Normalization ensures and/or enhances the following database features:
  • Entity and Referential Integrity: The normalization process results in a set of relations that are logically structures such that each relation has its primary key or candidate keys identified. Foreign keys are strategically inserted into specific referencing relations to map back to related primary keys in the target (referenced) relations. The DBMS can then be allowed to enforce appropriate referential integrity constraints when the database is implemented.

  • Control of Redundancy: Through a process called non-loss decomposition , the normalization process is geared toward reducing unnecessary data duplication.

  • Logical Data Independence: Normalization leaves the logical structure of the database (i.e. the conceptual schema) in a state that easily facilitates the construction of logical integration and aggregation of data from various normalized relations, to meet the assorted needs of end users. These logical maneuvers do not necessarily mandate changes in the application programs that access the database; neither do changes in the application programs necessarily force structural changes in the database.

  • Avoidance of Modification Anomalies: As you will soon see, normalization resolves undesirable structure of data that causes data redundancy, which in turn causes various modification anomalies, as explained below.

  • Efficient Data Storage and Retrieval: As stated in the definition above, the primary objective of normalization is efficient storage and subsequent retrieval of data.

The contrasting argument is equally important: in the absence of normalization, various problems are likely to occur. Among these problems are the following:
  • Data redundancy that leads to modification anomalies

  • Modification anomalies which include
    • Insertion Anomaly: Data cannot be inserted when it is desirable to do so; one has to wait on some future data, due to organization of the data structure

    • Deletion Anomaly: Deletion of some undesirable aspect(s) of data necessarily means deletion of some other desirable aspect(s) of data

    • Update Anomaly: Update of some aspect(s) of data necessarily means update of other aspect(s) of data

  • Inefficient file manipulation; lack of ease of data access

  • Inhibition to the achievement of logical data independence

  • Compromise on the integrity of data

  • Pressure on programming effort to make up for the poor design

Figure 4-1 indicates the six most commonly used normal forms. The hierarchy is such that a relation in a given normal form is automatically in all normal forms prior to it. Thus, a relation in the second normal form (2NF) is automatically in the first normal form (1NF); a relation in the third normal form (3NF) is in 2NF and so on. Edgar Frank Codd defined the first three normal forms in the early 1970s; the Boyce-Codd normal form (BCNF) was subsequently deduced from his work. The fourth and fifth normal forms (4NF and 5NF) were subsequently defined by Ronald Fagin in the late 1970s.
Figure 4-1.

Normal forms

The normalization procedure involves decomposing relations into other relations of repeatedly higher normal forms. The process is reversible. Moreover, normalization must be conducted in a manner that ensures that there is no loss of information.

4.4 Functional Dependence and Non-Loss Decomposition

Before discussion of the normal forms, we need to define and clarify two fundamental concepts: functional dependence and non-loss decomposition.

4.4.1 Functional Dependence

Given a relation, R{A, B, C, …}, then attribute B is functionally dependent on attribute A, written A → B (read as "A determines B") iff each value of A in R has precisely one B-value in R at any point in time. Attributes A and B may or may not be composite.

An alternate way to describe functional dependence (FD) is as follows: given a value of attribute A, one can deduce a value for attribute B since any two tuples that agree on A must necessarily agree on B.

To further clarify the concept of an FD, consider storing information about employees in an organization. Example 4-1 revisits the Employee relation of earlier discussion, showing how an FD may be defined on the attribute representing the employee’s identification number. In this example, the employee’s identification number (Emp#) is fulfilling the role of primary key.

Example 4-1. Illustrating the Concept of FD

Consider the relation Employee as outlined below:

Employee{Emp#, Emp_FName, Emp_LName, Emp_Dept#, … Emp_MgrEmp#}

The following FD holds

Emp# → Emp#, Emp_FName, Emp_LName, Emp_Dept#, Emp_MgrEmp#

What this means is that each employee is uniquely identified by his/her employee number. By knowing the employee number, all other attribute values for that employee can be deduced.

From the definition of primary key (which will be frequently abbreviated as PK from this point), all attributes of a relation are functionally dependent on the PK. This is precisely what is required; in fact, an attribute (or group of attributes) qualifies as a candidate key iff all other attributes of the entity are dependent on it.

We need to further refine the concept of FD by introducing two new terms, full functional dependence and partial functional dependence: let X and Y represent two distinct combinations of one or more attributes, and let X → Y. In such a scenario, Y is fully functionally dependent on X if it is functionally dependent on X and not functionally dependent on any proper subset of X. Any dependence on a subset of X is said to be a partial FD.

To illustrate full and partial FDs, consider the relation R(A, B, C, D,…} with PK[A,B]. Then [A, B] → C, D represents a full FD provided that neither A alone nor B alone determines C, D. However, any of the following FDs would represent a partial FD: A → C, A → D, B → C, or B → D.

Understanding what FDs are and how to identify them are both essential to designing a database. Based on the aforestated definition, here are three noteworthy consequences:
  1. 1.

    FD constraints have similarities with referential constraints, except that here, reference is internal to the relation.

  2. 2.

    FDs help us to determine primary keys.

  3. 3.

    Each FD defines a determinant in a relation: the attribute(s) on the right are dependent on the attribute(s) on the left; the attribute(s) on the left constitute(s) a determinant.


Next are the trivial functional dependencies. An FD is trivial if the attribute(s) on the right form(s) a subset of the attribute(s) on the left. For instance, in relation R{A, B, C, D, …} with PK [A, B], the following FDs are trivial: [A, B] → A; [A, B] → B; and [A, B] → A, B.

4.4.2 Non-Loss Decomposition

We come now to the concept of non-loss decomposition (NLD) . An NLD of a relation is decomposition of the relation into multiple (at least two) relations in such a way that the new relations can be recombined to yield the original; there is no loss of data from the decomposition of the original relation. Let’s start with an example (see also Figure 4-2). Suppose that we have a relation R0 storing basic supplies to a business as follows:
Figure 4-2.

FD diagram for relation R0

  • R0{Suppl#, SupplName, Item#, ItemName, Quantity, SupplStatus, Location} PK [Suppl#, Item#]

Functional dependencies of R0 are represented in the illustrative FD diagram of Figure 4-2; they may also be listed as follows:
  • [Suppl#, Item#] → {Quantity, SupplName, SupplStatus, Location, ItemName}

  • Suppl# → {SupplName, SupplStatus, Location}

  • Item# → ItemName

As you will soon see, storing R0 as a single entity (file) would introduce a duplication problem that would in turn introduce a number of other problems. As the size of the relation increases (i.e., more data is added over time), these related problems would be exacerbated towards a virtually uncontrollable level. The reason is that R0 has too many unresolved FDs (i.e., the relation is not sufficiently normalized). The reason is that R0 is not sufficiently normalized. As an alternative, we could have the following relations:
  • R1{Suppl#, SupplName, Location, SupplStatus} PK[Suppl#]

  • R2{Item#, ItemName} PK [Item#]

  • R3{Suppl#, Item#, Quantity} PK[Suppl#, Item#]

In this scenario, R1, R2, and R3 would constitute an example of a non-loss decomposition (NLD) of R0. What this means is that we can effectively defer storing R0 as originally proposed, and instead store R1, R2, and R2. Moreover, in so doing, we would avoid the problems related to storing R0 in its original form, but still be able to reconstruct it by combining the three replacement relations.

As you will soon see, making NLDs on relations is an integral part of the normalization process. Here is the formal definition of an NLD:
  • If R is a relation and P1, P2, … Pn are projections on R such that

  •      P1 JOIN P2 JOIN … JOIN Pn = R,

  • then P1, P2, … Pn constitutes a non-loss decomposition of R.

Notice that we have glided into two new terms, projection and join. These will be formally treated later in the course. Suffice it for now to say that a projection on a relation is an extraction (into a new relation) of some attributes of the relation. A join requires at least two relations and may be construed as the opposite of a projection: if you can project R into P1 and P2, then you may be able to join P1 and P2 to yield R.

So has it occurred to you that identifying FDs and managing NLDs are two imperatives for the normalization process? Indeed they are! Given this definition of an NLD, we need to address the following questions:
  1. 1.

    How do we find non-loss decompositions?

  2. 2.

    When should we replace a relation by a non-loss decomposition?

  3. 3.

    What are the advantages?

Heath’s theorem (of [Heath, 1971]) addresses questions (1) and (2). The answer to the third question has already been stated in section 4.3 above. Heath’s theorem states
  • If we have a relation R{A, B, C, …} and if AB and BC, then projections P1{A, B} and P2{B, C} constitute a non-loss decomposition of R.


Heath’s theorem effectively addresses the questions of when to conduct an NLD and how to do so. Each of the letters [A, B, and C] in the theorem represents a single attribute or a combination of attributes. Example 4-2 provides a proof of the theorem.

Example 4-2. Proof of Heath’s Theorem

We wish to show that for R0 {A, B, C}, if A → B and B → C, then R0 = P1 {A, B} JOIN P2 {B, C}.

Let P1 {A, B} and P2 {B, C} be projections of R0 {A, B, C}. Assume further that A, B, C are single attributes.

Suppose that (a, b, c) is a tuple in R0. Then (a, b) is in P1 and (b, c) is in P2.

So (a, b, c) is in P1 JOIN P2  ................ (1)

Suppose that (a, b, c) is in P1 JOIN P2. Then (a, b, c1) is in R0 for some value c1 and (a1, b, c) is in R0 for some value a1. But B → C therefore b → c so that c1 must be c.

Therefore (a, b, c) is in R0 ..................  (2)

We have shown that any tuple (a, b, c) that is in R0 is also in P1 JOIN P2, and that any tuple (a, b, c) that is in P1 JOIN P2 is in R0. Therefore R0 = P1 JOIN P2. Corollary from Heath's Theorem

An important corollary from Heath’s theorem informs us when to discontinue or desist from further decomposition:
  • If P1, P2, … Pn is a non-loss decomposition of R and relations P1, P2, … Pn all share a candidate key, then there is no reduction in data duplication. Conclusion from Heath's Theorem and Its Corollary

Based on Heath’s theorem and its corollary, we can assert with confidence the following advice:
  • Decompose only when there is non-loss decomposition such that the resulting relations do not share a candidate key.

  • Do not decompose if each resulting relation does not have a candidate key.

Example 4-3 illustrates the importance of the above-mentioned corollary. The example shows two flawed decompositions of the relation Student {Stud#, StudName, StudGPA, StudDept} PK [Stud#].

Example 4-3. Illustrating Usefulness of Heath’s Theorem Corollary

Consider the relation Student{Stud#, StudName, StudGPA, StudDept#} PK[Stud#].

Now consider projections S1{Stud#, StudName} PK[Stud#] and S2{Stud#, StudGPA, StudDept#} PK [Stud#].

S1 and S2 would not represent a useful pair of decompositions because they both share the same primary; such decompositions would not curtail duplication but worsen it by storing Stud# in two different relations.

Also, decomposition of Student into S3{Stud#, StudName} and S4{StudGPA, StudDept#} makes no sense.

With this established background, let’s now proceed to discussing the normal forms. Starting with the first, a section is dedicated to each normal form.

4.5 First Normal Form

We begin with a definition of the first normal form:
  • A relation is in the first normal form (1NF) iff all its attributes are atomic (i.e., each attribute is defined on a single domain), and each record in the relation is characterized by a unique primary key value.

In defining atomic attributes, Codd was deliberate about eliminating repeating groups (the scenario where a column (attribute) could store an array or list of possible values); this was prevalent in CODASYL and COBOL. Atomicity means that each attribute is defined on a single domain and stores a single value for any tuple in the relation. Moreover, implicit in this definition is the enforcement of the entity integrity constraint (section 4.1). As such, tuples that have null values in their primary key will not be allowed to exist; neither will there be tuples with duplicate primary key values in the relation. (These are problems that persisted and perhaps still do outside of the context of 1NF.)

By this definition, all relations are in 1NF. This is by no means coincidental, but by design: we define a relation to consist of atomic attributes, and subject to the entity integrity constraint and the referential integrity constraint. However, as you will soon see, having relations in 1NF only is often not good enough.

Example 4-4 provides an example of a relation that is in 1NF but is poorly designed. It may surprise you to learn that this problem was once very widespread in accounting software systems. As an exercise, try proposing a more efficient design for the relation described in the example.

Example 4-4. Example of a Poorly Designed 1NF Relation

Consider the relation EndOfMonth {Acct#, Dept#, Bal1, Bal2, … Bal13} PK [Acct#, Dept#]

Here are some insights:
  1. 1.

    This relation is intended to store monthly balances on various accounts in an organization.

  2. 2.

    Bal1 … Bal13 are defined on the same domain and therefore constitute a vast amount of space wasting (this scenario is often incorrectly described as a repeating group, but is different from the original meaning of the term).

  3. 3.

    The only time that Bal1 … Bal13 are all non-null is after Bal13 is calculated; this is normally done at the end of the accounting year.

  4. 4.

    At the end of each accounting period, this relation is cleared and reinitialized for the next accounting period.


4.5.1 Problems with Relations in 1NF Only

In the previous section, R0 was presented as a non-normalized file to be replaced with three other relations. Let’s revisit that discussion for further insight. For ease of reference, R0 is repeated here:
  • R0{Suppl#, SupplName, Item#, ItemName, Quantity, SupplStatus, Location} PK [Suppl#, Item#]

Relation R0 is in 1NF only. However, it is undesirable to store it in its current form due to a number of problems. For ease of reference, the functional dependencies of R0 as illustrated in Figure 4-2 are repeated here:
  • FD1: Suppl# → {SupplName, SupplStatus, Location}

  • FD2: Item# → ItemName

  • FD3: [Suppl#, Item#] → {Quantity, SupplName, SupplStatus, Location, ItemName}

Figure 4-3 shows some sample data depicting various shipments of inventory items for R0. A cursory analysis of the data will amplify the following data anomalies:
Figure 4-3.

Sample data representing a shipment of supplies (for relation R0)

  • Replication of Data: Every time we record a supplier-item pair, we also have to record the supplier name and item name.

  • Insertion Anomaly: We cannot insert a new item until it is supplied; neither can we insert a new supplier until that supplier supplies some item.

  • Deletion Anomaly: We cannot delete an item or a supplier without destroying an entire shipment, as well as information about a supplier's location.

  • Update Anomaly: If we desire to update a supplier's location or item name, we have to update several records (in fact, an entire shipment) due to the duplication problem.

Insertion, deletion, and update anomalies constitute modification anomalies, caused by duplication of data due to improper database design. As stated earlier, these problems are exacerbated as more data are added, thus leading to other problems of data access and database performance — a tangled hellish web.

4.6 Second Normal Form

The second normal form draws in the concept of functional dependence to shape an elevated benchmark beyond mere 1NF requirement. Here is the definition:
  • A relation is in the second normal form (2NF) iff it is in 1NF and every non-key attribute is fully functionally dependent on the primary key.

By non-key attribute, we mean that the attribute is not part of the primary key. Relation R0 (of the previous section), though in 1NF, is not in 2NF, due to FD1 and FD2. Using Heath's theorem, we may decompose relation R0 as follows (note that the abbreviation PK is used to denote the primary key):
  • R1{Suppl#, SuppName, Location, SupplStatus} PK[Suppl#]

  • R2{Item#, ItemName} PK[Item#]

  • R3{Suppl#, Item#, Quantity} PK[Suppl#, Item#]

We then check to ensure that the resulting relations are in 2NF. Relation R1 has a single attribute as its primary key, and so does R2; there is therefore no possibility of either relation being in violation of 2NF. As for relation R3, there is only one non-key attribute and it is dependent on the primary key. We may therefore conclude with confidence that all three relations (R1, R2, and R3) are in 2NF.

So based on the definition of 2NF, and on the authority of Heath’s theorem, we would replace R0 with R1, R2, and R3. Please note the consequences of our treatment of R0 so far:
  1. 1.

    The problems with relations in 1NF only have been addressed.

  2. 2.

    By decomposing, we have introduced foreign keys in relation R3.

  3. 3.

    JOINing is the opposite of PROJecting. We can rebuild relation R0 by simply JOINing R3 with R1 and R3 with R2, on the respective foreign keys.

  4. 4.

    From the definition of 2NF, two observations should be obvious. Firstly, if you have a relation with a single attribute as the primary key, it is automatically in 2NF. Secondly, if you have a relation with n attributes and n-1 of them form the primary key, the relation may very well be in 2NF but you must first verify this.


4.6.1 Problems with Relations in 2NF Only

In this example, relations R2 and R3 are in 2NF (in fact, they are in 3NF), but we still have potential problems with R1. What if we have a situation where there are several suppliers from a given location? Or what if we want to keep track of locations of interest? In either case, we would have modification anomalies as described below:
  • Insertion Anomaly: We cannot record information about a location until we have at least one supplier from that location.

  • Deletion Anomaly: We cannot delete a particular location without also deleting supplier(s) from that location.

  • Update Anomaly: If we wish to update information on a location, we have to update all supplier records from that location.

These problems can be addressed if we take the necessary steps to bring R1 into the third normal form (3NF). But first, we must define 3NF.

4.7 Third Normal Form

The third normal form extends constraints related to functional dependence to non-key attributes (i.e. attributes that are not part of the primary key). The following are three alternate definitions:
  • A relation is in the third normal form (3NF) iff it is in 2NF and no non-key attribute is fully functionally dependent on other non-key attribute(s).

  • Put another way, a relation is in 3NF iff non-key attributes are mutually independent and fully functionally dependent on the primary key. (Two or more attributes are mutually independent if none of them is functionally dependent on any combination of the others.)

  • Put another way, a relation is in 3NF iff it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. (Non-transitivity implies mutual independence.)

Transitive dependence refers to dependence among non-key attributes. In particular, if A → B and B → C, then C is transitively dependent on A (i.e., A → C transitively).

Let’s revisit relations R1, R2, and R3 of the previous section (and recall that these are decompositions of R0). Relation R1 is problematic because it is not in 3NF. If it is desirable to store additional information about the locations as indicated in the previous section, then we must be smart enough to discern that location is to be treated as an entity with attributes such as location code, location name, and perhaps others. We may therefore rewrite R1 as follows:
  • R1{Suppl#, SupplName, LocationCode, LocationName, SupplStatus} PK[Suppl#]

Now observe that LocationCode → LocationName! R1 therefore violates 3NF and must be decomposed. Using Heath's theorem, we may therefore decompose R1 as follows:
  • R4{Suppl#, SupplName, LocationCode} PK[Suppl#]

  • R5{LocationCode, LocationName} PK[LocationCode]

We now check to ensure that the relations are in 3NF (and they are). Again, please take careful notice of the consequences of our actions to this point:
  1. 1.

    The problems with relations in 2NF only have been addressed.

  2. 2.

    Again, by decomposing, we have introduced a foreign key in relation R4.

  3. 3.

    We can obtain the information that relation R1 represented by simply JOINing R4 with R5 on the foreign key.

  4. 4.

    From the definition of 3NF, it should be obvious that if you have a 2NF relation with one candidate key and n mutually independent non-key attributes, or only one non-key attribute, that relation is in 3NF.


4.7.1 Problems with Relations in 3NF Only

Relations R2, R3, R4, and R5 above are all in 3NF. However, it has been found that 3NF-only relations suffer from certain inadequacies. It is well known that 3NF does not deal satisfactorily with cases where the following circumstances hold:
  • There are multiple composite candidate keys in a relation.

  • The candidate keys overlap (they have at least one attribute in common).

For these situations, the Boyce-Codd normal form (BCNF) provides the perfect solution. As you shall soon see, the BCNF is really a refinement of 3NF. In fact, where the above-mentioned conditions do not hold, BCNF reduces to 3NF.

4.8 Boyce-Codd Normal Form

The Boyce-Codd normal form (BCNF ) was developed thanks to the effort of Raymond F. Boyce and Edgar F. Codd. This normal form is really a refinement of the third normal form. Simply, the BCNF requirement states
  • A relation is in BCNF iff every determinant in the relation is a candidate key.

Drawing from subsection 4.4.1, a determinant is an attribute (or group of attributes) on which some other attribute(s) is (are) fully functionally dependent. Examination of R2, R3, R4, and R5 above will quickly reveal that they are in BCNF (hence 3NF). We therefore need to find a different example that illustrates the importance of BCNF.

Consider the situation where it is desirous to keep track of animals in various zoos, multiple keepers, and the assigned keepers for these animals. Let’s tentatively construct the relation R6 as
  • R6{Zoo, Animal, Keeper}

Assume further that that a keeper works at one and only one zoo. We can therefore identify the following FDs:
  • [Zoo, Animal] → Keeper

  • Keeper → Zoo

Given the above, we conclude that [Zoo, Animal] is the primary key. Observe that R6 is in 3NF but not in BCNF, since Keeper is not a candidate key but is clearly a determinant. Using Heath's theorem, we may decompose R6 as follows:
  • R7{Animal, Keeper} PK[Animal]

  • R8{Keeper, Zoo} PK[Keeper]

Finally, let’s clean up by making the observation that Zoo, Keeper, and Animal are really disguised entities, not mere attributes (similar to Location in the previous section), and should be treated as such. Minimally, each would require a code and a name. Below is the revised list of normalized relations for the case:
  • R7{ZooID, ZooName, . . .} PK[ZooID]

  • R8{KeeperID, KeeperName, . . .ZooID} PK[KeeperID]

  • R9{AnimalCode, AnimalName, . . .KeeperID} PK[AnimalCode]

Notice that the test for BCNF is quite straightforward: you simply check to ensure that the only determinant in each relation is a candidate key. As on previous occasions, let’s examine the consequences of our action:
  1. 1.

    By achieving BCNF, we benefit from further reduction in data duplication and modification anomalies.

  2. 2.

    A further advantage is that we can now store dangling tuples. In our example, a keeper can be assigned to a zoo even before he/she is assigned an animal.

  3. 3.

    One drawback of this schema is that we cannot store an animal that has not been assigned a keeper or zoo. The schema can be further refined but that requires a discussion of the fourth normal form (4NF); this will be done in the upcoming section.

  4. 4.

    One possible drawback with BCNF is that more relations have to be accessed (joined) in order to obtain useful information. Again referring to the example, R9 must be joined with R8 in order to derive Zoo-Animal pairs; and to obtain the name of the zoo (and no doubt other information), a second join must be made with R7. In light of the fact that logical views can facilitate a seamless end user perspectives that obscure these underlying joins (more on this in Chapter  13), and processing power continues on its upswing (according to Moore’s Law), the benefits of achieving BCNF tend to outweigh the drawbacks by a large margin.



The principle of BCNF is very simple but profound. By being guided by it, you can, in many circumstances, actually bypass obtaining 2NF and 3NF relations, and move directly into a set of BCNF relations. Adopting this approach will significantly simplify the analysis process. Moreover, in most practical situations, you will not be required to normalize beyond BCNF. This approach will be further clarified in the next chapter.

4.9 Fourth Normal Form

The fourth normal form (4NF) relates to the situation where mutually independent but related attributes form a relation and the inefficient arrangement causes duplication and hence modification anomalies. Consider the relation CTT-Schedule , representing course-teacher-text combinations in an educational institution. Assume the following business rules:
  1. a.

    A course can be taught by several teachers.

  2. b.

    A course can require any number of texts.

  3. c.

    Teachers and texts are independent of each other; in other words, the same texts are used irrespective of who teaches the course.

  4. d.

    A teacher can teach several courses.

Figure 4-4 provides some sample data for the purpose of illustration. Notice the high level of duplication due to the structure of the relation and the prevailing business rules.
Figure 4-4.

CTT schedule

Note that the theory so far does not provide a method of treating such a situation, except flattening the structure (by making each attribute part of the primary key) as shown below:
  • R10{Course, Teacher, Text} PK[Course, Teacher, Text]

Since R10 is keyed on all its attributes, it is in BCNF. Yet, two potential problems are data redundancy and modification anomalies (the former leading to the latter). In our example, in order to record that Calculus II is taught by both Professor Cross and Professor Maitland, four records are required. In fact, if a course is taught by p professors and requires n texts, the number of records required to represent this situation is p*n. This is extraordinary, and could prove to be very demanding on storage space, especially as the number of records increases.

Relation R10, though in BCNF, is not in 4NF, because it has a peculiar dependency called a multi-valued dependency (MVD). In order to state the 4NF, we must first define MVD.

4.9.1 Multi-Valued Dependency

A multi-valued dependency (MVD ) is defined as follows:
  • Given a relation R(A, B, C), the MVD AB (read “A multi-determines B”) holds iff every B-value matching a given (A-value, C-value) pair in R depends only on the A-value and is independent of the C-value.

  • Further, given R(A B C), AB holds iff AC also holds. MVDs always go together in pairs like this. We may therefore write AB/C.

MVDs do not occur as frequently as FDs. Moreover, as a consequence of the definition of MVD, it is important to note the following points:
  1. 1.

    For MVD, at least three attributes must exist.

  2. 2.

    All FDs are MVDs but all MVDs are not necessarily FDs.

  3. 3.

    A -» B reads “A multi-determines B” or “B is multi-dependent on A.”


Let’s get back to R10: CourseText/Teacher. Note that Course is the pivot of the MVD. CourseTeacher since Teacher depends on Course, independent of Text. CourseText since Text depends on Course, independent of Teacher. So how do we resolve this? Fagin’s theorem provides the answer.

4.9.2 Fagin’s Theorem

Fagin’s theorem (named after Ronald Fagin, who proposed it) may be stated as follows:
  • Relation R{A, B, C} can be non-loss decomposed into projections R1{A, B} and R2{A, C} iff the MVDs A -» B/C both hold.

Note that like Heath’s theorem, which prescribes how to treat FDs, Fagin’s theorem states exactly how to treat MVDs. With this background, we can proceed to defining the 4NF:
  • A relation is in 4NF iff whenever there exists an MVD, say AB, then all attributes of R are also functionally dependent on A.

  • Put another way, R{A, B, C…} is in 4NF iff every MVD satisfied by R is implied by the candidate key of R.

  • Put another way, R{A, B, C…} is in 4NF iff the only dependencies are of the form [candidate key][other non-key attribute(s)].

  • Put another way, R{A, B, C …} is in 4NF iff it is in BCNF and there are no MVDs (that are not FDs).

The last formulation is particularly instructive. To paraphrase, whenever you encounter a relation that has an unresolved MVD, use Fagin’s theorem to resolve by replacing the MVD with equivalent FDs. In the current example, R10 is not in 4NF. This is so because although it is in BCNF, an MVD exists. Using Fagin's theorem, we may decompose it as follows:
  • R11{Course, Text} PK[Course, Text]

  • R12{Course, Teacher} PK[Course, Teacher]


Fagin’s theorem prescribes a method of decomposing a relation containing an MVD that is slightly different from the decomposition of an FD as prescribed by Heath's theorem. Figure 4-5 clarifies this.
Figure 4-5.

Treating MVDs

As an additional step to resolving the CTT-Schedule problem, it is a good idea to refine the solution by applying the following strategies: replace euphemistic relation names (R11 and R12) with more meaningful names; recognize that Course, Text, and Teacher are actually entities, not mere attributes; return to the use of unique attribute names by using the prefixing strategy that was introduced toward the end of Chapter  3. A more refined solution is as follows:
  • Course{Crs#, CrsName, CrsCredits, . . .} PK[Crs#]

  • Teacher{TeacherEmp#, Emp_FName, Emp_LName, . . .} PK[TeacherEmp#]

  • Text{TextCode, TextTitle, . . .} PK[TextCode]

  • CourseTextMap{CT_Crs#, CT_TextCode} PK[CT_Crs#, CT_TextCode]

  • CourseTeacherMap{CR_Crs#, CR_TeacherEmp#} PK[CR_Crs#, CR_TeacherEmp#]

From this proposed solution, it should be obvious to you that the attributes of CourseTextMap and CourseTeacherMap are foreign keys. You could then construct a RAL (review the latter part of Chapter  3) that clearly outlines the database conceptual schema for the stated problem; this is left as an exercise for you.

4.9.3 The Zoo Revisited

Let’s revisit the zoo problem of the previous section. It was mentioned that further refinement was needed to allow for storing animals not yet assigned to keepers or zoos. To facilitate this, we need to recognize the presence of an MVD:
  • Animal -» Keeper/Zoo

The partial solution given in section 4.8 is repeated here, with meaningful relation names introduced to replace the more cryptic (euphemistic) names used up to this point in the chapter.
  • Zoo{ZooID, ZooName, . . .} PK[ZooID]

  • Keeper{KeeperID, KeeperName, . . .ZooID} PK[KeeperID]

  • Animal{AnimalCode, AnimalName, . . .KeeperID} PK[AnimalCode]

Given the presence of the MVD AnimalKeeper/Zoo, we need to refine the three relations by removing prematurely inserted FK from Animal, and then apply Fagin’s theorem to introduce two new relations. The revised set of normalized relations follows (note the use of continued use of meaning relation names and unique attribute names):
  • Zoo{ZooID, ZooName, . . .} PK[ZooID]

  • Keeper{KeeperID, KeeperName, . . .KeeperZooID} PK[KeeperID]

  • Animal{AnimalCode, AnimalName, . . .} PK[AnimalCode]

  • AnimalKeeperMap{AK_AnimalCode, AK_KeeperID} PK [AK_AnimalCode, AK_KeeperID]

  • AnimalZooMap{AZ_AnimalCode, AZ_ZooID} PK [AZ_AnimalCode, AZ_ZooID]

As in the previous subsection (with the CTT-Schedule problem), it should be obvious to you what the foreign keys are in entities Keeper, AnimalKeeperMap, and AnimalZooMap; construction of a detailed RAL for the stated problem is left as an exercise for you.

4.10 Fifth Normal Form

So far we have been treating relations that are decomposable into two other relations. In fact, there are relations that cannot be so decomposed, but can be decomposed into n other relations where n > 2. They are said to be n-decomposable relations (n > 2). The fifth normal form (5NF) is also commonly referred to as the projection-join normal form (PJNF) because it relates to these (n > 2) projections (of a relation not in 5NF) into decompositions that can be rejoined to yield the original relation.

Recall the SupplierSchedule relationship (linking suppliers, inventory items, and projects) mentioned in Chapter  3 (Figures  3-3,  3-5,  3-13, and  3-17); it is represented here:
  • SupplierSchedule{Suppl#, Item#, Proj#} PK[Suppl#, Item#, Proj#]

Let’s assume that specific suppliers can supply specific items for specific projects. The relation represents a M:M relationship involving Suppliers, Items, and Projects. Observe the following features about the relation:
  1. 1.

    SupplierSchedule is keyed on all attributes and therefore by definition is in BCNF. By inspection, it is also in 4NF. There is no unresolved MVD as were the cases for the CTT-Schedule problem and the zoo case of earlier discussions. Here, the attributes are dependent on each other: suppliers supply inventory items for various projects.

  2. 2.

    It is not possible to decompose this relation into two other relations without losing critical information.

  3. 3.

    If there are S suppliers, N items, and J projects, then theoretically, there may be up to S*N*J records. Not all of these may be valid: a supplier may supply specific item(s) for specific project(s); not every item may be applicable to every project; and a supplier does not necessarily support every project.

  4. 4.

    If we consider S suppliers, each supplying N items to J projects, then it does not take much imagination to see that a fair amount of duplication will take place, despite the fact that the relation is in 4NF.

Let’s examine a possible decomposition of SupplierSchedule as shown in Figure 4-6. If we employ the first two decompositions only, this will not result in a situation that will guarantee us the original SupplierSchedule. In fact, if we were to join these two decompositions (SI and IP), we would obtain a false representation of the original relation. The third projection (PS) is absolutely necessary if we are to have any guarantee of obtaining the original relation after joining the projections.
Figure 4-6.

Illustrating possible decompositions of SupplierSchedule

As you examine the figure, observe that first join produces SupplierSchedule plus additional spurious tuples. The effect of the second join is to eliminate the spurious tuples. To put it into perspective, SupplierSchedule is subject to a (time independent) 3-decomposable (3D) constraint, namely
  • If (s, i) is in SI and (i, p) is in IP and (p, s) is in PS, then (s, i, p) is in SupplierSchedule.

This is an example of a join dependency (JD) constraint. This constraint exists because of the assumption made at the outset. If the prevailing business rules dictate otherwise, then the constraint would likely be different.

4.10.1 Definition of Join Dependency

A join dependency (JD ) constraint may be defined as follows:
  • Relation R satisfies the JD P1, P2, … Pn iff R = P1 JOIN P2 JOIN … JOIN Pn where the attributes of P1 … Pn are subsets of the attributes of R.

Relations that are in 4NF, but not in 5NF (such as SupplierSchedule) suffer from duplication, which in turn leads to modification anomalies. These problems are directly related to the presence of the JD constraint(s) in such relations. Fagin’s theorem for 5NF relations provides the solution.

4.10.2 Fagin's Theorem

Here now is Fagin’s theorem for the fifth normal form (5NF):
  • A relation R is in 5NF (also called PJNF) iff every JD in R is a consequence of the candidate keys of R.

  • In layman's terms, if a relation R is in 4NF and it contains a JD that renders it n-decomposable into P1, P2 … Pn, such that R = P1 JOIN P2 … JOIN Pn where n > 2, then such relation is not in 5NF. It may therefore be decomposed to achieve 5NF relations.

  • Put another way, a relation R is in 5NF iff it is in 4NF and it is not decomposable, except the decompositions are based on a candidate key of R, and the minimum number of projections is 3.

Now examine relation SupplierSchedule. SupplierSchedule is not in 5NF because it has a JD (i.e., the JD constraint) that is not a consequence of its candidate key. In other words, SupplierSchedule can be decomposed, but this is not implied by its candidate key [Supl#, Item#, Proj#]. We should therefore proceed with the decomposition represented in Figure 4-6 in order to achieve 5NF. Finally, we can refine the solution by observing the following strategies: assigning meaningful relation names; treating suppliers, projects, and items as entities instead of mere attributes; and observing the principle of unique attribute names. Accordingly, a revised list of relations is provided below:
  • Supplier{Suppl#, SupplName, . . .} PK [Suppl#]

  • InventoryItem{Item#, ItemName, . . .} PK [Item#]

  • Project{Proj#, ProjName, . . .} PK [Proj#]

  • SuppItemMap{SI_Supp#, SI_Item#, SI_Ref#} PK [SI_Ref#]

  • ItemProjMap{IP_Item#, IP_Proj#, IP_Ref#} PK[IP_Ref#]

  • ProjSuppMap{PS_Supp#, PS_Proj#, PS_Ref#} PK [PS_Ref#]

As for the previous cases (for instance sections 4.9.2 and 4.9.3), you should be able to easily identify the foreign keys; you should also be able to construct a detailed RAL for the case (review Figure  3-17 of Chapter  3). Also note that for relations SuppItemMap, ItemProjMap, and ProjSuppMap, a surrogate primary key has been introduced in each case as an alternative to using a composite candidate key consisting of the foreign keys.


For most practical purposes, you only have to worry about 5NF if you are trying to implement an M:M relationship involving more than two relations. Once in 5NF, further decompositions would share candidate keys and are therefore to no avail (recall the corollary of Heath's theorem). Notwithstanding this, other normal forms have been proposed, as will be discussed in section 4.12.

4.11 An Example

Now consider the case where it is desirous to keep track of students enrolled in various academic programs and courses in a college environment. Assume further that an academic program is comprised of various courses, and that multiple students may enroll in any given program; also, students may enroll in various courses each college term (semester or quarter). At minimum, it would be necessary to store for each program its code and name; similarly for each course (typically, it would be desirable to store additional information about each, but let’s keep the illustration simple for the time being). It would be desirable to define several attributes about each student record. However, for the purpose of this illustration, let’s keep these to the minimum, namely student number, name, and grade point average (GPA, which is normally calculated from other data but we will sidestep that complication for this discussion). Let’s attempt to derive a set of normalized relations from this scenario.

One way to start the normalization process is to begin with the proposal of a 1NF relation, followed by identification of FDs and subsequent NLDs based on those FDs; this is the classical approach. As you become more experienced, you will be able to employ more expeditious pragmatic approaches (some of which are covered in Chapter  5), but for now, let’s take it slowly.

Step 1: Let’s propose an initial 1NF relation, InitialProgram, with four FDs from the narrative, as shown:
  • InitialProgram{Pgm#, PgmName, Crs#, CrsName, Stud#, StudFName, StudLName, StudGPA, Term#} PK[Pgm#, Crs#, Stud#]

  • FD0: [Pgm#, Crs#, Stud#] → PgmName, CrsName, StudFName, StudLName, StudGPA, Term#

  • FD1: Pgm# → PgmName

  • FD2: Crs# → CrsName

  • FD3: Stud# → StudFName, StudLName, StudGPA

Step 2: The second step is to derive 2NF relations from step 1. To do so, you must recall the definition of 2NF, and then examine the stated FDs to see if there is a violation of 2NF. Recall that 2NF requires full functional dependence on the primary key. Observe that the presence of FD1, FD2, and FD3 renders InitialProgram to be in 1NF only, and in clear violation of 2NF. To resolve this dilemma, you must decompose via Heath’s theorem to obtain the following relations:
  • Program{Pgm#, PgmName} PK[Pgm#]

  • Course{Crs#, CrsName} PK[Crs#]

  • Student {Stud#, StudFName, StudLName, StudGPA} PK[Stud#]

  • ProgramStudCourseMap{Pgm#, Crs#, Stud#, Term#} PK[Pgm#, Crs#, Stud#]

Note the use of meaningful relation names to improve readability. Also observe that the relation ProgramStudCourseMap is what is left of the original relation InitialProgram after it has been decomposed.

Step 3: The third step is to obtain 3NF relations from step 2. To do so, you must examine the relations derived from step 2 to see if there is any violation of 3NF. Remember, 3NF requires mutual independence among non-key attributes. A quick check shows that there is no violation, so you can assert with confidence that for this particular scenario, the relations of step 2 are also in 3NF.

Step 4: This step requires that you obtain BCNF relations. Recall that BCNF requires that every determinant must serve as a candidate within the relation that it appears. A quick eye inspection of the relations from step 2 reveals no violation of this principle, so you can assert that for this particular case, the relations of step 2 are also in BCNF.

Step 5: This step requires obtaining higher normal forms where necessary. In most cases, you would stop the normalization process after acquiring BCNF. However, this particular example reveals a peculiar case of data duplication that needs to be resolved. Take a close look at the proposal for relation ProgramStudCourseMap. It contains three attributes, each of which forms part of the primary key. Thus, the relation is in BCNF. Why is it necessary to key on all three attributes? It is necessary to do so because it is desirable to represent all program-student-course combinations at the college where this database is implemented. Similar to the previous section, we have a case here where if there are P programs, C courses, and S students, the total number of combinations possible will be P*C*S. However, it is highly unlikely and illogical that the database would require all the possibilities. The key point here is to observe that there is a JD among Program, Course, and Student. That is what is reflected in relation ProgramStudCourseMap. Storing the relation as proposed would not be efficient for the following reasons: it should be possible to easily obtain program/course combinations, program/student combinations, and student/course combinations without having to read triplets as suggested by ProgramStudCourseMap. We therefore need to apply Fagin’s theorem to resolve the JD into the following projections:
  • ProgramStructure{Pgm#, Crs#, CrsSeqn} PK[Pgm#, Crs#] or PK[Pgm#, CrsSeqn]

  • ProgramStudentMap{Pgm#, Stud#} PK[Pgm#, Stud#]

  • StudentSchedule{Stud#, Crs#, Term#, …} PK[Stud#, Crs#, Term#]

Again note the use of meaningful relation names. As you view these proposed relations, here are a few additional points worth noting:
  1. 1.

    These three relations would replace relation ProgramStudCourseMap. Notice that each new relation implements a binary relationship. Thus, a ternary relationship (represented by ProgramStudCourseMap) has been replaced by three binary relationships (represented by relations ProgramStructure, ProgramStudentMap, and StudentSchedule).

  2. 2.

    In the ProgramStructure relation, the attribute CrsSeqn has been introduced to facilitate sequencing of courses in a particular major. This is optional; technically speaking, it is not part of the normalization process. The attribute has been introduced merely for convenience.

  3. 3.

    In the StudentSchedule relation, the attribute Term# has been promoted to be part of the primary key; this is done for increased flexibility. For instance, a student may repeat a course in order to obtain a more favorable grade.

Final Step: The final step is to list the normalized relations that will form the conceptual schema of the database. Sticking with the convention of using unique attribute-names, the finalized relations for this case are as follows:
  • Program{Pgm#, PgmName, . . .} PK[Pgm#]

  • Course{Crs#, CrsName, . . .} PK[Crs#]

  • Student {Stud#, StudFName, StudLName, StudGPA, . . .} PK[Stud#]

  • ProgramStructure{PS_Pgm#, PS_Crs#, PS_CrsSeqn} PK[PS_Pgm#, PS_Crs#] or PK[PS_Pgm#, PS_CrsSeqn]

  • ProgramStudentMap{SP_Pgm#, SP_Stud#} PK[SP_Pgm#, SP_Stud#]

  • StudentSchedule{SS_Stud#, SS_Crs#, SS_Term#, …} PK[SS_Stud#,SS_ Crs#, SS_Term#]

As for the previous cases (sections 4.9.2, 4.9.3, and 4.10.2), you should be able to easily identify the foreign keys in the latter three relations; you should also be able to construct a detailed RAL for the case (review Figure  3-17 of Chapter  3). Finally, note that this case covers only a small section of a college database. It is not meant to provide comprehensive coverage of such a problem domain; rather, it is meant to help you develop mastery of the normalization process.

4.12 Other Normal Forms

The field of Database Systems is potentially a contemptuous one. Indeed, there are accounts of former friends or colleagues becoming foes over database quibbles. Various individuals have proposed several database theorems and methodologies, but they have not all gained universal acceptance as have the normal forms of the previous sections. Two additional normal forms that have been, and will no doubt continue to be, the subject of debate are the domain-key normal form (DKNF) and the sixth normal form (6NF). Without picking sides of the debate on these two normal forms, this section will summarize each.

4.12.1 The Domain-Key Normal Form

The domain-key normal form (DKNF ) was proposed by Ronald Fagin in 1981. Unlike the other normal forms, which all relate to FDs, MVDs, and JDs, this normal form is defined in terms of domains and keys (hence its name). In his paper, Fagin showed that a relation DKNF has no modification anomalies, and that a relation without modification anomalies must be in DKNF. He therefore argued that a relation in DKNF needed no further normalization (at least, not for the purpose of reducing modification anomalies). The definition of DKNF is as follows:
  • A relation is in DKNF if every constraint on the relation is a logical consequence of the definition of its keys and domains.

This loaded definition contains three important terms that need clarification; they are constraint, key, and domain:
  • A constraint is used to mean any rule relating to static values of attributes. Constraints therefore include integrity rules, editing rules, foreign keys, intra-relation references, FDs, and MVDs, but exclude time-dependent constraints, cardinality constraints, and constraints relating to changes in data values.

  • A key is a unique identifier of a row (as defined in Chapter  3).

  • A domain is a pool of legal attribute values (as defined in Chapter  3).

The implication of the DKNF is clear: if we have a relation that contains constraint(s) that is (are) not a logical consequence of its (candidate) key and domains, then that relation is not in DKNF, and should therefore be further normalized. The DKNF as proposed by Fagin therefore represents an ideal situation to strive for.

Unfortunately, a number of problems have arisen from consideration of DKNF and they are summarized below:
  • Any constraint that restricts the cardinality of a relation (such as the number of tuples in the relation) will render it in violation of DKNF. (It was perhaps for this reason that Fagin excluded from his definition of constraints, time-dependent constraints, or constraints relating to data values.) However, there are many relations for which such constraints are required.

  • There is no known algorithm for converting a relation to DKNF. The conversion is intuitive and for this reason described as artistic rather than scientific.

  • Not all relations can be reduced to DKNF (relations with cardinality constraints fall in this category).

  • It is not precisely clear as to when a relation can be reduced to DKNF.

For these reasons, the DKNF has been compared by Date (see [Date, 2006]) to a “straw man… of some considerable theoretical interest but not yet of much practical ditto.”

4.12.2 The Sixth Normal Form

A sixth normal form (6NF) has been proposed by C. J. Date in [Date, 2003], after several years of exploitation, expounding, and research in the field of database systems. It relates to so-called temporal databases. Date wrote a whole book on the subject; a summary of the essence is presented in this subsection. Date defines a temporal database as a database that contains historical data as well as or instead of current data. Temporal databases are often associated with read-only databases or update-once databases, but they could be used otherwise. In this sense, a temporal database may be considered as a precursor to a data warehouse (discussed in Chapter  19).

For the purpose of illustration, assume that we are in a college or university setting and desire to store the relation Course as defined below:

Course{CourseNo, CourseName, CourseCred}

Suppose further that we desire to show different courses at the time they existed in the database. To make our analysis more realistic, let’s also make the following additional assumptions:
  • The primary key is CourseNo; for any given course, the attribute CourseNo cannot be changed.

  • For any given course, the attribute CourseName may be changed any point in time.

  • For any given course, the attribute CourseCred may be changed any point in time.

We may be tempted to introduce a timestamp on each tuple, and therefore modify the definition of Course as follows:

Course{CourseNo, CourseName, CourseCred, EffectiveDate}

Figure 4-7 provides some sample data for the Course relation. By introducing the attribute EffectiveDate, we have actually introduced a new set of concerns as summarized below:
  1. 1.

    If we assume that the FD CourseNo → {CourseName, CourseCred, EffectiveDate} is (still) applicable, then Course is in 5NF. However, in this case, if the CourseName or CourseCred of a given Course tuple changes at a given effective date, there is no way of showing what it was before, unless we create a new course and assign a new CourseNo. In either case, this clearly, is undesirable.

  2. 2.

    Assume for the moment, the presence of FDs CourseNo → CourseName and [CourseNo, EffectiveDate] → CourseCred. Then, the relation would be in violation of 2NF, and would therefore need to be decomposed into two decompositions:

    CourseDef {CourseNo, CourseName} PK [CourseNo]


    CourseTimeStamp {CourseNo, EffectiveDate, CourseCred} PK [CourseNo, EffectiveDate]

    Both of these relations would now be in 5NF. However, if we now desire to change the CourseName of a course for a given effective date, we would not be able to represent this in the current schema.

  3. 3.
    We could introduce a surrogate (say CourseRef) into relation Course, and key on the surrogate, while ignoring the FDs stated in (1) and (2) above. In this case, Course would be in violation of 3NF, and if we attempt to decompose, we would revert to the situation in case (2) above.
    Figure 4-7.

    Sample data for the course relation

The reason for these problems can be explained as follows: the relation Course as described defines the following predicate:
  • Each course is to be accounted for (we say Course is under contract).

  • Each course has a CourseName which is under contract.

  • Each course has a CourseCred which is under contract.

The predicate involves three distinct propositions. We are attempting to use the timestamp attribute (EffectiveDate) to represent more than one proposition about the attribute values. This, according to Date, is undesirable and violates the sixth normal form.

Here now is Date’s theorem for the sixth normal form (6NF):
  • A relation R is in 6NF iff it satisfies no non-trivial JDs at all. (A JD is trivial iff at least one of its projections is over all of the attributes of the relation.)

  • Put another way, a relation R is in 6NF iff the only JDs that it satisfies are trivial ones.

Observe that 6NF as defined essentially refines 5NF. It is therefore obvious from the definition that a relation in 6NF is necessarily in 5NF also.

Let’s now revisit the Course relation. With the introduction of the timestamp attribute (EffectiveDate), and given the requirements of the relation, there is a non-trivial JD that leads to the following projections:
  • CourseInTime {CourseNo, EffectiveDate} PK [CourseNo, EffectiveDate]

  • CourseNameInTime {CourseNo, CourseName, EffectiveDate}     PK [CourseNo, EffectiveDate]

  • CourseCredInTime {CourseNo, CourseCred, EffectiveDate} PK [CourseNo, EffectiveDate]

Observe that the projection CourseInTime is, strictly speaking, redundant since it can be obtained by a projection from either CourseNameInTime or CourseCredInTime. However, in the interest of clarity and completeness, it has been included.

This work by C. J. Date represents a significant contribution to the field of database systems, and will no doubt be a topical point of discussion in the future.

4.13 Summary and Concluding Remarks

This brings us near the conclusion of one of the most important topics in your database systems course. Take some time to go over the concepts. Figure 4-8 should help you to remember the salient points.
Figure 4-8.

Summary of normalization

Traditionally, it has been widely accepted that for most databases, attainment of 3NF is acceptable. This course recommends a minimum attainment of BCNF, but that database designers strive for 5NF at every reasonable and feasible opportunity. Recall that as stated earlier (in section 4.8), BCNF is really a refinement of 3NF, and the normalization process can bypass 2NF and 3NF, and go straight to BCNF. In some circumstances, it may even be best to proceed all the way to 6NF.

Despite all the virtues of normalization, the methodology has its limitations . Here are three main limitations:
  • The normalization process is quite tedious and challenging for inexperienced database designers. Because of this, the occurrence of inappropriately designed databases is quite high.

  • As the size of the database increases over time, the performance on queries involving logical joins tends to deteriorate. However, with the significant advances in processing power (Moore’s Law), this is not as critical a concern as it was historically.

  • In scenarios such as data warehousing (discussed in Chapter  19), where data volume is very high and accessibility is a primary objective, normalization is not always as desirable as it is for operational databases.

Normalization is a technique that must be mastered by database designers. It improves with practice and experience, and ultimately becomes almost intuitive. As your mastery of normalization improves, you will find that there is a corresponding improvement in your ability to design and/or propose database systems for various software systems. However, be aware that the converse is also true: failure to master fundamental principles of database design will significantly impair one’s ability to design quality software systems involving underlying databases. Notwithstanding, be careful not to be antagonistic about your views, as informed as they may be. Database feuds are common and passions can sometimes run high.

If after reading and reviewing this chapter, you arrive at the conclusion that normalization is integral to sound database design, you would be correct. Indeed it is! With time and practice, you should get quite good at applying the technique. The upcoming chapter builds on this topic as you learn about different approaches and methodologies related to database design.

4.14 Review Questions

  1. 1.

    State the two fundamental integrity rules and explain their significance.

  2. 2.

    What is a foreign key? Illustrate using an appropriate example. How should foreign keys be treated?

  3. 3.

    Clarify the following: functional dependence; non-loss decomposition; Heath’s theorem. Provide examples that will illustrate the significance of these terms.

  4. 4.

    State the normal forms covered in this chapter. For each normal form, state it, explain in your own words what it means, and provide an example of a relation that conforms to it and one that does not.

  5. 5.

    What are the main advantages and challenges related to database normalization?

  6. 6.

    Review the case presented in subsection 4.9.2 and propose a refined RAL for the CTT-Schedule problem.

  7. 7.

    Review the case presented in subsection 4.9.3 and propose a refined RAL for the zoo problem.

  8. 8.

    Review the case presented in section 4.11 and propose a refined RAL for the partial college database problem.

  9. 9.

    Practice working with simple non-normalized database ideas and refining them to obtain normalized relations. Start off by identifying the main entity/entities, followed by FDs; then apply the appropriate normalization rules.


4.15 References and/or Recommended Readings

[Codd 1972] Codd, Edgar F. 1972. “Further Normalization of the Database Relational Model.” Database Systems, Courant Computer Science Symposia Series 6. Englewood Cliffs, NJ: Prentice Hall.

[Codd 1974] Codd. Edgar F. 1974. “Recent Investigations into Relational Data Base Systems.” Proc. IFIP Congress, Stockholm, Sweden.

[Connolly 2015] Connolly, Thomas and Carolyn Begg. 2015. Database Systems: A Practical Approach to Design, Implementation, and Management 6th ed. Boston: Pearson. See Chapters  14 and  15.

[Coronel 2015] Coronel, Carlos and Steven Morris. 2015. Database Systems: Design, Implementation, and Management 11th ed. Boston: Cengage Learning. See Chapter  6.

[Date 2003] Date, Christopher J., Hugh Darwen and Nikos A. Lorentzos. 2003. Temporal Databases and the Relational Model. San Francisco, CA: Morgan-Kaufmann.

[Date 2004] Date, Christopher J. 2004. Introduction to Database Systems 8h ed. Menlo Park, California: Addison-Wesley. See Chapters  11-13, 23.

[Date 2006] Date, Christopher J. 2006. “Database Debunking.” Accessed June 2006. .

[Elmasri, 2011] Elmasri, Ramez and Shamkant B. Navathe. 2011. Fundamentals of Database Systems 6th ed. Boston: Pearson. See Chapters  15 and  16.

[Fagin 1977] Fagin, Ronald. 1977. “Multi-valued Dependencies and a New Normal Form for Relational Databases.” ACM TODS 2, No. 3, September 1977.

[Fagin 1979] Fagin, Ronald. 1979. “Normal Forms and Relational Database Operations.” Proc. 1979 ACM SIGMOD International Conference on Management of Data. Boston, MA, May-June 1979.

[Fagin 1981] Fagin, Ronald. 1981. “A Normal Form for Relational Databases that is Based on Domains and Keys.” ACM TODS 6, No. 3, September 1981.

[Garcia-Molina 2009] Garcia-Molina, Hector, Jeffrey Ullman and Jennifer Widom. 2009. Database Systems: The Complete Book 2nd ed. Boston: Pearson. See Chapter  3.

[Heath 1971] Heath. I. J. 1971. “Unacceptable File Operations in a Relational Database.” Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and Control. San Diego, CA, November 1971.

[Hoffer 2013] Hoffer, Jeffrey A., Ramesh Venkataraman, and Heikki Topi. 2013. Modern Database Management 11th ed. Boston: Pearson. See Chapter  4.

[Kifer 2006] Kifer, Michael, Arthur Bernstein, and Phillip M. Lewis. 2006. Databases and Transaction Processing: An Application Oriented Approach 2nd ed. Boston: Pearson. See Chapter  6.

[Kroenke 2014] Kroenke, David M and David Auer. 2014. Database Processing: Fundamentals, Design, and Implementation 13th ed. Boston: Pearson. See Chapters  3 and  4.

[Pratt 2015] Pratt, Phillip J. and Mary Z. Last. 2015. Concepts of Database Management 8th ed. Boston: Cengage Learning. See Chapter  5.

[Silberschatz 2011] Silberschatz, Abraham, Henry F. Korth and S. Sudarshan. 2011. Database Systems Concepts 6th ed. Boston: McGraw-Hill. See Chapter  7.

[Ullman 2008] Ullman, Jeffrey D., and Jennifer Widom. 2008. A First Course in Database Systems 3rd ed. Boston: Pearson. See Chapters  3 and  4.

Copyright information

© Elvis C. Foster and Shripad Godbole 2016

Authors and Affiliations

  • Elvis C. Foster
    • 1
  • Shripad Godbole
    • 2
  1. 1.Associate Professor of Computer ScienceKeene State CollegeNew HampshireUSA
  2. 2.Senior Manager of Database AdministrationMerkle Inc.MarlboroughUSA

Personalised recommendations