Abstract
In order to design high quality databases, you need to be cognizant of the fundamental integrity and normalization rules for the database. We will discuss these rules in this chapter. The sections in this chapter include.
Keywords
Normal Form Database System Functional Dependence Temporal Database Integrity Rule
Fundamental Integrity Rules

Foreign Key Concept

Rationale for Normalization

Functional Dependence and NonLoss Decomposition

First Normal Form

Second Normal Form

Third Normal Form

Boyce/Codd Normal Form

Fourth Normal Form

Fifth Normal Form

An Example

Other Normal Forms

Summary and Concluding Remarks
4.1 Fundamental Integrity Rules
Two fundamental integrity rules that the database designer must be cognizant of are the entity integrity rule and the referential integrity rule.

The rule applies to base relations.

The rule applies to the primary key, not alternate keys.

The primary key must be wholly nonnull.
Recall from the previous chapter that the primary key may be a single attribute or a composite key; either way, a null value is not allowed. If the primary key is a single attribute, it should not be null for any tuple in the relation. If the primary key is a composite key, then no attribute comprising this composite key should be null for any tuple in the relation. Also note that for composite primary keys, the correct attributes must be chosen. Choosing fewer attributes than required causes the primary key to lose the uniqueness quality required for distinguishing tuples; choosing more attributes than required makes the primary key redundant and increases the database overhead.

The rule requires that foreign keys must match primary keys, not alternate keys.

The foreign key and referential integrity are defined in terms of each other. It is not possible to explain one without mentioning the other.
4.2 Foreign Key Concept

Attribute FK of base relation R2 is a foreign key if and only if (denoted iff from this point) it satisfies the following conditions:
 a.
Each value of FK is either wholly null or wholly nonnull.
 b.
There exists a base relation, R1 with primary key PK such that each nonnull value of FK is identical to the value of PK in some tuple of R1.
We will use the notation R1 → R2 to mean, the relation R1 references the relation R2. In this case, R1 is the referencing (primary) relation and R2 is the referenced relation. Since R1 is the referencing relation, it contains a foreign key. We will also use the notion R1{A, B, C, …} to mean, the relation R1 contains attributes A, B, C, and so on. Where specific examples are given, the relation name will be highlighted or placed in upper case; attribute names of specific examples will not be highlighted when stated with the related relation; however, they will be highlighted when reference is made to them from the body of the text.
 1.
The foreign key and the referenced primary key must be defined on the same domain. However, the attribute names can be different (do not be surprised if you see widespread violation of this principle).
 2.
The foreign key need not be a component of the primary key for the host (holding) relation (in which case nulls may be accepted, but only with the understanding that they will be subsequently updated). To illustrate, in the following two relations, Employee references Department via a M:1 relationship; you could have an employee who is temporarily not assigned to a department.
Department {Dept#, DeptName, . . .} PK[Dept#]
Employee {Emp#, Emp_FName, Emp_LName, Emp_Dept#, . . .} PK[Emp#] FK[Emp_Dept#] references Department.Dept#
 3.
If relations Rn, R(n1), R(n2) .... R1 are such that Rn → R(n1) → R(n2) → .... R2 → R1, then the chain Rn to R1 forms a referential path.
 4.
In a referential path, a relation can be both referenced and referencing. Consider the referential path R3 → R2 → R1. In this case, R2 is both a referenced and a referencing relation.
 5.
The simplest possible referential path is a selfreferencing relation. To illustrate, consider a revised version of the aforementioned relation Employee where each employee tuple also includes a reference to the employee’s manager. In this case, we have a selfreferencing relation. Thus,
Employee {Emp#, Emp_FName, Emp_LName, Emp_Dept#, . . ., Emp_MgrEmp#} PK[Emp#] FK[Emp_Dept#] references Department.Dept# and FK [Emp_MgrEmp#] references Employee.Emp#
 6.
More generally, a referential cycle exists when there is a referential path from Rn to itself: Rn → R(n1) → .... R1 → Rn
 7.
From this simple discourse, you should note the important role of foreign keys. Foreign keys are said to be the "glue" that holds the database together. As you will later see, relations are joined based on foreign keys.

Restrict deletion to tuples that are not referenced.

Cascade deletion to all referencing tuples in referencing relations.

Allow the deletion but nullify all referencing foreign keys in referencing relations.
The nullification strategy is particularly irresponsible if applied to component relationships, as it could quite quickly plunge the integrity of the database into question, by introducing orphan records (i.e. records with unmatched references). However, it could be applied to some M:1 relationships that are not component relationships (more on this in Chapter 5). For instance, referring to the Employee and Department relations above, we could remove a particular department from the database, and then nullify the FK values for employees that belonged to that department; the assumption would be that these employees would be subsequently assigned to different departments.
Traditionally, DBMS suites implement the restriction strategy (and for good reasons). This is the most conservative and safest approach. If you desire to remove a referenced tuple from the database, you must first take appropriate action on all references to that referenced tuple, either by removing those referencing tuples or changing their references.
The cascading strategy has been surfacing in contemporary systems, as an optional feature. It must be used with much care, as it is potentially dangerous when used without discretion. This strategy would not be the most prudent for the Department and Employee relations above. You may want to remove a department without removing all Employee tuples referenced that department. However, as you will see later in the chapter, an invoice is a perfect example where the cascading strategy would be applicable. An invoice typically consists of several line items (for instance, purchasing multiple items from a specific organization). The invoice is typically broken up into a summary portion along with corresponding details. If you desire to remove the invoice summary from the database, the details should also be removed, as otherwise the database would wind up with orphan records.
4.3 Rationale for Normalization

Entity and Referential Integrity: The normalization process results in a set of relations that are logically structures such that each relation has its primary key or candidate keys identified. Foreign keys are strategically inserted into specific referencing relations to map back to related primary keys in the target (referenced) relations. The DBMS can then be allowed to enforce appropriate referential integrity constraints when the database is implemented.

Control of Redundancy: Through a process called nonloss decomposition , the normalization process is geared toward reducing unnecessary data duplication.

Logical Data Independence: Normalization leaves the logical structure of the database (i.e. the conceptual schema) in a state that easily facilitates the construction of logical integration and aggregation of data from various normalized relations, to meet the assorted needs of end users. These logical maneuvers do not necessarily mandate changes in the application programs that access the database; neither do changes in the application programs necessarily force structural changes in the database.

Avoidance of Modification Anomalies: As you will soon see, normalization resolves undesirable structure of data that causes data redundancy, which in turn causes various modification anomalies, as explained below.

Efficient Data Storage and Retrieval: As stated in the definition above, the primary objective of normalization is efficient storage and subsequent retrieval of data.

Data redundancy that leads to modification anomalies
 Modification anomalies which include

Insertion Anomaly: Data cannot be inserted when it is desirable to do so; one has to wait on some future data, due to organization of the data structure

Deletion Anomaly: Deletion of some undesirable aspect(s) of data necessarily means deletion of some other desirable aspect(s) of data

Update Anomaly: Update of some aspect(s) of data necessarily means update of other aspect(s) of data


Inefficient file manipulation; lack of ease of data access

Inhibition to the achievement of logical data independence

Compromise on the integrity of data

Pressure on programming effort to make up for the poor design
The normalization procedure involves decomposing relations into other relations of repeatedly higher normal forms. The process is reversible. Moreover, normalization must be conducted in a manner that ensures that there is no loss of information.
4.4 Functional Dependence and NonLoss Decomposition
Before discussion of the normal forms, we need to define and clarify two fundamental concepts: functional dependence and nonloss decomposition.
4.4.1 Functional Dependence
Given a relation, R{A, B, C, …}, then attribute B is functionally dependent on attribute A, written A → B (read as "A determines B") iff each value of A in R has precisely one Bvalue in R at any point in time. Attributes A and B may or may not be composite.
An alternate way to describe functional dependence (FD) is as follows: given a value of attribute A, one can deduce a value for attribute B since any two tuples that agree on A must necessarily agree on B.
To further clarify the concept of an FD, consider storing information about employees in an organization. Example 41 revisits the Employee relation of earlier discussion, showing how an FD may be defined on the attribute representing the employee’s identification number. In this example, the employee’s identification number (Emp#) is fulfilling the role of primary key.
Example 41. Illustrating the Concept of FD
Consider the relation Employee as outlined below:
Employee{Emp#, Emp_FName, Emp_LName, Emp_Dept#, … Emp_MgrEmp#}
The following FD holds
Emp# → Emp#, Emp_FName, Emp_LName, Emp_Dept#, Emp_MgrEmp#
What this means is that each employee is uniquely identified by his/her employee number. By knowing the employee number, all other attribute values for that employee can be deduced.
From the definition of primary key (which will be frequently abbreviated as PK from this point), all attributes of a relation are functionally dependent on the PK. This is precisely what is required; in fact, an attribute (or group of attributes) qualifies as a candidate key iff all other attributes of the entity are dependent on it.
We need to further refine the concept of FD by introducing two new terms, full functional dependence and partial functional dependence: let X and Y represent two distinct combinations of one or more attributes, and let X → Y. In such a scenario, Y is fully functionally dependent on X if it is functionally dependent on X and not functionally dependent on any proper subset of X. Any dependence on a subset of X is said to be a partial FD.
To illustrate full and partial FDs, consider the relation R(A, B, C, D,…} with PK[A,B]. Then [A, B] → C, D represents a full FD provided that neither A alone nor B alone determines C, D. However, any of the following FDs would represent a partial FD: A → C, A → D, B → C, or B → D.
 1.
FD constraints have similarities with referential constraints, except that here, reference is internal to the relation.
 2.
FDs help us to determine primary keys.
 3.
Each FD defines a determinant in a relation: the attribute(s) on the right are dependent on the attribute(s) on the left; the attribute(s) on the left constitute(s) a determinant.
Next are the trivial functional dependencies. An FD is trivial if the attribute(s) on the right form(s) a subset of the attribute(s) on the left. For instance, in relation R{A, B, C, D, …} with PK [A, B], the following FDs are trivial: [A, B] → A; [A, B] → B; and [A, B] → A, B.
4.4.2 NonLoss Decomposition

R0{Suppl#, SupplName, Item#, ItemName, Quantity, SupplStatus, Location} PK [Suppl#, Item#]

[Suppl#, Item#] → {Quantity, SupplName, SupplStatus, Location, ItemName}

Suppl# → {SupplName, SupplStatus, Location}

Item# → ItemName

R1{Suppl#, SupplName, Location, SupplStatus} PK[Suppl#]

R2{Item#, ItemName} PK [Item#]

R3{Suppl#, Item#, Quantity} PK[Suppl#, Item#]
In this scenario, R1, R2, and R3 would constitute an example of a nonloss decomposition (NLD) of R0. What this means is that we can effectively defer storing R0 as originally proposed, and instead store R1, R2, and R2. Moreover, in so doing, we would avoid the problems related to storing R0 in its original form, but still be able to reconstruct it by combining the three replacement relations.

If R is a relation and P1, P2, … Pn are projections on R such that

P1 JOIN P2 JOIN … JOIN Pn = R,

then P1, P2, … Pn constitutes a nonloss decomposition of R.
Notice that we have glided into two new terms, projection and join. These will be formally treated later in the course. Suffice it for now to say that a projection on a relation is an extraction (into a new relation) of some attributes of the relation. A join requires at least two relations and may be construed as the opposite of a projection: if you can project R into P1 and P2, then you may be able to join P1 and P2 to yield R.
 1.
How do we find nonloss decompositions?
 2.
When should we replace a relation by a nonloss decomposition?
 3.
What are the advantages?

If we have a relation R{A, B, C, …} and if A → B and B → C, then projections P1{A, B} and P2{B, C} constitute a nonloss decomposition of R.
Observe
Heath’s theorem effectively addresses the questions of when to conduct an NLD and how to do so. Each of the letters [A, B, and C] in the theorem represents a single attribute or a combination of attributes. Example 42 provides a proof of the theorem.
Example 42. Proof of Heath’s Theorem
We wish to show that for R0 {A, B, C}, if A → B and B → C, then R0 = P1 {A, B} JOIN P2 {B, C}.
Let P1 {A, B} and P2 {B, C} be projections of R0 {A, B, C}. Assume further that A, B, C are single attributes.
Suppose that (a, b, c) is a tuple in R0. Then (a, b) is in P1 and (b, c) is in P2.
So (a, b, c) is in P1 JOIN P2 ................ (1)
Suppose that (a, b, c) is in P1 JOIN P2. Then (a, b, c1) is in R0 for some value c1 and (a1, b, c) is in R0 for some value a1. But B → C therefore b → c so that c1 must be c.
Therefore (a, b, c) is in R0 .................. (2)
We have shown that any tuple (a, b, c) that is in R0 is also in P1 JOIN P2, and that any tuple (a, b, c) that is in P1 JOIN P2 is in R0. Therefore R0 = P1 JOIN P2.
4.4.2.1 Corollary from Heath's Theorem

If P1, P2, … Pn is a nonloss decomposition of R and relations P1, P2, … Pn all share a candidate key, then there is no reduction in data duplication.
4.4.2.2 Conclusion from Heath's Theorem and Its Corollary

Decompose only when there is nonloss decomposition such that the resulting relations do not share a candidate key.

Do not decompose if each resulting relation does not have a candidate key.
Example 43 illustrates the importance of the abovementioned corollary. The example shows two flawed decompositions of the relation Student {Stud#, StudName, StudGPA, StudDept} PK [Stud#].
Example 43. Illustrating Usefulness of Heath’s Theorem Corollary
Consider the relation Student{Stud#, StudName, StudGPA, StudDept#} PK[Stud#].
Now consider projections S1{Stud#, StudName} PK[Stud#] and S2{Stud#, StudGPA, StudDept#} PK [Stud#].
S1 and S2 would not represent a useful pair of decompositions because they both share the same primary; such decompositions would not curtail duplication but worsen it by storing Stud# in two different relations.
Also, decomposition of Student into S3{Stud#, StudName} and S4{StudGPA, StudDept#} makes no sense.
With this established background, let’s now proceed to discussing the normal forms. Starting with the first, a section is dedicated to each normal form.
4.5 First Normal Form

A relation is in the first normal form (1NF) iff all its attributes are atomic (i.e., each attribute is defined on a single domain), and each record in the relation is characterized by a unique primary key value.
In defining atomic attributes, Codd was deliberate about eliminating repeating groups (the scenario where a column (attribute) could store an array or list of possible values); this was prevalent in CODASYL and COBOL. Atomicity means that each attribute is defined on a single domain and stores a single value for any tuple in the relation. Moreover, implicit in this definition is the enforcement of the entity integrity constraint (section 4.1). As such, tuples that have null values in their primary key will not be allowed to exist; neither will there be tuples with duplicate primary key values in the relation. (These are problems that persisted and perhaps still do outside of the context of 1NF.)
By this definition, all relations are in 1NF. This is by no means coincidental, but by design: we define a relation to consist of atomic attributes, and subject to the entity integrity constraint and the referential integrity constraint. However, as you will soon see, having relations in 1NF only is often not good enough.
Example 44 provides an example of a relation that is in 1NF but is poorly designed. It may surprise you to learn that this problem was once very widespread in accounting software systems. As an exercise, try proposing a more efficient design for the relation described in the example.
Example 44. Example of a Poorly Designed 1NF Relation
Consider the relation EndOfMonth {Acct#, Dept#, Bal1, Bal2, … Bal13} PK [Acct#, Dept#]
 1.
This relation is intended to store monthly balances on various accounts in an organization.
 2.
Bal1 … Bal13 are defined on the same domain and therefore constitute a vast amount of space wasting (this scenario is often incorrectly described as a repeating group, but is different from the original meaning of the term).
 3.
The only time that Bal1 … Bal13 are all nonnull is after Bal13 is calculated; this is normally done at the end of the accounting year.
 4.
At the end of each accounting period, this relation is cleared and reinitialized for the next accounting period.
4.5.1 Problems with Relations in 1NF Only

R0{Suppl#, SupplName, Item#, ItemName, Quantity, SupplStatus, Location} PK [Suppl#, Item#]

FD1: Suppl# → {SupplName, SupplStatus, Location}

FD2: Item# → ItemName

FD3: [Suppl#, Item#] → {Quantity, SupplName, SupplStatus, Location, ItemName}

Replication of Data: Every time we record a supplieritem pair, we also have to record the supplier name and item name.

Insertion Anomaly: We cannot insert a new item until it is supplied; neither can we insert a new supplier until that supplier supplies some item.

Deletion Anomaly: We cannot delete an item or a supplier without destroying an entire shipment, as well as information about a supplier's location.

Update Anomaly: If we desire to update a supplier's location or item name, we have to update several records (in fact, an entire shipment) due to the duplication problem.
Insertion, deletion, and update anomalies constitute modification anomalies, caused by duplication of data due to improper database design. As stated earlier, these problems are exacerbated as more data are added, thus leading to other problems of data access and database performance — a tangled hellish web.
4.6 Second Normal Form

A relation is in the second normal form (2NF) iff it is in 1NF and every nonkey attribute is fully functionally dependent on the primary key.

R1{Suppl#, SuppName, Location, SupplStatus} PK[Suppl#]

R2{Item#, ItemName} PK[Item#]

R3{Suppl#, Item#, Quantity} PK[Suppl#, Item#]
We then check to ensure that the resulting relations are in 2NF. Relation R1 has a single attribute as its primary key, and so does R2; there is therefore no possibility of either relation being in violation of 2NF. As for relation R3, there is only one nonkey attribute and it is dependent on the primary key. We may therefore conclude with confidence that all three relations (R1, R2, and R3) are in 2NF.
 1.
The problems with relations in 1NF only have been addressed.
 2.
By decomposing, we have introduced foreign keys in relation R3.
 3.
JOINing is the opposite of PROJecting. We can rebuild relation R0 by simply JOINing R3 with R1 and R3 with R2, on the respective foreign keys.
 4.
From the definition of 2NF, two observations should be obvious. Firstly, if you have a relation with a single attribute as the primary key, it is automatically in 2NF. Secondly, if you have a relation with n attributes and n1 of them form the primary key, the relation may very well be in 2NF but you must first verify this.
4.6.1 Problems with Relations in 2NF Only

Insertion Anomaly: We cannot record information about a location until we have at least one supplier from that location.

Deletion Anomaly: We cannot delete a particular location without also deleting supplier(s) from that location.

Update Anomaly: If we wish to update information on a location, we have to update all supplier records from that location.
These problems can be addressed if we take the necessary steps to bring R1 into the third normal form (3NF). But first, we must define 3NF.
4.7 Third Normal Form

A relation is in the third normal form (3NF) iff it is in 2NF and no nonkey attribute is fully functionally dependent on other nonkey attribute(s).

Put another way, a relation is in 3NF iff nonkey attributes are mutually independent and fully functionally dependent on the primary key. (Two or more attributes are mutually independent if none of them is functionally dependent on any combination of the others.)

Put another way, a relation is in 3NF iff it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key. (Nontransitivity implies mutual independence.)
Transitive dependence refers to dependence among nonkey attributes. In particular, if A → B and B → C, then C is transitively dependent on A (i.e., A → C transitively).

R1{Suppl#, SupplName, LocationCode, LocationName, SupplStatus} PK[Suppl#]

R4{Suppl#, SupplName, LocationCode} PK[Suppl#]

R5{LocationCode, LocationName} PK[LocationCode]
 1.
The problems with relations in 2NF only have been addressed.
 2.
Again, by decomposing, we have introduced a foreign key in relation R4.
 3.
We can obtain the information that relation R1 represented by simply JOINing R4 with R5 on the foreign key.
 4.
From the definition of 3NF, it should be obvious that if you have a 2NF relation with one candidate key and n mutually independent nonkey attributes, or only one nonkey attribute, that relation is in 3NF.
4.7.1 Problems with Relations in 3NF Only

There are multiple composite candidate keys in a relation.

The candidate keys overlap (they have at least one attribute in common).
For these situations, the BoyceCodd normal form (BCNF) provides the perfect solution. As you shall soon see, the BCNF is really a refinement of 3NF. In fact, where the abovementioned conditions do not hold, BCNF reduces to 3NF.
4.8 BoyceCodd Normal Form

A relation is in BCNF iff every determinant in the relation is a candidate key.
Drawing from subsection 4.4.1, a determinant is an attribute (or group of attributes) on which some other attribute(s) is (are) fully functionally dependent. Examination of R2, R3, R4, and R5 above will quickly reveal that they are in BCNF (hence 3NF). We therefore need to find a different example that illustrates the importance of BCNF.

R6{Zoo, Animal, Keeper}

[Zoo, Animal] → Keeper

Keeper → Zoo

R7{Animal, Keeper} PK[Animal]

R8{Keeper, Zoo} PK[Keeper]

R7{ZooID, ZooName, . . .} PK[ZooID]

R8{KeeperID, KeeperName, . . .ZooID} PK[KeeperID]

R9{AnimalCode, AnimalName, . . .KeeperID} PK[AnimalCode]
 1.
By achieving BCNF, we benefit from further reduction in data duplication and modification anomalies.
 2.
A further advantage is that we can now store dangling tuples. In our example, a keeper can be assigned to a zoo even before he/she is assigned an animal.
 3.
One drawback of this schema is that we cannot store an animal that has not been assigned a keeper or zoo. The schema can be further refined but that requires a discussion of the fourth normal form (4NF); this will be done in the upcoming section.
 4.
One possible drawback with BCNF is that more relations have to be accessed (joined) in order to obtain useful information. Again referring to the example, R9 must be joined with R8 in order to derive ZooAnimal pairs; and to obtain the name of the zoo (and no doubt other information), a second join must be made with R7. In light of the fact that logical views can facilitate a seamless end user perspectives that obscure these underlying joins (more on this in Chapter 13), and processing power continues on its upswing (according to Moore’s Law), the benefits of achieving BCNF tend to outweigh the drawbacks by a large margin.
Observe
The principle of BCNF is very simple but profound. By being guided by it, you can, in many circumstances, actually bypass obtaining 2NF and 3NF relations, and move directly into a set of BCNF relations. Adopting this approach will significantly simplify the analysis process. Moreover, in most practical situations, you will not be required to normalize beyond BCNF. This approach will be further clarified in the next chapter.
4.9 Fourth Normal Form
 a.
A course can be taught by several teachers.
 b.
A course can require any number of texts.
 c.
Teachers and texts are independent of each other; in other words, the same texts are used irrespective of who teaches the course.
 d.
A teacher can teach several courses.

R10{Course, Teacher, Text} PK[Course, Teacher, Text]
Since R10 is keyed on all its attributes, it is in BCNF. Yet, two potential problems are data redundancy and modification anomalies (the former leading to the latter). In our example, in order to record that Calculus II is taught by both Professor Cross and Professor Maitland, four records are required. In fact, if a course is taught by p professors and requires n texts, the number of records required to represent this situation is p*n. This is extraordinary, and could prove to be very demanding on storage space, especially as the number of records increases.
Relation R10, though in BCNF, is not in 4NF, because it has a peculiar dependency called a multivalued dependency (MVD). In order to state the 4NF, we must first define MVD.
4.9.1 MultiValued Dependency

Given a relation R(A, B, C), the MVD A » B (read “A multidetermines B”) holds iff every Bvalue matching a given (Avalue, Cvalue) pair in R depends only on the Avalue and is independent of the Cvalue.

Further, given R(A B C), A » B holds iff A » C also holds. MVDs always go together in pairs like this. We may therefore write A » B/C.
 1.
For MVD, at least three attributes must exist.
 2.
All FDs are MVDs but all MVDs are not necessarily FDs.
 3.
A » B reads “A multidetermines B” or “B is multidependent on A.”
Let’s get back to R10: Course » Text/Teacher. Note that Course is the pivot of the MVD. Course » Teacher since Teacher depends on Course, independent of Text. Course » Text since Text depends on Course, independent of Teacher. So how do we resolve this? Fagin’s theorem provides the answer.
4.9.2 Fagin’s Theorem

Relation R{A, B, C} can be nonloss decomposed into projections R1{A, B} and R2{A, C} iff the MVDs A » B/C both hold.

A relation is in 4NF iff whenever there exists an MVD, say A » B, then all attributes of R are also functionally dependent on A.

Put another way, R{A, B, C…} is in 4NF iff every MVD satisfied by R is implied by the candidate key of R.

Put another way, R{A, B, C…} is in 4NF iff the only dependencies are of the form [candidate key] → [other nonkey attribute(s)].

Put another way, R{A, B, C …} is in 4NF iff it is in BCNF and there are no MVDs (that are not FDs).

R11{Course, Text} PK[Course, Text]

R12{Course, Teacher} PK[Course, Teacher]
Note

Course{Crs#, CrsName, CrsCredits, . . .} PK[Crs#]

Teacher{TeacherEmp#, Emp_FName, Emp_LName, . . .} PK[TeacherEmp#]

Text{TextCode, TextTitle, . . .} PK[TextCode]

CourseTextMap{CT_Crs#, CT_TextCode} PK[CT_Crs#, CT_TextCode]

CourseTeacherMap{CR_Crs#, CR_TeacherEmp#} PK[CR_Crs#, CR_TeacherEmp#]
From this proposed solution, it should be obvious to you that the attributes of CourseTextMap and CourseTeacherMap are foreign keys. You could then construct a RAL (review the latter part of Chapter 3) that clearly outlines the database conceptual schema for the stated problem; this is left as an exercise for you.
4.9.3 The Zoo Revisited

Animal » Keeper/Zoo

Zoo{ZooID, ZooName, . . .} PK[ZooID]

Keeper{KeeperID, KeeperName, . . .ZooID} PK[KeeperID]

Animal{AnimalCode, AnimalName, . . .KeeperID} PK[AnimalCode]

Zoo{ZooID, ZooName, . . .} PK[ZooID]

Keeper{KeeperID, KeeperName, . . .KeeperZooID} PK[KeeperID]

Animal{AnimalCode, AnimalName, . . .} PK[AnimalCode]

AnimalKeeperMap{AK_AnimalCode, AK_KeeperID} PK [AK_AnimalCode, AK_KeeperID]

AnimalZooMap{AZ_AnimalCode, AZ_ZooID} PK [AZ_AnimalCode, AZ_ZooID]
As in the previous subsection (with the CTTSchedule problem), it should be obvious to you what the foreign keys are in entities Keeper, AnimalKeeperMap, and AnimalZooMap; construction of a detailed RAL for the stated problem is left as an exercise for you.
4.10 Fifth Normal Form
So far we have been treating relations that are decomposable into two other relations. In fact, there are relations that cannot be so decomposed, but can be decomposed into n other relations where n > 2. They are said to be ndecomposable relations (n > 2). The fifth normal form (5NF) is also commonly referred to as the projectionjoin normal form (PJNF) because it relates to these (n > 2) projections (of a relation not in 5NF) into decompositions that can be rejoined to yield the original relation.

SupplierSchedule{Suppl#, Item#, Proj#} PK[Suppl#, Item#, Proj#]
 1.
SupplierSchedule is keyed on all attributes and therefore by definition is in BCNF. By inspection, it is also in 4NF. There is no unresolved MVD as were the cases for the CTTSchedule problem and the zoo case of earlier discussions. Here, the attributes are dependent on each other: suppliers supply inventory items for various projects.
 2.
It is not possible to decompose this relation into two other relations without losing critical information.
 3.
If there are S suppliers, N items, and J projects, then theoretically, there may be up to S*N*J records. Not all of these may be valid: a supplier may supply specific item(s) for specific project(s); not every item may be applicable to every project; and a supplier does not necessarily support every project.
 4.
If we consider S suppliers, each supplying N items to J projects, then it does not take much imagination to see that a fair amount of duplication will take place, despite the fact that the relation is in 4NF.

If (s, i) is in SI and (i, p) is in IP and (p, s) is in PS, then (s, i, p) is in SupplierSchedule.
This is an example of a join dependency (JD) constraint. This constraint exists because of the assumption made at the outset. If the prevailing business rules dictate otherwise, then the constraint would likely be different.
4.10.1 Definition of Join Dependency

Relation R satisfies the JD P1, P2, … Pn iff R = P1 JOIN P2 JOIN … JOIN Pn where the attributes of P1 … Pn are subsets of the attributes of R.
Relations that are in 4NF, but not in 5NF (such as SupplierSchedule) suffer from duplication, which in turn leads to modification anomalies. These problems are directly related to the presence of the JD constraint(s) in such relations. Fagin’s theorem for 5NF relations provides the solution.
4.10.2 Fagin's Theorem

A relation R is in 5NF (also called PJNF) iff every JD in R is a consequence of the candidate keys of R.

In layman's terms, if a relation R is in 4NF and it contains a JD that renders it ndecomposable into P1, P2 … Pn, such that R = P1 JOIN P2 … JOIN Pn where n > 2, then such relation is not in 5NF. It may therefore be decomposed to achieve 5NF relations.

Put another way, a relation R is in 5NF iff it is in 4NF and it is not decomposable, except the decompositions are based on a candidate key of R, and the minimum number of projections is 3.

Supplier{Suppl#, SupplName, . . .} PK [Suppl#]

InventoryItem{Item#, ItemName, . . .} PK [Item#]

Project{Proj#, ProjName, . . .} PK [Proj#]

SuppItemMap{SI_Supp#, SI_Item#, SI_Ref#} PK [SI_Ref#]

ItemProjMap{IP_Item#, IP_Proj#, IP_Ref#} PK[IP_Ref#]

ProjSuppMap{PS_Supp#, PS_Proj#, PS_Ref#} PK [PS_Ref#]
As for the previous cases (for instance sections 4.9.2 and 4.9.3), you should be able to easily identify the foreign keys; you should also be able to construct a detailed RAL for the case (review Figure 317 of Chapter 3). Also note that for relations SuppItemMap, ItemProjMap, and ProjSuppMap, a surrogate primary key has been introduced in each case as an alternative to using a composite candidate key consisting of the foreign keys.
Note
For most practical purposes, you only have to worry about 5NF if you are trying to implement an M:M relationship involving more than two relations. Once in 5NF, further decompositions would share candidate keys and are therefore to no avail (recall the corollary of Heath's theorem). Notwithstanding this, other normal forms have been proposed, as will be discussed in section 4.12.
4.11 An Example
Now consider the case where it is desirous to keep track of students enrolled in various academic programs and courses in a college environment. Assume further that an academic program is comprised of various courses, and that multiple students may enroll in any given program; also, students may enroll in various courses each college term (semester or quarter). At minimum, it would be necessary to store for each program its code and name; similarly for each course (typically, it would be desirable to store additional information about each, but let’s keep the illustration simple for the time being). It would be desirable to define several attributes about each student record. However, for the purpose of this illustration, let’s keep these to the minimum, namely student number, name, and grade point average (GPA, which is normally calculated from other data but we will sidestep that complication for this discussion). Let’s attempt to derive a set of normalized relations from this scenario.
One way to start the normalization process is to begin with the proposal of a 1NF relation, followed by identification of FDs and subsequent NLDs based on those FDs; this is the classical approach. As you become more experienced, you will be able to employ more expeditious pragmatic approaches (some of which are covered in Chapter 5), but for now, let’s take it slowly.

InitialProgram{Pgm#, PgmName, Crs#, CrsName, Stud#, StudFName, StudLName, StudGPA, Term#} PK[Pgm#, Crs#, Stud#]

FD0: [Pgm#, Crs#, Stud#] → PgmName, CrsName, StudFName, StudLName, StudGPA, Term#

FD1: Pgm# → PgmName

FD2: Crs# → CrsName

FD3: Stud# → StudFName, StudLName, StudGPA

Program{Pgm#, PgmName} PK[Pgm#]

Course{Crs#, CrsName} PK[Crs#]

Student {Stud#, StudFName, StudLName, StudGPA} PK[Stud#]

ProgramStudCourseMap{Pgm#, Crs#, Stud#, Term#} PK[Pgm#, Crs#, Stud#]
Note the use of meaningful relation names to improve readability. Also observe that the relation ProgramStudCourseMap is what is left of the original relation InitialProgram after it has been decomposed.
Step 3: The third step is to obtain 3NF relations from step 2. To do so, you must examine the relations derived from step 2 to see if there is any violation of 3NF. Remember, 3NF requires mutual independence among nonkey attributes. A quick check shows that there is no violation, so you can assert with confidence that for this particular scenario, the relations of step 2 are also in 3NF.
Step 4: This step requires that you obtain BCNF relations. Recall that BCNF requires that every determinant must serve as a candidate within the relation that it appears. A quick eye inspection of the relations from step 2 reveals no violation of this principle, so you can assert that for this particular case, the relations of step 2 are also in BCNF.

ProgramStructure{Pgm#, Crs#, CrsSeqn} PK[Pgm#, Crs#] or PK[Pgm#, CrsSeqn]

ProgramStudentMap{Pgm#, Stud#} PK[Pgm#, Stud#]

StudentSchedule{Stud#, Crs#, Term#, …} PK[Stud#, Crs#, Term#]
 1.
These three relations would replace relation ProgramStudCourseMap. Notice that each new relation implements a binary relationship. Thus, a ternary relationship (represented by ProgramStudCourseMap) has been replaced by three binary relationships (represented by relations ProgramStructure, ProgramStudentMap, and StudentSchedule).
 2.
In the ProgramStructure relation, the attribute CrsSeqn has been introduced to facilitate sequencing of courses in a particular major. This is optional; technically speaking, it is not part of the normalization process. The attribute has been introduced merely for convenience.
 3.
In the StudentSchedule relation, the attribute Term# has been promoted to be part of the primary key; this is done for increased flexibility. For instance, a student may repeat a course in order to obtain a more favorable grade.

Program{Pgm#, PgmName, . . .} PK[Pgm#]

Course{Crs#, CrsName, . . .} PK[Crs#]

Student {Stud#, StudFName, StudLName, StudGPA, . . .} PK[Stud#]

ProgramStructure{PS_Pgm#, PS_Crs#, PS_CrsSeqn} PK[PS_Pgm#, PS_Crs#] or PK[PS_Pgm#, PS_CrsSeqn]

ProgramStudentMap{SP_Pgm#, SP_Stud#} PK[SP_Pgm#, SP_Stud#]

StudentSchedule{SS_Stud#, SS_Crs#, SS_Term#, …} PK[SS_Stud#,SS_ Crs#, SS_Term#]
As for the previous cases (sections 4.9.2, 4.9.3, and 4.10.2), you should be able to easily identify the foreign keys in the latter three relations; you should also be able to construct a detailed RAL for the case (review Figure 317 of Chapter 3). Finally, note that this case covers only a small section of a college database. It is not meant to provide comprehensive coverage of such a problem domain; rather, it is meant to help you develop mastery of the normalization process.
4.12 Other Normal Forms
The field of Database Systems is potentially a contemptuous one. Indeed, there are accounts of former friends or colleagues becoming foes over database quibbles. Various individuals have proposed several database theorems and methodologies, but they have not all gained universal acceptance as have the normal forms of the previous sections. Two additional normal forms that have been, and will no doubt continue to be, the subject of debate are the domainkey normal form (DKNF) and the sixth normal form (6NF). Without picking sides of the debate on these two normal forms, this section will summarize each.
4.12.1 The DomainKey Normal Form

A relation is in DKNF if every constraint on the relation is a logical consequence of the definition of its keys and domains.

A constraint is used to mean any rule relating to static values of attributes. Constraints therefore include integrity rules, editing rules, foreign keys, intrarelation references, FDs, and MVDs, but exclude timedependent constraints, cardinality constraints, and constraints relating to changes in data values.

A key is a unique identifier of a row (as defined in Chapter 3).

A domain is a pool of legal attribute values (as defined in Chapter 3).
The implication of the DKNF is clear: if we have a relation that contains constraint(s) that is (are) not a logical consequence of its (candidate) key and domains, then that relation is not in DKNF, and should therefore be further normalized. The DKNF as proposed by Fagin therefore represents an ideal situation to strive for.

Any constraint that restricts the cardinality of a relation (such as the number of tuples in the relation) will render it in violation of DKNF. (It was perhaps for this reason that Fagin excluded from his definition of constraints, timedependent constraints, or constraints relating to data values.) However, there are many relations for which such constraints are required.

There is no known algorithm for converting a relation to DKNF. The conversion is intuitive and for this reason described as artistic rather than scientific.

Not all relations can be reduced to DKNF (relations with cardinality constraints fall in this category).

It is not precisely clear as to when a relation can be reduced to DKNF.
For these reasons, the DKNF has been compared by Date (see [Date, 2006]) to a “straw man… of some considerable theoretical interest but not yet of much practical ditto.”
4.12.2 The Sixth Normal Form
A sixth normal form (6NF) has been proposed by C. J. Date in [Date, 2003], after several years of exploitation, expounding, and research in the field of database systems. It relates to socalled temporal databases. Date wrote a whole book on the subject; a summary of the essence is presented in this subsection. Date defines a temporal database as a database that contains historical data as well as or instead of current data. Temporal databases are often associated with readonly databases or updateonce databases, but they could be used otherwise. In this sense, a temporal database may be considered as a precursor to a data warehouse (discussed in Chapter 19).
For the purpose of illustration, assume that we are in a college or university setting and desire to store the relation Course as defined below:
Course{CourseNo, CourseName, CourseCred}

The primary key is CourseNo; for any given course, the attribute CourseNo cannot be changed.

For any given course, the attribute CourseName may be changed any point in time.

For any given course, the attribute CourseCred may be changed any point in time.
We may be tempted to introduce a timestamp on each tuple, and therefore modify the definition of Course as follows:
Course{CourseNo, CourseName, CourseCred, EffectiveDate}
 1.
If we assume that the FD CourseNo → {CourseName, CourseCred, EffectiveDate} is (still) applicable, then Course is in 5NF. However, in this case, if the CourseName or CourseCred of a given Course tuple changes at a given effective date, there is no way of showing what it was before, unless we create a new course and assign a new CourseNo. In either case, this clearly, is undesirable.
 2.
Assume for the moment, the presence of FDs CourseNo → CourseName and [CourseNo, EffectiveDate] → CourseCred. Then, the relation would be in violation of 2NF, and would therefore need to be decomposed into two decompositions:
CourseDef {CourseNo, CourseName} PK [CourseNo]
and
CourseTimeStamp {CourseNo, EffectiveDate, CourseCred} PK [CourseNo, EffectiveDate]
Both of these relations would now be in 5NF. However, if we now desire to change the CourseName of a course for a given effective date, we would not be able to represent this in the current schema.
 3.We could introduce a surrogate (say CourseRef) into relation Course, and key on the surrogate, while ignoring the FDs stated in (1) and (2) above. In this case, Course would be in violation of 3NF, and if we attempt to decompose, we would revert to the situation in case (2) above.

Each course is to be accounted for (we say Course is under contract).

Each course has a CourseName which is under contract.

Each course has a CourseCred which is under contract.
The predicate involves three distinct propositions. We are attempting to use the timestamp attribute (EffectiveDate) to represent more than one proposition about the attribute values. This, according to Date, is undesirable and violates the sixth normal form.

A relation R is in 6NF iff it satisfies no nontrivial JDs at all. (A JD is trivial iff at least one of its projections is over all of the attributes of the relation.)

Put another way, a relation R is in 6NF iff the only JDs that it satisfies are trivial ones.
Observe that 6NF as defined essentially refines 5NF. It is therefore obvious from the definition that a relation in 6NF is necessarily in 5NF also.

CourseInTime {CourseNo, EffectiveDate} PK [CourseNo, EffectiveDate]

CourseNameInTime {CourseNo, CourseName, EffectiveDate} PK [CourseNo, EffectiveDate]

CourseCredInTime {CourseNo, CourseCred, EffectiveDate} PK [CourseNo, EffectiveDate]
Observe that the projection CourseInTime is, strictly speaking, redundant since it can be obtained by a projection from either CourseNameInTime or CourseCredInTime. However, in the interest of clarity and completeness, it has been included.
This work by C. J. Date represents a significant contribution to the field of database systems, and will no doubt be a topical point of discussion in the future.
4.13 Summary and Concluding Remarks
Traditionally, it has been widely accepted that for most databases, attainment of 3NF is acceptable. This course recommends a minimum attainment of BCNF, but that database designers strive for 5NF at every reasonable and feasible opportunity. Recall that as stated earlier (in section 4.8), BCNF is really a refinement of 3NF, and the normalization process can bypass 2NF and 3NF, and go straight to BCNF. In some circumstances, it may even be best to proceed all the way to 6NF.

The normalization process is quite tedious and challenging for inexperienced database designers. Because of this, the occurrence of inappropriately designed databases is quite high.

As the size of the database increases over time, the performance on queries involving logical joins tends to deteriorate. However, with the significant advances in processing power (Moore’s Law), this is not as critical a concern as it was historically.

In scenarios such as data warehousing (discussed in Chapter 19), where data volume is very high and accessibility is a primary objective, normalization is not always as desirable as it is for operational databases.
Normalization is a technique that must be mastered by database designers. It improves with practice and experience, and ultimately becomes almost intuitive. As your mastery of normalization improves, you will find that there is a corresponding improvement in your ability to design and/or propose database systems for various software systems. However, be aware that the converse is also true: failure to master fundamental principles of database design will significantly impair one’s ability to design quality software systems involving underlying databases. Notwithstanding, be careful not to be antagonistic about your views, as informed as they may be. Database feuds are common and passions can sometimes run high.
If after reading and reviewing this chapter, you arrive at the conclusion that normalization is integral to sound database design, you would be correct. Indeed it is! With time and practice, you should get quite good at applying the technique. The upcoming chapter builds on this topic as you learn about different approaches and methodologies related to database design.
4.14 Review Questions
 1.
State the two fundamental integrity rules and explain their significance.
 2.
What is a foreign key? Illustrate using an appropriate example. How should foreign keys be treated?
 3.
Clarify the following: functional dependence; nonloss decomposition; Heath’s theorem. Provide examples that will illustrate the significance of these terms.
 4.
State the normal forms covered in this chapter. For each normal form, state it, explain in your own words what it means, and provide an example of a relation that conforms to it and one that does not.
 5.
What are the main advantages and challenges related to database normalization?
 6.
Review the case presented in subsection 4.9.2 and propose a refined RAL for the CTTSchedule problem.
 7.
Review the case presented in subsection 4.9.3 and propose a refined RAL for the zoo problem.
 8.
Review the case presented in section 4.11 and propose a refined RAL for the partial college database problem.
 9.
Practice working with simple nonnormalized database ideas and refining them to obtain normalized relations. Start off by identifying the main entity/entities, followed by FDs; then apply the appropriate normalization rules.
4.15 References and/or Recommended Readings
[Codd 1972] Codd, Edgar F. 1972. “Further Normalization of the Database Relational Model.” Database Systems, Courant Computer Science Symposia Series 6. Englewood Cliffs, NJ: Prentice Hall.
[Codd 1974] Codd. Edgar F. 1974. “Recent Investigations into Relational Data Base Systems.” Proc. IFIP Congress, Stockholm, Sweden.
[Connolly 2015] Connolly, Thomas and Carolyn Begg. 2015. Database Systems: A Practical Approach to Design, Implementation, and Management 6^{th} ed. Boston: Pearson. See Chapters 14 and 15.
[Coronel 2015] Coronel, Carlos and Steven Morris. 2015. Database Systems: Design, Implementation, and Management 11^{th} ed. Boston: Cengage Learning. See Chapter 6.
[Date 2003] Date, Christopher J., Hugh Darwen and Nikos A. Lorentzos. 2003. Temporal Databases and the Relational Model. San Francisco, CA: MorganKaufmann.
[Date 2004] Date, Christopher J. 2004. Introduction to Database Systems 8^{h} ed. Menlo Park, California: AddisonWesley. See Chapters 1113, 23.
[Date 2006] Date, Christopher J. 2006. “Database Debunking.” Accessed June 2006. www.dbdebunk.com/page/page/621935.htm .
[Elmasri, 2011] Elmasri, Ramez and Shamkant B. Navathe. 2011. Fundamentals of Database Systems 6^{th} ed. Boston: Pearson. See Chapters 15 and 16.
[Fagin 1977] Fagin, Ronald. 1977. “Multivalued Dependencies and a New Normal Form for Relational Databases.” ACM TODS 2, No. 3, September 1977.
[Fagin 1979] Fagin, Ronald. 1979. “Normal Forms and Relational Database Operations.” Proc. 1979 ACM SIGMOD International Conference on Management of Data. Boston, MA, MayJune 1979.
[Fagin 1981] Fagin, Ronald. 1981. “A Normal Form for Relational Databases that is Based on Domains and Keys.” ACM TODS 6, No. 3, September 1981.
[GarciaMolina 2009] GarciaMolina, Hector, Jeffrey Ullman and Jennifer Widom. 2009. Database Systems: The Complete Book 2^{nd} ed. Boston: Pearson. See Chapter 3.
[Heath 1971] Heath. I. J. 1971. “Unacceptable File Operations in a Relational Database.” Proc. 1971 ACM SIGFIDET Workshop on Data Description, Access, and Control. San Diego, CA, November 1971.
[Hoffer 2013] Hoffer, Jeffrey A., Ramesh Venkataraman, and Heikki Topi. 2013. Modern Database Management 11^{th} ed. Boston: Pearson. See Chapter 4.
[Kifer 2006] Kifer, Michael, Arthur Bernstein, and Phillip M. Lewis. 2006. Databases and Transaction Processing: An Application Oriented Approach 2^{nd} ed. Boston: Pearson. See Chapter 6.
[Kroenke 2014] Kroenke, David M and David Auer. 2014. Database Processing: Fundamentals, Design, and Implementation 13^{th} ed. Boston: Pearson. See Chapters 3 and 4.
[Pratt 2015] Pratt, Phillip J. and Mary Z. Last. 2015. Concepts of Database Management 8^{th} ed. Boston: Cengage Learning. See Chapter 5.
[Silberschatz 2011] Silberschatz, Abraham, Henry F. Korth and S. Sudarshan. 2011. Database Systems Concepts 6^{th} ed. Boston: McGrawHill. See Chapter 7.
[Ullman 2008] Ullman, Jeffrey D., and Jennifer Widom. 2008. A First Course in Database Systems 3^{rd} ed. Boston: Pearson. See Chapters 3 and 4.