Database design refers to constructing a suitable database schema for specific application objects according to the characteristics of database system, establishing database and corresponding applications, so that the whole system can effectively collect, store, process and manage data to meet the usage requirements of various users.

This chapter introduces the relevant concepts, overall objectives and problems to be solved in database design , and details the specific work in stages including requirement analysis, conceptual design, logical design and physical design according to the New Orleans design methodology. Finally, the specific means of implementation of database design are introduced with relevant cases.

Through this chapter, the reader is able to describe the characteristics and uses of the data models, enumerate the types of data models, describe the criteria of the third normal form (NF) data model, describe the common concepts in the logical model, distinguish the corresponding concepts in the logical and physical models, and enumerate the common means of anti-NF in the physical design process.

7.1 Database Design Overview

Database design refers to the construction of an optimized database logical model and physical structure for a given application environment, and the establishment of the database and its application system accordingly, so that it can effectively store and manage data to meet the application needs of various users. It is worth noting that there is no "optimal" standard for database design , and different designs and optimizations need to be made for different applications. The OLTP and OLAP scenarios are very different, and there are corresponding differences in the methods and optimization tools for database design .

Readers need to first understand what the most common methods and techniques are, and then use them in conjunction with different practical scenarios.

7.1.1 Difficulties of Database Design

In practical applications, database design will encounter many difficulties, mainly the following.

  1. (1)

    The lack of service knowledge and industry knowledge of technical staff familiar with the database.

    Database design needs to be flexibly adjusted for different applications, which requires the relevant personnel to have a good understanding of the application usage scenario and service background, while the technical personnel familiar with the database often lack service knowledge and industry knowledge.

  2. (2)

    People who are familiar with service knowledge often lack understanding of database products.

    Relatively speaking, people familiar with service knowledge and service process often lack understanding of database products and are not familiar with database design process. Therefore, in the process of data model design, the two party need to fully communicate with each other in order to do a good job of database design .

  3. (3)

    There is no way to clarify the scope of service requirements of the database system for the application service in the initial stage.

    In the initial stage of the project, the application service is not particularly clear, and the users' requirements are not established. And the database system is gradually improved along with the users' requirements, which is also a difficult point in database design .

  4. (4)

    User requirements are constantly adjusted and modified during the design process, and even after the database model is landed, new requirements will appear, which will have an impact on the existing database structure.

    Because of the uncertainty of the requirements, database adjustment is frequent, these will cause some trouble to the database design , so the database design is a spiral forward work, which needs to be constantly adjusted, improved and optimized to better meet the needs of the application.

7.1.2 Goal of Database Design

Database design is the technology of establishing database and its application system, which is the core technology in the development and construction of information system. The goal of database design is to provide information infrastructure and efficient operation environment for users and various application systems. Efficient operation environment means to achieve high efficiency in database data access, database storage space utilization, and database system operation and management. The goal of database design must set the time range and target boundary range, and the design goal without restrictive conditions will fail because of the too large range. Reasonable development of database system goals is a very difficult thing. The goals that are too large or too high will result in unachievable goals, and targets that are too small will be unacceptable to the customer. Therefore, the goals should be planned reasonably in stages and levels so as to form sustainable solutions for the construction process, ultimately meeting the needs of the users and achieving the goals.

7.1.3 Methods of Database Design

In October 1978, database experts from more than 30 countries have dedicated their time to discussing database design methods in New Orleans, USA. They applied the ideas and methods of software engineering to propose a database design specification, which is the famous New Orleans design methodology, currently recognized as a more complete and authoritative design method for database specification. The New Orleans design methodology divides the database design into four phases, as shown in Fig. 7.1.

Fig. 7.1
figure 1

The four phases in the New Orleans design methodology

These four phases are requirement analysis, conceptual design, logical design, and physical design. The requirement analysis phase mainly analyzes user requirements and produces requirement statements; the conceptual design phase mainly analyzes and defines information and produces conceptual models; the logical design phase mainly designs based on entity connections and produces logical models; and the physical design phase mainly designs physical structures based on physical characteristics of database products and produces physical models.

In addition to the New Orleans design methodology, there are also database design methods based on E-R diagrams, and design methods based on the 3NF. They are all specific techniques and methods used in different phases of database design , which will be described in detail in later chapters.

7.2 Requirements Analysis

7.2.1 Significance of Requirement Analysis

In real life, the whole building without a good foundation is crooked. Experience has proven that poor requirement analysis can directly lead to incorrect design. If many problems are not discovered until the system testing stage and then go back to correct them, it will be costly, so the requirement analysis stage must be given high priority.

The requirement analysis phase mainly collects information and analyzes and organizes it to provide sufficient information for the subsequent phases. This stage is the most difficult and time-consuming stage, but is also the basis of the whole database design . If the requirement analysis is not done well, the whole database design may be reworked.

The following points should be done in the requirement analysis phase.

  1. (1)

    Understand the operation of the existing system, such as the service carried by the existing system, the service process and the deficiencies.

  2. (2)

    Determine the functional requirements of the new system, that is, to understand the end-user's ideas, functional requirements and the desired results.

  3. (3)

    Collect the basic data and related service processes that can achieve the objectives, so as to prepare for a better understanding of service processes and user requirements.

7.2.2 Tasks of the Requirement Analysis Stage

The main task of the requirement analysis phase is first to investigate user service behaviors and processes, then to conduct system research, collect and analyze requirements, determine the scope of system development, and finally prepare a requirement analysis report.

The phase of investigation of user service behaviors and processes requires understanding of user expectations and goals for the new system and the main problems of the existing system. In the stage of system research, collecting and analyzing requirements, determine the scope of system development, with the main tasks being divided into the following three parts.

  1. (1)

    Information research. It is necessary to determine all the information to be used in the designed database system and to clarify the sources, methods, data formats and contents of the information. The main goal of the requirement analysis phase is to clarify what data is to be stored in the designed database, what data needs to be processed, and what data needs to be used for the next system.

  2. (2)

    Processing requirements. Translate the user's service functional requirements into a requirement statement that defines the functional points of the database system to be designed. That is, convert the requirements described by users in service language into design requirements that can be understood by computer systems or developers; it is necessary to describe the operational functions of data processing, the sequence of operations, the frequency and occasion of execution of operations, and the connection between operations and data, as well as to specify the response time and processing methods required by users. These contents form the necessary part of the user requirement specification.

  3. (3)

    Understand and record user requirements in terms of security and integrity. In the stage of writing requirement analysis report, it needs to go through the process of system research, collection and processing, and generally the output product in this stage is the requirement analysis report, including user requirement specification and data dictionary . The data dictionary here is a summary document of the data items and data of the existing services, not the data dictionary inside the database product.

7.2.3 Methods of Requirement Analysis

The focus of requirement analysis is to sort out the "information flow" and "service flow" of users. The "service flow" refers to the current status of the service, including service policies, organization, service processes, etc. The "information flow" refers to the data flow, including the source, flow and focus of data, the process and frequency of data generation and modification, and the relation between data and service processing. External requirements should be clarified during the requirement analysis phase, including but not limited to data confidentiality requirements, query response time requirements, output report requirements, etc.

According to the actual situation and the possible support from users, the requirement investigation can be done by a combination of means, for example, viewing the design documents and reports of existing systems, talking with service personnel, and questionnaire surveys. If conditions permit, sample data from existing service systems should also be collected as part of the design process to verify some service rules and understand the quality of data.

During the requirement analysis process, do not make assumptions or guesses about the user's ideas. Always check with the user for assumptions or unclear areas.

7.2.4 Data Dictionary

The data dictionary is the result obtained after the introduction of requirement analysis, data collection and data analysis. Unlike the data dictionary in the database, the data dictionary here mainly refers to the description of the data, not the data itself, and includes the following contents.

  1. (1)

    Data items: They mainly includes data item name, meaning, data type, length, value range, unit and logical relation with other data items, which are the basis of model optimization in logic design stage.

  2. (2)

    Data structure: Data structure reflects the combination relation between data items, and a data structure can be composed of several data items and data structures.

  3. (3)

    Data flow: The data dictionary is required to represent the data flow, that is, the transmission path of data in the system, including data source, flow direction, average flow, peak flow, etc.

  4. (4)

    Data storage: This includes data access frequency, retention time duration, and data access methods.

  5. (5)

    Processing process: This includes the function of the data processing process and processing requirements. Function refers to what the processing process is used to do, and the requirements include how many transactions are processed per unit of time, how much data volume involved, time response requirements, etc.

There is no fixed document specification for the format of data dictionary , in practice, it can refer to the above content items and can be reflected through different descriptive documents or in the model file. So the data dictionary is a concept at the abstract level, a collection of documents. And in the requirement analysis phase, the most important output is the user requirement specification, where the data dictionary often exists as an annex or appendix to provide a reference for the model designers in their subsequent work.

7.3 Conceptual Design

7.3.1 Conceptual Design and Conceptual Model

The task of the conceptual design phase is to analyze the requirements proposed by the users, synthesize, summarize and abstract the user requirements, and form a conceptual-level abstract model independent of the concrete DBMS , i.e., the conceptual data model (hereinafter referred to as the conceptual model). The conceptual model is a high level abstract model, independent of any specific database product, not be bound by any database product characteristics. At this stage, the conceptual model is independent of the physical attributes of any particular database product.

The conceptual model has developed the following four main features.

  1. (1)

    It can truly and fully reflect the real world, including the connection between things and things, as a real model of the real world.

  2. (2)

    It is easy to understand, enabling discussion with users who are not familiar with the database.

  3. (3)

    It is easy to change, when the application environment and application requirements change, the conceptual model can be modified and expanded.

  4. (4)

    It is easy to convert to a relational data model.

The latter two are the basic conditions for the smooth progress of the next stage of work.

7.3.2 E-R Approach

The conceptual model is a conceptual-level abstract model that is independent of the concrete database management system , generated by analyzing the requirements proposed by users and synthesizing, summarizing and abstracting the user requirements. The model can directly organize the real world according to the concrete data model, but many factors must be considered at the same time, and the design work is complicated with unsatisfactory effect, so an approach is needed to describe the information structure of the real world.

In 1976 E-R (Entity-Relation) approach was proposed. This approach quickly became one of the commonly used methods in conceptual models because of its simplicity and practicality, and is now a common approach to describing information structures. The tool used in the E-R approach is called E-R diagram, which mainly consists of three elements - entity, attribute and linkage, which is widely used in the conceptual design stage. The database concept represented by E-R diagram is very intuitive and easy to understand by users.

An entity is a collection of real-world objects that have common attributes and can be distinguished from each other. For example, teachers, students, and courses are all entities, as shown in Fig. 7.2. In an E-R diagram, specific entities are generally represented by rectangular boxes. Each specific record value in an entity, such as each specific student in the student entity, is called an instance of the entity.

Fig. 7.2
figure 2

Entities

Attributes are data items that describe the nature or characteristics of an entity, and all instances belonging to the same entity have the same attributes. For example, the student number, name and gender shown in Fig. 7.3 are all attributes. In the conceptual model, attributes are generally represented by rectangular boxes with rounded corners.

In practice, the conceptual model can also be designed not to the attribute level in detail, but to the entity level. If the conceptual model will increase the workload is all the attributes are planned out in detail. The E-R diagram of the conceptual model should delineate the linkages between entities clearly and express them clearly in the practical application project. So it is sufficient that the general conceptual model reaches the level that reflects the linkages between entities.

Fig. 7.3
figure 3

Attributes

The linkages within and between entities are usually represented by diamond-shaped boxes. In most cases, the data model is concerned with the linkages between entities. The linkages between entities are usually divided into three categories.

  1. (1)

    One-to-one linkage (1:1): Each instance in entity A has at most one instance linked to it in entity B, and vice versa. For example, a class has a Class Advisor, this linkage is recorded in the form of 1:1.

  2. (2)

    One-to-many linkage (1:n): Each instance in entity A has n instances linked to it in entity B, while each instance in entity B has at most 1 instance linked to it in entity A, which is recorded as 1:n. For example, there are n students in a class.

  3. (3)

    Many-to-many linkage (m:n): Each instance in entity A has n instances linked to it in entity B, while each instance in entity B has m instance linked to it in entity A, which is recorded as m:n. Take for example the linkage between students and elective courses. A student can take more than one course, and a course can be taken by more than one student.

Simply put, conceptual design is the conversion of realistic conceptual abstractions and linkages into the form of an E-R diagram, as shown in Fig. 7.4.

Fig. 7.4
figure 4

Linkages

7.4 Logical Design

7.4.1 Logical Design and Logical Models

Logical design is the process of converting a conceptual model into a concrete data model. According to the basic E-R diagram established in the conceptual design phase, the selected target data model (hierarchical, mesh, relational, or object-oriented) is converted into the corresponding logical-layer target data model, and what is obtained is the logical data model (hereinafter referred to as logical model). For relational databases, this conversion has to conform to the principles of the relational data model.

The most important work in the logical design phase is to determine the attributes and primary keys of the logical model. The primary key identifies the unique primary keyword in the table, also known as a code. A primary key can consist of a single field or multiple fields. The more common way of logical design work is to use E-R design tool and IDEF1X method for logical model building. Commonly used E-R diagram representations include IDEF1X, Crow's Foot for IE models, Unified Modeling Language (UML) class diagrams, etc.

7.4.2 IDEF1X Method

The logical model of this book adopts the IDEF1X (Integration DEFinition for Information Modeling) method. IDEF, which stands for Integration DEFinition method, was established in the US Air Force ICAM (Integrated Computer Aided Manufacturing) project, and three methods were initially developed - functional modeling (IDEF0), information modeling (IDEF1), and dynamic modeling (IDEF2). Later, as information systems were developed one after another, IDEF cluster methods were introduced, such as data modeling method (IDEF1X), process description acquisition method (IDEF3), object-oriented design method (IDEF4), OO design method using C++ (IDEF4C++), entity description acquisition method (IDEF5), design theory acquisition method (IDEF6), and Human-system interaction design method (IDEF8), service constraint discovery method (IDEF9), network design method (IDEF14), etc. IDEF1X is an extended version of IDEF1 in the IDEF family of methods, which adds some rules to the E-R method to make the semantics richer.

The IDEF1X method has several features when used for logic modeling.

  1. (1)

    It supports the semantic structure necessary for the development of conceptual and logical models, and has good scalability.

  2. (2)

    It has concise and consistent structure in semantic concept representation.

  3. (3)

    It is easy to understand, enabling service personnel, IT technicians, database administrators and designers to communicate based on the same language.

  4. (4)

    It can be generated automatically. Commercial modeling software supports the IDEF1X model design methodology and can be quickly converted to and from models at all levels.

7.4.3 Entities and Attributes in the Logic Model

According to the characteristics of entities, they can be divided into two categories.

  1. (1)

    Independent entity, which is usually represented by a rectangular box with right-angle corners. An independent entity is an entity that exists independently that does not depend on other entities.

  2. (2)

    Dependent entity, which is usually represented by a rectangular box with round corners. Dependent entities must depend on other entities, and the primary key in a dependent entity must be part or all of the primary key of an independent entity.

The primary key of the independent entity will appear in and become part of the primary key of the dependent entity, as shown in Fig. 7.5, where the chapter entity depends on the book entity. For example, many books have Chap. 2. If there is no book as one of the ID primary keys to distinguish the Chap. 2 of different books, only one record of Chap. 2 will appear in the chapter entity. But in fact, the title, page number and word count of Chap. 2 of different books are different, so the chapter entity depends on the book entity in order to function.

Fig. 7.5
figure 5

Entity categories

Attributes are the characteristics of the entity, containing the following types to be noted.

  1. (1)

    Primary key. The primary key is an attribute or group of attributes that identifies the uniqueness of an entity instance. For example, the name of a student entity cannot be used as a primary key because there may be cases of duplication of name. The school number or ID number can be used as an attribute that uniquely identifies the student, i.e., it can be used as a primary key.

  2. (2)

    Optional key. It can identify other attributes or groups of attributes of the entity.

  3. (3)

    Foreign key. Two entities are linked, and the foreign key of one entity is the primary key of the other entity. You can also call the primary key entity the parent entity and the entity with the foreign key the child entity.

  4. (4)

    Non-key attribute. It is an attributes other than primary key and foreign key attributes inside an entity.

  5. (5)

    Derived attribute. It is a field that can be counted or derived from other fields.

The primary key of the book entity shown in Fig. 7.6 is the book ID, while other attributes are non-key attributes. The primary key of the chapter is the book ID plus the chapter number, while other attributes are non-key attributes. The book ID in the chapter entity is a foreign key.

Fig. 7.6
figure 6

Attributes in entities

How to distinguish the relation between primary key, foreign key and index? A primary key uniquely identifies an instance, have no duplicate values, which is a non-null attribute, and should not be updated. Its role is to determine the uniqueness of a record and ensure data integrity, so an entity can have only one primary key.

A foreign key is generally the primary key of another entity, which can be duplicated or null for this entity, and its role is to establish data reference consistency with the relation between two entities. So an entity can have more than one foreign key. For example, attribute A is a foreign key in table X, and it is duplicable in table X. Because it is a foreign key, it must be a primary key in another table. Suppose A is in table Y (if any) as a primary key, then attribute A is not allowed to be duplicated.

Indexes are physical objects of the database and can be divided into unique indexes and non-unique indexes by uniqueness. A unique index is an object built on a table with no duplicate values and can have a null value. A non-unique index is an object built on a table, which can be null and can have duplicate values. The purpose of indexes is to improve query efficiency and thus speed up queries. The relation between primary key, foreign key and index are shown in Table 7.1.

Primary keys and foreign keys are logical concepts in the logical model, while indexes are physical objects. Many databases can create primary keys when building a table, at which time the attributes of the primary keys are unique non-null indexes.

Table 7.1 Relation between primary key, foreign key and index

After determining the entities and important attributes, you also need to understand the relations between the entities. Relations are used to describe how entities are related to each other. For example, if a book "includes" several chapters, "includes" is the relation between these two entities. The relation is directional. The book "includes" the chapter rather than the chapter "includes" the book, so the relation between the chapter and the book is "belonging to".

Cardinality is a service rule that reflects the relation between two or more entities, and the relation cardinality is used to express the concept of "linkage" in the E-R method.

Figure 7.7 shows the illustration of cardinality in IDEF1X. Understanding the meanings of the labels helps to quickly clarify the relation between entities as you see the model structure. From left to right, the first symbol represents a one-to-many relation, where the cardinality for many party is 0, 1, or n. The P symbol represents a one-to-many relation, where the cardinality for many party is 1 or n. The difference between these two relations lies in the presence or absence of 0. If there is 0, it is an optional relation, indicating that the relation may exist, which is expressed as "may" in English, and the opposite is a mandatory relation, indicating that the relation must exist, which is expressed as "must" in English. The Z symbol indicates that the cardinality of the many party is 0 or n. "n" indicates that there are and only n relations, for example, a rectangle has and only has four right angles, then the rectangle and the right angle are in 1 → 4 relations. The n-m symbol represents a range interval relation. For example, the relation between months and days, how many days there are in a month, and the relation between months and days is 1 → (28–31) as the size of the month and leap years vary. The cardinality relation represented by the {n} symbol cannot be illustrated by a simple number, and an annotation is needed to show the value range of this n. Such annotated descriptions are reflected in practical projects as some service rules, for example, the relation between a month and a securities trading day. How many valid securities trading days are contained in a month depends on the dates on which the stock exchange specifies that listing transactions can take place during the month, which varies annually with policy changes and needs to be stated separately.

Fig. 7.7
figure 7

Cardinality symbols in IDEF1X

In summary, the illustration of the cardinality symbols also reflects the important point that the cardinalities reflect different relations, and such relations are likely to reflect important service rules or constraints.

0, n is the expression form of may, which is an optional requirement.

1, n is the expression form of must, which is a mandatory requirement.

In practice, a cardinality of 0 may occur, indicating that a null value (NULL) may occur when two tables are associated.

The significance of cardinality is that it reflects the relation, as shown in Fig. 7.8. First of all, both the left and right sides are "including" relations, and the left side of the relation is 1:1, which means that a chapter must belong to a book, that is, it belongs to and only belongs to. For the example on the left, the values 0 to n are possible expressions for the optional requirement that a book may contain one or more chapters. And the cardinality equal to 0 expression means that a book is not divided into chapters. In practice, when the cardinality is equal to 0, null values may appear when the two tables are associated with each other. The example on the right takes the values 1 to n, which is a certain form of expression for the mandatory requirement that the cardinality is not 0 means that a book must contain one or more chapters.

Fig. 7.8
figure 8

Different cardinalities reflect different relations

Identifying relation occurs between independent and dependent entities, where the instance unique identification of a child entity is associated with the parent entity and the primary key attribute of the parent entity becomes one of the primary key attributes of the child entity. The primary key book ID of the parent entity book shown in Fig. 7.6 becomes the primary key attribute component of the chapter.

Non-identifying relation means that the child entity does not need the relation with the parent entity to determine the uniqueness of the instance. At this point the two entities are independent entities with no dependencies. In Fig. 7.6, if the chapter entity does not depend on the book entity and becomes independent, then each chapter number can only have one record, and the same chapters of different books will cover each other, and there is a problem with this design. In this case, the solution is to modify the non-identifying relation into an identifying relation. It can be summarized as follows: according to whether the parent entity and child entity have a foreign key relation, if there is a foreign key, it is a child entity; if there is a primary key, it is the parent entity. The location of the foreign key determines whether the parent entity and the child entity are of identifying or non-identifying relation. If the foreign key appears in the primary key of the child entity, it is an identifying relation; if the foreign key appears in the non-key attribute of the child entity, it is a non-identifying relation.

Recursive relation means that the parent entity and the child entity are the same entity, forming a recursive or nested relation, and the primary key of the entity also becomes its own foreign key. A recursive relation occurs when the entities themselves form a hierarchical relation. In practical applications, such entities of recursive relation are very common. For example, The organization structure includes superior departments and subordinate departments. One department may have one or more subordinate departments, the lowest department has no subordinate department, and the top department has no superior department, as shown in Fig. 7.9.

Fig. 7.9
figure 9

Recursive relation

Subtype relation is the relation between a subclass entity and the parent entity to which it belongs. There are two types of subtype relation. One is complete subtype relation, also called complete classification, where each instance of the parent entity to which it belongs can be associated with an instance of the entity in the subtype group, and all instances can be found in the classification case, with no exception. The other is incomplete subtype, also called incomplete classification, where each instance of the parent entity is not necessarily associated with an entity instance in the subclass group, and only some instances can be classified in the subclass, and some instances cannot be classified or do not need to care about the classification. Remember that in practice you must not divide a pocket of other subclasses in order to pursue complete classification, which will bring uncertainty to future service development.

The logic model is summarized as follows.

  1. (1)

    Entity is the metadata that describes the service.

  2. (2)

    The primary key is an attribute or group of attributes that identifies the uniqueness of an entity instance.

  3. (3)

    Relations exist between entities only if there are foreign keys, and no relation can be established without foreign keys.

  4. (4)

    The cardinalities of the relations reflect the service rules between the relations.

The logic model is as follows.

  • A customer can have only one type of savings account.

  • A customer can have more than one type of savings account.

  • An order can correspond to only one shipping order.

  • A product includes multiple parts.

7.4.4 NF Theory

According to the specific service requirements, database design needs to make clear how to construct a database design pattern that meets the requirements, and how many entities need to be generated, which attributes these entities are composed of, and what is the relation between entities. To be precise, these are the questions that need to be addressed in the logical design stage of relational database. The relational model is based on strict mathematical theory, so designing the relational model based on the normalization theory of relational database can construct a reasonable relational model . In the database logic design phase, the process of placing attributes in the correct entity is called normalization. Different NFs satisfy different levels of requirements.

Between 1971 and 1972, Dr. E.F. Codd systematically proposed the concept of 1NF to 3NF, which fully discussed the model normalization issues. Later, others deepened and proposed higher-level NF standards, but for relational databases, it is sufficient to achieve the 3NF in practical applications.

The relational data model designed by following the normalization theory has the following implications.

  1. 1.

    It can avoid the generation of redundant data.

  2. 2.

    The risk of data inconsistency can be reduced.

  3. 3.

    The model has good scalability.

  4. 4.

    It can be flexibly adjusted to reflect changing service rules.

In contrast to normalization in the process of logical model checking, normalization means denormalization when the physical model is built, i.e., violating some normalization rules to improve the performance when the database is applied by enhancing the physical rule attributes.

When determining entity attributes, the question often faced is: which attributes belong to the corresponding entities? This is the question to be addressed by the NF theory . For example, there will be a lot of business dealings between banks and individuals, and the same person may be engaged in business such as saving, spending on credit cards, buying financial products for investment and financial management, and buying cars and houses with loans. For banks, different services are carried out by different departments and service systems. For example, if you spend with credit cards, you have a credit card (credit card number) and a customer number in the credit card system; if you handle financial management, you open a financial account; if you make a deposit, you open a savings account. The individual a bank faces is a person. When building a model, how do you group individuals into a single customer entity? Do you create three entities or use one entity when counting a customer's assets? For customers who do not have a loan relation with the bank, if there is a loan relation in the future, what should the current model consider in advance for this change? These are all questions that need to be addressed in the logical design, and the theoretical basis for this is the NF model.

The one that satisfies the minimum requirements is called the first NF (1NF), the one that further satisfies the requirements based on the 1NF is the second NF (2NF), and so on. A low-level NF relation pattern can be transformed into a collection of several higher-level NF relation patterns by schema decomposition. This process is called normalization, as shown in Fig. 7.10.

Fig. 7.10
figure 10

Relations between NFs

Domain is the set of legal values of an attribute, which defines the valid range of values of the attribute. The values inside the domain are legal data. The domain reflects the relevant rules.

For example, the domain of the employee ID shown in Fig. 7.11 is an integer greater than 0, so 0 and −10 are the data outside the domain. For example, if the cell phone numbers are 11-bit length integers, 12345678910 is legal data; however, if we consider the actual situation, it cannot be legal data because different operators have different number segments.

Fig. 7.11
figure 11

Domain

If and only if each attribute contains only atomic values (which cannot be sub-splittable), a relation (table or entity) conforms to the 1NF, and the value of each attribute can only contain one value in the value range (not a subset).

The rules satisfying the 1NF contain the following features.

  1. (1)

    The attribute value is atomic (non-sub-splittable).

  2. (2)

    The number of attribute value is single and cannot be a subset inside the value domain.

  3. (3)

    A primary key is required to ensure that there are no duplicate records in the database.

  4. (4)

    There is no duplicate group problem for the attributes in the entity, because duplicate groups are prone to producing null values somewhere and unstable structure, that is, the service development that exists in the actual application can bring service instability, and duplicate groups can also lead to ambiguity when used.

For example, in the phone number column shown in Table 7.2, there is a big problem with the phone number attribute: the value format is not uniform, and contains non-numeric characters. The bigger problem is that there are two people with more than one phone number, and the two numbers are subsets of the phone number value field, which violates the feature "the number of attribute value is single and cannot be a subset inside the value domain". This kind of table structure is common in many practical scenarios. Take a list of account followers in social applications as an example. For this dynamic data, commas are often used to separate a series of accounts and they are designed as a field.

Table 7.2 Customer information table (1)

If the two phone numbers are split into two fields, they are shown in Table 7.3.

Table 7.3 Customer information table (2)

It seems to solve the atomicity problem, but the repeating group problem arises. The repeating group problem technically takes values atomically, but conceptually repeats the same attributes. The reason why we want to avoid the repeating group problem is that the following anomalies are introduced by the repeating group.

  1. (1)

    Some records produce null values. For example, some customers have only one phone number, without the second phone number, which would result in a null value in the Phone Number 2 field.

  2. (2)

    The structure may be unstable. For example, some people have three phone numbers or even more, so they require to update the table structure frequently to adapt to new situations, which will lead to instability of the model structure, that is, business development brings instability impact to the model.

  3. (3)

    Ambiguity arise when using data. Which number should be placed first? Which number should be put in the second place? What are the rules? Which telephone number shall prevail when obtaining contact information of customers? All of the above questions can lead to semantic confusion and ambiguity in the use of data for service.

To solve the above problems, the solution is to turn the duplicate group into a high table and put the phone number in the same attribute. This is in line with the 1NF, as shown in Table 7.4.

Table 7.4 Customer information table (3)

Atomicity means indivisibility .But to which degree should it be split? Many people are prone to misunderstand the concept of atomicity in practical applications. Generally speaking codes with coding rules are actually composite codes, which are divisible in terms of rules. For example, ID numbers and cell phone numbers can both be further split into data of smaller granularity, such as birth year and gender. However, from the field perspective, the field of ID number is legal as long as it conforms to the coding rules, i.e., it is atomic data and does not need further splitting.

The 2NF means that each table must have a primary key, with other data elements corresponding to the primary key one by one. This relation is often referred to as functional dependence, where all other data elements in the table depend on the primary key, or the data element is uniquely identified by the primary key. The 2NF emphasizes full functional dependence, which simply put, all non-primary key fields are dependent on the primary key as a whole, not some of them.

There are two necessary conditions to satisfy the 2NF: firstly, the 1NF should be satisfied; secondly, every non-primary attribute is fully functionally dependent on any of the candidate keys. It can be simply understood that all non-primary key fields depend on the whole primary key, not a part of it. What is shown in Table 7.5 does not satisfy the 2NF because the order date depends only on the order number and has nothing to do with the part number. So the table will have a lot of redundant data as the order number is repeated.

A simple tip: If an entity has only one primary key field, then basically the entity is satisfying the 2NF.

Table 7.5 Order and part table 1

Modify Table 7.5 to include the order date and the dependent order number as primary keys to form another entity, then both entities now satisfy the 2NF. This is the normalization, where a first-level NF can be converted into a collection of several higher-level NF relational patterns through schema decomposition, as shown in Tables 7.6 and 7.7.

Table 7.6 Order and part table (2)
Table 7.7 Order number table

The 3NF is that all non-primary key fields depend on the whole primary key, not on other attributes of the non-primary key. There are two necessary conditions to satisfy the 3NF: firstly, the 2NF should be satisfied; secondly, every non-primary attribute is not transitively dependent on the primary key. That is to say, the whole non-primary key field of the 3NF depends on the whole primary key instead of the non-primary key attribute. The customer name shown in Table 7.8 depends on the non-primary key attribute customer ID, so the 3NF is not satisfied.

Table 7.8 Order and customer table

The 3NF is mainly for the field redundancy constraint, which cannot have derived fields in the table. If there are redundant fields in the table, when updating data, the update efficiency will be reduced because of the existence of redundant data, which will easily lead to inconsistent data. The solution is to split the table into two tables and form a primary-foreign key relation, as shown in Tables 7.9 and 7.10.

Table 7.9 Order table
Table 7.10 Customer table

In 1970, Dr. E.F. Codd, an IBM researcher, published a paper that introduced the concept of relational model and laid the theoretical foundation of relational model . After publishing this paper, he defined the concept of the 1NF, 2NF and 3NF in the early 1970s. In practical applications, it is sufficient for a relational model to satisfy the 3NF.

  • The KEY—1st Normal Form (1NF)

  • The WHOLE Key—2nd Normal Form (2NF)

  • AND NOTHING BUT the Key—3rd Normal Form (3NF)—E.F. Codd

Database design now satisfies at most the 3NF. It is generally believed that although higher NFs have better constraint on data relations, they also make database I/O more busy due to the increase of data relation tables, so in real projects, there are basically no cases that meet the 3NF or higher.

In the data warehouse, the application layer often encounters a star-shaped or snowflake-shaped model. The snowflake-shaped model is a model structure commonly used in business intelligence (BI) systems and reporting systems, named because the dimensions of the fact table are similar to snowflakes after expansion, as shown in Fig. 7.12. This model basically meets the requirements of the 3NF, or at least the 2NF in many scenarios.

Fig. 7.12
figure 12

Snowflake-shaped model

7.4.5 Logic Design Considerations

When designing a logic model, some principle issues should be noted. The first is the establishment of naming rules. Similar to other language development, it is advisable to establish naming rules and follow them during logical modeling. The main purpose of establishing naming rules is to unify ideas, facilitate communication and achieve standardized development. For example, in the case of unified naming, the amount is amount, which is abbreviated as amt, and its corresponding physical type is DECIMAL(9,2). This field needs to be accurate to two decimal places when calculating. However, if the naming is inconsistent, for example, some people define the customer ID as cid, and some people define it as customer_id, it is easy to question whether the two attributes belong to the same object, which makes different roles have different understandings of the same model.

The naming suggestions for entities and attributes are as follows.

  1. (1)

    Entity name: capitalize the type domain + entity descriptor (full name, initial capitalization).

  2. (2)

    Attribute name: use full names with initial capitalization, and some conventional abbreviations are provided after the spaces.

  3. (3)

    Avoid mixing English and Chinese Pinyin.

  4. (4)

    If it is abbreviated, it must be the abbreviation of English words, avoid using the acronym abbreviation of pinyin.

Also pay attention to designing the logic model according to the design process, determining the entities and attributes, for example, defining the entity's primary key (PK), defining some non-primary key attributes (Non-Key Attribute), defining non-unique attribute groups and adding the corresponding comment content.

Finally, it is necessary to determine the relation between entities, e.g., use foreign keys to determine whether the relation between entities is identifiable and determine whether the cardinality of the relation is of 1:1, 1:n or n:m. When adding non-key attributes of entities, it is important to consider whether the added attributes conform to the design of the 3NF according to the rules of the 3NF. If the added attributes violate the 3NF, entity splitting is required to determine the relation between the new entity and the original entity. The content of the annotation is generally a literal description of the service meaning, code value, etc.

7.5 Physical Design

7.5.1 Physical Design and Physical Models

Physical design is the adjustment of the physical attributes of the model based on the logical model in order to optimize the database performance and improve the efficiency of service operation and application efficiency. The physical design should be adjusted in conjunction with the physical attributes of the target database product, with the ultimate goal of generating a deployable DDL for the target database.

The main contents include but are not limited to the following.

  1. (1)

    Non-regularized processing of entities.

  2. (2)

    Physical naming of tables and fields.

  3. (3)

    Determining the type of fields, including attributes such as length, precision, and case sensitivity.

  4. (4)

    Adding physical objects that do not exist in the logical model, such as indexes, constraints, partitions, etc.

Table 7.11 shows the designations of the same concept at different stages. For example, relations in relational theory are called entities in the logical model and tables in the physical model. A tuple in relational theory is an instance in the logical model and a row in the physical model. Attributes in relational theory are called attributes in the logical model and fields of a table in the physical model.

Table 7.11 Names of the same concept at different stages

In the comparison between the logical and physical models shown in Table 7.12, what are included in the logical model are entities and attributes, which correspond to tables and fields in the physical model. As for the key values, the physical model generally does not use primary keys, but more often uses unique constraints and not-null constraints to achieve this. Because the data quality requirement is too high if primary key constraint is used, the constraint requirement is generally reduced in the physical implementation, and the primary key is mainly reflected in the logical concept. In terms of name definition, the logical model is named according to the service rules and the naming convention of real-world objects, while the physical model needs to consider the limitations of database products, such as no illegal characters, no database keywords, and no over-length. In terms of regularization, the logical model design should try to meet the 3NF and be regularized; the physical model pursues high performance and may have to be denormalized, which is non-regularized processing.

Table 7.12 Comparison of logical and physical models

7.5.2 Denormalization of the Physical Model

Denormalization, also called non-regularization processing, is the process and technical means that is the opposite of the normalization process, for example, the process of downgrading a model from the 3NF to 2NF or 1NF. The physical model design should take into account the physical limitations of the database in terms of performance and application requirements. Theoretically, if the hardware conditions are unlimited, such as unlimited CPU speed, unlimited memory, unlimited storage space, unlimited bandwidth and so on, there is no need to denormalize. However, it is precisely because of limited resources and limited hardware conditions that the physical model requires denormalization, and denormalization needs to be carried out moderately to avoid possible data redundancy problems and potential risks of data inconsistency.

Frequent table linkage operations can be avoided by adding redundant columns, as shown in Tables 7.13, 7.14, and 7.15. There is a primary-foreign key relation between the order table and the customer table, and if a report can only display the customer number, it is very inconvenient for users. So you need to perform linkage operation to display and output customer names together, which is more convenient for users. However, the linkage operation consumes resources, and in practice, it is common to have more than a dozen code tables associated in one query. Without data redundancy processing, a lot of real-time computing resources will be consumed to perform linkage operations, which will affect query efficiency. Therefore, adding redundant columns and performing pre-linkage operations can improve query efficiency.

Table 7.13 Order table
Table 7.14 Customer table
Table 7.15 Order and customer table

The complexity of SQL can be reduced by adding redundant columns and using duplicate groups, as shown in Tables 7.16 and 7.17. This example is a conversion from a high table above to a wide table below, a means often used in the front-end report query process, which is more suitable for fixed class reports with style requirements determined in advance.

Table 7.16 Sales monthly report of a department
Table 7.17 Customer table

Tables 7.18 and 7.19 show the reduction of function calculation by adding derived columns, which is a very common application scenario. For example, extracting customer age information from ID card numbers; classifying users into VIP customers, platinum customers, ordinary customers, etc. based on their spending amounts; and flag suspicious transactions and suspicious accounts after judging them in the AML system. This method is generally used in customer relation management projects. In Table 7.19, users are divided into different groups by age, including elderly, middle-aged and young.

Table 7.18 Original customer table
Table 7.19 Derived customer table

Denormalization is commonly handled by the following means.

  1. (1)

    Adding duplicate groups.

  2. (2)

    Performing pre-linkage.

  3. (3)

    Adding derived fields.

  4. (4)

    Creating summary tables or temporary tables.

  5. (5)

    Horizontally or vertically splitting tables.

The negative impact of denormalization is relatively large for OLAP systems, but is more common for OLTP systems, and is generally used to improve the system's high concurrency performance for scenarios that require a large number of transactions. The impact of denormalization needs to be considered more in OLAP systems for the following reasons.

  1. (1)

    Denormalization does not bring performance improvement to all processing processes, and the negative impact needs to be balanced.

  2. (2)

    Denormalization may sacrifice the flexibility of data models.

  3. (3)

    Denormalization poses the risk of data inconsistency.

7.5.3 Maintaining Data Integrity

Denormalization brings the increase of redundant data, which requires certain management measures to maintain data integrity. There are three common processing methods.

  1. (1)

    Maintain by batch processing. This approach is to modify the replicated or derived columns, and after a certain period of time, a batch of processing jobs or stored procedures are executed to modify the replicated or derived columns. This can only be used in cases where the real-time requirement is not strict.

  2. (2)

    Add, delete, and modify all designed tables during the same transaction in the application implementation. But be sure to pay attention to the data quality, because it is easy to be neglected when the demand changes frequently, which leads to data quality problems.

  3. (3)

    Use triggers. The trigger has good real-time processing effect. After the application updates the data of Table A, the database will automatically trigger the update of Table B, but the cost of using the trigger is that it will cause pressure on the database. The use of triggers in the actual application means a significant negative impact on performance, so there are fewer and fewer scenes using it.

7.5.4 Establishing a Physicalized Naming Convention

Naming convention should be established when physicalizing. Firstly, naming should be based on the physical characteristics of the database, then illegal characters should be avoided in the name, and the reserved keywords of the physical database should be avoided. English words which are meaningful, easy to remember, descriptive, short and unique should be used as far as possible, and Chinese Pinyin is not recommended. The developed naming convention should be unified and strictly observed within the project team. Name abbreviations should be agreed upon. Physical characteristics generally refer to the case sensitivity and the length limit of table names. For example, in GaussDB (DWS ), it is specified that the name cannot exceed 63 characters.

Using database reserved keywords may pass at the syntax level, but will bring uncontrollable risks to the subsequent operations and maintenance work, other automated management work and future system upgrades. Generally database object names are case-insensitive when implemented at the physical level, so do not adopt the special use of double quotes to force case differentiation.

Table prefixes can be unified using t, view prefixes unified using v, index prefixes unified using ix. When naming, the corresponding prefix should be added, followed by a meaningful specific name, and the whole name should be in lowercase, as shown in Table 7.20. The examples here are for reference only, which are not mandatory convention.

Table 7.20 Object naming convention

7.5.5 Physicalizing Tables and Fields

The table-level physicalization operations listed here are only part of the work, not covering all table-level physicalization work.

There are several methods for table physicalization as follows.

  1. (1)

    Perform the denormalization operation using the methods described earlier.

  2. (2)

    Decide whether to perform partitioning. Partitioning large tables can reduce the amount of I/O scanning workload and narrow the scope of queries. But the granularity of partitioning is not the finer the better. For example, if you only query monthly summary or conduct monthly query, you only need to partition by month instead of by day.

  3. (3)

    Decide whether to split the history table and the current table. History table is some cold data with low frequency of use, for which you can use low-speed storage; and current table is the hot data with high frequency of query, for which you can use high-speed storage. History tables can also use compression to reduce the storage space occupied.

For field-level physicalization efforts, first try to use data types with short fields. Data types with shorter lengths not only reduce the size of data files and improve I/O performance, but also reduce memory consumption during related calculations and improve computational performance. For example, for integer data, try not to use INT if you can use SMALLINT, and try not to use BIGINT if you can use INT. The second is to use consistent data types, trying to use the same data type for table linkage operations. Otherwise, the database must dynamically convert them into the same data type for comparison, which will bring some performance overhead. The last is the use of efficient data. Generally speaking, integer data operations (including =, >, <, ≥, ≤, ≠ and other conventional comparison operations, as well as GROUP BY) are more efficient than strings and floating-point numbers.

The premise of using efficient data is that the data type must meet the service requirements of the value field. For example, the service context is the amount field with decimals, then you cannot force the use of integers in pursuit of high efficiency.

Integer data is an efficient type compared to string, TINYINT only occupies 1 byte and takes values in the range of 0–255, which is the most efficient. But it belongs to the data type that comes with GaussDB database. At present, the ODBC of GaussDB database is open source odbc driver, with poor compatibility with tinydb. SMALLINT occupies 2 bytes, taking values in the range of −327 68 to +32 767, but the field can only use numbers in the future, and cannot use such characters as a, b, c for expansion. INT occupies 4 bytes, with the range of values from −2 147 483 648 to +2 147 483 647; BIGINT occupies 8 bytes, from −9 223 372 036 854 775 808 to +9 223 372 036 854 775 807; CHAR(1) occupies 1 byte, with efficiency lower than the integer, but supporting characters 0–9, and A–Z; VARCHAR(1) takes up at least three characters for the leading characters. So there is no absolute standard, just decide according to the actual scenario.

Thinking

A certain identification class field takes values of 0,1. If I want to set a data type for this field, which one is appropriate?

Common constraints on the field level are DEFAULT constraints, non-null constraints, unique constraints, primary key constraints and check constraints. If the field value can be completed from the service level, it is not recommended to use DEFAULT constraints to avoid unintended results when data is loaded. It is recommended to add a non-null constraint to a field where a null value clearly does not exist. The primary key constraint is actually equal to the unique constraint plus the non-null constraint, so it should be added if the conditions allow. The check constraint is a requirement for data quality, and data that does not satisfy the constraint will cause SQL execution to fail when inserted into the data table.

Check constraints have little impact on GaussDB (for MySQL) as a whole. In OLTP system, if some data cannot be inserted because the constraint is not satisfied, you can record these failure messages and deal with them by other means instead. However, in GaussDB (DWS ) system, it affects the whole job processing system and has relatively more impact. In the OLAP system, the processing of large data volume operations may cause the whole SQL statement or operation to fail because some data records do not meet the check constraints, thus affecting the overall warehouse's data processing process. Therefore, in the OLAP system, the data quality of the local data is constrained by the data application as much as possible, so do not add constraints on the physical tables and fields.

For the creation and use of indexes, here is a list of cases in which indexes are allowed to add, rather than mandatory requirements. Despite the addition of indexes, it is up to the database system to make its own optimization judgments about whether indexes can be used. When using indexes is more efficient and faster, they will be used; if the use is more costly and the efficiency is not significantly improved, the use of indexes will not be forced.

The common index use scenarios are as follows.

  1. (1)

    Create indexes on columns that are frequently required to be searched and queried, which can speed up the search and query.

  2. (2)

    Create an index on a column that used as the primary key, which emphasizes the uniqueness of the column and organizes the arrangement structure of the data in the table.

  3. (3)

    Create indexes on columns that often use joins. These columns are mainly foreign keys, so the speed of association can be accelerated.

  4. (4)

    Create indexes on columns that often need to be searched based on ranges, because the indexes are already sorted and their specified ranges are contiguous.

  5. (5)

    Create indexes on columns that often need to be sorted, also because the indexes are already sorted, and these queries can shorten the query time of sorting by index sorting.

  6. (6)

    Create indexes on the columns that often use the WHERE clause to speed up the judgment of the condition.

The above scenario allows the use of indexes, but it is not necessary. Whether the indexes can be used after being added is determined by the database system itself.

However, creating more indexes will have negative effects, such as the need for more index space; when inserting the base table data, the efficiency of the insertion operation will be reduced because the index data should be inserted at the same time. Therefore, invalid indexes should be deleted in time to avoid wasting space.

Other physicalization means are judged to be used according to the situation, for example, whether to further compress the data, whether to encrypt or desensitize the data, etc.

7.5.6 Using Modeling Software

During the physical design process, we typically use modeling software for both logical and physical modeling. Automation software delivers many benefits, such as forward generation of DDL, reverse analysis of database, and comprehensive satisfaction of various requirements in modeling, so that efficient modeling can be carried out.

Advantages of using modeling software for logical modeling and physical modeling are as follows.

  1. (1)

    Powerful and rich.

  2. (2)

    Forward DDL generation and reverse analysis of database.

  3. (3)

    Free switch of views between logical model and physical model.

  4. (4)

    Comprehensive satisfaction of various requirements in modeling for efficient modeling.

The following are some of the commonly used modeling software.

  1. (1)

    ERwin's full name is ERwin Data Modeler, a data modeling tool from CA, which supports all major database systems.

  2. (2)

    PowerDesigner is SAP's enterprise modeling and design solution that uses a model-driven approach to integrate service and IT, helps deploy effective enterprise architecture, and provides powerful analysis and design techniques for R&D lifecycle management. PowerDesigner uniquely integrates multiple standard data modeling techniques (UML, service process modeling, and market-leading data modeling) with leading development platforms such as. NET, WorkSpace, PowerBuilder, Java, Eclipse, etc., providing business analysis and standardized database design solutions for traditional software development cycle management.

  3. (3)

    ER/Studio is a set of model-driven data structure management and database design products that help companies discover, reuse and document data assets. It empowers data structures with the ability to fully analyze existing data sources through regressive database support, and design and implement high-quality database structures based on service requirements. The easy-to-read visual data structure facilitates communication between service analysts and developers at work. ER/Studio Enterprise enables enterprises and task teams to collaborate through a central repository.

  4. (4)

    dbeaver is a free, open source, universal database tool for developers and database administrators.

  5. (5)

    pgModeler is a dedicated modeling tool for PostgreSQL databases, developed using Qt and supporting Windows, Linux operating systems and OS X platforms, which uses the classic entity linkage diagram.

7.5.7 Physical Model Products

The products that should be output during the physical model design phase include the following:

  1. (1)

    A physical data model, usually an engineering file for some automated modeling software;

  2. (2)

    Physical model naming convention, which is a standard convention that everyone in the project should follow;

  3. (3)

    Design specification of the physical data model;

  4. (4)

    DDL table building statements for the target database.

7.6 Database Design Case

7.6.1 Scenario Description

This scenario is a customer placing an order to purchase equipment. A sample order table is shown in Fig. 7.13. After the customer purchases the equipment, he/she needs to fill in the relevant information in the order form.

Fig. 7.13
figure 13

Order form for a customer to purchase equipment

The current demand is to design the model of the underlying database according to this order style, taking into account the following three requirements.

  1. (1)

    Record the relevant data information in the database.

  2. (2)

    Enable to query the information about the order through the database system.

  3. (3)

    Support some statistical reports of sales volume.

7.6.2 Regularization Processing

The entities and attributes that can be proposed in the order shown in Fig. 7.13 are order number, order date, customer ID, customer name, contact information, ID number, customer address, part number, part description, part unit price, part quantity, part total price, and order total price.

If this information is generated directly into an entity where the design result is a table that needs to cover all the information, then the part number, part description, part unit price, part quantity, and part total price are called duplicate attribute groups that have to appear repeatedly in the entity, as shown in Fig. 7.14. For example, including Part Number 1, Part Description 1, Part Unit Price 1, Part Quantity 1, Part Number 2, Part Description 2, Part Unit Price 2, etc. This situation does not satisfy the 1NF.

Fig. 7.14
figure 14

List of extractable attributes

For the duplicate group problem of part information, the relevant information of the part is extracted to form a separate entity with several parts for each order, then the primary key of the new entity is the order number plus the part number, as shown in Fig. 7.15.

Thinking

After eliminating the duplicate groups, which NF does the model now conform to?

Fig. 7.15
figure 15

Order and order-part entities

There are still partial dependencies on the information of the parts in the current model, so normalization should be continued to resolve the partial dependencies. Extract the partial information that depends only on the part number to form a new entity, the part entity, as shown in Fig. 7.16.

Thinking

After eliminating the partial dependencies, which NF does the model now conform to?

Fig. 7.16
figure 16

Elimination of partial dependencies

The problem with the current model is that the customer information depends on the customer ID, and the customer ID depends on the order number. Such dependency has a transferability and is not directly dependent. So a conversion from the 2NF to the 3NF has to be implemented to eliminate this transmissive dependency.

Eliminating the transmissive dependency is to generate the customer information as a separate entity, the customer entity, as shown in Fig. 7.17.

Thinking

After eliminating the transmissive dependency, which NF does the model now conform to?

Fig. 7.17
figure 17

Separate customer information

At this point, the logical model is essentially complete. However, note that the order total price and the part total price are derived fields, which are not strictly considered to meet the requirements of the 3NF, so they should be "erased".

After the regularization process is completed, the entity of the 3NF model is obtained, and the primary key and foreign key are marked in the two-dimensional table. Experience the 3NF model, as shown in Tables 7.21 and 7.22.

Table 7.21 Order table
Table 7.22 Order-part table

Since the part total price attribute has been removed from the order-part table, if you want to get the part total price now, you need to do the operation based on the order-part table and multiply the part quantity by the sales price. The pseudo SQL code is as follows.

select 订单编号,部件编号,(销售价格*部件数量)as部件合计价格 from 订单部件表;

If you now want to get the order total price, the pseudo SQL code is as follows.

SELECT 订单编号,SUM (销售价格*部件数量)AS订单合计金额 FROM 订单部件表;

7.6.3 Data Types and Length

After completing the logical model design, the physical model design begins. First, name the table and field according to certain convention, avoid using database keywords, and perform certain case-specific design; then determine the data type at the field level, and if it involves characters, the length of the field definition, then determine its upper limit according to the possible value fields of the actual data; after that determine whether each field needs to add non-null constraints, unique constraints, and other constraints, as shown in Tables 7.23, 7.24, 7.25, and 7.26.

The samples in the above tables are a kind of example, you can adjust them as the actual situation required in practice.

Thinking

If the value type of the set price is DEIMAL(5,2), what is the range of the value field?

Table 7.23 ORDER table
Table 7.24 CUSTOMER table
Table 7.25 ORDER_ITEM table
Table 7.26 ITEM table

7.6.4 Denormalization

The denormalization shown in Tables 7.27 and 7.28 solves some service problems by adding some derived fields. For example, Total_Price states the order total price of a particular order. Item_Total indicate the sales of a part in an order.

Table 7.27 Order table
Table 7.28 order detail table

Whether to continue to derive fields or perform other pre-association operations depends on the service problems to be solved, the computational complexity, and whether denormalization can speed up these queries.

Thinking

What is the average monthly sales for Q1? What are the top three parts by sales? You can further refine the derived fields on your own based on some service issues.

7.6.5 Index Selection

Taking Tables 7.23 and 7.25 as examples, the operation results of adding indexes are shown in Tables 7.29 and 7.30. Here some partition indexes and query indexes are added. There is no standard answer for adding indexes, and the same needs to be judged according to the actual scenario and data volume. For OLTP , each table needs to add a primary key, and if there is no natural primary key, then a field like sequence can be used as a proxy primary key. For OLAP distributed database , each table needs to further select distribution keys upon careful consideration.

Table 7.29 Index selection (1)
Table 7.30 Index selection (2)

7.7 Summary

This chapter focuses on the New Orleans design methodology to database modeling, and explains the four phases of requirement analysis, conceptual design, logical design, and physical design, with the tasks of each design phase clearly explained. The significance of requirement analysis stage is expounded. The E-R approach is introduced in the conceptual design stage. For the logical design section, the important basic concepts and the 3NF are expounded, and each NF is explained in depth with examples. For the stage of physical design, the denormalization means and the key points in the work are emphasized. The chapter concludes with a small practical case to illustrate the main elements of logical and physical modeling.

7.8 Exercises

  1. 1.

    [Single Choice] The next phase after the logical design phase in the New Orleans design methodology is ( ).

    1. A.

      Requirement analysis

    2. B.

      Physical design

    3. C.

      Conceptual design

    4. D.

      Logical design

  2. 2.

    [Multiple Choice] In what ways is the database operating environment efficient? ( )

    1. A.

      Data access efficiency

    2. B.

      Time cycle of data storage

    3. C.

      Storage space utilization

    4. D.

      Efficiency of database system operation and management

  3. 3.

    [Multiple Choice] In the process of requirement investigation, which of the following methods can be used? ( )

    1. A.

      Questionnaire survey

    2. B.

      Interviews with service personnel

    3. C.

      Sample data collection, and data analysis

    4. D.

      Review or the User Requirement Specification

  4. 4.

    [Multiple Choice] Which of the following options are included in the three elements of the E-R diagram in model design? ( )

    1. A.

      Entity

    2. B.

      Relation

    3. C.

      Cardinality

    4. D.

      Attribute

  5. 5.

    [Multiple Choice] The linkage between entities are ( ).

    1. A.

      One-to-one linkage (1:1)

    2. B.

      One-to-null linkage (1:0)

    3. C.

      One-to-many linkage (1:n)

    4. D.

      Many-to-many linkage (m:n)

  6. 6.

    An entity is a collection of real-world objects that have common attributes and can be distinguished from each other. For example, teachers, students, and courses are all entities, as shown in Fig. 7.2. ( )

    1. A.

      True

    2. B.

      False

  7. 7.

    [Multiple Choice] The significance of normalized modeling in the logic model design process includes ( ).

    1. A.

      Improve the efficiency of database use

    2. B.

      Reduce redundant data

    3. C.

      Make the model well scalable

    4. D.

      Reduce the possibility of data inconsistency

  8. 8.

    [True or False] A model that satisfies the 3NF must satisfy the 2NF. ( )

    1. A.

      True

    2. B.

      False

  9. 9.

    [Multiple Choice] The physical model has the following characteristics compared to the logical model: ( ).

    1. A.

      Strictly observes the 3NF

    2. B.

      Can contain redundant data

    3. C.

      Mainly for database administrators and developers

    4. D.

      Can contain derived data

  10. 10.

    [Multiple Choice] Which of the following are ways to data denormalization? ( )

    1. A.

      Add derived fields

    2. B.

      Create a summary or temporary table

    3. C.

      Perform pre-linkage

    4. D.

      Add duplicate groups

  11. 11.

    [Multiple Choice] The effects of using indexes are ( ).

    1. A.

      It will take up more physical storage space

    2. B.

      With indexes in effect, the efficiency of queries can be greatly improved

    3. C.

      The efficiency of inserting base tables will be reduced

    4. D.

      Once the index is established, the database optimizer will definitely use the index in queries

  12. 12.

    [True or False] Because partitioning can reduce the I/O scan overhead during data query, the more partitions are created during the physicalization process, the better. ( )

    1. A.

      True

    2. B.

      False

  13. 13.

    [True or False] The foreign key is the unique identifier that identifies each instance in an entity. ( )

    1. A.

      True

    2. B.

      False

  14. 14.

    [True or False] Atomicity that satisfies the 1NF is the sub-splitting of each attribute to the smallest granularity that is non-sub-splittable. ( )

    1. A.

      True

    2. B.

      False

  15. 15.

    [True or False] A relation between entities exists only if a foreign key exists, and a relation between two entities cannot be established without a foreign key. ( )

    1. A.

      True

    2. B.

      False

  16. 16.

    [Multiple Choice] Which of the following options in the process of building a logical model is within the scope of work for determining the attributes in an entity? ( )

    1. A.

      Define the primary key of the entity

    2. B.

      Define some of the non-key attributes

    3. C.

      Define non-unique attribute groups

    4. D.

      Define constraints on attributes

  17. 17.

    [True or False] The data dictionary in the requirement analysis phase of the New Orleans design methodology has the same meaning as the data dictionary in a database product. ( )

    1. A.

      True

    2. B.

      False