Database technology is a technology developed earlier in computer science, having experienced nearly 60 years of history since its birth in the early 1960s. Now, database technology has developed from the early stage of simply saving and processing data files to a rich comprehensive discipline with data modeling and DBMS as the core, as the foundation and core of modern computer application system. With the continuous refinement of “Internet+”, big data, AI and data mining technologies in recent years, database technology and products are changing day by day. This chapter will give a brief introduction to the basic knowledge and concepts of database.

1.1 Overview of Database Technology

Database technology is an effective technology used for data management. It studies how to manage data scientifically so as to provide people with shareable, secure and reliable data. It involves four important concepts, as shown in Fig. 1.1, which are introduced below.

Fig. 1.1
figure 1

Overview of database technology

1.1.1 Data

Data refers to the raw records that have not been processed. Generally speaking, data is not clearly organized and classified, and thus cannot clearly express the meaning of what things represent. Data can be a pile of magazines, a stack of newspapers, minutes of a meeting, or a copy of medical records. Early computer systems were primarily used for scientific calculations and dealt with numerical data, that is, numbers in the generalized concept of data, such as integers like 1, 2, 3, 4, 5, but also floating point numbers like 3.14, 100.34, and −25.336.

In modern computer systems, the concept of data is generalized. In a broad sense, data includes number, text, graphic, image, audio, video, and many other forms, and after digitization they can be stored in the computer as shown in Fig. 1.2.

Fig. 1.2
figure 2

Information is stored in a computer after being digitized as data

In addition to its presentation, data also involves semantics, i.e., the meaning and implications of the data. Data and the semantics of data are closely related. For example, 88 as a data can indicate that the total number of employees in a department is 88, or that a student's score in a certain subject is 88, or that the price of a product is 88 yuan or a person's weight is 88 kg.

1.1.2 Database

Database is a large collection of organized and shareable data stored in the computer for a long time, with the following three characteristics.

  1. (1)

    Long-term storage: The database should provide a reliable mechanism to support long-term data storage, so that data recovery is feasible upon the system failure to prevent data loss in the database.

  2. (2)

    Organization: Data should be organized, described and stored in a certain data model. Model-based storage endows data with less redundancy, higher independence and easy scalability.

  3. (3)

    Sharebility: The data in the database is shared and used by all types of users, not exclusive to a single user.

The student information database shown in Fig. 1.3 should be accessible non-exclusively to different users such as students, teachers and parents simultaneously.

Fig. 1.3
figure 3

Database

1.1.3 Database Management System

Database management system (DBMS) is a system software located between the user and the operating system that can organize and store data scientifically, access and maintain data efficiently.

Like the operating system, the DBMS is also the basic software of the computer system, as shown in Fig. 1.4.

Fig. 1.4
figure 4

Hierarchy diagram of computer system

The DBMS mainly functions as follows.

  1. (1)

    Data definition. The DBMS provides data definition language (DDL), through which the user can easily define the composition and structure of data objects in the database.

  2. (2)

    Data organization, storage and management. The DBMS is responsible for organizing, storing and managing data in a classified manner, involving data dictionaries , user data, data access paths, etc. The DBMS also has to determine in which file structure and which access method to organize this data in the storage space, and how to realize the linkage between the data. The basic goal of data organization and storage is to improve storage space utilization, facilitate data access, and provide a variety of data access to improve access efficiency.

  3. (3)

    Data manipulation. DBMS also provides data manipulation language (DML), with which users can manipulate data to achieve such basic operations as query, insert, deletion and modification on data.

  4. (4)

    Transaction management and operation management of the database. The database is unified managed and controlled by the DBMS during establishment, operation and maintenance to ensure the correct operation of transactions, the security and integrity of data, the concurrent use of data by multiple users and the system recovery after a failure.

  5. (5)

    Database establishment and maintenance. This function covers the database initial data input and conversion, database dump and recovery, database reorganization and performance monitoring, analysis function, etc. These functions are usually implemented by certain programs or management tools.

1.1.4 Database System

The database system (DBS) is a system for storing, managing, processing and maintaining data composed of database, DBMS and its application development tools, applications and database administrators.

In Fig. 1.5, the parts other than the user and the operating system are the components of the database system.

Though the operating system is not a component of the database system, the DBMS needs to call the interface provided by the operating system in order to access the database.

Fig. 1.5
figure 5

Database System

1.2 History of Database Technology

1.2.1 Emergence and Development of Database Technology

Database technology emerged in response to the need of data management tasks. Data management refers to the classification, organization, coding, storage, retrieval and maintenance of data, which is the core of data processing.

The development of data management has gone through three stages, as shown in Fig. 1.6.

  1. (1)

    Manual management stage (from the emergence of computers to the mid-1950s). Before the mid-1950s, there was no software system responsible for data management. To perform data computation on a computer, programmers needed to design their own programs. Not only the logical structure of the data had to be specified in the application, but also the physical structure, including storage structures, access methods, etc. had to be designed. Thus, on one hand, programmers had a very heavy workload, while on the other hand, non-programmers were incapable to use the computer system.

  2. (2)

    File system stage (from the late 1950s to the mid-1960s). In this phase, data was organized into separate data files, which is accessed based on file name and saved and obtained based on record, with file opening, closing, and access support provided by the file system.

  3. (3)

    Database system stage (from the late 1960s to present). In the late 1960s, database systems (proprietary software systems) emerged to allow large-scale data management. In this stage, with the development of the times, hierarchical databases, mesh databases, and classic relational databases have emerged successively. In the last 20 years or so, emerging databases such as NoSQL and NewSQL have also emerged.

Fig. 1.6
figure 6

Timeline of data management development

1.2.2 Comparison of the Three Stages of Data Management

A comparison of the three stages of data management is shown in Table 1.1.

Table 1.1 Comparison of the three stages of data management

Among the three stages, the manual management is the most primitive stage, in which the data is not shareable. A set of application-oriented data corresponds to a program. Multiple applications processing the same data must be defined individually. They cannot use each other, so there is a large amount of redundant data between programs. In addition, the data lacks independence, which means that when the logical and physical structure of the data changes, the application must be modified accordingly. So, the data is completely dependent on the application.

The file system stage supports sharing to some extent compared to the manual management stage, but such sharing is still poor and redundant, so that the files are still application-oriented. In this phase, different applications must create their own files even if they use the same data. Given the lack of file independence, the same data is stored repeatedly and data redundancy is high. Such separate management approach is prone to data inconsistency.

The lack of file independence means that the file serves a specific application and the logical structure of the file is designed for this application. If the logical structure of the data changes, the definition of the file structure in the application must be modified, because the data depends on the application. In addition, files do not reflect the intrinsic linkage between things in the real world because they are independent of each other. From file system to database system, data management technology has made a leap.

In the database system stage, database technology is applied to data management on a large scale and starts to use large-capacity disks and disk arrays for data storage. Dedicated DBMS has emerged that allows online real-time processing, distributed processing, and batch processing. At this stage, data is well shared and less redundant, and data files reach a high level of physical independence and certain logical independence. The overall structure of the data can be described by a data model, and the database system has the ability to ensure data security and integrity and provide concurrency control and data recovery.

1.2.3 Benefits of Database

The database delivers following benefits.

  1. (1)

    Data structure as a whole. The data structure is for the whole organization, not for a particular application. The structure of records and the links between records are maintained by the DBMS , thus reducing the workload of programmers.

  2. (2)

    High level of data sharing and easy expansion. Data can be shared by multiple applications, reducing data redundancy and saving storage space. Data sharing avoids incompatibility and inconsistency between data. The reason for achieving easy expansion is to take into account the overall needs of the system to form structured data, and a highly resilient and easily expandable database system can meet a variety of requirements.

  3. (3)

    Strong data independence. In terms of physical independence, the physical storage characteristics of data are managed by the DBMS , which can be ignored by the application; the application only needs to deal with the logical structure, and does not need to make changes with the changes of the physical storage characteristics of data. In terms of logical independence, the application can remain unchanged when the logical structure of data in the database is changed. Data independence simplifies application development and greatly reduces the complexity of the applications. Data independence from the application is actually decoupling the data from the application, while the original strongly coupled approach presented the disadvantages of poor flexibility, high development volume, and heavy maintenance tasks.

  4. (4)

    Unified management and control. The database system facilitates users to manage and control data in a unified manner, including data security protection, data integrity checking, concurrency control, data recovery, etc. Data security protection refers to the protection of data to prevent data leakage or damage caused by unlawful use. Data integrity checking refers to checking the correctness, validity and uniformity of data. It controls the data within valid limits and ensures that certain relations are satisfied between the data. Concurrency control refers to the control and coordination of concurrent access operations by multiple users when they access the database at the same time, so as to avoid interfering with each other and affecting the results obtained from the access. Data recovery refers to the function that the DBMS restores the database from an error state to a known correct state when the database system has hardware failure, software failure, operation error, etc.

1.2.4 Development Characteristics of the Database

Database has become one of the important foundation and core technology of computer information system and intelligent application system, as shown in Fig. 1.7.

Fig. 1.7
figure 7

Applications and related technologies and models of database systems

The development of database systems presents the following three characteristics.

  1. (1)

    The database development is concentrated on the data model development. The data model is the core and foundation of the database system, so the development of the database system and the development of the data model are inseparable. How to divide the data model is an important criterion for database system division.

  2. (2)

    Intersection and combination with other computer technologies. With the endless emergence of new computer technology, intersecting and combining with other computer technologies becomes a significant feature of the development of database system, such as the distributed database upon the combination with distributed processing technology, and cloud database upon the combination with cloud technology.

  3. (3)

    Development of new database technology for application fields. The universal database cannot meet the application requirements in specific fields, and it is necessary to develop specific database systems according to the specific requirements of relevant fields.

1.2.5 Hierarchical Model , Mesh Model and Relational Model

The hierarchical model , mesh model and relational model are the three classical data models that have emerged throughout history.

  1. 1.

    Hierarchical model

    The hierarchical model presents a tree-like data structure, as shown in Fig. 1.8. There are two very typical features as follows.

    1. (1)

      There is one and only one node without “two parents” nodes, which is called the root node.

    2. (2)

      Each of the nodes other than the root node has one and only one “two parents” node, and this hierarchical model is often used in common organizational structures.

  2. 2.

    Mesh model

    The mesh model has a data structure similar to a network diagram, as shown in Fig. 1.9. In the mesh model diagram, E represents an entity and R represents the relation between entities. In the mesh model , more than one node is allowed to have no “two parents” node, and a node can have more than one “two parents” node. As shown in Fig. 1.9, E1 and E2 have no “two parents” node, while E3 and E5 have two “two parents” nodes respectively. The mesh model is able to map a lot of many-to-many relations in reality, such as students choosing courses and teachers teaching them.

  3. 3.

    Relational model

    A strict concept of relation is the basis of the relational model , and this relation must be normalized and the component of the relation must be an indivisible data item, as shown in Fig. 1.10.

    Explanation

    In 1970, Dr. Edgar Frank Codd, a researcher at IBM, published a paper entitled “A Relational Model of Data for Large Shared Data Banks” in the publication Communication of the ACM. He introduced the concept of relational model and laid the theoretical foundation for the relational model . Dr. Codd has published several articles on paradigm theory and 12 criteria that define how to measure relational systems. This laid the foundation of the relational model with mathematical theory.

    Built on the set algebra, the relational model consists of a set of relations, each with a normalized two-dimensional table as its data structure. As the student information table shown in Fig. 1.10, a relation usually corresponds to a table. A tuple denotes a row in the table (a row denotes a tuple), an attribute denotes a column in the table (a column denotes an attribute); a key is also called a code, a domain is a set of values of the same data type, and a relational model is a relation name (Attribute 1, Attribute 2, Attribute 3,..., Attribute n), such as the relation in the example is student (student number, name, age, gender).

  4. 4.

    Comparison of the Three Models

    Table 1.2 shows a comparison of the three models.

    Both the mesh model and the hierarchical model are formatted models whose data structure is based on the basic hierarchical linkage as the basic unit. The basic hierarchical linkage refers to two records and the one-to-many (including one-to-one) linkage between them in a single-record operation. Entities in the formatted model are represented as records, the attributes of which correspond to the data items (or fields) of the records, and the linkages between entities are converted into linkages between records in the formatted model. The data linkage is reflected by the access path, in the sense that any given record value can only be viewed by its access path, and no “child” record value can exist independently of the “two parents” record value. The relational model , on the other hand, reflects data linkages not by access paths but by associative linkages, so it is more capable of reflecting the linkages between things in the real world.

    The hierarchical and mesh models are efficient in querying because the application usually uses pointers to link the data, thus the record values can be found quickly by following the paths pointed by the pointers. Despite the efficient access, the hierarchical and mesh models are not easy to use for the average user because queries generally require a high-level or procedural language. Hence the poor experience for the average user. In the early days, the query efficiency of the relational model was relatively low; however, with the development of hardware, this deficiency in efficiency has been gradually overcome and compensated by the high flexibility and independence of the relational model . The structured query language provided by the relational model can greatly reduce the development workload for programmers and lower the threshold of use for general users. Therefore, the relational model can quickly replace the hierarchical and mesh models and become the dominant data model in recent years.

    Each tuple in a relational database should be distinguishable and unique, which relies on entity integrity to achieve.

    1. (1)

      Entity integrity: Simply put, the primary key cannot be null.

    2. (2)

      Referential integrity: Simply put, it is the linkage between primary and foreign keys.

    3. (3)

      User-defined integrity: for a specific constraint, such as a unique value.

Fig. 1.8
figure 8

Hierarchical model

Fig. 1.9
figure 9

Mesh model

Fig. 1.10
figure 10

Relational model

Table 1.2 Comparison of hierarchical model , mesh model , and relational model

1.2.6 Structured Query Language

Structured query language (SQL) is a high-level, non-procedural programming language that allows users to work on high-level data structures without requiring them to specify data storage methods or to understand specific data access methods. This language allows various relational databases with completely different underlying structures to use the same SQL as an interface for data manipulation and management. That's why SQL has become the de facto universal language for relational databases, even until now. Considering the large user base of SQL, many NoSQL products also develop SQL-compatible interface forms to facilitate the use of a wide range of users.

SQL not only can be nested, but also can realize procedural programming through advanced objects, which has great flexibility and rich functions, and is known as the de facto universal language standard for relational databases. The development timeline of SQL standard is shown in Fig. 1.11.

Fig. 1.11
figure 11

Development timeline of SQL standard

1.2.7 Characteristics of Relational Databases

The ACID characteristics of relational databases are as follows.

  1. (1)

    Atomicity. The transaction is the logical unit of work of the database; the operations in the transaction are either all done or nothing done.

  2. (2)

    Consistency. The result of the execution of the transaction must be to move the database from one consistent state to another consistent state. For example, if User A transferring $100 to User B is a transaction, then it must ensure that Account A is reduced by $100, and Account B is increased by $100 at the same time; there must be no consistency violation where Account A has reduced the amount of money but Account B has not increased..

  3. (3)

    Isolation. The execution of a transaction in the database cannot be interfered with by other transactions, that is, the internal operation of a transaction and the use of data are isolated from other transactions; multiple transactions subject to concurrent execution cannot interfere with each other.

  4. (4)

    Durability. Once a transaction is committed, the changes to the data in the database are permanent. Post-commit operation or failure will not have any effect on the result of the transaction.

1.2.8 Historical Review of Relational Database Products

The introduction of the relational model was an epochal and significant event in the history of database development. The great success in the research of relational theory and the development of relational DBMS has further promoted the development of relational database. The last 40 years have been the most “glorious” years for relational databases, during which many successful database products have been born, having a great impact on the development of society and our life. Some of the relational database products are shown in Fig. 1.12.

  1. (1)

    Oracle is the database product of Oracle Corporation, which is one of the most popular relational databases in the world. In 1977, Larry Ellison and his colleague Bob Miner founded Software Development Labs (SDL), and they also developed the first version of Oracle in assembly language based on a paper published by Dr. Codd (released to the public in 1979).

    The Oracle's success can be attributed to the following reasons.

    1. (a)

      High openness. It supported running on all mainstream platforms at that time, fully complied with various industry standards and was highly compatible.

    2. (b)

      High security. It provided multiple security protections, including features for assessing risk, preventing unauthorized data leakage, detecting and reporting database activity, and enforcing data access control in the database through data-driven security.

    3. (c)

      Strong performance. Under the open platform, the database has been a perennial leader in professional test results. In the 1980s to 1990s, it consistently followed and led the technical trend of relational databases. In addition, Oracle released Oracle EXADATA All-in-One product in 2009 to compete for the high-end online analytical processing (OLAP) market.

  2. (2)

    Teradata is a database product launched by Tenet of the US. The first database computer DBC/1012, which was released in 1984, was the first database-dedicated platform with massively parallel processing (MPP) architecture. The Teradata database was primarily available in the early days as an all-in-one machine, positioned as a large data warehouse system. Proprietary software and hardware gave it excellent OLAP performance, but it was very expensive.

  3. (3)

    DB2 is the database product of IBM. DB2 is the main relational database product promoted by IBM, which only served IBM mainframe and small machine at the beginning, and then started to support Windows, UNIX and other platforms in 1995. The reason why it is named DB2 is because DB1 is a hierarchical database.

  4. (4)

    Ingres was originally a relational database research project initiated by the University of California, Berkeley in 1974, and the code of Ingres was available for free, so much commercial database software was produced based on it, including Sybase, Microsoft SQL Server, Informix, and the successor project PostgreSQL. It can be said that Ingres is one of the most influential computer research projects in history.

  5. (5)

    Informix was the first commercial Ingres product to appear in 1982, but was later acquired by IBM in 2000 due to management failures by its owner. The source code of Informix was then licensed to GBASE from China, which developed the Chinese-made Gbase 8t product on the basis of its source code.

  6. (6)

    Sybase is a database product of Sybase Inc. The company was founded in 1984, named after the combination of the words “System” and “Database”, and one of its founders, Bob Epstein, was one of the main designers of Ingres. Sybase first proposed and implemented the idea of the Client/Server database architecture. The company began working with Microsoft in 1987 to develop the Sybase SQL Server product. After the termination of the partnership, Microsoft continued to develop the MS SQL Server and Sybase continued to develop the Sybase ASE. Its subsequent relational database, Sybase IQ, designed especially for data warehousing, was a highly successful columnar database. In May 2010, Sybase was acquired by the German company SAP.

  7. (7)

    MySQL 1.0, the internal version of the MySQL database product, was released in 1996, and MySQL 3.11.1 was released in October of the same year. MySQL is allowed to be distributed commercially for free, but may not be bundled with other commercial products. MySQL was acquired by Sun in January 2008, and the latter was acquired by Oracle in 2009, so MySQL is now an Oracle product, but is still available as a free open source product.

  8. (8)

    PostgreSQL database was born in 1989, inheriting many ideas from Ingres, and its SQL engine was modified and formally communitized in 1995. Greenplum, Netezza, Amazon Redshift, and GaussDB (DWS ) are all databases developed based on PostgreSQL versions.

  9. (9)

    Greenplum and Netezza are both distributed databases adopting the MPP architecture based on PostgreSQL version 8.x. Greenplum was acquired by EMC Corporation as a pure software version and formed the Pivotal family together with other products, of which Greenplum is the relational database product.

  10. (10)

    Netezza is a software and hardware all-in-one product with proprietary hardware optimization technology, which was later acquired by IBM.

  11. (11)

    Aster Data is a relational database product based on Greenplum, similar to Greenplum, with the main feature of providing SQL-based data discovery algorithms and powerful statistical analysis functions. The product was later acquired by Teradata Corporation.

  12. (12)

    Amazon Redshift is a cloud-based relational database from Amazon, developed based on PostgreSQL.

  13. (13)

    SAP HANA is SAP's self-developed in-memory database product, using columnar storage, data compression and parallel processing technologies.

  14. (14)

    Vertica is a columnar database, suitable for the OLAP .

Fig. 1.12
figure 12

Some of the relational database products

1.2.9 Other Data Models

With the expansion of the database industry and the diversification of data objects, the traditional relational database model begins to reveal many weaknesses, such as poor identification capability for complex objects, weak semantic expression capability, and poor processing capability for data types such as text, time, space, sound, image and video. For example, multimedia data are basically stored as binary data streams in relational databases, but for binary data streams, the generic database has poor identification capability and poor semantic expression capability, which is not conducive to retrieval and query.

In view of this, many new data models have been proposed to adapt to the new application requirements, specifically the following.

  1. (1)

    Object oriented data model (OODM). This model, combining the semantic data model and object-oriented programming methods, uses a series of object-oriented methods and new concepts to form the basis of the model. However, the OODM operation language is too complex, which increases the burden of system upgrade for enterprises, and it is difficult for users to accept such a complex way of use. So OODM is not as universally accepted as relational database except for some specific application markets.

  2. (2)

    XML data model. With the rapid development of the Internet, there are a large number of semi-structured and unstructured data sources. Extensible markup language (XML) has become a common data model for exchanging data on the Internet and a hot spot for database research, and accordingly derived an XML data model for semi-structured data. Pure XML database, based on XML node tree model, supports XML data management, but the same requires to solve the various problems faced by traditional relational database.

  3. (3)

    RDF data model. The information in the Internet lacks a unified expression, so the World Wide Web Consortium (W3C) proposes to describe and annotate Internet resources with the resource description framework (RDF). The RDF is a markup language for describing Internet resources, with triple containing resources (subject), attributes (predicate), and attribute values (object) as the infrastructure. Such a triple is also called a statement, where an attribute value can be a resource (either a resource or a literal; if it is a literal, it can only be an atomic value, such as a number, a date, etc.), and an attribute describes the relationship between the resource and the attribute value. Statement can also be represented as a graph: a directed edge points from the statement resource to the attribute value, with the attribute on the edge; the attribute value of a statement can be the resource of another statement.

1.2.10 New Challenges for Data Management Technologies

Although new data models are emerging from time to time, all of them have failed to replace the relational database model as the common basic model for database products due to problems such as lack of convenience and generality.

New challenges for data management technologies are as follows.

  1. (1)

    With the automation, diversification and intelligence of data acquisition means, the volume of data is soaring, so the databases need to provide a high degree of scalability and scalability.

  2. (2)

    The ability to deal with diverse data types is needed. Data can be classified into structured data, semi-structured and unstructured data, including texts, graphics, images, audio, videos and other multimedia data, stream data, queue data, etc. Diverse data types require database products to develop the ability of dealing with multiple data types and the ability to heterogeneous processing.

  3. (3)

    The development of sensing, network and communication technologies has put forward higher requirements for data acquisition and processing in real-time.

  4. (4)

    At the advent of the era of big data, data problems such as massive heterogeneity, complicated forms, high growth rate, and low value density have posed comprehensive challenges to traditional relational databases. NoSQL technology has flourished in response to the needs of big data development. Big data has 4V characteristics, as shown in Fig. 1.13.

Fig. 1.13
figure 13

Big data's 4V characteristics

The 4Vs are Volume (huge data volume), Variety (rich data types), Velocity (fast generation speed), and Veracity (widely varying veracity). Volume: The volume of data covered by Big Data processing is huge, having risen from the traditional terabyte level to the petabyte level. Variety: Big data processing involves a wide range of data types, where in addition to traditional structured data, Internet web logs, videos, pictures, geolocation information, etc. can also be found; moreover, semi-structured data and unstructured data also need to be processed. Velocity: The high processing speed in the Internet of Things ( IoT) applications is particularly significant, with the requirement for real-time processing. Veracity: Big data processing pursues high quality data, i.e., mining valuable data from massive data with a lot of noise, and due to low data value density, high-value information needs to be mined among a large amount of low-value data.

1.2.11 NoSQL Database

To meet the challenges of the big data era, new models and technologies have sprung up, typically the NoSQL database technology, which first emerged in 1998 as a lightweight, open-source, non-relational database technology that does not provide SQL functionality. By 2009, the concept began to return, but it was a completely different concept compared to the original one. The NoSQL technology, or Not Only SQL, that is widely accepted today is no longer just SQL technology.

Many different types of NoSQL database products have been created over the years, and although they have different characteristics, non-relational, distributed, and not guaranteed to meet ACID characteristics are their unifying features.

NoSQL databases have the following three technical features.

  1. (1)

    Partitioning of data (Partition). It can distribute data across multiple nodes in a cluster, and then conduct parallel processing on a large number of nodes to achieve high performance; it also facilitates the scaling of the cluster by scaling horizontally.

  2. (2)

    Reduction of ACID consistency constraint. Based on the BASE principle, it accepts the eventual consistency constraint although it allows temporary inconsistency.

    Explanation

    The BASE principle contains the following 3 levels of meaning.

    Basically available: Short-term data unavailability is tolerated, and no emphasis is placed on 24/7 service.

    Soft state: There is a period of state asynchrony, i.e. asynchronous state.

    Eventual consistency: It requires eventual data consistency and does not require strict full consistency.

  3. (3)

    Backup for each data partition. The general principle of triple backup (three copies of data are kept on the current node, another node in the same rack, and another node in another rack against node failure and rack failure. The more backups, the greater the data redundancy. Based on the comprehensive consideration of security and redundancy, such triple backup of data is the most reasonable setting) is followed to cope with node failures and improve system availability.

The four common types of NoSQL database technologies are divided by storage model, including key-value database, graph database, column family database, and document database, as shown in Fig. 1.14.

Fig. 1.14
figure 14

Four common types of NoSQL database technologies

Table 1.3 briefly introduces the main NoSQL databases. Key-value databases are generally implemented based on hash tables by pointing key to value; storing keys in memory enables extremely efficient key-based, or code-based, query and write operations, and is suitable for caching user information, session information, configuration files, shopping carts, and other application scenarios. Such products as column grouping database, document database and graph database also feature their own characteristics, but since this book is mainly concerned with relational databases, they will not be covered here.

Table 1.3 Brief introduction of the main NoSQL databases

NoSQL was not created to replace a relational DBMS (RDBMS), and while it has both significant advantages and disadvantages. It is designed to work with RDBMS to build a complete data ecosystem.

1.2.12 NewSQL Database

Since the introduction of NoSQL, a highly scalable product, its ease of use has been recognized. If applied to traditional databases, it can greatly enhance the scalability of traditional databases. Therefore, a relational database that combines the scalability of NoSQL with support for the relational model has been developed. This new-type database is mainly oriented to the online transaction processing (OLTP) scenario that shows high requirements for speed and concurrency. The database uses SQL as the main language, so it is called NewSQL database.

“NewSQL” is only a description of this class, not an officially defined name. NewSQL database is a relational database system that supports the relational model (including ACID features) while achieving the scalability of NoSQL, mainly oriented to the OLTP scenario, allowing SQL as the primary language.

The classification of NewSQL databases is as follows.

  1. (1)

    Databases re-constructed with new architecture.

    They may adopt technical architectures such as multi-node concurrency control, distributed processing, replication-based fault tolerance, and flow control. Such products include Google Spanner, H-Store, VoltDB, etc.

  2. (2)

    Databases adopting the middleware technology of transparent shard.

    The generation of data shards is transparent to users, and users do not need to make changes to their applications.

    These products are Oracle, MySQL, Proxy, MariaDB MaxScale, etc.

  3. (3)

    Database as a Service (DaaS).

    The database products provided by cloud service providers are generally such databases with NewSQL features.

    Such products include Amazon Aurora, Alibaba Cloud's Oceanbase, Tencent Cloud's CynosDB, Huawei's GaussDB (DWS ) and GaussDB (for MySQL).

1.2.13 Database Ranking

Like programming languages, databases have popularity rankings, which are updated monthly and include overall database rankings and rankings under database types, such as special rankings for relational databases, key-value databases, temporal databases, graph databases, and so on, as shown in Fig. 1.15. The figure shows that in August 2019, the top 3 are firmly held by traditional relational databases; four NoSQL databases appear in the top 10; and the top 20 are generally evenly split between relational and NoSQL databases. Relational databases are extending their functionality and features.

Fig. 1.15
figure 15

Database ranking

1.3 Architecture of Relational Databases

1.3.1 Development of Database Architecture

In the early days when the data size was not too large, the database system used a very simple stand-alone service, i.e., database software was installed on a dedicated server to provide external data access services. However, as business expands, the data size in the database and the pressure on the business are upgraded. This requires the database architecture to change accordingly. The architecture classification shown in Fig. 1.16 is a way to distinguish the database architecture according to the number of hosts.

Fig. 1.16
figure 16

Database architecture classification by number of hosts

An architecture with only one database host is a single-host architecture, while an architecture with more than one database host is a multi-host architecture. The single host in the single-host architecture deploys both database application and database on the same host; while the stand-alone host deploys them separately, with the database exclusively on a separate database server. The multi-host architecture enhances the availability and service capability of the overall database services by increasing the number of servers. This architecture can be classified into two models based on whether data shards are generated. One type is the group architecture, in which, depending on the role of each server, the servers are further divided into master-standby, master-slave and multi-master architectures. Regardless of the grouping method, the databases share the same structure and store exactly the same data, essentially replicating data between multiple databases with synchronization techniques. Another model is the sharding architecture, which spreads the data shards within different hosts through a certain mechanism.

1.3.2 Single-Host Architecture

In order to avoid the application services and database services from competing for resources, the single-host architecture evolved from the earlier single-host model to stand-alone host for database, which separates the application services and data services. For the application services, the number of servers can be increased to balance the load, thus enhancing the concurrency capability of the system. The single-host deployment features such as flexibility and ease of deployment in R&D, learning, and simulation environments, as shown in Fig. 1.17.

Fig. 1.17
figure 17

Single-host architecture

The LAMP (Linux, Apache, MySQL, and PHP) architecture of the early Internet is a typical single-host architecture, with following obvious shortcomings.

  1. (1)

    Poor scalability. The single-host architecture only supports vertical expansion, improving performance by increasing the hardware configuration, but there is an upper limit to the hardware resources that can be configured on a single host.

  2. (2)

    Single point of failure. Expansion of the single-host architecture often requires suspension, and the service will also suspense. In addition, hardware failure can easily lead to the unavailability of the entire service, and can even cause data loss.

  3. (3)

    As business expands, the single-host architecture is bound to encounter performance bottlenecks.

1.3.3 Group Architecture: Master-Standby Architecture

The master-standby architecture in the group architecture is actually born from the single-host architecture to solve the single point of failure, as shown in Fig. 1.18.

Fig. 1.18
figure 18

Group architecture—master-standby architecture

The database is deployed on two servers, where the server that undertakes the data read/write service is called the host, and the other server, standby, is used as a backup to copy the data from the host using the data synchronization mechanism. Only one server provides data services at the same time.

This architecture has the advantage that the application does not require additional development to cope with database failures, plus it improves data fault tolerance compared to a stand-alone architecture.

The disadvantage is the waste of resources, the backup and the host enjoy the same configuration, but the backup resources are basically in idle state; in addition, the performance pressure is still concentrated on a single server, which cannot address the performance bottleneck. When a failure occurs, the switch between the host and the standby requires some manual intervention or monitoring. So to say, this model only addresses the data availability and cannot break through the performance bottleneck; while the performance is still limited by the hardware configuration of a single server, cannot be improved overall by increasing the number of servers.

1.3.4 Group Architecture: Master-Slave Architecture

The deployment model of master-slave architecture is similar to that of master-standby architecture, but in which the standby is promoted to the slave role and provides certain data services. The application can adopt the read/write separation, and the development model needs to be adjusted accordingly at this time, i.e., the three write operations of write, modify, and delete are required to be done on the write library (host), and the query requests (read operations) are assigned to the read library (slave), as shown in Fig. 1.19.

Fig. 1.19
figure 19

Group architecture—master-slave architecture

This architecture brings the benefit of improved resource utilization and is suitable for application scenarios with more read operations and few write operations. In addition, it can be balanced among multiple slaves in scenarios with highly concurrent read operations. The slaves can be flexibly expanded, and the expansion operation generally does not affect the service.

However, the master-slave architecture also has the following disadvantages: first, data latency, i.e., there is a delay when synchronizing data to the slave database, so the application must be able to tolerate short inconsistencies, which is not suitable for scenarios with very high requirements for consistency; second, the performance pressure of write operations is still concentrated on the host; third, availability problems, i.e., when switching from the host to a slave due to host failure, such manual intervention costs time to respond, and the complexity to achieve automatic switching is high.

1.3.5 Group Architecture: Multi-Master Architecture

Multi-master architecture is also called active-active or multi-active architecture, in which the database and servers are master and slave to each other, and provide complete data services at the same time, as shown in Fig. 1.20.

Fig. 1.20
figure 20

Group architecture—multi-master architecture

The advantage of the multi-master architecture is to ensure higher resource utilization while reducing the risk of single point of failure; however, there is also the disadvantage that since both hosts receive write data, bi-directional synchronization of data must be achieved, but bi-directional replication also brings latency issues, and in extreme cases even the risk of data loss must be considered. When changing from dual hosts to multiple hosts, the increased number of databases further complicates the data synchronization issues, so the dual-host model is more common in practical applications.

1.3.6 Shared Disk Architecture

Next, we will introduce a special kind of multi-master architecture—shared disk. In this architecture, the database and servers share the stored data, and load balancing is achieved by multiple servers, as shown in Fig. 1.21.

Fig. 1.21
figure 21

Shared disk architecture

The advantage of shared disk is that multiple servers can provide highly available services at the same time, thus achieving a high level of availability and scalability as a whole and avoiding a single point of failure of server clusters. This architecture supports easy horizontal scaling, which in turn enhances the parallel processing capability of the overall system.

The disadvantage is that it is quite difficult to implement the technology. In addition, when the memory interface bandwidth reaches saturation, adding nodes does not result in higher performance, and storage I/O can easily become a bottleneck affecting the overall system performance.

1.3.7 Sharding Architecture

The sharding architecture is primarily a horizontal data sharding architecture, which is a sharding scheme that spreads data across multiple nodes. Each shard consists of a part of the database. Multiple nodes in this architecture share the same database structure, without intersection between the data in different shards, and the concatenation of all data shards forms the data aggregate. Common sharding algorithms are those based on list values, range intervals, and hash values, as shown in Fig. 1.22.

Fig. 1.22
figure 22

Sharding architecture

The advantage of this architecture is that the data is scatted on the nodes within the cluster, and each node can work independently, giving full play to the parallelism of the cluster.

1.3.8 Shared-Nothing Architecture

The shared-nothing architecture is a completely non-shared architecture in which each node (processing unit) in the cluster has its own independent CPU, memory and external memory, never sharing resources. Each node (processing unit) processes its own local data, and the results can be aggregated upward or circulated among the nodes through communication protocols. The nodes are independent of each other and have high scalability, so the whole cluster obtains a strong parallel processing capability, as shown in Fig. 1.23.

Fig. 1.23
figure 23

Shared-nothing architecture

Hardware has evolved to the point where a node or a physical host can accommodate multiple processing units, so the smallest unit of the architecture may not be a physical host, but a logical virtual processing unit. For example, for a physical host with a quad-core CPU, four database instances can be deployed, which is equivalent to having four processing units.

1.3.9 Massively Parallel Processing Architecture

The massively parallel processing (MPP) architecture spreads tasks in parallel across multiple servers and nodes. After the computation on each node is completed, the results of each part are aggregated into the final result, as shown in Fig. 1.24.

Fig. 1.24
figure 24

MPP architecture

The MPP architecture is characterized by the fact that the tasks are executed in parallel, while the computation is distributed. Two minor variations exist here, one is the non-shared host architecture and the other is the shared host architecture. In the non-shared host architecture, all nodes are peer-to-peer, and data can be queried and loaded by any node, which generally does not have performance bottlenecks and single-point risks, but the technical implementation is more complex.

The common MPP architecture products are as follows.

  1. (1)

    Non-shared host architecture: Vertica and Teradata.

  2. (2)

    Shared host architecture: Greenplum and Netezza.

Teradata and Netezza are hardware-software all-in-one machines, while GaussDB (DWS ), Greeplum, and Vertica are software versions of MPP architecture databases. The shared architecture is the basis of the shared-nothing architecture, and shared-nothing for clusters is only possible if data is sharded.

Explanation

The concept of shared-nothing describes the architecture from the perspective of resource independence, while the concept of shard describes the architecture from the perspective of data independence. The MPP describes the architecture from the perspective of parallel computing, which is the application and embodiment of parallel computing technology on distributed databases . The terms shard, shared-nothing and MPP can be regarded as the proper nouns of distributed database architecture.

1.3.10 Comparison of the Characteristics of Database Architectures

Finally, let's compare the characteristics of the above database architectures, as shown in Table 1.4.

Table 1.4 Comparison of the characteristics of database architectures

In terms of high availability, the more hosts there are, the better the high availability will perform. As for read/write performance, both stand-alone architecture and master-standby architecture depend on single-host hardware, so both are affected by single-host hardware's performance bottleneck. While the master-slave architecture can use read/write separation to enhance read/write performance; the multi-master architecture and sharding architecture both have better read/write service capability and can provide strong parallel processing capability. In terms of data consistency, the stand-alone architecture excludes the data consistency issues, while the master-standby and master-slave architectures require data synchronization among multiple hosts because they increase the number of hosts, thus making it difficult to avoid data latency and data consistency issues. Multi-host architecture also faces the data inconsistency, unlike the shared disk architecture that benefit from shared storage. Inside the sharding architecture, data is scattered on each node and data synchronization is not required between nodes, so there is no data inconsistency. Finally, scalability. The stand-alone architecture and master-standby architecture only support vertical scaling, and both will encounter the stand-alone issues and hardware performance bottleneck. The master-slave architecture can improve concurrent read capability by scaling horizontally, and the multi-master architecture scales well, but increasing the hosts will lead to a sharp increase in the complexity of data synchronization; while the sharding architecture can theoretically achieve linear scaling, and has the best scalability among these architectures.

1.4 Mainstream Applications of Relational Databases

1.4.1 Online Transaction Processing

Online transaction processing (OLTP) is the main application of traditional relational database, which is oriented to basic and daily transaction processing, such as access transaction and transfer transaction of deposit service.

Database transaction is a basic logical unit in the database execution process. The database system needs to ensure that all operations in a transaction are completed successfully and the results are permanently stored in the database.

For example, someone wants to buy something worth $100 in a store using electronic money. This involves at least two operations: the person's account is reduced by $100; the store account is increased by $100. A transactional DBMS makes sure that both of these operations (i.e., the entire transaction) either complete or are canceled together; otherwise, the $100 will disappear or appear for nothing. But in reality, the risk of failure is high. The execution of database transactions may encounter operation failures, database system or operating system errors, or even storage media errors. This requires the DBMS to perform recovery operations on a failed transaction execution to restore its database state to a consistent state (the state in which data consistency is guaranteed), for which the DBMS usually needs to maintain transaction logs to track all operations in the transaction that affect the database data.

OLTP is characterized precisely by high throughput, as evidenced by the ability to support a large number of short online transactions (inserts, updates, deletes), and very fast query processing, supporting high concurrency and (quasi-real time) real-time response.

The OLTP scenario places very high demands on the timeliness of response, requiring the database system to have the ability to quickly handle a large number of concurrent transaction operations, where the response of each transaction reaches millisecond level or even faster. Moreover, the transaction concurrency is very large, so high concurrency is also one of the most significant features in OLTP scenarios. For example, online ticketing systems, retail systems and flash sale campaigns are typical OLTP application scenarios.

1.4.2 Online Analytical Processing

The concept of online analytical processing (OLAP) was first proposed by Edgar Frank Codd in 1993 relative to OLTP system, and refers to the query and analysis operations on data. Usually, when querying and analyzing a large amount of historical data, the historical period involved is long, the volume of data is large, and the aggregation operations at different levels make the transaction processing more complex.

OLAP is characterized by its focus on complex queries and some “strategic” problem solving. In terms of data processing, it focuses on “analytical” data processing and operations such as data aggregation, grouping calculation, and window calculation, involving multi-dimensional data usage and analysis.

Common OLAP scenarios include reporting systems, customer relationship management (CRM) systems, financial risk prediction and early warning systems, anti-money laundering systems, data marts, data warehouses, etc. A reporting system is a platform or system that generates reports for a fixed period or in a fixed format, such as daily, weekly, and monthly reports, to provide electronic reporting data for business decision making. A CRM system is a comprehensive business system platform that provides customer maintenance services, stores customer-related information, analyzes customer behavior, responds to customers, and manages marketing activities. A data mart is generally an application geared toward the departmental needs of an organization, such as the analytical needs of a credit card department. Data warehouse is an enterprise-oriented analytic platform system created to build an analytic processing environment for the entire enterprise.

1.4.3 Database Performance Measurement Indicators

The particular architectural design and implementation of different databases will vary depending on the scenario. So to evaluate the merits of different databases in different scenarios, there is a need for a more authoritative standard. The Transaction Processing Performance Council (TPC) is responsible for developing benchmark specifications, performance and price metrics for business applications, and managing the publication of test results. It is a standard specification rather than a code, and any manufacturer can optimally construct their own system and evaluate it according to the specification. The TPC has introduced a number of benchmarking standards, including the following two specifications for OLTP and OLAP , respectively.

  1. (1)

    TPC-C specification is for OLTP systems, including the traffic indicator tpmC (tpm: transactions per minute [test system transactions per minute]) and cost performance indicators (price [test system price]/tpmC), the latter is the cost to achieve a basic unit.

  2. (2)

    OLAP-oriented system is the TPC-H specification, whose test metric is the traffic indicator qphH (qph: query per hour [complex queries processed per hour]) The TPC-H specification requires consideration of the size of the test data set, and the specification specifies 22 query statements for different test data sets, which can be fine-tuned according to specific products. The test scenarios include data loading, power testing and traffic testing; the specific test criteria are explained by the test specification documents, which are publicly available online and can be accessed at the TPC website.

The comparative analysis of OLTP and OLAP is shown in Table 1.5, and similar analyses can be found on the web; they mostly refer to the relevant contents described in the book Build in the Body to Y by Iven, the proposer of the data warehouse concept. From the perspective of analysis granularity, OLTP is a detailed analysis, dealing with every most basic transaction event, while OLAP is a comprehensive analysis, in which there is more integrated and aggregated analysis. In terms of timeliness, OLTP emphasizes transient technicality, with transactions ending when they are completed. In terms of data update requirements, OLAP does not require updates in general.

Table 1.5 Comparative analysis of OLTP and OLAP

To sum up, both OLTP and OLAP systems follow the ACID principle and use relational databases. Both are functionally similar in that they support SQL statements, can handle large volume of data, and implement highly consistent transactional processing. However, for application scenarios, OLTP places more emphasis on substantive requirements, while OLAP more on analysis of large-volume data. In general, due to the different goals pursued by the two in their respective application scenarios, it is currently not suitable to use them interchangeably, for example, using OLTP databases for OLAP analytics applications or using OLAP as a core transaction system with high requirements for real-time. But now an emerging hybrid transaction and analytical process (HTAP) database system aims to achieve a system that can host both OLTP and OLAP application scenarios. Related products applying this technology appear on the scene from time to time and are one of the trends in the development of NewSQL database technology. Readers can find related materials to expand their knowledge on their own.

1.5 Summary

This chapter introduces the basic concepts of database and data management system, reviews the development history of database for decades, details the development of database from early mesh model and hierarchical model to relational model , and introduces the emerging NoSQL and NewSQL concepts in recent years; provides a comparative analysis and introduction to the main architectures of relational database, and briefly explains the advantages and disadvantages of various architectures in different scenarios; finally, introduces and contrasts the mainstream application scenarios of OLTP and OLAP for relational data.

Through the study of this chapter, readers are able to describe the concepts related to database technology, enumerate the main relational databases, distinguish different relational data architectures, and describe and identify the main application scenarios of relational databases.

1.6 Exercises

  1. 1.

    [Multiple Choice] The characteristics of the data stored in the database are ( ).

    1. A.

      Permanently stored

    2. B.

      Organized

    3. C.

      Independent

    4. D.

      Shareable

  2. 2.

    [Multiple Choice] The components of the concept of a database system are ( ).

    1. A.

      Database management system

    2. B.

      Database

    3. C.

      Application development tool

    4. D.

      Application

  3. 3.

    [True or False] Database applications can read database files directly, without using the database management system . ( )

    1. A.

      True

    2. B.

      False

  4. 4.

    [Multiple Choice] What are the stages in the development of data management? ( )

    1. A.

      Manual stage

    2. B.

      Intelligent system

    3. C.

      File system

    4. D.

      Database system

  5. 5.

    [Single Choice] In which data model, more than one node is allowed to have no “two parents” node, and a node can have more than one “two parents” node. ( )

    1. A.

      Hierarchical model

    2. B.

      Relational model

    3. C.

      Object-oriented model

    4. D.

      Mesh model

  6. 6.

    [Multiple Choice] Which of the following are NoSQL databases? ( )

    1. A.

      Graph database

    2. B.

      Document database

    3. C.

      Key-value database

    4. D.

      Column family database

  7. 7.

    [True or False] The emergence of NoSQL and NewSQL databases can completely subvert and replace the original relational database systems. ( )

    1. A.

      True

    2. B.

      False

  8. 8.

    [True or False] The master-standby architecture can improve the overall read/write concurrency by separating read and write. ( )

    1. A.

      True

    2. B.

      False

  9. 9.

    [Single Choice] Which database architecture has good linear scalability? ( )

    1. A.

      Master-slave architecture

    2. B.

      Shared-nothing architecture

    3. C.

      Shared disk architecture

    4. D.

      Master-standby architecture

  10. 10.

    [True or False] The characteristic of the sharding architecture is that the data is scattered on each database node of the cluster through a certain algorithm, and the advantage of server number in the cluster is taken for parallel computing. ( )

    1. A.

      True

    2. B.

      False

  11. 11.

    [Multiple Choice] Test metrics used to measure OLTP systems include ( ).

    1. A.

      tpmC

    2. B.

      Price/tmpC

    3. C.

      qphH

    4. D.

      qps

  12. 12.

    [Multiple Choice] OLAP system is suitable for which of the following scenarios? ( )

    1. A.

      Reporting system

    2. B.

      Online transaction system

    3. C.

      Multi-dimensional analysis and data mining systems

    4. D.

      Data warehouse

  13. 13.

    [True or False] OLAP system can analyze and process a large volume of data, so it can also meet the processing performance requirements of OLTP for small data volume. ( )

    1. A.

      True

    2. B.

      False