In computer science, the storage uses different applications to save data to certain media in a reasonable, secure, and efficient manner and ensure effective access. In general, storage has two meanings. On the one hand, it is a physical medium in which data resides temporarily or permanently.

This chapter begins with an introduction to the basics of storage, followed by a complete introduction to the storage fundamentals involved in cloud computing in terms of basic storage units, networked storage, storage reliability technologies, storage virtualization, and distributed storage.

5.1 Basic Knowledge of Storage

5.1.1 Storage Development and Technological Evolution

As we all know, the development of civilization depends on the accumulation of knowledge, and the accumulation of knowledge cannot be separated from storage. Therefore, the ability to store information containing knowledge is an important part of the development of civilization. In a sense, it is one of the signs that human beings were entering civilized society. Historically, humans have created many ways to store information.

  1. 1.

    Perforated Paper Tape

    In 1725, Bouchon, a French textile mechanic, came up with a brilliant idea for a “perforated paper belt,” as shown in Fig. 5.1. Bouchon first managed to control all movements with a row of knitted needles, then punched a row of holes in a roll of paper straps according to the woven pattern. After starting the machine, the woven needles lying against the small holes can pass through and hook up the thread, while the paper tape blocks the other needles. In this way, the braided needle automatically selects the thread according to the pre-designed pattern. Bjorn's “thoughts” were then “passed” to the knitting machine, and the “program” of the woven pattern was “stored” in small holes in the perforated tape.

    Figure 5.2 shows an example of an early perforated tape with 90 column holes. As you can see, there is very little data that can be stored on this card, and virtually no one actually uses it to store data. Typically, it is used to hold settings parameters for different computers.

    Alexander Bain, the inventor of the facsimile machine and the telegraph machine, first used the improved punched paper tape shown in Fig. 5.3 in 1846. Each line on the paper tape represents a character, and its capacity is larger than before and has increased significantly.

  2. 2.

    Electric tube counter

    In 1946, RCA began researching counting tubes, storage devices used in early giant tube computers. A counting tube up to 10 in. (about 25 cm) can hold 4096 bits of data, as shown in Fig. 5.4. Unfortunately, it is extremely expensive. So it is a flash in the pan in the market and quickly disappears.

  3. 3.

    Disc tape

    In the 1950s, IBM first used disk tapes for data storage, as shown in Fig. 5.5. Because a roll of tape could replace 10,000 punching paper cards, disc tape was a success, becoming one of the most popular computer storage devices until the 1980s.

  4. 4.

    Floppy disk

    The first floppy disk was invented in 1969 to hold 80KB of read-only data in an 8 in. (about 20 cm) piece of physical storage space. In 1973, a small but 256KB floppy disk was born, characterized by repeated reading/writing, as shown in Fig. 5.6. Then, there is the trend—disks are getting smaller and smaller in diameter, and the capacity is getting bigger and bigger. By the late 1990s, 3.5 in. (about 9 cm) floppy disks with a capacity of 250MB had emerged.

    Fig. 5.1
    figure 1

    Perforated paper tape

    Fig. 5.2
    figure 2

    Example of early perforated tape

    Fig. 5.3
    figure 3

    Improved perforated paper

    Fig. 5.4
    figure 4

    Counting tube

    Fig. 5.5
    figure 5

    Disc tape

    Fig. 5.6
    figure 6

    Floppy disk

  5. 5.

    Hard Disk

The hard drive is a storage device that is still in development, as shown in Fig. 5.7, the Hitachi Deskstar 7K500, which is the first hard drive to reach 500GB.

Fig. 5.7
figure 7

Hard disk

5.1.2 Cutting-Edge Storage Technologies and Development Trends

In the era of big data, it is important to store large-scale data securely, efficiently, and cost-effectively. Throughout the development of computer storage technology, from the first mechanical hard disk in 1956, to the emergence of SAN in the 1970s, to the invention of NAS in the 1980s and the emergence of object storage in 2006, computer storage has been developing rapidly. As can be seen from the development of storage technology, storage technology is constantly integrated with applications. However, these technologies are not completely replacing applications, but the application is expanding, so even now, hard disk, SAN, NAS technology is still widely used in related fields.

Virtual storage and network storage are two of the major themes of current storage technology development. The development of storage technology not only meets the basic user requirements for large capacity and high speed, but also places higher demands on cost-effectiveness and security, as well as the scalability of storage in time and spatial scalability, which will lead to the convergence of various storage devices and storage technologies, ultimately unified within a standard architecture.

  1. 1.

    Virtual Storage

    At present, the main direction of storage technology development is virtualization technology. As the amount of information increases exponentially, efficient use of existing storage architecture and storage technology to simplify storage management, thereby reducing the maintenance cost has become the focus of attention. Virtual storage refers to integrating many different types of physical storage entities that exist independently into a logical virtual storage unit through software and hardware technology, which is managed and made available to users. The storage capacity of a logical virtual storage unit is the sum of the storage volumes of the physical storage bodies it centrally manages, while the read/write bandwidth it has is close to the sum of the read/write bandwidths of each physical storage. The development and application of virtual storage technology can help to more effectively develop current storage devices' storage capacity and improve storage efficiency. At the heart of storage, virtualization is how to map physical storage devices to a single pool of logical resources.

    In general, virtualization technology is achieved by establishing a virtual abstraction layer. This virtual abstraction layer provides the user with a unified interface that hides complex physical implementations. According to the region where the virtual abstraction layer is located in the storage system, the implementation of storage virtualization can be divided into three ways: virtual storage based on a storage device, virtual storage based on storage network, and virtual storage based on server-side.

  2. 2.

    Internet Storage

    With the increase of information demand, storage capacity is expanding at high speed, and the storage system network platform has become a core of development. Accordingly, applications are increasingly demanding these platforms, not just in demand for storage capacity but also in the areas of access performance, transport performance, control, compatibility, expansion capabilities, and more. It can be said that the comprehensive performance of the storage system network platform will directly affect the normal and efficient implementation of the whole system. Therefore, the development of an economical and manageable advanced storage technology will become an inevitable development trend.

    Network storage is one of many data storage technologies, it is a special kind of dedicated data storage server, including storage devices (such as disk arrays, tape drives, or removable storage media) and embedded system software, and can provide platform file-sharing capabilities. Typically, a network is stored on a local area network with its own nodes and does not require an application server's intervention, allowing users to access data on the network. In this mode, networked storage centrally manages and processes all data on the network, with the advantage of unloading the load from the application or enterprise server, reducing the total cost of ownership. There are three broad networked storage architectures: Direct Attached Storage (DAS); NAS; and SAN. They can all use RAID to provide efficient, secure storage space. Because NAS is the most common form of networked storage for consumers, so generally referred to as networked storage refers to NAS.

    With the continuous development of storage technology and enterprises' changing needs, Server SAN is gradually becoming the mainstream storage form of enterprises. Whether it is a public or private cloud, there is much interest in distributed storage. With the continuous development of new business, the supply model of new resources has gradually changed from “chimney” to “cloud” mode. Because traditional storage is built to meet a single application and scenario and does not meet the needs of today's elastic scaling, cloud storage that can scale elastically on demand is bound to develop. A more advanced storage method is software-defined fully converged cloud storage, a system based on a common hardware platform that provides block, file, and object services on demand, and is suitable for cloud resource pools in industries such as financial development testing, government affairs, policing, and large enterprises, as well as scenarios such as carrier public cloud.

5.1.3 Common Storage Products and Solutions

To ensure high availability, high reliability and economy, cloud computing uses distributed storage to store data, and redundant storage to ensure the reliability of storage data, that is, multiple copies of the same data storage. In addition, cloud computing systems need to meet the needs of many users at the same time and serve a large number of users in parallel. Therefore, cloud computing data storage technology must be characterized by high throughput and transmission rates.

Distributed data storage technology typically includes Google's non-open source Google File System (GFS) and the open source HDFS for GFS developed by the Hadoop development team. Most IT vendors, including Yahoo! Intel's “cloud” program uses HDFS data storage technology. Future developments will focus on large-scale data storage, data encryption and security assurance, and continued improvements in I/O rates.

Huawei offers several widely used solutions in cloud storage.

  1. 1.

    Storage as a Service

    Storage as a Service (STaaS) is a storage resource implemented as an on-premises service that provides users with the same immediacy as public cloud storage, scalability, and pay-as-you-go flexibility without security and performance variability issues. In addition, when staffing becomes a problem, it can be used as a managed service. It also enables customers to access new technologies on-demand as needed, rather than investing capital resources in technologies that have become increasingly obsolete over time.

  2. 2.

    OceanStor DJ

    OceanStor DJ is Huawei's business-driven storage control software for cloud data center development, which manages data center storage resources in a unified way, provides business-driven, automated storage services, and improves the utilization of storage resources and efficiency of storage services in a cloud-driven environment. At its core is the enhancement of OpenStack-related services for unified storage resource management, on-demand allocation, and data protection services. OceanStor DJs decouple applications with underlying storage, breaking the monopoly of traditional devices and application vendors. In a clouded scenario, capabilities such as storage and data protection are provided as services, adapting to the transfer of storage value chains to software and services.

  3. 3.

    OceanStor Dorado V3

    OceanStor Dorado V3 is an all-flash storage system for business-critical enterprises. It uses smart chips, NVMe architectures, and FlashLink intelligent algorithms with a latency of up to 0.3 ms for end-to-end acceleration and a three-fold increase in business performance. Supports smooth expansion to 16 controllers to meet unpredictable business growth in the future. A system that supports both SAN and NAS, with enterprise-class features, provides higher quality services for applications such as databases and file sharing. Supports gateway-free dual-life solutions, which can be smoothly upgraded to three-center schemes and converged data management schemes in both places, resulting in 99.9999% reliability assurance. Operating Expense (OPEX) saves 75% by providing data reduction ratios of up to 5:1 with online redeleting and online compression technology. At the same time, to meet the needs of databases, virtual desktops, virtual servers, file-sharing scenarios, to help finance, manufacturing, operators, and other industries to the flash era smooth evolution.

  4. 4.

    OceanStor

    OceanStor is a next-generation enterprise storage solution that is cost-effective and sophisticated enough to meet application needs such as OLTP/OLAP, Exchange, server virtualization, and video surveillance for large, medium, and small business databases. The rich Hyper family of data protection software meets users' local, off-site, and multi-site data protection needs, ensuring user business continuity and data availability. The unique ease-of-use software SmartConfig dramatically simplifies the storage configuration process, solves the bottleneck of IT operations professional skills, and meets the critical needs of small- and medium-sized enterprises for IT simplicity and ease of use.

  5. 5.

    FusionStorage

    FusionStorage is a distributed storage volume device that forms a virtualized pool of storage resources on the server's local hard disk that provides network RAID protection for virtual machines. FusionStorage enables on-demand data services in storage that genuinely help customers meet cloud challenges based on the performance and capacity of a fully distributed architecture, a flexible and optional infrastructure, and open compatibility for the cloud. FusionStorage starts with the convergence of “three-in-one” distributed block storage, distributed file storage, and distributed object storage. FusionStorage also has the excellent quality of a storage system: good elasticity. FusionStorage features a fully symmetrical distributed architecture based on standard hardware that can easily scale to 4096 nodes, EB-level capacity, and 10 million IOPS (Input/Output Operations Per Second, read and write per second) to support enterprise-critical business clouding such as high-performance data queries. FusionStorage's last feature is openness, based on an open architecture design that supports mainstream virtualization software, provides an open API for standard interface protocols, and is naturally integrated into the OpenStack cloud infrastructure and Hadoop big data ecosystem.

    Cloud storage has become a trend in the future of storage development. Today, cloud storage vendors combine search, application technologies, and cloud storage to deliver a range of data services to enterprises. The future of cloud storage is focused on providing better security and more convenient and efficient data access services.

5.1.4 Data Security Technology of Cloud Storage

Cloud storage enables data sharing stored in a network environment and efficient storage and access to user data, thus providing users with more efficient, convenient, and higher quality of service storage services. However, the security of the cloud storage system itself is often the focus of attention. Cloud storage service providers and related technical personnel must do an excellent job in the security of the stored data protection work to effectively protect the data's privacy and integrity as an entry point. The application of a variety of existing security technologies to improve the security of data in cloud storage.

The data security technologies commonly used in cloud storage are as follows:

  1. 1.

    Identity authentication technology

    Authentication is the first line of defense against data security in cloud storage. There are three standard authentication techniques. First, and easiest, is to use your account number and password to verify before logging into the system. During authentication, users only need to enter the account number and password corresponding to the system requirements to pass the authentication. Once the account number or password is incorrect, it cannot be verified. Second, Kerberos are used for certification. The essence of this authentication method belongs to third-party authentication, which owns the authorized server, and has clear criteria for applying the server to verify the user's password during the resource visit. If verified, the services provided in the system can be obtained. Third, the application of PKI certification. This authentication method is based on the non-symmetric key (public/private key) and is complex, a typical PKI system includes PKI policy, software/hardware system, certificate authority, certificate registrar, certificate issuing system and PKI application.

  2. 2.

    Data encryption technology

    Data encryption technology mainly includes symmetric encryption and asymmetric encryption. There are some differences between the two technologies. In symmetric encryption, encryption and decryption use the same key, the algorithm is simple, the encryption/decryption is easy, efficient, and the execution is fast. However, the key transmission and management is inconvenient, easy to lose secret due to key leakage, cannot be digitally signed. Asymmetric encryption uses both public and private keys, and even if the redaction is intercepted and the public key is obtained, the private key cannot be obtained. It is not possible to decipher the text. So more security, but the encryption algorithm is complex, encryption and decryption efficiency is low. Symmetric encryption and non-symmetric encryption can be well combined, such as non-symmetric encryption technology to transfer a symmetric key between the two parties, the sending and receiving parties then use the symmetrical key to encrypt/decrypt subsequent transmission of data. This can not only guarantee the security of key transmission, but also improve the encryption/decryption efficiency of data.

  3. 3.

    Technology for data backup and integrity

    Cloud computing data centers generally have backup and recovery capabilities for stored data, which largely avoids data loss and ensures data integrity. To call at any time, the system also applied snapshot technology, through the use of those storage devices that can be expanded into a larger capacity. This allows data in the physical space to be stored appropriately and allows the system's data management and control methods to be updated and improved in a timely and effective manner. After application, this technology can break through the space limitations existing in physical containers, and thus improve the stability and security of the system.

5.2 Basic Storage Unit

In modern computer systems, the common storage media are hard disk, optical disk, tape, etc. Hard disk capacity, low price, fast reading speed, high reliability, and other storage media cannot replace the advantages and are considered an important storage device. This section will provide a detailed description of the hard drive from both the mechanical drive and the SSD.

5.2.1 Hard Disk Drive

Mechanical hard drives started in 1956. The world's first disk storage system, IBM 305 RAMAC, was invented by IBM. It has 50 24 in. (about 61 cm) disks, weighs about 1 t, and has a capacity of 5MB. In 1973, IBM successfully developed a new type of hard disk IBM 3340. This hard drive has several coaxial metal discs coated with magnetic material. These disks are sealed in a box together with a movable magnetic head, which can read magnetic signals from the rotating disk surface. This is the closest “ancestor” to the hard drive we use today, and IBM calls it the Winchester hard drive. Because the IBM 3340 has two 30MB storage units, and there was a well-known “Winchester Rifle” caliber and the charge also contained two numbers “30,” so the internal code name of this hard drive was designated as “Winchester.” In 1980, Seagate produced the first Winchester hard drive on a personal computer. This hard drive was similar in size to the floppy drive at that time, with a capacity of 5MB. Figure 5.8 shows the Winchester hard drive.

Fig. 5.8
figure 8

Winchester hard drive

The hard disk's read speed was limited by the rotational speed of the hard disk at that time. Increasing the rotation speed can speed up accessing data, but the head and the platters of the hard disk are in contact with each other. The excessive rotation speed will cause the disk to be damaged, so technicians thought of letting the head “fly” above the platters. The platters' high-speed rotation will generate flowing wind, so as long as the shape of the magnetic head is appropriate, the platters can be rotated quickly without worrying about friction causing malfunctions. This is Winchester technology.

Winchester hard disk uses innovative technology. The magnetic head is fixed on an arm that can move in the disc's radial direction, and the magnetic head does not contact the disc. When the magnetic head moves relative to the disc, the magnetic head can sense the magnetic poles on the disc surface and record or change the magnetic poles' state to complete data reading/writing. Since the magnetic head moves at a high speed relative to the disk and the distance between the two is very close, even a little dust will cause damage to the disk at this time. Therefore, the hard disk needs to be packaged in a sealed box to ensure that the head and the disk are efficient and effective. Work reliably.

The hard disk we usually say mainly refers to the mechanical hard disk, which is mainly composed of the disc, the spindle and the spindle motor, the pre-signal amplifier, the head assembly, the voice coil motor and the interface, as shown in Fig. 5.9.

  • Disc and spindle motors. The disc and spindle motor are two closely connected parts, the disc is a circular sheet coated with a layer of magnetic material to record the data, the spindle is driven by the spindle motor, driving the high-speed disc rotation.

  • The head assembly. The head assembly consists of a read/write head, a drive arm, and a drive shaft. When the platter rotates at high speed, the drive arm drives the reading/writing head of the front end with the drive shaft as the center of the drive to move in the vertical direction of the platter rotation, and the head senses the magnetic signal on the platter to read or change the magnetism of the magnetic material in order to write information.

  • Ring motor. It consists of a head-driven car, motor, and shock-proof mechanism, and its function is to drive and position the head with high precision, so that the head can read/write quickly and accurately on the specified track.

  • Front signal amplifier. The front-facing signal amplifier is an amplification line sealed in the shielded cavity, which mainly controls the head's induction signal, spindle motor speed control, drive head and head positioning, etc.

  • Interface. Interfaces typically contain power interfaces and data transfer interfaces. The current mainstream interface types are SATA and SAS.

Fig. 5.9
figure 9

Mechanical hard disk

A disc inside a hard drive used to store data is a metal disc coated with magnetic material. The platter surface is divided into a circle of tracks, and when the platter rotates at high speed under the drive of the motor, the head set on the platter surface reads and writes data along the track. When the system writes data to the hard disk, a current in the head changes with the contents of the data, which creates a magnetic field that changes the state of the magnetic material on the surface of the disc, and this state remains permanent after the current magnetic field disappears, which is equivalent to saving the data. When the system reads data from the hard disk, the head passes through the designated area of the platter, the magnetic field on the surface of the platter makes the head produce a change in induction current or coil impedance, which is arrested, after some processing, it can restore the original written data.

5.2.2 Solid-State Drive

The world's first SSD appeared in 1989. SSDs were costly at the time, but their performance was much lower than that of ordinary hard drives, so they were not widely used. SSDs have developed to some extent in these areas due to the unique seismic, mute, and low power consumption properties of SSDs in particular markets, such as medical work and military markets.

With the increasing maturity of SSD technology, the manufacturing process's improvement, and the reduction of production costs, SSDs gradually entered the consumer field. In 2006, Samsung released its first laptop with a 32GB SSD. In early 2007, SanDisk released two 32GB SSDs. In 2011, a major flood happened in Thailand. Many mechanical hard drive manufacturers such as Western Digital and Seagate in Thailand's factories were forced to close, resulting in a sharp decline in mechanical hard drive production that year, prices soared. This has largely stimulated demand for SSDs, resulting in a “golden age” for SSDs. Today, SSDs are significantly improved in capacity, cost, transfer rate, and service life compared to their original products. The current market capacity of common SSDs has reached hundreds of GIGb to a few TB, and the price per GB is only a fraction of what it was then, which many consumers can afford. SSDs are one of the essential storage devices in the field of ultra-thin laptops and tablets. In the next few years, it is foreseeable that SSDs will continue to receive great attention.

SSDs consist of master chips and memory chips, which are hard drives made up of SSD arrays. SSDs have the same interface specifications, definitions, functions, and usage methods as hard drives and are identical in form and size to a hard drive. SSDs consist of master chips and memory chips, which, in short, are hard drives made up of arrays of solid-state electronic chips. SSDs have the same interface specifications, definitions, functions, and usage methods as normal hard drives and are identical to normal hard drives in terms of product form factor and size. Because SSDs do not have the rotating structure of ordinary hard drives, they are extremely earthquake-resistant, and the operating temperature range is very large. The extended temperature of electronic hard drives can operate at –45 °C to 85 °C.

SSDs can be widely used in automotive, industrial control, video surveillance, network monitoring, network terminals, power, medical, aviation, navigation equipment, and other fields. Traditional mechanical hard drives are disk-type, with data stored in disk sectors, while the common SSD storage medium is flash or dynamic random access memory (Dynamic Random Access Memory, DRAM). SSDs are one of the trends in the development of hard drives in the future. The internal structure of the SSD is shown in Fig. 5.10.

Fig. 5.10
figure 10

The internal structure of the solid-state drive

SSD consists of a memory chip and a master chip. The memory chip is responsible for storing the data, while the master chip controls the read/write process of the data. Memory chips are divided into two types by storage medium, the most common being flash memory as storage media, and the other using DRAM as storage media.

  1. 1.

    Flash-based SSDs

    The most common SSDs use flash memory as storage media. Flash memory can be made into various electronic products, such as SSDs, memory cards, and USB drives, depending on how they are used, all of which are small in size and high in portability. The SSDs discussed in this chapter are flash-based SSDs.

  2. 2.

    DRAM-based SSDs

    This type of SSDs uses DRAM as storage media. This storage medium is currently widely used in memory, performs very well, and has a long service life. The downfall is that it can only store data when it is powered, and if it loses power, the information stored by DRAM is lost, so it needs additional power to protect it. At present, these SSDs are more expensive and have a narrow range of applications.

SSDs have many advantages over traditional hard drives, as follows:

  1. (1)

    Read fast

    Because SSDs use flash as storage media and do not have a disk-motor structure, they save seeking time when reading data, especially when reading randomly. At the same time, the performance of SSDs is not affected by disk fragmentation.

  2. (2)

    Good earthquake resistance

    There are no mechanical moving parts inside the SSD, no mechanical failure, and no fear of collisions, shocks, and vibrations. This does not affect normal use even at high speeds, even with flip tilts, and minimizes the possibility of data loss when a laptop accidentally drops or collides with a hard object.

  3. (3)

    No noise

    There is no mechanical motor inside the SSD, so it is a truly noise-free silent drive.

  4. (4)

    Small size and lightweight

    A small, lightweight circuit board can be integrated with an SSD.

  5. (5)

    The operating temperature range is larger.

    A typical hard drive can only operate in the range of 5–55 °C. Most SSDs can operate in the −10 to 70 °C range, and some industrial-grade SSDs can operate in temperatures ranging from −40 to 85 °C.

However, SSDs also have two big drawbacks, making them unable to substitute for system mechanical hard drives. One drawback is the high cost. At present, the price per unit capacity of SSDs is still significantly higher than traditional mechanical hard drives. High-capacity SSDs are still relatively rare in the market, so those applications that are not sensitive to data read/write speed and system mechanical hard drives are still the first choice. Another drawback is the limited life of SSDs. General high-performance flash memory can be erased 10,000 to 100,000 times, ordinary consumer-grade flash memory can only be erased 3,000 to 30,000 times. As the manufacturing process continues to improve and the size of the memory cell gets smaller, the maximum erase count of flash memory will be further reduced. Fortunately, SSD's master chips can balance chip losses, allowing memory chips to be consumed more evenly, extending service life.

SSDs, as storage devices with higher read/write speeds than traditional mechanical hard drives, are now receiving widespread attention. Because it works differently from traditional mechanical hard drives and does not have any mechanical components, SSDs improve performance quickly. Simultaneously, it also has earthquake-resistant, small size, noise-free, small heat dissipation, and other traditional mechanical hard drives do not have the advantages, so many people hope to replace the traditional mechanical hard drive and become a new generation of storage equipment. However, the cost of SSDs is still higher than that of traditional mechanical drives, and the performance of traditional mechanical drives has been able to meet most of the needs, so in the next few years, traditional mechanical drives and SSDs will coexist and develop together.

5.3 Network Storage

5.3.1 DAS

DAS refers to connecting an external storage device directly to a server via a connecting cable, as shown in Fig. 5.11. Server structure with a direct external storage scheme is like the personal computer structure, external data storage devices are connected directly to the internal bus using SCSI technology, or Fibre Channel (FC) technology, and data storage devices are part of the entire server structure. In this case, it is often the data and the operating system that are not separated. DAS is a direct connection that meets the storage expansion and high-performance transfer needs of a single server, and the capacity of a single external storage system has grown from a few hundred gigabytes to a few terabytes or more. With the introduction of high-capacity hard drives, the capacity of a single external storage system will increase. In addition, DAS can form a two-machine, highly available system based on disk arrays to meet data storage requirements for high availability. On a trend basis, DAS will continue to be used as a storage mode.

Fig. 5.11
figure 11

DAS

The open system's DAS technology is the first storage technology adopted and has been used for nearly 40 years. Like the structure of a personal computer, DAS hangs external data storage devices directly on the bus inside the server, which is part of the server structure. However, because this storage technology is to hang (storage) devices directly on the server, with the increasing demand, more and more (storage) devices added to the network environment, the server becomes a system bottleneck, resulting in low resource utilization, data sharing is severely restricted. As user data continues to grow, these systems are increasingly plaguing system administrators with backup, recovery, scaling, disaster preparedness, and more. Therefore, DAS is only available for small networks.

DAS relies on the server host operating system for data read/write and storage maintenance management. Data backup and recovery requirements consume server host resources (including CPUs, system I/O, etc.). The data flow requires a return host to the hard disk or tape drive connected to the server. Data backup typically consumes 20–30% of the server host resources. As a result, daily data backups for many enterprise users are often made late at night or when business systems are not busy to not interfere with the operation of normal business systems. The greater the amount of data in DAS, the longer it takes to back up and recover, and the greater the dependency and impact on server hardware.

The connection channel between the DAS and the server host is usually an SCSI connection. With the processing power of server CPU becoming more and more powerful, storage hard disk space is getting larger and larger, and the number of array hard disks is increasing, SCSI channel will become I/O bottleneck. The server host SCSI ID has limited resources and limited SCSI channel connectivity. Figure 5.12 shows some common disk interfaces.

Fig. 5.12
figure 12

Different types of SCSI cable interfaces

Whether it is a DAS or a server host expansion, a cluster of multiple servers from one server, or an expansion of storage array capacity, it can cause downtime of business systems and economic loss to the enterprise. This is unacceptable for key business systems that provide 24× services in the banking, telecommunications, media, and other industries. These reasons have also led to DAS being gradually replaced by more advanced storage technologies.

5.3.2 SAN

SAN is a high-speed storage private network independent of the business network system and uses block-level data as its basic access unit. The main implementations of this network are Fibre Channel Storage Area Network (FC-SAN) and IP storage area network (IP-SAN). Different forms of implementation use different communication protocols and connections to transfer data, commands, and states between servers and storage devices.

Before SAN, DAS was most used. Early data centers used disk arrays to scale storage capacity in the form of DAS, with storage devices per server serving only a single application, creating an isolated storage environment that was difficult to share and manage. With user data growth, the disadvantages of this expansion in terms of expansion and disaster preparedness are becoming evident. The emergence of SAN solves these problems. SAN connects these “storage silos” over a high-speed network shared by multiple servers, enabling offsite backup of data and excellent scalability. These factors have led to the rapid development of SAN.

As an emerging storage solution, SAN mitigates the impact of transmission bottlenecks on systems and greatly improves remote disaster backup's efficiency with its advantages of faster data transfer, greater flexibility, and reduced network complexity.

SAN is a network architecture consisting of storage devices and various system components, including servers that use storage device resources, host bus adapters (HBA) cards for connecting storage devices, and FC switches.

In SAN, all traffic-related to data storage is done on a separate network isolated from the application network, which means that when data is transferred in SAN, it does not impact the existing application system data network. As a result, SAN can improve the overall I/O capabilities of the network without reducing the original application system's efficiency network while increasing redundant links to storage systems and providing support for highly available cluster systems.

With the continuous development of SAN, three types of storage area network systems have been formed: FC-based FC-SAN, IP-based IP-SAN, and SAS-SAN-based saS bus. Here we learn about FC-SAN and IP-SAN.

In FC-SAN, two network interface adapters are typically configured on a storage server: a network interface adapter for a normal network card (Network Interface Card, NIC) that connects to a business IP network through which the server interacts with the client, and a network interface adapter that is an HBA connected to the FC-SAN through which the server communicates with the storage device in the FC-SAN. The FC-SAN architecture is shown in Fig. 5.13.

Fig. 5.13
figure 13

FC-SAN architecture

IP-SAN is a popular network storage technology in recent years. In the early SAN environment, data was propagated in Fibre Channel as a block-based access unit. For instance, the early SAN was FC-SAN. FC-SAN must be procured and deployed separately because FC protocols are not IP compatible, and its high price and complex configuration are a challenge for many small and medium-sized businesses. Therefore, FC-SAN is mainly used for high-end storage requirements with high performance, redundancy, availability, etc. In order to increase the popularity and scope of SAN and take full advantage of the architectural advantages of SAN itself, the direction of SAN began to consider integration with the already popular and relatively inexpensive IP network. Therefore, IP-SAN, which uses an existing IP network architecture, has emerged. IP-SAN combines standard TCP/IP and SCSI instruction sets based on IP networks to achieve block-level data transmission.

The difference between IP-SAN and FC-SAN is that the transport protocol and transport media are different. Common IP-SAN protocols are iSCSI, FCIP, iFCP, etc., where iSCSI protocol is the fastest-growing protocol standard. Usually, we refer to IP-SAN refers to the iSCSI protocol-based SAN.

The purpose of an iSCSI protocol-based SAN is to establish an SAN connection to iSCSI Target (target, usually a storage device) over an IP network using the local iSCSI Initiator (launcher, usually a server). The IP-SAN architecture is shown in Fig. 5.14.

Fig. 5.14
figure 14

IP-SAN architecture

Compared with FC-SAN, IP-SAN has the following advantages.

  • Access standardization. There is no need for dedicated HBA cards and FC switches, just plain Ethernet cards and Ethernet switches for storage and server connectivity.

  • Transmission distance is far. In theory, IP-SAN can be used as long as it is accessed by IP networks, which is one of the most widely used networks.

  • It is maintainable. On the one hand, most network maintenance personnel have an IP network foundation, IP-SAN is naturally more acceptable than FC-SAN. On the other hand, IP network maintenance tools have been very developed, IP-SAN fully developed the “take it.”

  • It is easy to extend the bandwidth in the future. Because the iSCSI protocol is hosted by Ethernet, with the rapid development of Ethernet, IP-SAN single-port bandwidth expansion to more than 10GB is an inevitable result of development.

These benefits reduce the Total Cost of Ownership (TCO). For example, to build a storage system, the total cost of ownership includes the need to purchase disk arrays and access devices (HBA cards and switches), personnel training, routine maintenance, subsequent expansion, disaster tolerance expansion, etc. Because of the wide application advantages of IP network, Ip-SAN can significantly reduce the cost of purchasing access equipment for a single purchase, reduce maintenance costs, and subsequent expansion and network expansion costs are significantly reduced. Ip-SAN and other aspects of FC-SAN are shown in Table 5.1.

Table 5.1 Comparison between IP-SAN and FC-SAN

5.3.3 NAS

NAS is a technology that consolidates distributed, independent data into large, centrally managed data centers for access by different hosts and application servers. Typically, NAS is defined as a special dedicated file storage server that includes storage devices such as disk arrays, CD/DVD drives, tape drives, removable storage media, and embedded system software that provides cross-platform file-sharing capabilities.

The emergence of NAS is inextricable to the development of the network. After the emergence of THEPANET, modern network technology has been developed rapidly. People share more and more data in the network, but sharing files in the network faces cross-platform access and data security and many other problems. Early network sharing was shown in Fig. 5.15.

Fig. 5.15
figure 15

Early network sharing

To solve this problem, you can set up a dedicated computer to hold many shared files, connect to an existing network, and allow all users on the entire network to share their storage space. Through this approach, the early UNIX network environment evolved into a way of relying on “file servers” to share data.

Using specialized servers to provide shared data storage, with a large amount of storage disk space, is necessary to ensure data security and reliability. Simultaneously, a single server is responsible for many servers' access needs and needs to optimize the file-sharing server in terms of file I/O. Also, computers used in this manner should have an I/O-only operating system connected to an existing network, which is not required for such servers. Users on the network can access files on this particular server as if they were accessing files on their workstation, essentially fulfilling the need for file sharing for all users throughout the network. The TCP/IP network sharing indication in the early UNIX environment is shown in Fig. 5.16.

Fig. 5.16
figure 16

TCP/IP network sharing in the early UNIX environment

With the development of the network, there are more and more data sharing needs between different network computers. In most cases, systems and users on the network are expected to connect to specific file systems and access data, so that remote files from shared computers can be processed in the same way as local files in the local operating system, providing users with a virtual collection of files. The files in this collection do not exist on the local computer's storage device, and their location is virtual. One of this storage approach's evolutions is integration with traditional client/server environments that support Windows operating systems. This involves issues such as Windows network capabilities, private protocols, and UNIX/Linux-based database servers. In its early stages of development, a Windows network consisted of a network file server that is still in use today and uses a dedicated network system protocol. Early Windows file servers were shown in Fig. 5.17.

Fig. 5.17
figure 17

Early Windows file server diagram

The advent of file-sharing servers has led to the development of data storage toward centralized storage, which has led to rapid growth in centralized data and business volumes. As a result, NAS, which focuses on file-sharing services, has emerged.

NAS typically has its nodes on a local area network, allowing users to access file data directly over the network without an application server's intervention. In this configuration, NAS centrally manages and processes all shared files on the network, freeing the load from applications or enterprise servers, effectively reducing the total cost of ownership and protecting users' investments. Simply put, an NAS device is a device that is connected to a network and has file storage capabilities, hence the name “network file storage device.” It is a kind of dedicated file data storage server, with the file as the core, realizes the storage and management of Chinese pieces, completely separates the server's storage device, thus freeing up bandwidth and improving performance.

Essentially, NAS is a storage device, not a server. NAS is not a Lite file server. It has features that some servers do not have. The role of the server is to process the business. The role of the storage device is to store data. In a complete application, the environment should be the two devices organically combined.

NAS's intrinsic value lies in its ability to leverage existing resources in the data center to deliver file storage services in a fast and low-cost manner. Today's solutions are compatible across UNIX, Linux, and Windows environments and easily provide the ability to connect to the user's TCP/IP network. The NAS indication is shown in Fig. 5.18.

Fig. 5.18
figure 18

NAS schematic

NAS requires storing and backing up large amounts of data, based on which a stable and efficient data transfer service is required. Such requirements cannot be accomplished by hardware alone, and NAS needs some software to do so.

NAS devices support reading/writing to CIFS or NFS, as well as both.

CIFS is a public, open file system developed by Microsoft's SMB. SMB is a set of file-sharing protocols set by Microsoft based on NetBIOS. CIFS allows users to access data on remote computers. In addition, CIFS provides a mechanism to avoid read and write conflicts and thus support multi-user access.

In order for Windows and UNIX computers to share resources, Windows customers can use the resources on their UNIX computers as if they were using a Windows NT server without changing settings, and the best way to do this is to install the software in UNIX that supports the SMB/CIFS protocol. When all major operating systems support CIFS, “communication” between computers is convenient. Samba software helps Windows and UNIX users achieve this. People set up a CIFS-based shared server, share resources to its target computer, the target computer in their system through a simple shared mapping, the CIFS server shared resources mounted to their systems, as their local file system resources to use. With a simple mapping, the computer customer gets all the shared resources they want from the CIFS server.

NFS was developed by Sun, which enables users to share files and is designed to be used between different systems, so its communication protocols are designed to be independent of hosts and operating systems. When users want to use remote files, only the use of mount commands, you can mount the remote file system under their file system. The use of remote files and native files is no different.

The NFS platform-independent file-sharing mechanism is based on the XDR/RPC protocol.

External Data Representation (XDR) can transform the data format. Typically, XDRs transform data into a uniform standard data format to ensure data consistency representing different platforms, operating systems, and programming languages.

Remote Procedure Call (RPC) requests service from the remote computer. The user transmits the request over the network to the remote computer, which processes the request.

Using the Virtual File System (VFS) mechanism, NFS sends user requests for remote data access to the server through a unified file inquiry protocol and remote procedure calls. NFS continues to evolve, and since its emergence in 1985, it has undergone four versions of the update and been ported to all major operating systems, becoming the de facto standard for distributed file systems. NFS appears in an era of unstable network conditions, initially based on UDP transmission, rather than highly reliable TCP. While UDP works well on higher reliability LANs, it is not up to the task when running on less reliable WAN networks such as the Internet. At present, with the improvement of TCP, NFS running on TCP has high reliability and good performance.

5.4 Storage Reliability Technology

5.4.1 Traditional RAID Technology

RAID technology has been continuously developed, now has RAID 0 to RAID 6, a total of 7 basic RAID levels. In addition, there are some basic RAID-level combinations, such as RAID 10 (a combination of RAID 1 and RAID 0) and RAID 50 (a combination of RAID 5 and RAID 0). Different RAID levels represent different storage performance, data security, and storage costs. Here we only cover RAID 0, RAID 1, RAID 5, and RAID 6.

  1. 1.

    RAID 0

    RAID 0, also known as Stripe, is based on combining multiple physical disks into a large logical disk. It represents the most efficient storage performance of all RAID levels, is not redundant, cannot be parallel I/O, but is the fastest. When data is stored, the data is segmented based on the number of disks that build RAID 0, and then written in parallel into the disk at the same time, so RAID 0 is the fastest of all levels. However, RAID 0 does not have redundancy, and if a physical disk is corrupted, all data is lost. Theoretically, the number of disks and total disk performance should be multiplied, and the full disk performance is equal to the “single disk performance × number of disks.” However, due to bus I/O bottlenecks and other factors, RAID performance is no longer multiplied by the number of disks. Assuming that one disk's performance is 50MB/s, the RAID 0 performance of two disks is about 96MB/s, and the RAID 0 of 3 disks is probably 130MB/s instead of 150MB/s, so the RAID 0 of both disks is the most prominent performance improvement. Figure 5.19 shows the RAID 0 indication. There are two disks, Disk1 and Disk2, and RAID 0 does this by dividing the stored content into two parts based on the number of disks. D0 and D1 are stored in Disk1 and Disk2, respectively, and when D0 storage is complete, D2 is stored in Disk1, with the rest of the data block being the same. This allows you to think of two disks as a large disk with I/O on both sides. However, if a piece of data is corrupted, the entire data is lost.

    RAID 0 has good read and write performance, but there is no data redundancy. Therefore, RAID 0 is suitable for applications that have fault tolerance for data access, and applications that can reform data through other means, such as Web applications and streaming media.

  2. 2.

    RAID 1

    RAID 1 is also called Mirror or Mirroring. Its purpose is to maximize the availability and repairability of user data. The principle of RAID 1 is to automatically copy 100% of the user's data to the disk to another disk. While RAID 1 stores data on the primary disk, it also stores the same data on the mirror disk. When the primary disk is damaged, the mirror disk replaces the work of the primary hard disk. Because there are mirrored disks for data backup, the data security of RAID 1 is the best among all RAID levels. However, no matter how many disks are used for RAID 1, the effective data space is only a single-disk capacity, so RAID 1 has the lowest disk utilization among all RAID levels.

    Figure 5.20 shows a schematic of RAID 1. There are two disks Disk1 and Disk2 in the picture. When storing data, store the content to be stored in Disk1, and store the data again in Disk2 to achieve data backup.

    RAID 1 has the highest unit storage cost among all RAID levels. However, because it provides higher data security and availability, RAID 1 is suitable for read-intensive On-Line Transaction Processing (OLTP) and other applications that require high read/write performance and reliability of data, such as e-mail, operating systems, application files, and random access environments.

  3. 3.

    RAID 5

    RAID 5, whose full name is “Independent Data Disk and Distributed Check Block,” is one of the most common RAID levels in advanced RAID systems and is widely used due to its excellent performance and data redundancy balance design. RAID 5 uses parity for checking and error correction. RAID 5 is shown in Fig. 5.21. In the figure, for example, three disks, P is the check value of the data, and D is the real data. RAID 5 does not back up stored data, but instead stores the data and corresponding parity information on the disks that make up RAID 5, and the data and corresponding parity information are stored on different disks. When one of RAID 5's disk data is corrupted, the corrupted data can be recovered using the remaining data and the corresponding parity information. As a result, RAID 5 is a storage solution that combines storage performance, data security, and storage costs.

    RAID 5, despite some capacity losses, provides better overall performance and is therefore a widely used data protection solution. It is suitable for I/O intensive, high read/write ratio applications, such as online transaction processing.

  4. 4.

    RAID 6

    RAID 6 is a RAID level designed to enhance data protection further. RAID 6 adds a second independent parity block compared to RAID 5. As a result, each block of data's equivalent has two check protection barriers (one for hierarchical checks and the other for general checks), so RAID 6's data redundancy performance is excellent. However, with the addition of a check, write efficiency is lower than RAID 5, and the design of the control system is more complex, and the second verification area reduces effective storage space.

    The common RAID 6 techniques are P-Q and DP, which have different methods of obtaining verification information, but both allow for the loss of data on two disks throughout the array. RAID 6 is shown in Fig. 5.22.

    RAID 6's data security is higher than RAID 5, even if two disks in the array fail. The array is still able to continue working and recover data from the failed disk. However, RAID 6 controller design is more complex, write speed is not very fast, and computational verification information and verification of data correctness takes more time. When writing for each block of data, two independent validation calculations are performed, and the system load is heavy. Disk utilization is lower than RAID 5, and the configuration is more complex, making it suitable for environments that require more data accuracy and completeness.

Fig. 5.19
figure 19

RAID 0 schematic

Fig. 5.20
figure 20

RAID 1 schematic

Fig. 5.21
figure 21

RAID 5 schematic

Fig. 5.22
figure 22

RAID 6 schematic

5.4.2 RAID 2.0 + technology

RAID technology is a technique for storing the same data in different places on multiple disks. By storing data on multiple disks, input/output operations can overlap in a balanced manner, improving performance, and increasing the average time between failures.

However, there is no reliability guarantee for the traditional RAID system during the refactoring process. If a lousy disk occurs before the refactoring is complete, the data will be lost and unrecoverable. Therefore, for a storage system, the most important sign of its reliability is that the shorter the RAID refactoring time, the better, thereby reducing the probability of another bad disk before the refactoring is complete. Early storage systems mostly used FC disks and had a capacity of only a few tens of gigabytes, so the refactoring time is short and the probability of another bad disk in refactoring is low. However, with the rapid growth of disk capacity, disk read/write speed is affected by disk speed and other aspects of slow growth, cannot meet the system's refactoring time requirements. Over the past few years, companies in many storage areas, such as Huawei and 3PAR, have evolved RAID technology columns from disk-based RAID to more flexible RAID 2.0 and RAID 2.0 plus technologies that integrate data protection and cross-disk planning for data distribution, while fully meeting storage applications in virtual machine environments.

RAID 2.0 is an enhanced RAID technology that effectively solves the problem that data is easily lost during reconstruction due to the increasing capacity of mechanical hard drives and the increasing time required to reconstruct a mechanical hard drive (i.e., the growing refactoring window of traditional RAID groups). The basic idea is to cut a large mechanical hard drive into smaller chunks (Chunk) at a fixed capacity, usually 64MB in length, and raid groups are built on these small chunks, called Chunk Groups. At this time, the hard disk no longer constitutes a traditional RAID relationship. However, it may have a larger number of hard disk groups (recommended maximum number of hard drives 96 to 120, not recommended more than 120 disks), each hard disk on different blocks can be with this hard disk group on different hard drives on the block to form different RAID type of blocking groups, such a hard disk block can belong to multiple RAID types of multiple chunk groups. In this form of organization, a storage system based on RAID 2.0 technology can be refactored concurrently on all hard drives on a hard drive group after a failure, rather than on a single hot backup disk of a traditional RAID. Thus, it greatly reduces the refactoring time, reduces the risk of data loss due to the expansion of the refactoring window, and ensures the performance and reliability of the storage system while increasing the hard disk's capacity significantly. Figures 5.23 and 5.24 show the storage array failure recovery mechanism based on traditional RAID technology and the storage array failure recovery mechanism based on RAID 2.0 technology. RAID 2.0 does not change the traditional algorithms for various RAID types but narrows the RAID range to a block group. As a result, RAID 2.0 technology has the following technical characteristics:

  • Several dozens or even hundreds of mechanical hard drives to form a hard disk group.

  • The hard disk group of hard drives are divided into dozens or hundreds of gigabytes of chunks, different hard disk blocks form a block group.

  • RAID calculations are performed within a chunk group, the system no longer has a hot backup disk, but is replaced by a hot backup block that is retained within the same chunk group.

Fig. 5.23
figure 23

Storage array failure recovery mechanism based on traditional RAID technology

Fig. 5.24
figure 24

Storage array failure recovery mechanism based on RAID 2.0 technology

Storage systems that use RAID 2.0 technology have the following advantages because refactoring can occur concurrently on hot backup space reserved for all other hard drives in the same hard drive group after a failure of a hard drive in a storage system using RAID 2.0 technology.

  • Quick refactoring: All hard drives in the storage pool are involved in refactoring, significantly faster than traditional RAID refactoring.

  • Automatic load balancing: RAID 2.0 allows hard drives to share the load evenly, no more hot backup disks, improving system performance and hard drive reliability.

  • System performance improvement: LUNs are created based on chunk groups and can be distributed on more physical hard drives regardless of the number of traditional RAID drives, resulting in an effective increase in system performance as hard disk I/O bandwidth increases.

  • Self-healing: When a hard drive alert occurs, there is no need for a hot backup disk, no need to replace the fault disk immediately, and the system can be quickly refactored for self-healing.

RAID 2.0 plus provides more granular resource particles (up to tens of kilobytes) on top of RAID 2.0, forming standard allocation and recycling units for storage resources, similar to virtual machines in computational virtualization, which we call virtual blocks. These consistent virtual blocks of capacity make up a unified pool of storage resources in which all applications, middleware, virtual machines, and operating systems can allocate and reclaim resources on demand. Compared to traditional RAID technology, RAID 2.0 plus technology virtualizes and pre-configures storage resources, and the application and release of storage resources is fully automated through storage pools, eliminating the need for time-consuming and error-prone manual configuration processes such as RAID group creation, LUN creation, and LUN formatting of traditional RAID arrays. As a result, RAID 2.0 plus technology addresses the need for dynamic on-demand allocation and release of storage resources in a virtual machine environment. Based on RAID 2.0, RAID 2.0 plus technology has the following technical characteristics:

  • On the basis of RAID 2.0, the chunk group is divided into virtualized storage particles with a capacity of 256KB to 64MB.

  • Storage resources are automatically allocated and released in the particles mentioned above.

  • The granularity mentioned above can be measured in storage pools or between different storage pools for fine-grained graded storage.

  • After the system scales performance or capacity by extending the controller, these standard particles can be migrated automatically for load balancing purposes.

Figure 5.25 shows a storage array based on RAID 2.0 plus technology.

Fig. 5.25
figure 25

Storage framework based on RAID 2.0+ technology

RAID 2.0 plus technology is primarily used to intelligently allocate system resources to meet the storage needs of virtual machine environments. The advantages are as follows:

  • Storage resources are automatically allocated and released on demand to meet the most essential storage needs of virtual machines (see Fig. 5.26).

  • Different data can be stored hierarchically based on real-time business conditions to meet high-performance business needs through flexible provisioning of high-performance storage resources such as SSDs (see Fig. 5.27).

  • Automatic migration of data based on business characteristics to improve storage efficiency (see Fig. 5.28).

Fig. 5.26
figure 26

Storage capacity virtualization based on RAID 2.0 + technology

Fig. 5.27
figure 27

Real-time resource allocation based on RAID 2.0 + technology

Fig. 5.28
figure 28

Automatic data migration based on RAID 2.0 + technology

5.5 Storage Virtualization

Storage virtualization is the process of transforming traditional computer hardware data storage into virtual storage. Storage virtualization integrates a single function within the system, improving the comprehensiveness of the system. Storage virtualization technology is a means of storage virtualization, which virtualizes complex and diverse real-world technology devices abstractly and achieves reasonable control over real-world physical devices. From the consumer’s point of view, storage virtualization is the original disk, hard disk storage data technology to virtualize, the application of all the data stored through virtualization, consumers no longer have to consider storage location and storage security. From enterprise managers’ perspective, storage virtualization stores all enterprise data information in virtualized storage pools and integrates information management, enabling enterprise managers to use information more quickly and efficiently.

The virtualized storage architecture is shown in Fig. 5.29. In this architecture, the bottom layer is the physical disk. The top layer is the cloud hard disk. The middle through a series of logical division, file system formatting and other operations.

Fig. 5.29
figure 29

Virtualized storage architecture

5.5.1 Virtualization of I/O Paths

I/O virtualization is a complex but essential part of virtualization technology. Overall, I/O virtualization includes software-assisted virtualization and hardware-assisted virtualization. Software-assisted virtualization can be divided into full virtualization and semi-virtualization. If broken down according to device type, it can also be divided into character device I/O virtualization (keyboard, mouse, monitor), block device I/O virtualization (disk, CD), and network device I/O virtualization (network card).

Both full and semi-virtualization are implemented at the software level, and performance is naturally not too high. The best way to improve performance is through hardware. If you let a virtual machine dominate a physical device and use it as a host, performance is undoubtedly the best. A critical virtualization technology in I/O virtualization is called I/O pass-through technology, enabling virtual machines to access physical devices directly through hardware without having to pass through VMM or be intercepted by VMM. Because direct access to physical devices by multiple virtual machines involves memory access, which is shared, the appropriate technology (the IOMMU architecture shown in Fig. 5.30) is required to isolate memory access by individual virtual machines.

Fig. 5.30
figure 30

IOMMU architecture

I/O pass-through technology requires appropriate hardware support to complete, typically VT-d technology, which is achieved through chip-level retrofits. This approach has a qualitative improvement in performance, does not require modification of the operating system, portability is also outstanding.

However, there are limitations to this approach, and if another virtual machine occupies the device, the current virtual machine can no longer use the device. To solve this problem so that more virtual machines can share a physical device, academics and industry have made many improvements. PCI-SIG released the SR-IOV specification (SR-IOV architecture shown in Fig. 5.31), which details how hardware vendors share single I/O device hardware across multiple virtual machines. In general, clients drive the SR-IOV capabilities of discovery devices through Physical Functions (PF) and divide physical resources, including send and receive queues, into subsets based on the number of Virtual Functions (VF), and then PF drivers abstract those resource subsets into VF devices so that VF devices can be assigned to virtual machines through some communication mechanism.

Fig. 5.31
figure 31

SR-IOV architecture

Although virtualized I/O pass-through technology for hardware storage I/O paths eliminates the additional overhead associated with VMM intervention in virtual machine I/O, I/O devices can cause significant disruption in I/O operations. Virtual machines cannot handle outages directly for security reasons, so interrupt requests need to be routed securely and isolated from the VMM to the appropriate virtual machines. So in practice, it is generally a combination of software/hardware virtualization methods.

5.5.2 Block-Level and File-Level Storage Virtualization

  1. 1.

    Block-level storage virtualization

    With the increase of single-disk capacity, the problems brought about by the traditional RAID method are becoming more and more apparent, especially after the appearance of lousy disk RAID reconstruction time is getting longer and longer, and the reliability of the system is greatly challenged. Block-level storage virtualization breaks the traditional RAID method, abstracts the real physical address of storage to the user, and provides a storage service for the user's program's logical storage. At the software level, it parses logical I/O requests and maps them to the correct physical address. As a result, storage virtualization allows administrators to provide freely scalable storage capacity without users being aware of the trivial details of storage extensions, data protection, and system maintenance behind their storage.

    Because of the advanced nature of block-level storage virtualization technology, many storage vendors are starting to use it in their storage product lines through self-study or acquisition, such as:

    • On January 2, 2008, IBM acquired XIV, an Israeli storage technology company whose technology and personnel were incorporated into the storage of the system business unit of IBM's Systems and Technology division.

    • On January 29, 2008, Dell acquired Compellent for $1.4 billion.

    • 2010, HP acquired 3PAR for $2.35 billion.

      The acquisition of new vendors by these traditional storage vendors values the new storage revolution brought about by their block-level storage virtualization technology, which enriches the product range of traditional storage vendors and is better suited to modern big data storage applications.

      The principle of block-level storage virtualization technology is to break a single disk into countless small chunks, each of which acts as the smallest cell for storing data, which is randomly and evenly distributed across all chunks. When you use a storage system, follow these steps.

    • Step 1: For physical hard drives within the system, three types of storage pools are formed according to the media's performance and for external storage pools that are otherwise accessed to the system.

    • Step 2: Cut each hard disk space inside the system into 64MB logical blocks.

    • Step 3: Divide multiple logical blocks from different hard drives into logical block groups by RAID group.

    • Step 4: Cut the logical block group into 512KB to 64MB, the default 4MB (size configurable) finer-grained logical block.

    • Step 5: Make 1 to N fine-grained logical blocks into volumes/files on demand so that the configured storage space can be used the way it was.

      The logical structure of block-level storage virtualization is shown in Fig. 5.32.

  2. 2.

    File-level storage virtualization

    The bottom-level physical disks and centralized and distributed storage consisting of physical disks are described earlier. Whether centralized or distributed storage, using RAID or replica mechanisms results in a physical volume, but in most cases the entire physical volume is not mounted to the upper-level applications (operating system or virtualization system, where we refer only to virtualized systems) for data security reasons. Physical volumes are typically grouped into volume groups, which are then divided into logical volumes, and the upper layer applies space that uses logical volumes.

    In cloud computing, virtualization programs format logical volumes, and virtualized file systems vary from vendor to vendor. VMware uses Virtual Machine File System (VMFS) and Huawei use virtual mirror management systems, all of which are high-performance cluster file systems that allow virtualization to go beyond a single system and allow multiple compute nodes to access a consolidated clustered storage pool at the same time. A file system that computes a cluster ensures that a server or application does not fully control access to the file system.

    In the virtual mirror management system, it is a clustered file system based on SAN storage, so when using FusionStorage to provide storage space, it can only be non-virtualized storage. FusionCompute manages virtual machine images and profiles as a file through a virtual image management system. The virtual image management system ensures the consistency of data reading/writing in the cluster through a distributed lock mechanism. The virtualization program's minimum storage unit is the LUN, which corresponds to the volume, which is the management object inside the storage system, and the LUN is the external embodiment of the volume. LUNs and volumes are divided from a pool of resources.

    With virtualization, LUNs can be divided into Tick LUNs and Sin LUNs.

    The Chinese name of Thick LUN is Traditional Non-Thin LUN, a type of LUN that supports virtual resource allocation and enables easier creation, expansion, and compression operations. Once thick LUNs are created, a total amount of storage space is allocated from the storage pool, i.e., the LUN size is precisely equal to the allocated space. As a result, it has high, predictable performance.

    Thin LUN, whose Chinese name is Thin LUN, is also a type of LUN that supports virtual resource allocation and enables easier creation, capacity expansion, and compression operations. When thin LUNs are created, you can set the initial allocated capacity. Once created, the storage pool allocates only the initial capacity size space, with the remaining space still in the storage pool. When the usage of the allocated storage space of the Sin LUN reaches the threshold, the storage system divides a certain amount of storage space from the storage pool to the Sin LUN until the Sin LUN is equal to the full capacity set. As a result, it has a higher utilization of storage space.

    The main differences between Thick LUN and Sin LUN are as follows:

    1. (1)

      Space allocation

      Thick LUN allocates all the storage space needed when it is created, and Thin LUN is an on-demand spatial organization method that allocates all the storage space needed when it is created, but dynamically allocates it based on usage (see Fig. 5.33).

    2. (2)

      Space recovery

      Spatial reclaiming refers to freeing up resources in a storage pool that other LUNs can reuse. Thick LUN does not have the concept of space reclamation because it takes up all the storage space allocated to it by the storage pool at the time of creation, and even if the data in the Thick LUN is deleted, the storage space allocated to it by the storage pool is still occupied and cannot be used by other LUNs. If you manually delete an entire Thick LUN that is no longer in use, its corresponding storage space is also reclaimed.

      Thin LUN can automatically allocate new storage space with the increase of space occupancy and realize the release of storage space when the files in Sin LUN are deleted, realize the reuse of storage space, and significantly improve storage utilization space. Thin LUN space reclamation is shown in Fig. 5.34.

    3. (3)

      Performance difference

      Because Thick LUNs have all the allocated space from the start, Thick LUNs have high performance when reading/writing sequentially but can waste some storage space.

      Thin LUN is a real-time allocation of space, so each expansion needs to increase capacity, background reformatting, and performance. And each allocation of space can result in a discontinuity in the storage space on the hard disk, which can result in the hard disk spending more time looking for storage locations when reading/writing data and having an impact on performance when reading/writing in sequence.

  3. 3.

    Application scenario

    The use scenario for Thick LUN is as follows:

    • Scenarios that require high performance.

    • Scenarios that are less sensitive to storage utilization.

    • Scenarios where the cost requirements are not too high.

      The use scenario for Thin LUN is as follows:

    • A scenario that requires general performance.

    • Scenarios that are sensitive to storage utilization.

    • Scenarios that are sensitive to cost requirements.

    • It is hard to estimate storage space scenarios in real-world applications.

Fig. 5.32
figure 32

Logical structure of block-level storage virtualization

Fig. 5.33
figure 33

Space allocation of Thick LUN and Thin LUN

Fig. 5.34
figure 34

Thin LUN space recovery

In addition to virtualized clustered file systems, common file systems include NAS Storage File Systems (NFS and CIFS, described above) and operating system file systems.

A file system is a hierarchical organizational structure of many files, and once the operating system has a file system, the data we see can be reflected as a file or folder before it can be copied, pasted, deleted, and recovered at any time. The file system uses how the directory organizes the data into layers, and the directory is where the file pointer is saved. All file systems maintain this directory, the operating system maintains only native directories, and clusters need to maintain NAS or shared directories formed by cluster file systems.

Common operating system file formats include FAT32 (Windows), NTFS (Windows), UFS (UNIX), EXT2/3/4 (Linux), and so on.

Figure 5.35 shows the working process of the operating system file system. These include:

  • The user or application has created a file or folder.

  • These files or folders are stored on the file system.

  • The file system maps the data corresponding to these files to the file system block.

  • The file system block corresponds to the logical region formed by the logical volume.

  • Map the logical region to the physical area of the physical disk through the operating system or virtual machine, that is, the logical volume we mentioned earlier corresponds to the physical volume.

  • The physical volume for the physical region may contain one or more physical disks.

Fig. 5.35
figure 35

Working process of the operating system file system

5.5.3 Host-Based Storage Virtualization

Today, storage virtualization technology has become the main direction of the future development of information storage technology. There are many ways to implement it, and it is more mature, and it is widely used in practice. Virtual storage based on computer hosting is a vital storage virtualization technology based on volume management software. Today, most operating systems such as common host server systems, Windows or Linux, come with volume management software. If an enterprise wants to implement a host-based storage virtualization application, it does not need to purchase additional commercial software, which can be achieved with the operating system's software. So it is much cheaper to deploy than buying commercial storage virtualization products. At the same time, because the virtual layer and file system are on the same host server in host-based storage virtualization, the combination of the two can not only realize the flexible management of storage capacity, but also the logical volume and file system can dynamically adjust their capacity without downtime, have high stability, and support heterogeneous storage system.

But there are obvious drawbacks to this approach. Because virtual volume management software is deployed with the host, the volume management software running on the host consumes a portion of the host's running memory and processing time, resulting in reduced performance and relatively poor storage expansion performance, which is also a disadvantage of host-based storage virtualization. In addition, host upgrades, maintenance, and expansion are complex, and crossing multiple heterogeneous storage systems can require complex data migration processes that can affect business continuity. However, compared with other virtualization technology methods, host-based storage virtualization only needs to install virtual volume management software on the host. Many of them come with the operating system, which is relatively simple, convenient, suitable for storage requirements, and the number of users is small. Currently, Linux LVM is the most commonly used product in the host-based storage virtualization market.

5.5.4 Storage Virtualization Based on Storage Devices

Another approach to storage virtualization is storage virtualization based on storage devices. This approach to storage virtualization is virtualization capabilities to storage controllers, common in mid-to-high-end storage devices. Its purpose is to optimize user-oriented applications, integrate users' different storage systems into a single platform, solve data management challenges, and implement information life cycle management through tiered storage to optimize the application environment further. This technology is primarily used within the same storage device for data protection and data migration. Its advantages are not related to the host, do not occupy the host resources, data management functions are rich. However, it also has drawbacks: first, it is generally only possible to virtualize disks within the device. Second, data management functions from different vendors cannot be interoperable. Third, multiple sets of storage devices need to be configured with multiple sets of data management software. The cost is higher.

5.5.5 Network-Based Storage Virtualization

There is also a way to implement storage virtualization based on network-based storage virtualization. This approach is implemented by adding a virtualization engine to the SAN, such as a virtual gateway in SAN, to enable virtual network storage. With a virtual gateway, SANs can establish storage volumes with different storage capacities in the virtual storage pool, which can virtually manage the stored data. Network-based storage virtualization is primarily used for the consolidation of heterogeneous storage systems and unified data management. Its advantages are: first, it has nothing to do with the host, does not occupy the host resources. Second, it can support heterogeneous host, heterogeneous storage devices. Third, it can make the data management functions of different storage devices unified. Fourth, it can build a unified management platform, good scalability. But it also has disadvantages: first, some manufacturers’ product data management function is weak, difficult to achieve the purpose of virtualization unified data management.

Network virtualization can be divided into Internet device virtualization and router device virtualization depending on the network virtualization devices.

  1. 1.

    Internet device virtualization

    In implementing Internet device virtualization, the consistency of the storage path through which information and data are controlled depends on the symmetry of the Internet device virtualization approach. Due to Internet devices' diversity and complexity, it is challenging to implement symmetrical storage virtualization, which can be used asymmetrically for Internet devices. Because its information control is not on the same path as data collection, asymmetric storage virtualization methods are easier to implement and store scalable performance than symmetrical storage virtualization methods.

  2. 2.

    Router device virtualization

    Router device virtualization is a transformation of storage virtualization capabilities at the router level, which means that the vast majority of virtual enclosures are integrated into router software. This approach can also be combined with a host-based storage virtualization approach. In implementing router device virtualization, the router is placed in the host storage network channel, and the network storage command issued by the host can be acquired and processed to realize the function of storage virtualization. Therefore, in router device virtualization, the router becomes the host's service provider and the real data virtualization storage of the computer system. Router device virtualization is more independent, stable, and less impactful than host-based storage virtualization and Internet device virtualization. Although router device virtualization may also provide unfamiliar access to host-protected data, it is only for hosts connected to the router.

5.5.6 Storage Virtualization Products and Applications

This section focuses on the storage features of FusionCompute, Huawei's virtualization product.

FusionCompute uses storage resources that can come from local disks or dedicated storage devices. A dedicated storage device should be connected to the host via a network cable or fiber optic. Data storage is FusionCompute's unified encapsulation of storage units in storage resources. Once the storage resources are encapsulated as data stores and associated with the host, they can be further created into several virtual disks for use by virtual machines. Storage units that can be encapsulated as data storage include:

LUNs are divided as follows:

  • SAN storage, including SAN storage for SCSI or Fibre Channel

  • NAS storage file system divided

  • storage pool on FusionStorage Block

  • the host's local hard drive (virtualization)

These storage units are collectively referred to as “storage devices” in Huawei FusionCompute, while the physical storage media that provide storage space to virtualization is referred to as “storage resources,” as shown in Fig. 5.36.

Fig. 5.36
figure 36

Huawei storage model

Before using data storage, you need to add storage resources manually. If the storage resource is IP-SAN, FusionStorage, or NAS storage, you need to add a storage interface to the hosts in the cluster and then communicate with the centralized storage controller's service interface or the management address of the FusionStorage Manager through this interface. If the storage resource is IP-SAN, there is no need to add a storage interface separately.

After adding storage resources, you need to scan for storage devices on the FusionCompute interface, and finally add them as data storage.

Data storage can be virtualized or non-virtualized, and LUNs on SAN storage can also be used directly as data storage for virtual machines instead of creating virtual disks. This process is called raw device mapping. Raw device mapping currently only supports virtual machines of some operating systems, which is suitable for scenarios that require large disk space, such as building database servers.

5.6 Distributed Storage

The success of Internet companies such as amazon, Google, Alibaba, Baidu, and Tencent gave birth to technologies such as cloud computing, big data, and artificial intelligence. A key goal of the infrastructure behind the various applications provided by these Internet companies is to build a high-performance, low-cost, scalable, and easy-to-use distributed storage system.

Although the distributed storage system has a history of many years, it is only in recent years that it has been applied to engineering practice on a large scale due to the rise of big data and artificial intelligence applications. Compared with traditional storage systems, a new generation of distributed storage systems has two important characteristics: low cost and large scale. This is mainly based on the actual needs of the Internet industry. It can be said that Internet companies have redefined large-scale distributed storage systems.

5.6.1 Overview of Cloud Storage

Cloud storage is a new concept extended and developed from the concept of cloud computing. In fact, cloud storage is part of the cloud computing system, but it is different from the super processing power of cloud computing, emphasizing the “cloud.” Cloud storage refers to a system that integrates many different types of storage devices in the network through application software through functions such as cluster applications, grid technology, or distributed file systems to work together to provide data storage and business access functions to the outside world.

Compared with traditional storage devices, cloud storage is a piece of hardware and a complex system that includes network devices, storage devices, servers, application software, public access interfaces, access networks, and client programs. Each part takes the storage device as the core and provides data storage and business access services to the outside through application software. The structural model of the cloud storage system is shown in Fig. 5.37.

Fig. 5.37
figure 37

Structure model of cloud storage system

The structural model of the cloud storage system consists of four layers.

  1. (1)

    Storage layer

    The storage layer is the most essential part of cloud storage. The storage device can be an FC storage device, an IP storage device such as NAS and iSCSI, or a DAS storage device such as SCSI or SAS. Storage devices in cloud storage are often large in number and distributed in different places and are connected through a wide area network, the Internet, or an FC network. Above the storage device is a unified storage device management system, which can realize logical virtualization management of storage devices, multi-link redundancy management, as well as status monitoring and fault maintenance of hardware devices.

  2. (2)

    Basic management layer

    The basic management layer is the core part of cloud storage and the most challenging part to realize. The basic management layer uses cluster systems, distributed file systems and grid computing to realize the collaborative work between multiple storage devices in cloud storage. Multiple storage devices can provide the same service and provide better data access performance. Content distribution and data encryption technology ensure that unauthorized users will not access the data in cloud storage. Simultaneously, various data backup and data disaster recovery technologies and measures can ensure that cloud storage data will not be lost and ensure that the cloud store its security and stability.

  3. (3)

    Application interface layer

    The application interface layer is the most flexible part of cloud storage. Different cloud storage operating units can develop different application service interfaces and provide different application services according to actual business types.

  4. (4)

    Access layer

    Any authorized user can log in to the cloud storage system through a standard public application interface and enjoy cloud storage services. Different cloud storage operating units have different access types and access methods provided by cloud storage, such as video surveillance application platform, interactive Internet TV (IPTV) and video on demand application platform, network hard disk reference platform, and remote data backup application platform.

5.6.2 HDFS

HDFS is a core sub-project of the Hadoop project, the basis of data storage management in distributed computing, developed to meet the needs of streaming data patterns to access and process large files and run on inexpensive commercial servers. Its high fault tolerance, high reliability, high scalability, high acquisition, high-throughput rate, and other characteristics provide a high degree of fault-tolerant storage solutions for massive data. For Large Data Sets, application processing has brought a lot of conveniences. The overall structure of HDFS is shown in Fig. 5.38.

Fig. 5.38
figure 38

Overall structure of HDFS

HDFS consists of a named node (NameNode) and multiple data nodes (DataNode). NameNode is the central server that manages the namespace of the file system (NameSpace) and client access to files, while DataNode is usually a common computer responsible for specific data storage. HDFS is very similar to a familiar stand-alone file system in that it creates directories, creates, copies, deletes files, or views file contents. However, its underlying implementation is to cut the file into pieces, and then these blocks are scattered on different DataNode. Each block can also be copied several copies stored on different DataNode to achieve fault tolerance and disaster tolerance purposes. NameNode is at the heart of the entire HDFS, which records how many blocks each file is cut into by maintaining some data structure, from which DataNode is available, and how important information such as the status of each DataNode is obtained.

The following will be from the point of view of the write and read process of HDFS, respectively, to introduce the implementation process of HDFS.

  1. (1)

    The writing process of HDFS

    NameNode is responsible for managing metadata stored on all files on HDFS, which confirms the client's request and records the file's name and the DataNode collection that stores the file. It stores this information in the file allocation table in memory.

    For example, if the client sends a request to NameNode to write a .log file to HDFS, its execution process is shown in Fig. 5.39.

    One of the challenges in the design of distributed file systems is how to ensure data consistency. For HDFS, the data is not considered written until DataNode, which wants to save the data, confirms that they have a copy of the file. Therefore, data consistency is done during the write phase. A client will get the same data no matter which DataNode it chooses to read from.

  2. (2)

    The reading process of HDFS

    To understand the read process of HDFS, it can be thought that a file is made up of blocks of data stored on DataNode. The execution process for the client to view what was previously written is shown in Fig. 5.40.

    The client takes a block of files from different DataNodes in parallel, and then merges them into complete files.

Fig. 5.39
figure 39

HDFS writing process

 • Step 1: The client sends a message to NameNode, ready to write the .log state file to HDFS (see Fig. (1))

 • Step 2: NameNode sends a message to the client, commanding the client to write data nodes A, DataNode B and DataNode D and contact DataNode B directly (see Fig. (2))

 • Step 3: The client sends a message to DataNode B, orders that a .log file be saved, and sends a copy to DataNode A and DataNode D (see Fig. (3))

 • Step 4: DataNode A sends a message to DataNode B asking for a .log file (see Fig. (4))

 • Step 5: DataNode B sends a message to DataNode A, transmits a .log file, and sends a copy to DataNode D (see Fig. (5))

 • Step 6: DataNode D sends a confirmation message to DataNode A

 • Step 7: DataNode A sends a confirmation message to DataNode B

 • Step 8: DataNode B sends a confirmation message to the client indicating that the write is complete (see Fig. (6) in the figure)

Fig. 5.40
figure 40

HDFS read process

 • Step 1: The client asks NameNode to read the file (see Fig. (1))

 • Step 2: NameNode sends block information to the client (the block information contains the IP address of DataNode, which holds a copy of the file, and the block ID that DataNode needs to find the block on the local hard drive) (see Fig. (2))

 • Step 3: The client checks the block information, contacts the relevant DataNode, and requests the block (see Fig. (3) in the figure)

 • Step 4: DataNode returns the file contents to the client, then closes the connection and completes the read operation

5.6.3 Peer Storage System

Peer-to-peer storage technology refers to a technology that forms a storage network in a functionally reciprocal manner between storage nodes, which is a type of distributed storage.

Unlike the traditional centralized control mode, the peer-to-peer storage system's storage nodes are equal in status. Specifically, a peer-to-peer storage system can be composed entirely of server nodes in a peer-to-peer manner, can also be composed entirely of user desktops, or a mixture of servers and desktops in a peer-to-peer manner. As long as the storage system is organized in a functionally equivalent manner is a peer-to-peer storage system. Peer-to-Peer (P2P) technology is the foundation and physical support of peer-to-peer storage technology.

A peer-to-peer network is a complex overlay network with high dynamics, also known as an “overlay network” or “overlay network.” Peer-to-peer network is a virtual layer built on several existing physical networks. By overlaying another layer of a more abstract network based on the original physical network, the interconnection of different networks can be realized without changing the existing network structure realize the sharing of resources.

A peer-to-peer system built on a peer-to-peer network is a distributed system built on the application layer. Each node communicates through a unified routing protocol, and messages are transmitted along with the peer-to-peer network's logical connection.

Compared with the traditional client/server model, the peer-to-peer storage system has the following advantages.

  1. (1)

    High scalability

    The most important feature of a peer-to-peer storage system is that each network node has two roles: client and server. In this mode, its capacity can be expanded arbitrarily. If its system structure is structured, the capacity can be expanded and contracted arbitrarily without disturbing the system's normal operation at all.

  2. (2)

    Large system capacity

    Since there is no direct mapping relationship between data and server, storage capacity is not limited by hardware. Today, the capacity of an IDE hard disk can exceed 1TB, and statistics show that the utilization rate of storage media is less than 50%. That is, a large amount of storage media is wasted. A peer-to-peer storage system can pool a large amount of free disk space on different computers and share it with users who need it.

  3. (3)

    Good service performance

    Severely unstable network conditions usually cause irreparable losses to traditional client/server storage systems. For a peer-to-peer storage system, because data is stored on each node of the system, and many nodes share the risk, the change in the number of nodes will not have a serious impact on system performance, and it can well adapt to the dynamic changes of the network. Moreover, its unique feature of keeping file copies at different locations in the network enables nodes to achieve nearby access, reducing access latency and improving data reading capabilities.

  4. (4)

    High reliability

    Peer-to-peer storage systems are usually self-organizing, which can better adapt to the dynamic joining and exiting of nodes. The peer-to-peer storage system also implements the file fault tolerance function. Even if some nodes fail, the target file can still be obtained. Besides, because the system's nodes are distributed in different geographical areas, local devastating disaster events such as earthquakes and partial power outages will not destroy the entire system. That is, the data has good disaster tolerance.

  5. (5)

    Low system cost

    Users do not need to spend large amounts of money to purchase file servers and various network equipment to build a large-capacity, high-performance file storage system. For a peer-to-peer storage system, it does not have high requirements on the performance of computing devices. As long as its storage resources can be interconnected with the network to achieve an effective organization, storage services can be provided. In addition, due to the principle of nearby access to different nodes, a large amount of network traffic is usually confined to a local area network, such as a campus network, so that most of the network traffic does not have to pass through the external network, which will significantly save the network costs based on traffic billing.

    The peer-to-peer storage system is precise because of the above advantages, which has aroused researchers' general attention and in-depth research.

The software Napster launched in 1999 has gradually made the peer-to-peer model replace the client/server model as a research and application hotspot. With the widespread popularity of the Internet, the significant increase in network bandwidth, and the rapid increase in ordinary computers' computing power, end-user equipment has gradually become an available computing resource that breaks the storage bottleneck. Until now, the research of peer-to-peer networks has been pervasive, roughly including the following aspects: data storage, parallel computing, and instant messaging. Well-known peer-to-peer storage systems include Gnutella, Napster, Kazaa, Morpheus, Freenet, Chord, CAN, Pastry, BitTorrent, FreeHaven, etc. Each of these peer-to-peer storage systems has its characteristics and characteristics in network node concentration and network topology. An important feature of the classification of peer-to-peer storage systems is the Degree of Centralization, the degree of reliance on the server, which specifically refers to whether the server's adjustment is required between the communicating nodes. The peer-to-peer storage systems can be divided into the following categories according to the degree of node organization concentration.

  1. (1)

    Completely decentralized

    There is no server at all, and the roles of all nodes in the network are the same. They are both a server and a client. The system does not intervene and regulate the communication of any node. Typical representatives of such peer-to-peer storage systems are Gnutella, Pastry, etc.

  2. (2)

    Partially centralized

    There is a SuperNode different from ordinary nodes in the system, and it plays a more critical role than ordinary nodes. The supernode contains the file directories of other nodes in the network area. However, the supernodes in the system are not “lifetime,” but are dynamically designated by a fixed election algorithm. Typical representatives of such peer-to-peer storage systems are Kazaa and Morpheus.

  3. (3)

    Mixed decentralized

    There is a file server in the system, and the file index on all nodes in the system is recorded on the server. When the system is running, the server is responsible for querying the file's location required by the requesting node in the network, and then the requesting node is connected to the target node to achieve communication. Obviously, the mixed decentralized node and server are still in the client/server model. Typical representatives of such peer-to-peer storage systems are Napster and BitTorrent.

    According to the degree of network node concentration and network topology, several existing peer-to-peer storage systems can be classified, as shown in Table 5.2.

Table 5.2 Peer-to-peer storage system classification

5.7 Exercise

  1. (1)

    Multiple choices

    1. 1.

      When you create a port group in Huawei FusionCompute, the following operation is incorrect ( ).

      1. A.

        Set the VLAN ID to "5000".

      2. B.

        Set the name of the port group to “ceshi”.

      3. C.

        Set the port type to Normal

      4. D.

        Add “This is a test port” to the description.

    2. 2.

      In Huawei FusionCompute, the role of the uplink is ( ).

      1. A.

        Assign IP address

      2. B.

        To virtual machines connect virtual and physical networks

      3. C.

        Manage virtual machine MAC address

      4. D.

        Detect the status of the virtual network card

    3. 3.

      The following description of cloud computing is correct ( ).

      1. A.

        Cloud computing is a technology that enables easy, on-demand access to IT resources anytime, anywhere.

      2. B.

        Various IT resources in cloud computing are available for a fee.

      3. C.

        IT resources acquired in cloud computing need to be used over the network.

      4. D.

        In the process of acquiring IT resources, users need to negotiate repeatedly with cloud computing service providers.

    4. 4.

      In the development of the Internet, there are many milestone events, the following options for milestone events in a normal order is ( ).

      1. A.

        TCP/IP Establishment—The Birth of ARPANET—The World Wide Web Is Officially Open to the Public—The Birth of DNS

      2. B.

        The World Wide Web is officially open to the public—the birth of DNS—TCP / IP establishment—the birth of ARPANET

      3. C.

        The birth of ARPANET—TCP/IP establishment—THE BIRTH OF DNS—The World Wide Web is officially open to the public

      4. D.

        Dns Birth—TCP/IP Establishment—ARPANET Birth—The World Wide Web is officially open to the public

  2. (2)

    Answer the following questions

    1. 1.

      Briefly summarize the benefits of virtualization technology in physical devices.

    2. 2.

      List common RAID technologies and compare the differences between different RAID technologies.

    3. 3.

      What improvements has RAID 2.0 made and what have been the improvements in storage performance?

    4. 4.

      Explain the principles of GFS and HDFS, and briefly explain their respective advantages.