Synonyms

SIMD (Single Instruction, Multiple Data) Machines

Definition

The SX Series is the parallel vector supercomputer that has been provided by NEC as its flagship high-performance computing system. After the advent of its pioneering models of SX-1/2 back in 1983, NEC has continuously been enhancing this series toward the SX-9, which has the world’s fastest single CPU core performance of 102.4 GFLOPS. It makes up 1 node system of 1.6 TFLOPS peak performance, which can be configured up to 512 nodes enabling almost petascale computing with the maximum vector performance of 839 TFLOPS. The technology basis of the SX Series has led to the realization of the Earth Simulator at JAMSTEC (Japan Agency for Marine-Earth Science and Technology) that has earned the number one position on the Linpack benchmark during 2002–2004 for five times in the row, as well as its successor model (renewed Earth Simulator) put into operational use in 2009. The SX Series has been utilized for a spectrum of applications ranging from scientific research and mission-critical weather forecasting to engineering and material design.

Discussion

History of the Supercomputer SX Series

In April 1983, NEC made an entry in the supercomputer market with its simultaneous announcement for the first two models in the series: the SX-1 and the SX-2 – the world’s fastest supercomputers at the time. The SX-2 achieved a peak performance of 1.3 GFLOPS (1.3 billion floating-point operations per second), marking the beginning of the era of gigaflop computing.

The successor model, SX-3, announced in April 1990 attained new world speed with its peak performance of 22 GFLOPS using four multiprocessors and maximum 5.5 GFLOPS per CPU. Offering a lineup of commercial application software, the SX-3 not only employed the shared-memory-type multiprocessor combined with the parallel processing technology, but also pioneered the era of open systems featuring a 64-bit SUPER-UX operating system, which was developed based on UNIX.

With an increase in expandability to a maximum of 512 processors, the SX-4 was announced in November 1994, achieving a maximum performance of 1 TFLOPS.

The SX-4 adopted CMOS for the CPU instead of conventional silicon bipolar LSI, which resulted in both reduced power consumption and higher density packaging, consequently leading to a cost reduction by allowing the user to utilize air cooling instead of previously used liquid-cooling systems. The system has enabled expansion in memory capacity by the use of DRAM instead of SRAM as the memory device and the consequent improvement in a price performance ratio combined with a comprehensive list of application software.

In June 1998, the SX-5, the fourth generation in this series, was unveiled. Doubling both the clock frequency and the number of vector pipelines, the evolved model achieved a CPU vector performance of 8 GFLOPS, resulting in the system peak vector performance of 4 TFLOPS with a configuration of parallel processing with 512 CPUs. The release of the fifth generation SX-6 in October 2001 enabled the integration in a single chip of what required 30 LSI chips in the previous CPU model (SX-5).

With 1,024 CPUs each having a maximum CPU performance of 8 GFLOPS, the SX-6 delivered a peak performance of 8 TFLOPS. Equipped with a maximum of 32 CPUs per node and a maximum of 256 GB of shared memory combined with automatic parallelization, the SX-7 introduced in October 2002 realized further ease of operation.

Also during this period, the delivery of the Earth Simulator, for which NEC was in charge of the hardware and basic software development, was completed. The Earth Simulator began operation in March 2002, and its performance not only stunned the world, but also has continued to make significant contributions to understanding challenging environmental issues such as global climate change and advancing a wide spectrum of scientific findings and industrial applications.

Shipping of the sixth generation SX-8 started at the end of 2004, achieving a CPU performance of 16 GFLOPS and a system peak performance of 65 TFLOPS with a maximum of 4,096 CPUs. Subsequently in October 2007, the seventh generation SX-9 was announced. Featuring the world’s fastest 102.4 GFLOPS performance per single core, the large-scale shared memory up to 1 TB, and the interconnects with a data transfer rate of 128 GB/s, it can deliver near-PFLOP peak performance of 839 TFLOPS. In addition, the improved LSI technology and high-density packaging technology have reduced both power consumption and required installation space to approximately a quarter of that required for conventional supercomputers.

Figure 1 shows the pictures of the major models of the SX Series. The evolution of the SX Series is summarized with its specifications in Table 1.

NEC SX Series Vector Computers. Fig. 1
figure 1_425figure 1_425

Pictures of the SX Series hardware

NEC SX Series Vector Computers. Table 1 SX Series system specification

SX-9 Architecture

In subsequent sections, the description of the SX Series is made with the particular focus on the latest model SX-9. The development of the SX-9 aims at facilitating both high performance for real application programs and economical operation.

The SX-9 has been developed with the following features:

  1. 1.

    Fast single-chip vector processor

    Using the 65 nm CMOS technology, the SX-9 realizes 102.4 GFLOPS per processor – the world’s fastest single core performance.

  2. 2.

    Multiprocessor system with excellent scalability

    A single SX-9 node has a flat shared memory configuration for a maximum of 16 CPUs with a peak single-node performance of 1.638 TFLOPS and a maximum memory capacity of 1 TB. This shared memory node is interconnected with a special high-speed switch at a maximum of 128 GB/s per node, enabling configurations of up to a maximum of 512 nodes (839 TFLOPS) to provide both shared and distributed memory systems with excellent scalability.

  3. 3.

    Superior energy-saving and installation advantages

    In addition to the reduced processor energy consumption and heat generation, the system adopts the high-efficiency cooling technology, resulting in improved system power consumption and easy installation.

Hardware Overview

Figure 2 shows the SX-9 system configuration. The product lineup ranges from a shared-memory type single-node model that tightly integrates a maximum of 16 CPUs and the main memory unit (MMU) into a multinode system model connecting each node in a cluster configuration via a high-speed internode crossbar switch (IXS). The single-node system is available in two models: the model A with a maximum number of CPUs of 16 (1.638 TFLOPS), a maximum main memory capacity of 1 TB, and a maximum 32 input/output slots, and the model B with eight CPUs at a maximum performance of 819.2 GFLOPS, a maximum main memory capacity of 512 GB, and a maximum of 16 input/output slots. The multinode system interconnects 2–512 nodes in a cluster configuration via IXS with up to 8,192 CPUs.

NEC SX Series Vector Computers. Fig. 2
figure 2_425figure 2_425

SX Series system configuration

The CPU of the SX-9 uses eight sets of vector pipelines to achieve a peak performance of over 100 GFLOPS with a single unit. The memory with a maximum capacity of 1 TB can be shared by up to 16 CPUs, the maximum data transfer rate between the CPU and the MMU is 4 TB/s, and that between the I/O unit and the MMU is 64 GB/s ×2.

Processor

The CPU is composed of a vector unit and a scalar unit, both of which are connected to the MMU through the processor-memory network. The SX-9 features the addition of the ADB (Assignable Data Buffer) within a CPU for the selective buffering of data. Figure 3 shows the configuration of the CPU of the SX-9.

NEC SX Series Vector Computers. Fig. 3
figure 3_425figure 3_425

SX-9 CPU configuration

Vector Unit

The vector unit is composed of the vector operation block and the vector control block.

Vector Operation Block

The vector operation block has basic operations including the Logical, Multiply, Add, and Divide/Sqrt, as well as the Mask and Load/Store functions. This block also includes 16 mask registers and 72 vector registers. The vector operation block supports the IEEE double and single-precision floating point data formats.

Vector Operation Pipeline

The vector operation block consists of eight pipelines. To reduce the time to transfer the result of one vector operation to another operation, the vector unit has the enhanced data forwarding capability (chaining), improving the processing speed of programs with a short vector loop length. The vector pipelines have vector data compression and decompression capabilities for the efficient use of memory bandwidth.

Vector register

The vector operation block has 72 vector registers. Each vector register consists of 256 64 bit-wide registers. The total capacity of the registers reaches 144 KB.

Vector mask register

Some vector operations can be processed with a 256 bit mask, which can be switched to enable/disable the operation for each element corresponding to the mask bit. Mask bits are generated in the pipeline according to the result of logical operations. There are 16 mask registers prepared, each of which is 256 bit-wide.

Vector control block

The SX-9 CPU has a vector control block which enables the out-of-order execution of vector instructions, contributing to hiding memory latency. Furthermore, this block controls chaining to perform vector processing with high efficiency.

Scalar Unit

The scalar unit of the SX-9 employs the 64-bit RISC architecture that is compatible with the SX Series products and incorporates 128 general-purpose 64-bit registers and two 32 KB L1 caches for instructions and data. Its features are as described below.

6-way super-scalar configuration

The SX-9 uses a super-scalar configuration with a total of six execution pipelines (two for the floating-point arithmetic operations, two for integer arithmetic operations, and two for memory operations), thus improving the performance of application codes with a high degree of instructionlevel parallelism.

Instruction execution control

The out-of-order execution mechanism contributes to the high performance by executing instructions that become executable regardless of the instruction order. For the fast operation over a large domain, the SX-9 enables search and retrieval across up to eight branch instructions. It also adopts a speculative instruction execution mechanism that performs tentative execution of subsequent instructions before the actual execution of a branch instruction and restarts an instruction with the appropriate condition in a case when the branch prediction fails. In order to expand the instruction window in the out-of-order execution mechanism, the SX-9 has the reorder buffer entries enhanced to 64 entries.

ADB (Assignable Data Buffer)

The ADB is located within a CPU for the selective buffering of data. The ADB enables a more efficient data transfer between the ADB and the vector unit with shorter latency than that between the processor and the memory, resulting in higher performance by storing frequently used data in the ADB and the reduced memory bank contention arising from the concurrent processing. The data stored in the ADB can be set to vector data only, scalar data only, or both vector and scalar data depending on the running application. The ADB for each vector load/store instruction is assigned for retaining the necessary data in the ADB as much as possible. A software-controlled prefetch instruction is supported.

Multinode Configuration

The multinode system of the SX-9 can connect up to 512 nodes by grouping shared-memorytype single nodes into clusters and connecting them to the IXS ultra-high-speed crossbar switch. The multinode system features not only a wide bandwidth of internode data transfer but also reduced communication latency by capitalizing on the RCU (remote access control unit), which is the IXS connection unit for each node. Figure 4 shows the configuration of the SX-9 multinode system. Each node incorporates up to 16 RCUs, which are connected to the IXS via cables. Each RCU forms a single lane, which has two connection ports of 4 GB/s ×2, offering a transfer rate of 8 GB/s ×2. As a result, each node can incorporate a maximum of 16 lanes with 32 connection ports, and the total transfer performance reaches a maximum of 128 GB/s ×2.

NEC SX Series Vector Computers. Fig. 4
figure 4_425figure 4_425

SX-9 multinode system configuration

Main Memory Unit (MMU)

The MMU adopts the shared memory system and is composed of 512 MMU cards for its maximum configuration. The memory of the SX-9 has a large capacity thanks to the use of DDR-SDRAM (Double Data Rate-SDRAM) devices for all the MMUs.

The memory capacity per MMU card is 2 GB, and the system supports the capacity from 256 GB up to 1 TB. The MMU cards are capable of handling concurrent operations of memory access requests from the CPU with a data transfer rate of 8 GB/s per MMU card, or a maximum of 4 TB/s for the entire system. The maximum 32,768-way parallel memory interleaving is adopted for the efficient data transfer with DDR3-SDRAM.

I/O Processing Unit

All of the processors in the system are allowed to access all of the I/O devices. The SX adopts the direct I/O method, with which the memory in the host bus adapter (HBA) is accessed directly. The I/O interface can be selected according to the types of packaged channel cards, facilitating the system expansion. The I/O processing unit of the SX-9 is composed of the IO Features (IOFs: host bridge units) and the PCI Express control units. Up to 16 IOFs can be mounted per system and two PCI Express control units can be connected to each IOF. As each IOF is virtually represented as a singlechannel device, all of the functions incorporated in the IOFs can be set and modified at the desired timing from the software using the same access method as a general purpose channel card.

Core Technology Behind the SX-9

Figure 5 summarizes some of the core technologies built into the SX Series for enhanced performance and ease of use. The SX Series has the advanced architecture of large-scale shared memory, high-speed data transfer between the CPU and the memory, and an ultra-high-speed network interconnecting nodes. One of the key technologies indispensable to achieving higher system efficiency is the interconnection between processors. NEC has developed an optical interconnection technology that achieves a 20 Gbps signal transmission rate between two LSIs of a high-performance computer, thus surpassing the communication speed made available with the existing electrical transmission.

NEC SX Series Vector Computers. Fig. 5
figure 5_425figure 5_425

Technologies for the SX-9

Software Technology of the SX-9

Outline of the SUPER-UX Operating System for the SX-9

The SUPER-UX is an operating system based on the UNIX System V operating system that features functions inherited from the BSD and SVR4 2MP as well as enhancements of the functions required to support supercomputers. The SUPER-UX features the flexible resource management and the high parallel processing capabilities of kernels and I/Os in order to guarantee high scalability. Four page sizes are supported, which are 32 KB for general commands, 4 MB for compiler and system commands, and 64 and 256 MB for large-scale user programs. This strategy improves the execution performance of programs that use large arrays and reduces the overheads in the memory management. The SUPER-UX of the SX-9 has expanded the user’s virtual space to 8 TB, enabling an efficient layout of parallel programs and MPI programs. The SX Memory File Facility (SX-MFF) is also provided for high-speed I/O by building a conventional file system on the large-capacity memory as a disk cache.

Basic OS Functions

The SUPER-UX supports not only large-scale multiprocessors with efficient resource management but also high-speed input/output and the sophisticated gang scheduling for maximizing parallel processing performance.

File Management Functions

The SUPER-UX makes it possible to create large-scale files and file systems. While retaining the advantages of a standard UNIX system, it realizes a highspeed file system. The high-speed shared file system gStorageFS enables the efficient data transfer comparable to local file systems without involving the CPUs on remote servers. The NFS V3 compatibility is maintained as a user interface.

Batch Processing Functions

The SUPER-UX supports the concepts of jobs (sets of processes) at the kernel level. NQSII is the batch processing system that enables appropriate processing by large-scale clusters, and JobManipulator is provided as an NQSII scheduler extension to maximize system operational efficiency with backfill scheduling. NQSII incorporates functional enhancements of NQS job management, resource management, and load balancing to adapt to cluster systems with the improved single system image (SSI) and system operability.

Operation Management Functions

SUPER-UX supports a checkpoint/restart function for the interruption of a program in progress at a user-specified point and its smooth resumption. The unified operation management software provides the integrated management of multiple host machines on the network from a single machine, as well as automatic operation control, thus resulting in reduced operation cost.

Software Development Environment

The compilers for Fortran, C, and C++ provide highlevel optimization and automatic vectorization/ parallelization features for maximizing the performance. The vector data buffering function, which utilizes the ADB, is open to the user, thus enabling highly effective memory access optimization. Both the MPI library and the HPF compiler are provided for comprehensive distributed memory programming. The PSUITE tool is also available for the GUI-based program development, debugging, and tuning.

Compiler and Library

The SX-9 provides FORTRAN90/SX and C++/SX, respectively, a Fortran compiler and a C/C++ compiler; both feature vectorization and parallelization functions. HPF/SX V2 (a compiler for High Performance Fortran, which is a standard language for distributed parallel processing) is also provided, as well as MPI/SX and MPI2/SX (fully compliant with the distributed parallel processing interfaces MPI-1.3 and MPI-2.1). The system also supports a number of the Fortran 2003 features. Both FORTRAN90/SX and C++/SX support the OpenMP standard API for sharedmemory type parallel processing.

Two proprietary mathematical libraries are provided; one is the ASL scientific library aimed at a wide usage in scientific and engineering applications with enhanced performance on key functions, such as linear algebra and FFT. The other is MathKeisan, which collects a set of public domain mathematical library components, such as BLAS.

Application Program and Performance

The application of the SX Series spans a wide spectrum of areas ranging from fundamental research such as nanomaterial sciences and environmental sciences to industrial designing, drug designing, and emerging life sciences. The SX Series realizes high performance on real application programs, as well as its well-balanced performance from various architectural perspectives. Meantime, the Earth Simulator at JAMSTEC has inherited the parallel vector architecture, realizing the peak performance of 131 TFLOPS with its updated model (160 nodes of SX-9/E). This system is ahead of the pack with respect to the HPC Challenge benchmark (HPCC), which is gaining popularity separately from the Linpack score. The updated Earth Simulator topped the Global FFT (Fast Fourier Transform), one of the measures of the HPC Challenge Awards, with the performance number of 11.876 TFLOPS. The HPCC is designed to give multiple measures on performance under the US governmental project, including such factors as memory bandwidth and interconnect, as well as processor capability, all of which are suitable to the evaluation of the capabilities of a computer more comprehensively.

The Earth Simulator showcases excellent performance and peak performance ratio on scientific application programs in such an area as earth environmental modeling.

Figure 6 shows the performance numbers on real scientific applications evaluated for six benchmark programs on various computer platforms, including the SX Series. The target applications are: (1) the computations of inter-plate earthquakes in a subduction zone, (2) the turbulent flow based on the direct numerical simulation, (3) the electromagnetic analysis about various antennas, (4) the land mine detection with the Synthetic Aperture Radar (Maxwell’s equations), (5) the unsteady flows around a turbine based on the direct numerical simulation, and (6) the wave-particle interactions between electrons and plasma waves.

NEC SX Series Vector Computers. Fig. 6
figure 6_425figure 6_425

Comparison of single-core performance for scientific application programs. (a) Measured performance (GFLOPS), (b) Peak performance ratio (%). Each bar represents SX-7, SX-8, SX-9, Harpertown, Nehalem-EP, and Nehalem-EX respectively from left to right

Figure 6 includes the comparison of single CPU performance of these benchmark programs for various NEC vector supercomputers and Intel multicore processors (Harpertown, Nehalem-EP, and Nehalem-EX) as indicated in Table 2. The Nehalem processors are directly connected to the memory system, enabling efficient data transfer. Each application program is set up to utilize all the cores for the Intel processors. As clearly seen in Fig. 6, the SX-9 processor shows much higher performance over other processors due to its high theoretical peak performance for all the six programs. In addition, the execution efficiency (peak performance ratio) is more than 40% for many application programs on the SX Series, which is significantly higher than the commodity scalar processors. The actual performance varies depending on such factors as the memory access frequency, which depends on the ratio of the CPU-memory bandwidth and the floating-point operation capability.

NEC SX Series Vector Computers. Table 2 Specifications of referenced hardware platforms for performance evaluation

Figure 7 shows the performance of a Lattice Boltzmann (LB) CFD program measured with a single CPU on different models of the SX Series for various numbers of fluid cells. The LB method emerges from a highly simplified gas-kinetic description. It represents the behavior of fluid by the translational movement and the collision of virtual particles. The left axis indicates the fluid cell update rate (per second), while the right one represents the measured performance (in GFLOPS). Taking into account the maximum theoretical performance of each CPU (SX-9: 102.4 GFLOPS, SX-8R: 32 GFLOPS, SX-8: 16 GFLOPS), the computational efficiency (peak performance ratio) is proven to be high (35–60%) for a wide range of the number of fluid cells, thus surpassing the efficiency on conventional scalar computers [1].

NEC SX Series Vector Computers. Fig. 7
figure 7_425figure 7_425

Single-core performance of Lattice-Boltzmann CFD program. (Left) Cell update ratio (in Million updates per second) (Right) Measured performance (in GFLOPS)

Concluding Remarks

While the commodity-chip-based parallelism is getting pervasive, complex control mechanisms and massive hardware resources are necessary for realizing high performance with the current scalar architecture, spurring the widening gap between the theoretical peak performance of a processor and the performance for real scientific application programs [23]. The vector mechanism is a possible solution to such an issue; it employs a highly developed single instruction multiple data stream (SIMD) approach because of a larger number of arithmetic operations handled by one instruction, which is further enhanced with multiple pipelines. It provides an easy way to express the concurrency needed to take advantages of the memory bandwidth with more tolerance about the memory systems’ latency than scalar designs, facilitating the realization of high performance needed for scientific computing.

Related Entries

Compilers

Cray Vector Computers

Earth Simulator

Fortran 90 and Its Successors

Fujitsu Vector Computers

HPC Challenge Benchmark

HPF (High Performance Fortran)

Hybrid Programming with SIMPLE

Instruction-Level Parallelism

MPI (Message Passing Interface)

OpenMP

Load Balancing

Shared-Memory Multiprocessors

SIMD (Single Instruction, Multiple Data) Machines

Bibliographic Notes and Further Reading

The architectural aspects of the SX Series are described in [46]. The fundamental descriptions about vector processors (or SIMD approach) and the general trend of the performance of high-performance computers are available, for example, in [23]. There are several papers that focus on the computing performance of the SX Series concerning real scientific application programs (References [17]). As for more comprehensive performance evaluations of the SX-9, for example, reference [8] is suggestive. In relation to such benchmarks, the descriptions of the referenced scalar computers can be found in [910]. The elaborate descriptions of the HPC Challenge benchmark and its awards are available in [1112]. It is also useful to refer to the overview of the Earth Simulator featuring the vector-processing capabilities that inherit from the SX Series for deeper understanding of the NEC vector supercomputers (references [1314]).