Keywords

1 Introduction

The development of deep learning experienced three upsurges: from 1940s to 1960s, the idea of artificial neural network was born in the field of control; from 1980s to 1990s, neural networks were interpreted as connectionism; After entering the 21st century, it was revived in the name of deep learning [1]. The concept of deep learning originates from the research of deep neural network, which is also the core branch of machine learning field. For example, multi-layer perceptron is a simple network learning structure. Generally speaking, deep learning is to realize complex nonlinear mapping by stacking and feature extraction of multi-layer artificial networks. In essence, compared with traditional artificial neural networks, deep learning does not add more complex logical structures, but significantly improves the feature extraction and nonlinear approximation capabilities of the model only by adding hidden layers. Since Hinton formally proposed the concept of “deep learning” [2] in 2006, it immediately triggered a research upsurge in the academic world and the investment of the industry, and many excellent deep learning algorithms began to emerge. For example, during the Visual Recognition Contest (ILSVRC) from 2010 to 2017, CNN demonstrated its powerful image processing capability and confirmed its leading position in the field of computer vision image [3]. In 2016, the intelligent Go program AlphaGo [4] developed by Google defeated the world Go champion Lee Sedol by an absolute advantage. The success of AlphaGo marked the arrival of the era of artificial intelligence with deep learning as the core.

After years of development, the rise of deep learning has led to the creation of common programming frameworks such as Tensorflow, Caffe, Theano, MXNet, PyTorch and Keras, It also promotes the rapid development of AI hardware acceleration platforms and dedicated chips, including GPU, CPU, FPGA and ASIC. This paper focuses on the current research hotspots and mainstream deep learning algorithms in the field of artificial intelligence. The basic principles and applications of Autoencoder (AE), Boltzmann Machine (BM), Deep Belief Network (DBM), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Recursive Neural Network (RNN) are summarized. The performance characteristics and differences of deep learning framework, AI hardware acceleration platform and dedicated chip are compared and analyzed.

2 Deep Learning Algorithms

2.1 Auto-Encoder (AE)

As a special multi-layer perceptron, Auto-encoder (AE) is mainly composed of encoder and decoder [5]. As shown in Fig. 1, the basic Auto-encoder can be regarded as a three-layer neural network, from input ‘x’ to ‘a’ is the process of encoding, and from ‘a’ to ‘y’ is the process of decoding. The learning of auto-encoder is a process to reduce the error between output ‘y’ and input signal ‘x’. The output expectation of Auto-encoder is the input, so it is generally regarded as an unsupervised learning algorithm, mainly used for data dimension reduction or feature extraction. In the training process of neural network, Auto-encoder is often used to determine the initialization parameters of the network. The principle is that if the encoded data can be restored accurately after decoding, the weight of the hidden layer is considered to be able to store the data information better.

Fig. 1.
figure 1

Auto-encoder (AE)

The approximation ability of Auto-encoder for input and output is not the stronger the better, especially when the output of Auto-encoder is exactly equal to the input, the process only realizes the replication of the original data, and does not extract the inherent characteristics of the input information. Therefore, in order to enable the Auto-encoder to learn the key features, usually impose some constraints on the Auto-encoder. As a result, a variety of improved Auto-encoder emerged, such as: Sparse Auto-encoder (SAE) makes neurons inactive in most cases by adding penalty items, and the number of nodes in the hidden layer is less than that in the input layer, so as to represent the input data with fewer characteristic parameters [6]. Stack Autoencoders (SAE) make it possible to extract deeper data features by stacking multiple autoencoders in series to deepen the layers of the network [7]; The Denoising Autoencoder (DAE) improves the robustness by adding noise interference during training [8]. Contraction Autoencoder (CAE) can learn mapping relations with stronger contraction by adding regular terms [9]. In addition, Deep Autoencoder (DAE), Stacked Denoised Autoencoder (SDAE), Sparse Stacked Autoencoder (SSAE), etc. [10,11,12].

2.2 Boltzmann Machine

Boltzmann Machine (BM) is a generative random neural network proposed by Hinton [13]. Traditional BM does not have the concept of layers, and its neurons are in a fully connected state, which is divided into visible unit and hidden unit. These two parts are binary variables, and the state can only be 0 or 1. Due to the complexity of the fully connected structure of BM, the variant of BM - Restricted Boltzmann machine is widely used at present (Fig. 2).

Fig. 2.
figure 2

BM (left) and RBM (right)

Restricted Boltzmann Machine (RBM) was first proposed by Smolensky [14] and has been widely used in data dimension reduction, feature extraction, classification and collaborative filtering. RBM is a shallow network similar to BM in structure, the difference is that RBM cancels the connection between layers and the neurons between layers do not affect each other, thus simplifying the model.

2.3 Deep Boltzmann Machine and Deep Belief Network

Deep Boltzmann Machine (DBM) is a model composed of multiple Restricted Boltzmann Machine, and the network layers are bidirectional connections [15]. Compared with RBM, DBM can learn higher-order features from unlabeled data and has better robustness, so it is suitable for target recognition and speech recognition.

Deep Belief Network (DBN) is also a deep neural network composed of multiple RBM, which differs from DBM in that only the network layer at the output part of RBM is bidirectional propagation [16]. Different from general neural models, DBM aims at establishing joint distribution between data and expected output, to make the network generate the expected output as much as possible, so as to extract and restore data features more abstractly. DBN is a practical deep learning algorithm, and its excellent scalability and compatibility have been proved in the application of feature recognition, data classification, speech recognition and image processing. For example, the combination of DBN and Multi-layer Perceptron (MLP) has good performance in facial expression recognition [17]. The combination of DBN and Support Vector Machine (SVM) has excellent performance in text classification [18].

2.4 Convolutional Neural Network

Convolutional Neural Network (CNN) was originally a deep learning algorithm derived from the discovery of ‘Receptive Field’ [19], which has excellent ability in image feature extraction. With the successful application of Lenet-5 model in the field of handwritten number recognition, scholars from all walks of life began to study the application of CNN in the fields of speech and image. In 2012, The AlexNet model proposed by Krizhevsky beats many excellent neural network models in the Image Net Image classification competition, which also pushed the application research of CNN to a climax [20] (Fig. 3).

Fig. 3.
figure 3

Convolutional neural network [21]

Convolutional neural network is mainly composed of input layer, convolutional layer, excitation layer, pooling layer, full connection layer and output layer, among which the convolutional layer and pooling layer are the core structure of CNN. Different from other deep learning algorithms, CNN mainly uses convolution kernel (filter) for convolution calculation, and uses pooling layer to reduce inter-layer connections to further extract features. It obtains high-level features through repeated extraction and compression of features, and then uses the output for classification and regression.

Weight sharing mechanism and local perception field are two major features of CNN. They have similar functions with pooling layer and can reduce the risk of overfitting by reducing inter-layer connections and network parameters. Weight sharing means that a filter will be used multiple times, it will slide across the feature surface and do multiple convolution computations [22]. Local perception field is inspired by the process of human observing the outside world, which is from the local to the whole. Therefore, a single filter does not need to perceive the whole, but only needs to extract local features and summarize them at a higher level.

In recent years, CNN has gradually emerged in various industries, such as Alphago, speech recognition, natural language processing, image generation and face recognition, etc. [23,24,25,26]. At the same time, many improved CNN models were born, such as VGG, ResNet, GoogLeNet and MobileNet.

VGG.

In 2014, Simonyan and Zisserman [27] proposed the VGGmodel, it won the first prize in positioning task and the second prize in classification task in the ImageNet Challenge. In order to improve the fitting ability, the network layer of VGG is increased to 19 layers, and the convolution kernel with small receptive field (3 × 3) is used to replace the large one (5 × 5 or 7 × 7), thus increasing the nonlinear expression ability of the network.

ResNet.

VGG proved that the deep network structure can effectively improve the fitting ability of the model, but the deeper network tends to cause gradient dispersion, which makes the network unable to converge. In 2015, Kaiming [28] proposed ResNet, which effectively alleviated the problem of neural network degradation, and won the first prize of classification, positioning, detection and segmentation tasks with absolute superiority in ILSVRC and COCO competitions. To solve the problem of gradient disappearance, Kaiming introduces a Residual Block structure in the network, which enables the model use Shortcut to implement Identity Mapping.

GoogLeNet.

To solve the problem of too many parameters in large-scale network model, Google proposed Inception V1 [29] network architecture in 2014 and constructed GoogLeNet, which won the first prize in the ImageNet Challenge classification and detection task in the same year. Inception V1 abandons the full connection layer and changes the convolutional layer to a sparse network structure, that results in a significant reduction of the network parameters. In 2015, Google proposed Batch Normalization operation and improved the original GoogLeNet based on this technology, obtained a better model—Inception V2 [30]. In the same year, Inception V3 [31] is also born. Its core idea is to decompose the convolution kernel into smaller convolution, such as splitting 7 × 7 into 1 × 7 and 7 × 1, to further reduce network parameters. In 2016, Google launched Inception V4 by combining Inception and ResNet, which has been improved in training speed and performance [32]. When the number of filters is too large (More than 1000), the training of Inception V4 will become unstable, but it can be alleviated by adding an Activate Scaling factor.

MobileNet.

In recent years, in order to promote the combination of neural network model and mobile devices, neural network model began to develop towards the direction of lightweight. In 2017, Google designs MobileNet V1 by Depthwise Convolution [33] and allows users to change the network width and input resolution, thus achieving a tradeoff between latency and accuracy. In 2018, Google introduced The Inverted Residuals and Linear Bottlenecks on the basis of MobileNet V1, and put forward MobileNet V2 [34]. In 2019, Google proposed MobileNet V3 by combining Depthwise Convolution, Inverted Residuals and Linear Bottlenecks [35]. It is proved that MobileNet has excellent performance in multi-objective tasks, such as classification, target detection and semantic segmentation.

2.5 Recurrent Neural Network

Recurrent neural network (RNN) is a kind of deep learning model that is good at dealing with time series. RNN expands neurons at each layer in time dimension, realizes forward transmission of data in the network through sequential input of information, and stores information in ‘long-term memory unit’ to establish sequential relations between data.

Fig. 4.
figure 4

Convolutional neural network

As shown in Fig. 4, RNN reduces the computation of the network by sharing parameters (W, U, V). RNN mainly uses Back Propagation Through Time algorithm [36] to update the parameters of each node. Its forward Propagation can be expressed as:

$$ S_t = \sigma (w*S_{t - 1} + X_t *U). $$
(1)
$$ Q_t = soft\max (V*S_t ) $$
(2)

Although RNN can consider the correlation between information, traditional RNN is usually difficult to achieve long-term preservation of information. Due to the excitation function and multiplication, when RNN has a large number of network layers or a long time sequence of data, sometimes the gradient will grow or decay exponentially with iteration, resulting in gradient disappearance and gradient explosion [37].

LSTM.

In order to solve the shortcomings of traditional RNN, Hochreiter [38] proposed LSTM. LSTM introduces three types of gated units in RNN to realize information extraction, abandoned and long-term storage, which not only improves the problems of gradient disappearance and excessive gradient, but also improves the long-term storage capacity of RNN for information. Each memory cell in the LSTM contains one cell and three gates. A basic structure is shown in the Fig. 5: In the three types of gating units, input gate is used to control the proportion of the current input data X(t) into the network; Forget gate is used to control the extent to which the long-term memory unit abandons information when passing through each neuron. Output gate is used to control the output of the current neuron and the input to the next neuron.

Three types of gate control units are shown:

$$ i_t = \sigma (w_{ii} x_t + w_{ih} h_{t - 1} ) $$
(3)
$$ f_t = \sigma (w_{fi} x_t + w_{fh} h_{t - 1} ) $$
(4)
$$ O_t = \sigma (w_{Oi} x_t + w_{Oh} h_{t - 1} ) $$
(5)

The calculation of Cell is shown:

$$ g_t = \tanh (g_{gi} x_t + w_{gi} h_{t - 1} ) $$
(6)

The calculation of long-term memory unit C and hidden layer output h are as follows:

$$ C_t = f_t C_{t - 1} + i_t g_t $$
(7)
$$ h_t = o_t \tanh (c_t ) $$
(8)
Fig. 5.
figure 5

LSTM memory cell [39]

LSTM has many excellent variants, of which the more successful improvement is the bi-directional LSTM. Bi-directional LSTM realizes the simultaneous utilization of past and future information through two-way propagation of data in the time dimension [40]. In some problems, its prediction performance is better than one-way LSTM. Greff [39] discussed the performance of 8 variants based on Vanilla LSTM, and conducted experimental comparisons in the three fields of TIMIT speech recognition, handwritten character recognition and polyphonic music modeling. The results showed that the performance of 8 variants did not significantly improve; Forgetting gate and output gate are the two most important parts of LSTM model, and the combination of these two gate units can not only simplify the LSTM structure, but also will not reduce the performance.

GRU.

As a simplified model of LSTM, GRU only uses two gating units to save and forget information, including update gate for input and forget, and reset gate for output [41]. GRU replaces forget gate and Input gate with Update gate compared with LSTM, simplifying structure and reducing computation without reducing performance. At present, there is no final conclusion to show the performance of LSTM and GRU, but a large number of practices have proved that the performance of the two network models is often similar in general problems [42].

2.6 Recursive Neural Network

Recursive neural network is a deep learning model with tree-like hierarchical structure, its information will be collected layer by layer from the end of the branch, and finally reach root end, that is, to establish the connection between information from the spatial dimension. Compared with recurrent neural network, recursive neural network can map words and sentences expressing different semantics into a vector space, and use the distance between statements to determine semantics [43], rather than just considering word order relations. Recursive neural networks have powerful natural language processing capabilities, but constructing such tree-structured networks requires manual annotation of sentences or words as parsing trees, which is relatively expensive (Fig. 6).

Fig. 6.
figure 6

Syntax parse tree and natural scene parse tree [44]

3 Deep Learning Framework

In the early stage of the development of deep learning, in order to simplify the process of model building and avoid repeated work, some researchers or institutions packaged codes that could realize basic functions into frameworks for the public to use. Currently, commonly used deep learning frameworks include Tensorflow, Caffe, Theano, MXNet, PyTorch, Keras, etc.

3.1 Tensorflow

Tensorflow is an open source framework for machine learning and deep learning developed by Google. It uses the form of a Data Flow Graph to build models and provides TF. Gradients for quickly calculating gradients. Tensorflow is highly flexible and portable, it supports multiple language interfaces such as Python and C++. It can not only be deployed on servers with multiple cpus and gpus, but also run on mobile phones [48]. Therefore, Tensorflow is widely used in many fields such as voice and image. Although it is not superior to other frameworks in terms of running speed and memory consumption, it is relatively complete in terms of theory, functions, tutorials and peripheral services, which is suitable for most deep learning beginners.

3.2 Caffe

Caffe is an open source framework for deep learning, and is maintained by Berkeley Vision Center (BVLC). Caffe can flexibly modify and design new network layers according to different requirements, and is very suitable for modeling deep convolutional neural networks [49]. Caffe has demonstrated excellent image processing skills in ImageNet competitions and has become one of the most popular frameworks in computer vision. Caffe’s models are usually implemented in text form, which is easy to learn. In addition, Caffe can use GPU for training acceleration through Nvidia’s CUDA architecture and cuDNN accelerators. However, Caffe is not flexible enough to modify or add the network layer, and is not good at dealing with language modeling problems.

3.3 Theano

Theano is an efficient and convenient mathematical compiler developed by the Polytechnic Institute of Montreal, it is the first architecture to use symbolic tensor diagrams to build network models. Theano is a framework developed based on Python that relies on the Numpy toolkit, and is well suited for large-scale deep learning algorithm design and modeling, especially for language modeling problems [50]. Theano’s disadvantages are also obvious, it is slow to run both as a toolkit import and during its compilation, and the framework is currently out of development, so it is not recommended as a research tool.

3.4 MXNet

MXNet is a deep learning framework used and maintained by Amazon officially. It has a flexible and efficient programming mode, supporting both imperative and symbolic compilation methods [51], and can perfectly combine the two methods to provide users with a more comfortable programming environment. MXNet has many advantages. It not only supports distributed training of multiple CPU/GPU, but also can realize true portability of micro-devices from servers and workstations to smart phones. In addition, MXNet supports JavaScript, Python, Matlab, C++ and other languages, which can meet the needs of different users. However, MXNet is not widely used by the community due to the difficulty of getting started and the incomplete tutorials.

3.5 PyTorch

Facebook introduced the Torch framework early on, but it struggled to meet market demand due to its lack of support for the Python interface. Instead, Facebook built Pytorch, a deep learning framework specifically designed for Python programming and GPU acceleration [52, 53]. Pytorch uses a dynamic data flow diagram to build the model, giving users the flexibility to modify the diagram. Pytorch is highly efficient at encapsulating code and runs faster than frameworks such as TensorFlow and Keras, and providing users with a more user-friendly programming environment than other frameworks.

3.6 Keras

Keras is a neural network library derived from Theano. The framework is mainly developed based on Python language and has a complete function chain in the construction, debugging, verification and application of deep learning algorithms. Keras architecture is designed for object-oriented programming, which encapsulates many functions in a modular manner, simplifying the process of building complex models. Meanwhile, Keras is compatible with Tensorflow and Theano’s deep learning software package, which supports most of the major algorithms including convolution and cyclic neural networks (Table 1).

Table 1. Deep learning framework

4 Hardware Platform and Dedicated Chip

4.1 CPU

CPU is one of the core parts of the computer, usually composed of control parts, logic parts and registers, its main function is to read, execute computer instructions and process data. As a general-purpose chip, CPU is originally designed to be compatible with all kinds of data processing and computation, and it is not a special processor for neural network training and acceleration. There are a lot of matrix and vector calculations in the training process of deep network, and the computing efficiency is not high by using CPU, and upgrading CPU to improve performance is not cost-effective. Therefore, CPU is generally only suitable for small-scale network training.

4.2 GPU

In 1999, NVIDIA launched GeForce-256 as its first commercial GPU, and began working on developing high-performance GPU technology in the early 2000s. In 2004, gpus evolved to the point where they could carry early neural network computing. In 2006, Kumar Chellapilla [54] successfully used GPU to accelerate CNN, which was the earliest known attempt to use GPU for deep learning.

GPU is a microprocessor specially used for processing image calculation. Different from the generality of CPU, GPU focuses on the calculation of complex matrix and geometric problems, especially good at processing image problems [55]. In the face of complex deep learning model, GPU can greatly increase the training speed. For example, Coates [56] used GPU for training acceleration in the target detection system, which increased its running speed by nearly 90 times. Currently, companies such as Nvidia and Qualcomm have advanced capabilities in developing GPU hardware and acceleration technologies, and support multiple programming languages and frameworks. For example, Pytorch can use the GPU to help model training through CUDA and cuDNN that developed by Nvidia, which can significantly reduce network training time.

4.3 ASIC

ASIC is a professional chip with extremely high flexibility. Its performance can be customized according to actual problems to meet different computing power requirements. Therefore, when dealing with deep learning problems, its performance and power consumption are far higher than CPU, GPU and other general chips. For example, TPU [57], launched by Google in 2015, is a very representative integrated circuit chip. It has been proved that its execution speed and efficiency are dozens of times higher than CPU and GPU. It has been applied and promoted in Google’s search map, browser and translation software. In recent years, Google has continuously released the second and third generation of TPU and TPU Pod [58], which not only greatly improves chip performance, but also extends its application to the broader field of artificial intelligence. In addition, the Cambrian series chips [59] proposed by The Chinese Academy of Sciences also have great advantages in improving the running speed of neural networks. ASIC has broader development prospects and application value, but due to long development cycle, high investment risk and high technical requirements, only a few companies have the development ability at present.

4.4 FPGA

FPGA, also known as field programmable gate array, is a variable circuit derived from custom integrated circuit (ASIC) technology. FPGA directly operates through gate circuit, which not only has high speed and flexibility, but also enables users to meet different needs by changing the wiring between internal gate circuits [60]. FPGA generally have lower performance than ASIC, but their development cycle is shorter, risk is lower, and cost is also relatively lower. When processing specific tasks, the efficiency can be further improved through parallel computing. Although FPGA has many advantages and can better adapt to rapidly developing deep learning algorithms, it is not recommended for individual users or small companies to use due to their high cost and difficulty (Table 2).

Table 2. Deep learning hardware technology comparison [61]

5 Conclusion

Around the current popular research fields in artificial intelligence, this paper summarizes the basic principles and application scenarios of current mainstream deep learning algorithms, introduces and compares common deep learning programming frameworks, hardware acceleration platforms and dedicated chips. Obviously, deep learning algorithms are in a stage of rapid development, and also promote the rise of its surrounding industries. However, problems such as single model type and insufficient algorithm performance also limit the development of some industries, so how to innovate and improve new algorithms is still the focus of future research. In addition, the intelligence of deep learning algorithm also brings a lot of convenience to our daily life, but its application is not widely at present. That mean how to promote and utilize deep learning more efficiently is still a long way to go.