Introduction

Artificial neural networks have dramatically advanced the capability of artificial intelligence in many aspects1,2,3,4,5. In artificial neural networks, the most widely used neuron activation functions include the Sigmoid function \({{{\mathrm {sgmd}}\,}}(x) = 1/[1+\exp (-x)]\), the Hyperbolic Tangent function \(\tanh (x)\), and the Rectified Linear Unit function \({{{\mathrm {ReLU}}\,}}(x)=\max (x,0)\), etc..6,7. All of these activation functions require multiple bits for storage or processing. In contrast, the binary Heaviside step-function, as the simplest nonlinear activation,

$$\begin{aligned} {{{\mathrm {H}}\,}}(x)={\left\{ \begin{array}{ll} 1 &{} x\ge 0\\ 0 &{} x<0 \end{array}\right. }, \end{aligned}$$
(1)

only takes 1 bit for storing the output. However, the Heaviside function is not suitable for the activation function because of its non-differentiability at \(x = 0\) and its vanishing derivatives for \(x \ne 0\), which makes it impossible to use the back-propagation algorithm8 in the training process. For the same reason, the weights have to be smoothly tunable and cannot take binary values.

Despite of these difficulties, binarization of neural networks is highly desirable. Modern large Deep Neural Networks (DNNs) requires very large memory (hundreds of MB) to store weights and intermediate variables, as well as strong computing power (up to \(\sim\) 10 G flops per sample). This is one the of major obstacles limiting the applications of artificial neural networks, especially on mobile devices. However, binarizing weights/activations obviously can provide huge performance boost, thus it is of interest to find ways to achieve binary weights/activations.

Previously, it was believed that the network with binary activation and/or binary weights cannot be trained because good binary weights simply do not exist as the result of inevitable loss of degrees of freedom. However, recently, some pioneering works such as the BinaryConnect9 and BNN10 proposed by Courbariaux et al. demonstrated that binarization of weights and activation in deep neural network is indeed feasible. During training runs, these methods use binarized weights (deterministically or stochastically) and/or activations (deterministically) for forward pass; for backward pass, Straight-Through Estimator (STE) is used for gradient of activations:

$$\frac{{{\text{d}}\;{\text{sign}}(x)}}{{{\text{d}}x}} = \left\{ \begin{array}{*{20}l}{1} \hfill & {\left| x \right| < 1} \hfill \\ 0 \hfill & {{\text{otherwise }}} \hfill \\ \end{array}\right.$$

and the gradient of binarized weights are applied to raw (un-binarized) weights. These methods can achieve good results of various tasks. Almost all the works on binarization (quantization) relies on this technique, and some improved variants or extensions are proposed, such as XNOR-Nets11 that adds a scaling factor to cancel out quantization error, LQ-Nets12 that automatically optimizes quantizers, DSQ13 which improves STE itself, and DoReFa-Net14 which also uses low bit width during backpropagation to reduce training time. Other methods for binarizing (quantizing) weights of neural networks include expectation backpropagation15 and Proxquant16. Simons and Lee17 and Qin et al.18 provide reviews of recent progress in binary neural networks.

Apart from binarization or quantization like our work, there are some other efforts to speed up deep neural networks, they use different method from binarization/quantization of networks, but the aim is essentially the same—to boost the performance of neural network. Some examples are using compact layers19,20,21 and compressing network by removing redundant weights22,23,24,25.

Method

Here, we describe an efficient yet simple method to train existing networks to take binary activation function and/or binary weights. In this method, we use a parameterized Sigmoid and Hyperbolic Tanh functions with width w:

$$\begin{aligned} {{{\mathrm {sgmd}}}}_w(x) \equiv [1+\exp (-x/w)]^{-1} \quad {\text{and}}\quad {{\tanh}}_w(x) \equiv {{\tanh}}(x/w) \end{aligned}$$
(2)

to approach the Heaviside step-function \(\mathrm {H}(x)\) and the sign-function \(\mathrm{sign}(x)\) as \(w\rightarrow 0\). In order to train the real-valued weights to work with Heaviside-activated neurons, the training process starts with the usual Sigmoid function of finite width \(w > 0\), and then w is decreased adiabatically until the Sigmoid function becomes a Heaviside-like function. In the end, the network may be trained once more using the Heaviside-function (binary) activations to further stabilize the weights. To achieve binarized weights, the network is slightly modified by replacing the raw weights W with the polarized weights: \(W\mapsto a \tanh _w(W)\), where a is a real-valued constant for each layer, and the polarized weights instead of raw weights are used for the connections between neurons. The raw weights W and multiplier a are trained as usual. When width w is large, the raw weights and polarized weights are identical. But when the width w is gradually decreased during the training, the polarized weights become binarized because \(a \tanh _{w=0}(W)=\pm a\) as \(w\rightarrow 0\). Unlike most of previous works, our method does not require STE to work and may get rid some of its limitations such as gradient mismatch. The parametrized function used in Eq. (2) has some similarity with the Differentiable Soft Quantization (DSQ) method proposed by Gong et al.13, but the tunable width used in this work is viewed as a global non-trainable, time-evolving parameter; Proxquant by Bai et al.16 used a time-evolving regularizer to binarize weight, but is unable to regularize activation in a similar manner.

For small tasks such as hand written number below, the width w can be adjusted manually. For other larger tasks, we adopted one of the following two self-adaptive methods to adjust w. Scheme V1: a target accuracy is set for each w. The network is trained using the current value of w. Once the validation accuracy (using the present w) reaches target accuracy or the training epochs exceeds 15–30 epochs (exact number depending on tasks), the value of w is decreased by a factor of 1.2–2 (depending on tasks, but not crucial) and the learning rate is also reduced by a factor of 1–2. In the mean time, the next target accuracy is set to the present accuracy. Scheme V2: there is no target accuracy in this scheme. The width w and the learning rate are decreased as in Scheme V1 once the binary validation accuracy (i.e., at \(w=0\) instead of at current width) saturates. Note that this method does not require pre-training which is required for V1 and may be viewed as more flexible version of V1.

Numerical experiments

We use the adiabatic training method described above to binarize the following four different tasks based on some typical neural network structures: (1) a fully-connected network for recognizing hand-written numbers; (2) a convolutional neural network for recognizing the dog-cat pictures; (3) a convolutional neural network for recognizing spoken numbers; (4) a ResNet-20 or VGG-Small neural network for recognizing pictures from CIFAR-10 dataset with 10 classes. We compare the networks obtained from four different supervised training procedures: the conventional non-binary network trained with standard Sigmoid or ReLU activations and real-valued weights, the binary-activation network trained with Heaviside (binary) activation and real-valued weights, the binary-weight network trained with Sigmoid or ReLU activation and binary weights, and the full-binary network trained with binary activation and binary weights. All networks are built using the TensorFlow module26.

Figure 1
figure 1

(A) The fully-connected neural network with one hidden layer, used for the recognition of hand-written number images. (B) The comparison of validating accuracies (using the present width) as function of epochs realized by the non-binary, binary-activation, binary-weight, and full-binary networks for the numbered image. The adiabatic change of width w as function of epochs for the full-binary case is shown in the bottom of the panel. The sharp dips in the full-binary training curve are due to the sudden change in w. (C) The distribution of activation outputs (top) and weights values (bottom) for successive epochs (each color band represents one epoch) for the hand-written numbers recognition task (layer 1 only). The left and right panels are results from the non-binary network (with hybrid Sigmoid-ReLU activation function) and the full-binary network, respectively. Each colored band contains 1800 data samples out of \(\sim 100\,\mathrm{k}\) weights or 1 million activations. The weights corresponding to the edges of the image remain unchanged during training.

Image recognition of hand-written numbers

A fully-connected network with one hidden layer (see Fig. 1A) is sufficient for this task27. 70 k image samples from the MNIST dataset28 are used with 60 k for training/validating and 10 k for testing. drop-out29,30 is used to alleviate over-fitting, and no data augmentation is used. For training the non-binary network, the neurons in the hidden layer use the ReLU activation function and 98.2% accuracy can be reached. Since the ReLU function cannot go smoothly to a Heaviside step-function, to train binary networks, we use a hybrid Sigmoid-ReLU activation function as

$$\begin{aligned} f_w(x)=2{{{\mathrm {sgmd}}\,}}[{{{\mathrm {ReLU}}\,}}(x)/w]-1, \end{aligned}$$
(3)

which can approach the Heaviside function as \(w\rightarrow 0\). The width w is adjusted manually for this task. The same fully-connected network (of the same size and structure) is trained for 8 epochs using this hybrid activation with \(w=1/3\) for neurons in the hidden layer, followed by 3 epochs using \(w=1/12\) and 2 epochs with the Heaviside activation. The realized binary-activation network with Heaviside activation has the testing accuracy 97.1% as shown in Fig. 1B, similar to the conventional non-binary network using ReLU activation. It should be noted that the initial epochs of training with finite w is crucial, otherwise the validation accuracy can only reach \(\sim 85\%\) if using Heaviside activation from start. For the binary-weight network, the network is trained for 8 epochs using the polarized weight with \(w=1/3\), followed by 3 epochs each for \(w=1/10,1/20,1/50,1/100,1/300,1/500\) and 0, and finally the realized binary-weight network reaches 97.2% testing accuracy as well. The exact sequence of w is not important as long as it decreases. For the full-binary network, the network is trained with \(w_\text{ weights } = 2 w_\text{ activation }\), and \(w_\text{ weights }\) is decreased in the same way as in training the binary-weight network. The final validation accuracy of 96.0% is reached. Therefore, both the half-binary and full-binary networks can reach almost the same validation accuracy as the non-binary network.

Figure 1C shows the distributions of the activation outputs and the weights from the non-binary and full-binary networks. It shows that for the full-binary networks, both the activation outputs and the weights become progressively more and more binarized, and eventually purely binarized at the end of the training process. In comparison, both activations and weights in the non-binary network spread out across the full range for all epochs as expected. In the full-binary case, the weight scaling factor a also needs to be trained and is in general a real-valued number, but at the end of the training, a can be set to be unity and all activations and/or weights become purely binary. We should mention that it is not desirable to have many weights or activation values close to the center (near zero) of the Sigmoid/Tanh function during training, because the center part cannot be accessed in binarized networks. Therefore, the ratio between the range of the initial weights \(\mathrm {Range}(\left| W \right| )\) and the initial value for width w should be \(\mathrm {Range}(\left| W \right| )/w \gtrsim 1\) (for the first band of Fig. 1C for the full-binary panel, \(\mathrm {Range}(\left| W \right| ) \simeq 1/2\) and \(w = 1/3\)), and in the meantime the learning rate should be moderately large to avoid the center part.

Figure 2
figure 2

(A) The convolutional neural network with three hidden layers used for dog-cat recognition and the spoken number recognition. (B) The validating accuracies as function of number of epochs for the dog-cat task using networks trained with usual full precision (blue), binary activation (orange), binary weights (green), and full binary (red). All networks are trained from scratch. The gray stepwise curve shows the adiabatic change of width w towards zero as function of epochs. (C) Same as (B) for the spoken number task. (D) results for the CIFAR-10 recognition task based on the ResNet-20 and VGG-Small networks; binary-weight network starts from pre-trained (not shown) model, binary-activation nets are trained from scratch and full-binary nets start from the binary-activation network; for the latter two networks, batch normalization layers work in training mode for evaluation as drop-outs are used (inference mode is used for testing). Binary-weight ResNet-20 stops early because it already reached target accuracy. Full-precision accuracies for these two networks are 92.4% and 93.8%12.

The dog-cat picture recognition

A convolutional neural network (CNN) with three hidden layers (see Fig. 2A) are used for this task. In this network, the convolution kernel is \(3\times 3\) and the pooling size is \(2\times 2\). And 25 k pictures of dogs and cats from the Kaggle dataset31 (subset of Asirra32) are used with 23.4 k for training/validating and 2.6 k for testing. The pictures are resized to \(64\times 64\) pixels and converted to grayscale before training, the training pictures are randomly flipped horizontally and shifted by a maximum of 8 pixels for data augmentation.

For the conventional non-binary network, the standard ReLU activations with max-pooling are used in all hidden layers, and batch normalization33 is used to accelerate training. Within 120 epochs of training, the validation accuracy saturates to 89.7% as shown in Fig. 2B. To train a binary-activation network with the Heaviside activation, the ReLU activation is replaced by the parametrized sigmoid activation Eq. (2) with varying width w for the hidden layers except last one which uses the ReLU-sigmoid hybrid activation (using sigmoid here causes mostly 0 output when binarized thus not ideal). Because the max-pooling makes no sense with numbers of 0s and 1s, the order of pooling and activation is changed as convolution \(\rightarrow\) batch Normalization \(\rightarrow\) pooling \(\rightarrow\) activation, the same as11. Previously the width is adjusted by hand, but this is not practical for large network. From this task, we use the self-adaptive method V1 mentioned above. With these operations, the realized binary-activation network equipped with Heaviside activation functions reaches a validation accuracy of 84.6% (see the orange curve in Fig. 2B). For a binary-weight network, we use the same method to reduce the width w and a validation accuracy of 86.0% is achieved (see the green curve Fig. 2B). In order to realize the full binarization for both activations and weights, we have to leave out the first (weights only) and last layer from being binarized, as in most other methods on binarization like12,13. The width w for the activations and weights can be reduced to zero alternatively (or simultaneously) to achieve full binarization, and realized a validation accuracy of 85.5% (the red curve in Fig. 2B), close to the non-binary and half-binary networks.

Voice recognition of spoken numbers

This task uses a similar convolutional neural network (see Fig. 2A) as the one used for the dog-cat recognition task above. The kernel length is 30 and the pool sizes for the three layers are 10, 8, and 5, respectively. And 14 k audio waveforms of human spoken numbers (each resized to array of length 8000) from Speech Commands dataset are used with 13 k for training/validating and 1 k for testing34,35, no data augmentation is used for this task. The non-binary network trained using the ReLU activation and real-valued weights along with max-pooling reaches a validation accuracy of 93.7% (see the blue curve in Fig. 2C). Following the same procedure as that used in the dog-cat case, the validating accuracies for the binary-activation (the orange curve in Fig. 2C), binary-weight (the green curve), and full-binary networks (the red curve) trained using the adiabatic method are 91.5%, 91.4%, and 93.0% , all of which are close to the non-binary network. The results on all previous tasks are summarized in Table 1.

CIFAR-10 classification

In this task, we train a standard ResNet-2036 or VGG-Small network2,12 (with similar structure as the CNN shown in Fig. 2A) to recognize 60 K (50 K for training/validation and 10 K for testing) \(32\times 32\) color images belonging to 10 classes from the CIFAR-10 dataset37,38. This task is much more challenging than previous ones. However, the adiabatic method can be applied directly without modification, except that the first and last layer are kept un-binarized for both half- and full-binarization. For the data augmentation, as in36, we randomly flip the picture horizontally and shift the image by the maximum of 4 pixels.

Using the adiabatic method and starting from pre-trained full-precision network with ReLU activations, we succeed in training both ResNet-20 and VGG-Small network to function with binary-weight as shown in Fig. 2D, where the validation accuracy of binary-weight network approaches to the accuracy of the full-precision network. The test accuracy reached is 90.2% and 93.3% respectively. Compared with VGG-Small, ResNet-20 is much deeper but with much less feature maps per layer, thus the binarization may cause more information loss. It is known to be very hard to binarize activations. As keeping accuracy while pushing activations to binary value is practically impossible, scheme V2 is used to adjust the width w. We successfully train the binary activation and full binary ResNet-20 to testing accuracy 85.7% and 83.0%. A slight modification in network structure by adding shortcut at every conventional layer39 can increase the accuracy to 86.2% and 84.1%. On the other hand, VGG-Small networks are more friendly to activation binarization, and binary-activation can be trained using the adiabatic method (adjusting width using method V1) starting with \(w=1\) (or pre-training with sigmoid activation) and reached 92.4%. The full-binary VGG-Small networks (see the right panel of Fig. 2D) can also be trained using the same method with accuracy 90.7%. For full-binarization of both ResNet-20 and VGG-Small, starting with weights of binary-activation network can boost accuracy (with \(\sim 1\%\) accuracy gain) as binarization of activation is the bottleneck of whole task. Table 1 lists the accuracies for the CIFAR-10 task trained from our method and other existing methods, which shows that the accuracies from our method are better or comparable to the accuracies from other methods, and approach the accuracies from the full-precision network. In the table, methods like BNN and XNOR-Net use relatively vanilla STE for gradient of activation and thus has relatively low accuracy. Improvement on STE like DSQ is more complicated but can increase the accuracy. While CL-BCNN introduces channel wise interaction, effectively changing the network structure, and can achieve highest accuracy. Our method can be viewed as an independent method from STE, simple but still have relatively good performance.

Here, we discuss some tricks and analysis on binarization. Firstly, for parameterized binarizing function for activation, there can be many choice: sigmoid, sigmoid-ReLU hybrid, \({\mathrm {clip}}(x/w,0,1)\), etc.. In early layers, symmetric function usually works better as the mean value and shape of distribution of its output is not largely affected. Secondly, it should be mentioned that the L2 regularization (or weight decay40), often used in ResNet-20 or VGG small networks, may reduce the magnitude of raw weights over time or even flip their signs, or may make the network use the center part of sigmoid function which is not accessible after binarized. This may weaken the effect of reducing w used in the adiabatic method. From our experiment, when binarizing weights only, this is not problematic at all, but when binarizing activations, the regularizations such as drop-out is recommended in the adiabatic method.

To see whether the binarized neural network is identifying the same defining part of the image as the full-precission network, we investigated the Gradient-weighted Class Activation Mapping (Grad-CAM)41 (for VGG-Small) and usual CAM (for ResNet-20) of full-precision and full-binary network, as shown in Fig. 3. We find that the neural network indeed focuses on the defining part of the image that it correctly identifies. However, for full-binary ResNet-20, we find that the all the weights in the last layer (fully connective layer) are negative, and the heat map is reversed. This suggests that the structure of weights in binary ResNet-20 may be completely different from its full-precision counterpart and can be the reason that ResNet-20 is very difficult to binarize activations.

Table 1 Comparison of accuracy for all tasks.
Figure 3
figure 3

Grad-CAM/CAM of VGG-Small (left) and ResNet-20 (right). In the top row, the images are correctly classified for both full-precision (FP) and full-binary (BN) network, therefore they identify the same defining part of the image. In the bottom row, the images are mis-identified by full-binary network, therefore the BN gives different defining part from the FP network. The numbers are the confidence. Notice that for binary ResNet-20 everything gets reversed.

Discussion and conclusion

With the four different tasks demonstrated above, we showed that, by adiabatically varying the width of the activation and weight distribution towards zero, the existing neural networks can be trained to work with binarized Heaviside activated neurons and/or binarized weights. And the realized binary networks have the validation accuracy approaching that obtained from the conventional non-binary neural networks. The most distinguishing advantage of this adiabatic approach for training binary networks is its applicability to many (if not all) existing neural networks with no or slight change in terms of network size or structure. Reflecting in the coding, this corresponds to a few lines of code change in the training programs (see the program codes43). Our method can also be applied in conjunction with other existing tricks (e.g. adding channel wise interaction like42) to obtain better results.

Compared with the non-binary neural networks, the benefits of binary network are obvious. Firstly, a Heaviside activation simplifies the neuron to a 1-bit yes or no decision maker, which would greatly speed-up and simplify both software- and hardware-implemented neuron realizations. For example, a CMOS based hardware-implemented neuron takes several dozens of transistors to construct a sigmoid-activated neuron, but it only requires one transistor for a Heaviside-activated neuron. Binarized weight can heavily reduce the storage along with computation burdens. Compared with a float32 type weight, binary weights only takes about 1/32 of storage space (Although this needs specialized hardware design). For VGG-Small, there are 4.67 M parameters, costing 19 MB memory storage. Binarization, can eliminate 4.57 M full precision weights, thus reduces storage to less than 1 MB. The memory saving on larger neural network will be more drastic. Secondly but more dramatically, the most time/area/energy-consuming part of a conventional non-binary neural network is the weight readout and the multiply-accumulate (MAC) operation44 which calculates the dot product of the outputs of previous layer neurons and the weights. If the number of neurons involved is N, then one MAC operation includes N multiplication and N addition operations. In VGG-Small, \(1.2\times 10^9\) flops are required for each sample. However, for a binary-activation neural network, since the neuron outputs are either 0 or 1, the matrix multiplication operation is completely unnecessary, one needs only sum over those weights associated with neuron output 1 and omitting all the weights corresponding to zero outputs. The binarized weights in a binary-weight network can reduce matrix multiplication for a similar reason. For full-binary neural networks, since both neuron outputs and the weights are binary, the addition operation is further simplified to a single type of counting (XNOR) operation (Strictly speaking, only multiplication between \(\pm 1\) is equal to XNOR, however, multiplication between \(\pm 1\) weights and any binary activation only differs by a bias term, which can be absorbed into batch normalization layer.). Full binarized VGG-Small network only requires \(3\times 10^5\) flops per sample for un-binarized first and last layers as well as batch normalization. The speed up is tested to be around 7 times faster according to other works10,39. Because the Heaviside activation function simply turns on or off the contributions from some of the weights, there is no need to read out the weights corresponding to the off-neuron, which reduces the read-out operations from N to \(\sim N/2\) on average. Because of these obvious benefits, the binarized neural network can have huge advantage in the deployment.