Training Integer-Only Deep Recurrent Neural Networks

Nia, Vahid Partovi; Sari, Eyyüb; Courville, Vanessa; Asgharian, Masoud

doi:10.1007/s42979-023-01920-z

Training Integer-Only Deep Recurrent Neural Networks

Original Research
Published: 29 June 2023

Volume 4, article number 501, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Vahid Partovi Nia ORCID: orcid.org/0000-0001-6673-4224^1,2,
Eyyüb Sari¹,
Vanessa Courville¹ &
…
Masoud Asgharian³

94 Accesses
Explore all metrics

Abstract

Recurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional dependence and attention. In order to maintain good accuracy, these components are frequently run using full-precision floating-point computation, making them slow, inefficient and difficult to deploy on edge devices. In addition, the complex nature of these operations makes them challenging to quantize using standard quantization methods without a significant performance drop. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions, to serve a wide range of state-of-the-art RNNs. The proposed method enables RNN-based language models to run on edge devices with $2\times$ improvement in runtime, and $4\times$ reduction in model size while maintaining similar accuracy as its full-precision counterpart.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Neural Architecture Search to Optimize Neural Networks for Embedded Devices

1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation

Are 2D-LSTM really dead for offline text recognition?

Article 06 June 2019

Notes

References

Rumelhart D, Hinton G.E, Williams R.J. Learning internal representations by error propagation. 1986
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Article Google Scholar
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 2014
Chen MX, Firat O, Bapna A, Johnson M, Macherey W, Foster G, Jones L, Schuster M, Shazeer N, Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Chen Z, Wu Y, Hughes M. The best of both worlds: Combining recent advances in neural machine translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 76–86. Association for Computational Linguistics Melbourne, Australia 2018. https://doi.org/10.18653/v1/P18-1008. https://www.aclweb.org/anthology/P18-1008
Wang C, Wu S, Liu S. Accelerating transformer decoding via a hybrid of self-attention and recurrent neural network. arXiv preprint arXiv:1909.02279 2019.
Zhang L, Wang S, Liu B. Deep learning for sentiment analysis : A survey. CoRR. 2018 abs/1801.07883arxiv:1801.07883
You Q, Jin H, Wang Z, Fang C, Luo J. Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016.
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A. Deep captioning with multimodal recurrent neural networks (m-rnn) 2014. https://doi.org/10.48550/ARXIV.1412.6632
He Y, Sainath T.N, Prabhavalkar R, McGraw I, Alvarez R, Zhao D, Rybach D, Kannan A, Wu Y, Pang R, et al. Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385 2019. IEEE
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A.N, Kaiser L.u, Polosukhin I. Attention is all you need. In: Guyon I, Luxburg U.V, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. ??? 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Devlin J, Chang M.-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics Minneapolis, Minnesota 2019. https://doi.org/10.18653/v1/N19-1423. https://www.aclweb.org/anthology/N19-1423
Kasai J, Peng H, Zhang Y, Yogatama D, Ilharco G, Pappas N, Mao Y, Chen W, Smith N.A. Finetuning pretrained transformers into rnns. arXiv preprint arXiv:2103.13076 (2021)
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard AG, Adam H, Kalenichenko D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;2017:2704–13.
Google Scholar
Krishnamoorthi R. Quantizing deep convolutional networks for efficient inference: A whitepaper. 2018; arXiv preprint arXiv:1806.08342
Courville V, Nia VP. Deep learning inference frameworks for arm cpu. Journal of Computational Vision and Imaging Systems. 2019;5(1):3–3.
Google Scholar
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. Quantized neural networks: Training neural networks with low precision weights and activations. J Mach Learn Res. 2018;18(187):1–30.
MathSciNet MATH Google Scholar
Darabi S, Belbahri M, Courbariaux M, Nia V.P. BNN+: improved binary network training. CoRR. 2018; abs/1812.11800. arXiv:1812.11800
Esser S.K, McKinstry J.L, Bablani D, Appuswamy R, Modha D.S. Learned step size quantization. 2019; arXiv preprint arXiv:1902.08153
Ott J, Lin Z, Zhang Y, Liu S.-C, Bengio Y. Recurrent neural networks with limited numerical precision. 2016; arXiv preprint arXiv:1608.06902
He Q, Wen H, Zhou S, Wu Y, Yao C, Zhou X, Zou Y. Effective quantization methods for recurrent neural networks. 2016; arXiv preprint arXiv:1611.10176.
Kapur S, Mishra A, Marr D. Low precision rnns: Quantizing rnns without losing accuracy. 2017; arXiv preprint arXiv:1710.07706
Hou L, Zhu J, Kwok J.T.-Y, Gao F, Qin T, Liu T.-y. Normalization helps training of quantized lstm. 2019
Ardakani A, Ji Z, Smithson S.C, Meyer B.H, Gross W.J. Learning recurrent binary/ternary weights. 2018; arXiv preprint arXiv:1809.11086
Sari E, Partovi Nia V. Batch normalization in quantized networks. In: Proceedings of the Edge Intelligence Workshop, 2020; 6–9. https://www.gerad.ca/en/papers/G-2020-23
Wu Y, Schuster M, Chen Z, Le Q.V, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016; arXiv preprint arXiv:1609.08144
Bluche T, Primet M, Gisselbrecht T. Small-footprint open-vocabulary keyword spotting with quantized lstm networks. 2020; arXiv preprint arXiv:2002.10851
Ba J, Kiros J.R, Hinton G.E. Layer normalization. 2016; ArXiv abs/1607.06450
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings 2015. http://arxiv.org/abs/1409.0473
Chorowski J, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech recognition. 2015; arXiv preprint arXiv:1506.07503
Sari. E, Courville. V, Partovi Nia. V. iRNN: Integer-only Recurrent Neural Network. In: Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - ICPRAM, 2022; 110–121
Marcus MP, Santorini B, Marcinkiewicz MA. Building a large annotated corpus of English: The Penn Treebank. Comput Linguist. 1993;19(2):313–30.
Google Scholar
Merity S, Xiong C, Bradbury J, Socher R. Pointer sentinel mixture models. CoRR abs/1609.07843 (2016) arXiv:1609.07843
Melis G, Kočisk’y T, Blunsom P. Mogrifier lstm. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SJe5P6EYvS
Wang Y, Chen T, Xu H, Ding S, Lv H, Shao Y, Peng N, Xie L, Watanabe S, Khudanpur S. Espresso: A fast end-to-end neural speech recognition toolkit. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 136–143 (2019). https://doi.org/10.1109/ASRU46091.2019.9003968
Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an asr corpus based on public domain audio books. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference On, pp. 5206–5210 (2015). IEEE
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
Mikolov T, Sutskever I, Deoras A, Le H.-S, Kombrink S, Cernocky J. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf) 8 67 (2012)
Werbos PJ. Backpropagation through time: what it does and how to do it. Proc IEEE. 1990;78(10):1550–60.
Article Google Scholar
Krause B, Kahembwe E, Murray I, Renals S. Dynamic evaluation of neural sequence models. In: International Conference on Machine Learning, pp. 2766–2775 (2018). PMLR
Gal Y, Ghahramani Z. A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems 29 (2016)
Yang Z, Dai Z, Salakhutdinov R, Cohen W.W. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model (2018)
Izmailov P, Podoprikhin D, Garipov T, Vetrov D, Wilson A.G. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
Kingma D.P, Ba J. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Hendrycks D, Gimpel K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

Download references

Author information

Authors and Affiliations

Huawei Noah’s Ark Lab, Montreal Research Centre, 7101 Park Avenue, Montreal, QC, H3N 1X9, Canada
Vahid Partovi Nia, Eyyüb Sari & Vanessa Courville
Department of Mathematics and Industrial Engineering, Polytechnique Montreal, 2500 Chem. Polytechnique, Montreal, QC, H3T 1J4, Canada
Vahid Partovi Nia
Department of Mathematics and Statistics, McGill University, 805 Sherbrooke Street West, Montreal, QC, H3A 0B9, Canada
Masoud Asgharian

Authors

Vahid Partovi Nia
View author publications
You can also search for this author in PubMed Google Scholar
Eyyüb Sari
View author publications
You can also search for this author in PubMed Google Scholar
Vanessa Courville
View author publications
You can also search for this author in PubMed Google Scholar
Masoud Asgharian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vahid Partovi Nia.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances on Pattern Recognition Applications and Methods 2022” guest edited by Ana Fred, Maria De Marsico and Gabriella Sanniti di Baja.

Appendices

Appendix A

A.1 Details on LSTM-Based Models

For BiLSTM cells, nothing stated in section Integer-only LSTM network is changed except that we enforce the forward LSTM hidden state $\overrightarrow{\textbf{h}}_t$ and the backward LSTM hidden state $\overleftarrow{\textbf{h}}_t$ to share the same quantization parameters so that they can be concatenated as a vector. If the model has embedding layers, they are quantized to 8-bit as we found they were not sensitive to quantization. If the model has residual connections (e.g., between LSTM cells), they are quantized to 8-bit integers. In encoder-decoder models, the attention layers would be quantized using the method described in Sect. “Integer-Only Attention”. The last fully-connected layer weights of the model are 8-bit quantized to allow for 8-bit matrix multiplication. We do not quantize the outputs and let them remain 32-bit integers as often this is where it is considered that the model has done its job and that some postprocessing is performed (e.g., beam search).

A.2 Experimental Details

Table 7 Details about the model parameters and training time for each different experiment setup: language modeling on PTB (LM PTB), language modeling on WikiText2 (LM WikiText2), automatic speech recognition on LibriSpeech (ASR LibriSpeech)

Full size table

In this section, we provide further details of our experimental setups. The number of parameters and training time of the model are reported in Table 7.

A.2.1 LayerNorm LSTM on PTB

Preprocessing of the dataset was performed [37]. The vocabulary size is 10K. We report the best perplexity per word on the validation set and test set for a language model of embedding size 200 with one LayerNormLSTM cell of state size 200. The lower the perplexity, the better the model performs. These experiments focus on the relative increase of perplexity between the full-precision models and their 8-bit quantized counterparts. We did not aim to reproduce state-of-the-art performance on PTB and went with a naïve set of hyper-parameters. The full-precision network is trained for 100 epochs with batch size 20 and BPTT [38] window size of 35. We used the SGD optimizer with weight decay of $10^{-5}$ and learning rate 20, which is divided by 4 when the loss plateaus for more than two epochs without a relative decrease of $10^{-4}$ in perplexity. We use gradient clipping of 0.25. We initialize the quantized models from the best full-precision checkpoint and train for another 100 epochs. We did not enable quantization to gather range statistics to compute the quantization parameters for the first five epochs.

A.2.2 Mogrifier LSTM on WikiText2

We describe the experimental setup for Mogrifier LSTM on WikiText2. Note that we follow the setup of [33] where they do not use dynamic evaluation [39] nor Monte Carlo dropout [40]. The vocabulary size is 33279. We use a two-layer Mogrifier LSTM with embedding dimension 272, state dimension 1366, and capped input gates. We use six modulation rounds per Mogrifier layer with low-rank dimension 48. We use 2 Mixture-of-Softmax layers [41]. The input and output embedding are tied. We use a batch size of 64 and a BPTT window size of 70. We train the full-precision Mogrifier LSTM for 340 epochs, after which we enable Stochastic Weight Averaging (SWA) [42] for 70 epochs. For the optimizer we used Adam [43] with a learning rate of $\approx 3\times 10^{-3}$, $\beta _1=0$, $\beta _2=0.999$ and weight decay $\approx 1.8\times 10^{-4}$. We clip gradients’ norm to 10. We use the same hyper-parameters for the quantized models from which we initialize with a pre-trained full-precision and continue to train for 200 epochs. During the first two epochs, we do not perform QAT, but we gather min and max statistics in the network to have a correct starting estimate of the quantization parameters. After that, we enable 8-bit QAT on every component of the Mogrifier LSTM: weights, matrix multiplications, element-wise operations, and activations. Then we replace activation functions in the model with quantization-aware PWLs and continue training for 100 epochs. We perform complete ablation on our method to study the effect of each component. Quantizing the weights and matrix multiplications cover about 0.1 of the perplexity increase. There is a clear performance drop after adding quantization of element-wise operations with an increase in the perplexity of about 0.3. This is due to element-wise operations in the cell and hidden state computations affecting the flow of information across timesteps and the residual connections across layers. Adding quantization of the activation does not impact the performance of the network.

A.2.3 ESPRESSO LSTM on LibriSpeech

The encoder comprises 4 CNN-BatchNorm-ReLU blocks followed by 4 BiLSTM layers with 1024 units. The decoder consists of 3 LSTM layers of units 1024, with Bahdanau’s attention on hidden states of the encoder and residual connections between each layer. The dataset preprocessing is precisely the same as in [34]. We train the model for 30 epochs on one V100 GPU, approximately six days to complete. We use a batch size of 24 while limiting the maximum number of tokens in a mini-batch to 26000. Adam is used with a starting learning rate of 0.001, which is divided by 2 when the validation set metric plateaus without a relative decrease of $10^{-4}$ in performance. Cross-entropy with uniform label smoothing $\alpha =0.1$ [44] is used as a loss function. The model predictions are weighted at evaluation time using a pre-trained full-precision 4-layer LSTM language model (shallow fusion). We consider this language model as an external component to the ESPRESSO LSTM; we do not quantize it due to the lack of resources. In our language modeling experiments, we have already shown that quantized language models retain their performance. We refer the reader to [34] and training script^{Footnote 5} for a complete description of the experimental setup.

We initialize the quantized model from the pre-trained full-precision ESPRESSO LSTM. Due to the lack of resources, we have trained the quantized model for only four epochs. The quantized model is trained on 6 V100 GPUs where each epoch takes two days, so a total of 48 GPU days. The batch size is set to 8 mini-batch per GPU with a maximum of 8600 tokens. We made these changes since, otherwise, the GPU would run out of VRAM due to the added fake quantization operations. For the first half of the first epoch, we gathered statistics for quantization parameters then we enabled QAT. The activation functions are swapped with quantization-aware PWL in the last epoch. The number of pieces for the quantization-aware PWLs is 96, except for the exponential function in the attention, which is 160, as we found out it was necessary to have more pieces because of its curvature. The number of pieces used is higher than that in the language modeling experiments. However, the difference is that the inputs to the activation functions are 16-bit rather than 8-bit, although the outputs are still quantized to 8-bit. It means we need more pieces to capture the input resolution better. Note that it would not be feasible to use a 16-bit Look-Up Table to compute the activation functions due to the size and cache misses, whereas using 96 pieces allows for a 170x reduction in memory consumption compared to LUT.

Appendix B

The following section provides some examples of integer-only arithmetic and more details on fixed-point scaling.

B.1 Multiplication

To illustrate how integer-only multiplication is achieved, we define an example utilizing (4). Defining $u \in [u_{\min }, u_{\max }]=[-1, 1]$ and $w \in [w_{\min }, w_{\max }]=[0,5]$, the multiplication between two numbers from those ranges will fall into $[z_{\min }, z_{\max }]=[-5, 5]$. From (3), for 8-bit quantization, we have $S_u \approx 0.0078$, $Z_u=128$, $S_w \approx 0.0196, Z_w=0, S_z \approx 0.0392, Z_z=128$. Given $u=-0.8$ and $w=2.3$, we have $q_u=25$ and $q_w=117$. Therefore, following (4),

$$\begin{aligned} q_z&= \Big \lfloor \frac{S_u S_w}{S_z}\Big (q_u q_w - q_u Z_w - q_w Z_u + Z_u Z_w \Big ) \Big \rceil + Z_z \nonumber \\&= \Big \lfloor \frac{0.0078 \times 0.0196}{0.0392}\Big (25 \times 117 - 25 \times 0 - 117 \times 128 + 128 \times 0 \Big ) \Big \rceil + 128 \nonumber \\&= 81. \end{aligned}$$

Using (2), the floating-point representation of $q_z$ is $r_z=-1.8424$ which is close to $uv=-1.8399$. Note that we lost precision at two levels, the first time when quantizing u and v, then the second time when quantizing z, the multiplication output.

B.2 Addition

As mentioned in Sect. “Integer-only arithmetic”, addition with quantized numbers can take two forms. The first form is when the two numbers to be added share the same scaling factor and zero-point. For instance, given $x_1=-0.3, x_2=0.7$ from $[-1,1]$, and $S_x=0.0078, Z_x=128$, we have $q_{x_1}=90$ and $q_{x_2}=218$. The result value y will fall into the range $[-2,2]$, therefore $S_y \approx 0.0157$ and $Z_y=128$. Then, because they share the same quantization parameters, following (5),

$$\begin{aligned} q_y&= \Big \lfloor \frac{1}{S_y}\Big (S_x(q_{x_1} - Z_x) + S_x(q_{x_2} - Z_x)\Big ) \Big \rceil + Z_y \\&= \Big \lfloor \frac{S_x}{S_y}\Big (q_{x_1} + q_{x_2} - 2Z_x\Big ) \Big \rceil + Z_y \\&= \Big \lfloor \frac{0.0078}{0.0157}\Big (90 + 218 - 256\Big ) \Big \rceil + 128 \\&= 154. \end{aligned}$$

We have $r_y=0.4082$, while $x_1 + x_2 = 0.3999$. The second form is when the two numbers do not share the same scaling factor and zero-point. Define $a \in [a_{\min }, a_{\max }]=[-1, 1]$ and $b \in [b_{\min }, b_{\max }]=[0,5]$, the addition between two numbers from those ranges will fall into $[c_{\min }, c_{\max }]=[-1, 6]$. We get $S_a \approx 0.0078$, $Z_a=128$, $S_b \approx 0.0196, Z_b=0, S_c \approx 0.0274, Z_c=36$. For $a=-0.9$, $b=3.9$, we have $q_a=13$ and $q_b=199$. The quantized addition result $q_c$, following (6), is,

$$\begin{aligned} q_c&= \Big \lfloor \frac{S_a}{S_c}(q_a - Z_a) + \frac{S_b}{S_c}(q_b - Z_b)\Big \rceil + Z_c \\&= \Big \lfloor \frac{0.0078}{0.0274}(13 - 128) + \frac{0.0196}{0.0274}199\Big \rceil + 36 \\&= 146 \end{aligned}$$

and $r_y=3.0140$ while $a+b=3.0$.

B.3 Fixed Point Arithmetic

Even with the most careful rounding, fixed-point values represented with a scaling factor S may have an error of up to $\pm 0.5$ in the stored integer, that is, $\pm 0.5 S$ in the value. Therefore, smaller scaling factors generally produce more accurate results. On the other hand, a smaller scaling factor means a smaller range of values stored in a given program variable. The maximum fixed-point value that can be stored in a variable is the largest integer value that can be stored into it, multiplied by the scaling factor; and similarly for the minimum value. For example, Table 8 gives the implied scaling factor S, the minimum and maximum representable values. The accuracy $\delta = S/2$ of values can be represented in 16-bit signed binary fixed-point format, depending on the number f of implied fraction bits.

Table 8 Fixed point format with common signed scaling and bias format for 8 bit mantissa

Full size table

To convert a number from a floating-point to a fixed-point, one may divide it by the scaling factor S, then round the result to the nearest integer. Care must be taken to ensure that the result fits in the destination variable or register. Depending on the scaling factor and storage size, and the range of input numbers, the conversion may not entail any rounding. To convert a fixed-point number to floating-point, in contrast, one may convert the integer to floating-point and then multiply it by the scaling factor S. This conversion may entail rounding if the integer’s absolute value is greater than 224 (for binary single-precision IEEE floating point) or of 253 (for double-precision). In addition, overflow or underflow may occur if $\left|S\right|$ is very large or small. However, most computers with binary arithmetic have fast bit shift instructions that can multiply or divide an integer by any power of 2, particularly an arithmetic shift instruction. These instructions can be used to quickly change scaling factors that are powers of 2 while preserving the sign of the number.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nia, V.P., Sari, E., Courville, V. et al. Training Integer-Only Deep Recurrent Neural Networks. SN COMPUT. SCI. 4, 501 (2023). https://doi.org/10.1007/s42979-023-01920-z

Download citation

Received: 22 June 2022
Accepted: 13 May 2023
Published: 29 June 2023
DOI: https://doi.org/10.1007/s42979-023-01920-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training Integer-Only Deep Recurrent Neural Networks

Abstract

Access this article

Similar content being viewed by others

Using Neural Architecture Search to Optimize Neural Networks for Embedded Devices

1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation

Are 2D-LSTM really dead for offline text recognition?

Notes

References