HALO: Hardware-Aware Learning to Optimize

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)


There has been an explosive demand for bringing machine learning (ML) powered intelligence into numerous Internet-of-Things (IoT) devices. However, the effectiveness of such intelligent functionality requires in-situ continuous model adaptation for adapting to new data and environments, while the on-device computing and energy resources are usually extremely constrained. Neither traditional hand-crafted (e.g., SGD, Adagrad, and Adam) nor existing meta optimizers are specifically designed to meet those challenges, as the former requires tedious hyper-parameter tuning while the latter are often costly due to the meta algorithms’ own overhead. To this end, we propose hardware-aware learning to optimize (HALO), a practical meta optimizer dedicated to resource-efficient on-device adaptation. Our HALO optimizer features the following highlights: (1) faster adaptation speed (i.e., taking fewer data or iterations to reach a specified accuracy) by introducing a new regularizer to promote empirical generalization; and (2) lower per-iteration complexity, thanks to a stochastic structural sparsity regularizer being enforced. Furthermore, the optimizer itself is designed as a very light-weight RNN and thus incurs negligible overhead. Ablation studies and experiments on five datasets, six optimizees, and two state-of-the-art (SOTA) edge AI devices validate that, while always achieving a better accuracy (\(\uparrow \)0.46% - \(\uparrow \)20.28%), HALO can greatly trim down the energy cost (up to \(\downarrow \)60%) in adaptation, quantified using an IoT device or SOTA simulator. Codes and pre-trained models are at


On-device learning Learning to optimize Meta learning Efficient training Internet-of-Things 



The work is supported by the National Science Foundation (NSF) through the Real-Time Machine Learning program (Award number: 1937592, 1937588).

Supplementary material

504446_1_En_29_MOESM1_ESM.pdf (1.9 mb)
Supplementary material 1 (pdf 1941 KB)


  1. 1.
    Achille, A., Rovere, M., Soatto, S.: Critical learning periods in deep networks. In: International Conference on Learning Representations (2019).
  2. 2.
    Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, pp. 3981–3989 (2016)Google Scholar
  3. 3.
    Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: A public domain dataset for human activity recognition using smartphones. In: Esann (2013)Google Scholar
  4. 4.
    Ashqar, B.A., Abu-Naser, S.S.: Identifying images of invasive hydrangea using pre-trained deep convolutional neural networks. Int. J. Acad. Eng. Res. (IJAER) 3(3), 28–36 (2019)Google Scholar
  5. 5.
    Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 17–36 (2012)Google Scholar
  6. 6.
    Bippus, R., Fischer, A., Stahl, V.: Domain adaptation for robust automatic speech recognition in car environments. In: Sixth European Conference on Speech Communication and Technology (1999)Google Scholar
  7. 7.
    Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128 (2006)Google Scholar
  8. 8.
    Cao, Y., Chen, T., Wang, Z., Shen, Y.: Learning to optimize in swarms. In: Advances in Neural Information Processing Systems, vol. 32, pp. 15018–15028. Curran Associates, Inc. (2019).
  9. 9.
    Chen, H., Mahfuz, S., Zulkernine, F.: Smart phone based human activity recognition. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2525–2532. IEEE (2019)Google Scholar
  10. 10.
    Chen, Q., Liu, Y., Wang, Z., Wassell, I., Chetty, K.: Re-weighted adversarial adaptation network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7976–7985 (2018)Google Scholar
  11. 11.
    Chen, Y., et al.: Learning to learn without gradient descent by gradient descent. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 748–756. JMLR org (2017)Google Scholar
  12. 12.
    Dua, D., Graff, C.: UCI machine learning repository (2017).
  13. 13.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Elibol, M., Lei, L., Jordan, M.I.: Variance reduction with sparse gradients (2020)Google Scholar
  15. 15.
    Fang, B., Zeng, X., Zhang, M.: Nestdnn: resource-aware multi-tenant on-device deep learning for continuous mobile vision. In: Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pp. 115–127 (2018)Google Scholar
  16. 16.
    Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (2011)Google Scholar
  17. 17.
    Greff, K., Srivastava, R.K., Schmidhuber, J.: Highway and residual networks learn unrolled iterative estimation (2016). arXiv preprint arXiv:1612.07771
  18. 18.
    Grothmann, T., Patt, A.: Adaptive capacity and human cognition: the process of individual adaptation to climate change. Glob. Environ. Change 15(3), 199–213 (2005)CrossRefGoogle Scholar
  19. 19.
    Habibzadeh, M., Jannesari, M., Rezaei, Z., Baharvand, H., Totonchi, M.: Automatic white blood cell classification using pre-trained deep learning models: Resnet and inception. In: Tenth International Conference on Machine Vision (ICMV 2017), vol. 10696, p. 1069612. International Society for Optics and Photonics (2018)Google Scholar
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). Scholar
  21. 21.
    Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1–42 (1997)CrossRefGoogle Scholar
  22. 22.
    Hoffman, J., Roberts, D.A., Yaida, S.: Robust learning with Jacobian regularization (2019). arXiv preprint arXiv:1908.02729
  23. 23.
    Hoffman, J., Rodner, E., Donahue, J., Darrell, T., Saenko, K.: Efficient learning of domain-invariant image representations (2013). arXiv preprint arXiv:1301.3224
  24. 24.
    Hou, L., Zhu, J., Kwok, J., Gao, F., Qin, T., Liu, T.Y.: Normalization helps training of quantized LSTM. In: Advances in Neural Information Processing Systems, pp. 7344–7354 (2019)Google Scholar
  25. 25.
    Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). Scholar
  26. 26.
    Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018.,
  27. 27.
    Keshamoni, K., Hemanth, S.: Smart gas level monitoring, booking and gas leakage detector over IoT. In: 2017 IEEE 7th International Advance Computing Conference (IACC), pp. 330–332. IEEE (2017)Google Scholar
  28. 28.
    Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: Generalization gap and sharp minima (2016). arXiv preprint arXiv:1609.04836
  29. 29.
    Kikui, K., Itoh, Y., Yamada, M., Sugiura, Y., Sugimoto, M.: Intra-/inter-user adaptation framework for wearable gesture sensing device. In: Proceedings of the 2018 ACM International Symposium on Wearable Computers, pp. 21–24 (2018)Google Scholar
  30. 30.
    Kikui, K., Itoh, Y., Yamada, M., Sugiura, Y., Sugimoto, M.: Intra-/inter-user adaptation framework for wearable gesture sensing device. In: Proceedings of the 2018 ACM International Symposium on Wearable Computers, New York, NY, USA. ISWC’2018, pp. 21–24, Association for Computing Machinery (2018).
  31. 31.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)Google Scholar
  32. 32.
    Krizhevsky, A., et al.: Learning multiple layers of features from tiny images (2009)Google Scholar
  33. 33.
    Lane, N.D., Bhattacharya, S., Mathur, A., Georgiev, P., Forlivesi, C., Kawsar, F.: Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Compu. 16(3), 82–88 (2017)CrossRefGoogle Scholar
  34. 34.
    LeCun, Y.: The mnist database of handwritten digits (1999).
  35. 35.
    Lee, J., et al.: Mems-based no 2 gas sensor using zno nano-rods for low-power IoT application. J. Korean Phys. Soc. 70(10), 924–928 (2017)CrossRefGoogle Scholar
  36. 36.
    Li, Y., Wei, C., Ma, T.: Towards explaining the regularization effect of initial large learning rate in training neural networks (2019). arXiv preprint arXiv:1907.04595
  37. 37.
    Lin, Y., Sakr, C., Kim, Y., Shanbhag, N.: Predictivenet: an energy-efficient convolutional neural network via zero prediction. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4 (2017)Google Scholar
  38. 38.
    Liu, C.H., Fan, J., Branch, J.W., Leung, K.K.: Toward qoi and energy-efficiency in internet-of-things sensory environments. IEEE Trans. Emerg. Top. Comput. 2(4), 473–487 (2014). Scholar
  39. 39.
    Liu, S., Lin, Y., Zhou, Z., Nan, K., Liu, H., Du, J.: On-demand deep model compression for mobile devices: a usage-driven model selection framework. In: Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pp. 389–400. ACM (2018)Google Scholar
  40. 40.
    Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning. In: International Conference on Learning Representations (2019).
  41. 41.
    Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2208–2217. ICML’2017, JMLR org (2017)Google Scholar
  42. 42.
    Lv, K., Jiang, S., Li, J.: Learning gradient descent: Better generalization and longer horizons. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2247–2255. JMLR org (2017)Google Scholar
  43. 43.
    Miotto, R., Wang, F., Wang, S., Jiang, X., Dudley, J.T.: Deep learning for healthcare: review, opportunities and challenges. Briefings Bioinf. 19(6), 1236–1246 (2018)CrossRefGoogle Scholar
  44. 44.
    Moreno, M., Úbeda, B., Skarmeta, A.F., Zamora, M.A.: How can we tackle energy efficiency in iot basedsmart buildings? Sensors 14(6), 9582–9614 (2014)CrossRefGoogle Scholar
  45. 45.
    NVIDIA Inc.: NVIDIA Jetson TX2., Accessed 01 Sep 2019
  46. 46.
    Park, K., Yi, Y.: Bpnet: branch-pruned conditional neural network for systematic time-accuracy tradeoff in dnn inference: work-in-progress. In: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion, pp. 1–2 (2019)Google Scholar
  47. 47.
    PARTICLE Inc.: PARTICLE LP103450., Accessed 29 Feb 2020
  48. 48.
    Patrick Mochel and Mike Murphy.: sysfs-The filesystem for exporting kernel objects., Accessed 21 Nov 2019
  49. 49.
    Peters, M., Ruder, S., Smith, N.A.: To tune or not to tune? adapting pretrained representations to diverse tasks (2019). arXiv preprint arXiv:1903.05987
  50. 50.
    Petrolo, R., Lin, Y., Knightly, E.: Astro: autonomous, sensing, and tetherless networked drones. In: Proceedings of the 4th ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications, New York, NY, USA. DroNet’2018, pp. 1–6. Association for Computing Machinery (2018).
  51. 51.
    Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction (2011)Google Scholar
  52. 52.
    Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584, April 2015.
  53. 53.
    SAMSUNG Inc.: SAMSUNG Galaxy S20., Accessed 29 Feb 2020
  54. 54.
    Sannino, G., Pietro, G.D.: A deep learning approach for ECG-based heartbeat classification for arrhythmia detection. Future Gener. Comput. Syst. 86, 446–455 (2018).,
  55. 55.
    Sharma, H., et al.: Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), June 2018.,
  56. 56.
    Singh, K.J., Kapoor, D.S.: Create your own internet of things: A survey of iot platforms. IEEE Consum. Electron. Mag. 6(2), 57–68 (2017). Scholar
  57. 57.
    Sokolić, J., Giryes, R., Sapiro, G., Rodrigues, M.R.: Robust large margin deep neural networks. IEEE Trans. Signal Process. 65(16), 4265–4280 (2017)MathSciNetCrossRefGoogle Scholar
  58. 58.
    Subasi, A., Radhwan, M., Kurdi, R., Khateeb, K.: IoT based mobile healthcare system for human activity recognition. In: 2018 15th Learning and Technology Conference (L&T), pp. 29–34. IEEE (2018)Google Scholar
  59. 59.
    Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). Scholar
  60. 60.
    Texas Instruments Inc.: INA3221 Triple-Channel, High-Side Measurement, Shunt and Bus Voltage Monitor., Accessed 21 Nov 2019
  61. 61.
    Upton, E., Halfacree, G.: Raspberry Pi User Guide. John Wiley & Sons, Hoboken (2014)Google Scholar
  62. 62.
    Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sensors Actuators B: Chem. 166, 320–329 (2012)CrossRefGoogle Scholar
  63. 63.
    Wang, X., Yu, F., Dou, Z.Y., Darrell, T., Gonzalez, J.E.: Skipnet: Learning dynamic routing in convolutional networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424 (2018)Google Scholar
  64. 64.
    Wang, Y., et al.: E2-train: training state-of-the-art CNNs with over 80% energy savings. In: Advances in Neural Information Processing Systems, pp. 5139–5151 (2019)Google Scholar
  65. 65.
    Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016)Google Scholar
  66. 66.
    Wichrowska, O., et al.: Learned optimizers that scale and generalize. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3751–3760. JMLR org (2017)Google Scholar
  67. 67.
    Wu, Z., et al.: Blockdrop: Dynamic inference paths in residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826 (2018)Google Scholar
  68. 68.
    You, H., et al.: Drawing early-bird tickets: towards more efficient training of deep networks (2019)Google Scholar
  69. 69.
    Zhang, C., Bengio, S., Singer, Y.: Are all layers created equal? CoRR abs/1902.01996 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Rice UniversityHoustonUSA
  2. 2.The University of Texas at AustinAustinUSA

Personalised recommendations