Abstract
The growing popularity of Deep Learning (DL) in recent years has had a large environmental impact. Training models require a lot of processing and computation and therefore require a lot of energy. The size of these models and the amount of data required for training them have grown exponentially, not comparable to the performance improvements. Recently, some model-centric approaches have been proposed to limit the environmental impact of AI. This paper complements them by proposing a data-centric “Green AI” approach, focusing on the data preparation phase of the DL pipeline. A general methodology, valid for any DL task, is proposed. This methodology is based on analyzing data characteristics, mainly the data quality and volume dimensions, and observing how these affect carbon emissions and performance on different models. With this information, a human-in-the-loop (HITL) approach is provided to support researchers in obtaining a modified and reduced version of a dataset that can decrease the environmental impact of training while achieving a specified performance goal. To demonstrate its validity, the proposed methodology is applied to the time series classification task and a prototype has been developed which demonstrates the possibility of reducing the carbon emissions of DL training by up to 50%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2), 281–305 (2012)
Berti-Equille, L.: Learn2Clean: optimizing the sequence of tasks for web data preparation. In: The World Wide Web Conference, pp. 2580–2586 (2019)
Budach, L., et al.: The effects of data quality on machine learning performance. preprint arXiv:2207.14529 (2022)
Castanyer, R.C., Martínez-Fernández, S., Franch, X.: Which design decisions in AI-enabled mobile applications contribute to greener AI? preprint arXiv:2109.15284 (2021)
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248. IEEE (2013)
Frey, N.C., et al.: Energy-aware neural architecture selection and hyperparameter optimization. In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 732–741. IEEE (2022)
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hsiao, T.Y., et al.: Filter-based deep-compression with global average pooling for convolutional networks. J. Syst. Archit. 95, 9–18 (2019)
Jain, A., et al.: Overview and importance of data quality for machine learning tasks. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3561–3562 (2020)
Knight, W.: AI can do great things - if it doesn’t burn the planet. Wired Magazine (2020)
Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)
Lucivero, F.: Big data, big waste? A reflection on the environmental sustainability of big data initiatives. Sci. Eng. Ethics 26(2), 1009–1030 (2020). https://doi.org/10.1007/s11948-019-00171-7
Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29
Miao, Z., et al.: A data preparation framework for cleaning electronic health records and assessing cleaning outcomes for secondary analysis. Inf. Syst. 111, 102130 (2023)
Patterson, D., et al.: Carbon emissions and large neural network training. preprint arXiv:2104.10350 (2021)
Rolnick, D., et al.: Tackling climate change with machine learning. ACM Comput. Surv. (CSUR) 55(2), 1–96 (2022)
Schwartz, R., et al.: Green AI. Commun. ACM 63(12), 54–63 (2020)
Segal, M.R.: Machine learning benchmarks and random forest regression. UCSF: Center for Bioinformatics and Molecular Biostatistics (2004)
Shin, Y., et al.: Practical methods of image data preprocessing for enhancing the performance of deep learning based road crack detection. ICIC Express Lett. Part B Appl. 11(4), 373–379 (2020)
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. preprint arXiv:1906.02243, June 2019
Sun, C., et al.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843–852 (2017)
Werner de Vargas, V., et al.: Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl. Inf. Syst. 65(1), 31–57 (2023). https://doi.org/10.1007/s10115-022-01772-8
Verdecchia, R., et al.: Data-centric green AI: an exploratory empirical study. preprint arXiv:2204.02766 (2022)
Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: a strong baseline. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1578–1585. IEEE (2017)
Xu, J., et al.: A survey on green deep learning. preprint arXiv:2111.05193 (2021)
Acknowledgements
This research was supported by the EU Horizon Framework grant agreement 101070186 (TEADAL) and by the Spoke 1 “FutureHPC & BigData” of the Italian Research Center on High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR Missione 4 - Next Generation EU (NGEU).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Anselmo, M., Vitali, M. (2023). A Data-Centric Approach for Reducing Carbon Emissions in Deep Learning. In: Indulska, M., Reinhartz-Berger, I., Cetina, C., Pastor, O. (eds) Advanced Information Systems Engineering. CAiSE 2023. Lecture Notes in Computer Science, vol 13901. Springer, Cham. https://doi.org/10.1007/978-3-031-34560-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-34560-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34559-3
Online ISBN: 978-3-031-34560-9
eBook Packages: Computer ScienceComputer Science (R0)