1 INTRODUCTION

While the global ESG agenda (Environment, Social, and Corporate Governance) is guided by agreements established between countries [1], the actual development of ESG principles occurs through corporate, research, and academic standards. Many companies have started to develop their ESG strategies, allocating full-fledged functions and departments dedicated to the agenda, publishing annual reports on sustainable development, providing additional funds for research, including digital technologies and AI.

Despite growing influence of the ESG agenda, the problem of transparent and objective quantitative evaluation of ESG progress in the field of environmental protection remains. This is of significant importance for IT industry, since about one percent of the world’s electricity is consumed by cloud computing, and its share continues to grow [2]. Artificial intelligence (AI) and machine learning (ML) being a big part of the today’s IT industry are rapidly evolving technologies with massive potential for disruption. There are a number of ways in which AI and ML could mitigate environmental problems and human-induced impact. In particular, they could be used to generate and process large-scale interconnected data to learn Earth more sensitively, to predict the environmental behavior in various scenarios [3]. This could improve our understanding of environmental processes and help us to make more informed decisions. There is also a potential for AI and ML to be used for simulating the results of harmful activities, such as deforestation, soil erosion, flooding, increased greenhouse gases in the atmosphere, etc. Ultimately, these technologies hold great potential to improve our understanding and control of the environment.

A number of AI-based solutions are being developed to achieve carbon neutrality as the part of the concept of Green AI. The final goal of these solutions is the reduction of greenhouse gases (GHG) emissions. In fact, AI can help to reduce the effects of the climate crisis, for example, by smart grid design, developing low-emission infrastructure and modelling climate changes [8]. However, it is also crucial to account for generated CO2 emissions by AI itself as a result of learning and applying AI models. In fact, AI progresses to larger and larger models with increasing computational complexity and, thereby, electrical energy consumption and, as a result, equivalent carbon emissions (eq. CO2). The ecological impact of AI is a major factor that needs to be accounted for in the eventual risks. For AI/ML models to be environmentally sustainable, they should be optimized not only for prediction accuracy, but also in terms of energy consumption and environmental impact. Therefore, tracking the ecological impact of AI is the first step towards Sustainable AI. A clear understanding of the ecological impact from AI motivates data science community to search for optimal architectures consuming less computational resources. An explicit call to promote research on more computationally efficient algorithms has been mentioned elsewhere [41].

To summarize the previous theses, we present the concept of AI-based GHG sequestrating cycle, which describes the relationship of AI with sustainability goals (Fig. 1). The request from Sustainability towards AI spawns demand for more optimized models in terms of energy consumption, shaping the path we have called “Towards Sustainable AI.” On the other hand, AI creates additional opportunities to achieve sustainability goals, and we suggest to call this path “Towards Green AI.” To understand the role of the eco2AI library in this cycle, the right panel of Fig. 1 provides a scheme with the paths mentioned. First, eco2AI motivates to optimize the AI technology itself. Secondly, if AI is aimed to sequestrate the GHG, then the total effect should be evaluated with account for generated eq. CO2 at least during training sessions (and during model application/inference at best). In this article we only focus on examining the path “Towards Sustainable AI” (see examples in Section 4).

Fig. 1.
figure 1

High-level schemes of AI-based GHG sequestrating. The left scheme corresponds to AI-based GHG sequestrating cycle. The right scheme describes the role of eco2AI in this scheme.

Contribution. The contribution of our paper is threefold:

• First, we introduce eco2AI, an open-source Python library we developed for evaluating equivalent CO2 emissions during ML model training.

• Second, we define the role of eco2AI within the context of AI-based GHG sequestrating cycle concept.

• Third, we describe practical cases where eco2AI has been used as an efficiency optimization tracker for fusion models learning.

This paper is organized as follows. In Section 2 we review the existing solutions for CO2 assessment and describe the differences from our library. Section 3 presents the calculation methodology, and Section 4 shows cases of using the library. Finally, in Section 5 we summarize our work. The appendix section briefly describes the code usage.

2 RELATED WORK

In this section, we describe recent practices of CO2 emissions evaluation for AI-based models. In what follows, we give a brief description of the existing open-source packages, providing comparison summary.

2.1 Practice of AI Equivalent Carbon Emissions Tracking

Since the appearance of DL models, their complexity has been increasing exponentially, doubling the number of learnable parameters every 3–4 months since 2012 (AI) and reaching more than a trillion parameters in 2022. Among the most well known models are BERT-Large (Oct. 2018, 3.4 × 108), GPT-2 (2019, 1.5 × 109), T5 (Oct. 2019, 1.1 × 1010), GPT-3 (2020, 1.75 × 1011), Megatron Turing (2022, 5.30 × 1011), and Switch Transformer (2022, 1.6 × 1012).

Data accumulation, labeling, storage, processing, and exploitation consumes a lot of resources during model lifespan from production to disposal. The impact of such models is presented in descriptive visual map on a global scale using Amazon’s infrastructure as in [24]. Carbon emissions are only one of footprints of such an industry but their efficient monitoring is important for passing new regulation standards and laws as well as self-regulation [20].

Large-scale research was conducted in [41] focused on quantifying the approximate environmental costs of DL widely used for NLP tasks. Among examined DL architectures, there were Transformer, ELMo, BERT, NAS, and GPT-2. Total power consumption was evaluated as the combined GPU, CPU, and DRAM consumption, multiplied by the data center specific power usage effectiveness (PUE) with a default value of 1. Sampling of CPU and GPU consumption was being queried by the vendor specialized software interface packages: Intel Running Average Power Limit and NVIDIA System Management, respectively. The conversion of energy to carbon emissions was computed as a product of total energy consumption and carbon energy intensity. The authors estimated that the carbon footprint for training BERT (base) was about 652 kg, which is comparable to the carbon footprint of the “New York ↔ San Francisco” air travel per passenger.

The energy consumption and carbon footprint were estimated with the following NLP models: T5, Meena, GShard, Switch Transformer, and GPT-3 [30]. The key outcome resulted in opportunities to improve energy efficiency while training neural network models with methods, such as sparsely activating DL; distillation techniques [22]; pruning, quantization, efficient coding [19]; fine-tuning and transfer-learning [9]; training of large models in a specific region with low energy consumption; and the use of energy-optimized cloud data centers. The authors expect that the carbon footprint would be reduced by 102–103 times if the mentioned suggestions are taken into account.

2.2 Review of Open-Source Emission Trackers

Several libraries have been developed to track the AI equivalent carbon footprint. Here we are focusing on describing the most popular open-source libraries. They all have a common goal: to monitor CO2 emissions during model training (see Table 1).

Table 1. Features of open-source trackers for equivalent CO2 emission evaluation of machine learning models

Cloud Carbon Footprint [5] is an application that estimates the energy and carbon emissions of public cloud providers. It measures cloud carbon and is intended to connect with various cloud service providers. It provides estimates for both energy consumption and carbon emissions for all types of cloud usage, including embodied emissions from production, with the option to drill down emissions by cloud provider, account, service, and time period. It provides recommendations to AWS and Google Cloud on saving money and minimize carbon emissions, as well as forecasts cost savings and actual outcomes in the term of trees planted. For hyperscale data centers, it measures the consumption at the service level using actual server utilization rather than an average. It provides a number of approaches for incorporating energy and carbon indicators into existing consumption and billing data sets, data pipelines, and monitoring systems.

CodeCarbon [6] is a Python package for tracking the carbon emissions produced by various computer programs, from simple algorithms to deep neural networks. By taking into account computing infrastructure, location, usage and running time, CodeCarbon provides an estimate of how much CO2 was has been produced. It also provides comparisons with emissions from common transportation types.

Carbontracker [4] is a tool to track and predict the energy consumption and carbon footprint of training DL models. The package allows for a further proactive and intervention-driven approach for reducing carbon emissions using predictions. Model training can be stopped when the predicted environmental cost exceeds a rational threshold. The library support a variety of different environments and platforms such as clusters, desktop computers, Google Colab notebooks, allowing a plug-and-play experience [2].

Experiment impact tracker [16] is a framework that provides information of the energy, computational and carbon impacts of ML models. It includes the following features: extraction the CPU and GPU hardware information, setting the experiment start and end-times, accounting for the energy grid region where the experiment is being run (based on the IP address), the average carbon intensity in the energy grid region, memory usage, the real-time CPU frequency (in hertz) [20].

Green Algorithms [17] are an online tool that enables the user to estimate and report the carbon footprint of computation. It integrates with computational processes and does not interfere with existing code, while also accounting for a range of CPUs, GPUs, cloud computing, local servers, and desktop computers [26].

Tracarbon [42] is a Python library that tracks energy consumption of the device and calculates carbon emissions. It detects the location and the device model automatically and can be used as a command line interface (CLI) with predefined or calculated with the API (application programming interface) user metrics.

Similar to above-described libraries, in eco2AI we focus on the following: taking into account only those system processes that are related directly to models training (to avoid overestimation). We also use extensive database of regional emission coefficients (365 territorial objects are included) and information on CPU devices (3279 models).

3 METHODOLOGY

This section covers our approach to calculate electric energy consumption, extracting the emission intensity coefficient and conversion to equivalent CO2 emissions. Each part is described below.

3.1 Electric Energy Consumption

The energy consumption of a system can be measured in joules (J) or kilowatt-hours (kW h)—a unit of energy equal to one kilowatt of power sustained for one hour. The problem is to evaluate energy contribution for each hardware unit [20]. We focus on GPU, CPU, and RAM energy evaluation for their direct and most significant impact on the ML processes. While examining CPU and GPU energy consumption we do not track the terminating processes effect, as its impact on overall power consumption is relatively small. We also do not account for data storage (SSD, HDD) energy consumption, since there not directly running processes.

GPU. The eco2AI library is able to detect NVIDIA devices. A Python interface for GPU management and monitoring functions has been implemented within the Pynvml library. This is a wrapper for the NVIDIA Management Library that detects most NVIDIA GPU devices and tracks the number of active devices, names, memory used, temperatures, power limits, and power consumption of every detected device. Correct functionality of the library requires CUDA installation on the computing machine. The total energy consumption of all active GPU devices \({{E}_{{{\text{GPU}}}}}\) (k Wh) equals to product of the power consumption of GPU device and its loading time: \({{E}_{{{\text{GPU}}}}} = \int_0^T {{{P}_{{{\text{GPU}}}}}(t)dt} \), where \({{P}_{{{\text{GPU}}}}}\) is the total power consumption of all GPU devices determined by Pynvml (kW) and T is GPU devices loading time (h). If the tracker does not detect any GPU device, then GPU power consumption is set to zero.

CPU. The Python modules os and psutil were used to monitor CPU energy consumption. To avoid overestimation, eco2AI takes into account only running processes related to model training. The tracker takes the percentage of CPU utilization and divides it by the number of CPU cores, obtaining CPU utilization percent. We have collected the most comprehensive database containing 3279 unique processors for Intel and AMD models. Each CPU model has information on thermal design power (TDP) which is equivalent to the power consumption at long-term loadings. The total energy consumption of all active CPU devices \({{E}_{{{\text{CPU}}}}}\) (k Wh) is calculated as the product of the power consumption of all CPU devices and its loading time \({{E}_{{{\text{CPU}}}}} = TDP\int_0^T {{{W}_{{{\text{CPU}}}}}(t)dt} \), where TDP is the equivalent CPU model specific power consumption at long-term loading (kW) and \({{W}_{{{\text{CPU}}}}}\) is the total loading of all processors (fraction). If the tracker can not match any CPU device, the CPU power consumption is set to a constant value of 100 W [27].

RAM. Dynamic random access memory devices are an important source of energy consumption in modern computing systems, especially when a significant amount of data have to be allocated or processed. However, accounting for RAM energy consumption is problematic as its power consumption is strongly depends on operation type, such as data being read, written or maintained. In eco2AI RAM power consumption is considered proportional to the amount of allocated power by the currently running process and is calculated as follows: \({{E}_{{{\text{RAM}}}}} = 0.375\int_0^T {{{M}_{{{\text{RA}}{{{\text{M}}}_{i}}}}}(t)dt} \), where \({{E}_{{{\text{RAM}}}}}\) is the power consumption of all allocated RAM (k Wh), \({{M}_{{{\text{RA}}{{{\text{M}}}_{i}}}}}\) is allocated memory (GB) measured via psutil, and 0.375 W/Gb is the estimated specific energy consumption of the DDR3 and DDR4 modules [27].

3.2 Emission Intensity

There are difference in emissions among countries due to various factors: geographical location, type of fuel used, level of economic and technological development. To account for the regional dependence, we use the emission intensity coefficient \(\gamma \) that is the weight in kilograms of emitted CO2 per each megawatt-hour (MW h) of electricity generated by a particular power sector of the country. The emission intensity coefficient is defined by regional energy consumption, or \(\gamma = \sum\nolimits_i {{f}_{i}}{{e}_{i}}\), where \(i\) is the ith energy source (e.g., coal, renewable, petroleum, gas, etc.), \({{f}_{i}}\) is a fraction of the ith energy source for specific region, and \({{e}_{i}}\) is its emission intensity coefficient. Therefore, the higher is the fraction of renewable energy is, the less is the total emission intensity coefficient. the opposite case, a high fraction of hydrocarbon energy resources implies a higher value of emission intensity coefficient. Thereby, the carbon emission intensity varies significantly depending on the region (see Table 2).

Table 2. Emission intensity coefficients for selected regions

The eco2AI library includes a continuously supported database of emission intensity coefficients for 365 regions based on the publicly available data in 209 countries [11], as well as regional data for such countries as Australia (Emissions factors sources [13], Science and Resources [39]), Canada (Emissions factors sources [13], UNFCCC [43]), Russia (Rosstat [34], EMISS [12], Minprirody (Russia) [28]), and the USA (Emissions factors sources [13], USA EPA [44]). Currently, this is the largest database among the reviewed trackers allowing higher accuracy of energy consumption estimations.

The database contains the following data: country name, ISO-Alpha-2 code, ISO-Alpha-3 code, UN M49 code and emission coefficient value. As an example, the data for selected regions are presented in Table 2. The eco2AI library automatically determines the country and region of the user by IP and chooses the corresponding emission intensity coefficient. If the coefficient cannot be determined, it is set to 436.5 kg/(MW h), which is the global average [11]. In eco2AI region, country, emission intensity coefficient can be specified manually.

3.3 Equivalent Carbon Emissions

The total equivalent emission value of carbon footprint (\(CF\)) generated during AI models learning is defined by multiplication of the total power consumption from CPU, GPU, and RAM by the emission intensity coefficient \(\gamma \) (kg/(kW h)) and the \(PUE\) coefficient:

$$CF = \gamma PUE({{E}_{{{\text{CPU}}}}} + {{E}_{{{\text{GPU}}}}} + {{E}_{{{\text{RAM}}}}}).$$

Here, \(PUE\) is the power usage effectiveness of data center when training is done on a cloud. PUE is an optional parameter with a default value of 1. It is defined manually in the eco2AI library.

4 EXPERIMENTS

In this section, we present experiments of tracking equivalent CO2 emissions using eco2AI while training of Malevich (ruDALL-E XL 1.3B) and Kandinsky (ruDALL-E XXL 12B) models. Malevich and Kandinsky are large multimodal generative models [18] with 1.3 and 12 billion parameters, respectively, capable of generating arbitrary images from a textual description.

We present the results of fine-tuning Malevich and Kandinsky on the Emojis dataset [40] and of training Malevich with an optimized version of the GELU activation function [21].

4.1 Fine-Tuning of Multimodal Models

In this section, we present eco2AI use cases for monitoring the fine-tuning of Malevich and Kandinsky models characteristics (e.g., CO2, kg; power, kW h) on the Emojis dataset. Malevich and Kandinsky are multi-modal pre-trained transformers that learn the conditional distribution of images. More precisely, they autoregressively model text and image tokens as a single stream of data (see, e.g., DALL-E [33]). These models are transformer decoders [45] with 24 and 64 layers, 16 and 60 attention heads, 2048 and 3840 hidden dimensions, respectively, and standard GELU nonlinearity.

Both Malevich and Kandinsky work with 128 text tokens, which are generated from the text input using YTTM tokenizer (YouTokenToME), and 1024 image tokens, which are obtained encoding the input image using generative adversarial network Sber-VQGAN encoder part (sber-vq-gan) (it is pretrained VQGAN [15] with Gumbel Softmax Relaxation [25]).

The dataset of Emojis (russian-emoji) used for fine-tuning contains 2749 unique emoji icons and 1611 unique texts that were collected by web scrapping (the difference in quantities is due to the fact that there are sets, within which emojis differ only in color, moreover, some elements are homonyms).

The Malevich and Kandinsky models were trained in fp16 and fp32 precision, respectively. Adam 8-bit optimizer [7] used for optimization in both experiments. This realization reduces the amount of GPU memory required for gradient retention. OneCycle learning rate is chosen as a scheduler with the following parameters: start learning rate (lr) 4 × 10–7, max lr 10–5, final lr 2 × 10–8. The models were fine-tuned for 40 epochs with a warmup lr of 0.3, batch size of 4 for Male-vich and batch size of 12 for Kandinsky, with large image loss coefficient of 1000 and with frozen feed forward and attention layers.

Distributed model training optimizer DeepSpeed (ZeRO-3) was used to train the Kandinsky model. The source code used for fine-tuning of Malevich is available in Kaggle (emojich ruDALL-E). Malevich and Kandinsky models were trained at 1 GPU Tesla A100 (80 Gb) and 8 GPU Tesla A100 (80 Gb), respectively.

We have named the results of Malevich and Kandinsky fine-tuning as Emojich XL and Emojich XXL, respectively. We compare the results of image generation by Malevich vs Emojich XL and Kandinsky vs Emojich XXL on some text inputs (see Figs. 2 and 3) in order to assess visually the quality of fine-tuning (how much the style of generated images is adjusted to the style of emojis). The image generation starts with a text prompt that describes the desired content. When the tokenized text is fed to Emojich, the model generates the remaining image tokens auto-regressively.

Fig. 2.
figure 2

Images generation of Malevich (top) vs Emojich XL (bottom) by text input “Tree in the form of a neuron.”

Fig. 3.
figure 3

Images generation of Kandinsky (top) vs Emojich XXL (bottom) by text input “Green artificial intelligence.”

Every image token is selected item-by-item from a predicted multinomial probability distribution over the image latent vectors using kernel top-p and top-k sampling with a temperature [23] as a decoding strategy. The image is rendered from the generated sequence of latent vectors by the decoder part of the Sber-VQGAN.

All examples below are generated automatically with the following hyper-parameters: batch size 16 and 6, top-k 2048 and 768, top-p 0.995 and 0.99, temperature 1.0, 1 GPU Tesla A100 for Malevich (as well as Emojich XL) and Kandinsky (as well as Emojich XXL), respectively.

Summary of fine-tuning parameters, energy consumption results and eq. CO2 is given in (Table 3). One can note that the fine-tuning Kandinsky consumes more than 17 times more energy than Malevich.

Table 3. Carbon emissions and power consumption of the fine-tuning of Malevich and Kandinsky models

Thus, the eco2AI library makes it straightforward to control the energy consumption while training (and fine-tuning) large models not only on one GPU, but also on multiple GPUs, which is essential when using optimization libraries for distributed training, for example, DeepSpeed.

4.2 Pre-training of Multimodal Models

Training large models like Malevich is highly resource demanding. In this section we give an example of model improvement in terms of energy efficiency referring to low precision computing using few-bits GELU activation function as an example. Few-bits GELU [29] is a variation of the GELU [21] activation function that preserves model gradients with few-bit resolution, thus allocating less GPU memory and spending less computational resources (see Fig. 4). More precisely, we compare training Malevich with regular GELU and version of Malevich with GELU 4-bit, 3-bit, 2-bit, and 1-bit sampling using the eco2AI library.

Fig. 4.
figure 4

Optimized 4-bit piecewise-constant approximation of the derivative of the GELU activation function.

We used the same optimizer, scheduler, and training strategy as in the fine-tuning experiments. To ensure reproducibility, we ran each experiment 5 times with a random seed. The training dataset consisted of 300 000 samples. Each sample was passed through the model only once with a batch size of 4. The validation dataset consisted of 100 000 samples. The eco2AI library was used to track the carbon footprint during the training in real-time.

As we can see in Fig. 5a, the validation loss of Malevich with 4-bit and 3-bit GELU and Malevich with regular GELU is almost the same (0.06% growth), whereas the 2-bit GELU and 1-bit GELU demonstrate an about 0.42% increase in validation loss. In contrast, 4-bit GELU and 3-bit GELU are about 15 and 17% more efficient, respectively, comparing to the original GELU and accumulate less CO2 emissions at the same training step (Fig. 5b).

Fig. 5.
figure 5

The comparison of GELU and GELU few-bit activation functions integrated to Malevich model: (a) Validation loss at every step of pre-training (box-plot within the inset indicates deviation of validation loss of each model at 300 000 step), (b) accumulated CO2 at every step of models pre-training (box-plot within the inset indicates deviation of accumulated CO2 of each model at 300 000 step), (c) accumulated CO2 for achieved validation loss of each model (the inset depicts zoomed area of graph with peak accumulated CO2 to stress the difference between models).

The use of 1-bit GELU resulted in an additional saving of only 0.05% CO2. The performance of the models is summarized in Fig. 5c and Table 4. The 3-bit GELU seems to provide model performance close to the original GELU, while consuming 17% less power and hence, producing less equivalent CO2 emissions.

Table 4. Carbon emissions and power consumption of the pre-trained Malevich model on 300 000 dataset during 1 epoch (A100 Graphics, AMD EPYC 7742 64-Core)

Thus, the eco2AI library can monitor the power consumption and carbon footprint of training models in real-time and helps to implement and demonstrate various memory and power optimization algorithms (such as quantization of gradients of activation functions).

5 CONCLUSIONS

In spite of the great potential of AI to solve environmental issues, AI itself can be the source of an indirect carbon footprint. In order to help the AI-community to understand the environmental impact of AI models during training and inference and to systematically monitor equivalent carbon emissions in this paper we introduced the eco2AI tool. The eco2AI is an open-source library capable of tracking equivalent carbon emissions while training or inferring Python-based AI models accounting for energy consumption of CPU, GPU, RAM devices. In eco2AI we focused on accuracy of energy consumption tracking and correct regional CO2 emissions accounting due to precise measurement of process loading, extensive database of regional emission coefficients and CPU devices.

We present examples of eco2AI usage for tracking the fine-tuning of big text2image models Malevich and Kandinsky, as well as for optimization of GELU activation function integrated to Malevich model. For example, using eco2AI, we have demonstrated that usage of 3-bit GELU decreased equivalent CO2 emissions by about 17%. We expect that eco2AI can help the ML community to pace to Green and Sustainable AI within the presented concept of the AI-based GHG sequestrating cycle.