Abstract
With machine learning (ML) becoming a popular tool across all domains, practitioners are in dire need of comprehensive reporting on the state-of-the-art. Benchmarks and open databases provide helpful insights for many tasks, however suffer from several phenomena: Firstly, they overly focus on prediction quality, which is problematic considering the demand for more sustainability in ML. Depending on the use case at hand, interested users might also face tight resource constraints and thus should be allowed to interact with reporting frameworks, in order to prioritize certain reported characteristics. Furthermore, as some practitioners might not yet be well-skilled in ML, it is important to convey information on a more abstract, comprehensible level. Usability and extendability are key for moving with the state-of-the-art and in order to be trustworthy, frameworks should explicitly address reproducibility. In this work, we analyze established reporting systems under consideration of the aforementioned issues. Afterwards, we propose STREP, our novel framework that aims at overcoming these shortcomings and paves the way towards more sustainable and trustworthy reporting. We use STREP’s (publicly available) implementation to investigate various existing report databases. Our experimental results unveil the need for making reporting more resource-aware and demonstrate our framework’s capabilities of overcoming current reporting limitations. With our work, we want to initiate a paradigm shift in reporting and help with making ML advances more considerate of sustainability and trustworthiness.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The performance and usability of today’s machine learning (ML) methods make them ubiquitous for business use across all sectors (Fischer et al. 2023). Software developers who realize these use cases, however, are often not (yet) skilled in ML engineering and need to acquire high amounts of knowledge about this technology in a short time. The same can be said for ML experts, which might be well-versed in certain domains, but every now and then still need to familiarize themselves with new fields that they haven’t worked in yet. The reporting and communication of state-of-the-art (SOTA) results plays a highly important role for both types of practitioners. As the depth of scientific papers is hardly feasible for swiftly learning about ML intricacies, several approaches to higher-level reporting were established. They summarize the key properties of individual models (Arnold et al. 2019; Mitchell et al. 2019; Morik et al. 2022), provide overviews for specific benchmarks (Croce et al. 2020; Godahewa et al. 2021), or offer open databases for reported properties across various ML domains (Vanschoren et al. 2014; Stojnic et al. 2018). While the importance of documenting predictive quality is obvious, recent calls for more sustainability (van Wynsberghe 2021) and trustworthiness (Chatila et al. 2021) also need to be explicitly considered by reporting frameworks.
While the currently established reporting already serves practitioners with helpful overviews, we see dire need for improvement: (1) Information on computing setup and resource usage is largely missing from established reports. To make ML research and deployment more resource-efficient (Fischer et al. 2022) and sustainable (Kar et al. 2022), frameworks should explicitly inform on such aspects and provide means for comparing methods across different computing platforms. (2) Except for sorting results based on single properties, most reports, unfortunately, cannot be interactively customized. This is problematic, because interested practitioners often bring their application-specific priorities (e.g., due to resource constraints (Marwedel and Morik 2022)). In addition, means for interaction were shown to benefit the understandability, and thus, trustworthiness of systems (Beckh et al. 2023). (3) Reporting frameworks are usually designed by and for experts in the field and only limitedly benefit target audiences who are not well-skilled in ML fundamentals. To address these stakeholders, a more abstract level of reporting is necessary, for example in the form of labels (Morik et al. 2022). (4) Some of the established systems have shortcomings in terms of usability and extendability, which are key factors for the acceptance and long-term feasibility of any framework. (5) Even though reproducibility of reported results (Hutson 2018) is of utmost importance for constructing trustworthy systems (Avin et al. 2021), current frameworks do not address the matter sufficiently.
With this work, we take steps towards better reporting that explicitly considers aspects of sustainability and trustworthiness. In short, we contribute to the SOTA by
-
1.
discussing current reporting approaches and their drawbacks (Section 2),
-
2.
proposing STREP, a novel framework that addresses these flaws and allows for more (s)ustainable and (t)rustworthy (rep)orting (Section 3),
-
3.
offering a readily-usable implementation of STREP, which is used for
-
4.
exploring the resource-awareness of established report databases (Section 4).
Our proposed framework is schematically displayed in Fig. 1, and its practical implementation and results are publicly available at https://www.github.com/raphischer/strep. To address the identified flaws, we later formulate key questions that future reporting frameworks should consider. STREP manifests possible answers – it leverages index scaling to compare any given benchmark in a unified and resource-aware way, offers means for interactively customizing reports to align with users’ priorities, and addresses diverse target audiences by adapting the idea of high-level labeling (Morik et al. 2022). Our experimental investigations comprise empiric reports from thousands of ML methods, hundreds of data sets, and several computing environments. In some cases, we successfully identify intricate performance trade-offs (Fischer and Saadallah 2023), while in others, we find clear evidence that reports of resource usage are under-represented. We deem our work and implementation highly extensible, invite fellow researchers to explore their own reported properties with our software, and hope that our work becomes a milestone towards more sustainable and trustworthy reporting in ML.
2 Current state of ML reporting
Reporting plays a vital role for bridging communication gaps between ML experts and less informed practitioners (Piorkowski et al. 2021). The EU AI act (EU AI HLEG 2020) was manifested to ensure that ML is used in trustworthy (Chatila et al. 2021; Avin et al. 2021), accountable (Hauer et al. 2023), and responsible (Baum et al. 2022; Dignum 2019) ways, and demands these systems to be designed safe, transparent, traceable, non-discriminatory, and environmentally friendly (European Parliament 2023). The last aspect is especially noteworthy, as modern ML can act as a tool for making the world more sustainable (van Wynsberghe 2021; Kar et al. 2022), but has also been shown to significantly impact our climate (Strubell et al. 2020; Bender et al. 2021; Patterson et al. 2021). Many works have put forward to investigate resource efficiency explicitly (Fischer et al. 2022; Lacoste et al. 2019), as resource consumption was found to be generally underreported (Schwartz et al. 2020; Castaño et al. 2023). The works above underline that the current paradigm of ML reporting needs further improvement. We start our contribution to the cause by characterizing the present state – our assessment is summarized in Table 1.
Our literature review found that ML reporting generally takes place on one of three levels: The first informs on properties of an individual method or model, which in the most naive case, can be described in a scientific paper. To allow for faster comparison and understanding, fact sheets (Arnold et al. 2019) and model cards (Mitchell et al. 2019) were proposed as higher-level summaries. This idea is for example successfully implemented in the Hugging Face repository (Jain 2022). As these representations are still too technical to be understood by non-experts, the idea for establishing care labels (Morik et al. 2022), a more abstract and comprehensible form of communication, was put forward. In the context of sustainability, this idea was adapted for reporting on ML resource efficiency more comprehensibly (Fischer et al. 2022) (in analogy to the EU energy labels (European Commission 2019)). A central drawback of frameworks for individual reporting lies in their non-interactiveness – their results cannot be customized to users’ interests. Consider a practitioner wanting to learn about a model’s suitability for his specific use case and priorities, which might for example be subject to certain resource constraints. In this case, the option of interactively controlling the importance of single model properties would help to align the framework’s output with his needs. Interacting with any system has also been shown to increase users’ understanding and trust (Beckh et al. 2023), which are desirable attributes for reporting frameworks. In addition, most aforementioned works (Arnold et al. 2019; Mitchell et al. 2019) also do not offer any automation for generating reports, which strongly impacts the usability for experts wanting to report novel results.
Benchmarks form the second level of reporting and summarize many individual reports for giving an overview of specific ML application domains. These days, competitive leaderboards allow to compare the performance of SOTA methods in domains such as robustness testing (Croce et al. 2020), time series forecasting (Godahewa et al. 2021), language understanding (Srivastava et al. 2022; Wang et al. 2019), and commonsense reasoning (Sakaguchi et al. 2021). Several works assembled such benchmarks to investigate the performance of different methods across multiple data sets (Dems̆ar 2006; Ismail-Fawaz et al. 2023). Cross-domain databases are located on the third level and take the idea of benchmarks a step further, basically comprising different leaderboards across a wide range of domains. Thanks to their open nature and direct coding interfaces, representatives like Papers With Code (Stojnic et al. 2018) and OpenML (Vanschoren et al. 2014) became highly popular. As they offer direct access to data, methods, and training configurations (Feurer et al. 2021), these platforms are highly beneficial for new practitioners. Model and data set repositories (Jain 2022) as well as platforms for performing ML experiments in an accelerated and unified way (Zaharia et al. 2018; Dems̆ar et al. 2013; Mierswa et al. 2006) help in assembling such databases, and are sometimes directly integrated via interfaces. Recent investigation suggests that promoting scientific papers and code via such platforms (Stojnic et al. 2018) helps in distributing research and achieving more citations (Kang et al. 2023).
Opposed to individual reporting, these platforms offer limited means for interaction, as users can sort results based on their preference (Croce et al. 2020; Vanschoren et al. 2014) or configure custom plots (Vanschoren et al. 2014; Stojnic et al. 2018). More advanced mechanisms, like controlling the impact of different properties for the overall method rating, are unfortunately missing. The comprehensibility and thus benefit for non-experts is questionable, as fundamental information on metrics and methods is only partially provided and high-level representations of individual results (e.g., via labels (Morik et al. 2022)) are not offered. As anyone can easily edit open databases, their usability is slightly impacted by the inconsistent reporting of properties. As a small teaser to Section 4, we found many leaderboards to be only sparsely populated and specific properties being redundantly duplicated thanks to different spellings (e.g., “params” and “parameters”). Another big drawback of all mentioned reporting frameworks lies in their neglect of sustainability. Reporting on computing environment or consumed resources is only optional and not specifically motivated, and we later show further evidence that, as a result, many reported benchmarks are overly focused on predictive capabilities (Schwartz et al. 2020). Lastly, we want to also explicitly mention the problem of reproducibility (Hutson 2018; Pineau et al. 2021), which is of utmost importance in the context of trustworthiness. Established frameworks support linking code (Stojnic et al. 2018) and offer interfaces to selected ML libraries (Vanschoren et al. 2014), but ultimately, they do not test whether reported results can actually be reproduced. Even with available code, information on the utilized hardware and software setup is usually not provided – we argue that SOTA reporting should explicitly demand such information.
Table 1 lists our assessment of current reporting and its drawbacks at a glance. We want to stress that we understand all established works to be highly important and successful contributions that we want to build on. In the following section, we present STREP, our own take on sustainable and trustworthy reporting. In comparison to the discussed approaches, we specifically designed our framework to be more sustainable, since we offer a holistic approach for reporting any method’s resource trade-offs in its respective computational environment, interactive, since users can align the impact of each property for the overall assessment with their priorities, and more comprehensible, since we directly offer means for generating high-level labels. As motivated earlier, the latter two aspects are directly linked to the trustworthiness of any system. In terms of usability and reproducibility aspects, we are aware of their importance but do not push improvements with this work. We still deem STREP to be highly usable – it can be easily used to investigate any given database and generate different report representations automatically. However, it does neither tackle the aforementioned inconsistencies in open databases nor the limited reproducibility, as both issues boil down to how well-organized and -documented the original reported results are.
3 STREP - Sustainable and Trustworthy Reporting
To make reporting more sustainable and trustworthy, we devised central questions that address the identified flaws and need to be explicitly considered: (Q1) How can reported method properties be characterized in a resource-aware way? (Q2) How can the practical computing environment be taken into account? (Q3) What means for interactive investigation can be offered? (Q4) How can users at different levels of ML understanding be informed? This section is structured accordingly, and our proposed STREP framework implements possible answers to Q1–Q4.
3.1 Characterizing ML properties
Good reporting requires to first thoroughly characterize the nature of ML experiments and properties that methods may exhibit. While other works have addressed this challenge (Fischer et al. 2022; Morik et al. 2022), we aligned the characterization in STREP along established terminology (Stojnic et al. 2018), for which thousands of reported evaluations can already be found online.
Definition 1
Any ML evaluation is defined by the tuple (d, t, m), which represents: (i) a data set d, (ii) a task type t, i.e., an abstract action to be performed on d, and (iii) an ML method m, that can practically perform this action. The theoretical space of all possible evaluations (d, t, m) is denoted by \(\mathcal {D}\times \mathcal {T}\times \mathcal {M}\); however, certain combinations might be inadmissible.
Concisely, evaluations (d, t, m) represent the abstraction of using a specific method m to solve a particular task t on given data d. An example would be to classify (t) a fixed number of ImageNet images (d) with MobileNetV2 (m) (Howard et al. 2017). The last sentence in Definition 1 indicates that fixing any d, t, or m potentially limits the other options, as for example image classification cannot be performed on time series data sets. Similarly, the options for applicable methods m might be limited, however we also find certain approaches like neural networks being utilized across all tasks. Also note that in our understanding, m already encompasses specific hyperparameters (e.g., optimizer, preprocessing, ...), which might have a strong impact on practical performance. For the sake of clarity, we here do not explicitly characterize ML methods further, but want to emphasize the need for reporting them in detail – to give a good example, OpenML’s database distinguishes between different method setups (Vanschoren et al. 2014). Differentiating between tasks (training, inference, architecture search, ...) is important because each task has its own quantities that describe practical performance:
Definition 2
Any combination of data d and task t determines a space of properties \(\varvec{p}(d, t) = \{p_{(d, t)}(m)\} \subset \mathbb {R}^n\) in which methods m score via property functions \(p_{(d, t)_i}: \mathcal {M} \rightarrow \mathbb {R}\) for \(1 \le i \le n\). A specific evaluation result \(\varvec{p}_{(d, t)}(m)\) thus is a n-dimensional vector of properties that m exhibits when being applied to perform t on d.
Where others report metrics (Stojnic et al. 2018) or measures (Vanschoren et al. 2014), we chose the term properties, as the corresponding denotation p is easier to formally distinct from the method choice m. We want to stress that these p only represents measurable, numeric properties like accuracy or running time, to stay with the ImageNet example above. Obviously, certain additional information also needs to be reported – these meta properties might be textual descriptions, related literature, or implementation details, to name but a few. While they play an important role for comprehensibility and reproducibility, we do not formally denote them, as they are irrelevant for the numerical comparison of methods. In practice, different properties can be grouped into either describing predictive quality (error, accuracy, ...) or resource usage (running time, model size, ...). Usually, we expect these two groups to compete, as deploying ML efficiently requires to investigate the trade-off between predictive capabilities and resource demand (Fischer et al. 2022). As such, we support others in their argumentation that explicitly reporting on properties in both groups is mandatory for sustainable ML development (Schwartz et al. 2020). Opposed to aforementioned frameworks, STREP gives clear incentives to do so (e.g., via error messages when no resource-allied properties are found).
3.2 Taking execution environments into account
In practice, properties exhibited by a method are subject to the real-world hardware and software used for performing the evaluation (Fischer et al. 2024). Even though some methods possess certain theoretical (i.e., static (Morik et al. 2022)) attributes, it is not guaranteed that they hold in practice (e.g., due to coding mistakes (Sun et al. 2017)). Unfortunately, we find that most established benchmarks neglect the importance of this computational environment, which is why we explicitly consider it in STREP. We extend the definitions above accordingly:
Definition 3
Any evaluation (d, t, e) is executed in an environment \(e \in \mathcal {E}\), which is defined by the choice of hardware and software. As they impact how any method performs in practice, we adjust Definition 2 accordingly: \(\varvec{p}_{(d, t, e)} = \{p_{(d, t, e)_i}: \mathcal {M} \times \mathcal {E} \rightarrow \mathbb {R} \}\) is now also subject to e.
In addition to earlier restrictions, we may find that environments impose further limitations (and vice-versa). As examples, consider SOTA deep learning models that demand several gigabytes of memory capacity (Strubell et al. 2020), or the limited range of readily usable methods for edge (Buschjäger et al. 2020) or quantum hardware (Mücke et al. 2023). For improved readability, we from now on omit the underlying data, task, and environment choice (d, t, e) and resort to simply denoting method properties as \(\varvec{p}(m) = \{p_i(m)\}\). From the diversity of environments (e.g., desktops, compute clusters, embedded devices), a new problem arises: Resource-allied properties like running time can scale with multiple levels of magnitude when performing a fixed evaluation in different environments. While the impact on resource demand is most obvious, note that ML predictions could also be affected by the environment, e.g., due to algorithmic changes across library versions. In order to compare properties better, even across multiple environments, STREP therefore adapts relative index scaling (Fischer et al. 2022):
Where the original work calculates index values \(p_i(m)\) based on global references, we instead propose to relate the real measurements \(\mu _i(m)\) to the empirically best-performing method \(m^*\) on the i-th property. We deem this approach advantageous to global referencing since it neatly solves the problem of choosing good references. With (1), property values \(p_{i}\) become bounded by the interval (0, 1] and describe the behavior in the given environment relatively; the higher the value, the closer it is to the best empiric result which receives \(p_i(m^*) = 1\). On the index scale, properties are more easily comparable – a value of 0.6 always indicates that the method achieves 60% of the best empirically known result, regardless of which property is discussed or environment was used. Index scaling also neatly solves another problem – most properties (e.g., running time and power draw) require minimization for improvement, but some exceptions (e.g., accuracy) demand maximization. To flip these scales and unify the performance space, (1) uses the constant \(\sigma _i\) which denotes whether the corresponding i-th property should be maximized (\(\sigma _i=-1\)) or minimized (\(\sigma _i=+1\)). When measured properties and additional information via meta properties are provided to STREP, it automatically computes the environment-specific index values (and updates them if new best scores are added). Note that many users might also want to investigate the real measurements as encountered in other reports, which is why STREP computes index values as an alternative representation that is offered in addition to the original results.
3.3 Offering means for interaction
While methods should generally perform well across all properties, there can also be use cases where certain aspects are more important. To give some examples, consider safety-critical applications that demand highly robust predictions (Croce et al. 2020), or embedded systems with tight memory constraints (Buschjäger et al. 2020). Unfortunately, as listed in Table 1, established systems only support interactions to a certain degree. STREP assesses any method’s suitability for a given task and data under consideration of the practitioner’s priorities via a weighted compound performance score:
Normalizing the weights \(w_i\) assures that the compound score is also bounded by (0, 1], with \(P(m)=1\) indicating the unlikely case that m achieves the best possible results across all reported properties. With property grouping as introduced in Section 3.1, users are enabled to control the importance of resource efficiency. In an extreme setting, resource-allied properties can be disregarded by setting their weight to zero. Likewise, users can purely concentrate on a method’s compute demand and leave out all non-resource properties. To generally achieve good trades, STREP per default assigns weights such that both groups are equally impactful in (2).
Another opportunity for making the reporting framework more interactive lies in the representation of results. Like other frameworks, STREP provides plain textual descriptions, numeric tables, and plots that display reported data more visually and intuitively (Cui 2019). Via its interactive interface (Dabbas 2021), users can explore the reported results and also switch between and thus understand the real measurement and relative index scales. Lastly, STREP also interactively displays individual results in a more comprehensible representation.
3.4 Informing on different levels of understanding
While experts from the field can investigate in-depth tables and plots, many works have put forward to establish more comprehensible means of reporting such as labels (Morik et al. 2022). We support this call and formulate several ideas for implementing this practically. First of all, labels can inform on properties with the help of intuitive badges (Morik et al. 2022) which express the underlying meaning metaphorically. To make numeric values less abstract, each index scale can be divided into bins that color-code the performance along each property (Fischer et al. 2022). Naturally, labels should also report the evaluation setting (task, data, environment, method) via text, badges, or logos. Additional details on the issuing authority and listed method can be easily embedded via QR codes. As the index values can become outdated when novel results are added to the database, the labels also need to feature the date of issuance. As the fast and easy generation of such summaries is key for establishing them, STREP was designed to put all suggested ideas into effect and produce labels automatically. Note that these labels are merely a scientific proof-of-concept – establishing a truly trustworthy labeling system or even certification would obviously require the support of renowned authorities with appropriate legitimation.
4 Experimental evaluation
We now demonstrate what insights the introduced ideas provide in practice by investigating established report databases with the help of our STREP software.
4.1 Experiment setup
Our Python library (available at https://www.github.com/raphischer/strep) handles property databases via pandas (The pandas development team 2022) dynamically reports tables, interactive plots, and labels via Plotly and Dash (Dabbas 2021). The user interface is displayed in Fig. 2 – users can bring their own custom property databases or explore four readily processed databases: (1) reported efficiency of ImageNet models across multiple environments (Fischer et al. 2022), (2) the RobustBench leaderboard (Croce et al. 2020), (3) results from forecasting Monash time series (Fischer and Saadallah 2023; Godahewa et al. 2021) with deep neural networks, and (4) a extract from Papers With Code (Stojnic et al. 2018). Following our characterization as proposed in Section 3.1, we provide the database statistics in Table 2. They list total numbers across all task \(\times \) data combinations, which each determine the applicable properties and methods. As shown in Fig. 3, Papers With Code is only sparsely populated, with few evaluations and properties reported for most cases (i.e., data \(\times \) task combinations). For this reason, we identified the most interesting benchmarks via filtering steps described in Fig. 4. With Incompleteness, we denote the average amount of NAN values found in the evaluations of each data \(\times \) task combination. The weights were chosen such that each property group (resources or predictive quality) is equally impactful for (2) and the properties within each group are also equally weighted. The “most important” properties discussed later were chosen heuristically, as the ones with highest number of reported values. The Papers With Code and RobustBench databases can be re-created with our linked code while the other two were pre-assembled. Keep in mind that this work investigates the state of reporting on a rather abstract level – in-depth experimental discussions and further information can however be found in the respective literature. Following the call for openly reporting any paper’s environmental impact (Schwartz et al. 2020), we estimate the total carbon emissions of this work to about 6 CO\(_2\)e – four full work weeks of running a CPU-only laptop (Lacoste et al. 2019) for software development and writing. We invite readers to also use our STREP application for exploring the databases in detail, as this paper is limited to investigating exemplary results.
4.2 Trade-offs and correlation of properties
We start by investigating trades among important reported properties in the databases. Figure 5 depicts the power draw versus accuracy comparison of ImageNet models in two environments. It showcases the effectiveness of index scaling (right plot) as given by (1), which allows for more intuitive comparisons. All results are put in relation to the best empiric performance per metric (value 1), which bounds all values and ensures that improving any property corresponds to a higher score. The node colors indicate the compound score over all properties, as introduced in (2). The ImageNet models seem to be scattered along a multi-dimensional Pareto-front, with the two shown axes clearly trading against each other. For the other databases, we display similar scatter plots for single data sets in Fig. 6. In cases where properties are not reported (i.e., NAN value in the data frame), we assign \(p_i(m) = 0\). For RobustBench, the properties seem to be strongly correlated, whereas no clear correlation is visible from the other two plots. We display the complete properties of the best and worst options for each data set via star plots in Fig. 7. They demonstrate how the proposed relative index scale can unify the reporting of multi-dimensional ML properties, despite their origin from very different domains and tasks. Even though the axes describe totally different phenomena like running time, prediction errors, or model size, their index values remain easily comparable. Where users of other reporting frameworks are restricted to sorting reports based on single properties, these STREP plots allow to explore the multi-dimensional relations between properties at an arbitrary depth.
With help of the Pearson correlation coefficient (PCC), we can numerically quantify trades in the form of pairwise property correlation. Values near one, zero, or minus one respectively indicate strong correlation, no relation, or trades (i.e., negative correlation). For the ImageNetEff and Forecasting data, this is displayed in Fig. 8. We find strong correlation between (1) model parameters and file size, (2) running time and power draw, as well as (3) accuracy and error properties. The latter also tends to be un- or even negatively correlated with other properties, which indicates resource trade-offs. This also supports our call for grouping and equally reporting properties for predictive capabilities and resource consumption.
We show the distribution of PCCs across all databases and properties in Fig. 9, computed once for the original measurements and once for the scaled index values. For the ImageNetEff and Forecasting data, we observe bimodal shapes, as already indicated in the more in-depth results from Fig. 8. On the other hand, the distributions for RobustBench and Papers With Code are heavily tilted towards high PCCs. Index scaling seems to spread the distributions and make their shapes more distinct. We assume that the tilted shapes originate from reporting overly focused on predictive performance, which can already be seen in the vanishingly small number of resource properties in Table 2 – on average, we found only 2% of the reported properties in Papers With Code to fall into the resource category. We test our assumption with Fig. 10, where the database was split into evaluations that have at least some reported resource usage versus ones that solely focus on predictive quality. The assessed PPC distributions for both groups clearly support our intuition – the resource-aware distribution is much less tilted and features many correlations near zero. This indicates that here interesting trades between different properties occur, and are under-reported for all other evaluations.
4.3 Resource-aware and comprehensible scoring
Next, we want to explore the effect of using our compound score (2) to assess overall method performance, under consideration of all reported properties and trades. It is opposed to just concentrating on a single property, as it is the case with established frameworks that only allow for sorting (recall Table 1). We experimentally investigate this by assessing the rank correlation (Spearman’s \(\rho \)) of sorting applicable methods for any task based on either (1) the most important property or (2) our compound score. The resulting distribution of \(\rho \) values across all databases is depicted in Fig. 11. Firstly, it shows that on average both ranking approaches are positively correlated, which is to be expected – the single property is also considered by the compound score. The distributions of ImageNetEff and RobustBench are rather simple (only few benchmark rankings considered), but clearly demonstrate that considering all properties seems to strongly impact the understanding of “good” performance. This is even more dramatically observed in the other databases, where some tasks and data sets even have low or negative correlation – the ranking seems to completely change when all properties are taken into account. On Papers With Code, which we already showed to be heavily biased towards optimizing predictive capabilities, we find the highest rank correlations. However, even the average value of \(\rho =0.75\) is not very high on the scale, and demonstrates the importance of considering all properties when assessing method performance. We want to once more emphasize that our compound performance assessment is customizable – we here use the heuristic weights explained earlier, but STREP allows users to interactively align the importance of properties with their priorities (e.g., to favor methods with low resource demand).
As explained in Section 3.2, both the compound score as well as single properties are subject to the execution environment in which the method is deployed. For this reason, Fig. 12 investigates this effect in the ImageNetEff database. The vertical distance between model points indicates how stable their efficiency is across different environments. Models like MobileNetV2 (Howard et al. 2017) seem to generally perform rather efficiently, whereas certain architectures like ResNets and VGG have very mixed compound performances. These results are evidence that the environment greatly impacts method properties and thus should be explicitly reported. RobustBench and Forecasting experiments were all performed on unified hardware – it would be interesting to test how different environments might change the empiric behavior. For Papers With Code, the environment is, in some cases, described in the associated paper. However, it would be much more beneficial if this information was explicitly demanded whenever people report new results, ideally together with the very environment-specific compute utilization.
Lastly, let us consider how properties can be reported at higher level abstraction, to also inform users without fundamental ML knowledge. For that, we implement our suggestions of Section 3.4 and display prototypical labels as generated by STREP in Fig. 13. They are inspired by the EU energy labels (European Commission 2019) and display efficiency and robustness information for selected models and data. The colored badges indicate the intuitive meaning behind each property in a more comprehensible way (e.g., battery for power draw, target circle for accuracy, stopwatch for running time). For determining the rating colors, we used rating boundaries based on the 20%, 40%, 60%, and 80% quantiles of all scored values for the specific property – these boundaries are also displayed as colored rectangles in Figs. 5 and 6. While boundaries and ratings were determined from the index values, the labels display the more intuitive real measurements and hide away the complexity of relative scaling. In addition, they textually inform on evaluation and environment setting and feature a QR code with additional information material. These labels demonstrate how intricate trade-offs between properties of any ML method can be communicated to less informed users of the reporting framework.
5 Conclusion
Practitioners that want to develop sustainable and trustworthy ML systems require elaborate reporting on the current SOTA in their specific application domain. While many frameworks for publishing ML results exist, they also have certain drawbacks: they are overly focused on predictive capabilities, only offer limited means for interactive use, and inform on a very technical level that is hard to understand for non-experts. In this work, we analyzed the current state of reporting and suggested novel means for improving it. Our proposed STREP framework addresses (s)ustainability via index scaling and compound scoring, which enables resource-aware comparisons of methods even across multiple software and hardware. We improve on (t)rustworthiness by allowing users to interactively control the importance of single method properties for the overall assessment. In addition, our (rep)orting framework automatically generates high-level labels, which communicate intricate results to users without fundamental ML knowledge in a more comprehensible way. We used STREP to practically investigate established report databases in terms of their resource-awareness, where we uncovered major biases towards predictive quality. Our experimental evaluation also demonstrates the feasibility of our proposed ideas. The implementation of STREP is publicly available and can be easily extended – we invite fellow researchers to use it for exploring trade-offs occurring with methods in their own ML domains. While we specifically focused this paper on characterizing the current state, proposing means to improve it, and testing our suggestions in practice, we also see much potential for future work. Firstly, we will further investigate reproducibility aspects of current reporting and try to improve our framework along this direction. In addition, we plan on evaluating the usability of different reporting frameworks and our labels representations via user studies. A closer integration with established frameworks like Papers With Code and OpenML are further options for future extensions. We hope that our work is a helpful contribution for establishing a better reporting paradigm, which makes future ML application and methods more sustainable and trustworthy.
Data Availability
No datasets were generated or analysed during the current study.
References
Arnold M, Bellamy RK, Hind M, Houde S, Mehta S, Mojsilović A, Nair R, Ramamurthy KN, Olteanu A, Piorkowski D et al (2019) Factsheets: Increasing trust in ai services through supplier’s declarations of conformity. IBM J Res Dev 63(4/5):6–1
Avin S, Belfield H, Brundage M, Krueger G, Wang J et al (2021) Filling gaps in trustworthy development of AI. Science 374(6573):1327–1329. American Association for the Advancement of Science
Baum K, Mantel S, Schmidt E, Speith T (2022) From responsibility to reason-giving explainable artificial intelligence. Philos Technol 35(1):12
Beckh K, Müller S, Jakobs M, Toborek V, Tan H, Fischer R, Welke P, Houben S, Rueden L (2023) Harnessing prior knowledge for explainable machine learning: An overview. In: First IEEE conference on secure and trustworthy machine learning
Bender EM, Gebru T, McMillan-Major A, Shmitchell S (2021) On the dangers of stochastic parrots: can language models be too big? In: Conference on fairness, accountability, and transparency, pp 610–623. https://doi.org/10.1145/3442188.3445922
Buschjäger S, Pfahler L, Buss J, Morik K, Rhode W (2020) On-site gamma-hadron separation with deep learning on fpgas. In: European conference on machine learning and knowledge discovery in databases, pp 478–493
Castaño J, Martínez-Fernández S, Franch X, Bogner J (2023) Exploring the carbon footprint of hugging face’s ML models: a repository mining study. _eprint: arXiv:2305.11164
Chatila R, Dignum V, Fisher M, Giannotti F, Morik K, Russell S, Yeung K (2021) Trustworthy ai. Reflections on artificial intelligence for humanity, pp 13–39. Springer
Croce F, Andriushchenko M, Sehwag V, Debenedetti E, Flammarion N, Chiang M, Mittal P, Hein M (2020) Robustbench: a standardized adversarial robustness benchmark. Preprint arXiv:2010.09670
Cui W (2019) Visual analytics: a comprehensive overview. IEEE Access 7:81555–81573. https://doi.org/10.1109/ACCESS.2019.2923736
Dabbas E (2021) Interactive dashboards and data apps with plotly and dash
Dems̆ar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. JMLR. org
Dems̆ar J, Curk T, Erjavec A, Gorup U, Hoc̆evar T et al (2013) Orange: data mining toolbox in Python. J Mach Learn Res 14(1):2349–2353. JMLR. org
Dignum V (2019) Responsible artificial intelligence: how to develop and use AI in a responsible way. https://doi.org/10.1007/978-3-030-30371-6
EU AI HLEG (2020) Assessment List for Trustworthy Artificial Intelligence (ALTAI) for self-assessment. https://futurium.ec.europa.eu/en/european-ai-alliance/pages/altai-assessment-list-trustworthy-artificial-intelligence
European Commission (2019) Commission Delegated Regulation (EU) 2019/2014 with regard to energy labelling of household washing machines and household washer-dryers. https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:32019R2014
European Parliament (2023) A step closer to the first rules on artificial intelligence. European Parliament News. https://www.europarl.europa.eu/news/en/press-room/20230505IPR84904/ai-act-a-step-closer-to-the-first-rules-on-artificial-intelligence
Feurer M, Rijn JNv, Kadra A, Gijsbers P, Mallik N et al (2021) OpenML-Python: an extensible Python API for OpenML. J Mach Learn Res 22(100):1–5
Fischer R, Jakobs M, Mücke S, Morik K (2022) A unified framework for assessing energy efficiency of machine learning. Machine learning and principles and practice of knowledge discovery in databases. Springer, Cham, pp 39–54
Fischer R, Pauly A, Wilking R, Kini A, Graurock D (2023) Prioritization of identified data science use cases in industrial manufacturing via C-EDIF scoring. In: IEEE international conference on data science and advanced analytics, pp 1–4
Fischer R, Saadallah A (2023) AutoXPCR: Automated multi-objective model selection for time series forecasting. Preprint arXiv:2312.13038
Fischer R, van der Staay A, Buschjäger S (2024) Stress-testing USB accelerators for efficient edge inference. Research Square preprint. https://doi.org/10.21203/rs.3.rs-3793927
Godahewa R, Bergmeir C, Webb GI, Hyndman RJ, Montero-Manso P (2021) Monash time series forecasting archive. In: Neural information processing systems track on datasets and benchmarks. forthcoming
Hauer MP, Krafft TD, Zweig K (2023) Overview of transparency and inspectability mechanisms to achieve accountability of artificial intelligence systems. Data Policy 5:36. Cambridge University Press
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. Preprint arXiv:1704.04861
Hutson M (2018) Artificial intelligence faces reproducibility crisis. Science 359(6377):725–726. https://doi.org/10.1126/science.359.6377.725. _eprint: https://www.science.org/doi/pdf/10.1126/science.359.6377.725
Ismail-Fawaz A, Dempster A, Tan CW, Herrmann M, Miller L et al (2023) An approach to multiple comparison benchmark evaluations that is stable under manipulation of the comparate set. Preprint arXiv:2305.11921
Jain S (2022) Hugging face, pp 51–67. https://doi.org/10.1007/978-1-4842-8844-3_4
Kang D, Kang T, Jang J (2023) Papers with code or without code? Impact of GitHub repository usability on the diffusion of machine learning research. Inf Process Manag 60(6):103477. https://doi.org/10.1016/j.ipm.2023.103477
Kar AK, Choudhary SK, Singh VK (2022) How can artificial intelligence impact sustainability: A systematic literature review. J Clean Prod 134120. Elsevier
Lacoste A, Luccioni A, Schmidt V, Dandres T (2019) Quantifying the carbon emissions of machine learning. Preprint arXiv:1910.09700
Marwedel P, Morik K (2022) Machine learning under resource constraints - volume 1: fundamentals. https://doi.org/10.1515/9783110785944
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2006), pp 935–940. ACM Press, New York, USA. ACM. http://rapid-i.com/component/option,com_docman/task,doc_download/gid,25/Itemid,62/
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L et al (2019) Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency, FAT* 2019, pp 220–229. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3287560.3287596
Morik KJ, Kotthaus H, Fischer R, Mücke S, Jakobs M, Piatkowski N, Pauly A, Heppe L, Heinrich D (2022) Yes we care!-certification for machine learning methods through the care label framework. Front Artif Intell 5. https://doi.org/10.3389/frai.2022.975029
Mücke S, Heese R, Müller S, Wolter M, Piatkowski N (2023) Feature selection on quantum computers. Quantum Mach Intell 5(1):11
Patterson D, Gonzalez J, Le Q, Liang C, Munguia L-M, Rothchild D, So D, Texier M, Dean J (2021) Carbon emissions and large neural network training. Preprint arXiv:2104.10350
Pineau J, Vincent-Lamarre P, Sinha K, Larivière V, Beygelzimer A et al (2021) Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). J Mach Learn Res 22(1):7459–7478. JMLRORG
Piorkowski D, Park S, Wang AY, Wang D, Muller M, Portnoy F (2021) How ai developers overcome communication challenges in a multidisciplinary team: A case study. Proceedings of the ACM on human-computer interaction 5(CSCW1), pp 1–25. ACM New York, NY, USA
Sakaguchi K, Bras RL, Bhagavatula C, Choi Y (2021) Winogrande: An adversarial winograd schema challenge at scale. Commun ACM 64(9):99–106. ACM New York, NY, USA
Schwartz R, Dodge J, Smith NA, Etzioni O (2020) Green AI. Commun ACM 63(12):54–63
Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A et al (2022) Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Preprint arXiv:2206.04615
Stojnic R, Taylor R, Kardas M, Saravia E, Cucurull G, Westbury A, Scialom T (2018) Papers With Code - The latest in Machine Learning. https://paperswithcode.com/
Strubell E, Ganesh A, McCallum A (2020) Energy and Policy Considerations for Modern Deep Learning Research. In: AAAI conference on artificial intelligence, pp 13693–13696
Sun X, Zhou T, Li G, Hu J, Yang H, Li B (2017) An Empirical Study on Real Bugs for Machine Learning Programs. In: 2017 24th Asia-Pacific software engineering conference (APSEC), pp 348–357. https://doi.org/10.1109/APSEC.2017.41
The pandas development team (2022) pandas-dev/pandas: Pandas 1.4.1. Zenodo. https://doi.org/10.5281/zenodo.6053272
Vanschoren J, Van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Explor Newslett 15(2):49–60
Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman S (2019) Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32
Wynsberghe A (2021) Sustainable AI: AI for sustainability and the sustainability of AI. AI Ethics 1(3):213–218. https://doi.org/10.1007/s43681-021-00043-6
Zaharia M, Chen A, Davidson A, Ghodsi A, Hong SA et al (2018) Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng Bull 41(4):39–45
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
R.F. wrote the main manuscript text and T.L. and K.M. reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Responsible editor: Panagiotis Papapetrou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fischer, R., Liebig, T. & Morik, K. Towards more sustainable and trustworthy reporting in machine learning. Data Min Knowl Disc 38, 1909–1928 (2024). https://doi.org/10.1007/s10618-024-01020-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-024-01020-3