1 Introduction

The performance and usability of today’s machine learning (ML) methods make them ubiquitous for business use across all sectors (Fischer et al. 2023). Software developers who realize these use cases, however, are often not (yet) skilled in ML engineering and need to acquire high amounts of knowledge about this technology in a short time. The same can be said for ML experts, which might be well-versed in certain domains, but every now and then still need to familiarize themselves with new fields that they haven’t worked in yet. The reporting and communication of state-of-the-art (SOTA) results plays a highly important role for both types of practitioners. As the depth of scientific papers is hardly feasible for swiftly learning about ML intricacies, several approaches to higher-level reporting were established. They summarize the key properties of individual models (Arnold et al. 2019; Mitchell et al. 2019; Morik et al. 2022), provide overviews for specific benchmarks (Croce et al. 2020; Godahewa et al. 2021), or offer open databases for reported properties across various ML domains (Vanschoren et al. 2014; Stojnic et al. 2018). While the importance of documenting predictive quality is obvious, recent calls for more sustainability (van Wynsberghe 2021) and trustworthiness (Chatila et al. 2021) also need to be explicitly considered by reporting frameworks.

While the currently established reporting already serves practitioners with helpful overviews, we see dire need for improvement: (1) Information on computing setup and resource usage is largely missing from established reports. To make ML research and deployment more resource-efficient (Fischer et al. 2022) and sustainable (Kar et al. 2022), frameworks should explicitly inform on such aspects and provide means for comparing methods across different computing platforms. (2) Except for sorting results based on single properties, most reports, unfortunately, cannot be interactively customized. This is problematic, because interested practitioners often bring their application-specific priorities (e.g., due to resource constraints (Marwedel and Morik 2022)). In addition, means for interaction were shown to benefit the understandability, and thus, trustworthiness of systems (Beckh et al. 2023). (3) Reporting frameworks are usually designed by and for experts in the field and only limitedly benefit target audiences who are not well-skilled in ML fundamentals. To address these stakeholders, a more abstract level of reporting is necessary, for example in the form of labels (Morik et al. 2022). (4) Some of the established systems have shortcomings in terms of usability and extendability, which are key factors for the acceptance and long-term feasibility of any framework. (5) Even though reproducibility of reported results (Hutson 2018) is of utmost importance for constructing trustworthy systems (Avin et al. 2021), current frameworks do not address the matter sufficiently.

With this work, we take steps towards better reporting that explicitly considers aspects of sustainability and trustworthiness. In short, we contribute to the SOTA by

  1. 1.

    discussing current reporting approaches and their drawbacks (Section 2),

  2. 2.

    proposing STREP, a novel framework that addresses these flaws and allows for more (s)ustainable and (t)rustworthy (rep)orting (Section 3),

  3. 3.

    offering a readily-usable implementation of STREP, which is used for

  4. 4.

    exploring the resource-awareness of established report databases (Section 4).

Our proposed framework is schematically displayed in Fig. 1, and its practical implementation and results are publicly available at https://www.github.com/raphischer/strep. To address the identified flaws, we later formulate key questions that future reporting frameworks should consider. STREP manifests possible answers – it leverages index scaling to compare any given benchmark in a unified and resource-aware way, offers means for interactively customizing reports to align with users’ priorities, and addresses diverse target audiences by adapting the idea of high-level labeling (Morik et al. 2022). Our experimental investigations comprise empiric reports from thousands of ML methods, hundreds of data sets, and several computing environments. In some cases, we successfully identify intricate performance trade-offs (Fischer and Saadallah 2023), while in others, we find clear evidence that reports of resource usage are under-represented. We deem our work and implementation highly extensible, invite fellow researchers to explore their own reported properties with our software, and hope that our work becomes a milestone towards more sustainable and trustworthy reporting in ML.

Fig. 1
figure 1

Schematic visualization of STREP, our proposed sustainable and trustworthy reporting framework. Internally, it processes databases of reported properties, which stem from applying implemented ML methods to specific data sets. Users can interact with the framework to gain an understanding of model properties in terms of their own priorities. Resource trade-offs are explicitly considered, and results can be represented more abstractly for audiences with limited ML knowledge

2 Current state of ML reporting

Reporting plays a vital role for bridging communication gaps between ML experts and less informed practitioners (Piorkowski et al. 2021). The EU AI act (EU AI HLEG 2020) was manifested to ensure that ML is used in trustworthy (Chatila et al. 2021; Avin et al. 2021), accountable (Hauer et al. 2023), and responsible (Baum et al. 2022; Dignum 2019) ways, and demands these systems to be designed safe, transparent, traceable, non-discriminatory, and environmentally friendly (European Parliament 2023). The last aspect is especially noteworthy, as modern ML can act as a tool for making the world more sustainable (van Wynsberghe 2021; Kar et al. 2022), but has also been shown to significantly impact our climate (Strubell et al. 2020; Bender et al. 2021; Patterson et al. 2021). Many works have put forward to investigate resource efficiency explicitly (Fischer et al. 2022; Lacoste et al. 2019), as resource consumption was found to be generally underreported (Schwartz et al. 2020; Castaño et al. 2023). The works above underline that the current paradigm of ML reporting needs further improvement. We start our contribution to the cause by characterizing the present state – our assessment is summarized in Table 1.

Table 1 Established reporting approaches and their drawbacks

Our literature review found that ML reporting generally takes place on one of three levels: The first informs on properties of an individual method or model, which in the most naive case, can be described in a scientific paper. To allow for faster comparison and understanding, fact sheets (Arnold et al. 2019) and model cards (Mitchell et al. 2019) were proposed as higher-level summaries. This idea is for example successfully implemented in the Hugging Face repository (Jain 2022). As these representations are still too technical to be understood by non-experts, the idea for establishing care labels (Morik et al. 2022), a more abstract and comprehensible form of communication, was put forward. In the context of sustainability, this idea was adapted for reporting on ML resource efficiency more comprehensibly (Fischer et al. 2022) (in analogy to the EU energy labels (European Commission 2019)). A central drawback of frameworks for individual reporting lies in their non-interactiveness – their results cannot be customized to users’ interests. Consider a practitioner wanting to learn about a model’s suitability for his specific use case and priorities, which might for example be subject to certain resource constraints. In this case, the option of interactively controlling the importance of single model properties would help to align the framework’s output with his needs. Interacting with any system has also been shown to increase users’ understanding and trust (Beckh et al. 2023), which are desirable attributes for reporting frameworks. In addition, most aforementioned works (Arnold et al. 2019; Mitchell et al. 2019) also do not offer any automation for generating reports, which strongly impacts the usability for experts wanting to report novel results.

Benchmarks form the second level of reporting and summarize many individual reports for giving an overview of specific ML application domains. These days, competitive leaderboards allow to compare the performance of SOTA methods in domains such as robustness testing (Croce et al. 2020), time series forecasting (Godahewa et al. 2021), language understanding (Srivastava et al. 2022; Wang et al. 2019), and commonsense reasoning (Sakaguchi et al. 2021). Several works assembled such benchmarks to investigate the performance of different methods across multiple data sets (Dems̆ar 2006; Ismail-Fawaz et al. 2023). Cross-domain databases are located on the third level and take the idea of benchmarks a step further, basically comprising different leaderboards across a wide range of domains. Thanks to their open nature and direct coding interfaces, representatives like Papers With Code (Stojnic et al. 2018) and OpenML (Vanschoren et al. 2014) became highly popular. As they offer direct access to data, methods, and training configurations (Feurer et al. 2021), these platforms are highly beneficial for new practitioners. Model and data set repositories (Jain 2022) as well as platforms for performing ML experiments in an accelerated and unified way (Zaharia et al. 2018; Dems̆ar et al. 2013; Mierswa et al. 2006) help in assembling such databases, and are sometimes directly integrated via interfaces. Recent investigation suggests that promoting scientific papers and code via such platforms (Stojnic et al. 2018) helps in distributing research and achieving more citations (Kang et al. 2023).

Opposed to individual reporting, these platforms offer limited means for interaction, as users can sort results based on their preference (Croce et al. 2020; Vanschoren et al. 2014) or configure custom plots (Vanschoren et al. 2014; Stojnic et al. 2018). More advanced mechanisms, like controlling the impact of different properties for the overall method rating, are unfortunately missing. The comprehensibility and thus benefit for non-experts is questionable, as fundamental information on metrics and methods is only partially provided and high-level representations of individual results (e.g., via labels (Morik et al. 2022)) are not offered. As anyone can easily edit open databases, their usability is slightly impacted by the inconsistent reporting of properties. As a small teaser to Section 4, we found many leaderboards to be only sparsely populated and specific properties being redundantly duplicated thanks to different spellings (e.g., “params” and “parameters”). Another big drawback of all mentioned reporting frameworks lies in their neglect of sustainability. Reporting on computing environment or consumed resources is only optional and not specifically motivated, and we later show further evidence that, as a result, many reported benchmarks are overly focused on predictive capabilities (Schwartz et al. 2020). Lastly, we want to also explicitly mention the problem of reproducibility (Hutson 2018; Pineau et al. 2021), which is of utmost importance in the context of trustworthiness. Established frameworks support linking code (Stojnic et al. 2018) and offer interfaces to selected ML libraries (Vanschoren et al. 2014), but ultimately, they do not test whether reported results can actually be reproduced. Even with available code, information on the utilized hardware and software setup is usually not provided – we argue that SOTA reporting should explicitly demand such information.

Table 1 lists our assessment of current reporting and its drawbacks at a glance. We want to stress that we understand all established works to be highly important and successful contributions that we want to build on. In the following section, we present STREP, our own take on sustainable and trustworthy reporting. In comparison to the discussed approaches, we specifically designed our framework to be more sustainable, since we offer a holistic approach for reporting any method’s resource trade-offs in its respective computational environment, interactive, since users can align the impact of each property for the overall assessment with their priorities, and more comprehensible, since we directly offer means for generating high-level labels. As motivated earlier, the latter two aspects are directly linked to the trustworthiness of any system. In terms of usability and reproducibility aspects, we are aware of their importance but do not push improvements with this work. We still deem STREP to be highly usable – it can be easily used to investigate any given database and generate different report representations automatically. However, it does neither tackle the aforementioned inconsistencies in open databases nor the limited reproducibility, as both issues boil down to how well-organized and -documented the original reported results are.

3 STREP - Sustainable and Trustworthy Reporting

To make reporting more sustainable and trustworthy, we devised central questions that address the identified flaws and need to be explicitly considered: (Q1) How can reported method properties be characterized in a resource-aware way? (Q2) How can the practical computing environment be taken into account? (Q3) What means for interactive investigation can be offered? (Q4) How can users at different levels of ML understanding be informed? This section is structured accordingly, and our proposed STREP framework implements possible answers to Q1–Q4.

3.1 Characterizing ML properties

Good reporting requires to first thoroughly characterize the nature of ML experiments and properties that methods may exhibit. While other works have addressed this challenge (Fischer et al. 2022; Morik et al. 2022), we aligned the characterization in STREP along established terminology (Stojnic et al. 2018), for which thousands of reported evaluations can already be found online.

Definition 1

Any ML evaluation is defined by the tuple (dtm), which represents: (i) a data set d, (ii) a task type t, i.e., an abstract action to be performed on d, and (iii) an ML method m, that can practically perform this action. The theoretical space of all possible evaluations (dtm) is denoted by \(\mathcal {D}\times \mathcal {T}\times \mathcal {M}\); however, certain combinations might be inadmissible.

Concisely, evaluations (dtm) represent the abstraction of using a specific method m to solve a particular task t on given data d. An example would be to classify (t) a fixed number of ImageNet images (d) with MobileNetV2 (m) (Howard et al. 2017). The last sentence in Definition 1 indicates that fixing any d, t, or m potentially limits the other options, as for example image classification cannot be performed on time series data sets. Similarly, the options for applicable methods m might be limited, however we also find certain approaches like neural networks being utilized across all tasks. Also note that in our understanding, m already encompasses specific hyperparameters (e.g., optimizer, preprocessing, ...), which might have a strong impact on practical performance. For the sake of clarity, we here do not explicitly characterize ML methods further, but want to emphasize the need for reporting them in detail – to give a good example, OpenML’s database distinguishes between different method setups (Vanschoren et al. 2014). Differentiating between tasks (training, inference, architecture search, ...) is important because each task has its own quantities that describe practical performance:

Definition 2

Any combination of data d and task t determines a space of properties \(\varvec{p}(d, t) = \{p_{(d, t)}(m)\} \subset \mathbb {R}^n\) in which methods m score via property functions \(p_{(d, t)_i}: \mathcal {M} \rightarrow \mathbb {R}\) for \(1 \le i \le n\). A specific evaluation result \(\varvec{p}_{(d, t)}(m)\) thus is a n-dimensional vector of properties that m exhibits when being applied to perform t on d.

Where others report metrics (Stojnic et al. 2018) or measures (Vanschoren et al. 2014), we chose the term properties, as the corresponding denotation p is easier to formally distinct from the method choice m. We want to stress that these p only represents measurable, numeric properties like accuracy or running time, to stay with the ImageNet example above. Obviously, certain additional information also needs to be reported – these meta properties might be textual descriptions, related literature, or implementation details, to name but a few. While they play an important role for comprehensibility and reproducibility, we do not formally denote them, as they are irrelevant for the numerical comparison of methods. In practice, different properties can be grouped into either describing predictive quality (error, accuracy, ...) or resource usage (running time, model size, ...). Usually, we expect these two groups to compete, as deploying ML efficiently requires to investigate the trade-off between predictive capabilities and resource demand (Fischer et al. 2022). As such, we support others in their argumentation that explicitly reporting on properties in both groups is mandatory for sustainable ML development (Schwartz et al. 2020). Opposed to aforementioned frameworks, STREP gives clear incentives to do so (e.g., via error messages when no resource-allied properties are found).

3.2 Taking execution environments into account

In practice, properties exhibited by a method are subject to the real-world hardware and software used for performing the evaluation (Fischer et al. 2024). Even though some methods possess certain theoretical (i.e., static (Morik et al. 2022)) attributes, it is not guaranteed that they hold in practice (e.g., due to coding mistakes (Sun et al. 2017)). Unfortunately, we find that most established benchmarks neglect the importance of this computational environment, which is why we explicitly consider it in STREP. We extend the definitions above accordingly:

Definition 3

Any evaluation (dte) is executed in an environment \(e \in \mathcal {E}\), which is defined by the choice of hardware and software. As they impact how any method performs in practice, we adjust Definition 2 accordingly: \(\varvec{p}_{(d, t, e)} = \{p_{(d, t, e)_i}: \mathcal {M} \times \mathcal {E} \rightarrow \mathbb {R} \}\) is now also subject to e.

In addition to earlier restrictions, we may find that environments impose further limitations (and vice-versa). As examples, consider SOTA deep learning models that demand several gigabytes of memory capacity (Strubell et al. 2020), or the limited range of readily usable methods for edge (Buschjäger et al. 2020) or quantum hardware (Mücke et al. 2023). For improved readability, we from now on omit the underlying data, task, and environment choice (dte) and resort to simply denoting method properties as \(\varvec{p}(m) = \{p_i(m)\}\). From the diversity of environments (e.g., desktops, compute clusters, embedded devices), a new problem arises: Resource-allied properties like running time can scale with multiple levels of magnitude when performing a fixed evaluation in different environments. While the impact on resource demand is most obvious, note that ML predictions could also be affected by the environment, e.g., due to algorithmic changes across library versions. In order to compare properties better, even across multiple environments, STREP therefore adapts relative index scaling (Fischer et al. 2022):

$$\begin{aligned} p_i(m)=\left( \frac{\mu _i(m^*)}{\mu _i(m)}\right) ^{\sigma _i}=\left( \frac{ (\min _{m'\in \mathcal {M}} \mu _i(m')^{\sigma _i})^{\sigma _i}}{\mu _i(m)}\right) ^{\sigma _i}, \end{aligned}$$
(1)

Where the original work calculates index values \(p_i(m)\) based on global references, we instead propose to relate the real measurements \(\mu _i(m)\) to the empirically best-performing method \(m^*\) on the i-th property. We deem this approach advantageous to global referencing since it neatly solves the problem of choosing good references. With (1), property values \(p_{i}\) become bounded by the interval (0, 1] and describe the behavior in the given environment relatively; the higher the value, the closer it is to the best empiric result which receives \(p_i(m^*) = 1\). On the index scale, properties are more easily comparable – a value of 0.6 always indicates that the method achieves 60% of the best empirically known result, regardless of which property is discussed or environment was used. Index scaling also neatly solves another problem – most properties (e.g., running time and power draw) require minimization for improvement, but some exceptions (e.g., accuracy) demand maximization. To flip these scales and unify the performance space, (1) uses the constant \(\sigma _i\) which denotes whether the corresponding i-th property should be maximized (\(\sigma _i=-1\)) or minimized (\(\sigma _i=+1\)). When measured properties and additional information via meta properties are provided to STREP, it automatically computes the environment-specific index values (and updates them if new best scores are added). Note that many users might also want to investigate the real measurements as encountered in other reports, which is why STREP computes index values as an alternative representation that is offered in addition to the original results.

3.3 Offering means for interaction

While methods should generally perform well across all properties, there can also be use cases where certain aspects are more important. To give some examples, consider safety-critical applications that demand highly robust predictions (Croce et al. 2020), or embedded systems with tight memory constraints (Buschjäger et al. 2020). Unfortunately, as listed in Table 1, established systems only support interactions to a certain degree. STREP assesses any method’s suitability for a given task and data under consideration of the practitioner’s priorities via a weighted compound performance score:

$$\begin{aligned} P(m) = \sum _{i=1}^n w_i p_i(m) \text { with } \forall i, w_i\ge 0 \text { and } \sum _{i=1}^k w_i=1 \end{aligned}$$
(2)

Normalizing the weights \(w_i\) assures that the compound score is also bounded by (0, 1], with \(P(m)=1\) indicating the unlikely case that m achieves the best possible results across all reported properties. With property grouping as introduced in Section 3.1, users are enabled to control the importance of resource efficiency. In an extreme setting, resource-allied properties can be disregarded by setting their weight to zero. Likewise, users can purely concentrate on a method’s compute demand and leave out all non-resource properties. To generally achieve good trades, STREP per default assigns weights such that both groups are equally impactful in (2).

Another opportunity for making the reporting framework more interactive lies in the representation of results. Like other frameworks, STREP provides plain textual descriptions, numeric tables, and plots that display reported data more visually and intuitively (Cui 2019). Via its interactive interface (Dabbas 2021), users can explore the reported results and also switch between and thus understand the real measurement and relative index scales. Lastly, STREP also interactively displays individual results in a more comprehensible representation.

3.4 Informing on different levels of understanding

While experts from the field can investigate in-depth tables and plots, many works have put forward to establish more comprehensible means of reporting such as labels (Morik et al. 2022). We support this call and formulate several ideas for implementing this practically. First of all, labels can inform on properties with the help of intuitive badges (Morik et al. 2022) which express the underlying meaning metaphorically. To make numeric values less abstract, each index scale can be divided into bins that color-code the performance along each property (Fischer et al. 2022). Naturally, labels should also report the evaluation setting (task, data, environment, method) via text, badges, or logos. Additional details on the issuing authority and listed method can be easily embedded via QR codes. As the index values can become outdated when novel results are added to the database, the labels also need to feature the date of issuance. As the fast and easy generation of such summaries is key for establishing them, STREP was designed to put all suggested ideas into effect and produce labels automatically. Note that these labels are merely a scientific proof-of-concept – establishing a truly trustworthy labeling system or even certification would obviously require the support of renowned authorities with appropriate legitimation.

Fig. 2
figure 2

User interface of our STREP software. It depicts the trade-offs among properties, automatically generates a tabular and label representation of any method’s performance (upon hovering over single points) and allows for interaction via controlling the weights and customizing the plots

4 Experimental evaluation

We now demonstrate what insights the introduced ideas provide in practice by investigating established report databases with the help of our STREP software.

Table 2 Established reporting approaches and their drawbacks
Fig. 3
figure 3

Papers With Code is only sparsely populated, with only few data reported for thousands of (data \(\times \) task) combinations

4.1 Experiment setup

Our Python library (available at https://www.github.com/raphischer/strep) handles property databases via pandas (The pandas development team 2022) dynamically reports tables, interactive plots, and labels via Plotly and Dash (Dabbas 2021). The user interface is displayed in Fig. 2 – users can bring their own custom property databases or explore four readily processed databases: (1) reported efficiency of ImageNet models across multiple environments (Fischer et al. 2022), (2) the RobustBench leaderboard (Croce et al. 2020), (3) results from forecasting Monash time series (Fischer and Saadallah 2023; Godahewa et al. 2021) with deep neural networks, and (4) a extract from Papers With Code (Stojnic et al. 2018). Following our characterization as proposed in Section 3.1, we provide the database statistics in Table 2. They list total numbers across all task \(\times \) data combinations, which each determine the applicable properties and methods. As shown in Fig. 3, Papers With Code is only sparsely populated, with few evaluations and properties reported for most cases (i.e., data \(\times \) task combinations). For this reason, we identified the most interesting benchmarks via filtering steps described in Fig. 4. With Incompleteness, we denote the average amount of NAN values found in the evaluations of each data \(\times \) task combination. The weights were chosen such that each property group (resources or predictive quality) is equally impactful for (2) and the properties within each group are also equally weighted. The “most important” properties discussed later were chosen heuristically, as the ones with highest number of reported values. The Papers With Code and RobustBench databases can be re-created with our linked code while the other two were pre-assembled. Keep in mind that this work investigates the state of reporting on a rather abstract level – in-depth experimental discussions and further information can however be found in the respective literature. Following the call for openly reporting any paper’s environmental impact (Schwartz et al. 2020), we estimate the total carbon emissions of this work to about 6 CO\(_2\)e – four full work weeks of running a CPU-only laptop (Lacoste et al. 2019) for software development and writing. We invite readers to also use our STREP application for exploring the databases in detail, as this paper is limited to investigating exemplary results.

Fig. 4
figure 4

Filtering steps applied to the sparse database of Papers With Code for identifying interesting benchmarks for our investigations

Fig. 5
figure 5

Effect of index scaling for the comparison of power draw versus accuracy of ImageNet models (Fischer et al. 2022). Each point indicates the high dimensional properties of a single model, with two dimensions being shown as axes. Each model was benchmarked in two different environments, the respective points are connected. When comparing real measurements (left), the differences are hard to understand, whereas relative index scaling allows for an easier comparison (right). Point colors depict the compound score across all properties and rectangles indicate bins for discrete rating as explained in Section 3.4

Fig. 6
figure 6

Further property trades on selected data sets in the other databases (only reported for a single environment). RobustBench shows clear correlation, the other points appear more chaotic

4.2 Trade-offs and correlation of properties

We start by investigating trades among important reported properties in the databases. Figure 5 depicts the power draw versus accuracy comparison of ImageNet models in two environments. It showcases the effectiveness of index scaling (right plot) as given by (1), which allows for more intuitive comparisons. All results are put in relation to the best empiric performance per metric (value 1), which bounds all values and ensures that improving any property corresponds to a higher score. The node colors indicate the compound score over all properties, as introduced in (2). The ImageNet models seem to be scattered along a multi-dimensional Pareto-front, with the two shown axes clearly trading against each other. For the other databases, we display similar scatter plots for single data sets in Fig. 6. In cases where properties are not reported (i.e., NAN value in the data frame), we assign \(p_i(m) = 0\). For RobustBench, the properties seem to be strongly correlated, whereas no clear correlation is visible from the other two plots. We display the complete properties of the best and worst options for each data set via star plots in Fig. 7. They demonstrate how the proposed relative index scale can unify the reporting of multi-dimensional ML properties, despite their origin from very different domains and tasks. Even though the axes describe totally different phenomena like running time, prediction errors, or model size, their index values remain easily comparable. Where users of other reporting frameworks are restricted to sorting reports based on single properties, these STREP plots allow to explore the multi-dimensional relations between properties at an arbitrary depth.

Fig. 7
figure 7

All properties of the best and worst performing methods displayed via star plots on selected data. Due to the relative index scaling, all properties are equally bounded and easily comparable

Fig. 8
figure 8

Correlation between model properties reported in the ImageNet (Fischer et al. 2022) and Forecasting (Fischer and Saadallah 2023) databases. Certain characteristics like number of parameters and model file size are positively correlated, while trade-offs in terms of accuracy and resource usage are negatively correlated

With help of the Pearson correlation coefficient (PCC), we can numerically quantify trades in the form of pairwise property correlation. Values near one, zero, or minus one respectively indicate strong correlation, no relation, or trades (i.e., negative correlation). For the ImageNetEff and Forecasting data, this is displayed in Fig. 8. We find strong correlation between (1) model parameters and file size, (2) running time and power draw, as well as (3) accuracy and error properties. The latter also tends to be un- or even negatively correlated with other properties, which indicates resource trade-offs. This also supports our call for grouping and equally reporting properties for predictive capabilities and resource consumption.

Fig. 9
figure 9

Distributions of pairwise correlation between properties across all databases. The number of correlations N is the sum of the squared number of reported properties for each data set and task combination. Index scaling is more suitable for comparing trade-offs, as these distributions have a more bimodal and stretched out shape and tend to entail either positive or negative correlation

Fig. 10
figure 10

Property correlations for Papers With Code tasks that also report on resource usage versus tasks that solely focus on prediction quality. N is the number of tasks in the distribution

We show the distribution of PCCs across all databases and properties in Fig. 9, computed once for the original measurements and once for the scaled index values. For the ImageNetEff and Forecasting data, we observe bimodal shapes, as already indicated in the more in-depth results from Fig. 8. On the other hand, the distributions for RobustBench and Papers With Code are heavily tilted towards high PCCs. Index scaling seems to spread the distributions and make their shapes more distinct. We assume that the tilted shapes originate from reporting overly focused on predictive performance, which can already be seen in the vanishingly small number of resource properties in Table 2 – on average, we found only 2% of the reported properties in Papers With Code to fall into the resource category. We test our assumption with Fig. 10, where the database was split into evaluations that have at least some reported resource usage versus ones that solely focus on predictive quality. The assessed PPC distributions for both groups clearly support our intuition – the resource-aware distribution is much less tilted and features many correlations near zero. This indicates that here interesting trades between different properties occur, and are under-reported for all other evaluations.

4.3 Resource-aware and comprehensible scoring

Next, we want to explore the effect of using our compound score (2) to assess overall method performance, under consideration of all reported properties and trades. It is opposed to just concentrating on a single property, as it is the case with established frameworks that only allow for sorting (recall Table 1). We experimentally investigate this by assessing the rank correlation (Spearman’s \(\rho \)) of sorting applicable methods for any task based on either (1) the most important property or (2) our compound score. The resulting distribution of \(\rho \) values across all databases is depicted in Fig. 11. Firstly, it shows that on average both ranking approaches are positively correlated, which is to be expected – the single property is also considered by the compound score. The distributions of ImageNetEff and RobustBench are rather simple (only few benchmark rankings considered), but clearly demonstrate that considering all properties seems to strongly impact the understanding of “good” performance. This is even more dramatically observed in the other databases, where some tasks and data sets even have low or negative correlation – the ranking seems to completely change when all properties are taken into account. On Papers With Code, which we already showed to be heavily biased towards optimizing predictive capabilities, we find the highest rank correlations. However, even the average value of \(\rho =0.75\) is not very high on the scale, and demonstrates the importance of considering all properties when assessing method performance. We want to once more emphasize that our compound performance assessment is customizable – we here use the heuristic weights explained earlier, but STREP allows users to interactively align the importance of properties with their priorities (e.g., to favor methods with low resource demand).

Fig. 11
figure 11

Correlation of ranking methods based on a single property versus based on compound score. The ranking is performed for each data set and task combination, of which N exist (cf. Table 2)

Fig. 12
figure 12

Impact of environment choice on the efficiency of different ImageNet models (Fischer et al. 2022). The given compound scores are calculated from the environment-specific property index values. The vertical distance between model instances shows their performance stability across environment choice

Fig. 13
figure 13

Exemplary ML labels generated by STREP, which convey information on model characteristics more comprehensibly. Index scaling allows to indicate relative performance via colored badges and thus communicate intricate property trade-offs in a more intuitive way

As explained in Section 3.2, both the compound score as well as single properties are subject to the execution environment in which the method is deployed. For this reason, Fig. 12 investigates this effect in the ImageNetEff database. The vertical distance between model points indicates how stable their efficiency is across different environments. Models like MobileNetV2 (Howard et al. 2017) seem to generally perform rather efficiently, whereas certain architectures like ResNets and VGG have very mixed compound performances. These results are evidence that the environment greatly impacts method properties and thus should be explicitly reported. RobustBench and Forecasting experiments were all performed on unified hardware – it would be interesting to test how different environments might change the empiric behavior. For Papers With Code, the environment is, in some cases, described in the associated paper. However, it would be much more beneficial if this information was explicitly demanded whenever people report new results, ideally together with the very environment-specific compute utilization.

Lastly, let us consider how properties can be reported at higher level abstraction, to also inform users without fundamental ML knowledge. For that, we implement our suggestions of Section 3.4 and display prototypical labels as generated by STREP in Fig. 13. They are inspired by the EU energy labels (European Commission 2019) and display efficiency and robustness information for selected models and data. The colored badges indicate the intuitive meaning behind each property in a more comprehensible way (e.g., battery for power draw, target circle for accuracy, stopwatch for running time). For determining the rating colors, we used rating boundaries based on the 20%, 40%, 60%, and 80% quantiles of all scored values for the specific property – these boundaries are also displayed as colored rectangles in Figs. 5 and 6. While boundaries and ratings were determined from the index values, the labels display the more intuitive real measurements and hide away the complexity of relative scaling. In addition, they textually inform on evaluation and environment setting and feature a QR code with additional information material. These labels demonstrate how intricate trade-offs between properties of any ML method can be communicated to less informed users of the reporting framework.

5 Conclusion

Practitioners that want to develop sustainable and trustworthy ML systems require elaborate reporting on the current SOTA in their specific application domain. While many frameworks for publishing ML results exist, they also have certain drawbacks: they are overly focused on predictive capabilities, only offer limited means for interactive use, and inform on a very technical level that is hard to understand for non-experts. In this work, we analyzed the current state of reporting and suggested novel means for improving it. Our proposed STREP framework addresses (s)ustainability via index scaling and compound scoring, which enables resource-aware comparisons of methods even across multiple software and hardware. We improve on (t)rustworthiness by allowing users to interactively control the importance of single method properties for the overall assessment. In addition, our (rep)orting framework automatically generates high-level labels, which communicate intricate results to users without fundamental ML knowledge in a more comprehensible way. We used STREP to practically investigate established report databases in terms of their resource-awareness, where we uncovered major biases towards predictive quality. Our experimental evaluation also demonstrates the feasibility of our proposed ideas. The implementation of STREP is publicly available and can be easily extended – we invite fellow researchers to use it for exploring trade-offs occurring with methods in their own ML domains. While we specifically focused this paper on characterizing the current state, proposing means to improve it, and testing our suggestions in practice, we also see much potential for future work. Firstly, we will further investigate reproducibility aspects of current reporting and try to improve our framework along this direction. In addition, we plan on evaluating the usability of different reporting frameworks and our labels representations via user studies. A closer integration with established frameworks like Papers With Code and OpenML are further options for future extensions. We hope that our work is a helpful contribution for establishing a better reporting paradigm, which makes future ML application and methods more sustainable and trustworthy.