Comparison of four different methods for reliability evaluation of ecotoxicity data: a case study of non-standard test data used in environmental risk assessments of pharmaceutical substances

Ågerstrand, Marlene; Breitholtz, Magnus; Rudén, Christina

doi:10.1186/2190-4715-23-17

Comparison of four different methods for reliability evaluation of ecotoxicity data: a case study of non-standard test data used in environmental risk assessments of pharmaceutical substances

Research
Open access
Published: 09 May 2011

Volume 23, article number 17, (2011)
Cite this article

Download PDF

You have full access to this open access article

Environmental Sciences Europe Aims and scope Submit manuscript

Comparison of four different methods for reliability evaluation of ecotoxicity data: a case study of non-standard test data used in environmental risk assessments of pharmaceutical substances

Download PDF

Marlene Ågerstrand¹,
Magnus Breitholtz² &
Christina Rudén¹

12k Accesses
48 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Standard test data are still preferred and recommended for regulatory environmental risk assessments of pharmaceuticals even though data generated by non-standard tests could improve the scientific basis of risk assessments by providing relevant and more sensitive endpoints.

The aim of this study was to investigate if non-standard ecotoxicity data can be evaluated systematically in risk assessments of pharmaceuticals. This has been done by evaluating the usefulness of four reliability evaluation methods, and by investigating whether recently published non-standard ecotoxicity studies from the open scientific literature fulfill the criteria that these methods propose.

Results

The same test data were evaluated differently by the four methods in seven out of nine cases. The selected non-standard test data were considered reliable/acceptable in only 14 out of 36 cases.

Conclusions

The four evaluation methods differ in scope, user friendliness, and how criteria are weighted and summarized. This affected the outcome of the data evaluation.

The results suggest that there is room for improvements in how data are reported in the open scientific literature. Reliability evaluation criteria could be used as a checklist to ensure that all important aspects are reported and thereby increasing the possibility that the data could be used for regulatory risk assessment.

Triple Bottom Line

ESG ratings explainability through machine learning techniques

Article Open access 19 July 2023

Assessing and Improving the Quality of Sustainability Reports: The Auditors’ Perspective

Article 27 March 2017

Background

Environmental risk assessment of pharmaceuticals

In 2006, the European Medicines Agency (EMA) decided that all new marketing authorisation applications for human pharmaceuticals should be accompanied by an environmental risk assessment [1]. The EMA risk assessment has a PEC/PNEC (predicted environmental concentration/predicted no effect concentration) approach and is divided into two phases. In phase II, data on the substance's physicochemical properties, persistence and bioaccumulation, and ecotoxicity are reviewed and the PNEC is estimated. All relevant data should be taken into account. Experimental studies should preferably follow standard test protocols but it is recognized that there are other acceptable methods. However, their use should be justified and studies should be conducted in compliance with good laboratory practices (GLP) [1].

Standard and non-standard ecotoxicity tests

Ecotoxicological testing can be done using a variety of methods and models. There are two general approaches: using standard or non-standard testing methodologies. Standard tests refer to tests performed and reported according to a method described and provided by an official international or national harmonization or standardization organization, such as the OECD (Organisation for Economic Cooperation and Development), US EPA (United States Environmental Protection Agency), ASTM (American Society for Testing and Materials), AFNOR (Association Française de Normalisation), and ISO (International Organization for Standardization). The test standard establishes a uniform specification of the experimental setup and execution, methods for data analyses, and the reporting format for the test data. Non-standard tests, on the other hand, are tests performed according to any other test method.

Regardless of whether a test is performed according to a standard or not, they should meet some general scientific quality criteria to demonstrate the reliability and reproducibility of the test results. Examples of such general scientific quality criteria are a clear description of the endpoints, inclusion of appropriate controls, appropriate identification of test substance and test organism, stated exposure duration time and administration route, and transparent reporting of effect concentrations.

The major advantages of using standard tests are that the results are directly comparable across substances and that the data they generate will be readily accepted across jurisdictions. Test guidelines also contributes to promote the reliability of the data by making it easier to repeat the experiment if needed because of the detailed standard test procedures and extensive reporting of data that is required. The major disadvantage of standard test methods is that it does not always represent the most biologically relevant testing approach depending on the type of endpoint under investigation. Therefore, results from non-standardized tests may in some cases be more sensitive and thereby contribute additional and significant information to a risk assessment. Other disadvantages of standard tests are that they are inflexible and therefore there is no room for case-by-case adjustments and that it may take up to 10-15 years to develop a new standard test.

Given the characteristics and purposes of standard tests, it is not surprising that they are mostly performed by commercial laboratories while non-standardized methods are typically performed by research scientists and published in scientific journals. Standard tests are often performed according to GLP, whereas non-standard tests are seldom performed according to GLP. The primary objective of the OECD Principles of Good Laboratory Practice is to ensure the generation of high-quality and reliable test data related to the safety of chemical substances and preparations [2]. But concerns have also been raised regarding whether GLP is synonymous with good scientific practices, accurate reporting and valid data [3, 4].

Testing for environmental effects caused by pharmaceutical substances

Pharmaceutical substances have a number of inherent properties that make them interesting from a regulatory perspective. First, pharmaceuticals are carefully designed to interact with biological processes. Second, this interaction should be as specific as possible, ideally influencing only one well-defined target molecule or cellular process, and have as few other side effects as possible. Third, this interaction should be achieved at low concentrations, meaning that the substance has to be relatively potent. Fourth, to achieve this, it is necessary that the active pharmaceutical ingredient is sufficiently persistent to remain un-metabolized long enough to reach the target organ in the human body.

It is fundamental for risk identification, and thus a crucial part of the risk assessment process, to have toxicity test methods that are adapted to their purpose. Currently available standard test methods for deriving regulatory toxicity data for the aquatic environment are in many cases not sufficiently sensitive to the types of very specific effects that can be expected from pharmaceutical substances [see e.g., [5]]. The EMA guideline [1] recommends that standard tests measuring growth inhibition and reproduction failure are used in environmental risk assessment of pharmaceutical substances (OECD test number 201, 210 and 211). However, test data are for many pharmaceuticals still limited or not publically available. The sex hormone ethinylestradiol is one of few substances where a significant amount of both standard and non-standard test data is available. Table 1 presents the lowest reported standard and non-standard effect values (according to the Wikipharma database [6] and the environmental classification system at fass.se [7]), both no-observed effect concentration (NOEC) and EC₅₀ values (lowest identified effect concentration where 50% of the tested population have been found to be affected), for ethinylestradiol.

Table 1 The lowest publically available standard and non-standard effect values for ethinylestradiol^a.

Full size table

When comparing toxicity values, the non-standard NOEC value is 32 times lower than the standard test NOEC value, and the non-standard EC₅₀ value is over 95,000 times lower than the standard EC₅₀ value. Ethinylestradiol can therefore be seen as an example where non-standard tests with more substance-specific endpoints are more sensitive compared to the standard tests.

There is a need to carefully evaluate the regulatory process of identifying pharmaceuticals that might pose a risk to non-target species in the aquatic environment, and make sure that relevant and sufficiently sensitive tests are used in the regulatory environmental risk assessment of pharmaceuticals. As we see it, there are at least three ways forward: (1) to develop new standard ecotoxicity tests better suited for pharmaceuticals, or (2) to adjust existing standard tests by supplementing them with additional endpoints relevant for different pharmacological modes-of-action, or (3) to increase the use of non-standard tests for risk assessment purposes.

The development of new test standards is costly and may take up to 10 to 15 years [8] and since pharmaceuticals are a diverse group of substances when it comes to how they affect biological processes it is unlikely that new standards that would cover all relevant endpoints, could be developed in the near future.

Adjustments of current standard tests could increase the biological relevance for testing pharmaceutical substances. Therefore, a potential way forward could be that the standardization organizations initiate additional validation and expert commenting rounds, to standardize such adjustments. Still, such minor additions of existing standards would likely not be sufficient to ensure that the specific biological effects of most pharmaceuticals are covered by the tests.

Hence, in our view, an important and realistic way forward is to make increased use of non-standard test data to ensure a scientifically well-founded environmental risk assessment of pharmaceuticals. To enable the use of non-standard tests in risk assessments, two things are needed: that the legislation is designed so that non-standard tests can be included in a systematic and predictable way, and that non-standard tests are reported in a transparent and comprehensive way, much like required when using the standard test methods.

Reliability and relevance evaluation of (eco-)toxicity data

According to the TGD [9], an evaluation of data reliability should ensure "the inherent quality of a test relating to test methodology and the way that the performance and results of the test are described". Basically, this evaluation should answer the question: has the experiment generated and reported a true and correct result?

The assessment of the relevance of the data should describe "the extent to which a test is appropriate for a particular hazard or risk assessment" [9], e.g., answer questions like: Is the measured endpoint a valid indicator of environmental risk? Is the experimental model sufficiently sensitive in relation to detecting the expected effects? Has the experimental model a sufficient statistical power? How representative is the experimental model to the environment that is aimed to be protected?

Evaluation of data can be done within different frameworks. It usually relies to a significant extent on case-by-case assessments based on expert judgment. However, there have also been attempts to make the evaluation process more structured. Such an approach can include checklists or even pre-defined evaluation criteria. A major advantage of using a structured way of evaluating data is increased transparency and predictability of the risk assessment process. For instance, both a checklist and pre-defined criteria will contribute to ensuring that at least a minimum and similar set of aspects are considered in each evaluation. Pre-defined evaluation criteria may also contribute to increased transparency of the evaluation process to the extent that these criteria are clearly reported to the relevant parties. Disadvantages of using pre-defined evaluation criteria or checklists are that they are obviously less flexible and need to focus on the general aspects of a study. In general, there is a need to strike a balance between flexibility and predictability in the data quality evaluation process; it will always include an element of expert judgment, but it is also, in our view, important to continuously seek to increase the predictability and transparency of this process.

Aim

The overall aim of this study was to investigate if the reliability of non-standard ecotoxicity data can be evaluated systematically in environmental risk assessments of pharmaceutical substances. Our hypothesis was that evaluation and reporting criteria can contribute to making the evaluation more systematic, predictable, and transparent, and facilitate the use of non-standard data for risk assessment purposes.

Method

This study is divided into two parts: (1) an evaluation of the usefulness of four methods for reliability evaluation of test data that have been proposed in the scientific literature, and (2) an investigation of whether recently published non-standard ecotoxicity studies from the open scientific literature fulfill these reliability criteria.

Evaluation of existing reliability evaluation methods

The four evaluation methods used in this study are described by Klimisch et al. [10], Durda and Preziosi [11], Hobbs et al. [12], and Schneider et al. [13]. The reporting requirements from the OECD guidelines 201, 210, and 211 were used as a reference in the evaluation of the four methods. The reporting requirements were merged and generalized into 37 criteria so that they could be used on all types of endpoints and organisms.

Investigation whether published non-standard ecotoxicity studies fulfill proposed reliability criteria

The non-standard test data evaluated in this study (presented in Table 2[14–22]) have been selected for the current analyses since they were either (1) used in risk assessments of active pharmaceutical ingredients within the Swedish environmental classification and information system for pharmaceuticals [[23]; available at http://www.fass.se] or (2) used in a previous evaluation of this classification system [24]. These selection criteria resulted in a total of nine references that was then evaluated according to the four methods. Some references contain several ecotoxicological studies but only the part of the reference relevant for the chosen effect value was considered in this evaluation. Some of the evaluated studies could have been conducted according to a standardized method but since this is not reported in the reference, the study is treated as a non-standard test.

Table 2 Overview of the non-standard effect data evaluated in this study.

Full size table

Results

The results section is divided into the following sections: hypothesis and endpoints (Table 3), protocol (Table 4), test substance (Table 5), test environment (Table 6), dosing system (Table 7), test species (Table 8), controls (Table 9), statistical design (Table 10), and biological effect (Table 11). Each section is reported and discussed in two parts; (1) an evaluation of the usefulness of existing proposed criteria for reliability evaluation of test data, and (2) an investigation of whether the evaluated non-standard study fulfill the proposed reliability criteria. The two parts are also summarized in the end (Summary of the evaluation of existing reliability evaluation methods and Summary of the reliability evaluation of the non-standard test data sections).

Table 3 Modified evaluation criteria/questions concerning endpoints and a summary of the evaluation results.

Full size table

Table 4 OECDs reporting requirements, modified evaluation criteria/questions concerning protocol, and a summary of the evaluation results.

Full size table

Table 5 OECDs reporting requirements, modified evaluation criteria/questions concerning test substance, and summary of the evaluation results.

Full size table

Table 6 OECDs reporting requirements, modified evaluation criteria/questions concerning test environment, and summary of the evaluation results.

Full size table

Table 7 OECDs reporting requirements, modified evaluation criteria/questions concerning dosing system, and summary of the evaluation results.

Full size table

Table 8 OECDs reporting requirements, modified evaluation criteria/questions concerning test species, and summary of the evaluation results.

Full size table

Table 9 OECDs reporting requirements, modified evaluation criteria/questions concerning controls, and a summary of the evaluation results.

Full size table

Table 10 OECDs reporting requirements, modified evaluation criteria/questions concerning statistical design, and summary of the evaluation results.

Full size table

Table 11 OECDs reporting requirements, modified evaluation criteria/questions concerning biological effect, and summary of the evaluation results.

Full size table

Hypothesis and endpoint

Usefulness of proposed criteria

All four evaluation methods consider that study endpoints should be stated and described. Durda and Preziosi [11] also considers hypothesis important for studies where NOEC or LOEC (lowest observed effect concentration) values are identified, and involves a relevance criterion by asking whether the chosen endpoint is appropriate for the hypothesis. No OECD guideline criteria matched this category.

Result of the study evaluations

The nine selected studies report NOEC/LOEC values and EC₅₀ values, but none of the studies selected for this evaluation clearly stated a hypothesis. Instead endpoints were described, in some cases very thoroughly and in other cases hardly at all. Describing why a specific endpoint was used will help clarify the importance of the conducted study. It should be noted that it has been argued that hypotheses should be replaced by dose-response analysis when deriving exotoxicological benchmarks [25–28].

Protocol

Usefulness of proposed criteria

Both Klimisch et al. [10] and Durda and Preziosi [11] have evaluation criteria that are wide and imprecise which opens up for a variety of different interpretations and opinions.

Schneider et al. [13] evaluation question is also wide and concerns relevance, rather than reliability. In the accompanying guidance material to the question, a variety of aspects are included: the chosen test system and its applicability domain, consideration of physio-chemical properties and stability of test substance, number of replicates, number of concentration levels and their range and spread, suitability of administration method, inclusion of all relevant endpoints, and statistical evaluation. The method would benefit from separating these aspects into several questions.

Result of the study evaluations

For three of the selected studies, a clear description of the test procedure was lacking, the majority of the studies would benefit from improving their reporting of the study. The chosen study designs were in all cases relevant for the data aimed at.

Test substance

Usefulness of proposed criteria

Identification of test compound, source, physicochemical properties such as purity and stability, and other substances used are factors related to the "test compound" that the four evaluation methods together consider important, but none of the methods report all factors. Purity is the only factor reported by all four methods. Water solubility is not reported explicitly but can be interpreted into Schneider et al. [13] method. The OECD guidelines do not report of any other factors in addition to the ones mentioned by the four methods.

Result of the study evaluations

The purity of the test substance was not stated in six of the nine studies. Neither stability nor by-products of the test substance was stated in any of the studies. Other factors such as vapor pressure, water solubility, octanol/water partitioning coefficient (logKow) and bioconcentration factor were missing in several or all of the studies.

Test environment

Usefulness of proposed criteria

Oxygen concentration, conductivity, temperature, pH, water hardness, salinity, light intensity, photoperiod, physical structure of the test chamber, test media, test organism density, food composition and food availability are abiotic and biotic factors that the four methods specify in their evaluation schemes. Several of the reported factors could modify the toxicity of chemicals [29]. None of the four evaluation methods include all factors. The OECD guidelines recommend that light quality, residual chlorine levels, total organic carbon (TOC), and chemical oxygen demand (COD) are reported in addition to the other factors.

Schneider et al. [13] did not specify which factors related to the test environment that should be evaluated and only consider it relevant for repeated dose toxicity studies. Klimisch et al. [10] has gathered the factors in only two criteria, while Durda and Preziosi [11] and Hobbs et al. [12] have divided the factors into several criteria/questions. Having one factor per criteria/question facilitates the evaluation since only one thing at a time has to be considered. Durda and Preziosi [11] has, contrary to Klimisch et al. [10] and Hobbs et al. [12], considered feeding protocols for long-term tests a "must" criterion.

Result of the study evaluations

None of the nine studies report all "test environment" factors. Temperature was reported more often than the other factors, whereas pH and dissolved oxygen (DO) were the factors that most of the studies failed to report. In some studies, information was given on the conditions for the cultivation/breeding stock but not for the experimental setup.

Dosing system

Usefulness of proposed criteria

Administrated concentrations, concentration control analysis, administration route, frequency and duration of exposure, and information of the dosing type are aspects related to the "dosing system" that the four evaluation methods all together consider as important. Durda and Preziosi [11] is the only method that covers all aspects. Both Durda and Preziosi [11] and Schneider et al. [13] consider administrated concentrations, administration route, frequency and duration of exposure as "must" criteria. The OECD guidelines also recommend that date of start of the test; method of preparation of stock solutions, the recovery efficiency of the method, and the limit of quantification in the test matrix should be reported.

Result of the study evaluations

The tested concentrations were not stated in two studies and in two other studies only the concentration range was reported. Concentration control analyses were made in five of the nine studies. The administration routes were rarely stated explicitly. A possible reason for this is that it is considered self-evident when it comes to aquatic toxicity testing.

Test species

Usefulness of proposed criteria

Identification of test species, number of individuals, investigated period of the lifecycle, reproductive condition, sex, strain, source, body weight, length or mass are aspects related to the "test species" that the four evaluation methods all together as consider important. None of the methods cover all aspects. Only Schneider et al. [13] has a "must" criterion: identification of the test species. The OECD guidelines also recommend that culture conditions and methods of collecting the test species are described.

Result of the study evaluations

Complete information about the test organism was missing in all nine studies.

Controls

Usefulness of proposed criteria

The reported aspects connected to "controls" are: use of control (positive, negative and/or solvent), acceptability criteria, control media identical to test media in all respect except the treatment variable, and origin of the control and test organisms. Only Durda and Preziosi [11] cover all these aspects. Klimisch et al. [10] does not mention any of the aspects. Both Durda and Preziosi [11] and Schneider et al. [13] have "must" criteria/questions. The four methods all together covered the same aspects as the OECD guidelines.

When it comes to acceptability criteria for the test, e.g., control mortality, two different approaches can be seen. Hobbs et al. [12] only requires that the acceptability criteria are stated while Durda and Preziosi [11] and Schneider et al. [13] ask whether the results connected to the acceptability criteria are reliable or acceptable. Durda and Preziosi [11] provides a percent limit for the control mortality while Schneider et al. [13], in the guidance material, asks whether the variability of the results and control was acceptable and if control values were within reasonable range.

Result of the study evaluations

In two of the evaluated studies, the authors did not report whether controls were used or not. Acceptability criteria were not stated in any of the studies, but since Schneider et al. [13] asks whether the variability of the results and control was acceptable, one of the studies were considered to fulfill this evaluation question. The origin of the control and test organisms and whether the control media is identical to test media in all respect except the treatment variable was not specifically stated in any of the studies. A possible reason for this could be that this is considered to be self-evident.

Statistical design

Usefulness of proposed criteria

The reported aspects connected to "statistical design" are: statistical method and results including significance levels and estimates of variability, (sufficient) sample size and replicates, randomized treatments and independent observations. None of the methods cover all aspects. The four methods all together covered the same aspects as the OECD guidelines. Schneider et al. [13] has a "must" criterion: the number of animals has to be stated. Hobbs et al. [12] and Durda and Preziosi [11] instead ask for at least two replicates or sufficient sample size/replicates.

Result of the study evaluations

The statistical model used was not stated by two studies, results from the statistical calculations were missing for another one. The significance level or the estimation of variability was missing for five studies. Eight of the nine studies used a sufficient sample size but none stated whether treatments were randomized or how observations were made. A possible reason for not reporting this could be that it is considered being self-evident, or that the researchers use traditional and widely accepted statistical methods.

Biological effect

Usefulness of proposed criteria

The reported aspects connected to "biological effect" are: stated and quantified effects for all endpoints, concentration-response relationship, and whether the results have been reproduced by others or are consistent with other findings. Durda and Preziosi [11] have covered all aspects. The OECD guidelines also recommend that calculated response variables for each treatment replicate, with mean values and coefficient of variation for replicates is reported.

Result of the study evaluations

All but one study had a determined EC₅₀, LOEC, NOEC or NOEL value (no observed effect level). There was no concentration-response relationship reported in one study and unclear in three other studies.

Summary of the evaluation of existing reliability evaluation methods

The four reliability evaluation methods are described and compared in Table 12. The results from the evaluation are discussed in three parts: Scope; How evaluation criteria are weighted and summarized; User friendliness. Conclusions from this evaluation are presented in the last section.

Table 12 Description and comparison of the four reliability evaluation methods.

Full size table

Scope

The four methods differ in their scope. Durda and Preziosi [11] have twice as many criteria as the other methods. Still, the four methods all include criteria from the same categories with the exception of Klimisch et al. [10] and Hobbs et al. [12] that lack criteria concerning controls and protocol, respectively. The Schneider et al. [13] method also includes aspects that are related to relevance.

The criteria vary in extent and specification, e.g., Hobbs et al. [12] have one criterion in the test species category while Durda and Preziosi [11] use eight different criteria for the same issue (see Table 8). A disadvantage of using wide or unspecified criteria is that aspects could be forgotten about. A disadvantage of using too precise criteria is decreased flexibility.

Another example deals with criteria concerning dose/concentration. Hobbs et al. [12] and Klimisch et al. [10] do not explicitly state that the doses should be reported. The Schneider et al. [13] criteria could be interpreted in different ways; it is not clear whether nominal or measured concentrations are required.

There are also some examples of criteria where a minimum level is presented, e.g., Hobbs et al. [12] criteria ask whether each control and chemical concentration is at least duplicated. Others leave more to the evaluator by asking if the sample size and replicates is sufficient (see [11]).

When it comes to acceptability criteria, e.g., 10% mortality in the control, two different approaches can be seen. Either that the acceptability criteria are stated [12] or the requirement that the results connect to the acceptability criteria are reliable or acceptable [11, 13]. The different approaches put different demands on the evaluator.

Durda and Preziosi [11], Hobbs et al. [12] and Schneider et al. [13] all have unique evaluation criteria that the other methods are missing.

The method described by Durda and Preziosi [11] has been developed from guidelines and it is therefore not surprising that this method has the highest resemblance with the OECD reporting requirements, 22 out of 37 criteria. The other three methods have each included less than half of the OECD reporting requirements in their list of criteria.

Some of the reporting requirements in the OECD guidelines were not reported by any of the four evaluation methods: date of start of the test; method of preparation of stock solutions; the recovery efficiency of the method; the limit of quantification in the test matrix; culture conditions; methods of collection; light quality; residual chlorine levels; TOC and COD; and the coefficient of variation. Several of these reporting requirements could be important for a general reliability evaluation.

How evaluation criteria are weighted and summarized

The four methods differ in how criteria are weighted and summarized.

Klimisch et al. [10] have not weighted their criteria and have not stated how to summarize the evaluation; this could result in a wide range of reliability evaluation results for the same test data. Klimisch et al. [10] has reserved the highest reliability category for studies carried out according to accepted guidelines and preferably performed according to good laboratory practice (GLP), or for methods that are very similar to a guideline method. This means that a study that has a design that is significantly different from standard test methods can never be put in the highest reliability category.

Durda and Preziosi [11] distinguish between mandatory and optional criteria. All mandatory criteria have to be fulfilled to receive the lowest acceptable reliability criteria and the highest reliability category is reserved for studies that fulfill all 40 evaluation criteria. GLP is preferred and the highest reliability category applies to standard tests or test closely related to standard tests. However, in practice, it is possible that non-standard tests fulfill all criteria, except the one regarding GLP, regardless of whether the test is closely related to a standard test or not.

Hobbs et al. [12] has weighted the criteria by assigning scores between 0 and 10. The total score for each study is then divided with the total possible score, which varies depending on what type of chemical, test organism and test media, and this results in a quality score and a quality class.

Schneider et al. [13] have divided their evaluation questions into mandatory and optional. All mandatory criteria have to be fulfilled to receive the lowest acceptable reliability category. The highest reliability category is reserved for studies which fulfill all mandatory evaluation questions and at least 18 of the 21 evaluation questions in total. This method differed from the other methods by, in our evaluation, assigning studies the highest reliability category (see the Summary of the reliability evaluation of the non-standard test data section).

User friendliness

User friendliness is defined in this study as a method that has clear instructions and an uncomplicated procedure.

All four methods include evaluation criteria that are wide and imprecise which opens up for a variety of different interpretations. Klimisch et al. [10] and Schneider et al. [13] compound criteria, i.e., criteria that include several aspects. Having more delimited criteria/questions facilitates the evaluation since only one aspect at a time has to be considered.

Additional information that complements the evaluation criteria/questions makes an evaluation method easy to use. Both Durda and Preziosi [11] and Schneider et al. [13] have useful additional guidance.

The Klimisch method [10] lacks information how to summarize the evaluation. This complicates the work of the evaluator. Schneider [13] summarizes the results automatically in a pre-formatted excel-sheet. Both Durda and Preziosi [9] and Hobbs et al. [12] apply manual summarization of the evaluations.

Conclusions from the evaluation of existing reliability evaluation methods

The evaluation methods differ in their scope, user friendliness, and how criteria are weighted and summarized. Depending on the evaluators' previous experience and knowledge, the outcome of the different methods can therefore differ. For the evaluators, it is important to be aware of the different methods strengths and limitations.

Durda and Preziosi [11] provide the method with the broadest scope and it also had the highest resemblance with the OECD guidelines. Durda and Preziosi [9], Hobbs et al. [12] and Schneider et al. [13] differ in how evaluation criteria are weighted and summarized but all three methods are functional and understandable. Durda and Preziosi [9] and Schneider et al. [13] both provide useful guidance information to the risk assessors which enhance the user friendliness.

Summary of the reliability evaluation of the non-standard test data

All four methods require some degree of expert judgment. They are developed to help risk assessors evaluate data, not to replace the risk assessor. Therefore it is likely that two experts evaluating the same study end up with slightly different results depending on their expertise and previous experiences. We have in our evaluation strived to make a uniform treatment of the evaluation methods and the selected studies. The evaluation method described by Klimisch et al. [10] requires more expert judgment when the evaluation is merged since instructions for this is lacking.

A striking result of this exercise is that many of the aspects considered important in the different evaluation methods are not reported by the authors of the selected studies. Examples of aspects often omitted are information about the controls, results from statistical evaluations, whether there is a dose-response relationship or not, tested concentrations, and clear description of the test environment. To safeguard against under-reporting, we recommend that a checklist containing all applicable reliability criteria should be used.

Overall the evaluation of the nine selected non-standard tests resulted in a low number of studies with acceptable reliability. The nine selected studies were evaluated by four different methods which resulted in 36 evaluations. Only 14 (39%) of these resulted in acceptable quality, reliable with restrictions, or reliable without restrictions (Table 13).

Table 13 Summary of the reliability evaluation of non-standard test data.

Full size table

Also, the result from the four evaluation methods differed at a surprisingly high rate. Using the four methods lead to the same evaluation result for two studies only [14, 15], both were summarized as studies with unacceptable quality/not reliable. The evaluation result differed by one quality data level, from unacceptable quality/not reliable to acceptable quality/reliable with restrictions, for five studies [16–20] and by two quality data levels, from unacceptable quality/not reliable to high quality/reliable without restrictions, for two studies [21, 22] (Table 13).

Durda and Preziosi [11] did not accept any of the studies since the mandatory criteria acceptable control mortality/morbidity was not reported. Hobbs et al. [12] has a similar criterion but since it is not mandatory it does not have the same effect on the summarized evaluations.

Other reasons why the reliability was considered unsatisfactory according to one or more evaluation method were lack of information about: chemical concentration control analysis, physical and chemical test conditions, specification of the test substance, a clear description of the test procedure, and the investigated period of the life cycle of the test organisms.

Discussion and conclusions

Standard test data are still preferred for regulatory environmental risk assessments of pharmaceutical substances. Accepting non-standard test data is likely to increase the regulatory agencies' work load since it is more complicated to evaluate these data compared to relying on standards only. More structured evaluation methods can help risk assessors and evaluators to use non-standard test data. But, as we have shown in this study, the design of the evaluation method is crucial since it can affect the outcome of the evaluation significantly.

The evaluation methods scrutinized in this study all require expert judgement. In our view, it is neither possible nor desirable to develop a method that completely leaves out expert judgement, but we can strive towards a method that reduces vagueness and elements of case-by-case interpretations. Both Hobbs et al. [12] and Schneider et al. [13] have tested and modified their respective method during the development process in order to increase the likelihood that evaluators arrive at similar conclusions.

The actual use of the four evaluation methods is unclear but Klimisch et al. [10] has been cited 62 times, Schneider et al. [13] seven times, Durda and Preziosi [11] twice, and Hobbs et al. [12] once (ISI Web of Knowledge, 2010-11-02). The method described by Hobbs et al. [12] has, according to the authors themselves, been used by several Australian and New Zealand authorities.

The problem with non-standard data found to have low reliability could be either that the studies are poorly performed or that they are under-reported. Under-reporting could be a consequence from journals' desire to publish concise papers. However, an increasing number of journals provide possibilities to include additional data as supplementary electronic information which means that this should not be a major obstacle for making such information publicly available. It is also well known that environmental factors, such as oxygen saturation, salinity, pH, hardness, and temperature, can have drastic impact on uptake and effects of chemical substances [e.g., [30–32]] and therefore, as proposed by most standard protocols, these aspects need to be monitored and reported in order to ensure the reproducibility of the test data, i.e., a key issue in the scientific process.

For the nine studies investigated in the present paper, under-reporting could very well be a significant reason for the evaluation outcome. Aspects like use of controls, results from statistical evaluations, whether there are a concentration-response relationships or not, tested concentrations and clear description of the test subject and test environment are important and should be included in all publications presenting ecotoxicity data. Reliability evaluation methods can be used as checklists for authors and reviewers to ensure that all important aspects of the test method are included in their reports. A more structured reporting format could ensure the reliability of the test data without limiting the researcher's creativity in the design of a non-standard study.

As it is today, data with low reliability will not be included in regulatory environmental risk assessment. We believe that none of the nine selected non-standard studies could have been used in an environmental risk assessment of pharmaceuticals according to the EMA guideline. However, we still see that the studies could contribute to the risk assessment by acting as supporting information. We are currently developing a new suggestion on how to report and evaluate ecotoxicity data from the open scientific literature in regulatory risk assessment of pharmaceuticals. The new set of criteria is developed in collaboration between regulators at the German Federal Environment Agency (UBA) and researchers within the Swedish research program MistraPharma http://www.mistrapharma.se[33]. The criteria are based on the four methods evaluated in this study and the OECD reporting requirements, and have been further developed to include both reliability and relevance of test data. Intended users are risk assessors and researchers performing ecotoxicological experiments, but the criteria can also be used for education purposes and in the peer-review process for scientific papers. This approach intends to bridge the gap between the regulator and the scientist's needs and way of work.

It is important to remember that much of the research done within the field of ecotoxicology and risk assessment is financed through tax payer's money and to not find a way for use of this data in risk assessments would be an inefficient and irresponsible handling of resources.

References

European Medicines Agency (EMA), Committee for medicinal products for human use (CHMP): Guideline on the environmental risk assessment of medicinal products for human use. 2006. Ref EMEA/CRMP/SWP/4447/00
Google Scholar
OECD: OECD Series on Principles of Good Laboratory Practice and Compliance Monitoring. No. 1 OECD Principles of Good Laboratory Practice. Paris: Organization for Economic and Co-operative Development; 1998.
Google Scholar
vom Saal F, Myers JP: Good Laboratory Practices Are Not Synonymous with Good Scientific Practices, Accurate Reporting, or Valid Data. Environmental Health Perspective, Perspectives Correspondence 2010,118(2):A60.
Article Google Scholar
Myers JP, vom Saal FS, Akingbemi BT, Arizono K, Belcher S, Colborn T, Chahoud I, Crain DA, Farabollini F, Guillette LJ, Hassold T, Ho S, Hunt PA, Iguchi T, Jobling S, Kanno J, Laufer H, Marcus M, McLachlan JA, Nadal A, Oehlmann J, Olea N, Palanza P, Parmigiani S, Rubin BS, Schoenfelder G, Sonnenschein C, Soto AM, Talsness CE, Taylor JA, Vandenberg LN, Vandenbergh JG, Vogel S, Watson CS, Welshons WV, Zoeller RT: Why public health agencies cannot depend on good laboratory practices as a criterion for selecting data: the case of bisphenol A. Environmental Health Perspective 2009, 117: 309–315.
Article CAS Google Scholar
Gunnarsson L, Jauhiainen A, Kristiansson E, Nerman O, Larsson DG: Evolutionary conservation of human drug targets in organisms used for environmental risk assessments. Environmental Science and Technology 2008,42(15):5807–5813. 10.1021/es8005173
Article CAS Google Scholar
Molander L, Ågerstrand M, Rudén C: WikiPharma--a freely available, easily accessible, interactive and comprehensive database for environmental effect data for pharmaceuticals. Regulatory Toxicology and Pharmacology 2009, 55: 367–371. 10.1016/j.yrtph.2009.08.009
Article CAS Google Scholar
Mattson B: A voluntary environmental classification system for pharmaceutical substances. Drug Information Journal 2007,41(2):187–91.
Google Scholar
Breitholtz M, Lundström E, Dahl U, Forbes V: Improving the Value of Standard Toxicity Test Data in REACH. In Regulating Chemical Risks: European and Global Challenges. Edited by: Eriksson J, Gilek M, Rudén C. Dordrecht: Springer; 2010:85–98.
Chapter Google Scholar
European Commission: European Commission Technical Guidance Document in Support of Commission Directive 93/67/EEC on Risk Assessment for New Notified Substances and Commission Regulation (EC). No 1488/94 on Risk Assessment for Existing Substances, Part II.2003. [http://ecb.jrc.ec.europa.eu/documents/TECHNICAL_GUIDANCE_DOCUMENT/tgdpart2_2ed.pdf]
Google Scholar
Klimisch HJ, Andreae M, Tillmann U: A systematic approach for evaluating the quality of experimental toxicological and exotoxicological data. Regulatory toxicology and pharmacology 1997, 25: 1–5. 10.1006/rtph.1996.1076
Article CAS Google Scholar
Durda JL, Preziosi DV: Data quality evaluation of toxicological studies used to derive ecotoxicological benchmarks. Human and ecological risk assessment 2000,6(5):747–765. 10.1080/10807030091124176
Article CAS Google Scholar
Hobbs DA, Warne MStJ, Markich SJ: Evaluation of criteria used to assess the quality of aquatic toxicity data. Integrated environmental assessment and management 2005,1(3):174–180. 10.1897/2004-003R.1
Article Google Scholar
Schneider K, Schwarz M, Burkholder I, Kopp-Schneider A, Edler L, Kinsner-Ovaskainen A, Hartung T, Hoffmann S: "ToxRTool", a new tool to assess the reliability of toxicological data. Toxicology Letters 2009,189(2):138–144. 10.1016/j.toxlet.2009.05.013
Article CAS Google Scholar
Andreozzi R, Caprio V, Ciniglia C, de Champdor M, Giudice RL, Marotta R, Zuccato E: Antibiotics in the Environment: Occurrence in Italian STPs, Fate, and Preliminary Assessment on Algal Toxicity of Amoxicillin. Environmental Science and Technology 2004,38(24):6832–6838. 10.1021/es049509a
Article CAS Google Scholar
Ferrari B, Mons R, Vollat B, Fraysse B, Paxéus N, Giudice RL, Pollio A, Garric J: Environmental risk assessment of six human pharmaceuticals: Are the current environmental risk assessment procedures sufficient for the protection of the aquatic environment? Environmental Toxicology and Chemistry 2004,23(5):1344–1354. 10.1897/03-246
Article CAS Google Scholar
Huggett DB, Brooks BW, Peterson B, Foran CM, Schlenk D: Toxicity of Select Beta Adrenergic Receptor-Blocking Pharmaceuticals (B-Blockers) on Aquatic Organisms. Archives of Environmental Contamination and Toxicology 2002, 43: 229–235. 10.1007/s00244-002-1182-7
Article CAS Google Scholar
Robinson AA, Belden JB, Lydy MJ: Toxicity of fluoroquinolone antibiotics to aquatic organisms. Environmental toxicology and Chemistry 2005,24(2):423–430. 10.1897/04-210R.1
Article CAS Google Scholar
Schmitt-Jansen M, Bartels P, Adler N, Altenburger R: Phytotoxicity assessment of diclofenac and its phototransformation products. Analytical and Bioanalytical Chemistry 2007, 387: 1389–1396. 10.1007/s00216-006-0825-3
Article CAS Google Scholar
Quinn B, Gagné F, Blaise C: An investigation into the acute and chronic toxicity of eleven pharmaceuticals (and their solvents) found in wastewater effluent on the cnidarian, Hydra attenuata. Science of the Total Environment 2008, 389: 306–314. 10.1016/j.scitotenv.2007.08.038
Article CAS Google Scholar
Metcalfe CD, Metcalfe TL, Kiparissis Y, Koenig BG, Khan C, Hughes RJ, Croley TR, March RE, Potter T: Estrogenic potency of chemicals detected in sewage treatment plant effluents as determined by in vivo assays with Japanese medaka (Oryzias latipes). Environmental Toxicology and Chemistry 2001,20(2):297–308.
Article CAS Google Scholar
Nentwig G: Effects of pharmaceuticals on aquatic invertebrates. Part II: The antidepressant drug fluoxetine. Archives of Environmental Contamination and Toxicology 2007, 52: 163–170. 10.1007/s00244-005-7190-7
Article CAS Google Scholar
Halm S, Pounds N, Maddix S, Rand-Weaver M, Sumpter JP, Hutchinson TH, Tyler CR: Exposure to exogenous 17ß-oestradiol disrupts P450aromB mRNA expression in the brain and gonad of adult fathead minnows (Pimephales promelas). Aquatic Toxicology 2002, 60: 285–299. 10.1016/S0166-445X(02)00011-5
Article CAS Google Scholar
Mattson B: A voluntary environmental classification system for pharmaceutical substances. Drug Information Journal 2007,41(2):187–191.
Google Scholar
Ågerstrand M, Rudén C: Evaluation of the accuracy and consistency of the Swedish Environmental Classification and Information System for pharmaceuticals. Science of the Total Environment 2010, 408: 2327–2339. 10.1016/j.scitotenv.2010.02.020
Article Google Scholar
van der Hoven N: How to measure no effect. Part III. Statistical aspects of NOEC, ECx and NEC estimates. Environmetrics 1997, 8: 225–261.
Google Scholar
Suter GW: Abuse of hypothesis testing statistics in ecological risk assessment. Human and Ecological Risk Assessment 1996,2(2):331–347.
Article Google Scholar
Newman MC: "What exactly are you inferring?" A closer look at hypothesis testing. Environmental Toxicology and Chemistry 2008, 27: 1013–1019. 10.1897/07-373.1
Article CAS Google Scholar
Warne MStJ, Van Dam R: NOEC and LOEC data should no longer be generated or used. Australasian Journal of Ecotoxicology 2008,14(1):1.
Google Scholar
van Leeuwen CJ, Vermeire TG, Eitors: Risk Assessment of Chemicals: An Introduction. Berlin: Springer; 2007. ISBN1402061013
Chapter Google Scholar
Chapman PM, Wang FY, Janssen C, Persoone G, Allen HE: Ecotoxicology of metals in aquatic sediments: binding and release, bioavailability, risk assessment, and remediation. Canadian Journal of Fisheries and Aquatic Sciences 1998,55(10):2221–2243. 10.1139/f98-145
Article CAS Google Scholar
Heugens EHW, Jager T, Creyghton R, Kraak MHS, Hendriks AJ, Van Straalen NM, Admiraal W: Temperature-dependent effects of cadmium on Daphnia magna: accumulation versus sensitivity. Environmental Science and Technology 2003,37(10):2145–2151. 10.1021/es0264347
Article CAS Google Scholar
Gardeström J, Elfwing T, Lof M, Tedengren M, Davenport JL, Davenport J: The effect of thermal stress on protein composition in dogwhelks (Nucella lapillus) under non-normoxic and hyperoxic conditions. Comparative Biochemistry and Physiology A--Molecular and Integrative Physiology 2007,148(4):869–875. 10.1016/j.cbpa.2007.08.034
Article Google Scholar
Ågerstrand M, Küster A, Bachmann J, Breitholtz M, Ebert I, Rechenberg B, Rudén C: Reporting and evaluation criteria as means towards a transparent use of ecotoxicity data for environmental risk assessment of pharmaceuticals. , in press.
Schering's report A12518: Environmental risk assessment of ethinyletsradiol.Environmental Pollution; 2011. [http://www.fass.se] Available through the Swedish environmental classification and information system for pharmaceuticals website:
Google Scholar
Bogers R, Mutsaerds E, Druke J, Roode DF, de Murk AJ, Burg B, van der Legler J: Estrogenic endpoints in fish early life-stage tests: luciferase and vitellogenin induction in estrogen-responsive transgenic zebrafish. Environmental Toxicology and Chemistry 2006,25(1):241–247. 10.1897/05-234R.1
Article CAS Google Scholar
Schering-Plough: Environmental risk assessment of ethinyletsradiol.[http://www.fass.se] Available through the Swedish environmental classification and information system for pharmaceuticals website:
Schäfers C, Teigeler M, Wenzel A, Maack G, Fenske M, Segner H: Concentration- and time-dependent effects of the synthetic estrogen, 17a-ethinylestradiol, on reproductive capabilities of the szebrafish, Danio rerio. Journal of Toxicology and Environemntal Health Part A 2007, 70: 768–779. 10.1080/15287390701236470
Article Google Scholar
Segner H, Navas JM, Schäfers C, Wenzel A: Potencies of estrogenic compounds in in vitro screening assays and in life cycle tests with zebrafish in vivo. Ecotoxicology and Environmental Safety 2003, 54: 315–322. 10.1016/S0147-6513(02)00040-4
Article CAS Google Scholar
ECHA (European Chemicals Agency): Guidance information for the implementation of REACH. Guidance on information requirements and chemical safety assessment. 2008. Chapter R.4: Evaluation of available information
Google Scholar

Download references

Acknowledgements

The study was financed by Formas (the Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning) and Mistra (the Foundation for Strategic Environmental Research).

Author information

Authors and Affiliations

Department of Philosophy and the History of Technology, Royal Institute of Technology/Kungliga Tekniska Högskolan, Teknikringen 78B, 100 44, Stockholm, Sweden
Marlene Ågerstrand & Christina Rudén
Department of Applied Environmental Science, Stockholm University, Svante Arrhenius väg 8c, 106 91, Stockholm, Sweden
Magnus Breitholtz

Authors

Marlene Ågerstrand
View author publications
You can also search for this author in PubMed Google Scholar
Magnus Breitholtz
View author publications
You can also search for this author in PubMed Google Scholar
Christina Rudén
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marlene Ågerstrand.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MÅ was the main contributor to this paper. MB and CR have participated in the design of the study, commented on the analysis, and helped to draft the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ågerstrand, M., Breitholtz, M. & Rudén, C. Comparison of four different methods for reliability evaluation of ecotoxicity data: a case study of non-standard test data used in environmental risk assessments of pharmaceutical substances. Environ Sci Eur 23, 17 (2011). https://doi.org/10.1186/2190-4715-23-17

Download citation

Received: 14 February 2011
Accepted: 09 May 2011
Published: 09 May 2011
DOI: https://doi.org/10.1186/2190-4715-23-17

Comparison of four different methods for reliability evaluation of ecotoxicity data: a case study of non-standard test data used in environmental risk assessments of pharmaceutical substances

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Triple Bottom Line

ESG ratings explainability through machine learning techniques

Assessing and Improving the Quality of Sustainability Reports: The Auditors’ Perspective

Background

Environmental risk assessment of pharmaceuticals

Standard and non-standard ecotoxicity tests

Testing for environmental effects caused by pharmaceutical substances

Reliability and relevance evaluation of (eco-)toxicity data

Aim

Method

Evaluation of existing reliability evaluation methods

Investigation whether published non-standard ecotoxicity studies fulfill proposed reliability criteria

Results

Hypothesis and endpoint

Usefulness of proposed criteria

Result of the study evaluations

Protocol

Usefulness of proposed criteria

Result of the study evaluations

Test substance

Usefulness of proposed criteria

Result of the study evaluations

Test environment

Usefulness of proposed criteria

Result of the study evaluations

Dosing system

Usefulness of proposed criteria

Result of the study evaluations

Test species

Usefulness of proposed criteria

Result of the study evaluations

Controls

Usefulness of proposed criteria

Result of the study evaluations

Statistical design

Usefulness of proposed criteria

Result of the study evaluations

Biological effect

Usefulness of proposed criteria

Result of the study evaluations

Summary of the evaluation of existing reliability evaluation methods

Scope

How evaluation criteria are weighted and summarized

User friendliness

Conclusions from the evaluation of existing reliability evaluation methods

Summary of the reliability evaluation of the non-standard test data

Discussion and conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation