Background

Environmental risk assessment of pharmaceuticals

In 2006, the European Medicines Agency (EMA) decided that all new marketing authorisation applications for human pharmaceuticals should be accompanied by an environmental risk assessment [1]. The EMA risk assessment has a PEC/PNEC (predicted environmental concentration/predicted no effect concentration) approach and is divided into two phases. In phase II, data on the substance's physicochemical properties, persistence and bioaccumulation, and ecotoxicity are reviewed and the PNEC is estimated. All relevant data should be taken into account. Experimental studies should preferably follow standard test protocols but it is recognized that there are other acceptable methods. However, their use should be justified and studies should be conducted in compliance with good laboratory practices (GLP) [1].

Standard and non-standard ecotoxicity tests

Ecotoxicological testing can be done using a variety of methods and models. There are two general approaches: using standard or non-standard testing methodologies. Standard tests refer to tests performed and reported according to a method described and provided by an official international or national harmonization or standardization organization, such as the OECD (Organisation for Economic Cooperation and Development), US EPA (United States Environmental Protection Agency), ASTM (American Society for Testing and Materials), AFNOR (Association Française de Normalisation), and ISO (International Organization for Standardization). The test standard establishes a uniform specification of the experimental setup and execution, methods for data analyses, and the reporting format for the test data. Non-standard tests, on the other hand, are tests performed according to any other test method.

Regardless of whether a test is performed according to a standard or not, they should meet some general scientific quality criteria to demonstrate the reliability and reproducibility of the test results. Examples of such general scientific quality criteria are a clear description of the endpoints, inclusion of appropriate controls, appropriate identification of test substance and test organism, stated exposure duration time and administration route, and transparent reporting of effect concentrations.

The major advantages of using standard tests are that the results are directly comparable across substances and that the data they generate will be readily accepted across jurisdictions. Test guidelines also contributes to promote the reliability of the data by making it easier to repeat the experiment if needed because of the detailed standard test procedures and extensive reporting of data that is required. The major disadvantage of standard test methods is that it does not always represent the most biologically relevant testing approach depending on the type of endpoint under investigation. Therefore, results from non-standardized tests may in some cases be more sensitive and thereby contribute additional and significant information to a risk assessment. Other disadvantages of standard tests are that they are inflexible and therefore there is no room for case-by-case adjustments and that it may take up to 10-15 years to develop a new standard test.

Given the characteristics and purposes of standard tests, it is not surprising that they are mostly performed by commercial laboratories while non-standardized methods are typically performed by research scientists and published in scientific journals. Standard tests are often performed according to GLP, whereas non-standard tests are seldom performed according to GLP. The primary objective of the OECD Principles of Good Laboratory Practice is to ensure the generation of high-quality and reliable test data related to the safety of chemical substances and preparations [2]. But concerns have also been raised regarding whether GLP is synonymous with good scientific practices, accurate reporting and valid data [3, 4].

Testing for environmental effects caused by pharmaceutical substances

Pharmaceutical substances have a number of inherent properties that make them interesting from a regulatory perspective. First, pharmaceuticals are carefully designed to interact with biological processes. Second, this interaction should be as specific as possible, ideally influencing only one well-defined target molecule or cellular process, and have as few other side effects as possible. Third, this interaction should be achieved at low concentrations, meaning that the substance has to be relatively potent. Fourth, to achieve this, it is necessary that the active pharmaceutical ingredient is sufficiently persistent to remain un-metabolized long enough to reach the target organ in the human body.

It is fundamental for risk identification, and thus a crucial part of the risk assessment process, to have toxicity test methods that are adapted to their purpose. Currently available standard test methods for deriving regulatory toxicity data for the aquatic environment are in many cases not sufficiently sensitive to the types of very specific effects that can be expected from pharmaceutical substances [see e.g., [5]]. The EMA guideline [1] recommends that standard tests measuring growth inhibition and reproduction failure are used in environmental risk assessment of pharmaceutical substances (OECD test number 201, 210 and 211). However, test data are for many pharmaceuticals still limited or not publically available. The sex hormone ethinylestradiol is one of few substances where a significant amount of both standard and non-standard test data is available. Table 1 presents the lowest reported standard and non-standard effect values (according to the Wikipharma database [6] and the environmental classification system at fass.se [7]), both no-observed effect concentration (NOEC) and EC50 values (lowest identified effect concentration where 50% of the tested population have been found to be affected), for ethinylestradiol.

Table 1 The lowest publically available standard and non-standard effect values for ethinylestradiola.

When comparing toxicity values, the non-standard NOEC value is 32 times lower than the standard test NOEC value, and the non-standard EC50 value is over 95,000 times lower than the standard EC50 value. Ethinylestradiol can therefore be seen as an example where non-standard tests with more substance-specific endpoints are more sensitive compared to the standard tests.

There is a need to carefully evaluate the regulatory process of identifying pharmaceuticals that might pose a risk to non-target species in the aquatic environment, and make sure that relevant and sufficiently sensitive tests are used in the regulatory environmental risk assessment of pharmaceuticals. As we see it, there are at least three ways forward: (1) to develop new standard ecotoxicity tests better suited for pharmaceuticals, or (2) to adjust existing standard tests by supplementing them with additional endpoints relevant for different pharmacological modes-of-action, or (3) to increase the use of non-standard tests for risk assessment purposes.

The development of new test standards is costly and may take up to 10 to 15 years [8] and since pharmaceuticals are a diverse group of substances when it comes to how they affect biological processes it is unlikely that new standards that would cover all relevant endpoints, could be developed in the near future.

Adjustments of current standard tests could increase the biological relevance for testing pharmaceutical substances. Therefore, a potential way forward could be that the standardization organizations initiate additional validation and expert commenting rounds, to standardize such adjustments. Still, such minor additions of existing standards would likely not be sufficient to ensure that the specific biological effects of most pharmaceuticals are covered by the tests.

Hence, in our view, an important and realistic way forward is to make increased use of non-standard test data to ensure a scientifically well-founded environmental risk assessment of pharmaceuticals. To enable the use of non-standard tests in risk assessments, two things are needed: that the legislation is designed so that non-standard tests can be included in a systematic and predictable way, and that non-standard tests are reported in a transparent and comprehensive way, much like required when using the standard test methods.

Reliability and relevance evaluation of (eco-)toxicity data

According to the TGD [9], an evaluation of data reliability should ensure "the inherent quality of a test relating to test methodology and the way that the performance and results of the test are described". Basically, this evaluation should answer the question: has the experiment generated and reported a true and correct result?

The assessment of the relevance of the data should describe "the extent to which a test is appropriate for a particular hazard or risk assessment" [9], e.g., answer questions like: Is the measured endpoint a valid indicator of environmental risk? Is the experimental model sufficiently sensitive in relation to detecting the expected effects? Has the experimental model a sufficient statistical power? How representative is the experimental model to the environment that is aimed to be protected?

Evaluation of data can be done within different frameworks. It usually relies to a significant extent on case-by-case assessments based on expert judgment. However, there have also been attempts to make the evaluation process more structured. Such an approach can include checklists or even pre-defined evaluation criteria. A major advantage of using a structured way of evaluating data is increased transparency and predictability of the risk assessment process. For instance, both a checklist and pre-defined criteria will contribute to ensuring that at least a minimum and similar set of aspects are considered in each evaluation. Pre-defined evaluation criteria may also contribute to increased transparency of the evaluation process to the extent that these criteria are clearly reported to the relevant parties. Disadvantages of using pre-defined evaluation criteria or checklists are that they are obviously less flexible and need to focus on the general aspects of a study. In general, there is a need to strike a balance between flexibility and predictability in the data quality evaluation process; it will always include an element of expert judgment, but it is also, in our view, important to continuously seek to increase the predictability and transparency of this process.

Aim

The overall aim of this study was to investigate if the reliability of non-standard ecotoxicity data can be evaluated systematically in environmental risk assessments of pharmaceutical substances. Our hypothesis was that evaluation and reporting criteria can contribute to making the evaluation more systematic, predictable, and transparent, and facilitate the use of non-standard data for risk assessment purposes.

Method

This study is divided into two parts: (1) an evaluation of the usefulness of four methods for reliability evaluation of test data that have been proposed in the scientific literature, and (2) an investigation of whether recently published non-standard ecotoxicity studies from the open scientific literature fulfill these reliability criteria.

Evaluation of existing reliability evaluation methods

The four evaluation methods used in this study are described by Klimisch et al. [10], Durda and Preziosi [11], Hobbs et al. [12], and Schneider et al. [13]. The reporting requirements from the OECD guidelines 201, 210, and 211 were used as a reference in the evaluation of the four methods. The reporting requirements were merged and generalized into 37 criteria so that they could be used on all types of endpoints and organisms.

Investigation whether published non-standard ecotoxicity studies fulfill proposed reliability criteria

The non-standard test data evaluated in this study (presented in Table 2[1422]) have been selected for the current analyses since they were either (1) used in risk assessments of active pharmaceutical ingredients within the Swedish environmental classification and information system for pharmaceuticals [[23]; available at http://www.fass.se] or (2) used in a previous evaluation of this classification system [24]. These selection criteria resulted in a total of nine references that was then evaluated according to the four methods. Some references contain several ecotoxicological studies but only the part of the reference relevant for the chosen effect value was considered in this evaluation. Some of the evaluated studies could have been conducted according to a standardized method but since this is not reported in the reference, the study is treated as a non-standard test.

Table 2 Overview of the non-standard effect data evaluated in this study.

Results

The results section is divided into the following sections: hypothesis and endpoints (Table 3), protocol (Table 4), test substance (Table 5), test environment (Table 6), dosing system (Table 7), test species (Table 8), controls (Table 9), statistical design (Table 10), and biological effect (Table 11). Each section is reported and discussed in two parts; (1) an evaluation of the usefulness of existing proposed criteria for reliability evaluation of test data, and (2) an investigation of whether the evaluated non-standard study fulfill the proposed reliability criteria. The two parts are also summarized in the end (Summary of the evaluation of existing reliability evaluation methods and Summary of the reliability evaluation of the non-standard test data sections).

Table 3 Modified evaluation criteria/questions concerning endpoints and a summary of the evaluation results.
Table 4 OECDs reporting requirements, modified evaluation criteria/questions concerning protocol, and a summary of the evaluation results.
Table 5 OECDs reporting requirements, modified evaluation criteria/questions concerning test substance, and summary of the evaluation results.
Table 6 OECDs reporting requirements, modified evaluation criteria/questions concerning test environment, and summary of the evaluation results.
Table 7 OECDs reporting requirements, modified evaluation criteria/questions concerning dosing system, and summary of the evaluation results.
Table 8 OECDs reporting requirements, modified evaluation criteria/questions concerning test species, and summary of the evaluation results.
Table 9 OECDs reporting requirements, modified evaluation criteria/questions concerning controls, and a summary of the evaluation results.
Table 10 OECDs reporting requirements, modified evaluation criteria/questions concerning statistical design, and summary of the evaluation results.
Table 11 OECDs reporting requirements, modified evaluation criteria/questions concerning biological effect, and summary of the evaluation results.

Hypothesis and endpoint

Usefulness of proposed criteria

All four evaluation methods consider that study endpoints should be stated and described. Durda and Preziosi [11] also considers hypothesis important for studies where NOEC or LOEC (lowest observed effect concentration) values are identified, and involves a relevance criterion by asking whether the chosen endpoint is appropriate for the hypothesis. No OECD guideline criteria matched this category.

Result of the study evaluations

The nine selected studies report NOEC/LOEC values and EC50 values, but none of the studies selected for this evaluation clearly stated a hypothesis. Instead endpoints were described, in some cases very thoroughly and in other cases hardly at all. Describing why a specific endpoint was used will help clarify the importance of the conducted study. It should be noted that it has been argued that hypotheses should be replaced by dose-response analysis when deriving exotoxicological benchmarks [2528].

Protocol

Usefulness of proposed criteria

Both Klimisch et al. [10] and Durda and Preziosi [11] have evaluation criteria that are wide and imprecise which opens up for a variety of different interpretations and opinions.

Schneider et al. [13] evaluation question is also wide and concerns relevance, rather than reliability. In the accompanying guidance material to the question, a variety of aspects are included: the chosen test system and its applicability domain, consideration of physio-chemical properties and stability of test substance, number of replicates, number of concentration levels and their range and spread, suitability of administration method, inclusion of all relevant endpoints, and statistical evaluation. The method would benefit from separating these aspects into several questions.

Result of the study evaluations

For three of the selected studies, a clear description of the test procedure was lacking, the majority of the studies would benefit from improving their reporting of the study. The chosen study designs were in all cases relevant for the data aimed at.

Test substance

Usefulness of proposed criteria

Identification of test compound, source, physicochemical properties such as purity and stability, and other substances used are factors related to the "test compound" that the four evaluation methods together consider important, but none of the methods report all factors. Purity is the only factor reported by all four methods. Water solubility is not reported explicitly but can be interpreted into Schneider et al. [13] method. The OECD guidelines do not report of any other factors in addition to the ones mentioned by the four methods.

Result of the study evaluations

The purity of the test substance was not stated in six of the nine studies. Neither stability nor by-products of the test substance was stated in any of the studies. Other factors such as vapor pressure, water solubility, octanol/water partitioning coefficient (logKow) and bioconcentration factor were missing in several or all of the studies.

Test environment

Usefulness of proposed criteria

Oxygen concentration, conductivity, temperature, pH, water hardness, salinity, light intensity, photoperiod, physical structure of the test chamber, test media, test organism density, food composition and food availability are abiotic and biotic factors that the four methods specify in their evaluation schemes. Several of the reported factors could modify the toxicity of chemicals [29]. None of the four evaluation methods include all factors. The OECD guidelines recommend that light quality, residual chlorine levels, total organic carbon (TOC), and chemical oxygen demand (COD) are reported in addition to the other factors.

Schneider et al. [13] did not specify which factors related to the test environment that should be evaluated and only consider it relevant for repeated dose toxicity studies. Klimisch et al. [10] has gathered the factors in only two criteria, while Durda and Preziosi [11] and Hobbs et al. [12] have divided the factors into several criteria/questions. Having one factor per criteria/question facilitates the evaluation since only one thing at a time has to be considered. Durda and Preziosi [11] has, contrary to Klimisch et al. [10] and Hobbs et al. [12], considered feeding protocols for long-term tests a "must" criterion.

Result of the study evaluations

None of the nine studies report all "test environment" factors. Temperature was reported more often than the other factors, whereas pH and dissolved oxygen (DO) were the factors that most of the studies failed to report. In some studies, information was given on the conditions for the cultivation/breeding stock but not for the experimental setup.

Dosing system

Usefulness of proposed criteria

Administrated concentrations, concentration control analysis, administration route, frequency and duration of exposure, and information of the dosing type are aspects related to the "dosing system" that the four evaluation methods all together consider as important. Durda and Preziosi [11] is the only method that covers all aspects. Both Durda and Preziosi [11] and Schneider et al. [13] consider administrated concentrations, administration route, frequency and duration of exposure as "must" criteria. The OECD guidelines also recommend that date of start of the test; method of preparation of stock solutions, the recovery efficiency of the method, and the limit of quantification in the test matrix should be reported.

Result of the study evaluations

The tested concentrations were not stated in two studies and in two other studies only the concentration range was reported. Concentration control analyses were made in five of the nine studies. The administration routes were rarely stated explicitly. A possible reason for this is that it is considered self-evident when it comes to aquatic toxicity testing.

Test species

Usefulness of proposed criteria

Identification of test species, number of individuals, investigated period of the lifecycle, reproductive condition, sex, strain, source, body weight, length or mass are aspects related to the "test species" that the four evaluation methods all together as consider important. None of the methods cover all aspects. Only Schneider et al. [13] has a "must" criterion: identification of the test species. The OECD guidelines also recommend that culture conditions and methods of collecting the test species are described.

Result of the study evaluations

Complete information about the test organism was missing in all nine studies.

Controls

Usefulness of proposed criteria

The reported aspects connected to "controls" are: use of control (positive, negative and/or solvent), acceptability criteria, control media identical to test media in all respect except the treatment variable, and origin of the control and test organisms. Only Durda and Preziosi [11] cover all these aspects. Klimisch et al. [10] does not mention any of the aspects. Both Durda and Preziosi [11] and Schneider et al. [13] have "must" criteria/questions. The four methods all together covered the same aspects as the OECD guidelines.

When it comes to acceptability criteria for the test, e.g., control mortality, two different approaches can be seen. Hobbs et al. [12] only requires that the acceptability criteria are stated while Durda and Preziosi [11] and Schneider et al. [13] ask whether the results connected to the acceptability criteria are reliable or acceptable. Durda and Preziosi [11] provides a percent limit for the control mortality while Schneider et al. [13], in the guidance material, asks whether the variability of the results and control was acceptable and if control values were within reasonable range.

Result of the study evaluations

In two of the evaluated studies, the authors did not report whether controls were used or not. Acceptability criteria were not stated in any of the studies, but since Schneider et al. [13] asks whether the variability of the results and control was acceptable, one of the studies were considered to fulfill this evaluation question. The origin of the control and test organisms and whether the control media is identical to test media in all respect except the treatment variable was not specifically stated in any of the studies. A possible reason for this could be that this is considered to be self-evident.

Statistical design

Usefulness of proposed criteria

The reported aspects connected to "statistical design" are: statistical method and results including significance levels and estimates of variability, (sufficient) sample size and replicates, randomized treatments and independent observations. None of the methods cover all aspects. The four methods all together covered the same aspects as the OECD guidelines. Schneider et al. [13] has a "must" criterion: the number of animals has to be stated. Hobbs et al. [12] and Durda and Preziosi [11] instead ask for at least two replicates or sufficient sample size/replicates.

Result of the study evaluations

The statistical model used was not stated by two studies, results from the statistical calculations were missing for another one. The significance level or the estimation of variability was missing for five studies. Eight of the nine studies used a sufficient sample size but none stated whether treatments were randomized or how observations were made. A possible reason for not reporting this could be that it is considered being self-evident, or that the researchers use traditional and widely accepted statistical methods.

Biological effect

Usefulness of proposed criteria

The reported aspects connected to "biological effect" are: stated and quantified effects for all endpoints, concentration-response relationship, and whether the results have been reproduced by others or are consistent with other findings. Durda and Preziosi [11] have covered all aspects. The OECD guidelines also recommend that calculated response variables for each treatment replicate, with mean values and coefficient of variation for replicates is reported.

Result of the study evaluations

All but one study had a determined EC50, LOEC, NOEC or NOEL value (no observed effect level). There was no concentration-response relationship reported in one study and unclear in three other studies.

Summary of the evaluation of existing reliability evaluation methods

The four reliability evaluation methods are described and compared in Table 12. The results from the evaluation are discussed in three parts: Scope; How evaluation criteria are weighted and summarized; User friendliness. Conclusions from this evaluation are presented in the last section.

Table 12 Description and comparison of the four reliability evaluation methods.

Scope

The four methods differ in their scope. Durda and Preziosi [11] have twice as many criteria as the other methods. Still, the four methods all include criteria from the same categories with the exception of Klimisch et al. [10] and Hobbs et al. [12] that lack criteria concerning controls and protocol, respectively. The Schneider et al. [13] method also includes aspects that are related to relevance.

The criteria vary in extent and specification, e.g., Hobbs et al. [12] have one criterion in the test species category while Durda and Preziosi [11] use eight different criteria for the same issue (see Table 8). A disadvantage of using wide or unspecified criteria is that aspects could be forgotten about. A disadvantage of using too precise criteria is decreased flexibility.

Another example deals with criteria concerning dose/concentration. Hobbs et al. [12] and Klimisch et al. [10] do not explicitly state that the doses should be reported. The Schneider et al. [13] criteria could be interpreted in different ways; it is not clear whether nominal or measured concentrations are required.

There are also some examples of criteria where a minimum level is presented, e.g., Hobbs et al. [12] criteria ask whether each control and chemical concentration is at least duplicated. Others leave more to the evaluator by asking if the sample size and replicates is sufficient (see [11]).

When it comes to acceptability criteria, e.g., 10% mortality in the control, two different approaches can be seen. Either that the acceptability criteria are stated [12] or the requirement that the results connect to the acceptability criteria are reliable or acceptable [11, 13]. The different approaches put different demands on the evaluator.

Durda and Preziosi [11], Hobbs et al. [12] and Schneider et al. [13] all have unique evaluation criteria that the other methods are missing.

The method described by Durda and Preziosi [11] has been developed from guidelines and it is therefore not surprising that this method has the highest resemblance with the OECD reporting requirements, 22 out of 37 criteria. The other three methods have each included less than half of the OECD reporting requirements in their list of criteria.

Some of the reporting requirements in the OECD guidelines were not reported by any of the four evaluation methods: date of start of the test; method of preparation of stock solutions; the recovery efficiency of the method; the limit of quantification in the test matrix; culture conditions; methods of collection; light quality; residual chlorine levels; TOC and COD; and the coefficient of variation. Several of these reporting requirements could be important for a general reliability evaluation.

How evaluation criteria are weighted and summarized

The four methods differ in how criteria are weighted and summarized.

Klimisch et al. [10] have not weighted their criteria and have not stated how to summarize the evaluation; this could result in a wide range of reliability evaluation results for the same test data. Klimisch et al. [10] has reserved the highest reliability category for studies carried out according to accepted guidelines and preferably performed according to good laboratory practice (GLP), or for methods that are very similar to a guideline method. This means that a study that has a design that is significantly different from standard test methods can never be put in the highest reliability category.

Durda and Preziosi [11] distinguish between mandatory and optional criteria. All mandatory criteria have to be fulfilled to receive the lowest acceptable reliability criteria and the highest reliability category is reserved for studies that fulfill all 40 evaluation criteria. GLP is preferred and the highest reliability category applies to standard tests or test closely related to standard tests. However, in practice, it is possible that non-standard tests fulfill all criteria, except the one regarding GLP, regardless of whether the test is closely related to a standard test or not.

Hobbs et al. [12] has weighted the criteria by assigning scores between 0 and 10. The total score for each study is then divided with the total possible score, which varies depending on what type of chemical, test organism and test media, and this results in a quality score and a quality class.

Schneider et al. [13] have divided their evaluation questions into mandatory and optional. All mandatory criteria have to be fulfilled to receive the lowest acceptable reliability category. The highest reliability category is reserved for studies which fulfill all mandatory evaluation questions and at least 18 of the 21 evaluation questions in total. This method differed from the other methods by, in our evaluation, assigning studies the highest reliability category (see the Summary of the reliability evaluation of the non-standard test data section).

User friendliness

User friendliness is defined in this study as a method that has clear instructions and an uncomplicated procedure.

All four methods include evaluation criteria that are wide and imprecise which opens up for a variety of different interpretations. Klimisch et al. [10] and Schneider et al. [13] compound criteria, i.e., criteria that include several aspects. Having more delimited criteria/questions facilitates the evaluation since only one aspect at a time has to be considered.

Additional information that complements the evaluation criteria/questions makes an evaluation method easy to use. Both Durda and Preziosi [11] and Schneider et al. [13] have useful additional guidance.

The Klimisch method [10] lacks information how to summarize the evaluation. This complicates the work of the evaluator. Schneider [13] summarizes the results automatically in a pre-formatted excel-sheet. Both Durda and Preziosi [9] and Hobbs et al. [12] apply manual summarization of the evaluations.

Conclusions from the evaluation of existing reliability evaluation methods

The evaluation methods differ in their scope, user friendliness, and how criteria are weighted and summarized. Depending on the evaluators' previous experience and knowledge, the outcome of the different methods can therefore differ. For the evaluators, it is important to be aware of the different methods strengths and limitations.

Durda and Preziosi [11] provide the method with the broadest scope and it also had the highest resemblance with the OECD guidelines. Durda and Preziosi [9], Hobbs et al. [12] and Schneider et al. [13] differ in how evaluation criteria are weighted and summarized but all three methods are functional and understandable. Durda and Preziosi [9] and Schneider et al. [13] both provide useful guidance information to the risk assessors which enhance the user friendliness.

Summary of the reliability evaluation of the non-standard test data

All four methods require some degree of expert judgment. They are developed to help risk assessors evaluate data, not to replace the risk assessor. Therefore it is likely that two experts evaluating the same study end up with slightly different results depending on their expertise and previous experiences. We have in our evaluation strived to make a uniform treatment of the evaluation methods and the selected studies. The evaluation method described by Klimisch et al. [10] requires more expert judgment when the evaluation is merged since instructions for this is lacking.

A striking result of this exercise is that many of the aspects considered important in the different evaluation methods are not reported by the authors of the selected studies. Examples of aspects often omitted are information about the controls, results from statistical evaluations, whether there is a dose-response relationship or not, tested concentrations, and clear description of the test environment. To safeguard against under-reporting, we recommend that a checklist containing all applicable reliability criteria should be used.

Overall the evaluation of the nine selected non-standard tests resulted in a low number of studies with acceptable reliability. The nine selected studies were evaluated by four different methods which resulted in 36 evaluations. Only 14 (39%) of these resulted in acceptable quality, reliable with restrictions, or reliable without restrictions (Table 13).

Table 13 Summary of the reliability evaluation of non-standard test data.

Also, the result from the four evaluation methods differed at a surprisingly high rate. Using the four methods lead to the same evaluation result for two studies only [14, 15], both were summarized as studies with unacceptable quality/not reliable. The evaluation result differed by one quality data level, from unacceptable quality/not reliable to acceptable quality/reliable with restrictions, for five studies [1620] and by two quality data levels, from unacceptable quality/not reliable to high quality/reliable without restrictions, for two studies [21, 22] (Table 13).

Durda and Preziosi [11] did not accept any of the studies since the mandatory criteria acceptable control mortality/morbidity was not reported. Hobbs et al. [12] has a similar criterion but since it is not mandatory it does not have the same effect on the summarized evaluations.

Other reasons why the reliability was considered unsatisfactory according to one or more evaluation method were lack of information about: chemical concentration control analysis, physical and chemical test conditions, specification of the test substance, a clear description of the test procedure, and the investigated period of the life cycle of the test organisms.

Discussion and conclusions

Standard test data are still preferred for regulatory environmental risk assessments of pharmaceutical substances. Accepting non-standard test data is likely to increase the regulatory agencies' work load since it is more complicated to evaluate these data compared to relying on standards only. More structured evaluation methods can help risk assessors and evaluators to use non-standard test data. But, as we have shown in this study, the design of the evaluation method is crucial since it can affect the outcome of the evaluation significantly.

The evaluation methods scrutinized in this study all require expert judgement. In our view, it is neither possible nor desirable to develop a method that completely leaves out expert judgement, but we can strive towards a method that reduces vagueness and elements of case-by-case interpretations. Both Hobbs et al. [12] and Schneider et al. [13] have tested and modified their respective method during the development process in order to increase the likelihood that evaluators arrive at similar conclusions.

The actual use of the four evaluation methods is unclear but Klimisch et al. [10] has been cited 62 times, Schneider et al. [13] seven times, Durda and Preziosi [11] twice, and Hobbs et al. [12] once (ISI Web of Knowledge, 2010-11-02). The method described by Hobbs et al. [12] has, according to the authors themselves, been used by several Australian and New Zealand authorities.

The problem with non-standard data found to have low reliability could be either that the studies are poorly performed or that they are under-reported. Under-reporting could be a consequence from journals' desire to publish concise papers. However, an increasing number of journals provide possibilities to include additional data as supplementary electronic information which means that this should not be a major obstacle for making such information publicly available. It is also well known that environmental factors, such as oxygen saturation, salinity, pH, hardness, and temperature, can have drastic impact on uptake and effects of chemical substances [e.g., [3032]] and therefore, as proposed by most standard protocols, these aspects need to be monitored and reported in order to ensure the reproducibility of the test data, i.e., a key issue in the scientific process.

For the nine studies investigated in the present paper, under-reporting could very well be a significant reason for the evaluation outcome. Aspects like use of controls, results from statistical evaluations, whether there are a concentration-response relationships or not, tested concentrations and clear description of the test subject and test environment are important and should be included in all publications presenting ecotoxicity data. Reliability evaluation methods can be used as checklists for authors and reviewers to ensure that all important aspects of the test method are included in their reports. A more structured reporting format could ensure the reliability of the test data without limiting the researcher's creativity in the design of a non-standard study.

As it is today, data with low reliability will not be included in regulatory environmental risk assessment. We believe that none of the nine selected non-standard studies could have been used in an environmental risk assessment of pharmaceuticals according to the EMA guideline. However, we still see that the studies could contribute to the risk assessment by acting as supporting information. We are currently developing a new suggestion on how to report and evaluate ecotoxicity data from the open scientific literature in regulatory risk assessment of pharmaceuticals. The new set of criteria is developed in collaboration between regulators at the German Federal Environment Agency (UBA) and researchers within the Swedish research program MistraPharma http://www.mistrapharma.se[33]. The criteria are based on the four methods evaluated in this study and the OECD reporting requirements, and have been further developed to include both reliability and relevance of test data. Intended users are risk assessors and researchers performing ecotoxicological experiments, but the criteria can also be used for education purposes and in the peer-review process for scientific papers. This approach intends to bridge the gap between the regulator and the scientist's needs and way of work.

It is important to remember that much of the research done within the field of ecotoxicology and risk assessment is financed through tax payer's money and to not find a way for use of this data in risk assessments would be an inefficient and irresponsible handling of resources.