“Biotest results are much less precise and accurate than chemical data”
If chemical analysis were a precise and accurate tool, different laboratories should detect very similar concentrations of compounds in the same sediment sample, and the same laboratory should detect the same concentration when measuring a sample several times. Interlaboratory proficiency-testing exercises allow participating laboratories to test their regular in-house analytical methods. For regulatory bodies, they serve as an external quality control evaluation of monitoring data.
Table 1 summarizes the outcomes of the interlaboratory exercises in the analysis of contaminants in sediments, organized by the International Atomic Energy Agency (IAEA) in 1998, 1999, and 2001. The interlaboratory coefficient of variation (CV) for organic compounds exceeded 100% for some analytes in all chemical substance groups tested. According to de Mora et al. (2007), these outcomes were in line with those of other proficiency tests, e.g., from Quality Assurance of Information in Marine Environmental Monitoring (QUASIMEME), and no improvement in analytical performance was detected between 1996 and 2006. However, for interlaboratory comparisons, laboratories perform analyses according to their in-house protocols. Thus, methods and potentially sediment pretreatment may differ. Consequently, interlaboratory differences tend to be larger than intralaboratory differences. In an exercise conducted by the National Institute of Standards and Technology (NIST) on analysis of PAHs, PCBs, chlorinated pesticides, and PBDE congeners, the CVs for three replicates, indicating the precision of in-house analyses, were below 10% for most laboratories but sometimes exceeded 50%, depending on the substance (Schantz et al. 2006).
Table 1 Reproducibility of chemical analyses of sediment contaminants in IAEA interlaboratory assessments (coefficient of variation, CV range in % for different compounds), compiled from Villeneuve et al. (2000, 2002), de Mora et al. (2007), and Wyse et al. (2004) For trace metals, the reproducibility and accuracy of analyses are better when outliers are removed. Table 2 shows results from the IAEA-405 intercomparison exercise for arsenic and heavy metals that are routinely analyzed in most sediment and dredged material samples. With the exception of cadmium, all CVs were below 20% and thus were considered acceptable (Wyse et al. 2004). The data also showed a wide range of results when outliers were not removed, thus demonstrating the uncertainty that can accompany chemical data, particularly if quality control procedures are omitted.
Table 2 Extracted results of the intercomparison exercise IAEA-405 for commonly regulated trace metals in sediments (Wyse et al. 2004) With regard to sediment bioassays, Dillon (1994) has stressed the necessity for intra- and interlaboratory comparison before these bioassays are included in regulatory decision support systems. Standardization procedures, and in this context round robin tests, are usually a prerequisite before bioassays are relied upon. Interlaboratory comparisons that precede standardization of bioassays are usually conducted on spiked rather than natural sediment or water samples. For this discussion, however, the focus is on the reproducibility of the results for natural (environmental) samples from the following selected biotests that are being or have been used for sediment and dredged material assessment in Europe, such as the whole sediment assays with Corophium volutator (amphipod), Echinocardium cordatum (sea urchin), and Caenorhabditis elegans (nematode); solid-phase tests with Aliivibrio fischeri (luminescent bacteria); sediment contact tests with Myriophyllum aquaticum (aquatic plant); and elutriate tests with an embryo-larvae development bioassay with Crassostrea gigas (oyster), Paracentrotus lividus (sea urchin), and Daphnia magna (water flea) (Table 3).
Table 3 Examples of reproducibility (coefficient of variation, CV in %) of sediment toxicity tests in interlaboratory assessments Inter- and intralaboratory CVs differ strongly between bioassays. The variability would be expected to be highest for tests with larger organisms, and/or if test organisms are sampled from the field and not cultured in a laboratory. The latter is often the case for marine test species. Test organisms, collected in the field, are genetically more diverse, usually have longer life cycles, and fewer organisms are used per test replicate. However, the CVs for tests with marine organisms are most often within the range reported for ISO sediment toxicity tests performed with freshwater laboratory-cultured organisms (e.g., Feiler et al. 2014).
Because of the heterogeneity of sediments, lower precision might have been assumed for direct contact tests. Available data (Table 3) does not support this general assumption. Moreover, most results are well within the commonly accepted criterion of a CV of less than 30 to 40% (Environment Canada 1990; Moore et al. 2000).
On the basis of these interlaboratory comparisons, the assumption that ecotoxicological results in general are less reliable than chemical data can thus not be confirmed. Despite the more recent use of biotesting compared to chemical analytics, and although biological organisms naturally have variable phenotypes, CVs in chemical and ecotoxicological results for sediment quality assessment are in the same range, and sediment contact tests are not necessarily less reproducible than elutriate tests.
However, sampling methodology (pooling of subsamples, homogenization, and sample volume) has been shown to have a large effect on the reproducibility of solid-phase toxicity, and it cannot be recorrected after a sample is brought to the laboratory (Ferrari et al. 1999). Similarly, sampling has a key influence on the variability of the results of chemical characterization, often to a greater extent than analytical variability (Schiavone et al. 2011). Moreover, sediment storage and pretreatment significantly affect test results (e.g., De Lange et al. 2008).
Another major source of interlaboratory variability in ecotoxicological testing, as suspected by Stronkhorst et al. (2004), is the degree of experience of laboratory technicians with bioassays. Effort should be made to provide specific training in performing ecotoxicological tests if the results are used for regulatory purposes. This aspect is particularly important when the evaluation of test endpoints has some degree of subjectivity, such as the development of sea urchin larvae (Casado-Martínez et al. 2006b). One possibility for improving the performance of ecotoxicological testing of laboratories is the initiation of frequent interlaboratory comparisons for bioassays.
Although here we compare the precision of numerical endpoint results with those from analytical techniques through CVs, the use of CVs alone for assessing the results of toxicity tests has been challenged. Whereas extremely toxic or nontoxic samples may result in very low CVs (Burton Jr. et al. 1996), good agreement in the classification of samples according to toxicity and no toxicity may also be achieved with high CVs (Thursby et al. 1997; Casado-Martínez et al. 2006a, b). As Norberg-King et al. (2006) indicated: “it is important to keep in mind that the purpose of a toxicity test is not to find statistical differences; rather, it is to decide, with an acceptable degree of uncertainty, whether a sample is toxic.”
In conclusion
The statement that biotest results are generally less precise and accurate than chemical data cannot be confirmed. Nevertheless, more intra- and interlaboratory comparisons would help to harmonize procedures (sampling, pretreatment, and standard operation procedures) and to train technicians.
“The low number of test organisms cannot represent the ecosystem sensitivity”
This statement refers to the application of biotests to assess in situ sediment quality and to protect the environment against stress from contaminants. The sensitivity and stress levels of an ecosystem can be best assessed by studying the benthic community. However, changes in diversity can also be due to noncontaminant stressors, such as temperature or light; therefore, the “triad approach” combines benthic community data with toxicity data (and chemical data) (Chapman et al. 1997). Sometimes hypothetical “most sensitive test organisms” reflecting the sensitivity of the biological community have been desired to allow for cost efficient and fast determination of the chemical stress in situ.
This statement misinterprets the importance of ecotoxicological testing, and the search for the most sensitive organism will not be successful anyway, as Cairns (1986) has explained. Species differ in sensitivity toward chemicals with different modes of action; the same species may be very sensitive to substance A yet tolerant to substance B. Consequently, the search for a species “representative” of an ecosystem’s status is necessarily flawed. What we can expect from a biotest is information, such as the presence or availability of (undefined) substances that have the potential to disturb and affect organisms in the field. If basic biological traits are inhibited, such as photosynthesis, reproduction, or energy metabolism, the probability of implications for the ecosystem rises.
For the selection of a test species for sediment toxicity test development, practical reasons will prevail (e.g., availability and handleability). The utility of such tests can be greatly improved if the proposal for a species is accompanied by appropriate information regarding its sensitivity to contamination, its ecological importance, and its exposure pathways (Dillon 1994). As an indicator of potential risk to the biological community, a given biotest must be sensitive to chemical stress. An excessive tolerance would increase the likelihood of false-negative responses. Accordingly, field validation is needed, during which reactions of biotests are compared with measurable changes in the biological community, so that regulatory agencies can assess the relevance of bioassay results. Although there has been some debate regarding the need for field validation for sediment toxicity testing (Chapman 1995), a workshop to evaluate the uncertainty of measurement endpoints used in sediment ecological risk assessment highlighted the inadequate field validation of sediment toxicity tests in 1996 (Ingersoll et al. 1997). To overcome this bottleneck, several initiatives in the USA have demonstrated the ecological relevance of amphipod sediment toxicity testing. Long et al. (2001) have studied the relationship between acute sediment toxicity tests with marine and estuarine amphipods and benthic community structure metrics (abundance and diversity) in more than 1400 samples from studies conducted in the USA. Although the authors found considerable variability among the datasets, they concluded that ecologically relevant losses in the abundance and diversity of the benthic infauna frequently corresponded to decreased amphipod survival in laboratory tests. In > 90% of the samples classified as toxic, at least one measure of benthic diversity or abundance was < 50% of the average reference value. No amphipods were found in 39% of samples classified as toxic, although amphipods were also absent from 28% of the nontoxic samples. However, the abundance of crustaceans (notably amphipods) decreased in the infauna as amphipod survival decreased in the laboratory tests in many of the studied areas. A break point in the data indicated that, in general, amphipod abundance in the field was lowest when survival in the laboratory tests decreased below 50% that of controls.
A field validation study was also completed at a PAH-contaminated Superfund site USA, involving a 10-day toxicity test with the marine amphipods Leptocheirus plumosus and Rhypoxynius abronius (Ferraro and Cole 2002). Both toxicity tests were validated as indicators of changes in several macrofaunal community metrics that had low but sufficient statistical power to discriminate ecologically important effects: the percentage loss of the indices increased relative to values determined for nontoxic reference areas, as the average survival in laboratory toxicity tests decreased. Losses of benthic resources reached 50% when the survival dropped to 0%.
According to Borgmann et al. (2005), the freshwater amphipod Hyalella azteca is frequently one of the most sensitive organisms in sediment toxicity tests, according to the results of risk assessment of chemical substance registration. A close correlation between toxicity to H. azteca in laboratory toxicity tests and an abnormally low abundance of sensitive benthic species, such as amphipods, mayflies, sphaeriid clams, and tanytarsid midges, in the field has been shown to predict effects on sensitive species in situ.
The nematode Caenorhabditis elegans has become another frequently used organism in freshwater sediment toxicity tests. Haegerbaeumer et al. (2018) have compared the sensitivity of 27 wild nematode species extracted from freshwater sediments with that of C. elegans toward metals and PAHs. Although C. elegans is more tolerant to chemical stress than the average freshwater nematode species, the sensitivity of the extracted animals varied over a wide range, and the C. elegans responses were well within that range, except for benzo[a]pyrene.
In conclusion
Single-test species should not be expected to represent the sensitivity of ecosystems but should be regarded as indicators of available and harmful substances that may affect biological communities. From this perspective, more information must be compiled and provided on the sensitivity of test species toward relevant substances in comparison to that of biological communities, to provide information on the possibility of false-negative outcomes in batteries of biotests.
“Because of altered environmental conditions, laboratory testing does not reflect natural conditions and thus cannot be related to bioavailability in situ”
As indicated above, a direct extension of laboratory results to situations in situ is certainly not possible, as also described by Ferrari et al. (2019). The above statement reveals a misunderstanding of the purpose of performing biotests. These tests are intended to show whether there is a hazard for the aquatic or benthic community.
Therefore, experimental conditions may change as long as they remain environmentally relevant, and different scenarios may be tested.
In Europe, discussions of biotests in a regulatory context apply primarily to dredged material assessment. Deciding on management options for dredged material requires deciding on treatments, during which the material undergoes several physico-chemical changes, as do sediment samples in preparation for ecotoxicological testing. Bioassays usually require oxic conditions and more water than was present in the original sediment. Resuspension in a greater water volume and oxidation of samples will also occur during relocation of dredged material, and thus preparation of samples for biotesting simulates realistic conditions. Test conditions such as pH, temperature, or salinity, however, depend on the requirements of the given test organisms and must be kept within a certain range, even if it does not reflect the environmental situation. Ecotoxicological tests must be understood not to predict with high certainty what will happen in the environment but to characterize environmental samples on the basis of their properties under fixed conditions. The toxicity measured in the laboratory reflects the capability of the sediment to do harm under certain conditions and thus indicates toxicity potential. The information on ecotoxicity becomes meaningful in an environmental context, considering the exposure. Management decisions, e.g., to dredge or to relocate sediment, should be performed on the basis of its toxic potential to ensure that the material’s properties do not adversely affect the environment.
The same applies to chemical data for sediments. Bioassays, elution, or leaching tests are performed in standardized conditions that do not necessarily represent the in situ bioavailability of contaminants. Moreover, as reported above, sampling strategy, storage, and pretreatment may also alter contaminant bioavailability, thus affecting the results of chemical analysis. For example, De Lange et al. (2008) have reported the analysis of acid volatile sulfide (AVS) and simultaneously extracted metals (SEM) stored under different conditions: AVS increased significantly during cool storage, whereas SEM was not affected. The authors found different AVS values according to the sediment layer (i.e., 0–2 cm vs. 2–5 cm). In addition, the choice of digestion procedure may significanlty affect the results of trace element analysis (e.g., Mossop and Davidson 2003). Therefore, chemical analysis, like ecotoxicological tests, is performed to characterize environmental samples under fixed conditions, which may not reflect the in situ status.
In conclusion
Both biotesting and chemical analysis characterize sediments under standardized conditions that do not necessarily represent in situ conditions. Despite this limitation, both can reveal potential hazards to aquatic communities. However, ecotoxicological tests are more powerful in detecting the effects of pollutant mixtures and of chemicals that are not assayed.
“Agreement on how to assess biotest data is lacking”
Interpretation of individual test results within a biotest battery is performed differently depending on the laboratory and the guidelines. For whole organism tests, the results are often expressed as percentage inhibition of a certain endpoint such as mortality, photosynthesis, growth, or reproduction, compared with that of an unaffected control. Thresholds that differentiate “toxicity classes,” indicating, e.g., low, moderate, or high toxicity, often appear to be set arbitrarily and not to account for the characteristics of the test systems.
Different toxicity endpoints of different organisms have different response ranges, sensitivity, and precision, thus calling into question the use of strict threshold values in ecotoxicology (Ahlf and Heise 2005; Höss et al. 2010).
The issue becomes even more complicated in interpreting the results of biotest batteries, because the bioassays usually yield differing responses. Interpretation of multi-test results providing information on sediment or dredged material quality range from always considering the most sensitive organisms in a test battery (e.g., in Germany, according to GÜBAK-WSV (2009)) to integrating biotest data by more complex classification techniques, such as the Hasse diagram technique, fuzzy logic expert systems (Hollert et al. 2002), or toxicity profiling (Hamers et al. 2010). These integrating assessment approaches, although more complicated and less transparent, have the potential to improve decision-making on the basis of sound science and have found acceptance, e.g., in the Italian regulation for disposal of dredged marine sediments at sea in other than National Relevance Sites (SedNet and Sullied Sediments 2018).
In quantifying single and multiple responses in bioassays to assess their relevance in providing information on environmental toxicity of sediments, chemical analyses face a similar problem. Sediment quality guidelines (SQGs) are intended to relate the chemical concentrations in sediments to hazards. They have been developed to protect the biological community, to predict effects on benthic organisms, or both. Most have been derived through empirical or theoretical/mechanistic approaches.
Many (controversial) discussions have debated the design, implementation, and limitations of SQGs, thus resulting in a large variation in guidelines. DelValls et al. (2004) have reviewed SQGs from different European countries and have shown that they differ by two orders of magnitude for some substances (e.g., As, Cu, and seven PCBs). Most of the limitations listed and discussed at the Pellston Workshop on “use of sediment quality guidelines and related tools for the assessment of contaminated sediments” in 2002 have not been addressed to date for existing SQGs; e.g., they deliver no or limited information on the ecologically important aspects of chronic toxicity to sediment-dwelling organisms and cause-effect relationships, in addition to the questionable transferability of SQGs, derived from one endpoint in the laboratory, to, e.g., effects on organisms in the environment (Wenning et al. 2005). Moreover, existing SQGs cover tens of substances at best, and therefore substances of emerging concern cannot be reliably assessed with this tool. The same applies to chemical guidelines developed by countries to manage dredged material (action levels). These action levels vary substantially among countries and cover only a very limited number of substances (see Röper and Netzband 2011).
In conclusion
There is indeed no agreement yet on how to assess biotest results, although several approaches that account for test-specific characteristics have been reported. Contrary to the common perception, however, sediment quality guidelines and action levels also substantially vary among countries and, even if effect-based, have limited ability to predict adverse effects or protect benthic communities. Complementary application of chemical analyses and ecotoxicological testing still appears to be the best way to decrease the probability of false-negative results from sediment or dredged material analyses. Sediment toxicity testing with carefully selected organisms to target contaminants with a special mode of action could become a cost-effective monitoring technique.
“Biotesting significantly increases the costs of sediment management”
A brief study of testing costs was performed during drafting of a guidance document dedicated to the hazard assessment of sediments in French waterways (Stamm and Babut 2019). Two types of biotests were considered: miniaturized tests intended for a screening tier and classic biotests intended for an in-depth assessment if the screening tier did not lead to making a decision. The associated costs are shown in Table 4.
Table 4 Expected unit costs (excl. VAT) for a range of biotests. na not available Well-known, commonly used tests, such as ostracod (ISO 14371) or Microtox™, have unit costs similar to the costs of “simple” analyses, such as those of trace elements, PCBs (except dioxin-like congeners) or PAHs, which are mostly automated in chemical laboratories. Other tests appear to be more expensive for several reasons. A longer test duration (e.g., Gammarus), which entails a higher workload, leads accordingly to a more expensive test. The cost cited by potential contractors is also influenced by the potential demand (i.e., the number of tests the contractor expects to perform), which in turn is associated with the investment needed and the number of laboratories accredited for those tests.
Thus, currently, according to these tariffs, the cost of the screening tier would amount to approximately 2000 € per sample when biotests are included or 1000 € when they are not, in which case, the list of chemicals analyses would be limited to trace elements, PCBs, except dioxin-like congeners, and PAHs. If the concentrations of more chemicals of emerging concern are required in a screening tier, only the cost of chemical analyses would increase, because the biotest battery would have the same level of response (i.e., taking into account that bioassays assess the potential effects of all chemicals in the samples).
Costs can be further cut substantially with the use of smaller test organisms, such as bacteria and algae, thus enabling the test procedures to be miniaturized (Rojíčková et al. 1998; Heise and Ahlf 2005; Wadhia and Thompson 2007; Paixão et al. 2008).
More broadly, including any additional lines of evidence in the assessment framework would increase the overall expense leading to decision. This trend would be true for not only biotests but also for chemicals of emerging concern beyond the current lists of priority substances.
In conclusion
In our view, the cost issue should be discussed in relation to the needs—what information is required to reach a decision—and the cost of mismanagement, that is, of making a wrong decision. With analytical and bioassay data complementing each other, the risk of false-negative results which would guide the decision in the wrong—and costly—direction would be reduced. Objecting that biotests are expensive does not make much sense: it is a simplistic argument with no rational grounds.