In this issue of the Archives of Toxicology, Miriam Lohr and colleagues from TU Dortmund University in Germany contributed a method how to identify sample mix-up in gene expression datasets (Lohr et al. 2015). Today, large sets of genome-wide expression data obtained either by gene array or by RNAseq are publicly available. This offers the possibility to reuse previously analyzed data and to combine several datasets in order to improve statistical power (Lohr et al. 2015). However, a risk is that samples in public databases have been misannotated. This may lead to errors which are particularly critical, when the same error compromises several studies which all rely on the same publicly available expression data. Therefore, Lohr et al. (2015) established a biostatistical method that allows the retrospective identification of misannotated samples. The authors show that two types of error occur surprisingly often in public databases: (1) A sample is analyzed twice, and the duplicate is erroneously labeled with the identification number of another patient. (2) Two samples are mixed up. When this mix-up occurs between samples from male and female individuals, it can clearly be identified by a set of sex-specific genes. The authors applied the mix-up identification procedures to 45 publicly available datasets, including 4913 individuals. Erroneous sample annotation was identified in a surprisingly high fraction of 40 % of the analyzed datasets (Lohr et al. 2015). The authors also demonstrate that the removal of the identified erroneous samples may critically influence the results of statistical analysis of individual studies.

Publicly available datasets have been particularly often used in clinical cancer research to identify prognostic genes (Schmidt et al. 2008, 2012; Cadenas et al. 2010, 2014; Stewart et al. 2012; Godoy et al. 2014; Stock et al. 2015). However, genome-wide expression data are also increasingly used in toxicological research (Song et al. 2013; Faust et al. 2013; Shao et al. 2014; Shinde et al. 2015; Reif 2014; Stöber 2014; Marchan 2014; Hammad and Ahmed 2014; Hengstler 2011; Stewart 2010). Cutting-edge topics are the establishment of classifiers and signatures in developmental toxicity (Krug et al. 2013; Rempel et al. 2015; Balmer et al. 2014; Zimmer et al. 2014; Waldmann et al. 2014) and hepatotoxicity (Campos et al. 2014; Yafune et al. 2013; Doktorova et al. 2012; Zellmer et al. 2010; Godoy et al. 2015). For these purposes, large public databases are available (Grinberg et al. 2014). Using the novel methods for the identification of sample annotation errors described by Lohr et al. (2015) will improve the reliability of genome-wide biostatistical analyses in future.