Introduction

The exponential growth of human mitochondrial DNA (mtDNA) sequences available in public databases (Brandon et al. 2005; van Oven and Kayser 2009; Rubino et al. 2012) is probably the best current hallmark of the central role the mitochondrial genome plays in medicine, forensics and anthropology. In particular, clinicians have recently re-discovered the ‘neglected genome’ (Pesole et al. 2012) as a pivotal determinant or modifier of an increasing number of pathologies, including Alzheimer (Adeghate et al. 2013), cancer (Verschoor et al. 2013), diabetes (Patti and Corvera 2010; Mercader et al. 2012; Adeghate et al. 2013), spinocerebellar ataxias (Mancini et al. 2013) and several types of sclerosis (Patti and Corvera 2010). Because the role of mtDNA variants in most of these phenotypes is still a matter of debate, it is fundamental that either novel or previously reported mtDNA variants of interest are highlighted and brought forward to subsequent analyses, since functional studies aimed at ascertaining the potential pathogenicity often require cumbersome efforts. After sequencing and assembling an mtDNA genome, the first analytical step is usually the identification of positions differing from the chosen reference sequence. Next, with the aim of identifying variants of potential interest for a disease or a particular phenotype, such positions should be further filtered after determining the correct genetic background (haplogroup), so that fixed evolutionary allelic variants may be promptly recognized. Indeed, human mitochondrial phylogeny is described by haplogroup classification based on clusters of closely related evolutionary haplotypes, defined by the pattern of genetic markers occurring in the entire mtDNA and reflecting the migration of human populations over continents (Watson et al. 1997; Balter 2011).

Although clinicians are seldom familiar with the complexity of haplogroups, evolutionary and adaptive aspects should not be ignored in clinical studies, whereby the genetic association between haplogroup-defining variants and clinical phenotypes has been traced (Ghelli et al. 2009; Khan et al. 2013; Peng et al. 2013; Zhang et al. 2013). However, in the search for clinically relevant mtDNA variants, it is useful to rule out those evolutionary fixed, thus limiting variability analyses to few variants, annotated with estimation of conservation and prediction of pathogenicity, to obtain a shortlist of candidates that affect function of the gene/protein. Also, the heteroplasmic fraction of a sequence variant, whenever available, may not be neglected. Indeed, the advent of high-throughput sequencing technologies in mitochondrial genetics has revealed that a wide range of mtDNA variants at low heteroplasmy occurs also in healthy individuals (Payne et al. 2013; Diroma et al. 2014), and varies among tissues (He et al. 2010).

Overall, there is an urgent need for a common workflow based on stringent criteria, to be implemented by researchers who face the challenge of in-depth analysis of mtDNA sequences, which would allow them to recognize the influence of few variants on the phenotype, thus facilitating the functional assay.

In this paper, we propose and validate a workflow for prioritizing functionally important non-synonymous variants, starting from the variant annotation process already implemented in MToolBox based on the use of the Macro Haplogroup Consensus Sequences (MHCS) (Calabrese et al. 2014) and taking into account pathogenicity predictors. A nucleotide variability cutoff and a disease score threshold are established, to prioritize a pool of candidate variants affecting function, which may then be further investigated. Disease scores and prioritization criteria are now implemented in both the standalone version (http://sourceforge.net/projects/mtoolbox/) as well as in the web version of MToolBox at MSeqDR portal (https://mseqdr.org/mtoolbox.php) (Falk et al. 2014).

Materials and methods

The MToolBox variant annotation process

MtDNA variants identified by the MToolBox pipeline (Calabrese et al. 2014) are thoroughly parsed through an annotation process which is mainly based on the comparison with both the two widely used rCRS and RSRS reference sequences and the recognition of alleles that are not shared with the sample-specific MHCS. MHCSs, integrated in the MToolBox package, were generated from 32 multiple alignments of complete mitochondrial sequences from 14,144 healthy individuals available in HmtDB (November 2013 update) belonging to the 32 chosen macro-haplogroups. Each macro-haplogroup-specific multi-alignment was then subjected to nucleotide composition analysis by applying the SiteVar algorithm (Pesole and Saccone 2001) to determine the allele occurring most frequently in each position thus generating the MHCSs.

The annotations provided by MToolBox for each variant include:

  • nucleotide site-specific variability estimated on the multi-alignment of the updated healthy genomes reported in HmtDB;

  • predictions of pathogenicity for non-synonymous variants by applying MutPred (Li et al. 2009), HumDiv- and HumVar-trained PolyPhen-2 models (Adzhubei et al. 2013), SNPs&GO, PhD-SNP (Capriotti et al. 2013), and PANTHER algorithms (Thomas and Kejariwal 2004). Each predictor assigns to each sequence variant a probability score of pathogenicity as well as a qualitative prediction ‘disease’, ‘neutral’ or ‘unclassified’.

  • Mitomap (Lott et al. 2013) annotations referring to disease-associated mutations, occurring in coding and control regions and somatic mutations together with their state of homoplasmy/heteroplasmy;

  • links to OMIM (http://omim.org).

All these data are available in the ‘patho_table’, a tab delimited file provided by the MToolBox package, listing all possible 24,195 non-synonymous nucleotide substitutions which may occur within the 13 human mitochondrial protein encoding genes, as previously reported (https://sourceforge.net/projects/mtoolbox/; Pereira et al. 2011). Links to Mamit-tRNA (Pütz et al. 2007) web resources were added to provide the user with a general view of variants localization within the mitochondrial tRNA sequence structure.

Macro-haplogroup consensus sequences (MHCS) phylogeny

The robustness of 32 MHCSs was tested by generating a phylogenetic tree including all MHCSs and two complete mitochondrial sequences, for each macro-haplogroup, derived from Phylotree (van Oven and Kayser 2009). Those sequences are also available in the Human mitochondrial Data Base (HmtDB—Rubino et al. 2012) which reports all publicly available human mitochondrial genomes. Genomes associated to population studies are stored and analyzed as ‘healthy’; genomes from subjects affected by mitochondriopathies are reported in a separate category and annotated as ‘patient’. All data required to establish if an mtDNA genome sequence belongs to the ‘healthy’ category are obtained from the GenBank entry, papers and upon request to the authors. In addition, the multi-alignment and its manual editing allow to check the quality of sequences to detect any sequencing errors.

The MHCS phylogenetic tree was produced according to the Maximum Likelihood method, based on the Jukes–Cantor substitution model (Jukes and Cantor 1969). The quality of the tree topology was assessed by bootstrap analysis of 500 replicates (Felsenstein 1985). Analyses were performed through the functions implemented in MEGA5 software (Hall 2013).

Datasets

The prioritization workflow was validated on a dataset of 125 mtDNA genomes belonging to individuals affected by Leber’s Hereditary Optic Neuropathy (LHON), sequenced by Sanger technology and stored in HmtDB (http://www.hmtdb.uniba.it) (Rubino et al. 2012) (Supplementary Table 1—SampleData). This dataset was chosen to evaluate the performance of this workflow in identifying the known LHON-causative mutations since 42 % of these genomes was expected to harbor at least one of the primary mutations included in the panel of the ‘Top 14 LHON’ annotated in Mitomap (Lott et al. 2013).

The efficiency of the workflow to prioritize and recognize tumor-specific variants with a functional impact was also tested on mtDNA sequences obtained from 20 ovarian cancer samples, collected within a concluded clinical study, at S.Orsola-Malpighi Hospital, Bologna, Italy, during the period 2012–2013. Informed consent had been obtained in compliance with the Helsinki Declaration and the study had been approved by the local ethical committee. DNA was available also from the corresponding non-tumor tissue of patients and was used in the context of this study to test the germline nature of identified variants. List of specimens and HmtDB identifiers, obtained after submission of sequences to the HmtDB, are reported in Supplementary Table 2—SampleData. Sanger sequencing of the whole mtDNA was performed as previously described (Kurelac et al. 2013), to prevent nuclear mitochondrial sequence (NumtS) (Simone et al. 2011) co-amplification. The somatic (tumor-specific) nature of mitochondrial variants was ascertained by sequencing mtDNA from matched non-tumor tissue and validated on a second PCR product.

Finally, to test our workflow also on data generated from high-throughput technologies, Whole Exome Sequencing (WXS) BAM files (Binary Alignment/Map) from 90 matched samples (primary solid tumor/peripheral blood) from Colorectal Adenocarcinoma (COAD) patients were downloaded from the online repository of the consortium dbGaP (https://cghub.ucsc.edu/). These data were generated from the Baylor College of Medicine (BMC) center of The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) on Illumina platform and mapped on GRCh37-lite reference (Genome Reference Consortium Human Build 37, accession = “GCA_000001405.1”). The samples featuring a mean read depth >10X across the mitochondrial genome after its extraction through MToolBox were brought forward in the analysis (Supplementary Table 3—BloodSamples and TumorSamples sheets).

Denaturing high-performance liquid chromatography (dHPLC) analysis on OC samples

For the variants found in Ovarian Cancer (OC) samples, PCR was performed using AmpliTaq Gold polymerase (Applied Biosystems). For m.3380G>A in MT-ND1 fw-5′-ATACCCACACCCACCCAAGA-3′ and rv-5′-AGATGTGGCGGGTTTTAGGG-3′ primers were used, for m.9837G>A in MT-CO3 fw-5′-TCAATCACCTGAGCTCACCA-3′ and rv-5′-ACCACATCTACAAAATGCCAGT-3′ primers were used, for m.14969T>A in MT-CYB fw-5′-AACTTCGGCTCACTCCTTGG-3′ and rv-5′-TCACGGGAGGACATAGCCTA-3′ primers were used. The amplification product was analyzed by WAVE Nucleic Acid Fragment Analysis System (Transgenomic, Omaha, NE, USA). Data analysis was performed as previously described (Frueh and Noyer-Weidner 2003; Kurelac et al. 2012).

Mitochondrial DNA extraction, variant detection and annotation

MToolBox pipeline (Calabrese et al. 2014), including several steps as read mapping and NumtS filtering, post-mapping processing, genome assembly, haplogroup prediction and variant annotation, was used to extract the off-target mitochondrial genomes from the WXS COAD BAM files obtained from the TCGA repository, and then to annotate each variant allele. Fasta files from LHON and ovarian cancer samples were also used as input for MToolBox to annotate mitochondrial variants and related features.

Disease Score definition for non-synonymous variants

A training dataset of 53 mtDNA non-synonymous variants (Table 1; Supplementary Table 4), previously validated as affecting function, including 28 disease-associated mutations annotated in Mitomap as ‘confirmed’ to be pathogenic by two or more independent laboratories (Lott et al. 2013), and 25 clearly pathogenic cancer-associated mutations (Gasparre et al. 2007; Porcelli et al. 2010; Pereira et al. 2012), was used to define the ‘disease score’ (as described in the “Results” section ‘Disease Score definition’) of any non-synonymous mtDNA variants, by weighting the 6 above-listed pathogenicity predictions (Thomas and Kejariwal 2004; Adzhubei et al. 2013; Capriotti et al. 2013) available in the ‘patho_table’ implemented in MToolBox (https://sourceforge.net/projects/mtoolbox/). These six methods were chosen among the most widely used pathogenicity predictors, available online for a fast evaluation of large-scale data from sequencing, although their often-contradictory predictions demand a way to weigh their reliability. More details regarding the features used by any method to predict the impact of amino acid allelic variants on protein structure/function are available in (Thomas and Kejariwal 2004; Adzhubei et al. 2013; Capriotti et al. 2013).

Table 1 List of 53 non-synonymous variants composing the training dataset

Disease Score threshold

To derive the Disease Score Threshold (DST) for assessing the potential functional impact of the non-synonymous variants, the mixture model of two normal distributions (McLachlan and Peel 2000) was fitted to the disease scores related to 1872 non-synonymous variants observed in 15,385 mtDNA genomes from healthy individuals, carefully selected as complete sequences (http://webservice.cloud.ba.infn.it/hmtdb/HmtDBHealthyGenomes_References.xlsx) and stored in HmtDB (May 2014, Rubino et al. 2012). It was not possible to define a DST from non-synonymous variants found in mtDNA genomes from patients, as stored in HmtDB. This dataset suffered a sampling bias, in which certain pathologies (as LHON, Alzheimer’ disease, etc.) were over-represented, while others (as MELAS, etc.) were under-represented. The analysis was performed using the normalmixEM function from the mixtools package (Benaglia et al. 2009) implemented in R version 3.1.1.

Discussion

The assessment of the role of mitochondrial variants in the onset and/or progression of human diseases and cancer is a difficult task. It requires robust and statistically reliable methods to support clinicians in the functional investigation of variants potentially affecting protein function. The question of determining the potential pathogenicity of mtDNA variants is in fact a matter of debate among clinicians (McCormick et al. 2013). Because of the peculiarities of mitochondrial polyplasmic genetics, assignment of pathogenicity should take into account the degree of heteroplasmy, the haplogroup background, and even environmental factors (Wallace et al. 2003). Several methods have been recently published to meet these goals, such as mit-o-matic (Vellarikkal et al. 2015), MitImpact (Castellana et al. 2015) and MtSNPscore (Bhardwaj et al. 2009). Here, we contribute with a robust statistical approach based on the introduction of two thresholds, NVC and DST. NVC threshold is calculated from the updated site-specific nucleotide variability values in a large dataset of complete healthy genomes available in HmtDB database, whose sequence quality was supported by a correct haplogroup prediction (see “Materials and methods”); DST is a statistically estimated robust threshold based on the pathogenicity predictions estimated by MutPred (Li et al. 2009), HumDiv- and HumVar-trained PolyPhen-2 models (Adzhubei et al. 2013), SNPs&GO, PhD-SNP (Capriotti et al. 2013), and PANTHER (Thomas and Kejariwal 2004), combined with the nucleotide variability distribution in the same HmtDB healthy dataset. To date, the information on biochemical and structural features of mitochondrial respiratory chain proteins are not so rich as for nuclear ones. Accordingly, all the in silico pathogenicity predictions may not be reliable and hence they should be considered with caution, since the determinants of pathogenicity for an mtDNA variant are unclear. The integration of pathogenicity predictions with the functional assays is an essential and strongly recommended step to consolidate the validation for pathogenicity of an mtDNA variant. The prioritization workflow is fully implemented in MToolBox (Calabrese et al. 2014) thus contributing to highlight mitochondrial DNA variants as suitable candidates for subsequent functional analyses in clinical studies—an issue which has often revealed to be problematic in analyses seeking correlations between mtDNA genotypes and disease phenotypes.

The in silico prioritization criteria here developed were established on the assumption that rare variants may more likely affect the gene/protein function than polymorphic variants (Bannwarth et al. 2013). In this context, we used the Macro Haplogroup Consensus Sequences to move the fixed evolutionary variants to the background and highlight rare ones. These rare variants may be characterized by nucleotide variability values below the established cutoff, while their potential functional impact, particularly for non-synonymous variants, may be assessed from the disease score definition. Variants mapping on the non protein-coding regions may be also prioritized by taking into account the filtering against the three references and the nucleotide variability cutoff (NVC) only. Pathogenicity data regarding tRNAs and rRNAs variants will be soon integrated in the proposed pipeline.

The application of the NVC and DST on the dataset of all possible non-synonymous variants (patho_table, https://sourceforge.net/projects/mtoolbox/) suggested that the loci MT-ATP6, MT-ATP8 and MT-CYB harbor the variants featuring the highest nucleotide variability values and number of sites. Accordingly, these genes seem to be the least conserved regions of the human mtDNA and the most prone to harbor non-synonymous variants (Supplementary Fig. 4a), as previously reported in Mitomap. On the other hand, the distribution of disease scores related to all the potentially pathogenic variants occurring in the thirteen protein-coding regions did not show any gene-specific peculiarity (Supplementary Fig. 4b).

We applied the prioritization workflow to LHON, COAD and ovarian cancer sample sets to validate the robustness of our approach.

The majority of the causative mutations expected (8 out of 11 LHON-causing mutations) on the LHON dataset was recognized as affecting function by the application of the whole stepwise prioritization workflow with the exception of the mutations m.3700G>A (MT-ND1), m.14502T>C (MT-ND6) and m.14484T>C (MT-ND6) (Supplementary Table 1). These mutations were kept out by the application of chosen variability and/or disease score thresholds but were retained by the other filtering steps. Specifically, the mutation m.14484T>C was found in healthy individuals (0.11 % of total mtDNA genomes stored in HmtDB) despite the high DS suggestive of its potential functional impact. Such mutation was formerly considered a non-pathogenic variant, resulting in a conservative change into an amino acid with similar physiochemical properties (p.M64 V) (Mackey and Howell 1992) and found in population surveys without expressing any pathological phenotype (Achilli et al. 2012). However, it was previously assumed that it may exert a pathogenic potential if found in association with the haplogroups J and I (Achilli et al. 2012). 12 % of LHON-derived sequences here analyzed harbored this mutation; among these, only two belong to haplogroup J1 whereas most of the genomes belong to the macro-haplogroup M (26 % of total). The mutation m.3700G>A, although absent from the healthy population and for this reason more likely to have a functional impact, was suggested to be benign by the disease score. It may be due to half of pathogenicity predictors used to calculate the disease score, suggesting this mutation as benign (Supplementary Table 1—AllVariants sheet). This points out that pathogenicity predictions should always be treated with caution and that functional validation of variants is of paramount importance to clarify their role in contributing to phenotype, especially in the context of specific haplogroups and upon taking into account that penetrance is a cogent clinical issue particularly in canonical mitochondriopathies such as LHON. The mutation m.14502T>C, excluded since predicted as benign and found in the healthy population, is reported in the literature in sporadic LHON cases and/or in combination with other primary LHON mutations (Zhao et al. 2009; Zhang et al. 2010). In addition, our study suggested a pool of new variants not previously associated with LHON warranting further investigation with the aim to recognize novel causative mutations which could be added in the list of “top 14” LHON-specific mutations (Brandon et al. 2005). Finally, two additional variants were prioritized removing the haplogroup-filtering step. This result does not imply that our prioritization process may give false negatives because the output of the pipeline reports the entire list of variants with the prioritized ones at the top. The user is free to filter the variants according to his preferred criteria. The system simply suggests and reports the thresholds that may facilitate the selection of functional important variants.

With respect to cancer, it needs to be underlined that our workflow may provide an additional advantage, namely the efficient selection of potentially affecting function variants that are candidates for being somatic that only transformed cells may withstand, in the context of a deregulated cell metabolism. For too long, associations between cancer types and mtDNA variants have been reported without verifying whether they were somatic or germline, in which latter case they may be speculated to be predisposing to transformation. Even further, many polymorphisms have been classified in the past years as cancer-associated mtDNA mutations (Máximo and Sobrinho-Simões 2000; Setiawan et al. 2008), a risky statement likely driven by the lack of well-curated databases and of a commonly agreed protocol to define such associations. The high frequency of mtDNA variants that is often detected in cancer samples may also be a great obstacle to select few candidates to bring forward to functional studies, with the aim to assess in which way they may impact metabolism. This is a key step that this workflow intends to simplify, as we have shown in both COAD and ovarian cancer sample sets by verifying the somatic nature of prioritized variants. However, we have here shown that our criteria are not too stringent, as they allow inclusion of a few variants that were not found to be somatic. Although this finding may somewhat decrease the performing index of our workflow in highlighting specifically cancer-associated mutation, it may on the other hand permit to select variants that may still affect the gene/protein function, even though not associated to the patient’s disease.

In this paper, we successfully demonstrate that the prioritization of human mtDNA variants based on the workflow here proposed is able to recognize potential affecting function variants. Compared to the existing pipelines, capable of annotating mitochondrial variations from next-generation datasets exclusively (mit-o-matic, Vellarikkal et al. 2015), or extract information from lists of variants or FASTA sequences exclusively (MtSNPscore, Bhardwaj et al. 2009), our prioritization workflow, currently being implemented in MToolBox, appears to be more flexible allowing a functional annotation of variants on large datasets from both next-generation (in BAM, SAM, FASTQ format) and Sanger sequencing (in FASTA format) data.

Finally, the recent publication of MitImpact (Castellana et al. 2015) reports a database of the entire dataset of non-synonymous human mtDNA variants annotated with nucleotide variability available through the HmtDB resource and pathogenicity scores estimated by applying the same predictors implemented in MToolBox, with three additional ones. The estimation of the level of concordance among predictions is provided through two alternative “uniformity scores”, similar to our percentage of concordance, previously implemented in MToolBox and now replaced by the disease scores here reported. The different type of inputs required by mit-o-matic, MitImpact and MtSNPscore did not allow to report here the results of a quantitative comparison with MToolBox. Those results may be roughly comparable according to predicted pathogenicity of the considered variants.

In conclusion, our prioritization workflow based on (1) the use of MHCSs together with RSRS and rCRS, to filter out variants fixed during evolution; (2) the availability of the NV scores obtained through the HmtDB annotated genomes and (3) the implementation of the NVC and DST, reducing the large amount of variants from both NGS and conventional Sanger technologies, offers one of the most complete tools to guide clinicians in selecting the non-synonymous mtDNA variants with a potential functional impact. However, functional assays are strongly required to confirm the pathogenicity of all mtDNA variants prioritized by this workflow, as well as by any automated method, with the aim to establish their exact role and involvement in disease phenotypes.