Background

Genome and exome sequencing are both currently being used as molecular diagnostic tools for patients with rare, undiagnosed diseases [13]. Typically, these technologies are applied clinically by following workflows consisting of blood draw, sequencing, alignment, variant calling, variant annotation, variant filtering, and variant prioritization [4, 5]. Then, clinical analysts usually perform the more manual processes of inspecting and then reporting variants based on a set of patient phenotypes from the referring doctor.

In general, commonly used pipelines exist for the steps from sequencing through variant calling [6, 7]. Despite differences in performance, most of these pipelines are relatively uniform in that they start with the same inputs (i.e. read files, commonly FASTQ format) and produce the same outputs (i.e. a set of variants, commonly Variant Call Format). In contrast, methods for variant annotation and/or variant filtering are quite diverse [811]. These methods use a wide range of annotation sources including but not limited to population allele frequencies [12], conservation scores [1315], haploinsufficiency scores [16, 17], deleteriousness scores [17, 18], transcript impact scores [1923], and previously associated disease annotation [2426]. Variant prioritization is also quite diverse with some methods relying only on the variant annotations to prioritize variants [9] and some relying only on patient phenotype to rank the variants [2730]. There are also methods which combine both variant annotations and phenotype score to rank the variants [3134], a selection of which are benchmarked on the same simulated datasets in [35].

Given a prioritized list of variants, analysts manually inspect each one and curate a subset to ultimately report to the ordering physician. Unfortunately, manual curation is a time consuming process where analysts must inspect each variant while maintaining a mental picture of the patient’s phenotype. One group reported an average of 600 variants per case analyzed by two people (one analyst and one director) over three hours, meaning a throughput of ≈100 variants per man-hour [36]. If causative variants can be identified earlier due to a high rank from prioritization, it’s possible that the full filtered variant list can be short-circuited, reducing the total number of variants reviewed and therefore the time to analyze a case. Additionally, accurate prioritization is a step towards the ultimate goal of fully automating the analysis of the sequencing data for rare disease patients.

One of the issues with previously published ranking methods is that they were primarily tested on simulated datasets with known, single-gene, pathogenic variants injected into real or simulated background genomic datasets. Additionally, when phenotype terms were used, they tended to select all matching phenotype terms for the simulated disease and then inject/remove a few terms (typically 2-3) in order to provide some variability. In practice, rare disease patients often have much more variability in their phenotype terms for a wide variety of reasons such as multiple genetic diseases, variability in disease presentation, phenotypes of non-genetic origin, and/or variability in the standards describing a phenotype.

In this paper, we focus on real patient data from the multi-site collaboration of the Undiagnosed Diseases Network (UDN) [1]. Patients accepted into the UDN are believed to have rare, undiagnosed diseases of genetic origin. Because the UDN is not focused on a single particular disease, the patient population has a diverse range of phenotypes represented. Additionally, the exact phenotype terms associated to an individual patient are highly variable for the reasons described above. Because the UDN is a research collaboration, there is also variability in reported variants that range in pathogenicity from “variant of uncertain significance” (VUS) through “pathogenic” as defined by the ACMG guidelines [37]. The summation of this real-world variation means that accurately identifying and/or prioritizing variants is challenging due to uncertainty and variation in phenotype inputs and variation in pathogenicity of reported variant outputs.

Methods

Overview

We tested the application of classification algorithms for identifying clinically reported variants in real world patients in two ways: 1) predicting whether a variant observed by an analyst would be clinically reported and 2) prioritizing all variants seen by the clinical analysts. In particular, we focused our analyses on real patients with a diverse collection of rare, undiagnosed diseases that were admitted to the Undiagnosed Diseases Network (UDN) [1]. We limited our patients to those who received whole genome sequencing and received at least one primary variant (i.e. not secondary or incidental) on their clinical report. We extracted data directly from the same annotation and filtering tool used by the analysts in order to replicate their data view of each variant in a patient. Additionally, we incorporated phenotype information into the models using two scoring systems that are based on ranking genes by their association to a set of patient phenotypes. Finally, each variant was either labeled as “returned” or “not returned” depending on whether it was ultimately reported back to the clinical site.

Given the above variant information, we split the data into training and testing sets for measuring the performance of classifiers to predict whether a variant would be clinically reported or not. We tested four classifiers that are readily available in the sklearn [38] and imblearn [39] Python modules. Of note, our focus was not on picking the “best” classifier, but rather on analyzing their overall ability to handle the variability of real-world patient cases from the UDN.

Each classifier calculated probabilities of a variant belonging to the “returned” class, allowing us to measure their performance as both a classifier and a prioritization/ranking system. After tuning each classifier, we generated summaries of the performance of each method from both a binary classification perspective and a variant prioritization perspective. Additionally, we tested four publicly available variant prioritization algorithms and two single-value ranking methods for comparison. All of the scripts to train classifiers, test classifiers, and format results are contained in the VarSight repository. A visualization of the workflow for gathering features, training the models, and testing the models can be found in the Additional file 1.

Data sources

All samples were selected from the cohort of Undiagnosed Diseases Network (UDN) [1] genome sequencing samples that were sequenced at HudsonAlpha Institute for Biotechnology (HAIB). In short, the UDN accepts patients with rare, undiagnosed diseases that are believed to have a genetic origin. The UDN is not restricted to a particular disease, so there are a diverse set of diseases and phenotypes represented across the whole population. The phenotypes annotated to a patient are also variable compared to simulated datasets for a variety of reasons including: 1) patients may have multiple genetic diseases, 2) phenotype collection is done at seven different clinical sites leading to differences in the standards of collection, 3) patients may exhibit more or fewer phenotypes than are associated with the classic disease presentation, and 4) patients may have phenotypes of non-genetic origin such as age- or pathogen-related phenotypes. For more details on the UDN, we refer the reader to Ramoni et al., 2017 [1].

DNA for these UDN patients was prepared from whole blood samples (with few exceptions) and sequenced via standard operation protocols for use as a Laboratory-Developed Test in the HAIB CAP/CLIA lab. The analyses presented in this paper are based on data that is or will be deposited in the dbGaP database under dbGaP accession phs001232.v1.p1 by the UDN.

Alignment and variant calling

After sequencing, we followed GATK best practices [40] to align to the GRCh37 human reference genome with BWA-mem [41]. Aligned sequences were processed via GATK for base quality score recalibration, indel realignment, and duplicate removal. Finally, SNV and indel variants were joint genotyped, again following GATK best practices [40]. The end result of this pipeline is one Variant Call Format (VCF) file per patient sample. This collection of VCF files is used in the following sections.

Variant annotation and filtering

After VCF generation, the clinical analysts followed various published recommendations (e.g. [4, 5]) to annotate and filter variants from proband samples. For variant annotation and filtering, we used the same tool that our analysts used during their initial analyses. The tool, Codicem [42], loads patient variants from a VCF and annotates the variants with over fifty annotations that the analysts can use to interpret pathogenicity. These annotations include: variant level annotations such as CADD [18], conservation scores [13, 14], and population frequencies [12]; gene level annotations such as haploinsufficiency scores [16, 17], intolerance scores [15], and disease associations [2426]; and transcript level annotations such as protein change scores [1922] and splice site impact scores [23]. Additionally, if the variant has been previously curated in another patient through Human Gene Mutation Database (HGMD) or ClinVar [24, 26], those annotations are also made available to the analysts.

Codicem also performs filtering for the analysts to reduce the number of variants that are viewed through a standard clinical analysis. We used the latest version of the primary clinical filter for rare disease variants to replicate the standard filtering process for patients in the UDN. In short, the following criteria must be met for a variant to pass through the clinical filter: sufficient total read depth, sufficient alternate read depth, low population frequency, at least one predicted effect on a transcript, at least one gene-disease association, and to not be a known, common false-positive from sequencing. In general, the filter reduces the number of variants from the order of millions to hundreds (anecdotally, roughly 200-400 variants per proband after filtering). For details on the specific filter used, please refer to Additional file 1.

Phenotype annotation

The Codicem annotations are all agnostic of the patient phenotype. As noted earlier, we do not expect the patient phenotypes to exactly match the classic disease presentation due to the variety and complexity of diseases, phenotypes, and genetic heritage tied to UDN patients. Despite this, we made no effort to alter or condense the set of phenotypes provided by the corresponding clinical sites. In order to incorporate patient phenotype information, we used two distinct methods to rank genes based on the Human Phenotype Ontology (HPO) [43]. We then annotated each variant with the best scores from their corresponding gene(s).

The first method uses phenotype-to-gene annotations provided by the HPO to calculate a cosine score [44] between the patient’s phenotypes and each gene. Given P terms in the HPO, this method builds a binary, P-dimensional vector for each patient such that only the phenotype terms (including ancestral terms in the ontology) associated with the patient are set to 1, and all other terms are set to 0. Similarly, a P-dimensional vector for each gene is built using the phenotype-to-gene annotations. Then, the cosine of the angle between the patient vector and each gene vector is calculated as a representation of similarity. This method tends to be more conservative because it relies solely on curated annotations from the HPO.

The second method, an internally-developed tool called PyxisMap [30], uses the same phenotype-to-gene annotations from the HPO, but adds in automatically text-mined annotations from NCBI’s PubTator [45] and performs a Random-Walk with Restart [46] on the ontology graph structure. The PyxisMap method has the added benefit of incorporating gene-phenotype connections from recent papers that have not been manually curated into the HPO, but it also tends to make more spurious connections due to the imprecision of the text-mining from PubTator. Each method generates a single numerical feature that is used in the following analyses.

Patient selection

In the clinical analysis, each patient was fully analyzed by one director and one analyst. After the initial analysis, the full team of directors and analysts review flagged variants and determine their reported pathogenicity. In our analysis, we focused on variants that were clinically reported as “primary”, meaning the team of analysts believed the variant to be directly related to the patient’s phenotype. Note that secondary and/or incidental findings are specifically not included in this list. The team of analysts assigned each primary variant a classification of variant of uncertain significance (VUS), likely pathogenic, or pathogenic adhering to the recommendations in the American College of Medical genetics (ACMG) guidelines for variant classification [37].

We required the following for each proband sample included in our analyses: 1) at least one clinically reported primary variant that came through the primary clinical filter (i.e. it was not found through some other targeted search) and 2) a set of phenotypes annotated with Human Phenotype Ontology [43] terms using the Phenotips software [47]. At the time of writing, this amounted to 378 primary-reported variants and 87819 unreported variants spanning a total of 237 proband samples.

Feature selection

For the purposes of classification, all annotations needed to be cleaned, reformatted, and stored as numerical features. For single-value numerical annotations (e.g. float values like CADD), we simply copied the annotation over as a single value feature. Missing annotations were assigned a default value that was outside the expected value range for that feature. Additionally, these default values were always on the less impactful side of the spectrum (e.g. a default conservation score would err on the side of not being conserved). The one exception to this rule was for variant allele frequencies where a variant absent from a database was considered to have an allele frequency of 0.0. For multi-value numerical annotations, we reduced the values (using minimum or maximum) to a single value corresponding to the “worst” value (i.e. most deleterious value, most conserved value, etc.) that was used as the feature.

For categorical data, we relied on bin-count encoding to store the features. We chose to bin-count because there are many annotations where multiple categorical labels may be present at different quantities. For example, a single ClinVar variant may have multiple entries where different sites have selected different levels of pathogenicity. In this situation, we desired to capture not only the categorical label as a feature, but also the number of times that label occurred in the annotations.

After converting all annotations to numerical features, we had a total of 95 features per variant. We then pruned down to only the top 20 features using univariate feature selection (specifically the SelectKBest method of sklearn [38]). This method evaluates how well an individual feature performs as a classifier and keeps only the top 20 features for the full classifiers. Note that only the training set was used to select the top features and that selection was later applied to the testing set prior to final evaluation. Table 1 shows the list of retained features ordered by feature importance after training. Feature importance was derived from the random forest classifiers which automatically report how important each feature was for classification. The entire set of annotations along with descriptions of how each was processed prior to feature selection are detailed in the Additional file 1.

Table 1 Feature selection

Classifier training and tuning

As noted earlier, there are generally hundreds of variants per proband that pass the filter, but only a few are ever clinically reported. Across all 237 proband samples, there were a total of 378 clinically reported variants and another 87819 variants that were seen but not reported. As a result, there is a major imbalance in the number of true positives (variants clinically reported) and true negatives (variants seen, but not clinically reported).

We split the data into training and test sets on a per-proband basis with the primary goal of roughly balancing the total number of true positives in each set. Additionally, the cases were assigned to a particular set by chronological order of analysis in order to reduce any chronological biases that may be introduced by expanding scientific knowledge (i.e. there are roughly equal proportions of “early” or “late” proband samples from the UDN in each set). In the training set, there were a total of 189 returned variants and 44593 not returned variants spanning 120 different probands. In the test set, there were a total of 189 returned variants and 43226 not returned variants spanning 117 different probands. In our results, the returned test variants are further stratified by their reported levels of pathogenicity.

We then selected four publicly available binary-classification models that are capable of training on imbalanced datasets: the RandomForest model by sklearn [38], the LogisticRegression model by sklearn, the BalancedRandomForest model by imblearn [39], and the EasyEnsembleClassifier model by imblearn. These classifiers were chosen for three main reasons: 1) their ability to handle imbalanced data (i.e. far more unreported variants than reported variants), 2) their ability to scale to the size of the training and testing datasets, and 3) they are freely available implementations that can be tuned, trained, and tested with relative ease in the same Python framework. The two random forest classifiers build collections of decision trees that weight each training input by its class frequency. Logistic regression calculates the probability of a value belonging to a particular class, again weighting by the class frequency. In contrast to the other three tested methods, the ensemble classification balances the training input using random under-sampling and then trains an ensemble of AdaBoost learners. For more details on each classifier, please refer to the sklearn and imblearn documentations [38, 39].

Initially, we also tested the support vector classifier by sklearn (SVC), the multi-layer perceptron by sklearn (MLPClassifier), and the random under-sampling AdaBoost classifier by imblearn (RUSBoostClassifier). Each of these was excluded from our results due to, respectively, scaling issues with the training size, failure to handle the data imbalance, and overfitting to the training set. While we did not achieve positive results using these three implementations, it may be possible to use the methods through another implementation.

For each of our tested classifiers, we selected a list of hyperparameters to test and tested each possible combination of those hyperparameters. For each classifier and set of hyperparameters, we performed stratified 10-fold cross validation on the training variants and recorded the balanced accuracy (i.e. weighted accuracy based on inverse class frequency) and the F1 scores (i.e. harmonic mean between precision and recall). For each classifier type, we saved the hyperparameters and classifier with the best average F1 score (this is recommended for imbalanced datasets). These four tuned classifiers were then trained on the full training set and tested against the unseen set of test proband cases. The set of hyperparameters tested along with the highest performance setting for each hyperparameter can be found in the Additional file 1.

Results

Classifier statistics

The hyperparameters for each classifier were tuned using 10-fold cross validation and the resulting average and standard deviation of balanced accuracy is reported in Table 2. After fitting the tuned classifiers to the full training set, we evaluated the classifiers on the testing set by calculating the area under the receiver operator curve (AUROC) and area under the precision-recall curve (AUPRC) (also shown in Table 2). Figure 1 shows the corresponding receiver operator curves and precision-recall curves for the results from the testing set on all four classifiers.

Fig. 1
figure 1

Receiver operator and precision-recall curves. These figures show the performance of the four classifiers on the testing set after hyperparameter tuning and fitting to the training set. On the left, we show the receiver operator curve (false positive rate against the true positive rate). On the right, we show the precision recall curve. Area under the curve (AUROC or AUPRC) is reported beside each method in the legend

Table 2 Classifier performance statistics

From these metrics, we can see that all four classifiers have a similar performance with regards to AUROC. However, all classifiers have a relatively poor performance from a precision-recall perspective (best AUPRC was 0.2458). This indicates that from a classification perspective, these classifiers would identify a high number of false positives relative to the true positives unless a very conservative cutoff score was used. Practically, we would not recommend using these trained classifiers to do automated reporting because it would either report a large number of false positives or miss a large number of true positives.

Ranking statistics

We also quantified the performance of each classifier as a ranking system. For each proband, we used the classifiers to calculate the probability of each class (reported or not reported) for each variant and ranked those variants from highest to lowest probability of being reported. We then calculated median and mean rank statistics for the reported variants. Additionally, we quantified the percentage of reported variants that were ranked in the top 1, 10, and 20 variants in each case. While the classifiers were trained as a binary classification system, we stratified the results further to demonstrate differences between variants that were clinically reported as a variant of uncertain significance (VUS), likely pathogenic, and pathogenic.

For comparison, we selected to run Exomiser [33], Phen-Gen [48], and DeepPVP [34]. For each tool, we input the exact same set of phenotype terms used by the classifiers we tested. Additionally, we used the same set of pre-filtered variants from Codicem as input to each ranking algorithm. As a result, all external tools and our trained classifiers are ranking on identical phenotype and variant information.

For Exomiser, we followed the installation on their website to install Exomiser CLI v.11.0.0 along with version 1811 for hg19 data sources. We ran Exomiser twice, once using the default hiPhive prioritizer (incorporates knowledge from human, mouse, and fish) and once using the human only version of the hiPhive prioritizer (this was recommended instead of the PhenIX algorithm [32]). Phen-Gen V1 was run using the pre-compiled binary using the “dominant” and “genomic” modes to maximize the output. Of note, Phen-Gen was the only external method that did not fully rank all variants, so we conservatively assumed that any absent variants were at the next best possible rank. Thus, the reported Phen-Gen comparisons are an optimistic representation for this test data. Finally, DeepPVP v2.1 was run using the instructions available on their website. Details on the exact installation and execution for each external tool can be found in the Additional file 1.

Finally, we added two control scores for comparison: CADD scaled and HPO-cosine. These scores were inputs to each classifier, but also represent two common ways one might naively order variants after filtering (by predicted deleteriousness and by similarity to phenotype). The results for the two control scores, all four external tools, and all four trained classifiers are shown in Tables 3 and 4. A figure visualizing all ranking results can be found in the Additional file 1.

Table 3 Ranking performance statistics
Table 4 Top variant statistics. This table shows the ranking performance statistics for all methods evaluated on our test set (same order as Table 3)

In the overall data, all four classifiers outperform the single-value measures and external tools across the board. Overall, the median rank ranged from 6-10 in the trained classifiers compared to 15 in the best externally tested tool. The classifiers ranked 16-23% of all variants in the first position and 65-72% in the top 20. As one would intuitively expect, all classifiers performed better as the returned pathogenicity increased ranking 33-52% of pathogenic variants in the first position and 80-94% of pathogenic variants in the top 20.

Discussion

There are two major factors that we believe are influencing the classifiers’ performance relative to the externally tested tools. First, all results were generated using real-world patients from the UDN, but only our four classifiers were trained on real-world patients from the UDN. In contrast, the four external tools were primarily evaluated and/or trained using simulations that do not capture the variation and/or uncertainty that is apparent in the UDN patient datasets. Second, the four classifiers we tested have far more information (i.e. features) available to them than the external tools. As noted in our methods, we tried to reflect an analyst’s view of each variant as much as possible, starting with 95 features that were pruned down to 20 features used by each classifier. Incorporating the same set of features and/or training on real-world patients may improve the externally tested tools with respect to these classifiers.

We expect these classification algorithms could be refined in a variety of ways. First, adding new features could lead to increased performance in the classifiers. Additionally, some of the features represent data that is not freely available to the research community, so replacing those features with publicly accessible sources would likely influence the results. Second, there may be a better classification algorithms for this type of data. The four selected classifiers were all freely available methods intended to handle the large class imbalance in the training set, but other algorithms that aren’t as readily available may have better performance.

Finally, training the classifier on different patient populations will likely yield different results, especially in terms of feature selection and feature importances. The patient phenotypes were gathered from multiple clinical sites, but the reported variants were generated by one clinical laboratory. While there were multiple analysts working each case and a team review process for these cases, we suspect that a classifier trained on results from multiple laboratories would have different results. Furthermore, our classifiers were trained on a wide range of rare disease patients, so restricting to a particular disease type (based on inheritance, phenotype, impacted tissue, etc.) may allow for the classifiers to focus on different feature sets that yield better results.

Conclusion

We assessed the application of binary classification algorithms for identifying variants that were ultimately returned on a clinical report for rare disease patients. We trained and tested these algorithms using real patient variants and phenotype terms obtained from the Undiagnosed Diseases Network. From a classification perspective, we found that these methods tend to have low precision scores, meaning a high number of false positives were identified by each method. However, when evaluated as a ranking system, all four methods out-performed the single-measure ranking systems and external tools that were tested. The classifiers had median ranks of 6-10 for all reported variants and ranked 65-72% of those variants in the top 20 for the case. For “Pathogenic” variants, the median ranks were 1-4 and 80-94% of those variants were ranked in the top 20 for the case.

Overall, we believe the classifiers trained in VarSight represent a significant step forward in tackling real clinical data. The tested classifiers improved our ability to prioritize variants despite the variability and uncertainty injected by real-world patients. Ultimately, we believe implementing these classifiers will enable analysts to assess the best candidate variants first, allowing for faster clinical throughput and increased automation in the future.