Whole genome sequencing of one complex pedigree illustrates challenges with genomic medicine
- 3.1k Downloads
Human Phenotype Ontology (HPO) has risen as a useful tool for precision medicine by providing a standardized vocabulary of phenotypic abnormalities to describe presentations of human pathologies; however, there have been relatively few reports combining whole genome sequencing (WGS) and HPO, especially in the context of structural variants.
We illustrate an integrative analysis of WGS and HPO using an extended pedigree, which involves Prader–Willi Syndrome (PWS), hereditary hemochromatosis (HH), and dysautonomia-like symptoms. A comprehensive WGS pipeline was used to ensure reliable detection of genomic variants. Beyond variant filtering, we pursued phenotypic prioritization of candidate genes using Phenolyzer.
Regarding PWS, WGS confirmed a 5.5 Mb de novo deletion of the parental allele at 15q11.2 to 15q13.1. Phenolyzer successfully returned the diagnosis of PWS, and pinpointed clinically relevant genes in the deletion. Further, Phenolyzer revealed how each of the genes is linked with the phenotypes represented by HPO terms. For HH, WGS identified a known disease variant (p.C282Y) in HFE of an affected female. Analysis of HPO terms alone fails to provide a correct diagnosis, but Phenolyzer successfully revealed the phenotype-genotype relationship using a disease-centric approach. Finally, Phenolyzer also revealed the complexity behind dysautonomia-like symptoms, and seven variants that might be associated with the phenotypes were identified by manual filtering based on a dominant inheritance model.
The integration of WGS and HPO can inform comprehensive molecular diagnosis for patients, eliminate false positives and reveal novel insights into undiagnosed diseases. Due to extreme heterogeneity and insufficient knowledge of human diseases, it is also important that phenotypic and genomic data are standardized and shared simultaneously.
KeywordsWhole genome sequencing Precision medicine Human phenotype ontology Phenolyzer Variant calling Prader–Willi Syndrome Dysautonomia Hemochromatosis
Congenital insensitivity to pain with anhidrosis
Copy number variation
The human phenotype ontology
Insertions and deletions
Kilo base pairs
Polymerase chain reaction
Whole exome sequencing
Whole genome sequencing
Many genetic tests have been commonly performed on individuals that have phenotypes overlapping with known diseases, especially for cancer and rare diseases [1, 2, 3, 4]. Physicians have also been routinely prescribing prenatal genetic tests and newborn screenings in clinics [5, 6, 7]. However, there is a degree of uncertainty inherent in most genetic testings regarding the development, age of onset, and severity of disease . In addition, current genetic testing has not yet established predictive or even diagnostic value for common complex diseases . Some groups have begun to leverage the power of next-generation sequencing (NGS) to help diagnose rare diseases [10, 11, 12, 13]. Many studies have used whole exome sequencing (WES) to facilitate the molecular diagnosis of individuals with diseases that appear to have a single large-effect size mutation contributing substantially to the development of the disease, referred to by many as “Mendelian disorders” [14, 15]. Of course, such disorders also have an extraordinary phenotypic variability and spectrum brought about by genetic background, environmental differences and stochastic developmental variation (SDV) [16, 17, 18, 19, 20].
Despite much success using NGS-based techniques to identify mutations, there are still practical issues for the analytic validity for exome- or genome-wide NGS-based techiques, particularly in clinical settings [21, 22]. The clinical utility of genomic medicine is also uncertain, prompting some to suggest the need for better standards and benchmarking [23, 24]. However, the genetic architecture behind human disease is heterogeneous, and there are many reports of regulatory variants in the non-coding genome and splicing variants in the intronic regions that have a large-effect size on particular phenotypes [25, 26, 27, 28, 29, 30]. In hypothesis-driven research studies, one might gain higher statistical power with a larger sample size by using cheaper NGS assays like WES or gene panels. But whole genome sequencing (WGS) has a unique strength in its ability to cover a broader spectrum of variants; small insertions and deletions (INDELs), structual variants (SVs), and copy number variants (CNVs). This becomes extremely valuable in studies where disease associated variants might not be necessarily SNVs [31, 32, 33]. In particular, from a study design perspective, WGS results in a more uniform coverage and better detection of INDELs, and is free of exome capture deficiency issues . Of course, cost and technical considerations are still practical issues for WGS, but this will eventually become the optimal assay to address the extreme heterogeneity of different genetic architectures for different diseases.
Human Phenotype Ontology (HPO) has risen as a useful techique for precision medicine by providing a standardized vocabulary bank of phenotypic abnormalities to describe presentations of human pathologies [35, 36, 37]. Some showed that phenotypic matching can help interpret CNV findings based on integrated cross-species phenotypic information . The potential clinical usage of HPO derives from a wealth of medical literature and databases such as Online Mendelian Inheritance in Man (OMIM) . Computational tools like Exomizer and PhenIX were developed to aid disease associated variant prioritization from exome sequencing data [39, 40, 41], and this has been recently extended with the development of Genomiser for WGS data . Another tool is Phenolyzer , which uses prior biological knowledge and phenotype information to implicate genes involved in diseases. Phenolyzer reveals the hidden connection of genotypes and phenotypes by examing gene-gene, gene-disease and disease-phenotype interactions . Based on standarized phenotypic reports, Phenolyzer can be used to further prioritize WGS findings for disease associated variants.
We report here a comprehensive analysis of an extended pedigree, including genomics filtering on WGS data and phenotypic prioritization of candidate genes using Phenolyzer. The pedigree involves probands with Prader–Willi Syndrome (PWS) [44, 45], Hereditary Hemochromatosis (HH), dysautonomia-like symptoms, Tourette Syndrome (TS)  and other illnesses. We specifically chose this family for whole genome sequencing due to the phenotypic complexity in the family, including at least one genetic syndrome with a known genetic etiology, which on some level serves as a positive control among a range of diseases of unknown (or controversial) genetic architecture. Nine members of the family underwent WGS, enabling a wide scope of variant calling from SNVs to large copy number events. Notably, this is the first report of Illumina HiSeq WGS experiement on a PWS individual carrying the paternally-inherited deletion. The use of WGS enables the reconstruction of the recombination event in this imprinting hotspot, which provides a better understanding of the PWS disease mechanism. This report emphasizes the effectiveness of Phenolyzer, which can be used to integrate and share WGS and HPO data. Neither technique is yet perfect for clinical diagnosis, but combining the two can help eliminate false positives and reveal novel insights into human diseases.
Clinical phenotyping of individuals participating in this study
The family was interviewed by the corresponding author, GJL, a board-certified child, adolescent and adult psychiatrist. Medical records were obtained and reviewed, in conjunction with further interviews with the family. The interviews were videotaped and later reviewed to facilitate further diagnostic efforts. Various clinical diagnostic testings were performed on K10031-10133, including tilt table test, brain MRI, ultrasound of the kidneys and chest X-ray. In addition, her cholesterol level, thyroid profile, urine vanillylmandelic acid (VMA), catecholamines panel (urine-free), basic metabolic panel (BMP), and epinephrine and norepinephrine levels were also screened. Other clinical tests included electrocardiogram (EKG), polysomnographic report, and echocardiogram. For K10031-10232, the following diagnostic evaluations were performed: multiple sleep latency test (MSLT) , autism diagnostic observation system (ADOS) - module 2 , the Childhood Autism Rating Scale (CARS) , Behavior Assessment System for Children (BASC) , Intelligence Quotient (IQ), and Abnormal Involuntary Movement Scale (AIMS) .
Generation of WGS and microarray data
Blood and saliva samples were collected from nine individuals (K10031-10143, 10144, 10145, 10235, 10133, 10138, 10231, 10232, 10233) from the extended pedigree described in the results. Two CLIA-certified WGS tests (K10031-10133 and K10031-10138) were performed at Illumina, San Diego. The other seven WGS runs were performed at the sequencing center at Cold Spring Harbor Laboratory (CSHL). All libraries were constructed with PCR amplification, and sequenced on one Illumina HiSeq2000 with an average paired-end read length of 100 bp. Since the DNA extracted from saliva samples contains a certain proportion of bacterial DNA, these samples were sequenced on additional lanes to achieve an average coverage of 40X after removing unmapped reads (Additional file 1: Table S1). Microarray data for the same samples were generated with the Illumina Omni 2.5 microarray at the Center for Applied Genomics Core of the Children’s Hospital of Philidephia (CHOP). Illumina Genome Studio was used to extract the SNV calls and log R ratio (LRR) and B allele frequency (BAF) from the microarray data. The general analysis work-flow is shown in Additional file 1: Fig. S1.
Alignment and variant calling of WGS data
All of the unmapped raw reads were excluded to remove the sequence reads coming from the bacterial DNA (step 2 of Additional file 1: Fig. S1). The remaining reads were aligned to human reference genome (build hg19) with BWA-mem (v0.7-6a) . In parallel, reads were also aligned with NovoAlign (v3.00.04) to reduce false negatives resulting from alignment artifacts. All of the alignments were sorted with SAMtools (v0.1.18) and PCR duplicates marked with Picard (v1.91) . For the BWA-MEM bam files, INDELs were realigned with the GATK IndelRealigner (v2.6-4) and base quality scores were recalibrated . For variant calling with FreeBayes, the alignment files were not processed with INDEL-realignment and base quality recalibration as these additional steps are not required by FreeBayes. Qualimap (v2.0) was used to perform QC analysis on the alignment files .
In order not to miss potentially disease-contributory variants, more than one pipeline were used to detect SNVs, INDELs, SVs, and CNVs [56, 57]. All variants are included in the downstream analysis and orthogonal validations were performed to confirm the variants of interest (step 3 to step 5 of Additional file 1: Fig. S1). First, SNVs and INDELs were jointly called from nine genomes with GATK HapolotypeCaller (v3.1-1) from the BWA-MEM alignment following best practices . Second, a default parameter setting was used to call variants using FreeBayes from the NovoAlign alignment . Third, Scalpel (v0.1.1) was used with the BWA-MEM bam files to identify INDELs in the exonic regions with sizes up to 100 bp . Each exon was expanded by 20 bp upstream and 20 bp downstream to reveal possible INDELs harboring splicing sites. Following the benchmarking results as recently reported , Scalpel INDEL calls were filtered out if they have an alternative allele coverage less than five and a Chi-Square socre greater than 10.8. Fourth, RepeatSeq (v0.8.2) was utilized to detect variants near short tandem repeats regions in the genome using default settings . Fifth, Lumpy (v 0.2.6) and CNVnator were both used to call SVs with sizes >100 bp [62, 63]. Among Lumpy calls, events supported by >50 reads or less than four reads were excluded because regions of either too low or high coverage are more likely to contain biases in sequencing or alignment. Sixth, ERDS (v1.1) was used to call CNVs from the BWA-mem bam files with default settings . Among ERDS calls with a confidence score >300, duplications with sizes < 200 Kb and deletion calls with sizes <10 Kb were excluded from downstream analysis. CNVnator (v0.3) was used to identify smaller CNVs that are present in the WGS data using the parameters -his 100, −stat 100, −partition 100, −call 100 . Sixth, to achieve high confidence CNV calls, PennCNV (2011Jun16 version) was used to call CNVs from the microarray data . Each CNV was supported by at least 10 markers, excluding CNVs with an inter-marker distance of >50 Kb. SVs and CNVs that overlapped with segmental duplication regions by 50% were also filtered out with BEDtools .
Genomic filtering and annotations of the variants
To annotate the variants of interest, GEMINI (v0.11.0), ANNOVAR (2013Aug23 version) were used (step 6 of Additional file 1: Fig. S1) [67, 68]. The circos plot of K10031-10232’s genome was generated using circlize in R . The population allele frequencies (AF) were loaded with GEMINI from the 1000G database (http://www.1000genomes.org/) and Exome Aggregation Consortium (ExAC) database (http://exac.broadinstitute.org/) . GEMINI also served to import the CADD C-scores, loss-of-function variants defined by LOFTEE, and the reported pathogenicity information from the ClinVar database [71, 72]. There were several steps in filtering variants with respect to the segregation pattern, population frequency, allele deleteriousness prediction, and ClinVar annotation. First, variants were partitioned by the following disease inheritance models: autosomal dominant, autosomal recessive, de novo, compound heterzygous, and X-linked dominant. Second, autosomal or X-linked dominant and de novo variants were excluded if they had an AAF >0.01 in either ExAC or 1000G database while the cut-off was increased to 0.05 for autosomal recessive and compound heterzygous variants. Third, only the variants that met the following criteria were considered in the downstream analysis: 1) called by at least one pipeline and validated with a second pipeline, 2) had an adjusted p-value lower than 0.05 reported by pVAAST , 3) defined as medium or high impact by GEMINI, or defined as loss-of-function by LOFTEE, 4) with a CADD c-score greater than 15. Fourth, we also searched for variants that were considered as pathogenic, probably-pathogenic, mixed, or drug-response in the ClinVar database. Lastly, the VCF files were also uploaded to the Omicia Opal platform and the Tute Genomics platform for online annotation, filtering, and pharmacogenomic analysis. The Tute Genomics variant interpretation report for each individual can be found in Additional file 2.
Phenotypic prioritization of candidate genes using Phenolyzer
Main Clinical Presentation of Proband K10031-10232
Development and growth
Delayed speech and language development
Growth hormone deficiency
Poor fine motor coordination
Mild intellectual disability
Downslanted palpebral fissures
Other physical features
Excessive daytime sleepiness
Obstructive sleep apnea syndrome
Impaired ability to form peer relationships
Impaired social reciprocity
Inflexible adherence to routines or rituals
Low frustration tolerance
Poor eye contact
Short attention span
Main Clinical Presentation of Proband K10031-10133
Patent foramen ovale
Gynecologic & genitourinary
To find out what HPO terms affect our results the most, we performed a ranking analysis with Phenolyzer. We used individual HPO term as input and compared the Phenolyzer scores of the CNV containing NDN and SNRPN. Ideally, the higher the score, the more important this HPO term is to this CNV. Further, to understand the impact of the number of HPO terms on the final result, we randomly downsampled to a smaller number (one to six) of HPO terms from the entire set of 21. Then we used each combination as an input for Phenolyzer analysis. We defined the confidence level of a result based on the Phenolyzer score of the correct CNV; ‘High confidence’ (> = 0.5), ‘Medium confidence’ (0.1 = < Phenolyzer score <0.5) and ‘Low confidence’ (<0.1). For each scenario (one to six HPO terms), we counted the number of times when the correct CNV was prioritized at high/medium/low confidence levels. Finally, we computed and summarized the percentage of each (Fig. 6).
Results and discussion
Clinical presentation (with HPO annotation) and family history
Proband K10031-10232 is a 25-year-old (25 y.o.) male. He is the son of a Caucasian farther (K10031-10231), and an Asian mother (did not participate in the study). He has two older male siblings, namely K10031-10233 and K10031-10234. This proband was diagnosed with PWS at 11 months old, and has dysmorphic facial features including a narrow forehead, downslanted palpebral fissures and almond-shaped eyes. A description of a video recording (HDV_0073) illustrating his clinical manifestations can be found in the supplemental section, and the video can be provided on request to qualified investigators. Since the PWS diagnosis, his behavior has been assessed in great detail (Table 1, and Additional file 1: Supplemental Data), and the following diagnoses have been given: obsessive-compulsive disorder (OCD), depression, anxiety disorder, pervasive developmental disorder (PDD), hyperphagia, trichotillomania, and daytime hypersomnolence. He has an IQ ranging between 60 and 65, categorized as mild mental retardation. He also has diagnoses of mild dysarthria, obstructive sleep apnea syndrome (OSAS), and severe scoliosis. The latter has been corrected surgically. He has also undergone orchiopexy, tonsillectomy, and adenoidectomy. His physical exam is otherwise unremarkable. He has denied having significant psychotic symptoms, including auditory or visual hallucinations, delusions, ideas of grandiosity, or paranoid ideation.
In an effort to help standardize phenotype reporting, we used Human Phenotype Ontology (HPO) annotation . See Table 1 and Additional file 1: Table S5 for a list of clinical phenotype features collected from this proband. The Phenomizer tool  ranked the diagnosis for Prader-Willi Syndrome as the highest priority diagnosis for this proband (see Additional file 3), supporting the fact that highly specific and annotated phenotype information can yield accurate diagnoses, at least for a characteristic syndrome like PWS. As presented below, the genomic analysis of proband K10031-10232 further confirmed deletions in the chromosome regions from 15q11.2 to 15q13.1, making PWS the most credible diagnoses for him at present.
Proband K10031-10133 is a 26 y.o. female, born to a Caucasian mother (K10031-10145) and a Caucasian father (K10031-10144). She is the eldest child amongst her two sisters and two brothers. Prior to age 18, K10031-10133 had a fairly unremarkable medical history. Arthralgia and episodes of fatigue and dizziness started at around 18 years of age. At age 20, she started to have refractory syncopal events, which led to multiple body injuries. During the same period of time, she also developed postural orthostatic tachycardia syndrome (POTS), heart palpitations, gastroparesis, urinary incontinence, diplopia, and seizures. In addition, she reported experiencing auditory and visual hallucinations. She underwent dysautonomia evaluation and revealed a positive tilt table test. Other tests revealed unusual changes to her optic disks but without an elevated intraocular pressure, and nonspecific findings on her brain MRI, including a subtle focus of T2 signal abnormality involving the subcortical white matter of the right parietal lobe without associated enhancement. See Table 2 and Additional file 1: Table S6 for proband K10031-10133’s clinical phenotype list with HPO annotations, and Additional file 1: Supplemental Data for a full report of HPO analysis on her. Descriptions of video recordings (HDV_0079) of this proband illustrating her medical presentation and (HDV_0072) in which conditions in other family members are discussed are included in the supplemental videos section, and these videos can be provided on request to qualified investigators.
As for her family history (Additional file 1: Table S6), there are some noticeable symptoms that are shared by all her siblings and her mother, including dysautonomia-like symptoms such as dizziness and fainting, as well as tremors and asthma. In addition, anxiety, attention deficit, arthritis, dyslexia, gastroesophageal reflux, seizures and TS are other diagnoses found among her siblings. Her mother (K10031-10145), on the other hand, has HH and OCD traits. Her father has significant migraines, gastroesophageal reflux, hiatal hernia, and right sensorineural hearing loss. See detailed descriptions of her family members in Additional file 1: Supplemental Data. We are highlighting here that extensive characterization of families, including videotaping and the collection of collateral information from other relatives, yields a rich texture of findings that are not always easily captured in written medical records.
Summary statistics of the WGS data
WGS identified de novo CNV deletions in 15q11.2 to 15q13.1 of proband K10031-10232
Phenolyzer discovered interaction between PWS deletions and disease subtypes
Phenolyzer revealed the relationship between p.C282Y variant and HH in individual K10031-10145, which was missed by HPO analysis alone
Results from analyzing the WGS data showed that the mother’s brother (K10031-10231) is also homozygous for the p.C282Y variant in HFE. However, his clinical test result has not yet provided any evidence to support the diagnosis of HH, even though male p.C282Y homozygotes are considered more likely to develop iron-overload–related diseases due to the lack of the iron clearance events like menstruation and pregnancy in women . This is in line with the fact that even family members can have variable expressivity of disease, including different onset ages, etc. This instance with the brother and the sister again highlights the point that the phenotypic expression of a given mutation in HFE may vary widely, influenced in part by unidentified modifier loci [83, 84, 85, 86, 87, 88, 89]. Some studies previously estimated that less than 1% of individuals in the U.S. carrying homozygous mutations present clearly with clinical diagnoses of hemochromatosis . In contrast to studies that have searched for the “causal” gene, some have reported that genetic variations can instead have large effects on phenotypic variability, suggesting underlying genomic complexity from multiple interacting loci [91, 92, 93, 94]. Understanding such diseases thus requires probabilistic thinking about the risk of developing the clinical manifestation, rather than deterministic genotype-phenotype “causation” [16, 95, 96, 97], and there will always be some level of stochasticity as well . Further, alongside the primary research-focused analysis, the participating subjects and families also received the research findings (Additional file 1: Table S8). Of course, we cannot exclude the possibility that we might have missed some variants, including possibly non-coding variants, and we expect that the future phenotyping, sequencing, and collation of data from millions of people will reveal associations that are not currently known.
Analysis of dysautonomia-like symptoms
None of the family members with dysautonomia-like symptoms carry any previously reported variants in IKBKAP that are implicated in the autosomal recessive transmission of FD, which is also called hereditary sensory and autonomic neuropathy type III (HSAN-III). The WGS data have effective sequence coverage (> average coverage 40X) for this gene, but no novel rare variants were identified. Notably, both the mother (K10031-10145) and the male proband (K10031-10138) carry heterozygous variants of p.H604Y and p.G613V in the protein product of NTRK1, which has been proven to contribute to HSAN-IV (congenital insensitivity to pain with anhidrosis). HSAN-IV is a disease closely resembling FD (HSAN III), and is characterized by a lack of pain sensation, anhidrosis, unexplained fever since childhood, and self-mutilating behavior [98, 99]. Both variants are located within the intracellular tyrosine kinase domain of the encoded protein, but neither sites are conserved. Both variants have also been reported before in healthy individuals, so they are considered to be polymorphisms in the population and seem to be in linkage disequilibrium [100, 101, 102, 103, 104]. The mother’s brother (K10031-10231, unaffected) also carries these two variants, so this provides further evidence that they are likely to be polymorphisms. Most importantly, neither variant is present in the proband K10031-10133, who reported the most severe dysautonomia-like symptoms.
Instead of the NTRK1 variant, a manual filtering found seven other putative variants in PLCG2, ATXN2, VWA8, LRRIQ1, MYO1H, OR1J4, and RFX4 which follow a dominant inheritance model (Additional file 1: Table S4). Variants in PLCG2, ATXN2, and VWA8 were previously reported to be associated with certain disease phenotypes, including cold-induced urticaria, antibody deficiency, susceptibility to infection and autoimmunity, spinocerebellar ataxia type 2, celiac disease, and susceptibility to amyotrophic lateral sclerosis [98, 99, 100]. However, the variants we identified in this family are not the same variants in the literature, and all of these predicted diseases have only partially overlapping manifestations with dysautonomia-like symptoms. For the rest of the four genes mentioned above, LRRIQ1, MYO1H, OR1J4, and RFX4, there has been, to our knowledge, no reports published to date discussing any variants in these genes associated with human disease. Therefore, the functional impact of these variants remains unclear.
Lastly, Phenolyzer analysis was carried out using the phenotype of proband K10031-10133 as input. It successfully prioritized two variants we identified in the manual filtering analysis discussed above, ATXN2 and VWA8, and further revealed the complexity of such diseases (Additional file 1: Fig. S8).
These results together suggest that the genetic inheritance of dysautonomia-like symptoms in this pedigree may not consist of only one high-effect size mutation, but rather could be polygenic and/or environmentally influenced. It is possible that multiple variants including those we mentioned above are acting together or in conjunction with modifiers in these individuals’ genomes to give rise to a spectrum of complex clinical manifestations.
This research report provides insights into using WGS as a genetic test to investigate PWS and other phenotypes. In our study, three de novo deletions were discovered at single base pair resolution. WGS enables the reconstruction of the recombination event in this imprinting hotspot 15q11-13, which provides deeper insights into the mechanism of PWS. Notably, this is the first report of an Illumina HiSeq WGS experiment on an individual with PWS with the paternal allele deletion. In principle, the use of WGS, once standardized, could eventually simplify the molecular diagnosis procedure for PWS and and other genetic syndrome patients, as one would no longer need to through the several steps for the standard genetic testing for PWS [77, 105, 106]. Since AS and PWS share a similar cytogenetic anomaly in 15q11.2 to 15q13 [107, 108], WGS could potentially help reveal the sub-types of both syndromes, given that the breakpoints of the CNVs can be mapped at the nucleotide level and one could distinguish which allele (paternal or maternal) has been deleted. However, WGS alone would not be enough to detect either uniparental paternal disomy with heterodisomy or imprinting defects in this genomic region for non-deleted PWS individuals [77, 109].
However, WGS might not always pinpoint the exact disease relevant variants, due to the limition of cohort size and disease complexity. Phenotype and genotype matching across cohorts is needed for confirming causal relationships. HPO has emerged as a standardized way to compare phenotypes, although it can only marginally solve the phenotype issue and cannot be directly used for WGS analysis. Fortunately, the development of phenotype-analysis tools such as Phenolyzer makes it possible to bridge the gap between the two on top of rich prior information across multiple databases. During the selection process for a particular patient’s features, one is able to query a surplus of clinical and scientific knowledge about the diseases linked to the feature in question. In addition, integration of four types of gene-gene interaction databases in Phenolyzer makes it possible to find more candidate genes beyond the existing gene-disease knowledge and generate new biological hypotheses. While the common drawback of all the gene prediction tools is the balance between sensitivity and specificity, Phenolyzer uses a modified logistic regression model to address this problem, ensuring that well-established genes are recommended among a large set of predictions.
This report about integrating WGS and HPO data demonstrates the effectiveness of such an approach and shows its potential for clinical implementation. Neither technique on its own is ideal for clinical diagnosis, but fortunately they complement each other and thus help eliminate false positives and reveal novel insights into human diseases. The potential for HPO remains in the development of a more multi-dimensional depiction of subjects that takes into account the past and present human presentation, and will aid in efforts for early diagnoses and intervention. As the field of medical genetics advances, researchers will need to find an efficient way to capture phenotypic information that allows for the use of computational algorithms to search for phenotypic similarity between genomics studies . For WGS, with ever-increasing sequencing capacity, a scalable and reliable informatic solution is key to analyzing millions of genomes simultaneously. To maximize this potential in clinical settings, data from WGS and HPO should be integrated and shared in a unified fashion.
The authors acknowledge Gareth Highnam and Jason O’Rawe for bioinformatics support and comments on the manuscript. The authors would like to thank the Exome Aggregation Consortium and the groups that provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about.
The laboratory of G.J.L. is supported by funds from the Stanley Institute for Cognitive Genomics at Cold Spring Harbor Laboratory (CSHL). The CSHL genome center is supported in part by a Cancer Center Support Grant (CA045508) from the NCI. K.W. is supported by NIH grant HG006465.
Availability of data and materials
All of the sequence reads can be downloaded under project accession number [SRP058003] from the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra). Administrative permission was received from the Utah Foundation for Biomedical Research to access the medical records reviewed in this study.
GJL and RR helped to recruit the family and conduct clinical phenotyping. HF analyzed the sequencing data. YW analyzed the clinical data, performed Sanger sequencing validation experiment, assisted in the WGS experiment and the HPO analysis. HY conducted the computational analysis with Phenolyzer. MY performed the HPO analysis and helped with the clinical data analysis. LJB and DM helped analyze the WGS and microarray data. HF, YW, HY, MY and GJL wrote the manuscript. KW and GJL supervised the data analysis. All of the authors have read and approved the final manuscript.
G.J.L serves on advisory boards for GenePeeks, Inc., Omicia, Inc., and Seven Bridges Genomics, Inc., is a consultant to Genos, Inc., and previously served as a consultant to Good Start Genetics, Inc. R.R. and K.W. were board members and shareholders of Tute Genomics, Inc. D.M. was an investor in Tute Genomics.
Consent for publication
Written consent was received from all study subjects (parental consent in children under the age of 18) to publish their personal and clinical details relevant to the study including parents' ethnicity, along video and facial photography provided upon request to qualified investigators.
Ethics approval and consent to participate
The collection and analysis of the DNA used in this study was conducted by the Utah Foundation for Biomedical Research, Protocol #100, approved by Ethical & Independent Review Services, Inc. Written informed consent (parental consent in children under the age of 18) to participate in research including sample collection was obtained from all participants in the study. Research was carried out in compliance with the Federal Policy for the Protection of Human Subjects 45C.F.R.46.
The Human Phenotype Ontology (HPO): http://human-phenotype-ontology.github.io/page2/
1000G database: http://www.1000genomes.org/
Exome Aggregation Consortium (ExAC): http://exac.broadinstitute.org/
ClinVar database: http://www.ncbi.nlm.nih.gov/clinvar/
- 1.Meijers-Heijboer EJ, Verhoog LC, Brekelmans CTM, Seynaeve C, Tilanus-Linthorst MMA, Wagner A, Dukel L, Devilee P, van den Ouweland AMW, van Geel AN, et al. Presymptomatic DNA testing and prophylactic surgery in families with a BRCA1 or BRCA2 mutation. Lancet. 2000;355(9220):2015–20.PubMedCrossRefGoogle Scholar
- 10.Rope Alan F, Wang K, Evjenth R, Xing J, Johnston Jennifer J, Swensen Jeffrey J, Johnson WE, Moore B, Huff Chad D, Bird Lynne M, et al. Using VAAST to identify an X-linked disorder resulting in lethality in male infants Due to N-terminal acetyltransferase deficiency. Am J Hum Genet. 2011;89(1):28–43.PubMedPubMedCentralCrossRefGoogle Scholar
- 16.Lyon GJ, O'Rawe J. Human genetics and clinical aspects of neurodevelopmental disorders. In: Mitchell K, editor. The Genetics of Neurodevelopmental Disorders. Hoboken: Wiley-Blackwell; 2015. p. 368. ISBN: 978-1-118-52488-6.Google Scholar
- 22.O'Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013;5:1–18.Google Scholar
- 32.Day-Williams AG, Sun C, Jelcic I, McLaughlin H, Harris T, Martin R, Carulli J. Whole genome sequencing reveals a chromosome 9p deletion causing DOCK8 deficiency in an adult diagnosed with hyper IgE syndrome Who developed progressive multifocal leukoencephalopathy. J Clin Immunol. 2015;35(1):92–6.PubMedCrossRefGoogle Scholar
- 33.Wang K, Kim C, Bradfield J, Guo Y, Toskala E, Otieno F, Hou C, Thomas K, Cardinale C, Lyon G, et al. Whole-genome DNA/RNA sequencing identifies truncating mutations in RBCK1 in a novel Mendelian disease with neuromuscular and cardiac involvement. Genome Med. 2013;5(7):67.PubMedPubMedCentralCrossRefGoogle Scholar
- 38.Köhler S, Schoeneberg U, Czeschik JC, Doelken SC, Hehir-Kwa JY, Ibn-Salem J, Mungall CJ, Smedley D, Haendel MA, Robinson PN: Clinical interpretation of CNVs with cross-species phenotype data. J Med Genet 2014;51(11):766–72. doi: 10.1136/jmedgenet-2014-102633. Epub 2014 Oct 3.
- 41.Zemojtel T, Kohler S, Mackenroth L, Jager M, Hecht J, Krawitz P, Graul-Neumann L, Doelken S, Ehmke N, Spielmann M, et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci Transl Med. 2014;6(252):252ra123.PubMedPubMedCentralCrossRefGoogle Scholar
- 42.Smedley D, Schubach M, Jacobsen JO, Kohler S, Zemojtel T, Spielmann M, Jager M, Hochheiser H, Washington NL, McMurry JA, et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet. 2016;99(3):595–606.PubMedCrossRefGoogle Scholar
- 49.Moulton E, Bradbury K, Barton M, Fein D: Factor Analysis of the Childhood Autism Rating Scale in a Sample of Two Year Olds with an Autism Spectrum Disorder. J Autism Dev Disord. 2016. [Epub ahead of print]Google Scholar
- 52.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, ArXiv e-prints, vol. 1303. 2013. p. 3997.Google Scholar
- 59.Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing, ArXiv e-prints, vol. 1207. 2012. p. 3907.Google Scholar
- 60.Narzisi G, O'Rawe JA, Iossifov I, Fang H, Lee Y-h, Wang Z, Wu Y, Lyon GJ, Wigler M, Schatz MC: Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Meth. Advance online publication. 2014;11(10):1033–6.Google Scholar
- 75.Kohler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GC, Brown DL, Brudno M, Campbell J, et al. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42(Database issue):D966–74.PubMedCrossRefGoogle Scholar
- 85.McLaren CE, Emond MJ, Subramaniam VN, Phatak PD, Barton JC, Adams PC, Powell LW, Gurrin LC, Ramm GA, Anderson GJ et al.: Exome sequencing in HFE C282Y homozygous men with extreme phenotypes identifies a GNPAT variant associated with severe iron overload. Hepatology. 2015;62(2):429–39.Google Scholar
- 87.Stickel F, Buch S, Zoller H, Hultcrantz R, Gallati S, Osterreicher C, Finkenstedt A, Stadlmayr A, Aigner E, Sahinbegovic E, et al. Evaluation of genome-wide loci of iron metabolism in hereditary hemochromatosis identifies PCSK7 as a host risk factor of liver cirrhosis. Hum Mol Genet. 2014;23(14):3883–90.PubMedCrossRefGoogle Scholar
- 91.Massouras A, Waszak SM, Albarca-Aguilera M, Hens K, Holcombe W, Ayroles JF, Dermitzakis ET, Stone EA, Jensen JD, Mackay TFC, et al. Genomic variation and its impact on gene expression in < italic > drosophila melanogaster</italic> PLoS Genet. 2012;8(11):e1003055.PubMedPubMedCentralCrossRefGoogle Scholar
- 102.Shatzky S, Moses S, Levy J, Pinsk V, Hershkovitz E, Herzog L, Shorer Z, Luder A, Parvari R. Congenital insensitivity to pain with anhidrosis (CIPA) in Israeli-Bedouins: genetic heterogeneity, novel mutations in the TRKA/NGF receptor gene, clinical findings, and results of nerve conduction studies. Am J Med Genet. 2000;92(5):353–60.PubMedCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.