Haplous
Haplous is designed to detect haplotypes inherited by individuals who have the same familial disease predisposition and a distantly related common ancestor. It can also be used to compare haplotypes shared by a particular group of affected individuals to haplotypes of unaffected non-related controls. The same haplotype is found from several samples due to a common ancestor or a random event. That is, the haplotype is IBD (common ancestor) or identical by state (IBS; random event). The IBD haplotypes can be identified by filtering against the IBS haplotypes in matched controls and focusing on long haplotypes.
Haplous searches for a shared haplotype (SH) between individuals by using a sliding window and compares the SHs among sample sets, such as cases and controls. The comparison is based on rules, which formulate homozygous and heterozygous haplotype composition within and between sample groups. Each SH is assigned a score that allows ranking and prioritization. The score is based on the length of the SH and its abundance in cases and controls. The input to Haplous is phased genotypes and outcome is ranked lists of SHs and corresponding chromosomal regions.
Haplous is implemented in Java, and it is freely available as an independent Java library. In addition, it is included as a pipeline in the Anduril bioinformatics workflow engine [23]. Anduril compliance allows straightforward integration of Haplous analysis and results to other studies. The Haplous Java library, Anduril components and the user guide are freely available [17].
Haplous parameters
Haplous uses phased haplotypes and predicts the SHs by comparing consecutive SNP alleles with each other using a fixed-sized sliding window with user-adjustable parameters for window size (w), mismatches (m), and length of identical regions (l). Each sample has the phased SNPs for maternal and paternal chromosomes, which are effectively two allele vectors for each sample. Each allele vector in turn is used as a reference to which all other allele vectors are compared. The window of w markers is slid over the reference vector and compared to an allele vector, and windows having at most m differences are identified as SHs. Shorter identical regions of l markers are identified as SHs as well.
For each reference vector, this produces a pair-wise map of SHs between all other allele vectors, and all the pairs are collapsed into a single data structure that identifies the vectors that have the same SH and the location of this SH. This process is illustrated in Figure 1. This structure enables an easy lookup of all samples in each SH. The SHs are always defined against the reference vector. Haplous allows mismatches in SHs, meaning that if an allele vector A has a SH with an allele vector B (A = B) and if B = C, it still might be that A ≠ C. Thus, having multiple pair-wise SHs in a region does not mean that all the allele vectors are the same. Missing markers are treated as matches. The maternal and paternal vectors in each sample are also compared with each other, which reveals the location of homozygous regions.
The running time and the memory requirements of Haplous grow quadratically with the number of samples and linearly with the number of markers. Therefore, Haplous is applicable in genome-wide studies in rare diseases, where the number of samples is less than 500. For larger sample sizes, Haplous allows the user to define those samples that are used as a reference in the comparison, and the rest of the samples are skipped. This is a useful option in the case-control settings, since Haplous can be used to compare cases to controls but skip the control-case and the control-control comparisons. Additionally, each sample can be analyzed independently from each other, which enables parallelization of each run. Haplous is also implemented on an Anduril [23] bioinformatics workflow engine that automatically parallelizes the execution. Thus far, Haplous has been used to analyze datasets of 930 cases and 960 controls (data not shown).
Haplous gives an estimate for informativeness of each SH. This informativeness describes how rare a given SH is. Informativeness is defined as a joint probability of the alleles in SH estimated from the allele frequencies by multiplying allele frequencies a
i
from the first marker of SH (i) to the last marker (n), that is, . The user may set a threshold (t) for informativeness. In this case, only those SHs that have informativeness below t are included in the results. The value of informativeness is between zero and one. Zero denotes that the alleles of SH could never be seen in the population, while one means that alleles of SH are always observed in that region and thus the region is completely uninformative.
Simply scanning all SHs that meet the criteria would produce a huge list of SHs that are abundant in the population but not particularly interesting regarding the phenotype in question. To find the interesting SHs, we make use of the expert knowledge of the user: the user defines the rules that determine which SHs are interesting. The rules were used to define the features that an interesting SH needs to have, and these features are defined for cases and controls separately. If a SH has these features, it is considered interesting. The rules are set as thresholds for the number of cases or controls that share the SH.
These rules can be better understood through an analogy with basic parametric linkage analysis. The main differences are that in Haplous the 'parameters' are presented as counts instead of percentages, and IBD sharing expectations between families can be controlled at the same time. We need to find the IBD haplotype carrying a predisposing mutation that segregates with the disease trait. These rules include the inheritance model (dominant and recessive model - that is, heterozygotes and homozygotes), assumed penetrance (proportion of mutation-carrying individuals who have the disease in question - that is, the number of cases and controls sharing the same haplotype), phenocopies (the number of cases that do not share the same haplotype) and mutation frequency (the total number of cases and controls that share the same haplotype). Using the rules, the user can tune the parameters to correspond to the hypothesis of the current analysis.
The pseudo-code of this inference is given in Figure 2. Briefly, the inference algorithm takes the thresholds and list of cases and controls as an input, calculates the number of cases and controls sharing each SH, and evaluates whether a SH has the features of an interesting SH taking into account both the cases and controls. These evaluations for cases and controls are produced with the same function but the return value is negated for the control rule. The rules follow a natural deduction, an example of which could be: 'the SH is interesting if it is shared by one or more homozygous case samples and not shared by any control samples'.
After filtering the most interesting SHs, the result set may still include many regions that are almost equally promising for further studies. Haplous gives scores for SHs, which are stored in a file with information about the range, score and samples sharing the particular haplotype. This allows straightforward identification and post-processing of the most interesting homozygous and heterozygous chromosomal regions.
The score calculated by Haplous emphasizes the number of cases and controls that share the haplotype as well as the length of the SH. The score is calculated according to the formula M(C
a
- C
o
), where M is the number of markers in the chromosomal region of a SH, C
a
is the number of times cases share the SH and C
o
is the number of times controls share the SH. Note that in the case of homozygous loci, each homozygous sample shares the SH twice, which increases either C
a
or C
o
. Instead of a specific haplotype, we assign a score for the chromosomal region carrying the different haplotypes. This score is assigned similarly M
r
(C
ar
- C
or
), where C
ar
is the number of cases and C
or
the number of controls having any SH in a given marker. M
r
is the number of markers in the chromosomal region that receives the same score from (C
ar
- C
or
). The upper or lower limits of the scores depend on the parameter values.
HapMap data
We used the HapMap phase 3 European population (CEU) chromosome 12 [18] phase-known dataset that has been created for each family trio in the database. Computational methods are needed to estimate haplotype phases from high-throughput SNP data. In many cases, genotypes of the family members are not available but instead population data are used as a reference [24]. However, estimates based on family genotypes are more accurate since inheritance of most of the SNPs can be estimated based on the pedigree. The HapMap database provides unphased genotypes and corresponding haplotypes inferred from the families [18]. In HapMap, the phases of 94% of SNPs are known through the family information in the parents-child trios [25]. On average, 28% of the SNPs are heterozygous, and the trios reveal the phase for 80% of the heterozygous SNPs. We treat the more accurate family-based haplotypes as phase-known haplotypes, and use Haplous to compare them with the estimated phase-predicted haplotypes of the same HapMap samples, using Haplous.
Here we used only data from parents, as children convey redundant information. To create a population of unrelated individuals for the haplotype phase estimation, we randomly picked one individual from each family and used their unphased genotypes. Next, we selected SNPs that were present in the phase- known and unphased datasets. These formed a dataset of 60,704 SNPs for a population of 41 unrelated individuals. The phase-predicted haplotypes were estimated with the HaploRec software [24] based on the 41 population samples. The 29 samples present in the phase-known and the phase-predicted datasets were selected to test Haplous performance, and from these samples we created two datasets, one having phase-known and the other phase-predicted haplotypes.
Data simulation
We used the simulation software GENOME [26] to generate 6,000 SNP haplotypes from a single chromosome spanning 78 cM. The population had an effective population size of 100,000 with a mutation rate of 10-8 per generation. From this population, we chose 4,068 SNPs that matched best to the Illumina SNP map of human chromosome 22 and had a minimum allele frequency of 0.05 or more. Randomly chosen 1,500 haplotypes were used as the founder haplotypes for the affected pedigree, and 500 other randomly chosen haplotypes were paired as 250 healthy controls.
We created the pedigree by selecting random members A and B from both lymphoma families 2 and 3. Both A and B had at most one direct ancestor, and their alleles could be inherited by at least 20% of the youngest generations. The new family members in this pedigree were added so that A and B had a common ancestor 10 to 27 generations away from the youngest members.
The mutated allele was inserted into the common ancestor and the rest of the founders were non-carriers. For both families, we chose ten random paths from the youngest individuals to the common ancestor and forced the mutation to be passed through generations in these paths. Then we used our own simulator to simulate the inheritance of non-founder alleles from the oldest generation to the youngest based on the genetic map of chromosome 22.
The individuals from the youngest generation who had the mutation were inserted into the case dataset. The mutation allele was set to the same allele in all the final samples. This simulation was repeated 100 times, each time varying the position of the mutation.
Lymphoma data
Blood-derived DNA was collected from nine lymphoma patients, of whom six had nodular lymphocyte predominant Hodgkin lymphoma (NLPHL) and three had either T-cell/histiocyte rich B-cell lymphoma (TCRBCL), NHL or classical Hodgin lymphoma (cHL). When possible, samples were also collected from the children or parents of these patients for phase determination (Figure 3). Samples were also collected from the children and siblings of four deceased lymphoma patients, one of whom had had NLPHL, one TCRBCL and two NHL (Figure 3). Altogether, 29 samples were available, of which 20 were from unaffected family members. The slightly modified pedigrees and sample information are shown in Figure 3. The samples and patient information were obtained with approval from the Ethics committees of the Helsinki University Central Hospital and Hospital District of Helsinki and Uusimaa (Dnro 408/13/03/03/2009). All blood samples were derived after a signed informed consent in accordance with the Declaration of Helsinki. Genotypes from 250 unaffected Finnish control individuals were also available from the Nordic Center of Excellence in Disease Genetics control database [27].
Lymphoma data processing
Genome-wide SNP data were available from four lymphoma patients in family 1, two in family 2, and three in family 3, as well as from 20 unaffected family members (Figure 3). In the cases of four lymphoma patients whose DNA was not available, SNP data from their children and siblings were used to define their haplotypes. This created uninformative gaps in the haplotype data of the deceased individuals. In this study, however, we decided to consider all the uninformative regions as SHs in the haplotype screening in order not to lose any information.
Genomic DNA extracted from 29 blood samples was used for SNP genotyping with the Illumina's HumanCNV370 -Duo DNA Analysis BeadChip using Infinium Assay (Illumina Inc., San Diego, CA, USA). Genotyping was performed according to the manufacturer's standard protocol in the Institute for Molecular Medicine Finland (FIMM) Genome and Technology Centre (Finland). Genotype calling was carried out with BeadStudio software (Illumina Inc.) using the default GenCall score cutoff of 0.15. All samples passed the quality filtering. SNP genotypes were exported from BeadStudio to the Progeny database (Progeny Software LLC, South Bend, IN, USA), in which pedigree, phenotype, sample and SNP data were integrated. Mendelian error checking was performed on the genotype data using tools integrated in Progeny. The markers with Mendelian errors were removed from further analysis. Pedigree-based haplotypes were constructed using Merlin [28] with the '--best' mode, which estimates the most likely haplotype vector. For the haplotype estimation, the large pedigree of family 3 was first split into smaller overlapping sub-pedigrees. Unlikely genotypes that cause double recombinants were predicted with the Merlin error detection tool and subsequently excluded from the final analysis. The haplotypes for controls were estimated using HaploRec [24] with default parameters and the chromosome split into regions 500 markers long that overlapped by 10 markers and had an extra 20 markers at the tails of each split. In the phase- known family-based haplotypes the uninformative loci were transformed into missing markers.
Analysis of simulated data
From each 100 simulated datasets, we selected the first 11 cases for the analysis and used all controls from each simulated dataset. The Haplous SH scan was executed using parameters 'm = 0, w = {20,30,50,100,150,180,200}'. The rules for controls were (USER INPUT = {controlHet = 4, controlHom = 1, controlOperator = OR}) and for cases it was varied from (USER INPUT = {caseHet = {1,2,3, caseHom = {1,2,3}, caseOperator = AND}) to (USER INPUT = {caseHet = {1,2,3,4,5,6,7,8,9,10,11}, caseHom = {1,2,3,4,5,6,7,8,9,10,11}, caseOperator = AND}). The rules (USER INPUT = {caseHet = {1,2,3,4,5,6}, caseHom = {1,2,3,4,5,6}, caseOperator = AND}) are comparable with the lymphoma analysis, and they were used for the analysis of Haplous robustness (Tables 1 and 2). BEAGLE was executed using the parameter 'fastibd = true'.
Table 1 Summary of the results from the simulated dataset using different numbers of controls Table 2 Summary of the results from the simulated dataset mixing cases and controls Lymphoma analysis by Haplous
A schematic of the six-staged lymphoma data analysis is presented in Figure 4. In stage 1, genotypes were imported to the analysis. In stage 2, the haplotypes were estimated from the genotypes. In stage 3, haplotypes of the cases and controls were combined and all SHs were extracted from the data using the following parameters: window size 100, no more than one mismatch within a window and identical haplotype length 80 (w = 100, m = 1 and l = 80). Informativeness was not considered (t = 1). In stage 4, SHs present in the controls were excluded by filtering, allowing the maximum of three SHs as heterozygous or none in homozygous form (USER INPUT = {controlHet = 4, controlHom = 1, controlOperator = OR}). At this stage we did not set any threshold for lymphoma cases. In stage 5, we applied in parallel six different rules to identify five SHs in cases in all conformations. As the identified families do not provide a clear clue to the mode of inheritance, we varied the number of required heterozygous SHs (hetSHs) and homozygous SHs (homSHs) in cases (USER INPUT = {caseHet = {0,1,2,3,4,5}, caseHom = {0,1,2,3,4,5}, caseOperator = AND}). In stage 6, we selected the ten highest scoring SHs that are more than 30 markers long.
The filtering breaks the SHs into shorter segments based on samples that share the SHs. Therefore, SHs encompassing at least 30 SNPs were considered to be sufficiently long to be genetically interesting, that is, to represent potential IBD haplotypes. The SHs were filtered six times, and the ten highest scoring hits from each run were examined in more detail. Region boundaries were retrieved using SNP identifiers and a list of genes located in these regions was collected from the Ensembl database (release 59) [29]. A downstream analysis of these genes was performed. Genes that had a UniProt identifier were considered as protein coding. In order to find candidate genes that could be interesting considering what is known about their function in the literature, we performed a SNPs3D [22] search for lymphoma-related features. The search terms were 'nodular lymphocyte predominant Hodgkin lymphoma', 'T-cell rich B-cell lymphoma', 'histiocyte rich B-cell lymphoma', 'non-Hodgkin lymphoma', 'Hodgkin lymphoma' and 'B-cell'. We also used three known NHL or NLPHL related genes (BCL6 [30, 31], A20 [32] and SOCS1 [33]) as search words as well as 'NFkB', a pathway that appears to be activated in both NLPHL and non-Hodgkin lymphomas [31, 33, 34].