Genome characterization of the selected long- and short-sleep mouse lines

The Inbred Long- and Short-Sleep (ILS, ISS) mouse lines were selected for differences in acute ethanol sensitivity using the loss of righting response (LORR) as the selection trait. The lines show an over tenfold difference in LORR and, along with a recombinant inbred panel derived from them (the LXS), have been widely used to dissect the genetic underpinnings of acute ethanol sensitivity. Here we have sequenced the genomes of the ILS and ISS to investigate the DNA variants that contribute to their sensitivity difference. We identified ~2.7 million high-confidence SNPs and small indels and ~7000 structural variants between the lines; variants were found to occur in 6382 annotated genes. Using a hidden Markov model, we were able to reconstruct the genome-wide ancestry patterns of the eight inbred progenitor strains from which the ILS and ISS were derived, and found that quantitative trait loci that have been mapped for LORR were slightly enriched for DNA variants. Finally, by mapping and quantifying RNA-seq reads from the ILS and ISS to their strain-specific genomes rather than to the reference genome, we found a substantial improvement in a differential expression analysis between the lines. This work will help in identifying and characterizing the DNA sequence variants that contribute to the difference in ethanol sensitivity between the ILS and ISS and will also aid in accurate quantification of RNA-seq data generated from the LXS RIs. Electronic supplementary material The online version of this article (doi:10.1007/s00335-016-9663-6) contains supplementary material, which is available to authorized users.


Detailed HMM methods
We sought to infer the ancestral origin of distinct regions within both the ILS and ISS genomes. The ILS and ISS strains were selectively bred for their response to alcohol from eight ancestor strains: A, AK, BALB/c, C3H, C57BL, DBA/2, Is/Bi, and RIII [McClearn, 1981]. Of the ancestor strains, A, AK, BALB/c, C3H, C57BL, and DBA/2 have been previously sequenced and genotyped [Keane, 2011], while Is/Bi and RIII have not. As the sequenced ancestor strains lack information for the Y chromosome, we removed it from our analysis of ILS and ISS. The strain designations used here are based on Keane et. al. [Keane, 2011].
Intuitively, we can say that regions of the ISS and ILS genome with SNPs largely consistent with variants from a single ancestor were likely inherited from that ancestor. Using a hidden Markov model (HMM) approach we can probabilistically infer the most likely ancestral origin of each genome segment. Because C57BL is genotypically very similar to the mm10 reference genome, it was underrepresented in the SNP sets and very sensitive to de novo mutations in the dataset. Because of this, we combined C57BL and the Unknown into a single state, Unk/C57. Therefore our HMM consists of six states: one state for each sequenced ancestor and one "Unknown/C57" state to capture the two unsequenced ancestors and C57BL.
For our HMM training set, we use all ILS or ISS high quality SNP positions and the SNP positions of the six sequenced ancestor strains. Conceptually, transitions between distinct states correspond to recombination events during breeding, while transitions to the same state indicate adjacent SNPs belong to the same ancestral haploblock. Therefore, finding a maximum-likelihood path through the HMM results in the haploblock ancestral origins of the highest confidence.
The model parameters are set initially using biological intuition and refined using Expectation-Maximization. State emissions report on the consistency of the observed ILS or ISS strain with the designated ancestor. For example, when running the HMM for the ILS strain, if a given position in the genome contains a SNP consistent with the DBA/2 ancestor then state DBA/2 is given a high emission rate for that position. We note that inconsistencies can arise from sequencing errors and de novo mutations, thus non-ancestral variations in each state have non-zero probability. As adjacent SNPs are most likely to be from the same inherited region, we set the initial self-transition probabilities high. Additionally, we incorporate fine-scale mouse recombination rates [Brunschwig, 2012] to adjust the transition probabilities in a position specific fashion. As recombination rates are per-base and region-specific, this has the added benefit of naturally incorporating the distance between SNPs into our transition probabilities. To converge on optimal transition and emission probabilities, the HMM is run through an Expectation-Maximization loop. In each iteration, the maximum-likelihood path through the HMM is found using the Viterbi algorithm (maximization step), then transition and emission probabilities are recalculated based on the results (expectation step).
We note that SNPs inherited from one of the unsequenced ancestors will only appear in the data if those SNPs are not consistent with any of the six sequenced ancestor strains. Therefore, if a region was inherited from an unsequenced ancestor, but is genotypically similar to a sequenced ancestor, there is a bias for the HMM to label the region as coming from the sequenced ancestor rather than Unk/C57. We attempt to correct for this by relabeling regions with a certain percentage (we utilized 90%) of unique ISS/ILS SNPs as Unk/C57. Furthermore, we identify regions within the ILS and ISS genomes that are identical by descent (IBD). Sometimes large segments of the ancestor genomes are absolutely identical and indistinguishable. We sought to identify these regions within the HMM output, where a particular ancestor is chosen (by Viterbi) based on little to no informative positions. In these cases, there is no way to truly distinguish the ancestor of origin so we reclassify the segment as IBD for the set of nearly identical ancestors.
To assess our model, each haplotype block classified by the HMM is evaluated for its consistency relative to insertions and deletion (indel) structural variations observed in the ancestor strains. That is, for each haploblock classified by the model, we found all indels in the ILS/ISS strain and the ancestor strains that overlap that haploblock region. Then for each unique indel, we check for consistency with the model classification and the ancestor indels. We score a "hit" if the indel region contains the indel of the ancestor classified from the HMM and a "miss" if the region does not contain that indel. The final ratio of hits to misses gives the HMM output a score.
Source code for the HMM can be found at the Dowell lab website: http://dowell.colorado.edu/resources.html

SV scoring method
The events output by SVmerge [Wong 2010], namely 22,912 events called in ISS and 25,092 events called in ILS, were further analyzed to determine strain specific structural variants. First, 16,916 candidate strain-specific events were selected from the SVmerge results by identifying those deletions, inversions, translocations, and gains greater than 1 Kb in length called in only ISS or ILS but not both. Events were deemed the same if the start and stop were within 600 bp of each other, otherwise the event was considered strain-specific.
Strain specific SV candidates were scored for paired-end support or coverage support depending upon applicability of the short insert library read data. Paired-end support considers insert size and orientation while coverage support considers read-depth. Paired-end support scores were generated for calls originally identified by discordant paired-end mapping. For those events evaluated by paired-end support, the short insert library reads were used as input to the R package targetSeqView [Harper-Stromberg 2014] to generate scores. For those events evaluated by coverage support, coverage was summed within non-overlapping 100 bp windows tiled across the length of each event. The coverage score was the median coverage statistic from all windows across the event.
Score cutoffs for declaring an event to be a high confidence were generated by adapting a method from Ramskold et al., previously used to determine differential gene-expression cutoffs [Ramskold 2009]. This method was originally intended to compare the expression values (in RPKM) of genes versus intergenic 'background' regions. This method has the attraction of attempting to optimize the distinction between two distributions, one with mostly positives (the 'signal'), and one with all negatives ('background'). Our data was similarly constructed as two score distributions. For each candidate event, the score from the mouse of origin was denoted as 'signal' (i.e the mouse where the event was originally called by SVmerge) and a score for the alternate mouse was denoted as 'background' (i.e the same locus from the other strain). The original publication was interested in optimizing the false negative rate whereas the objective here was to optimize the false discovery rate (FDR). False discovery rate was defined for increments of score value, i: FDR i = cumulative_alternate_mouse_events i /cumulative_mouse_of_origin_events i * (1-cumulative_mouse_of_origin_events i )/(1-cumulative_alternate_mouse_events i ).
The latter term was a correction for negatives in the mouse of origin distribution. FDR values were plotted (Fig S-3) against score values and change-points were determined using the changepoint R package with the command: cpt.var(diff(fdrs), method="PELT") [Killick 2013]. Score cutoffs were set to be the change-point in the FDR curve with the greatest absolute difference between the two strains in the number of events exceeding that score cutoff. A strain-specific event was determined to be a high confidence event if the score cutoff was exceeded in one strain but not the other based upon testing the candidate event independently in both mice.
To summarize, a highly confident strain-specific call required strong evidence of strain specificity based upon scores generated from coverage metrics and paired-end support metrics. Based upon the aforementioned threshold criteria, 7,153 strain specific events were categorized as high confidence, namely 6,214 deletions, 773 gains, 60 inversions, and 106 translocations (Table S-

Figure S2: Genome wide distribution of all variants (SNP+indel, SVs)
All variants plotted genome-wide summarized by a sliding window approach (500k windows with 100k stepsize). Concentric circles (from outer to inner): SNPs and small indels, deletions, gains, losses and inversions. Left: Distribution plots of scores for mouse of origin SV calls (labeled signal, in blue) and corresponding loci tested in the alternate mouse (labeled background in red). Plots were generated using a method developed by Ramskold et al.
Ascending scores signify stronger evidence for the event based upon paired-end mapping metrics. The threshold point, indicates the minimum score for assigning an SV as high confidence. Right: Cumulative distribution plot of scores for mouse of origin SV calls (labeled signal, in blue) and corresponding loci tested in the alternate mouse (labeled background in red) with false discovery rate curve in green. The threshold point, indicates the minimum score for assigning an SV as high confidence. In order to make a strain-specific high confidence call, the score from the mouse of origin event needed to be at the threshold point or above, and the score from the alternate mouse for the same event needed to be less than the threshold point.

Figure S4: Inferred ancestry of ILS and ISS.
Genome-wide distribution of ancestor assignments based on consistency of ILS and ISS specific variations with ancestor sequence in particular regions. Ancestry is inferred using a hidden Markov model on SNPs and validated by consistency of indels. Regions that are identical by descent between two or more strains are labeled IBD and then subdivided by the possible parental labels.   Figure 4. (C) The Darlington et. al. data, which was conducted in naïve mice, is then compared to the saline treated mice of this study using the strain-specific genomes.