Background

Human leukocyte antigen (HLA) genes play a key role in immunological responses by presenting peptides to T cells. It is well known that HLA loci are highly polymorphic, and the polymorphism patterns define several thousands of types within HLA genes. HLA genotyping is a process that determines a pair of HLA types for an HLA gene. Since the relationships between HLA types and diseases have now been intensively investigated [15], HLA genotyping is considered as a fundamental step in immunological analysis. Further analysis enables us to identify novel HLA types and detect somatic mutations, which potentially affect the efficacy of immune therapy.

Recently, next generation sequencing-based approaches have been developed for HLA genotyping. These can be generally separated into two categories: those based on amplicon sequencing of HLA loci [6, 7] and others based on unbiased sequencing methods such as whole exome sequencing (WES) and RNA sequencing (RNA-seq) [815]. The amplicon sequencing-based methods are the most accurate owing to the sufficient coverage of sequence data, but are relatively expensive to perform and require specialized materials and equipment. The unbiased sequencing ones can be used without additional costs, but the accuracy of the results depends on the amount and quality of sequence reads generated from HLA loci. Previous papers have shown that the accuracy can reach 95% at 2nd field resolution from WES and RNA-seq data [10, 12, 13, 15] However, Bauer et al. has reported that these methods cannot achieve 80% accuracy from whole genome sequence (WGS) data [16]. Thus, HLA genotyping from WGS data remains a significant challenge, although this approach would provide more information of HLA loci than possible with WES and RNA-seq data, including details of the non-coding regions such as the introns and the untranslated regions.

To achieve high accuracy for WGS-based HLA genotyping and further analysis of HLA genes, we developed a series of computational methods, which involve collection of sequence reads that are potentially generated from a target HLA gene followed by HLA genotyping, using a novel Bayesian model termed ALelle Prediction in HLA Regions from sequence Data (ALPHLARD). This model was found to yield comparable accuracy to those based on WES and RNA-seq data at 3rd field resolution. Together with HLA genotyping, a notable feature of ALPHLARD is that it can estimate the personal HLA sequences of the sample. This enables achieving high accuracy for a sample whose HLA sequence is not included in the reference databases and further allows for calling rare germline variants not stored in the databases. We can also detect somatic mutations by comparing the HLA sequences of paired normal and tumor sequence data.

We illustrate the capability of our method by comparing the performance of ALPHLARD and existing methods using WES data from 253 HapMap samples and WGS data from the normal samples of 25 cancer patients. We also applied ALPHLARD to WGS data of the tumor samples of the cancer patients and detected three somatic point mutations and one case of loss of heterozygosity (LOH) in the HLA genes, which were validated by the Trusight HLA Sequencing Panels [17] and the Sanger sequencing.

Methods

Overview of our pipeline

Our pipeline consists of two steps as shown in Fig. 1. First, for each read and each HLA type, the HLA read score (HR score) is calculated, which quantifies the likelihood that the read comes from the HLA type. Based on the calculated HR scores, it is determined whether or not the read comes from a certain HLA gene. For example, by aligning read x to the reference sequences in HLA databases, we obtained the HR scores as shown in the bar graph of Fig. 1a. Then, if the maximum HR score for the HLA-A gene is large enough and the difference in the maximum scores for the HLA-A gene and the other HLA genes is also large, we conclude that read x is most likely a specific read of the HLA-A gene. Otherwise, read x is judged to be a read produced from other regions. HLA genotyping is then performed using the collected reads for each HLA gene, as shown in Fig. 1b. ALPHLARD outputs candidate pairs of HLA types according to the Bayesian posterior probabilities.

Fig. 1
figure 1

Schematic overview of ALPHLARD: a For each read and each HLA type, the HLA read score (HR score) is calculated, which quantifies the likelihood that the read comes from the HLA type. Based on the calculated HR scores, it is determined whether or not the read comes from a certain HLA gene. b For each read and each HLA type, the HLA read score (HR score) is calculated, which quantifies the likelihood that the read comes from the HLA type. Based on the calculated HR scores, it is determined whether or not the read comes from a certain HLA gene

HLA reference data

We used HLA reference information that can be obtained from the IPD-IMGT/HLA database (release 3.28.0) [18]. There are two types of HLA reference sequences in the database: one is a complete genomic reference and the other is an exonic reference without non-coding regions. Some HLA types have both genomic and exonic reference information, but most HLA types have only exonic reference information.

The database also provides multiple sequence alignments (MSAs) at the genomic and the exonic levels for each HLA gene. We combined the two MSAs into a common MSA as follows: First, some gaps were inserted into exons of the genomic MSA for consistency with the exonic reference sequences. Then, missing non-coding sequences were replaced with the most similar genomic reference sequences. This integrated MSA is then used for alignment and realignment of the reads.

Collection and realignment of reads

First, all reads are mapped to a human reference genome, and reads mapped to the HLA region and unmapped reads are used at the next step. We use hg19 [19] as the reference sequence and define the HLA region as chr6:28,477,797-33,448,354, which covers HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes.

Next, the filtered reads are mapped to all HLA genomic and exonic reference sequences. We use BWA-MEM (version 0.7.10) [20] with the -a option to output all found alignments. Then, each mapped read is filtered based on whether or not it is likely to be produced by the target HLA gene. This filtering is performed according to the HR score si,j for the ith read xi and the jth HLA type tj, which is similar to the filtering procedure used in HLAforest [11]. If xi is not aligned to tj, si,j is −. Otherwise, let \((\tilde {x}_{i,j}, \tilde {t}_{i,j})\) be the alignment of xi and tj, which might include some gaps. \(\tilde {x}_{{i,j,n}}\) and \(\tilde {t}_{{i,j,n}}\) are defined as the nth bases or gaps of \(\tilde {x}_{i,j}\) and \(\tilde {t}_{i,j}\), respectively, and \(\tilde {b}_{{i,j,n}}\) is defined as the base quality of \(\tilde {x}_{{i,j,n}}\). We suppose that \(\tilde {p}_{{i,j,n}}\) is the probability of a mismatch between \(\tilde {x}_{{i,j,n}}\) and \(\tilde {t}_{{i,j,n}}\), which can be calculated by

$$ \tilde{p}_{i,j,n} = 10^{-\frac{\tilde{b}_{i,j,n}}{10}}. $$

Then, the HR score si,j is given by

$$ s_{i,j} = \sum\limits_{n} \left(\tilde{\alpha}_{i,j,n} + \tilde{\beta}_{i,j,n}\right), $$

where

$$\begin{array}{*{20}l} \tilde{\alpha}_{i,j,n} & =\left\{ \begin{array}{ll} \log (\frac{\tilde{p}_{i,j,n}}{3}) & (\text{if } \tilde{x}_{i,j,n}, \tilde{t}_{i,j,n} \in B \text{ and } \tilde{x}_{i,j,n} \neq \tilde{t}_{i,j,n}) \\ \alpha^{\texttt{d,o}} & (\text{if } \tilde{x}_{i,j,n} = {-} \text{ and } \tilde{x}_{i,j,n-1} \neq {-}) \\ \alpha^{\texttt{d,e}} & (\text{if } \tilde{x}_{i,j,n} = {-} \text{ and } \tilde{x}_{i,j,n-1} = {-}) \\ \alpha^{\texttt{i,o}} & (\text{if } \tilde{t}_{i,j,n} = {-} \text{ and } \tilde{t}_{i,j,n-1} \neq {-}) \\ \alpha^{\texttt{i,e}} & (\text{if } \tilde{t}_{i,j,n} = {-} \text{ and } \tilde{t}_{i,j,n-1} = {-}) \\ \alpha^{\texttt{N}} & \left(\begin{array}{l} \text{if } \tilde{x}_{i,j,n} = \mathtt{N} \text{ and } \tilde{t}_{i,j,n} \in B^{\texttt{N}} \\ \quad \text{or }\tilde{x}_{i,j,n} \in B^{\mathtt{N}} \text{ and } \tilde{t}_{i,j,n} = \mathtt{N} \end{array}\right) \\ 0 & \text{(otherwise)} \end{array}\right.\! {,} \\ \tilde{\beta}_{i,j,n} & = \left\{\begin{array}{ll} \beta & (\text{if } \tilde{x}_{i,j,n} \in B^{\mathtt{N}}) \\ 0 & \text{(otherwise)} \end{array}\right. {.} \end{array} $$

Here, B={A,C,G,T} and BN={A,C,G,T,N}. The parameters αd,o, αd,e, αi,o, αi,e, and αN take negative values as penalties for opening a deletion, extending a deletion, opening an insertion, extending an insertion, and N in the read or the HLA type, respectively. β is a positive constant reward for read length, which prefers longer reads. Then, the score of xi for the target HLA gene \(s^{*}_{i}\), and the score of xi for the non-target HLA genes \(\bar {s}^{*}_{i}\) are defined by

$$ s^{*}_{i} = \max_{t_{j} \in T} s_{i,j}, \quad \bar{s}^{*}_{i} = \max_{t_{j} \notin T} s_{i,j}, $$

where T is the set of HLA types in the target HLA gene. \(s^{*}_{i}\) and \(\bar {s}^{*}_{i}\) indicate how likely xi is to be produced by the target HLA gene and the non-target HLA genes, respectively.

Thus, when xi is an unpaired read, it is used for HLA genotyping if

$$ s^{*}_{i} > \theta^{u,m}, \quad s^{*}_{i} - \bar{s}^{*}_{i} > \theta^{u,d}, $$

where θu,m and θu,d are constant thresholds. When xi and \(x_{i^{\prime }}\) are paired, they are used for HLA genotyping if

$$ s^{*}_{i} + s^{*}_{i^{\prime}} > \theta^{p,m}, \quad \left(s^{*}_{i} + s^{*}_{i^{\prime}}\right) - \left(\bar{s}^{*}_{i} + \bar{s}^{*}_{i^{\prime}}\right) > \theta^{p,d}, $$

where θp,m and θp,d are constant thresholds. Paired reads are generally more effective than unpaired reads; hence, θp,m and θp,d should be less than θu,m and θu,d, respectively.

In the next step, all of the collected reads are realigned as follows. First, \(\phantom {\dot {i}\!}t_{j^{*}}\) is defined as the best type for xi in the target gene, which is obtained by

$$ j^{*} = {\underset{j: t_{j} \in T}{\arg\max}} s_{i,j}. $$

Then, xi is realigned to be consistent with the alignment \(\left (\tilde {x}_{i,j^{*}}, \tilde {t}_{i,j^{*}}\right)\) and the integrated MSA of the target HLA gene.

Bayesian model for analyzing HLA genes

Analysis of the target HLA gene by ALPHLARD is performed using the collected and realigned reads. Let \(\hat {x}_{i}\) be the ith paired (or unpaired) read(s) collected and realigned with the previous procedure, \(\hat {x}_{i,n}\) be the nth base or gap of \(\hat {x}_{i}\), and \(\hat {b}_{i,n}\) be the base quality of \(\hat {x}_{i,n}\). Note that, hereafter, we regard paired reads as one sequence. The probability of mismatch \(\hat {p}_{i,n}\) can be calculated by

$$ \hat{p}_{i,n} = 10^{-\frac{\hat{b}_{i,n}}{10}}. $$

Suppose that \(R^{r}_{1}\) and \(R^{r}_{2}\) are the HLA types of the sample, and that \(S^{r}_{1}\) and \(S^{r}_{2}\) are the true HLA sequences of the sample, which are introduced because the HLA sequences of the sample might not be registered in the reference (IPD-IMGT/HLA) database. Let \(R^{d}_{1}, R^{d}_{2},...,\) be decoy HLA types and \(S^{d}_{1}, S^{d}_{2},...,\) be decoy HLA sequences. These parameters could make this HLA analysis robust when reads from non-target homologous regions are misclassified into the target HLA gene at the previous filtering step. We will sometimes use R1,R2,R3,R4,..., and S1,S2,S3,S4,..., instead of \(R^{r}_{1}, R^{r}_{2}, R^{d}_{1}, R^{d}_{2},...,\) and \(S^{r}_{1}, S^{r}_{2}, S^{d}_{1}, S^{d}_{2},...,\) for convenience. Ii is defined as a parameter to indicate which sequence produced the read \(\hat {x}_{i}\); that is, Ii=k means that \(\hat {x}_{i}\) was generated from Sk. Then, the posterior probability of the parameters \(p(\mathcal {R}, \mathcal {S}, \mathcal {I} | \hat {X})\) is given by

$$ p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right) \propto p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) p\left(\mathcal{R}, \mathcal{S}\right) p\left(\mathcal{I}\right), $$

where \(\hat {X} = \left \{ \hat {x}_{1}, \hat {x}_{2},... \right \}\), \(\mathcal {R} = \left \{ R_{1}, R_{2},..., \right \}\), \(\mathcal {S} = \left \{ S_{1}, S_{2},..., \right \}\), and \(\mathcal {I} = \left \{ I_{1}, I_{2},... \right \}\).

The likelihood function \(p\left (\hat {X} | \mathcal {S}, \mathcal {I}\right)\) is defined by

$$ p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) = \prod_{i} p\left(\hat{x}_{i} | S_{I_{i}}\right). $$

The likelihood of each read \(p(\hat {x}_{i} | S_{k})\) is given by

$$ p\left(\hat{x}_{i} | S_{k}\right) = \prod_{n} p\left(\hat{x}_{i,n} | S_{k,n}\right), $$

where Sk,n is the nth base or gap of Sk. The likelihood of each base \(p\left (\hat {x}_{i,n} | S_{k,n}\right)\) is calculated by

$${{}\begin{aligned} & p(\hat{x}_{i,n} | S_{k,n} \! \in \! B) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) (1 \! - \! \hat{p}_{i,n}) & (\text{if } \hat{x}_{i,n} \! = \! S_{k,n}) \\ (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) \frac{\hat{p}_{i,n}}{3} & (\text{if }\hat{x}_{i,n} \! \in \! B \text{ and } \hat{x}_{i,n} \! \neq \! S_{k,n}) \\ (1 \! - \! \gamma^{\mathtt{d}}) \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ \gamma^{\mathtt{d}} & (\text{if }\hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right.\!\! {,} \\ & p(\hat{x}_{i,n} | S_{k,n} \! = \! \mathtt{N}) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \gamma^{\mathtt{d}}) (1 \! - \! \gamma^{\mathtt{N}}) \frac{1}{4} & (\text{if } \hat{x}_{i,n} \! \in \! B) \\ (1 \! - \! \gamma^{\mathtt{d}}) \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ \gamma^{\mathtt{d}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right. {,} \\ & p(\hat{x}_{i,n} | S_{k,n} \! = \! \mathtt{-}) \\ & \ \ = \left\{\begin{array}{ll} \gamma^{\mathtt{i}} (1 \! - \! \gamma^{\mathtt{N}}) \frac{1}{4} & (\text{if } \hat{x}_{i,n} \! \in \! B) \\ \gamma^{\mathtt{i}} \gamma^{\mathtt{N}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{N}) \\ 1 \! - \! \gamma^{\mathtt{i}} & (\text{if } \hat{x}_{i,n} \! = \! \mathtt{-}) \end{array}\right. {.} \end{aligned}} $$

Here, γd, γi, and γN are the probabilities of a deletion error, an insertion error, and N, respectively.

The prior probability of the HLA types and the HLA sequences \(p(\mathcal {R}, \mathcal {S})\) is defined by

$$ p\left(\mathcal{R}, \mathcal{S}\right) = \prod_{k} p\left(R_{k}\right) p\left(S_{k} | R_{k}\right). $$

Here, \(p(R^{r}_{k})\) is the prior probability of the HLA type, which is calculated using The Allele Frequency Net Database [21]. On the other hand, \(p(R^{d}_{k})\) is the prior probability of the decoy HLA type, which we assume as constant. The prior probability of the HLA sequence p(Sk|Rk) is given by

$$ p\left(S_{k} | R_{k}\right) = \prod_{n} p\left(S_{k,n} | R_{k,n}\right), $$

where and Rk,n is the nth base or gap of Rk in the integrated MSA. The probability of a germline variant p(Sk,n|Rk,n) is calculated by

$${{}\begin{aligned} & p(S_{k,n} | R_{k,n} \! \in \! B) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) (1 \! - \! \delta^{\texttt{s}}) & \text{(if \(S_{k,n} \! = \! R_{k,n}\))} \\ (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) \frac{\delta^{\texttt{s}}}{3} & \text{(if \(S_{k,n} \! \in \! B\) and \(S_{k,n} \! \neq \! R_{k,n}\))} \\ (1 \! - \! \delta^{\texttt{d}}) \delta^{\texttt{N}} & \text{(if \(S_{k,n} \! = \! \texttt{N}\))} \\ \delta^{\texttt{d}} & \text{(if \(S_{k,n} \! = \! \texttt{-}\))} \end{array}\right. \! {,} \\ & p(S_{k,n} | R_{k,n} \! = \! \texttt{N}) \\ & \ \ = \left\{\begin{array}{ll} (1 \! - \! \delta^{\texttt{d}}) (1 \! - \! \delta^{\texttt{N}}) \frac{1}{4} & \text{(if \(S_{k,n} \! \in \! B\))} \\ (1 \! - \! \delta^{\texttt{d}}) \delta^{\texttt{N}} & \text{(if \(S_{k,n} \! = \! \texttt{N}\))} \\ \delta^{\texttt{d}} & \text{(if \(S_{k,n} \! = \! \texttt{-}\))} \end{array}\right. {,} \\ & p(S_{k,n} | R_{k,n} \! = \! \texttt{-}) \\ & \ \ = \left\{\begin{array}{ll} \delta^{\texttt{i}} (1 \! - \! \delta^{\texttt{N}}) \frac{1}{4} & \text{(if \(S_{k,n} \! \in \! B\))} \\ \delta^{\texttt{i}} \delta^{\texttt{N}} & \text{(if \(S_{k,n} \! = \! \texttt{N}\))} \\ 1 \! - \! \delta^{\texttt{i}} & \text{(if \(S_{k,n} \! = \! \texttt{-}\))} \end{array}\right. {.} \end{aligned}} $$

Here, δs, δd, δi, and δN are the probabilities of a true substitution, a true deletion, a true insertion, and a true N, respectively. Also, different values are used as these hyperparameters depending on whether Rk,n was imputed from the most similar reference base, which reduces the influence of misimputations. Sk,n tends to become N when it is ambiguous.

The prior probability of the indicator variables \(p(\mathcal {I})\) is defined by

$$ p(\mathcal{I}) = \prod_{i} p(I_{i}) $$

Here, p(Ii) is the prior probability of the indicator variable, which is calculated by

$${\begin{aligned} p(I_{i}) \propto \left\{\begin{array}{ll} 1 & \text{(if } I_{i} = 1 \text{ or } I_{i} = 2) \\ \epsilon & \text{(otherwise)} \end{array}\right. {.} \end{aligned}} $$

ε reflects how likely the reads are to be produced by non-target homologous regions.

Efficient sampling with elaborate MCMC schemes

The parameters of the model above are sampled using two Markov chain Monte Carlo (MCMC) schemes, Gibbs sampling and the Metropolis-Hastings algorithm, with parallel tempering to make the parameter sampling efficient. Gibbs sampling is mainly used for local search, and Metropolis-Hastings sampling is periodically used for more global search. For the Metropolis-Hastings algorithm, we constructed two novel proposal distributions that enable the parameters to jump from mode to mode and lead more efficient sampling.

One of the proposal distributions is focused on positions not covered with any read. First, \(S^{\texttt {N}}_{k}\) is defined as a modified HLA sequence whose bases are replaced with Ns at positions not covered with any read produced by Sk, which is given by

$${\begin{aligned} S^{\texttt{N}}_{k,n} = \left\{\begin{array}{ll} S_{k,n} & \text{(if } \exists i; I_{i} = k \text{ and } \hat{x}_{i} \text{ covers the } n^{\text{th}} \text{ base of } S_{k}) \\ \texttt{N} & \text{(otherwise)} \end{array}\right. {.} \end{aligned}} $$

A candidate HLA type and a candidate HLA sequence are then sampled based on

$$\begin{array}{*{20}l} R^{*}_{k} & \sim p\left(R^{*}_{k} | S^{\texttt{N}}_{k}\right), \\ S^{*}_{k} & \sim p\left(S^{*}_{k} | R^{*}_{k}, \mathcal{I}, \hat{X}\right). \end{array} $$

Then, the acceptance rate r can be calculated based on the Metropolis-Hastings algorithm, which is given by

$$\begin{array}{*{20}l} r & = \min \left(1, r^{*}\right), \\ r^{*} & = \frac{p\left(R^{*}_{k}, S^{*}_{k} | \mathcal{I}, \hat{X}\right) p\left(R^{*}_{k}, S^{*}_{k} \to R_{k}, S_{k} | \mathcal{I}, \hat{X}\right)} {p\left(R_{k}, S_{k} | \mathcal{I}, \hat{X}\right) p\left(R_{k}, S_{k} \to R^{*}_{k}, S^{*}_{k} | \mathcal{I}, \hat{X}\right)} \\ & = \frac{p\left(S^{\texttt{N}}_{k} | R_{k}\right) {\sum\nolimits}_{S} p\left(\hat{X} | S, \mathcal{I}\right) p\left(S | R^{*}_{k}\right)} {p\left(S^{\texttt{N}}_{k} | R^{*}_{k}\right) {\sum\nolimits}_{S} p\left(\hat{X} | S, \mathcal{I}\right) p\left(S | R_{k}\right)}. \end{array} $$

This proposal distribution makes the sampling more efficient when there is ambiguity in the HLA types attributed to some uncovered positions. For example, let tj and \(\phantom {\dot {i}\!}t_{j^{\prime }}\) be HLA types that only differ with one mismatch at the nth position. If a sample has tj as an HLA type but there are no reads from tj covering the nth position, we cannot determine whether the HLA type is tj or \(\phantom {\dot {i}\!}t_{j^{\prime }}\). However, once Rk becomes \(\phantom {\dot {i}\!}t_{j^{\prime }}\), Sk,n becomes the nth base of \(\phantom {\dot {i}\!}t_{j^{\prime }}\) with high probability. Then, Rk becomes \(\phantom {\dot {i}\!}t_{j^{\prime }}\) with high probability, and this process is repeated. This is because Rk and Sk are separately sampled in the Gibbs sampling in spite of their high correlation. Thus, the proposal distribution prevents the parameters from getting stuck by sampling them simultaneously.

The other proposal distribution swaps non-decoy and decoy parameters. In this proposal distribution, indices for non-decoy and decoy parameters are uniformly sampled, and the HLA types and the HLA sequences at the indices are swapped. After swapping, candidate indicator variables are sampled based on the conditional distribution given the swapped parameters. Suppose that \(\mathcal {R^{*}}\) and \(\mathcal {S^{*}}\) are HLA types and HLA sequences after swapping, and that \(\mathcal {I^{*}}\) is a set of candidate indicator variables. Then, the acceptance rate r can be calculated by

$$\begin{array}{*{20}l} r & = \min \left(1, r^{*}\right), \\ r^{*} & = \frac{p\left(\mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} | \hat{X}\right) p\left(\mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} \to \mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right)} {p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} | \hat{X}\right) p\left(\mathcal{R}, \mathcal{S}, \mathcal{I} \to \mathcal{R^{*}}, \mathcal{S^{*}}, \mathcal{I^{*}} | \hat{X}\right)} \\ & = \frac{p\left(\mathcal{R^{*}}\right) {\sum\nolimits}_{\mathcal{I}} p\left(\hat{X} | \mathcal{S^{*}}, \mathcal{I}\right) p\left(\mathcal{I}\right)} {p\left(\mathcal{R}\right) {\sum\nolimits}_{\mathcal{I}} p\left(\hat{X} | \mathcal{S}, \mathcal{I}\right) p\left(\mathcal{I}\right)}. \end{array} $$

This proposal distribution enables quickly distinguishing reads from the target HLA gene and non-target homologous regions.

Some procedures are used in the burn-in period to avoid getting stuck in local optima. At the beginning of sampling, a multi-start strategy is used to reduce the influence of initial parameters. Specifically, some MCMC runs are carried out, and initial parameters are sampled from the last parameters of the MCMC runs. In addition, reference sequences are periodically copied to HLA sequences in the burn-in period. This works well to get better parameters because there are many local optima where the parameters of the HLA sequences are twisted as if some crossovers occurred. Although these two approaches do not satisfy the detailed balance condition, since they are carried out only in the burn-in period, sampled parameters correctly reflect the posterior distribution.

After sampling the parameters, HLA genotyping can be performed by counting \(R^{r}_{1}\) and \(R^{r}_{2}\). We used the most sampled HLA genotype in the MCMC process as the candidate. The HLA sequences of a sample can be also inferred by counting \(S^{r}_{1}\) and \(S^{r}_{2}\). Even though misimputations in HLA reference sequences can cause mistyping, the HLA sequence can still be determined correctly, and the misimputations and the mistyping can be corrected.

Results

WES and WGS datasets

To evaluate the capability of our method, we obtained 253 WES data with the HLA genotypes from the International HapMap Project [22] that had been used by Szolek et al. [13] and Shukla et al. [15]. We further downsampled these data to 1/2, 1/4, 1/8, and 1/16 to simulate low-coverage data.

We also used paired normal and tumor WGS data of 25 Japanese cancer patients, including 20 liver cancer and 5 microsatellite-unstable colon cancer samples. These data were obtained from an Illumina HiSeq system with a 101-bp pair-end read length. The sequence data were deposited into the International Cancer Genome Consortium (ICGC) database (https://dcc.icgc.org/).

The sequencing-based typing (SBT) approach, which is guaranteed to be accurate at 2nd field resolution, was used for validation of the 20 liver cancer samples. Additional HLA genotyping using the TruSight HLA Sequencing Panels, which are theoretically guaranteed to be accurate at full (4th field) resolution, was performed for 7 out of the above 20 liver cancer samples to reduce ambiguity of the SBT genotyping. The 5 microsatellite-unstable samples were genotyped using the TruSight HLA Sequencing Panels, in order to verify not only the HLA genotypes but also the presence of somatic mutations. We regarded the results of the SBT approach and/or the TruSight HLA Sequencing Panels as the correct information. If the results differed between the two methods, we assumed that the result of the TruSight HLA Sequencing Panel was correct.

WES- and WGS-based HLA genotyping

For performance comparison, we used three existing methods, OptiType [13], PHLAT [12], and HLA-VBSeq [14] because it has been reported that they achieve the highest accuracy for WES- and WGS-based HLA genotyping [16]. First, we applied ALPHLARD and the existing methods to the original and the downsampled WES data (Additional file 1: Tables S1-S5). Because the gold standard HLA genotypes were determined from exon 2 and 3, we used only the exons as the reference sequences in ALPHLARD. Figure 2 shows the performance of the methods. ALPHLARD kept higher accuracy compared with the other methods even when the downsampling ratio was low. The accuracy of the existing methods was consistent with the preceding paper [16].

Fig. 2
figure 2

WES-based HLA genotyping of ALPHLARD, OptiType, PHLAT, and HLA-VBSeq. Each WES data was downsampled to 1/2, 1/4, 1/8, and 1/16, and the four methods were applied to all of the original and the downsampled WES data

We also applied the methods to the normal WGS data and compared the determined HLA genotypes with those obtained by the SBT approach and the TruSight HLA Sequencing Panel (Additional file 2: Tables S6-S13). The performance of the four methods is shown in Table 1 and Additional file 3: Table S14. ALPHLARD clearly achieved a higher accuracy rate than the other methods. Moreover, the HLA-B genotype of one sample was inferred differently between the SBT approach and the TruSight HLA Sequencing Panel, and the result of ALPHLARD for this sample was identical to that of the TruSight HLA Sequencing Panel. This suggests that ALPHLARD could be potentially superior to the SBT approach in some cases. HLA-VBSeq achieved higher accuracy from the WGS data than from the WES data. This would be because HLA-VBSeq uses non-coding information such as the introns and the untranslated regions. The accuracy of the existing methods was consistent with the preceding paper [16].

Table 1 WGS-based HLA genotyping accuracy that indicates how many HLA types were correctly determined with ALPHLARD, OptiType, PHLAT, and HLA-VBSeq

Detection of somatic mutations

Next, we searched for somatic point mutations in the HLA genes. They were detected by comparing the inferred HLA sequences between paired normal and tumor samples of each patient. We detected three somatic point mutations in the microsatellite-unstable samples: two single-base deletions and one single-base insertion (Fig. 3 and Additional file 4: Figures S1 and S2). One of the deletions occurred in a homopolymeric region in exon 1 of the HLA-A gene, and the other occurred in a homopolymeric region in exon 1 of the HLA-B gene. Both of these mutations caused a frameshift, leading to an early stop codon and ultimate loss of function of the HLA allele. It is known that the HLA-A and HLA-B genes are homologous, and we found that the two deletions occurred at homologously the same position. Moreover, one of the HLA-A types (A*68:11N) has a single-base deletion at exactly the same homopolymeric position. These observations suggest that the homopolymeric regions are deletion hotspots. The insertion occurred in a homopolymeric region at the beginning of exon 4 of the HLA-A gene, which changed the HLA-A allele from A*31:01:02 to A*31:14N. This region is known as an insertion hotspot in some HLA types such as A*01:04N and B*51:11N, and the insertion causes no expression of the allele [2326]. The three indels identified were validated by the TruSight HLA Sequencing Panels and the Sanger sequencing.

Fig. 3
figure 3

A single-base deletion in exon 1 of the HLA-B gene of patient RK363. IGV screenshots were taken at the position for a the WGS data and b the TruSight HLA Sequencing Panel data. In each of the screenshots, the upper and lower tracks correspond to the normal and tumor samples, respectively

We further sought cases of LOH in the HLA genes as follows. First, we focused on two types of patients: (i) those for which HLA genotypes were uniquely determined for the normal sample but not for the tumor sample, and (ii) those for which HLA genotypes of both the normal and the tumor samples were uniquely but not identically determined. Then, we checked whether the collected reads of the tumor sample supported the HLA genotype inferred for the normal sample.

We next checked the influence of misimputations in HLA reference sequences. The HLA types of the used samples had complete genomic references, and hence had no imputed bases. Therefore, in each sample, we made artificial misimputations to the reference sequences of the true HLA types by replacing one base at a SNP position with another variant of a similar HLA reference sequence. This replacement possibly has a large effect to HLA genotyping. We applied ALPHLARD to the WES and the WGS data using the misimputed references. The accuracy was 85.3% for the WES data and 80.4% for the WGS data at 2nd field resolution, which implies ALPHLARD has the capability of accurate HLA genotyping even if there occurred misimputations in HLA reference sequences. We note that this problem could be solved as the HLA reference database becomes complete.

We were able to detect one likely case of LOH in the tumor sample of a patient, RK069. At each heterozygous single nucleotide polymorphism (SNP) position in each HLA locus, the log odds ratio was calculated for the WGS data and the Trusight HLA Sequencing Panels based on the number of reads that supported the SNP (Fig. 4 and Additional file 5: Figures S3-S7). These figures suggest that A*26:01:01, B*35:01:01, C*03:03:01, DPA1*01:03:01, DQA1*03:02, and DRB1*12:01:01 might be lost in the tumor sample of RK069.

Fig. 4
figure 4

The log odds ratios of the depths at heterozygous SNP positions in the HLA-A gene of patient RK069. The log odds ratios were calculated for a the WGS data and b the TruSight HLA Sequencing Panel data. These log odds ratios correspond to the relative quantities of observed A*26:01:01 SNPs in the tumor sample compared with the normal sample. The red dots indicate the mean values of the log odds ratios, and the vertical lines indicate the 95% confidence intervals

Discussion

In this paper, we presented a new Bayesian method, ALPHLARD, which performs not only HLA genotyping but also infer the HLA sequences of a sample. The results showed that our method ALPHLARD achieved higher accuracy for HLA genotyping from both WES and WGS data than existing methods. We presume that the high performance of ALPHLARD originates from the following reasons. First, the search space of ALPHLARD is all possible HLA allele pairs. Some methods treat an HLA allele pair as two independent HLA alleles; that is they give a score to each HLA allele and output the most and the second most probable HLA alleles without directly considering the combinations. This approximation reduces the computation time but works well only when the coverage of the sequence data is sufficient. Therefore, such methods would not achieve high accuracy for HLA genotyping from WGS data. Second, ALPHLARD takes into account whether or not bases and gaps are observed at each position by inserting the parameters for HLA sequences between the parameters for HLA genotypes and collected reads. Most of read count-based HLA genotyping algorithms consider only the number of reads mapped to each HLA allele. However, even if a lot of reads are mapped to an HLA allele, it does not seem to be the true HLA type if there are several regions not covered by any read. We believe that what is really important is not the number of reads but the range covered by sufficient reads. Third, ALPHLARD uses some decoy parameters in addition to non-decoy ones. This is why ALPHLARD can robustly and accurately perform HLA genotyping even if there exist some reads from non-target homologous regions that are similar to the target HLA gene.

Besides HLA genotypes, ALPHLARD gives us beneficial information that cannot be obtained from other methods. First, somatic mutations such as point mutations and LOHs can be detected by comparing the sampled HLA sequences of paired normal and tumor samples. We detected three indels and one case of LOH, which lead to loss of function of the HLA alleles. These mutations are biologically important because they weaken the immune function and would be related to tumor progression. Second, novel HLA types not registered in HLA databases can be identified by comparing the inferred HLA genotype and HLA sequences. Unfortunately, no novel HLA type was observed in our analysis. However, ALPHLARD would be flexible enough to detect the difference between novel HLA types and known ones because the process of novel HLA type identification is theoretically the same as that of HLA somatic mutation detection.

Although ALPHLARD performed better HLA genotyping than other existing methods, its accuracy would not be sufficient for clinical applications including hematopoietic stem cell transplantation. However, ALPHLARD could be a effective tool in combination with other methods such as SBT since the sample might have a novel HLA type.

Conclusion

Our new Bayesian-based HLA analysis method, ALPHLARD, showed good performance for HLA genotyping. It also has a potential to detect rare germline variants and somatic mutations in HLA genes. A large amount of WGS data has been recently produced by big projects such as the ICGC. Applying our method to such big data would help to fill in the current gaps in HLA reference databases and unveil the immunological significance of somatic mutations identified in HLA genes.