Background

Cancer cells are fraught with genomic variants in all regions of the genome with high degrees of heterogeneity in a spatially complex tumor. This intra-tumor heterogeneity (ITH) realizes a fitness landscape upon which natural selection can act (reviewed by [1]). Neoantigens, epitopes derived from proteins translated from non-synonymous variants, are able to make their way to the cell surface in the hopes of stimulating an immune response after a number of cellular processing steps have occurred, primarily proteosomal cleavage and binding with major histocompatibility complexes (MHC) I or II. This binding depends upon the patient specific human leukocyte antigen (HLA) alleles. From here, the bound neoantigen with its MHC-Class I complex makes its way to the cell surface where it may bind with cytotoxic T-cell receptors thereby eliciting infiltration of cytotoxic T-cells capable of detecting and eliminating cells carrying the neoantigen in the absence of immune evading tactics. The immune response is strongly influenced by the total number of neoantigens within a tumor, especially in hyper-mutated cancers ([2]), as well as the ITH of antigenic mutations ([3]). Recent advances in sequencing techniques allow for multi-region sequencing approaches whereby adjacent regions of the same tumor or tissue are able to provide greater insights into variant clonality (i.e. truly clonal, subclonal, or shared). There is increasing evidence that the neoantigen landscape of tumours can be highly heterogeneous, containing regions of subclonal immune escape and significantly different neoantigen load that can influence a patient’s response to immunotherapy [46].

A number of tools are available that provide mutated peptide annotation, binding affinity prediction, wild-type and mutant peptide comparison, and neoantigen ranking based on these measures [710]. Their input varies from raw sequencing files (e.g. fastq) [7, 8, 10] to highly annotated vcf files [9]; some provide HLA-typing as part of their pipeline [7, 10], but require further dependencies for HLAtyping software. Most rely on a version of netMHC or netMHCpan for binding prediction, but [9] offers a choice of additional software. For an in-depth comparison of available pipelines for neoantigen calling, we refer the reader to the recent review of Lancaster et al. [11].

Despite the increasing number and diversity of neoantigen-prediction tools, none of them possess the capability of providing predicitions on multi-region sequence data and assessing ITH of the antigenic landscape of tumours. Here, we present NeoPredPipe, a pipeline connecting commonly used bioinformatic software via custom python scripts to allow for the processing of single and multi-region variant call format (VCF) files, variant annotations, neoantigen predictions, cross-referencing with known epitopes, and performing in silico TCR recognition potential predictions in a single, clear, and proficient workflow (Fig. 1).

Fig. 1
figure 1

NeoPredPipe workflow differentiating between user steps (green) and execution processes (purple). NeoPredPipe provides low level details and high level summary statistics as output for downstream analysis (red)

Implementation

The first stage in neoantigen identification from a VCF file is the proper annotation of variants to identify non-synonymous variants. To this end, NeoPredPipe employs the widely used and efficient genomics tool, ANNOVAR ([12]). Specifically, ANNOVAR processes samples in a way that prioritizes exonic variants, this step provides a useful means for quickly partitioning variant calls for downstream applications. The user is able to specify the genome build that they would like to use, provided it is compatible with ANNOVAR. Finally, using the coding_change function of ANNOVAR and custom code, the mutated amino acid sequence is predicted from annotated nonsynonymous variant calls, and the peptide sequence surrounding the newly introduced amino acid is extracted for epitope prediction. From this step, mutations that give rise to a single amino acid change, and mutations that mutate a larger peptide segment (e.g. indels and stop-losses) are handled separately and reported in separate files to help further assessment.

Once the VCF files have been annotated and partitioned with ANNOVAR, the program determines if HLA haplotypes have been provided by the user containing the HLA-A, -B, and -C haplotypes. NeoPredPipe does not include HLA allele identification as this step in the pipeline is highly dependent upon the source of the data (WES, WGS, targeted gene panels, transcriptome data, or conducted via experimental methods), but the pipeline’s github page provides detailed advice on haplotyping from WES/WGS data using the popular tool POLYSOLVER [13], and the output of POLYSOLVER is automatically processed in NeoPredPipe. In cases where no HLA haplotype information is available the most common alleles of each haplotype are assessed; while in cases where the HLA haplotypes are homozygous only that HLA haplotype is used for prediction. HLA haplotypes are cross-referenced with available HLA haplotypes prior to executing netMHCpan ([14]) for the primary neoantigen predictions. As with the primary tool, the user is able to specify the epitope lengths to conduct predictions for (typically epitopes of 8-, 9-, or 10-mers). The output from this process yields a single file containing either filtered or unfiltered (dependent on user options) neoantigen predictions with information on the sample possessing the neoantigen and, in the case of multi-region variant calling, a presence/absence indicator for each of the sequenced regions. These predicted neoantigens are then, optionally, cross-referenced with normal peptides utilizing PeptideMatch ([15]), whereby the candidate epitopes are assessed for novelty against a reference proteome that can be supplied by the user as a fasta file (e.g. from Ensembl or UniProt). When available, users may also provide expression data as a tsv file specific to each sample (or a single reference file) to quickly assess expression levels of the gene carrying a predicted neoantigen. This information is included in the final output table.

The steps outlined above deliver candidate information for neoantigens from provided variant calls that may be presented to cytotoxic T-cells, however, this does not inform the likelihood of a neoantigen eliciting an immune response (i.e. being recognised by a TCR). In order to predict the recognition potential we employ the algorithms and process utilized by [16]. The recognition potential is defined as the product of A and R, where A is the amplitude of the ratio of the relative probabilities of binding for the wild-type and mutant epitopes to the MHC-class I molecules; and R is a measure of similarity to pathogenic peptides, meant to represent the probability that the neoantigen in question is recognised by a TCR clone already present in the tissue/blood. To define A it is necessary to perform neoantigen predictions for the wildtype and mutant epitope: this is not performed by default by NeoPredPipe, but is supplied as an option to employ as a contiguous pipeline. To define R, NeoPredPipe utilizes the multistate thermodynamic model employed by [16], which requires alignment scores for each epitope to a curated Immune Epitope Database list of known epitopes (can be refined and updated by the user, but is provided).

In order to incorporate the ability to assess ITH in regards to both effective mutations (non-synonymous variants and indels) and neoantigen burdens, NeoPredPipe is capable of handling multi-region VCF files; further these files can be multi-region in only a select number of samples and differ in the number of regions. Similarly, NeoPredPipe can process multi-region expression data for samples where information on regions are compiled into separate columns. Thus NeoPredPipe is able to efficiently handle various, potentially multi-region experimental designs for neoantigen prediction and assessments providing a summary table and an optional web-based visualization tool for downstream statistical and in-depth analysis.

Results

The output of the pipeline depends largely on the options set by the user, but at the very least, NeoPredPipe provides two tables of putative neoantigens and their predicted binding affinities, one for single nucleotide/amino acid, and one for indel(-type) variants. With additional options selected it is possible to include, within a single output, whether an epitope matches a reference proteome, its expression on the RNA level and the neoantigen’s recognition potential. In additon, for rapid assessment, NeoPredPipe yields summary statistics on the neoantigen burden for each sample, a rapidly executed web-based visualization, as well as information to assess ITH by reporting neoantigen burdens for clonal, subclonal, and shared variants for multi-region samples. A detailed description of NeoPredPipe’s output tables and each field in these can be found at https://github.com/MathOnco/NeoPredPipe.

Use Case

While a small, two sample, multi-region example dataset is provided with the source code for users, we demonstrate the usefulness of NeoPredPipe by applying it to a previously published dataset examining the evolutionary landscape of colorectal tumors [17]. We select two exemplary patient samples (Adenoma 3 and Carcinoma 7 in the original paper) from the dataset, and apply our pipeline using default parameters to evaluate neoantigens in each sample. Figure 2 illustrates the information included in the standard output of NeoPredPipe and potential analysis that can be performed if NeoPredPipe is combined with the output of other standard bioinformatic methods.

Fig. 2
figure 2

Analysis of neoantigens in two colorectal tumors using NeoPredPipe. a Venn diagram of all neoantigens in the five regions of Adenoma 3. b Number of neoantigens in the two samples that are clonal (present in all regions, shown in blue), shared (present in at least two regions, in yellow) or subclonal (present in a single region, red). Separate counts of weak and strong MHC-binding neoantigens (WB and SB, respectively) are also shown. c Distribution of recognition potential values of neoantigens present in Adenoma 3 (green) and Carcinoma 7 (red). The boxplots represent the median and upper and lower 25 percentile. Only neoantigens with recognition potential higher than zero are shown. d Phylogenetic tree reconstructed from all exonic mutations for Adenoma 3 (left) and Carcinoma 7 (right). Pie-charts and the bar-charts represent the number of weak (orange) and strong (red) binder neoantigens assigned to each branch. The size of each circle is proportional to the percentage of total neoantigens on that branch

Figure 2a provides a summary of the complex interactions between different regions of Adenoma 3, and highlights both Region 4, which harbours the highest amount of subclonal (only present in a single region) neoantigens, and the overall clonality of the sample, with 72 neoantigens detected in all regions. For quick analysis, NeoPredPipe directly outputs a summary of the clonality of neoantigens, also divided into categories of strong and weak binders (peptides with a netMHCpan percentile rank ≤0.5 and ≤2, respectively, as recommended in [14]). Figure 2b visualizes this summary on two bar-charts for Adenoma 3 and Carcinoma 7. We find that whilst the number of shared neoantigens (present in more than one, but not all regions) is highly similar between the two samples, Carcinoma 7 harbours both more clonal (present in all regions) and subclonal neoantigens; and in total 26% of the neoantigens are clonal, compared to 16% of Adenoma 3. Figure 2c shows the recognition potential value for all neoantigens in the two samples. NeoPredPipe identified 10 peptides in Adenoma 3 and 9 in Carcinoma 7 with a recognition potential value above 1. In Fig. 2d, we provide an example of integrating NeoPredPipe outputs with downstream multi-region variant analysis. By inferring phylogenetic trees of each tumor, constructed using all exonic mutations with a variant allele frequency above 0.05 (see [17] for full methods), we find that neoantigen distributions across regions can reflect the phylogenetic distance of regions and clonal structure of samples. 31% and 23.5% of total exonic mutations are clonal in Carcinoma 7 and Adenoma 3, similarly to the clonality of neoantigens shown in Panel B. This approach also highlights regions with neoantigen loads different from their closest neighbors, such as Region61 and Region62 of Carcinoma 7. Therefore the analysis can inform future experimental and bioinformatic investigations of samples allowing for new evolutionary and mechanistic insights into tumor development, evolution, and progression.

Conclusions

We present NeoPredPipe, an efficient, high-throughput, and user-friendly pipeline for neoantigen prediction and interrogation for single and multi-region tumor VCF files. By tying together commonly utilized bioinformatics toolsets and integrating recent advances in neoantigen assessment, NeoPredPipe yields concise information typically required by researchers and clinicians. Through user options, based on the individuals own computational limitations, the pipeline is scalable for a high performance computing (HPC) cluster environment and customizable for individual research questions. Furthermore, unlike existing methods[710], NeoPredPipe can process a directory containing numerous samples in a single command; therefore provides a user-friendly way for not computer-proficient users to analyse the output of large studies or compare against reference datasets. All source code and an extensive read me for each component of NeoPredPipe with all pipeline options are available at https://github.com/MathOnco/NeoPredPipe.

Availability and requirements

Project name: NeoPredPipe

Project home page: https://github.com/MathOnco/NeoPredPipe

Operating system: Unix-based operating system

Programming languages: Python and Bash

Other requirements: Python 2.7, ANNOVAR, netMHCpan, PeptideMatch, and (optionally) NCBI BlastX+.

License: GNU GPLv3

Any restrictions to use by non-academics: None