Introduction

Major histocompatibility complex (MHC) molecules have attracted a lot of attention due to their central role in discriminating between self and non-self and their enormous polymorphism. Classical MHC class I molecules are present on most cell types and present peptide fragments from self- and non-self-proteins to CD8+ T cells, thus initiating a protective immune response when presenting peptides from foreign proteins (Klein 1986). Classical MHC class II molecules are present on specialised antigen presenting cells and stimulate CD4+ T cells when presenting peptides originating from foreign endocytosed proteins.

The IPD-MHC Database contains sequence data from classical and non-classical MHC class I and class II genes from non-human species such as important farmed animals, experimental animals or pets (Maccari et al. 2017). To be defined as a classical locus, one needs to know that the gene is highly polymorphic, the alleles must be peptide binders and the molecules must be membrane proteins. Classical MHC class I molecules are expressed on most cells so the expression patterns must comply with this expectation. For class II, the classical molecules are only expressed on specialised antigen presenting cells, influencing the expected transcriptional patterns. For species without a clear-cut understanding of the number of genes and their genomic organization, it is more difficult to link a nucleotide sequence to a specific locus. Thus, the species included in the IPD-MHC Database have well-defined number of classical genes and can link included nucleotide sequences to a given locus. In tetrapod’s, the MHC class I and class II genes are physically linked in one genomic region, but in teleost fishes, there is no major histocompatibility complex in teleost fishes as the classical class I and II loci identified reside on different chromosomes (Bingulac-Popovic et al. 1997).

Currently, only the salmonids Atlantic salmon Salmo salar and rainbow trout Oncorhynchus mykiss MHC represent ray-finned and teleost species in the IPD-MHC fish section. Due to their economic importance in aquaculture, identifying the classical MHC loci and defining their alleles, peptide binding ability and transcriptional patterns have been a priority (Grimholt et al. 2015; Kiryu et al. 2005; Lukacs et al. 2007; Shiina et al. 2005).

Genomes of a large number of other salmonid species are becoming available (e.g. Chinook salmon Oncorhynchus tshawytscha (Christensen et al. 2018a), Arctic char Salvelinus alpinus (Christensen et al. 2018b), grayling Thymallus thymallus (Savilammi et al. 2019)) making them prime candidates for inclusion in the IPD-MHC Database once MHC alleles are defined such as already initiated for brown trout Salmo trutta (O’Farrell et al. 2013).

Among other ray-finned or teleost species, medaka Oryzias latipes and zebrafish Danio rerio have the most reliable information regarding number of classical loci. Medaka has two defined classical MHCI loci denoted UAA and UBA (Nonaka and Nonaka 2010), but only one classical MHCII alpha and beta gene (Bannai and Nonaka 2013). Zebrafish has haplotypes with varying number of assumed classical MHCI genes (McConnell et al. 2014) while the number of classical MHCII genes is still undefined (Dijkstra et al. 2013; Ono et al. 1992; Sultmann et al. 1994; Sultmann et al. 1993). In other species, the number of classical MHC genes is mostly undefined with the exception of Atlantic cod Gadus morhua where the number of class I genes has greatly expanded while the MHCII genes, invariant chain and CD4 have been lost entirely (Star et al. 2011).

In salmonids, there are six characterised MHC class I lineages defined based on sequence identity denoted U, Z, S, L, P and H lineages (Dijkstra et al. 2007; Grimholt et al. 2015; Grimholt et al. 2019; Shum et al. 1999; Stet et al. 1998). Of these six lineages, only the U and Z lineages are peptide binders where the single classical locus UBA identified in rainbow trout and Atlantic salmon belongs to the U lineage (Grimholt et al. 2015). Other U lineage loci, i.e. the UCA, UDA, ULA, UGA and UHA, in these salmonids are defined as non-polymorphic where all but UGA have more restricted tissue distribution patterns. Also for MHC class II, there are multiple lineages denoted A, B and E lineages where the single classical MHC class II alpha (DAA) and class II beta (DAB) loci belong to the A lineage in Atlantic salmon and rainbow trout (Dijkstra et al. 2013; Grimholt 2016). There is a close physical linkage between these salmonid DAA and the DAB genes so their alleles segregate as a functional haplotype (Stet et al. 2002). The stability of these MHC class II haplotypes remains to be established.

For the salmonids Atlantic salmon and rainbow trout, the IPD-MHC Database currently includes 96 MHC class I sequences from the UBA locus and 89 MHC class II sequences originating from the DAA and DAB loci. At least for Atlantic salmon, the small number of 112 MHC class I and class II sequences is mainly due to a limited number of studies on MHC diversity. The overall MHC diversity in wild or farmed populations is currently unknown.

MHC has been firmly linked to pathogen resistance in salmonids (Croisetiere et al. 2008; Grimholt et al. 2003; Kjoglum et al. 2008; Langefors et al. 2001; Lohm et al. 2002) making it an important aspect to consider when cultivating wild stock. Specific MHC class II alleles were found to confer resistance towards furunculosis in Atlantic salmon and brook charr Salvelinus fontinalis, while other alleles were associated with susceptibility. Most likely, associations between MHC alleles and other pathogens also exist in salmonids, making it highly sensible to ensure that the MHC diversity is preserved in present and future populations.

More than 400 Norwegian watercourses harbour genetically distinct populations of wild Atlantic salmon. A selection of these populations once formed the basis of commercial breeding programmes for farmed Atlantic salmon (Gjedrem 2000). At least for one of our main breeding populations, the continued selection for production traits such as growth has biased the population to a few dominant river strains (Gjedrem et al. 1991). Many wild salmon populations are endangered or vulnerable due to anthropogenic factors and reduced marine survival (Forseth et al. 2017). Release of hatchery-produced eggs, fry or smolt of wild origin has been used to enhance stocks, compensate for the negative effects of hydropower development and to restore stocks decimated by acid precipitation, G. salaris and most recently salmon farming. One of the challenges in stock restoration is to ensure genetic representativeness and diversity. Guidelines have thus been developed and implemented by the Norwegian Environment Agency (NorwegianEnvironmentAgency 2014). The use of genetic tools aimed at excluding salmon of farmed origin, avoiding inbreeding and maximising effective population size is now routinely used.

Infectious diseases are severely hampering aquaculture production (Hjeltnes et al. 2019) and there is a growing concern of the impact on wild salmonids (Garseth et al. 2013). In addition, it has been inferred that climate changes can lead to introduction of new hosts and pathogens, increase pathogen development, survival and transmission, but also affect host susceptibility (Harvell et al. 2002). It is therefore imperative to avoid loss of immune diversity in connection with stock restoration programmes. MHC typing and monitoring of immune diversity thus represent a necessary and timely tool in wild salmon restoration programmes.

The Norwegian national salmon river Vosso once held the largest Atlantic salmon in the world, with a unique cultural legacy and considerable local impact on business and recreation (Barlaup 2008). The salmon population collapsed during the 1980s, and although the circumstances were not fully understood, it has been inferred that the population was negatively affected by acid precipitation, hydropower development, road construction and salmon lice during the past 20–30 years (Barlaup 2008). The spawning stock was at a very low level in the 1990s and 2000s, and genetic analysis suggests that the original wild population was replaced by a population affected by escaped farmed salmon during this period (Glover et al. 2012). A rescue operation was thus launched aimed at restoring the Vosso salmon with material collected by the Genebank programme for wild Atlantic salmon during the late 1980s (http://tema.miljodirektoratet.no/en/Areas-of-activity1/Species-and-ecosystems/Salmon-trout-and-Arctic-char/Gene-banks-for-wild-salmon/) (Barlaup 2008).

Both the number of fish species and the number of alleles in the IPD-MHC Database are expected to grow considerably due to the advances in sequencing technology. High-throughput sequencing (HTS) such as Illumina provides a quick and easy way of genotyping many samples in a limited period. Various NGS technology has been tested and compared for human HLA typing (Carapito et al. 2016; Duke et al. 2016), but NGS approaches are also applied to non-human species including MHC class II beta typing for the teleost fish guppy (Lighten et al. 2014).

HTS also provides new challenges when including such transcripts in the IPD-MHC Database. Previously, the sequences should have been identified in three separate PCR reactions thus eliminating jumping PCR artefacts and the sequences needed to include both alpha 1 and alpha 2 domains for class I and at least the alpha 1 or beta 1 domains for class II. What is required for including sequences originating from Illumina studies is currently undefined for fish. Here we develop an Illumina sequence typing protocol for cDNA typing of Atlantic salmon MHC class I and class II alleles, rename existing alleles in the IPD-MHC Database to accommodate HTS and identify aspects needing special attention.

Material and methods

Study animals

This study includes head kidney tissue preserved on RNAlater (ThermoFischer) from ten Atlantic salmon captured in River Vosso during the period 2007 to 2009 (denoted AS1-AS10). Samples were obtained during routine health control of brood fish in the stock restoration programme. During this period, scale characters were used to distinguish between salmon of wild, hatchery-reared and farmed origin (Lund and Hansen 1991), and were also used to determine the number of years spent in the river and in the sea, smolt-age and sea-age respectively (Lea 1910; Lee 1920). Catch-year, in combination with smolt-age and sea-age, was subsequently used to select the ten individuals that were not from the same family group (i.e. not siblings).

For the purpose of this study, head kidney samples underwent genetic analyses to identify individuals having farmed salmon in their pedigree (Karlsson et al. 2014; Karlsson et al. 2011). The method has been mandatory in stock enhancement of anadromous salmon since 2014 (Norwegian Environment Agency 2014). The method generates a p(wild) value that reflects the “probability of being wild”, with a high value reflecting a high probability of being wild, while salmon with p(wild) values < 0.71 is unlikely to be of pure wild origin (Table 1). Based on scale reading, four animals are considered wild and five hatchery-reared. Based on genetic tests (p(wild)), six are wild and two to three are the product of variable genetic introgression from farmed salmon.

Table 1 Classification of Atlantic salmon animals based on scale readings and probability of being wild (pWild)

Preparing the sequence library

RNA was isolated from head kidney tissue preserved in RNAlater according to the manufacturer’s recommendation (RNeasy, Qiagen, NL). One of the ten selected animal samples did not pass the RNA quality control (sample AS#4) and was not included further. cDNA was synthesised using 10 ng total RNA according to the manufacturer’s recommendation (QuantiTect Reverse Transcription Kit, Qiagen, NL) and the resultant cDNA was eluted in 35 μl TE. Due to known and unknown sequence variation in the primer regions, we initially tested different forward primers to ensure detection of all allelic variants in the study material (Table 2; primer testing) prior to ordering the Illumina adapter primers. The primers were chosen to comply with overlapping 300 bp paired end sequences for MiSeq v3 sequencing and the design of primer pairs is based on the successful 16S amplicon project described elsewhere (de Muinck et al. 2017).

Table 2 PCR primer sequences used in this study

We used 10 ng of the cDNA in 10 μl PCR reactions for each of the three UBA, DAA and DAB genes with 0.625 units OneTAQ DNA polymerase (NEB Inc., USA), one times standard reaction buffer, 200 μM dNTPs and 0,2 μM each primer. Based on the initial testing, two different forward primers were chosen for amplifying DAB and UBA fragments each, for Illumina sequencing. Twenty-five microliter reactions were performed with the first Illumina primer sets (Table 2; 1F/1R primers) using a PCR reaction mix as described above. Products were verified on a 1% agarose gel prior to cleanup using 1.8 × PCR volume of Agencourt AMPure XP PCR purification kit (Beckman Coulte, Brea, CA, USA) according to the manufacturer’s recommendation and dissolved in 20 μl TE. DNA concentrations and fragment sizes were measured on a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA) and Agilent Bioanalyzer (Santa Clara, CA, USA), respectively.

Libraries containing each of the UBA, DAA and DAB PCR products were blended in proportions to ensure similar coverage and subjected to 10 cycles of PCR using the second set of primers (Table 2; F2+R2), thus adding one unique Illumina index for each animal. This second amplification was carried out with 0.625 units OneTAQ DNA polymerase (NEB Inc., USA), one times standard reaction buffer, 200 μM dNTPs, 0.2 μM each primer and 10 μl template pool at 3 ng/μl. The following programme was used for amplification: 94 °C for 2 min; ten cycles of 94 °C for 30 s, 58 °C for 30 s, 72 °C for 60 s; 72 °C for 10 min.

The nine resultant PCR pools were subjected to two additional AMPure cleanups using a 1:1 ration to eliminate shorter fragments. Based on data from Bioanalyzer and Qubit, the nine PCR pools were mixed totalling 2 μg DNA in 130 μl TE-1 buffer (10 mM Tris-HCl, 0.1 mM disodium EDTA, pH 8.0) as recommended for Illumina MiSeq sequencing. qPCR was performed to check the library size before proceeding with sequencing. Sequencing was performed on Illumina MiSeq (Illumina, USA) platform using the v3 chemistry to achieve 300 bp paired end reads.

Bioinformatic analyses

Illumina raw reads (fastq format; read 1 and read 2 pairs) were obtained for each animal as the sequence data was demultiplexed using the Illumina index introduced during the second PCR reaction. The bioinformatic pipeline described below is also explained in Fig. 1 as a flowchart. Sequence data has been submitted to NCBI SRA under the BioProject accession number PRJNA578031.

Fig. 1
figure 1

Data analysis flowchart. Flowchart describes the analysis workflow used in this study and explained in detail under the “Bioinformatic analyses” section within the “Materials and Methods” section. All the tools are available as open source software/tool and the custom script used in the last step can be found at https://github.com/NorwegianVeterinaryInstitute/Salmonid_MHC_classifier

Data for each animal was processed using BBDuk v34.56 (part of BBTools; https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/) to remove/trim bad quality reads and sequencing adapter sequences. Cleaned reads were further demultiplexed based on the primers used during the first PCR reaction using demultiplexer v1.7 (https://github.com/nsc-norway/triple_index-demultiplexing) allowing zero mismatches between the primers and the sequenced reads. This step separates the reads into each MHC subgroups as targeted during the first PCR reaction and removes the primer sequences from the reads.

Read 1 and read 2 for each MHC subgroup were combined using FLASH v1.2.11 (Magoc and Salzberg 2011) with default settings (-r 300 was used to specify the read length). FLASH uses the overlapping information between the paired reads to combine them into one long read. The resulting full-length amplified reads were collapsed to identity all the unique reads and sorted based on the number of times it was present in the data using fastx_collapser (part of FASTX Toolkit v0.0.13; http://hannonlab.cshl.edu/fastx_toolkit/). The top five most represented (1–2% of the FLASHed reads) full-length amplified reads (Fasta format) were further processed.

Potential MHC allele sequences identified using the above approach were evaluated using custom python scripts (https://github.com/NorwegianVeterinaryInstitute/Salmonid_MHC_classifier). 0–2 nucleotides were removed from either/both ends of the sequences to accommodate frame shift issues while converting to amino acid sequences before proceeding with the analyses.

The scripts were developed in collaboration with the IPD-MHC Database to use a library of official alleles to identify the closer match to the input sequence. The scripts automatically retrieve relevant information from the IPD-MHC Database, thus facilitating the analysis and identification of novel sequences against the up-to-date dataset.

The input fasta records were converted to amino acid sequences using transeq [part of EMBOSS v6.6.0.0; (Rice et al. 2000)] followed by multiple sequence alignment (only nucleotide) with relevant IPD-MHC Database entries using MUSCLE v3.8.1551 (Edgar 2004). Closest clade/sibling information from the tree produced by MUSCLE was extracted using python module ETE toolkit (Huerta-Cepas et al. 2016). Sequence similarity and identity between the fasta record and the closest sibling was calculated using Water [part of EMBOSS v6.6.0.0; (Rice et al. 2000)] alignment tool for both nucleotide and amino acid sequences, respectively. A report file was generated with all the relevant information for each fasta record and the user can make an educated evaluation regarding the nomenclature and submit the probable MHC sequences to the IPD-MHC Database for further verification and official name assignment.

Phylogenetic analysis

Amino acid sequence alignments were performed in ClustalX (Larkin et al. 2007) after 5′ and 3′ sequences including primer sequences were removed using Jalview 2 (Waterhouse et al. 2009). The evolutionary history was inferred using the maximum likelihood method based on the Whelan and Goldman model (Whelan and Goldman 2001). The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbour-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.0515)). The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. All positions with less than 95% site coverage were eliminated. That is, fewer than 5% alignment gaps, missing data and ambiguous bases were allowed at any position. Evolutionary analyses were conducted in MEGA7 (Kumar et al. 2016).

Sanger sequencing

To decipher between some MHC class II alpha alleles, we amplified cDNA fragments from selected animals using primers shown in Table 2. The PCR was performed as described above and fragments were cloned into the pCR2.1 vector (ThermoFischer) and transformed into Oneshot Top10 competent cells (ThermoFischer) and individual clones were sequenced using BigDye Terminator 3.1 (Applied Biosystems) according to the manufacturer’s protocol.

Results and discussion

New nomenclature

Salmonid MHC alleles are currently denoted according to the set of guidelines promoted by the MHC Nomenclature committee (Maccari et al. 2018), where a unique four-letter code identifying the organism is followed by the gene name and an allele number. For example, MHC-Sasa-DAA*0101 or Sasa-DAA*0101 for short, Sasa-DAA*0102 etc. where Sasa denotes Salmo salar, DAA is the locus MHC class II alpha and *0101 denotes the first sequence in this sequence group. The second allele in this sequence group is denoted *0102. To represent a new sequence group, 4 amino acid differences are required between class I alleles, and 3 amino acid differences are required for class II alleles. In the case of silent mutations, when the amino acid sequences are identical but the nucleotide sequences are different, a third double-digit number is added e.g. Sasa-DAA*010101, Sasa-DAA*010102 etc.

In order to facilitate the comparison of genomic data, this paper introduces the use of the human MHC nomenclature (Marsh et al. 2010) to describe allele variation in fish, as suggested for non-human species (Maccari et al. 2018). The gene prefix name is identical i.e. MHC-Sasa -UBA* defines an allele originating from the UBA gene of Atlantic salmon. The allele group is shown using a two-digit number (Sasa -UBA*01) where individual groups have four amino acid differences for class I and three amino acid differences for class II. The specific protein is shown with an additional two digits divided by a colon (Sasa UBA*01:01). Synonymous substitution in coding region is represented using an additional two-three digit introduced after an additional colon (Sasa -UBA*01:01:01). Additional differences in non-coding regions are shown using another set of two-digit number (Sasa -UBA*01:01:01:01) (see Supplementary File 1). The same nomenclature will be also applied for the MHC alleles from rainbow trout (see Supplementary File 1).

Sequence analyses—scripts linked to IPD-MHC

Illumina sequencing yielded 0.65–1.4 M read pairs per animal and out of these more than 82% were retained after removing/trimming low-quality reads and adapter sequences. Further demultiplexing the data using the primers used for amplification provided more than 97,000 read pairs per MHC group for each animal. FLASH was able to combine 50–90% of these read pairs based on the overlapping regions, which provide 95–395,000 full-length amplified reads per group for each animal (Supplementary file 2: Sheet1).

Collapsing the reads using fastx_collapser to identify the most represented unique reads found 26–40% of the reads being represented by one allele (Supplementary file 2: Sheet2). Out of the 27 groups (3 groups for 9 animals), 8 had one over-represented allele while the rest had two or more. The percentage difference between the first and second most over-represented allele was much pronounced in DAA and UBA while the difference was negligible across DABs. The top five nucleotide and deduced amino acid sequences for each gene in each animal are listed in Supplementary file 3.

We established our library and bioinformatics pipeline using only nine animals. However, based on results, this Illumina Miseq v3 approach could most likely be adapted to 96 animals, three genes each.

MHC analysis

To prepare for Illumina typing of MHC alleles in a new population, one needs to identify primers that both comply with the read length of the Illumina sequencing mode but also ensure that all alleles in that populations are identified using the chosen set of primers. In particular, the sequence variation in the leader sequence region of UBA alleles (Supplementary file 4) makes it necessary to test a variety of forward primers to ensure that all alleles are represented in the final library. We found that only two of the five tested primers produced fragments for UBA in our animals. We also used two different MHC class II beta forward primers as primer efficiency varied between animals. For MHC class II alpha, all animals showed good amplification using just one primer set.

The sequences discussed below are given names to identify animal i.e.AS1 to AS10 (AS4 did not pass RNA quality and thus not present in the analysis) and allele class as follows: class II alpha is DAA plus a number referring to the sequence number found in the MiSEq v3 data analysis. Thus, AS1_DAA_s1 would refer to the collapsed DAA sequence with the highest number of reads found in the data analysis of animal number 1. Class II beta is denoted DAB and class I is denoted UBA making AS1_DAB1_s1 and AS1_UBA1-1 the DAB and the UBA sequences with highest number of reads found in the analyses of sequences from animal number 1 using the forward primers DAB1 and UBA1.

Seven of nine animals were heterozygous for MHC class II alpha (Table 3). We found six alleles in the material, all present in the current IPD-MHC Database (Fig. 2, Supplementary file 5). Seven of nine animals were heterozygous where only animal AS5 and AS10 were homozygous. Based on the amplified region, we could not determine if the AS3_DAA3_s1 and AS8_DAA8_s2 sequences were DAA*01:01 or DAA*01:02. We thus PCR amplified more of the coding region, cloned and Sanger sequenced fragments from the two samples and found them to be DAA*01:02 in both animals (data not shown).

Table 3 MHC alleles identified in Atlantic salmon animals
Fig. 2
figure 2

DAA tree. Evolutionary relationships of MHC class II alpha DAA amino acid sequences. Sequences originating from our dataset are shown using red font. A number of Illumina reads per sequence are shown in parenthesis. The tree with the highest log likelihood (− 444.69) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.0515)). The analysis involved 38 amino acid sequences. There were a total of 70 positions in the final dataset

A few of the animals show more than two MHC sequences with considerable support in number of collapsed reads. For instance, for AS2_DAA, the two top sequences are supported by more than 28,000 reads, but the next three sequences are supported by more than 3700 reads (Supplementary files 2-3). Aligning these five sequences shows a pattern of jumping PCR i.e. one MHC allele is partly elongated during one PCR cycle, then denatured and then the sequence re-associates with another allele for further elongation. This means that all the variable sites have amino acids from one of the two alleles but in different combinations. We used this approach to exclude the third highest supported sequence for all classes of genes, thus supporting our expectation of only two alleles per animal.

Eight of our nine animals were heterozygous for MHC class II beta (Table 3). We identified eight DAB alleles in our study material where seven were already included in the IPD-MHC database (Fig. 3, Supplementary file 5). Our DAB*09:01 allele sequence is an extension of IPD-MHC allele sequence. As the DAA and DAB genes are closely linked on chromosome 12, only separated by 3 kb, we expect these alleles to segregate as haplotypes with specific combinations of alpha and beta alleles. Two of the seven haplotypes we identified in a previous study (Grimholt et al. 2003) were also found in this material, i.e. DAA*02:01-DAB*02:01 and DAA*06:01-DAB*06:01. Four new haplotypes were supported by more than one animal, i.e. DAA*01:02-DAB*08:01 present in animals AS3 and AS8, DAA*03:02-DAB*20:01 present in animals AS2 and AS9, DAA*04:01-DAB*09:01 found in animals AS1 and AS10 and DAA*09:01-DAB*07:01 found in animals AS3 and AS9. The DAA*01:02-DAB*08:01 haplotype found in animals AS3 and AS8 differs from the DAA*01:01-DAB*08:01 haplotype found in our previous study in only one DAA*01:02 amino acid. Previously, we found the DAA*04:01-DAB*07:01 haplotype, but in this material, DAA*04:01 seems linked to either DAB*09:01 or DAB*09:02, where only one amino acid separates DAB*09:01 from DAB*09:02. Judging by the two additional haplotypes DAA*09:01-DAB*09:01 and DAA*04:01-DAA*09:02, this may suggest that there has been a crossing over between the DAA*04:01-DAA*07:01 and DAA*09:01-DAB*09:01 haplotypes providing the new DAA*09:01-DAB*07:01 and DAA*04:01-DAB*09:01 haplotypes found in this study. A problematic issue is the fact that AS5 has only one DAA allele, but two DAB alleles where one of the DAB alleles is new (Fig. 3, AS5_DAB2_s2_9933). One explanation is that the DAA*06:01 allele segregates with two different DAB alleles, where the DAB sequences differ by four amino acids each supported by a similar number of collapsed reads. Comparing our Vosso population with our previously MHC typed farmed population, there does not seem to be a stable link between MHC class II alpha and MHC class II beta alleles making it necessary to genotype both the alpha and the beta alleles.

Fig. 3
figure 3

DAB tree. Evolutionary relationships of MHC class II beta DAB amino acid sequences. Sequences originating from our dataset are shown using red font. A number of Illumina reads per sequence are shown in parenthesis. The tree with the highest log likelihood (− 1093.62) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.2153)). There were a total of 85 positions in the final dataset

Seven of our nine animals were also heterozygous for MHC class I (Table 3) representing a total of twelve MHC class I UBA alleles (Fig. 4, Supplementary file 5). The five alleles UBA*02:01, UBA*06:01, UBA*13:01, UBA*20:01 and UBA*34:01 were already present in the IPD-MHC database. Two of these alleles, i.e. AS6_UBA1_s2_3956 is UBA*02:01 and AS7_UBA1_s2_6454 is UBA*20:01, had lower number of collapsed read support than the remaining sequences defined as alleles. We chose to explain this by efficiency differences in PCR primers rather than being contaminations. One allele differed only slightly from an IPD-MHC alleles i.e. AS7_UBA_s1 differing in two amino acids from UBA*35:01 and thus qualifies for being named UBA*35:02. The six remaining allele sequences differed with more than four amino acids from existing alleles and thus represent new IPD-MHC alleles (Fig. 4, Supplementary files 3 and 5; AS2_UBA1_s1_46537, AS2_UBA1_s2_33647, AS5_UBA2_s1_22247, AS5_UBA2_s2_20661, AS7_UBA1_s1_25193, AS9_UBA1_s2_14112).

Fig. 4
figure 4

U lineage tree. Evolutionary relationships of MHC class I U lineage amino acid sequences. Sequences originating from our dataset are shown using red font. A number of Illumina reads per sequence are shown in parenthesis. The tree with the highest log likelihood (− 4391.31) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.6296)). The analysis involved 66 amino acid sequences. There were a total of 160 positions in the final dataset

UBA alleles have highly diverse alpha 1 domain sequences where different lineages are shared between distantly related species (Aoyagi et al. 2002; Grimholt et al. 2015; Kiryu et al. 2005). These alpha 1 domain lineage sequences are then combined with different lineages of alpha 2 domain and downstream sequences potentially due to recombination in the large intron between the alpha 1 and alpha 2 domains. This is clearly visible for the new allele AS7_UBA1_s1 which has an alpha 1 domain identical to the UBA*27:01 allele, but then the alpha 2 domain is similar to e.g. UBA*35:01. Another example is AS9_UBA1_s2 which shares an alpha 1 domain with for instance UBA*22:01 while the alpha 2 domain sequence is similar to for instance UBA*21:01. This combination of different alpha 1 and alpha 2 domain sequences is a very good argument for amplifying as much from both regions as possible to enable correct allele identification.

Requirements for including new allelic MHC sequences into the IPD-MHC fish database

We identified seven new MHC alleles in this study, i.e. one DAB sequence (AS5_DAB2_s2_9933) and six UBA sequences (AS2_UBA1_s1_46537, AS2_UBA1_s2_33647, AS5_UBA2_s1_22247, AS5_UBA2_s2_20661, AS7_UBA1_s1_25193, AS9_UBA1_s2_14112). These new sequences need to be verified using new PCR and Sanger sequencing prior to submission to the IPD-MHC Database for an official name assigned. Preferentially most of the coding region should be amplified, at least for UBA, it is required that submitted sequences include the three extracellular domain as well as the transmembrane domain, as the allele needs to be verified as UBA and not another U lineage sequence.

MHC diversity in the Vosso population

The selected material showed unexpected diversity. This may have been caused by genetic introgression from farmed escapees (Glover et al. 2012) and by straying—that salmon fails to return to their native river. About 3–6% of wild salmon and 15% of hatchery-reared salmon may stray to other rivers during homeward spawning migration (Jonsson et al. 1991; Jonsson et al. 2003; Stabell 1984). Studies show that most of the straying salmon will enter nearby rivers (Jonsson et al. 2003). A verdict on which alleles belong to the original Vosso population and what originates from farmed fish could be resolved genotyping scales sampled prior to 1980s, i.e. before aquaculture appeared in these fjords. However, that would require a different strategy for Illumina typing than the one presented here.

Conclusion

We have established a library preparation and bioinformatics analysis pipeline using Illumina MiSeq v3 paired end sequencing for MHC cDNA. This pipeline enables studies into salmonid MHC diversity among different strains in local rivers as well as breeding populations, thus expanding our knowledge on salmonid MHC diversity. To accommodate this IPD-MHC sequence expansion, we updated the IPD-MHC fish nomenclature for both Atlantic salmon and for rainbow trout following the MHC Nomenclature Committee guidelines, allowing the unambiguous naming and comparison of genomic data. Furthermore, the obtained haplotype data will be included into the next release of the IPD-MHC Database (December 2019) to enrich the information available in the IPD-MHC fish section.

We established the pipeline on Atlantic salmon animals from an endangered river strain and found a surprisingly high number of different alleles, with seven newly identified alleles (one Sasa-DAB and six Sasa UBA). Most likely, this diversity reflects interference from farmed Atlantic salmon in addition to potential straying from nearby river populations. To test allele changes over time, additional typing strategies need to be developed enabling genotyping using genomic DNA from historically preserved fish scales.