The Japanese anchovy Engraulis japonicus is a small pelagic schooling fish and plankton feeder distributed in the northwestern Pacific Ocean. This species has a high economic impact and is heavily fished in China, Japan, Korea and Taiwan [1, 2], where E. japonicus larvae are utilized as a food resource known as shirasu (whitebait in Japanese). There is increasing market demand and a huge fishing effort for this resource [3], thus due to the risks that could arise from non-sustainable fishing, some governments have taken measures to reduce larval exploitation, e.g., the Taiwan Government announced a seasonal closure to the harvest in 1999 [4]. Potential overexploitation of this species could be avoided by studying its population genetic structure, which provides essential information for sustainable fishery management [5].

While its closely related species, the European anchovy Engraulis encrasicolus, has been the target of many population genetics studies [6,7,8,9,10,11,12,13,14,15,16,17,18,19], E. japonicus has a mitochondrial genome sequence of 16,675 bp [20, 21] and 20 nuclear microsatellite markers [22, 23], which has limited the number of genetic studies on it [4, 24,25,26,27]. Development of genetic markers in E. japonicus will help detail its population genetic structure.

Cross-species amplification uses primers and probes developed for a species to sequence and discover genetic markers in its closely related species [28]. This approach is a good choice for developing genetic markers in species with no reference genome but abundant genomic resources available in closely related species. This is the case for E. japonicus, since hundreds of genetic markers are available in the closely related E. encrasicolus, including mitochondrial DNA (mtDNA) sequences [29], microsatellites [30], and nuclear (nSNP) and mitochondrial single nucleotide polymorphisms (mtSNPs) discovered from cloning, PCR and subsequent comparative sequencing [31], and from transcriptome sequencing [32]. Any of these classes of markers could be cross-amplified in E. japonicus to develop new genetic markers for this species. However, the increasing throughput of sequencing and SNP genotyping [28, 33, 34] has confined microsatellites to specific applications, making SNPs the workhorse of molecular ecology and one of the current promising classes of genetic markers.

The aim of this study was to develop SNP markers for E. japonicus through cross-species amplification of hundreds of previously discovered SNPs in the nuclear and mitochondrial genome of E. encrasicolus. The developed SNP panel could be used to investigate population genetic structure and processes of local adaptation in this species.

Materials and methods

A total of 482 SNPs previously developed for E. encrasicolus [31, 32], comprising 467 nSNPs and 15 mtSNPs, were selected for designing primers and probes (Table S1). SNPs were genotyped with TaqMan OpenArray Genotyping System (Life Technologies, CA) (see below) at the Sequencing and Genotyping Service (SGIker) of the University of the Basque Country (UPV/EHU). Information on all the primers/probes is included in Table S1.

The mtSNPs and nSNPs were genotyped for 35 and 22 E. japonicus individuals, respectively. Additionally, all the SNPs were genotyped for 22 E. encrasicolus individuals as control samples. E. japonicus individuals were obtained from a bag of dried anchovy snack purchased at a Japanese supermarket by collaborators. As indicated in the product label, individuals comprising the snack were fished in the Hiuchi-nada area of the Seto Inland Sea, Japan. The 22 E. encrasicolus individuals were taken from one haul fished in 2012 in the Bay of Biscay (see sample 14 in [35]) during scientific surveys. These specimens were stored in pure ethanol until DNA extraction.

A piece of muscle tissue (from dried E. japonicus and ethanol-fixed E. encrasicolus tissue) was used for genomic DNA extraction by a phenol/chloroform method [36]. DNA quantity and purity were measured using Nanodrop ND-8000 spectrophotometer and Qubit fluorometer (Thermo Fisher Scientific, MA). Reactions for SNP amplification/detection were done with a constant DNA concentration (25 ng/μl) following the instructions of TaqMan OpenArray Genotyping System. Scoring of individual genotypes was performed using TaqMan Genotyper software version 2.1 (Life Technologies). After default clustering, genotyping calls in the scatter plot were reviewed and manually adjusted for producing final cluster assignments.

Based on genotyping calls, SNPs were classified into five categories previously defined by Montes et al. [32]: no signal (no amplification), disperse (< 80% of individuals assigned to a cluster), paralogous sequence variants or multi-site variants (PSV/MSV) (≥ 99% of heterozygous individuals), monomorphic [minor allele frequency (MAF) < 0.01], and polymorphic (MAF ≥ 0.01) (Fig. 1). Then, conversion and validation rates were obtained. SNPs with a reliable genotyping signal for more than 80% of genotyped individuals were defined as converted SNPs; therefore, conversion rate was estimated as the ratio between the number of monomorphic, PSV/MSV and polymorphic SNPs (Fig. 1c–e), and the number of genotyped SNPs. Then, polymorphic SNPs were considered as the validated ones (transferred or cross-amplified SNPs), therefore, validation rate was estimated as the ratio between the number of polymorphic SNPs and the number of genotyped SNPs. The subsequent analyses were conducted for polymorphic nSNPs. Expected heterozygosity (H e) was estimated using GeneClass2 [37]. Hardy–Weinberg equilibrium probability test for each nSNP and genotypic linkage disequilibrium between all nSNP pairs were run in Genepop version 4.0 [38] with 10,000 dememorization runs, 1000 batches and 5000 iterations per batch. The false discovery rate method was used to correct interaction p-values for multiple comparisons [39] using the spreadsheet available as supplementary material in Pike [40].

Fig. 1
figure 1

Genotyping cluster examples for each of the five single nucleotide polymorphism (SNP) categories considered: a no signal, b disperse, c paralogous sequence variants or multi-site variants (PSV/MSV), d monomorphic, e polymorphic. Each circle corresponds to a genotyped individual. Filled circles correspond to Engraulis japonicus individuals, and open circles represent Engraulis encrasicolus reference individuals. Light-blue squares are negative controls (water), yellow circles represent non-amplified samples, black circles are undetermined individuals for which genotyping was not possible due to poor amplification, red circles are homozygous samples for alleles 1, dark blue circles are homozygous samples for alleles 2, green circle cluster represents heterozygous samples

Taking into account the commercial origin of E. japonicus, a test for ensuring species assignment (no mislabeling) was carried out. For this purpose, a total of 157 sequences of the mitochondrial cytochrome b gene (cytB) (Table S2) from five Engraulis species, including E. encrasicolus and E. japonicus, were downloaded from the National Center for Biotechnology Information (NCBI) nucleotide database ( Sequences were aligned with the ClustalW application [41] bundled in BioEdit software [42]. Then, the genotyped mtSNPs in cytB were assessed for species identification.


According to nucleotide variations in cytB, the five Engraulis species had different haplotypes from each other (Table 1).

Table 1 Nucleotide bases at each single nucleotide polymorphism (SNP) site in the mitochondrial cytochrome b gene (cytB) for five anchovy species

Results showed high conversion and validation rates for E. japonicus. A conversion of 93.6% was obtained with 451 converted SNPs. Of the non-converted SNPs, 13 gave no signal and 18 were disperse (Table 2; Table S1). A similar conversion rate was obtained for mtSNPs (93.3%) and nSNPs (93.6%).

Table 2 Summary of statistics of SNP genotyping

With respect to the validation rate, the 451 converted SNPs included three false (PSV/MSV), 272 monomorphic, and 176 polymorphic SNPs, yielding a validation rate of 36.5% (Table 2; Table S1). Validation rate was higher for nSNPs (37.0%) than for mtSNPs (20.0%). No linkage disequilibrium was found among the 173 nSNPs, indicating that all nSNPs were independent. Mean H e was 0.102, and no nSNP deviated significantly from Hardy–Weinberg proportions (Table S1).

Regarding species identification, combinations of alleles at five mtSNPs differentiated among Engraulis species, and the combination of only two mtSNPs was specific to E. japonicus (Table 1). Specifically, the combination of ss1583998210 and ss1583998454 SNPs formed GG haplotype in E. japonicus, which did not appear in any of the four other Engraulis species (Table 1). Therefore, the two diagnostic mtSNPs suggested that the genotyped individuals were genuine E. japonicus.


Conversion rate is generally expected to be low in cross-species amplification studies, due to inter-specific mutations in genomic regions flanking the target SNP where primers/probes are designed for SNP genotyping [43]. This drawback is especially remarkable in fish species, which are typically characterized by high mutation rates [44]. Consequently, only a few examples of cross-species SNP amplification are available in fish species (see Table 3). The present study showed one of the highest conversion rates (Table 3) likely due to two factors. First, amplified SNPs were transcriptome-derived SNPs that are embedded in coding regions. Therefore, flanking sequences are highly conserved between species and suitable for designing primers/probes that successfully cross-amplify in closely related species [45]. Most of the previous studies for non-model species (Table 3) have discovered SNPs through cross-species amplification based on the emulsion, paired isolation, and concatenation polymerase chain reaction (EPIC-PCR) method [46,47,48], in which primers are designed on exon regions to amplify the intron [49]. However, the EPIC-PCR has been ousted by transcriptome sequencing through high throughput sequencing, the strategy chosen in the present study, which allows the discovery of thousands of potential SNPs that can be used for cross-species amplification [50, 51]. The second factor that could explain the high conversion rate is related to the fact that the smaller the genetic distance between two taxa, the smaller the number of mutations in the genomic regions for designing primers/probes. The two anchovy species analyzed in the present study are phylogenetically close [52, 53] but, reference and target species in most of the previous cross-amplification studies were distantly related.

Table 3 Summary of SNP discovery through cross-species amplification in fish species

The validation rate obtained in the present study was also high (36.5%), and could even have been underestimated because of the limited number of individuals. The small sample size (n = 22 for nSNPs), together with its single geographical origin, are likely factors that downwardly biased the proportion of polymorphic SNPs in E. japonicus. Thus, the validation rate of the set of markers should increase by analyzing more individuals from multiple locations [54]. Nevertheless, the validation rate obtained in the present study was high compared to that of previous studies (Table 3). Albaina et al. [47] conducted a cross-species amplification of SNPs from albacore to Atlantic bluefin tuna, and obtained a validation rate of 18%; the reciprocal cross resulted in a validation success of 24.2% [51]. Cross-amplification success between Atlantic and Pacific herring, Clupea harengus and Clupea pallasii respectively, is even lower, 11.9% [50]. The high success obtained in the present study is likely due to the high polymorphic level of the anchovy species, and to the design of an appropriate SNP set, with SNPs from coding regions. Overall, the present study highlights the importance of exploiting the exponentially increasing amount of gene sequence data in public databases, which are a very useful basis for identifying new polymorphic markers in non-model organisms.

In conclusion, a set of 176 polymorphic SNPs was developed for E. japonicus, which represents an important and valuable genomic resource for this species. This set can be useful for further research on population genomics, which can provide valuable information on ecological factors and for conservation. In addition, two diagnostic SNPs were developed to identify E. japonicus, which could serve as molecular tools for food traceability.