Background

The discrimination of plant variety and cultivar is one of the most important aspects in agricultural systems. Traditionally a variety is identified by a set of phenotype characteristics for official testing of distinctness, uniformity and stability (DUS). But due to various environmental and climatic conditions, the quantification of difference between varieties cannot be revealed precisely by these morphological descriptors, which is less suitable when results are required rapidly in large collections or breeding lines [1].

Molecular markers offer numerous advantages as they are stable and detectable in all tissues regardless of growth, differentiation, development, or defense status of the cell are not confounded by the environment, pleiotropic and epistatic effects [2]. Molecular markers have been widely used in genetic studies, marker-assisted selection, comparative mapping, and exploration of the functional genetic diversity in the germplasm adapted to different environments. The widely used molecular marker include RAPD (random amplified polymorphism DNA), SSR (simple sequence repeat), SNP (single nucleotide polymorphism), InDel (insertion-deletion) and so forth. Most of the markers have been used for cultivar identification. Through RAPD markers, efficient identification was performed for tomato, peach and Ribes cultivars [3,4,5]. By using EST-SSR markers, red-flesh loquat cultivars were rapidly identified [6]. SNP markers were used to genotype 260 accessions of Pummelo [7].

However, there are several disadvantages of the markers described above, for example, SSR markers are detected by polyacrylamide gel electrophoresis or capillary sequencing machines with small size differences, SNP identifications always depend on sequencing or microarray analysis. In contrast with SSRs and SNPs, InDel markers with moderate size differences of insertion-deletion polymorphisms are user-friendly, PCR-based with minimum equipment requirements, and co-dominant, offering more genomic information than SNPs [8,9,10,11] and have been widely used in population genetics, taxon diagnostic markers, genetic map construction and association mapping in different crop plants, such as rice [8], tomato [12], soybean [13], chickpea [14], capsicum [15], citrus [16] and so forth. Insertion-deletion polymorphisms in 3′ regions were used as highly informative genetic markers positioning corresponding expressed genes [17]. InDel markers were also developed for species identification [18]. According to InDel markers specific to dense variation blocks, a barcode system was constructed for Soybean identification [10]. Usually Insertion-deletion variances were multi-allelic and hampered genetic analysis since the segregation patterns of multi-allele are more complex and not appropriate for genome-wide analysis requiring large number of markers. With uncertainty of molecular weight, multi-allelic markers cannot be used in standardized operation. Another problem of multi-allele based analysis is prohibitively time-consuming computational speed with most large, genome-wide data sets. Genotypes of bi-allelic markers can be automated called by modern genotyping assays, suitable for massive data analysis [19]. In addition, bi-allelic markers producing simple differences are easily followed by different laboratories for both genetic research and plant breeding without molecular size calibrations.

Traditionally, it was difficult to automatically identify and genotype bi-allelic InDels due to less efficient sequencing technologies. By using an Affymetrix® axiom® array, InDels were high throughput genotyped in maize [20]. The development of next-generation sequencing (NGS) technology has paved the way for InDel identification. The massive amount of data and the short read nature of NGS created a hurdle for effective InDel variation mining, software have been developed for variant discovery, such as SAMtools [21], GATK [22] and Atlas2 [23]. A high-throughput and efficient pipeline was produced for genome-wide InDel marker development [11]. Based on whole genome re-sequencing, InDel markers were identified in Capsicum [24], Soybean [13], Quinoa [25], chickpea [14] etc. On the DNA sequence level, maize has a higher diversity level than humans, Drosophila and many wild plants [26, 27]. 30,178 indels were detected among elite maize inbred lines [28], facilitating the identification of InDel markers. By using next-generation sequencing data, genome-wide InDel markers were developed in maize [9].

In this study, bi-allelic InDel variations all over maize genome were screened by 205 re-sequenced genotypes, 8188 bi-allelic loci were identified and a barcode system consisting of 37 bi-allelic InDel markers with high PIC values and discriminative size larger than 20 bp which are suitable for agarose gel was constructed for genetic discrimination of maize inbred lines. By using these markers, different maize hybrids and inbreds were clearly discriminated efficiently, meanwhile, the corresponding parents of the hybrids were accurately determined.

Methods

Plant materials

To select proper InDel markers for barcode system, a total of 241 maize inbred lines (Additional file 1: Table S1) were used to test InDel primers, in which 227 lines were also analyzed by microarray. 177 intermated recombinant inbred lines (RILs) derived from B73 and Mo17 (the IBM population) and the parental lines were also employed to assess primers. 35 hybrid lines derived from 25 inbreds were used to evaluate the barcode system for pedigree analysis.

These materials were grown in the field at Nanjing, China in 2018, with 20 plants per row. All the plants were sampled at V4 stage, and three biological replicates per sample were harvested and mixed for DNA extraction.

Genome sequence data and InDel marker development

The next generation sequence data of 205 maize inbred lines was downloaded from NCBI (Genbank accession number PRJNA82843, SRP011907 and PRJNA260788) [29,30,31]. After removing the low-quality nucleotides via SolexaQA with the Phred-Score greater than 20 [32], sequences of these materials were compared and those with the missing rate less than 10%, the minimum allele frequency (MAF) greater than 0.05 were selected. In addition, a linkage disequilibrium threshold (r2) of 0.20 with a window size of 100 and number of InDel to shift window at each step of 2 [33]. Linkage disequilibrium (LD) were measured by using the re-sequencing data with PLINK [33], the correlation coefficient (r2) of alleles were calculated by the software PopLDdecay [34].

The software mInDel was used for InDel marker development, InDel polymorphisms were identified using a sliding window alignment from assembled contigs with 300 bp of the window and 150 bp of the step, and dimorphic markers with large polymorphisms are preferred [11]. The loci of Insertion-deletions were annotated and predicted by SnpEff (version 4.3a) [35] based on maize B73 genome (version 4.32). Those sequences with polymorphic information content (PIC) greater than 0.4 were selected and analyzed with deep-depth sequencing data (30x) of six maize lines (Genbank accession number SRA010130) [28], the top 200 InDels were selected with the highest PIC for further analysis.

PCR amplification and gel electrophoresis

Genomic DNA was extracted from young leaves following the method of plant DNA extraction kit from Qiagen. The PCR analysis was performed using 10 µL reaction mixtures containing 20 ng of genomic DNA, 2 pM of primer, and 5 µl of 2 × Taq Master Mix (Vazyme Biotech Co.,Ltd, China). PCR was performed under conditions of 95 °C for 3 min and subsequent 35 rounds of 94 °C for 30 s, 55 °C for 30 s, 72 °C for 30 s, and 72 °C 5 min. The PCR products were separated by electrophoresis in 2% gel of agarose followed by ethidium bromide staining. The cost of materials were listed in Additional file 2: Table S2, the total cost of PCR was 2.64 cents per sample. Usually the whole process took about 4 h for one sample.

Barcoding process and identification of maize cultivars

Based on the selected InDels, the genetic distance between the maize lines were calculated with distance matrix and clustered using the UPGMA algorithm [36] in TASSEL 5.0 [37]. Phylogenetic tree was constructed by MEGA 6.0 [38].

Genetic similarity analysis

227 maize inbred lines were genotyped by using the MaizeSNP50 (50 K) BeadChip based on Illumina platform as described by the manufacturer (Illumina, Inc. San Diego, CA). Then the genetic distance were calculated with the microarray data in Tassel 5.0 [37]. According to the InDel markers, the number of polymorphic loci between every pairs of lines were counted, together with the genetic distance, a boxplot was drawn by using PASW Statistics 18 (IBM SPSS).

Results

Insertion-deletion identification

To identify insertion-deletion variations in maize genome, the next generation sequence data of 205 maize inbred lines were used. Based on the genomic sequence comparison, 9,622,805 InDel variants were detected throughout the whole genome. Linkage disequilibrium throughout maize genome was measured and 11,741 LD blocks were obtained by using PLINK. After removing those with missing rate > 10%, MAF < 0.05, and PIC < 0.4, 25,412 insert-deletion variants located in the LD blocks were identified. Only one insert-deletion polymorphism was kept for each block, and 11,741 InDels were used for further analysis.

To facilitate screening using gel-electrophoresis, only InDels larger than 20 bp in length and bi-allelic polymorphism loci were selected and 8188 InDels were used for further analysis. These InDels distributed across all maize genome. A maximum of InDels (1252) were identified on chromosome 1 while the fewest (548) were detected on chromosome 10. Annotation of these variants showed 11 genomic locations, including UTR_3_prime, UTR_5_prime, downstream, intergenic, intron, exon, upstream and so forth (Table 1). The maximum amount of 2622 InDels (32.02%) were located in the intergenic region, the second amount of 2207 (26.95%) were assigned at the upstream region. Another two locations exist more than 1000 InDels, including downstream, and intron (Table 1).

Table 1 Annotation of 8188 bi-allelic polymorphism insertion deletion loci detected across maize genome

Validation of InDel markers for cultivar discrimination

Primers for 200 InDels were designed and tested by using 177 lines of IBM population. Based on the electrophoresis with 2% agarose gel, those primer sets with clear band and significant difference among the 177 lines were performed segregation analysis in the IBM population. 37 primer sets were selected for barcoding system (Table 2). They are distributed on all ten chromosomes of maize, with at least two markers per chromosome (Fig. 1). To evaluate the discriminating ability of the InDel markers, 26 inbred lines were genotyped with them. According to the electrophoresis results of the 37 markers, all the maize lines were discriminated by at least one locus (Fig. 2a), implying suitable ability for maize cultivar discrimination. Since InDels were selected for bi-allelic polymorphism loci, primers would produce two types of amplicons, insertion or deletion, relative to the reference genome. Based on the amplification results, the same allele with B73 were represented by “A” depicted as “white” barcode, and the alternative allele was designed to “B” depicted as “black” barcode, respectively (Fig. 2a). Phylogenetic tree was drawn based on the genotypes, showing that these maize lines were separated into two major groups (Fig. 2b), corresponding to Reid and Iodent heterotic groups, respectively.

Table 2 37 InDel markers selected for barcode system
Fig. 1
figure 1

The chromosome location of 37 InDel markers developed for barcode system

Fig. 2
figure 2

Barcode system based on 37 InDel markers and evaluated in 26 inbred maize lines. a Barcoding representation of the polymorphisms revealed by the InDel markers among 26 maize lines, the same allele with B73 were represented by “A” depicted as “white” barcode, and the alternative allele was designed to “B” depicted as “black” barcode, respectively; b Phylogenetic tree constructed with 37 InDels in 26 maize lines

Evaluation of the barcode system for pedigree tracing

The maize barcode system was evaluated with 35 hybrids derived from 25 inbreds. Theoretically 25 inbred lines can produce 300 hybrids without regard to reciprocal cross. According to genotypes of the parents by using the 37 InDel markers, genotypes of all 300 descendants were predicted. The barcode of 35 randomly selected hybrid lines were genotyped based on the InDels and compared with the 300 predicted genotype data. Among the 37 loci invested, the loci with equal experimental results and predicted data were counted and the largest number of matched loci suggested the most possibility of correct prediction. Table 3 showed top two matched loci number with equal experimental and predicted results. In the top one column, the matched loci number ranged from 27 to 37, all higher than that in the second column (Table 3). Together with the combination data of maize hybrids, all the prediction data with top one matched loci were correct, confirming that the barcode system was suitable for pedigree tracing analysis.

Table 3 Pedigree analysis with experimental and predicted results of 35 hybrids based on the 37 InDel barcode system

Database construction with the barcode system in a maize population

A population including 227 lines was used for database construction with the barcode InDel markers (Fig. 3a, Additional file 3: Table S3). A total of 8399 genotype data were recorded in the database with only 75 missing data, accounting for 99.1% of data integrity. Among the 227 inbred lines, 56 hybrid loci (0.66%) were detected, implying most of these materials were highly homozygous. In the population, PIC of the 37 InDels ranged from 0.2910 to 0.4998, in which only two less than 0.35 and 30 larger than 0.40.

Fig. 3
figure 3

Database with barcode system in maize population including 227 maize lines. a Barcoding among 227 maize lines, the same allele with B73 were represented by “A” depicted as “white” barcode, and the alternative allele was designed to “B” depicted as “black” barcode, respectively; b Phylogenetic tree drawn with 37 InDels in 227 maize lines

Based on the barcode, more than 99.98% of the material pairs were discriminated with at least two InDel markers. The number of polymorphic loci detected by the InDels markers between each cultivar pairs ranged from zero to 34, with the average 17 (Additional file 4: Table S4). Among all 25,651 maize line pairs, five pairs showed no difference by using the 37 InDel markers and assigned at the same location on the phylogenetic tree (Fig. 3b, Additional file 4: Table S4). The population was also genotyped with microarray analysis, consisting of 55,187 loci. On the phylogenetic tree drawn with microarray data, the five pairs of maize lines were also located at the same places (Additional file 5: Fig. S1) with 1027(A17-2 vs Si287), 164 (C05 vs F68), 525 (PH207 vs Q381), 1014 (ZY9 vs Chang7-2), and 1386 (Feng16 vs Shen137) polymorphic SNPs, respectively, showing the close relationship between these pairs of lines. Based on the microarray data, the average number of different loci between the five pairs of lines was 823, accounting for 1.57% of the total loci. With the barcode system, all the other materials showed differences at least one locus and the minimum average number of different loci was 3056 according to microarray data, significantly different (p = 0.013) from that of the five pairs of lines with no difference based on the barcode system. The result suggested the threshold was 1.57% of different loci which can be discriminated with our barcode system.

By using the microarray data, genetic distances of each two lines were calculated. The lowest genetic distance based on the SNP data were detected between the five pairs of maize lines with no different according to InDel markers. The genetic distances increased rapidly along with the number of polymorphic loci until became steady when the number of polymorphic InDels reach ten (Fig. 4), implying ten InDel markers can reveal most of the genetic variance among the population with the average genetic distance of 0.33.

Fig. 4
figure 4

The genetic distances according to the number of loci different between each two maize lines. The genetic distances were calculated by using the microarray data of each two maize lines

Discussion

Maize is one of the most important cereal crop throughout the world with the highest yield. More and more cultivars were produced for the market. Then identification of maize variety and cultivar become more important than ever before with profound meanings to ensure seed quality and food safety [39]. Along with the phenotype identification, such as DUS, molecular marker has been widely used for cultivar identification due to numerous advantages. In this study, we produced a barcode system for maize cultivars identification with bi-allelic InDel markers based on next generation sequencing data with several advantages such as high discrimination ability, standardization, low-cost, easy and quick operation.

Molecular markers have been used for accurate and precise discrimination of cultivars, such as SNP markers [40], and microsatellite markers [41]. In China, SSR-based standard fingerprint database was constructed for corn variety authorization [42]. Insertion-deletion are structural variations distributed throughout the genome, sometimes lead to the gain/loss of function in the organism [43]. In contrast with SSR and SNP markers, InDel markers were used to determine genetic variation with the merit of easy detection of polymorphisms by PCR and direct gel electrophoresis. In rice InDel markers were developed to discriminate genome types rapidly [44].

The progress of sequencing technologies has paved the way for understanding the plant genome and more and more lines have been sequenced. The massive data has helped researchers to genetically characterize the genomes and screen InDel loci. By aligning the B73 and Mo17 genomes, 1,422,446 small insertions/deletions (length shorter than 100 bp) were identified in maize [45]. Based on next-generation sequencing data, genome-wide InDel markers have been discovered in Chickpea [14], Quinoa [25], Soybean [13], Brassica [46], Capsicum [15] and so forth. In maize, genome-wide insertion and deletion markers were also developed [9]. According to the next generation sequence data of 205 maize inbred lines, 9,622,805 InDel variants were detected throughout the whole genome in this study.

With the abundant insertion-deletion variants, those were selected for barcode system based on several criteria, including convenient for detect and analyze, and high discrimination ability. In order to separate amplicons appropriately on agarose gels, InDels longer than 20 bp were selected. Recently, more and more researches have to deal with massive data and multiple allelic loci hampered automatic analysis with computers. Bi-allelic loci conquered this problem, and genotypes of bi-allelic markers are more suitable for automatic analysis [19]. With pan-genome sequence data, InDels at the bi-allelic loci were developed in this study. PIC (polymorphism information content) was another factor used for marker selection. According to the next generation sequence data, the InDel markers for barcode system in this study were dimorphic polymorphisms with higher PIC and can be resolved appropriately by electrophoresis and user-friendly in a standardized operation. Another point of our barcode system is the low cost, less than 1 dollar per sample for analysis with all barcode makers, suitable for plant breeding with large scale screen. Typically the whole process took about 4 h for one sample, while much less time was taken for batch operation. For example, 2 h electrophoresis could run 200 samples at one time.

Authentication of plant species is important in a variety of different areas such as the trade of illegal and endangered species and food authentication [47]. DNA barcoding is a technique for characterizing species of organisms using a short DNA sequence from a standard and agreed-upon position in the genome and has sufficient sequence variation to discriminate among species [48]. The main purpose of barcode system is to discriminate different cultivars in an efficient way. Usually several leading candidate barcodes were used for plant DNA barcoding, including rbcL, rpoB, rpoC1, matK, atpF-atpH, psbK-psbI and trnH-psbA [49]. However, due to differences in their efficiency, it was concluded that no single-locus plant barcode exists [50]. In this study, based on genome-wide screen, a barcode system with InDel markers were constructed in maize. With high throughput sequence data, barcode candidates at the conserved regions among different lines were detected through the genome variances of 205 maize lines. Their ability of cultivar identification were measured experimentally. Both inbred lines and hybrids were used to test these InDel markers. By using 26 inbred lines, 37 InDel markers could discriminate all of them by at least one locus (Fig. 2). Based on the discrimination ability, the barcode system was evaluated for their use in pedigree tracing. 35 hybrid descendants from 25 inbreds were tested by using the barcode system, 27 to 37 markers made the correct prediction for their pedigree, confirming their ability for identification.

To test the accuracy of their discrimination, a population including 227 lines were genotyped by both the barcode and DNA microarray. Among all 227 lines, only five pairs of lines was not detected difference by using the 37 InDel markers. According to the genetic distances compared with microarray data including more than 50 k loci, the InDel can reveal the genetic variance effectively, with the threshold of 1.57% of different loci. Although all 37 markers used for identification was more efficiency, when the InDel variances above ten, the genetic distance kept steady (Fig. 4).

However, there were still several limits for the barcode system. Those materials with lower difference than the threshold cannot be discriminated by the system, for instance, sib-lines, backcross improved lines, mutants etc. And these markers were selected with several criterions and not suitable for genomic prediction due to the low number of markers. We are working to develop efficient and low-cost markers for genome prediction.

Conclusions

This study constructed a barcode system for genetic discrimination of maize lines by using re-sequencing data of 205 lines, including 37 bi-allelic InDel markers with high PIC values and user-friendly. The barcode system was measured and determined and different maize hybrids and inbreds were clearly discriminated efficiently with these markers, and the corresponding parental lines of the hybrids were accurately determined. The barcode system can be used in standardized easy and quick operation with very low cost and minimum equipment requirements.