The first molecular markers for species delimitation and taxonomy were isozymes and allozymes. Isozymes describe different molecular forms of an enzyme, which are encoded by different loci. In contrast, allozymes characterize different molecular forms of an enzyme, produced by different alleles at the same locus (Duminil and Di Michele 2009). The term locus refers to a specific position of a gene, while the term gene is related to a DNA section, which contains the information to produce an RNA molecule. The principle approach when using allozymes or isoenzymes is to identify the variation of an enzyme among individuals using electrophoresis. However, nowadays almost exclusively DNA markers instead of protein markers are used for speciation studies because of low resolution due to synonymous mutations.
3.4.1 PCR-Based Molecular Markers
DNA markers can be codominant or dominant such as amplified fragment length polymorphisms (AFLP), restriction fragment length polymorphism (RFLP), and random amplified polymorphic DNA (RAPD) (Duminil and Di Michele 2009). AFLP studies use restriction enzymes, which digest genomic DNA, followed by the ligation of adapters to the sticky ends of the restriction fragments. A selection of the restriction fragments will be amplified with polymerase chain reaction (PCR) primers, which have a corresponding adaptor and restriction-site specific sequences. Afterward, the amplicons will be separated through electrophoresis on a gel and visualized. RFLP is a technique, which starts with the cutting process of DNA fragments by restriction enzymes, followed by a gel electrophoresis to order the DNA fragments by their length. RAPD is a special method of the PCR, because it uses short primers and the results are random DNA sequences. The gel electrophoresis shows individual patterns.
There are codominant molecular markers, which can be used for species delimitation and taxonomy. For most of these markers, the PCR method is used to multiply a specific DNA sequence of a sample. The method starts with the denaturation of the double-stranded DNA into single strands, called templates. Short DNA sequences, which are generally 18–20 bp long and are known as primers, bind to the templates. This step is called annealing. The next step is elongation, in which the enzyme DNA polymerase synthesizes a new DNA strand, which is complementary to the template, by adding free nucleotides to the single DNA strand. Afterward, the annealing and elongation are repeated in a definite number of cycles, until enough target DNA sequences are available (Semagn et al. 2006). This method can be used to sequence and analyze different DNA sequences for a variety of scientific questions.
3.4.1.1 Ribosomal Genes
The nuclear rDNA encodes rRNA, and both contain highly conserved and variable domains, which is a good condition for analyzing phylogenetic relationships (Hwang and Kim 1999; Patwardhan et al. 2014).
The nuclear small subunit (SSU) rDNA is a highly conserved region of the DNA, which has been used for the reconstruction of phylogenetic relationships in kingdoms, phyla, classes, and orders. The nuclear large subunit (LSU) rDNA contains more variation than the SSU rDNA, and the size of its genes varies among phyla. The LSU rDNA is used for studying genetic relationships in orders and families (Hwang and Kim 1999). Further highly conserved regions like the nuclear SSU rDNA are the 12S and 16S rDNA. They encode the ribosomal RNA, which is part of the small ribosomal subunit of a ribosome in a mitochondrion. The 12S rDNA has been used to study the phylogeny of phyla and subphyla, while the 16S rDNA has been used for analyzing the phylogenetic relationships within families and genera, because the 16S rDNA is more variable than the 12S rDNA (Hwang and Kim 1999).
3.4.1.2 Mitochondrial DNA Markers
Due to the fact that the mitochondrial DNA evolves faster than the nuclear genome, mitochondrial protein-coding regions have been used for analyzing the phylogenetic relationships within families, genera, and species (Hwang and Kim 1999). The first mitochondrial marker used was the control region, which is located in the noncoding region and is part of the regulation and initiation of the mitochondrial DNA replication and transcription (Patwardhan et al. 2014). The mitochondrial control region is variable in size and contains many variations also between individuals of the same species. Thus, it is used for studying genetic relationships in species, subspecies, and populations (Hwang and Kim 1999).
The second mitochondrial marker was the cytochrome oxidase I/II (COI/II), which is a well-known protein of an electron transport chain. In the cytochrome c oxidase complex, the COI and COII genes code for two polypeptide subunits. Both have been used for phylogenetic relationships among orders, families, subfamilies, genera, and species. The sequence of the COI gene is one of the sequences that can be used as a barcode for the identification of species (Patwardhan et al. 2014). DNA barcoding is a method to identify species by using short sequences.
Further widely used mitochondrial markers to reconstruct the phylogeny among genera and species are the cytochrome b (cytb) and NADH dehydrogenase 2 (nd2) genes.
3.4.1.3 Microsatellites
A microsatellite is a specific DNA motif with a length of two to six base pairs (Fig. 3.3). Microsatellites are used to detect the number of repeats of a sequence to identify an individual. Similar to microsatellites are minisatellites, but their repeat motifs are longer. Microsatellites can be amplified by PCR, for which labeled primers are needed, followed by analyzing the length of the fragment (microsatellite). A large advantage is the small amount of DNA needed for the PCR. Microsatellites are locus-specific, codominant, and highly polymorphic. A disadvantage of microsatellites is their taxon-specificity. Thus microsatellite libraries need to be generated for each species or closely related sister species (Delaney 2014). Microsatellites are currently mainly used for paternity tests and population genetics but hold large potential for speciation studies due to their potential to distinguish lineages within a species. It is necessary to work with more than one microsatellite locus to have reliable results.
3.4.2 Expressed Sequence Tags
Genes must be converted into mRNA, but RNA is unstable outside the cell. Hence, mRNA needs to be converted into complementary DNA (cDNA) by the reverse transcriptase enzyme. The production of cDNA is the reverse process of transcription, because mRNA is used as the template instead of the DNA. cDNA is more stable than mRNA and contains generally only exons due to splicing of the pre-mRNA. This means that cDNA represents an expressed gene or a part of it. When the cDNA has been isolated, various nucleotides can be sequenced to create expressed sequence tags (ESTs) with a length of 100–800 bp. They allow the discovery of unknown genes and a comparison between different species due to high conservation in the coding regions (Semagn et al. 2006). From ESTs it is possible to develop primer pairs for sequencing genes in other species and to detect single nucleotide polymorphisms (SNPs) (Schlötterer 2004; Semagn et al. 2006).
3.4.3 Single Nucleotide Polymorphisms
A single nucleotide polymorphism (SNP) is the change of a single base in the DNA sequence (Fig. 3.3) (Semagn et al. 2006). Generally, two different nucleotides can be found per position, and SNPs mostly occur in noncoding regions (Grover and Sharma 2016). The simplest method to identify SNPs is to screen a high-quality DNA sequence or an EST. The most common methods like restriction-site-associated DNA sequencing (RAD-seq) and genotyping by sequencing (GBS) will be explained in the following two sections. A comprehensive strategy for detecting SNPs in a genome is the generation of shotgun genome sequences. For this method, a pool of DNA from different individuals should be sequenced. A more efficient approach is the shotgun sequencing with a reduced section of the genome, in which the DNA of many different individuals can be sequenced (Schlötterer 2004). Most of these methods are cost- and time-intensive and the information content of one SNP is very low, but they have a low mutation rate (high stability) and high frequency in the genome, and new analytical methods are being developed and open up new opportunities.
SNPs can be used in different research questions, e.g., investigate about natural selection across species (Künstner et al. 2010), examine recent divergence (McCormack et al. 2012), explore the genetic structure of different morphological features in different species (Silva et al. 2017), and investigate hybridization (Manthey et al. 2016).
3.4.4 Restriction-site-associated DNA sequencing
Restriction-site-associated DNA sequencing (RAD-seq) is the genotyping of short DNA fragments, which are adjacent to the cut site of a restriction enzyme (RE). The first step of RAD-seq is the digestion of the genomic DNA with a chosen RE, followed by the ligation of an adapter (P1) to the overhang of the RE (Baird et al. 2008; Davey and Blaxter 2011). This adapter contains a binding site for the forward primer and a barcode for the sample identification. After ligation, the fragments are pooled and size selected (Baird et al. 2008). The DNA fragments are then ligated to a second adapter (P2), which has a reverse primer site and is a Y adapter with divergent ends (Coyne et al. 2004; Baird et al. 2008). The reason for choosing a Y adapter is that all fragments contain the P1 adapter, because the P2 adapter cannot bind to the reverse primer, before the amplification of the P1 adapter has been finished (Baird et al. 2008; Davey and Blaxter 2011). After ligation of the second adapter, a PCR reaction is performed. The PCR-products are used for next-generation sequencing (3.4.7) (Baird et al. 2008). The resulting reads are trimmed, grouped by barcodes, and mapped to a reference genome or, if no reference genome is available, the same reads are aligned for identifying SNPs (Baird et al. 2008; Davey and Blaxter 2011). The challenges of RAD-seq are the high costs of sequencing and the diversity of RAD-seq protocols with different technical details. Nevertheless, one can choose the protocol most suitable for the own study system or research question (Andrews et al. 2016). RAD-seq can identify and generate thousands of genetic markers, reduces the complexity of the genome, and can be used for species with no or limited existing sequence data (Davey and Blaxter 2011). Furthermore, RAD-seq was extended to use two REs instead of one RE to exclude the step of size selection. This method is called double digest RAD-seq (Peterson et al. 2012).
3.4.5 Genotyping by sequencing
Genotyping by sequencing (GBS) is a highly multiplexed approach for constructing reduced representative libraries for the Illumina next-generation sequencing platform to discover a large number of SNPs. This approach can be used for any species at a low per-sample cost and also incorporates restriction enzymes (RE) to reduce genome complexity (Elshire et al. 2011; Chung et al. 2017). The procedure of GBS like RAD-seq starts with the digestion of DNA by an RE. The selected REs should be suitable for the investigated species by containing an overhang of two to three base pairs, and REs do not cut frequently in the major repetitive fraction of the genome. After the digestion, two adapters are ligated to the ends of the digested DNA. The adapters should be complementary to the overhang of the chosen RE, and one adapter contains a barcode for multiplex sequencing. These adapters contain binding sites for appropriate primers, which are added to perform a PCR reaction to increase the amount of DNA fragments. The PCR products are cleaned up and DNA fragments with a specific size result in a library. Libraries are used for sequencing, followed by filtering reads, which match one of the barcodes and the corresponding cut site of the RE, and are not adapter dimers. These sorted reads are separated by their barcode and after separation the barcode is removed. The filtered reads are mapped to the reference genome, consequently reads, which mapped on the same position are aligned and used to identify SNPs (Elshire et al. 2011). GBS is a cost-effective method to discover SNPs, genotype individuals within a population, and detect molecular markers. The disadvantages are the management of big datasets and the fact that the data do not represent the whole genome, which could have a negative effect on constructing genetic maps (Chung et al. 2017).
3.4.6 Transcriptomics
This is a technique to study an organism’s transcriptome, which is the total of all its RNA transcripts. The transcriptome is a snapshot at a specific time of all transcripts in one cell or tissue, for a specific developmental stage. These expressed genes of one organism in different cells, tissues, conditions, or time points give details about the function of uncharacterized genes and the biology of organisms. Furthermore, the comparison of transcriptomes allows the identification of genes, which are expressed in different cells; hence, it gives information about gene regulation. There are two techniques to create a transcriptome: microarrays and RNA-Seq. The microarray approach quantifies a set of predefined sequences, while the RNA-Seq technique uses next-generation sequencing to target “all” expressed genes (Wang et al. 2009).
3.4.7 “Whole” Genome Sequencing
Next-generation sequencing (NGS) is a method to produce a large number of reads of short DNA sequences, between 50 and 150 bp long. The read length of NGS is often short with a high error rate, but this is compensated due to a higher coverage of the consensus sequence (Scanes 2015). These reads can be combined to continuous sequences (contigs), and contigs can be in turn linked to scaffolds. Indications about the quality of contigs and scaffolds (genome assemblies) can be provided by the N50 value, which represents the minimum length of long sequences that make up half of the assembly of contigs or scaffolds (Kapusta and Suh 2017). Contigs and scaffolds can be used to identify genes, but there are sequences which have no genetic information, which are clustered in chromosome Unknown (chrUn). Annotation is the process of linking DNA reads to information available from previous work (on other taxa) (Scanes 2015).
3.4.7.1 Different Strategies for Sequencing Genomes
The traditional Sanger sequencing with 1-kb-long sequence reads and the Roche 454 sequencing with up to 800 bp sequence reads have been largely replaced by short-read technologies such as Illumina HiSeq with 150 bp sequence reads. There are also even newer technologies available such as Pacific Biosciences with up to 5 kb sequence reads or Ion Torrent with about 500 bp sequence reads (Ekblom and Wolf 2014). The technology of 10× genomics uses short reads from Illumina sequencing to link the short reads to long molecules. In the long molecules, variation can be detected to identify which reads belong to the father or mother of the examined individual. Another method uses single molecules by detecting them and sequencing their DNA. This is called single-molecule genomics.
One of the most common strategies for genome sequencing is the shotgun sequencing. First, DNA is cut into small random fragments, whereby the size of the fragments depends on the technology used. These fragments will be assembled to a longer contig. This process is known as de novo assembly. It is important that there is enough overlap between the sequence reads for a correct assembly, and this implies also a high coverage. If there are longer fragments like several hundred base pairs, both ends of the sequence will be sequenced called paired-end sequencing. Afterward, the resulted contigs are connected to longer sequences (scaffolds) (Ekblom and Wolf 2014).
The genome annotation uses the whole genome sequences in combination with relevant information from gene models, functional information, microRNA, or epigenetic modifications. Consequently, a lack of genomic information will result in low annotation rates. Annotation describes the process of using data of other genomes or transcriptomes to detect genes or transcripts on the newly assembled genome (Ekblom and Wolf 2014).
3.4.7.2 Limitations of Analyzing Genomes
Usually, a genome draft represents the complete nucleotide base sequence for all chromosomes in one species. Nevertheless, there is not just one sequence for a species, due to individual genomic variation, differences among cells within individuals due to diploidy. Thus, the assembled reference genome sequence of one individual will only comprise a subset of the total variation present within a species. Typically, one individual is sequenced, but sometimes a genome is based on a consensus of a few individuals (Ekblom and Wolf 2014). Furthermore, it is not possible to sequence and assemble all nucleotides in the genome due to sequencing errors (Scanes 2015), and most genome assembly methods fail on repetitive elements, which are typically not included in reference genomes (Hoban et al. 2016). However, repetitive regions may be characterized through the annotation of a comprehensive dataset compounded of a high-coverage single molecule real-time sequencing assembly, an assembled optical map, and a generated high-coverage short-read sequence assembly to a repeat library (Weissensteiner et al. 2017).
3.4.8 Epigenome
In almost all cells of an individual, the same DNA sequence can be found, but nevertheless cells may differ as the information content encoded within the DNA may be used differently. Such differences may arise from chemical modifications of the DNA or histone proteins without changing the DNA sequence. The resulting epigenome includes chemical compounds, which have been added to the DNA to regulate gene activity. These chemical compounds are not part of, but fixed to, the DNA. Epigenomic changes occur in individual development and tissue differentiation and may result in cell division, and, in some circumstances, they can be transferred to the next generation. However, the epigenome can also be influenced by environmental conditions, such that the epigenome may vary between individuals. Through epigenetic changes, genes can be turned off or on (expression), thus determining the production of proteins in specific cells. For example, the eye is specialized for light-sensitive proteins and red blood cells for carrying oxygen. Furthermore, epigenetic changes in DNA and histones play a role in regulatory pathways of eukaryotes (Marshall Graves 2015).