Introduction

DNA sequencing is significant in many fields such as forensic sciences, biology, genetics, molecular biology, archeology, and the likes. Nucleic acids are essential for continuing life since they constitute the genetic information of living matters. As the matter of fact, the ability to sequence the human genome has drastically outpaced our ability to interpret genetic variations. It has gained superabundant attention as the arrangement of nucleic acids in polynucleotide chains encompasses the information for the patrimonial and biochemical traits of living species. Since the discovery of the 3D structure of DNA by Watson and Crick in 1953 (Watson and Crick 1953; Zallen 2003), sequencing technology has experienced three generations of evolution which will be discussed in this review.

Nucleic acids sequencing is categorized into three generations (Fig. 1). During the first generation, short DNA shreds were sequenced. In the second generation, increasing throughput was achieved, alongside decreasing turnaround time and costs. Hence, at the end of the second generation, whole genome and transcriptome sequencing became more convenient. The third generation continues to surpass technological boundaries with capabilities in sequencing single molecules without prior amplification that was previously inconceivable. The second and the third generation are often referred to as “next-generation sequencing” (NGS).

Fig. 1
figure 1

A glance at DNA sequencing generations and some features of each generation

The growth of commercial sequencing platforms and optimization of experimental protocols have led to a huge growth in applications of DNA sequencing. There is special emphasis on integrating sequencing-related technologies like genomics, transcriptomics, proteomics, epigenomics, and metabolomics (Graw et al. 2021; Khella et al. 2021; Li et al. 2020; Reiter et al. 2021). The combination of these technologies with morphological and physiological techniques makes a general approach to unraveling biological systems possible (Philpott et al. 2020; Zhu et al. 2020).

First generation

The first protein sequence, of insulin, was determined in the early 1950s by Fred Sanger who devoted his scientific life to the determination of primary sequence (Heather and Chain 2016). During the first generation, efforts were focused on sequencing pure RNA species such as tRNA. At that time, researchers had borrowed the sequencing techniques from analytical chemistry that were just capable of measuring nucleotide composition, but not the order (Holley et al. 1961). In 1965, Robert Holley and colleagues invented a new method with ribonuclease treatment to generate RNA fragments and produced the first full nucleic acid sequence (Holley et al. 1965). The first sequencing procedure of alanine tRNA needed 3 years and five people working with 1 g of pure material detached from 140 g of yeast to specify 76 nucleotides (Holley et al. 1965). Molecular cloning protocols were so time-consuming that it had taken several years and was eventually replaced by in vitro amplification which was more efficiently, taking months instead of years (Lario et al. 1997). At the same time, Fred Sanger and colleagues fostered a method based on radiolabeled partial digestion of DNA fragments (Adams et al. 1969). This method labeled DNA strands with radioactive nucleotides to deduce its sequence (Padmanabhan et al. 1974; Wu 1970). However, this method again limited sequencing to tiny strands of DNA. Besides the challenges mentioned above, a remarkable methodological evolution of analytical chemistry and fractionation procedures was adapted to nucleic acid sequencing.

First-generation platforms include processes such as Maxam-Gilbert (chemical degradation) and Sanger (dideoxy terminator). As noted, they were capable of sequencing short slices of DNA. At the time of their development, first-generation techniques were monumental allowing researchers to begin to sequence DNA. More modern utilization of fluorescent labels in place of radioactive labels led to optimization of the currently recognized Sanger’s method (Lario et al. 1997).

The emergence of 2D fabrication methodology, which comprises electrophoresis and chromatography, had a significant influence on sequencing. This method provided researchers with significantly higher resolving power, originally employed by Coulson and Sanger in the “plus and minus” protocol which used Escherichia coli DNA polymerase I and DNA polymerase from bacteriophage T4 with different limiting nucleoside triphosphates. The products generated by polymerases were resolved by ionophoresis on acrylamide gels (França et al. 2002). Maxam and Gilbert also used it in their chemical cleavage technique (Maxam and Gilbert 1977; Sanger and Coulson 1975). The first DNA genome was sequenced with the aid of the plus and minus technique by Sanger and colleagues (Sanger et al. 1977). In contrast, the Maxam and Gilbert technique was quite different, and this method was widely adapted and could be considered the true arrival of “first-generation” DNA sequencing. The chief advantages of the Maxam-Gilbert technique compared with Sanger’s method are as follows: (1) sequencing could be done from the original DNA fragment, instead of from enzymic copies, (2) no PCR (polymerase chain reaction) is required, and (3) this method is less susceptible to mistakes with regard to sequencing of secondary structures or enzymic mistakes (França et al. 2002).

Sanger sequencing has provided the foundation for the growth of automatic DNA sequencing machines (Kambara et al. 1988; Luckey et al. 1990). These DNA sequencing machines were capable of reading no more than thousands of bases. Finally, newer sequencers like ABI PRISM that was outsourced from Leroy Hood research and manufactured by Applied Biosystems (Smith et al. 1986) were capable of simultaneously sequencing hundreds of samples (Ansorge 2009). This latter technology was employed in the now infamous Human Genome Project (HGP).

A glance on second generation

While efforts were being made to develop large-scale sequencing, the next generation of DNA sequencers was gradually coming to the scene. A new technique appeared which was strikingly different from existing methods since it did not identify nucleotides with the aid of radio-labeling or fluorescently labeled deoxyribonucleotides (dNTPs). The new method consisted of a two-enzyme process in which adenosine triphosphate (ATP) sulfurylase was used to convert pyrophosphate into ATP which is then used as the substrate for luciferase, thus producing light proportional to the amount of pyrophosphate (Nyrén and Lundin 1985). Notwithstanding the distinctions, both Sanger’s method and this new technique (Pyrosequencing) are known as “sequence-by-synthesis” (SBS) techniques, whereas the application of DNA polymerase to crop the apparent output was still required. This breakthrough of the second-generation sequencing technology allowed genome sequencing at an affordable time-cost scale. Second-generation sequencers overcame first-generation sequencing limitations with the aid of the following approaches, including (1) emulsion polymerase chain reaction (PCR), (2) reversible terminator, (3) sequencing by oligonucleotide ligation and detection, and the likes (Dorado et al. 2021). In spite of being revolutionary with respect to the first generation, limitations remained such as the requirement to amplify DNA which would intrinsically introduce errors to the read sequence (Ozsolak et al. 2009).

The disadvantage of the improved Sanger sequencing equipment was the cost and time consumption, and the Human Genome Project is a prime example, costing 3 billion dollars and 13 years (Lander et al. 2001). In contrast, the latter technique possessed some specifications that were considered beneficial; natural nucleotides (instead of greatly modified dNTPs) could be observable in real time (Ronaghi et al. 1996). The major drawback of this method was that the noise in the signal-to-noise ratio yielded a non-linear readout above four or five similar nucleotides (Ronaghi 1998). Pyrosequencing was then licensed to 454 Life Sciences which developed into the first chief “next-generation sequencing” (NGS) technology. These sequencing devices boosted the read output by orders of magnitude and allowed researchers to sequence a single human’s genome thoroughly in 2 months at approximately one-hundredth of the cost of traditional capillary electrophoresis methods (Wheeler et al. 2008). The tremendous shift in sequencing appreciably enhanced the quantity of DNA which could be sequenced in a single run. In a typical run, over 25 million bases could be sequenced (Margulies et al. 2005).

In principle the concepts behind Sanger vs. NGS are similar where DNA polymerase adds fluorescent nucleotides one by one onto a growing DNA template strand. Each incorporated nucleotide is identified by its fluorescent tag. The critical difference between Sanger sequencing and NGS is sequencing volume. While the Sanger method only sequences a single DNA fragment at a time, NGS is massively parallel, sequencing millions of fragments simultaneously per run. This high-throughput process translates into sequencing hundreds to thousands of genes at one time. NGS also offers greater discovery power to detect novel or rare variants with deep sequencing.

After the success of NGS, some parallel sequencing techniques emerged. Among them, the Solexa method is the most recognized and is described in detail in the following references (Bentley et al. 2008; Fedurco et al. 2006). Throughout this second generation, technologies and techniques improved substantially, now capable of reading greater length, achieving more accuracy and even faster reads.

DNA sequencing abilities from 2004 until 2010 reduplicated every 5 months which was much faster than the pace of computing revolution growth embodied by Moore’s law that doubles every 2 years (Stein 2010). From 2007 until 2012, the overall expense of DNA sequencing per base plunged by four orders of magnitude (Wetterstrand 2017). Besides, some companies have appeared or disappeared which had their own influence. Some were capable of producing machines with faster read lengths, while the others produced machines with more accuracy or cheaper sequencing per base (Glenn 2011).

Third generation

Although there is no distinct boundary between various DNA sequencing generations, especially the margin between the second and third generations (Pareek et al. 2011), real-time sequencing, single-molecule sequencing (SMS), and uninvolved split from prior technologies could be considered as the prominent specifications of the third generation. The key feature of the third-generation technologies stems from the fact that it can accurately sequence long strands of nucleic acid without an intermediary and without previous retro transcription or amplification (Ozsolak et al. 2009). Several platforms recently became commercially available such as Helicos Bio Sensing, Pacific Biosciences, BGI Group Complete Genomics, and Oxford Nanopore Technology. Each platform has its own advantages and disadvantages (Blom 2021; Broseus et al. 2020), Thus, a multifold compound of them may be required for a deep analysis of gene phraseology (Ilgisonis et al. 2021). In addition, computational models like machine learning have been exerted to these analyses (Bobrovskikh et al. 2021; Liu et al. 2021). For example, Pacific Biosciences is capable of long reads in the order of 20 kb and is capable of retaining 300 kb (Hestand and Ameur 2019); nanopore sequencing is capable of reading 30 kb, extending to 2.3 Mb (Amarasinghe et al. 2020). To reach the full potential of the third generation, some disadvantages such as the demand for higher nucleic acid concentrations, in some platforms, should be addressed to remove the need for amplification (Amarasinghe et al. 2020; Bleidorn 2016; Feng et al. 2021; Jain et al. 2018; Wang et al. 2020). In 2015, the single-molecule real-time (SMRT) sequencing platform was perchance the mostly utilized technology of third generation (Van Dijk et al. 2014).

Nanopore sequencing might be the most favorable platform for the development of third-generation DNA sequencing. It is a branch of the immense field of using nanopores for the identification of biological and chemical molecules (Haque et al. 2013). As a matter of fact, the potential of nanopores for sequencing was established much earlier than the emergence of the second generation but was not well recognized in mainstream science until recently. Researchers showed that single-stranded DNA (ssDNA) or RNA could be steered across a lipid bilayer throughout α-hemolysin ion grooves by crossing channel barricades and temporarily blocking the flow of ion current (blockade current) commensurate to the protraction of the nucleic acid (Kasianowicz et al. 1996). The possibility of utilizing solid-state nanopores was more recently mentioned in the literature as a means to sequence double-stranded DNA (Dekker 2010). In the section below, the review will describe solid-state nanopores and their application in DNA sequencing.

The amalgamation of gene engineering and computer aided technology may form the foundations of the fourth generation of sequencing platforms. For example, Oxford Nanopore Technology (ONT) developed the nanopore technology to sequence distorted bases resulting from DNA passing through the nanopore (Mikheyev and Tin 2014). Nanopores and sequencing through them will be addressed later. A comparison between some features of different generation platforms is shown in Table 1.

Table 1 A comparison between some features of different generations platforms (Solieri et al. 2013)

Importance of sequencing generated data

A tremendous amount of data generated and collected, mostly during the second and the third generations of sequencing, requires new software and hardware to analyze. Thus, to address the big data generated, many fields such as mathematics, statistics, and bioinformatics are involved. Artificial intelligence, machine learning, and similar fields have been developed (Chachar et al. 2021; Jovčevska 2020). The importance of nucleic acid data is listed as several examples below: sequencing of nucleic acid technologies is the vaccine design to treat COVID-19 disease (Wang et al. 2021); the Human Genome Project (HGP) paved the way for whole-genome sequencing (Wang et al. 2021); nucleic acids could be utilized as well to put in store any sort of data in a dense and efficacious manner that could be recovered and decoded by sequencing (Wang et al. 2021); functional genomics used in diverse arenas such as medicine and agronomy; and could be inspection of disease resistance or abiotic and biotic stresses in animals and plants with impressive consequences in health programs (Jha et al. 2021). This is so vital in disease diagnostics and clinical treatments (Caspar et al. 2021).

In the last 2 decades, the quantity of total drug-resistant bacteria that are resistant to all familiar antibiotics, principally because of the misapply of antibiotics, have increased (Gaultney et al. 2020). This calls attention to the requirement of new abatement and action toward the set of tactics for pathogenic bacteria, discovering surrogates to antibiotics. The recently developed sequencing technologies are brought into play to attain this objective (Allue Guardia et al. 2021). In this schema, extremely conserved DNA methyl-transferases (MTases) are possible objectives to action infections for epigenetic inhibitors (Oliveira and Fang 2021). A simple comparison between various platforms, belonging to different generations, is provided in Table 2.

Table 2 A comparison between different platforms from different generations (Lin et al. 2021)

Nanopore sequencing

Nanopore sequencing is a new age of sequencing, rapidly grown to meet the gap in advancements to sustain the flair for larger read length, faster sequencing, and lower costs. In some published texts, nanopore sequencing is considered fourth generation of DNA sequencing technology (Lin et al. 2021). Nanopore technology is an encouraging platform that utilizes highly sensitive single-molecule detectors for DNA or RNA (Garalde et al. 2018; Kasianowicz et al. 1996). In addition, nanopore sensors are easily miniaturized and integrated into portable “lab-on-a-chip” devices (Roman et al. 2017). Despite the benefits of nanopore sequencing, complicated sample preparation and data processing algorithms remain challenges that need to be overcome (Bayley 2015; Kasianowicz et al. 1996; Deamer et al. 2016).

A nanopore is a perforation of nanometer size that can be constructed either by proteins or by artificial molecules. All nanopore types are utilized to sequence biological and chemical molecules at the nanoscale (Deamer and Branton 2002). Nanopore sequencing offers inexpensive and fast DNA sequencing without using labels (Rhee and Burns 2006). Some types of nanopores and its materials are shown in Fig. 2.

Fig. 2
figure 2

Graphical representation of biological and 2D solid-state nanopores

Biological nanopores

Biological nanopores are also called transmembrane protein channels (Zeng et al. 2021). Biological nanopores are artificial or natural protein molecules produced by genetic engineering (Mohammad et al. 2012). Biological nanopores are generated by specified bacteria such as α-hemolysin pore protein (Bayley and Cremer 2001), MspA is from Mycobacterium smegmatis porin A (Zeng et al. 2021) (Derrington et al. 2010), and bacteriophage Phi-29 motor (Phi 29) is from Bacillus subtilis (Manrao et al. 2012). So α-hemolysin, MspA porin, and Phi 29 connector are some proteins that constitute pores. These biological nanopores are commonly utilized for smart drug delivery (Martinac et al. 2017; Martinac et al. 2020), disease diagnosis (Brown et al. 2021), protein sequencing (Hu et al. 2021), and gene sequencing (Quick et al. 2016). In laboratories, nanopores are inserted into a lipid bilayer film allowing manipulations and measurements to be undertaken (Briggs et al. 2018). Albeit there are abundant molecular channels, such as receptors and ligand-gated channels, that could be employed in sensing applications, but the main attention is paid to well-controlled pores that could be utilized as a single sensing element (Shen et al. 2020). Among these aforementioned ones, α-hemolysin is the first to be commonly used (Song et al. 1996).

Biological applications have inspired researchers to use technologies requiring synthetic and biological nanopores to detect gene sequences. These technologies have been extensively used in DNA sequencing (Heng et al. 2004; Manrao et al. 2012; Venkatesan and Bashir 2011; Wanunu 2012; Woodside et al. 2006) and even in RNA and protein sequencing as well (Depledge et al. 2019; Smith et al. 2019; Soneson et al. 2019; Xie et al. 1991). Besides, such technologies could be hired to determine the sequence of nucleic acids (Kono and Arakawa 2019; Lockhart and Winzeler 2000; Soneson et al. 2019; Xie et al. 1991).

Solid-state nanopores

Solid-state nanopores are principally produced in a thin film of materials such as graphene (single atom thickness sheet of carbon), silicon nitride (SiN), phosphorene, Al2O3 (Venkatesan et al. 2009) and HfO2 (Larkin et al. 2013). SiN, graphene, and phosphorene nanopores show superiorities over biological competitors like chemical and thermal stabilities, although this stability depends on the formation of the pore. There exist numerous techniques for producing solid-state nanopores such as “deploying and sculpting with ion beam” and “fabrication by electron beam” (Briggs et al. 2018). These pores can also be constructed using procedures such as electrochemical reactions, controlled breakdown, laser etching, and laser-assisted controlled breakdown (Feng et al. 2015b). However, the controlled chemical rectification of these nanopores is accessible though challenging (Brilmayer et al. 2020; Yusko et al. 2011). There are less restrictions with solid-state nanopores in contrast with biological ones; for example, solid-state nanopores can operate over wider temperature and voltage ranges. Besides, solid-state nanopores are more compatible and even more stable to solvent conditions, and they can be adjusted in diameter with sub-nanometer accuracy (Yuan et al. 2020). Si3N4 and SiO2 nanopores are among the most broadly employed nanopores, and their manufacturing is in accord with the complementary metal oxide semiconductor industrial integrated circuit processes. Ion etching in free-standing Si3N4 and SiO2 films using argon is the method by which these nanopores are produced (Tang et al. 2016).

Graphene holds unique chemical properties because of being electrically conductive and even much stronger than steel (Thompson and Milos 2011). Albeit graphene with its univalent layer character provides the optimal thickness (0.34 nm) for single-base resolution (Novoselov et al. 2016), MoS2 is the most frequently used two-dimensional (2D) material investigated for sequencing applications, because of the simple fabrication of MoS2 devices (Butler et al. 2013; Graf et al. 2019b; Tsutsui et al. 2011). It should be noted that the structure of single-layer plates and pores are not static, rather they are affected and distorted by electrostatic and hydrodynamic forces (Hernández-Ainsa et al. 2014; Plesa et al. 2014), even though it is recently shown that graphene nanopore is not a suitable candidate for sequencing DNA using ionic current. Since graphene and DNA nucleotides have strong hydrophobic interactions, DNA may stick to graphene which severely impacts translocation speed (Schneider et al. 2013). The major drawback of using graphene is its hydrophobic nature. Another point is the orientational fluctuations of nucleobases during DNA translocation through a graphene nanopore. From sequencing point of view, MoS2 can perform better than graphene. For example, signal to noise ratio and non-stickiness of DNA to MoS2 surface make it suitable (Graf et al. 2019a) (Henry et al. 2021). Instead, phosphorene nanopore and silicene (graphene like two-dimensional silicon) nanopore seem much more suitable (Henry et al. 2021). One of the main problems in detecting bases through solid-state nanopores is the fact that they have a low spatial resolution since dozens of bases can pass through them at any given moment (Yanagi et al. 2015).

In general, the thickness of a 2D single-layer material is approximately 3.0–11.0 Angstroms that is analogous to the gap between two successive nucleotides of an ssDNA which is almost 3.5–5.2 Angstroms (Liu et al. 2014).

As a theoretical example for nanopores’ applications, specifically graphene, an ssDNA is pulled through a nanopore whose diameter is comparable to single DNA bases. With the aid of molecular dynamics (MD) simulation, various parameters like pulling force or orientation of bases relative to the graphene plane or its normal axis are tracked during translocation of ssDNA can be resolved. In an unpublished work by the authors, the phosphorene atom of the DNA backbone is pulled through nanopore with constant velocity, and in addition to pulling force and base orientation, Vander Waals and electrostatic energies and forces are also tracked to see whether or not these parameters can yield an illustrious distinction between DNA bases (Fig. 3).

Fig. 3
figure 3

A schematic of SMD force vs. time curve (sample output of MD simulation) which is studied to investigate the distinction between bases

In addition, hexagonal boron nitride (hBN) is less hydrophobic than graphene. The thickness of hBN is comparable to the spacing between nucleotides (0.32–0.52 nm) in single-stranded DNA (ssDNA) (Zhao et al. 2014). It also shows other advantages over graphene in terms of its insulating property in high ionic strength solution and fewer defects made during the manufacturing process(Liu et al. 2013).

Several theoretical and experimental studies have proven MoS2 capabilities as a mono layer material in the form of the nanopore or nanoribbon (Feng et al. 2015a; Graf et al. 2019a; Liu et al. 2014). Moreover, graphene (Traversi et al. 2013), WS2 (Danda et al. 2017), and BN (Liu et al. 2013) have been demonstrated to detect DNA translocation. Up to now, none of solid-state nanopores have shown single-base resolution. Therefore, it is so crucial to proceed with studies to identify new materials, and two such prime candidate materials are phosphorene and silicene as mentioned earlier (Jose and Datta 2014; Zereshki et al. 2018). Both materials have properties that are ideal for base identification (Chen et al. 2017). Moreover, the biocompatibility and hydrophilicity of phosphorene makes it appropriate for biosensing applications (Cortés-Arriagada 2018; Kumawat et al. 2018).

DNA translocates through solid-state nanopores very fast, up to 0.01–1 μs per base (Heerema et al. 2018). As a matter of choice, the DNA translocation velocity should be 1–100 base per microseconds in a nanopore to provide an acceptable signal from each nucleotide (Akahori et al. 2017). Thus, it is so essential to slow down DNA during translocation. Different methods have been examined to control translocation speed such temperature (Wanunu et al. 2008), electrolyte viscosity (Feng et al. 2015a), driving voltage (Liang et al. 2013), and ion concentration (Luan and Aksimentiev 2010). Alternative methods like two nanopores system (Langecker et al. 2011), optical tweezers (Keyser et al. 2006), optical trapping of a single DNA (Kim and Lee 2014), and magnetic tweezers (Peng and Ling 2009) have been used. A schematic of various nanopore sequencing approaches is depicted in Fig. 4.

Fig. 4
figure 4

Schematic of various approaches for sequencing; specifically exerting force on the strand to pull it. a Optical trapping, b magnetic trapping, and c molecular dynamic simulation

Conclusion

The first-generation methods, though revolutionary, suffered from disadvantages like being costly or being capable to sequence only small strands. The second-generation techniques presented modifications to genome sequencing at a reasonable time-cost scale and enhancing throughput while still required DNA amplification which would have made errors. The third generation could go several steps forward and attained the traits like direct sequencing, longer base reads, real-time sequencing, and single-molecule nature. One should be so optimistic to the future of the DNA sequencing grounded on new technologies, but there are still obstacles that should be overcome by researchers. What seems more progressive is quantum simulations that are more confidential but more cumbersome since they require much more computational costs.