Introduction

The blueprint of living organisms sits in its DNA. It contains the instructions for an organism to grow and develop. In the last two decades, genome sequencing has greatly advanced. Currently, the NCBI database (https://www.ncbi.nlm.nih.gov/) holds information on 30,530 eukaryotic genomes (representing 12,205 species), of which 5119 are complete or at chromosome level (accessed on 5 March 2024; Fig. 1). From these sequencing efforts, it became clear that the complexity of an organism is not necessary in the number of its genes. For instance, the number of genes of human (International Human Genome Sequencing Consortium 2001; Venter et al. 2001) or a roundworm (C. elegans Sequencing Consortium 1998) are not that far apart. A big part of the complexity is in how gene expression is regulated, and finally in how many proteins this can result. Genome information drives the discovery of biological insights on how organisms are functioning and their evolutionary history, and as well for biotechnological innovations. In the field of agriculture, genome information helps modern breeding, facilitates climate adaptation and food security, among others. Though it does not stop here, genome sequence efforts continue around the world. To highlight one large effort, the Earth BioGenome Project, which aims to sequence every living eukaryotic organism with a name on our planet, which is around 2 million species (Lewin et al. 2018; Ebenezer et al. 2022). A genomic tree of life is intended to aid in our understanding of how species change, adapt, and rely on one another across an ecosystem. Through these discoveries, long-standing problems in phylogenetics, evolution, ecology, conservation, agriculture, the bioindustry, and medicine will be resolved (Blaxter et al. 2022).

Fig. 1
figure 1

Sequenced genomes of plant species. a The plant kingdom stands as the third-most sequenced domain of life, as evidenced by the cumulative number of sequenced species. b Boxplot of sequenced species across the main clades of the Plant Kingdom. c Graphical representation of the progression in plant genome sequencing since 2000. The bars illustrate the distribution of plant genomes at both chromosomal and non-chromosomal levels. The green line tracks the annual sequencing rate of species, while the salmon shadowed area represents the cumulative count of sequences through March 2024. For the latter two, use values on the right y-axis. d Chronology of sequenced key plants of agriculturally and scientifically important plant species. In a, the data for animals, fungi, protists, and other domains of life were acquired from the NCBI database (https://www.ncbi.nlm.nih.gov/). Sequenced plant species counts were obtained from https://www.plabipd.de/, with the information updated on 19 February 2024. In b, species count data and genome sequencing details at both chromosomal and non-chromosomal levels were obtained from the NCBI database. Species counts were verified and updated using information from https://www.plabipd.de/. In (c), the chronology was constructed using data obtained from the NCBI database and some of the images were generated using BioRender.com

In this review, we give an overview of the status of (nuclear) plant genome sequencing efforts and how this has helped for studies on plant functional genomics.

The status of sequenced plant genomes

Information on plant genome sequences enormously facilitates studies on plant biology, genetics, development, evolution, molecular biology, among many others. The first sequenced plant genome, Arabidopsis thaliana, was published in the year 2000 (Arabidopsis Genome Initiative 2000). This model plant is widely used worldwide and with the genome sequence, it opened the plant field into the genomics era. For a historical overview of Arabidopsis, we refer to other reviews (Meyerowitz 2001; Provart et al. 2016, 2021; Somssich 2019). Arabidopsis has a genome size of around 135 Mb, and based on the latest Araport11 re-annotation, has 27,655 protein-coding loci with 48,359 transcripts (Cheng et al. 2017). Various dedicated websites house data for the community such as The Arabidopsis Information Resource (TAIR; Rhee et al. 2003), Araport (Cheng et al. 2017; Pasha et al. 2020), ThaleMine (Krishnakumar et al. 2017; Pasha et al. 2020), and Bio-Analytic Resource (BAR; Toufighi et al. 2005).

Nowadays, plant genome sequencing is a very active field (Michael and Jackson 2013; Chen et al. 2018; Kersey 2019; Marks et al. 2021; Kress et al. 2022; Sun et al. 2022). Since the publication of the Arabidopsis genome in December 2000 (Arabidopsis Genome Initiative 2000) 4604 nuclear plant genomes have been sequenced, corresponding to 1482 plant species, most of them being from angiosperms (90%) (Figs. 1 and 2). This genome data are based on information from the NCBI database (accessed on 5 March 2024; https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/plants), and from the website Published Plant Genomes that visualizes sequenced plant genomes over time (https://www.plabipd.de/; R. Schwacke, personal communication, 19 February 2024). The second plant species to have a genome sequenced was rice, with two subspecies of rice (Oryza Sativa subsp. japonica and subsp. indica; Goff et al. 2002; Yu et al. 2002); in 2006 the first genome of a tree, from poplar (Populus trichocarpa; Tuskan et al. 2006); and in 2007 the genome of grape, the first genome of a fruit producing species (Vitis vinifera; Velasco et al. 2007). In the second decade of sequencing, the number of genome reports per year went up exponentially (Fig. 1).

Fig. 2
figure 2

Genome size and species count across plant clades. a Range of genome size within each clade of plant classification, with data points denoting the minimum and maximum genome sizes for each clade. b Bars illustrating the distribution of the number of species within each clade of plant classification. The plant classification used is based on the taxonomy provided by https://www.plabipd.de/, which was updated on 19 February 2024

Just in the last five year, numbers of sequenced nuclear plant genomes increased impressively from around 576 (reflecting 383 species) (Kersey 2019), 798 (reflecting 798 species) (Marks et al. 2021), 1031 (reflecting 788 species) (Sun et al. 2022), 1139 (reflecting 812 species) (Kress et al. 2022), to 4604 genome sequences (reflecting 1482 species) that have been reported (5 March 2024; Table S1). This has to do with improvements of sequence technologies and lower costs (Shendure et al. 2017; Michael and VanBuren 2020; Henry 2022). One of the descriptions of the quality of genome assemblies is the value of the Contig N50, which indicates the length of the shortest contig in the set of contigs containing at least 50% of the assembly length. This value greatly improved over the years (Fig. 3a), which is low (< 1 kb or < 10 kb) when a short-read sequencing approach was used (e.g., Illumina), and nowadays, with the use of long-read sequencing approaches such as from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), the Contig N50 is hundreds of kb to several Mb, resulting in much higher quality genome assemblies (Michael and Jackson 2013; Belser et al. 2018; Kersey 2019; Michael and VanBuren 2020; Marks et al. 2021; Sharma et al 2021; Sun et al. 2022).

Fig. 3
figure 3

Comparative analysis of genome size and protein-coding genes in annotated plant genomes, and assembly statistics of contig N50 over time for sequenced plant species. a Distribution of assembly statistics: Contig N50 over time for the 1482 sequenced plant species; data obtained from the NCBI Database (https://www.ncbi.nlm.nih.gov/). The green points represent assemblies based on long-read sequencing methods, while the purple points represent assemblies based on short-read sequencing methods. b The graph illustrates the distribution of the genome size and the number of protein-coding genes (the pink dashed line indicates the mean number of genes per genome: 34,071) in the 685 available annotated plant genomes, utilizing taxonomic classifications from the NCBI database (https://www.ncbi.nlm.nih.gov/). Points are colored by assembly level, and the figure represents a clade of the Plant Kingdom

The estimated number of extant green plant species is around 450,000–500,000 (Corlett 2016; Lughadha et al. 2016). The number of green plant species with sequenced genomes (1482) represents around 0.26–0.29% of plant species, so only a fraction of them has been sequenced so far. Despite an uneven distribution, the reported genomes span around 500 million years of evolution and comprise the major clades of green plants (Viridiplantae) (Fig. 2). Nuclear plant genome size varies greatly among the sequenced species, from 9 Mb to 31 Gb (Fig. 2). In contrast to more than 3000-fold difference in genome size, the number of protein-coding genes per genome varies much less, only in the range of a few-fold difference (Fig. 3b). Based on the 685 available annotated plant genomes depicted in Fig. 3b, the mean number of protein-coding genes is 34,071 (Table S1). Large genome sizes are attributed in part to polyploidy events common in plants, but mainly to the activity of transposable elements (Michael and Jackson 2013; Michael 2014; Kersey 2019; Kress et al. 2022; Marks et al. 2021).

Furthermore, we can see that the model species and many agriculturally and economically important plant species have been sequenced (Figs. 1 and 2). Without doubt, the number of sequenced genomes and phylogenetic distributions of them will soon increase and expand, because of many current genome initiatives. Projects affiliated to the Earth BioGenome Project (Lewin et al. 2018, 2022), is the Darwin Tree of Life Project that aims to sequence all 70,000 species in Britain and Ireland (Darwin Tree of Life Project Consortium 2022). Another example is the 10KP (10,000 Plants) Initiative, which aims to sequence genomes of 10,000 species representing every major clade of embryophytes (land plants), green algae (chlorophytes and streptophytes), and protists (photosynthetic and heterotrophic) (Cheng et al. 2018). Other initiatives are the African BioGenome Project (AfricaBP) aiming to sequence genomes of 105,000 endemic species, including plants (Ebenezer et al. 2022), the African Orphan Crops Consortium (AOCC) aiming to sequence 101 African orphan crops/trees (Hendre et al. 2019), and the Genomics for Australian Plants (GAP) consortium aiming to sequence representative Australian plant genomes across the plant tree of life (Genomics for Australian Plants Initiative 2018; McLay et al. 2022).

Mostly, when sequencing a genome, the genome of one individual species is sequenced, which will be used as the reference genome. However, this is unlikely to be the complete picture. Genetic differences among individual species may exist. To overcome this, the term pan-genome was coined. The first report was based on the sequencing of eight bacterial strains and the observation that not every gene was present in each strain (Tettelin et al. 2005). It refers to the ´whole´ genome within a species (Golicz et al. 2020; Bayer et al. 2020). A pan-genome can be made by sequencing different individuals, accessions, cultivars, or populations, and then by ´joining´ the information, the whole genetic diversity will be captured, in principle (Lei et al. 2021; Li et al. 2022). In plants, the first pan-genome was made for wild soybean (Glycine soja), by sequencing and de novo assembly of seven phylogenetically and geographically representative accessions (Li et al. 2014). To date, around 30 plant pan-genomes, mostly of crops, have been published (Li et al. 2022). To create pan-genomes, long read sequencing is used. Normally, for re-sequencing efforts, short read sequencing is used, which allows the detection of single nucleotide polymorphisms (SNPs), but structural variants (SVs) are more difficult to identify (Golicz et al. 2020).

For comparative plant genomics, we refer readers to the useful website Phytozome (Goodstein et al. 2012).

How plant genomes facilitate plant functional genomics

Gene function discovery using mutant collections

With the availability of genome sequences, the identification of gene functions via mutant screens became much easier. To go from a phenotype to the probable casual mutation induced by ethyl methanesulfonate (EMS) mutagenesis using classical forward genetic screens involved long and laborious mapping strategies. Nowadays, mapping can be performed by sequencing the genomes of a population of backcrossed homozygous plants with the phenotype of interest, which allows the rapid identification of the casual mutation (Hartwig et al. 2012; Garcia et al. 2016).

In reverse genetic screens, starting with a gene of interest and determining the phenotype/function (Alonso and Ecker 2006), for 20 years the Arabidopsis community has used insertional T-DNA mutant collections where sequence information is available for most of the random T-DNA insertions in the genome, arguably, the most widely used is the SALK T-DNA collection (Alonso et al. 2003). Various other valuable sequenced collections of T-DNA, transposon insertion, or variations, are available for Arabidopsis (Samson et al. 2002; Sessions et al. 2002; Rosso et al. 2003; Woody et al. 2007), and for other model species such as rice (Wang et al. 2013; Wei et al. 2013), maize (Lu et al. 2018), and petunia (Vandenbussche et al. 2008, 2016).

There are various other techniques available for gene function discovery where genome information is very useful. An example of a reverse genetics approach to find mutations is TILLING (Targeting Induced Local Lesions IN Genomes), which is a chemical random mutagenesis approach, followed by high-throughput screening of point mutations in targeted genomic regions. The screening part can be combined with high-throughput sequencing (Mccallum et al. 2000; Henikoff et al. 2004; Tadele 2016). Another frequently used approach is activation tagging to identify gain-of-function mutants. For this, a mutant population is made by random genome insertions of T-DNAs or transposons carrying an activation sequence, leading to the activation of nearby genes. Recovering the flanking sequence followed by the identification of the genome region leads to the discovery of the gene in question (Weigel et al. 2000; Marsch-Martinez et al. 2002; Tani et al. 2004).

Other reverse genetics approaches for gene function discovery, involve making dedicated constructs targeting genes of interests, which can be used to target one or more genes. RNA interference (RNAi) (Saurabh et al. 2014; Muhammad et al. 2019) or the fusion of a transcriptional repression domain (EAR domain) (Hiratsu et al. 2003; Mitsuda et al. 2011) can be used to obtain loss-of-function mutants. Another approach is the use of artificial miRNAs (amiRNAs) to silence genes. An amiRNA can be designed to silence one gene or a family of redundant genes (Schwab et al. 2006; Ossowski et al. 2008). A last example of an approach, still relatively new but already very actively used, is using a CRISPR-Cas system (Wada et al. 2020; Zhu et al. 2020; Gaillochet et al. 2021). The used guide RNAs (gRNAs) are typically directed towards coding regions, but can also be directed towards promoters or non-coding regions. Furthermore, multiple gRNAs can be cloned in the same vector to target different genes (Najera et al. 2019) or promoters (Rodríguez-Leal et al. 2017). Having the genome information, genome-wide screens can be made using pooled CRISPR libraries (Huang et al. 2022; Liu et al. 2023; Pan et al. 2023), and various reports have already been published such as in rice (Lu et al. 2017; Meng et al. 2017), tomato (Jacobs et al. 2017), soybean (Bai et al. 2020), maize (Liu et al. 2020), and canola (He et al. 2023).

The use of CRISPR systems, for ´traditional´ genome editing or for gene activation/repression, may fill the gap of functional genomics in plant species, beyond the model species currently used (Huang et al. 2022; Liu et al. 2023; Pan et al. 2023). With the use of pooled CRISPR libraries, massive plant transformation could be applied in different species. Sharing of whole genome gRNA library data, pooled libraries, and even complete transformed CRISPR mutant populations in the form of seeds could make a usage boost to functional studies. As mentioned above, 4,604 nuclear plant genomes have been sequenced, corresponding to 1482 plant species (Fig. 1), most functional genomics research is performed in a rough estimate of only 1–2% of plant species with genome information so far. The future holds interesting opportunities for the use of genome information.

OMICS technologies

In addition to genomics, there are now many other omics technologies available. All these technologies benefit greatly from genome information. Many efforts exist generating plant transcriptomes from model species but also non-model species, even from species with no genome information yet. For the latter, mapping of the sequence reads is done against the genome of the evolutionary closest species or reads can be mapped (and gene expression quantified) against a de novo assembled transcriptome from the target organism. In general, transcriptome information also helps to improve genome annotations. Many databases exist to explore transcriptome data such as BAR (Winter et al. 2007), Genevestigator (Zimmermann et al. 2004), and Plant Public RNA-seq Database (Yu et al. 2022). Other databases contain data from large initiatives like the 1KP (1000 Plants), where transcriptomes of 1124 species were sequenced to infer the phylogenomic relationships (Matasci et al. 2014; Leebens-Mack et al. 2019). Another initiative is the JGI Plant Gene Atlas, which contains almost 2100 RNA-Seq data sets collected from 18 plant species, with the aim to improve functional gene descriptions across the plant kingdom (Sreedasyam et al. 2023). Recently, a great number of specialized single cell and single nuclei transcriptome data sets are emerging (reviewed in: Seyfferth et al. 2021; Cervantes-Perez et al. 2022; Denyer and Timmermans 2022; Nolan and Shahan 2023; Zheng et al. 2023) and databases holding single cell transcriptome data (e.g., Ma et al. 2020; Wendrich et al. 2020; Chen et al. 2021a; He et al. 2023).

Plant proteomics is also a large field and benefits from genome information, including transcriptome information, first to be able to predict all proteins and isoforms (Chen et al. 2021a, b; Mergner and Kuster 2022). Many proteomic studies, from small studies to very large studies, and even pan-plant proteomes have been reported in the literature (e.g., McWhite et al. 2020; Mergner et al. 2020; van Wijk et al. 2021, 2024).

An omics area that has a growing significance that can improve draft plant genomes, correct gene annotation, discover new translation initial sites, ORFs, and alternative splicing, and verify novel genes of the peptide/protein level is called proteogenomics (Nesvizhskii 2014; Song et al. 2023). The usefulness of proteogenomics has been illustrated for instance for the model organism Arabidopsis (e.g., Castellana et al. 2008; Zhu et al. 2017; Willems et al. 2017, 2022). Recent examples of proteogenomics in other species are for sweet cherry and pear (Xanthopoulou et al. 2021; Wang et al. 2023).

Another big omics technology is metabolomics. Metabolomics is a good tool for functional genomics (Schauer and Fernie 2006). It is a powerful technique to analyze the metabolite content in plants and is less restricted to genome information or model species. Though limitations for metabolomics in some (non-model) plants are the lack of high-quality metabolite databases, such that some molecules cannot easily be unambiguously identified. On the other hand, combining different types of omics data can lead to the discovery of gene functions and help in future plant improvements (Kumar et al. 2017; Patel et al. 2021; Shen et al. 2023).

Evolution and domestication

Genome information facilitates the study of phylogenetic relationships among species. Furthermore, the importance of genes or gene families in the evolution of land plants can be studied (Yu et al. 2018; Leebens-Mack et al. 2019; Soltis and Soltis 2021; Guo et al. 2023). Another example facilitated by genome information is the study of domestication. Hundreds of plant species have been domesticated by humans by selecting for beneficial traits (Gepts 2004; Meyer and Purugganan 2013). Through candidate gene studies, quantitative trait locus (QTL) mapping and cloning, genome-wide association studies (GWASs), and whole-genome resequencing studies, a significant number of domestication or domestication-related genes have been discovered and isolated (Meyer and Purugganan 2013; Kantar et al. 2017). More recently, reports on pan-genomes also facilitate the study of evolution and domestication, and the identification of key genes associated with important agronomic traits (Li et al. 2022).

Interestingly, de novo domestication by genome editing has been used (Bartlett et al. 2023). For instance, using CRISPR-Cas9, this has been done in the wild tomato species (Li et al. 2018; Zsögön et al. 2018), in the Solanaceae species ´groundcherry´ (Lemmon et al. 2018), and in wild rice (Yu et al. 2021). Knowledge on domesticated genes was used to edit several of these genes at once, resulting directly in a ´crop´ with desirable agricultural traits.

Conclusion and perspective

In recent years, the number of sequenced plant genomes has increased at an incredible speed. It is clear that this will only continue, and in the near future we will have tens of thousands of sequenced plant genomes. This wealth of information will accelerate studies on plant biology, functional genomics, evolution of genomes and genes, domestication processes, phylogenetic relationships, among many others. In parallel, new and improved bioinformatics analysis methods will have to be developed.

The field of single cell genomics will also expand and will also come with technical challenges such as capturing more cells, capturing low-abundance cells, cell-type annotation, new sequencing and analysis methods (Efroni and Birnbaum 2016; Conde and Kirst 2022; Cuperus 2022). Moreover, this will not only apply to transcriptomics, but in all omics fields we are going to see a rapid expansion, from single cell omics, single cell multi-omics, spatial genomics and other omics, new omics analysis methods, and to inference of gene regulatory networks using single cell omics data, among others (Thibivilliers and Libault 2021; Clark et al. 2022; Yu et al. 2023; Baysoy et al. 2023).

The genome evolution and phylogenomic research field will have an ever-growing amount of data available for analyses. Furthermore, there is a great potential for the use of functional genomics data for genome-editing of crops and for the de novo domestication for future crops using this same technology (Fernie and Yan 2019; Zhou et al. 2020; Zaidi et al. 2020; Gao 2021; Kumar et al. 2022; Yu and Li 2022; Bartlett et al. 2023). Importantly, when it comes to crop yield, knowledge is required how to properly evaluate this (Khaipho-Burch et al. 2023).

Lastly, Artificial Intelligence (AI) is certainly going to play a role in the plant science fields discussed here. Predictive models or analysis methods are developed based on machine learning (ML) and deep learning (DL) (Wang et al. 2020; van Dijk et al. 2021; Xu et al. 2021; Holzinger et al. 2023). Besides ChatGPT as a tool to ask or write texts, among other tasks (OpenAI; https://chat.openai.com/chat), probably one of the best-known tools now in life sciences, is AlphaFold and its successor Alphafold2, a model that can predict almost all protein tertiary structures (Senior et al. 2020; Jumper et al. 2021). Other examples are the use of AI in image analysis and image-based phenotyping, having autonomous robots and/or drones for plant phenotyping, pest management, fertilizer management, or harvesting (Harfouche et al. 2023; Holzinger et al. 2023; Murphy et al. 2024). Furthermore, AI can be applied in bioinformatic analysis, to improve genome annotations, predict with high accuracy specific motifs in regulatory regions, gene function prediction, or predict the import nucleotide region or gene(s) in EMS screens or QTL analysis, etc. These are just a few examples of the many possibilities of the use of AI now and in the near future.

In conclusion, plant genomics will undoubtedly remain a cornerstone, actively contributing to the ongoing advancement of plant science and its practical applications.