Background

Sorghum ranks fifth in cereal production and acreage behind maize, rice, wheat and barley (http://www.fao.org). It is cultivated in vast geographic areas in the Americas, Africa, Asia, and Oceania. Sorghum’s excellent agronomic and biological properties, such as heat and drought tolerance, make it a vital grain crop in marginal land for production without competing against other major food crops [1]. With the increase of world population and the decrease of water resources, sorghum will become the preferred food crop all over the world in the future. Furthermore, sorghum is not only harvested for grain, but also often used to produce syrup, grazing and biomass production [2].

As a model organism that carries out C4 photosynthesis, sorghum was the second sequenced cereal crop after the C3 organism rice [3, 4]. The comparatively small genome of sorghum makes it a potential genetic model for the design of bioenergy crops compared with the larger and more repetitive genomes of other major C4 crops, such as maize and sugarcane. With the improvement of the reference genome (BTx623) [4, 5] and the development of sequencing technologies, studies on domestication and genetic mechanism of distinct phenotype in sorghum have been greatly accelerated [2, 6,7,8,9,10,11,12,13,14,15,16,17].

During the past decade, diverse web resources have been constructed to exhibit numerous omics data, which is beneficial for the sorghum research community (Table 1). Plant specific genome databases such as Phytozome [18] and Gramene [19], as well as the most comprehensive Genome OnLine Database (GOLD) [20] are widely used as data sources and analysis platforms for sorghum research. On the other hand, sorghum included plant secondary databases such as PIGD [21], PlanTFDB [22], DNApod [23], PceRBase [24], PtRFdb [25] and GreenPhylDB [26] have vital modules about sorghum resources. Finally, the sorghum specific secondary databases, including MOROKOSHI [27], PGSB [28], SorghumFDB [29], Sorghum QTL Atlas [30], and Sorghum Genomics, are a cluster of websites dedicated to sorghum researches. Among them, SorghumFDB is the most comprehensive sorghum specific database, which contains extensive public genomic and functional annotations data, as well as useful analysis tools. With published sorghum genome re-sequencing data of 48 accessions, we developed a sorghum SNP database (SorGSD) in 2016, providing the sorghum user community with abundant SNPs and some other resources related to sorghum genetics and genomics [31].

Table 1 Online databases for sorghum genome

Here, we announce and describe the second major release of the sorghum genome science database (SorGSD). The goal of the redesign is to construct a comprehensive database with sorghum genomic variations and phenotypes. Compared with the first version SorGSD which contains SNPs of 48 sorghum accessions, the second version provides a more extensive set of genomic variation data for both SNPs and small INDELs of 289 sorghum accessions, as well as characteristic phenotypic information and panicle pictures of critical sorghum lines. We also provide three useful tools in the new release, including ID Conversion, Homologue Search and Genome Browser. The back-end database framework and the web interface were redesigned as a part of the Genome Variation Map at the National Genomics Data Center (NGDC) and China National Center for Bioinformation (CNCB). We hope that these data and tools are beneficial for exploring genetic variations and evolution studies of sorghum and other species. The new version SorGSD is freely accessible at http://ngdc.cncb.ac.cn/sorgsd/.

Results and discussion

New data contents

The new version SorGSD was mainly built on sorghum reference genome BTx623 (v3.1) with improved assembly and gene annotations [5]. Currently, SorGSD contains 33,825,236 SNPs and 5,722,385 small INDELs identified from the re-sequencing data of 289 sorghum lines [6, 32, 33], including three accessions of Sorghum propinquum, 50 wild/weedy sorghums and 236 cultivated sorghums (Additional file 1: Table S1). After annotation and calculation, we obtained detailed information about the distribution of variations in different genomic regions, including genic, intergenic, and intronic regions (Table 2). On the other hand, we also collected about 70 kinds of phenotypic data over 183 accessions with plant ID (PI) from the U.S. National Plant Germplasm System (GRIN-Global) and panicle pictures of 174 critical accessions taken in our laboratory. Besides, we renewed the introduction about sorghum genome, sorghum resources websites including general information, genome and transcriptome databases, research institutions and sorghum producers around the world, as well as critical references about sorghum genetics and genomics.

Table 2 Distribution of variations in different genomic regions

New features of the database

SorGSD is free and open to the public with comprehensive functions (Fig. 1; Additional file 2: Table S2). In this update, we put the main page under the National Genomics Data Center of the China National Center for Bioinformation (CNCB-NGDC) (Fig. 1a, h) [34]. Links to each page are shown at the menu bar (Fig. 1b), and a simple welcome message is displayed under the menu bar (Fig. 1c). Four shortcuts of core functions and prompt of citation can be found on the home page (Fig. 1d, e). Our laboratory’s major publications and website browsing history could be acquired easily on the right side (Fig. 1f, g).

Fig. 1
figure 1

Schematic diagram of the SorGSD home page. The background of CNCB-NGDC is shown in a and h. The menu bar (b), welcome message (c), shortcuts of core functions (d) and prompt of citation (e) are placed from up to bottom. Our laboratory’s major publications (f) and website browsing history (g) could be acquired on the right side

It is worth mentioning that we still keep the original version up and running, and users could browse it by clicking the “V1.0” button on the menu bar and switch back to the new version by clicking the “V2.0” button of the old version. We optimized the presentation interface to make it easier for users to search for variations. Phenotypic details of each accession could be searched directly. The browsing interface of critical references was redesigned for a better user experience. We also provided three new tools: ID conversion, Homologue Search and Genome Browser. Online documentation is provided to help users get familiar with the database. More detailed information is described as follows.

Improved variation search function

Users may search variation by typing in the variation type, genome position or gene ID. Furthermore, it is also possible to filter variation through consequence type and minor allele frequency (MAF) value. In our previous work, we found that the Dry gene encoded a plant-specific NAC transcription factor, which had a few loss-of-function mutations in sweet sorghum [33]. An inframe deletion variation (Chr06:50898132) within the conserved functional NAC domain could turn pithy stem into juicy stem, which is one reason for the origin of sweet sorghum. Here we take the Dry gene as an example to search this inframe deletion (Chr06:50898132). Firstly, we can enter the “Variation Search” page and choose the variation type as “INDELs”; secondly, type the gene ID of version 3.1 (Sobic.006g147400) in the edit box “Gene ID”; thirdly, tick “inframe deletion” in “MODERATE” under “Consequence Type”; finally, click “Submit” and we can get the list of target small INDELs at the region of Dry on the right hand of the page (Fig. 2a).

Fig. 2
figure 2

Steps and results of variation search. a. The search page of variations. Numbers in a show the steps of the search. b. Detail page of the target variation. c. Detail page of the gene with target variations

In the list, we could see that the first one is the target small INDELs we searched (Fig. 2a). The details of the variation could be obtained by clicking the variation ID. Users may browse the no-redundant and individual variations with text format in three tables, one alleles distribution diagram and the chromosome-based graphical Genome Browser interface (Fig. 2b). In the text format tables, variation details (e.g., chromosome location, reference allele and three-fifths flank sequences), individual alleles and details of the annotated gene of the variation are given. The alleles distribution diagram is used to infer evolutionary scenario of each variation during sorghum domestication and improvement. More importantly, the individual alleles of target variation can be downloaded to perform subsequent analysis, such as phylogenetic tree construction and association analysis. Users can enter the gene page by clicking the gene ID with a blue background in the “Gene Annotation” table. The gene detail, gene annotation and all the variations locating gene, including SNPs and small INDELs without filtered, will be listed in three tables, respectively (Fig. 2c).

On the other hand, the demand of searching all the SNPs in the position of Dry could be obtained on the “Variation Search” page (Fig. 2a) by the following steps: (1) choose the variation type as “SNP”; (2) choose the chromosome as “Chr06”; (3) input the physical location (Chr06:50896169.50898604) and submit, we can get all the SNPs in the site of Dry.

New phenotype search function

A user-friendly web interface is provided for users to browse and retrieve phenotypic information (Fig. 3). On this page, users can search for important information of samples using several keywords, including sample ID, plant ID, plant name, origin, taxonomy and usage. When we input “sweet sorghum” in the search box, we can obtain all accessions with the keyword of individual information (Fig. 3a). A high-resolution image could be exhibited by clicking each sample’s picture to see the detail of panicle and seed appearance. For example, sample “101” is an improved sweet sorghum from Zimbabwe. By clicking the “Sample ID: 101” tab, the result page will list all agronomic traits’ values (Fig. 3b). It is noteworthy that users could also enter the phenotypic page to view the value of this trait from the variation detail page by clicking the tab of “Sample ID” in the “Individual Alleles” table (Fig. 2b).

Fig. 3
figure 3

Searching page (a) of accessions and result page (b) of the target accession

New online tool

SorGSD provides three online tools (e.g., ID Conversion, Homologue Search and Genome Browser) for users to analyze their data. ID Conversion is a useful tool to convert sorghum gene IDs from one to other ID systems of v1.4, v2.1 and v3.1, as well as the IDs of UniProt and PANTHER databases. When we type the gene ID (v3.1) of Dry gene (Sobic.006g147400) in the search box and press “Convert”, the corresponding ID of other versions and systems will be listed in the result table. Users could access directly to the corresponding pages of the IDs of UniProt and PANTHER through the hyperlink.

To better understand the evolution of sorghum genes, Homologue Search is built to identify homologous genes among sorghum, maize, rice and Arabidopsis. When we input the gene ID of Dry gene (Sobic.006g147400) in the “Gene Name” box and click “Submit”, the list of homologues in other species will be displayed. Besides, we provided a Genome Browser to visualize the locus of variation in the genome. Users only need to type in the genome position (e.g., Dry gene, Chr06:50896169.50898604), corresponding transcript information of the gene and the positions of SNPs and INDELs in the relevant range will appear on the results page. We also provided the link to BLAST tool rested on CNCB-NGDC for comparing nucleotide or protein sequences with sorghum reference sequence database.

Revised resource page

The resource page is divided into three sections, including “Genome”, “Website” and “Reference”. The “Genome” part introduces the general information of sorghum genome. Users could enter the homepages of website resources promptly on the “Website” page. It is worth mentioning that we updated 162 vital publications of sorghum and classed them into six broad categories in “Reference”. By clicking the class title heading in the directory on the left of the page, all papers in the target category will be listed on the right hand. Consumers could read the abstract or download the article from the links by clicking the button “Abstract”.

Conclusions and future directions

SorGSD is committed to providing a wide range of sorghum genome data, including genomic information, detailed phenotypic data, sorghum resources and analysis tools for sorghum scientists and breeders. The interface of new version SorGSD is under the CNCB-NGDC and also an essential part of the Genome Variation Map (GVM), a data repository of genome variations of human, as well as cultivated plants and domesticated animals [35]. In this upgrade, we added 241 varieties of whole-genome variation data (including SNPs and small INDELs) based on the latest sorghum reference annotation (version 3.1). The total number of accessions (289) and variations (39.5 Mb) are 6 times and 1.4 times as much as that of the first version, respectively. We also added about 70 kinds of traits information of 183 accessions, which provides detailed reference data of each line for breeders. Tools of ID Conversion, Homologue Search and Genome Browser provide visual, convenient and quick queries for scientific workers engaged in sorghum study. Besides, we carried out a brand new page design to optimize the user experience and make the interaction friendlier. The simple and straight forward user guide allows users to be familiar with the web page’s overall design and realize various functions of the webpage quickly.

In the future, we will update SorGSD regularly and add variations with newly available re-sequenced sorghum accessions. In the next step, we anticipate integrating phenotypic data, genomic variation data, transcriptome data, proteome data, and epigenomic data, as well as metabolomics and metabolic interaction networks to build a comprehensive sorghum research and analysis database. At the same time, we hope to receive comments and suggestions, aiming to make SorGSD a one-stop sorghum research platform with multi-faceted omics data and analysis tool.

Methods and materials

Data resources

Currently, we collected the re-sequencing data with the unique average depth of 4.02–48.55 ×  coverage from three sets of sorghum germplasms comprising a total of 289 accessions of wild and cultivated sorghum. The most extensive set of germplasm is a diverse panel of 241 sorghum lines which we published to explore the origin of sweet sorghum through the selection of Dry gene [33]. The second dataset is 44 sorghum lines which revealed untapped genetic potential in Africa’s indigenous cereal crop sorghum by Jordan’s Lab in 2013 [6]. The last dataset is also our group’s work which contains three accessions of cultivated sorghums [32]. The entire set of original sequence data could be obtained from Genome Sequence Archive [36]. Phenotypic data cover the breed and agronomic-trait information collected from GRIN-Global (npgsweb.ars-grin.gov/). Finally, panicle pictures were taken when the sorghum plant reached maturity in the experimental fields of the Institute of Botany, Chinese Academy of Sciences (Beijing, China) in 2019.

Data processing

After trimming the adapter and filtering low-quality reads of the second [6] and third [32] datasets in the first dataset [33], the remaining clean reads were mapped to the reference genome BTx623 (v3.1) with BWA (v0.7.8) [37]. The mapping results were converted to BAM format, and the duplicated reads and multi-aligned reads were eliminated by the SAMtools package (v1.3) [38]. GVCF files of these lines were generated by HaplotypeCaller in GATK (v3.1) [39]. All the GVCF files of the three datasets were used to call SNPs and INDELs by GenotypeGVCFs in GATK (v3.1) [39]. In total, 33,825,236 SNPs and 5,722,385 small INDELs were identified across 289 sorghum lines. Finally, we predicted and annotated the effects of variations by using the VEP program (v84) [40]. Besides, we also calculated the MAF of each variant using vcftools (v0.1.13) [41].

Database design and implementation

SorGSD was designed based on the framework of the iDog database [42], which was implemented using Spring Boot (http://sping.io), a free and prevailing Model-View-Controller (MVC) framework, and Mybatis (https://mybatis.org/mybatis-3/), a first-class persistence framework with support for custom SQL, stored procedures and advanced mappings. In the back-end part, metadata and reference data were stored in MySQL (https://www.mysql.com). Web user interfaces were developed using JSP, JQuery as well as BootStrap. The Biodalliance genome browser (http://www.biodalliance.org/) was used for genome synteny visualization.