Background

Leaf development from primordium initiation to organ senescence is an intricate process controlled by interconnected regulatory pathways [1, 2]. Many of the key genes have been thoroughly characterized, while the role of numerous other factors with clear leaf phenotypes has not been studied in the context of leaf organogenesis. The shoot apical meristem (SAM) gives rise to the aboveground differentiated organs. The position of leaf initiation is determined by polarized auxin accumulation generated by the YUCCA auxin biosynthesis genes [3] and the PIN-FORMED1 (PIN1) hormone transporter [4]. Leaf identity is established by suppression of meristem identity genes at this marked region by the MYB-family transcription factor ASYMMETRIC LEAVES1 (AS1) and AS2, a LOB domain protein coding gene [5, 6]. A defined boundary region separates the meristem from the organ primordium and provides a border between neighboring organs. Organization of this domain depends on factors including CUP-SHAPED COTYLEDON (CUC) genes, LATERAL ORGAN BOUNDARIES (LOB), LATERAL ORGAN FUSION (LOF1), and JAGGED LATERAL ORGAN (JLO) genes [7]. The early leaf primordium emerges as radially symmetrical, cylindrical structure that soon differentiates along the proximodistal, mediolateral and dorsoventral axes. For the formation of a flattened leaf structure, mutually antagonistic developmental programs define the dorsal and ventral organ identity [8]. AS1, AS2 and the HD-ZIPIII genes act as ventral determinants, while the KANADI (KAN) genes, the YABBY genes and several AUXIN RESPONSE FACTORs (ARFs) promote ventral fate. Leaf growth is a coordinated process of cell division and cell expansion. Cell divisions are driven by a great number of cell cycle regulators such as cyclins, cyclin-dependent protein kinases, and inhibitors of cyclin-dependent kinases [9]. Many of these factors are also key players in DNA endoreduplication hence crucial for controlling cell size. Cell proliferation drives early stages of leaf development, while cell expansion dominates the later phases of leaf growth. During this process, pluripotent initial cells differentiate into the abaxial and adaxial epidermis, the palisade and spongy mesophyll cell layers and the vascular system. Specific genetic and molecular pathways drive the formation of guard cells [10] and trichomes [11]. Furthermore, analysis of mutant phenotypes revealed that genes involved in chromatin remodeling, pre-mRNA splicing and processing, protein translation, post-transcriptional regulation via small RNA pathways, proteasome-dependent protein degradation, hormonal signaling, metabolite biosynthesis and numerous other processes are essential for leaf organogenesis [1, 1216].

During the past years, several public resources have been assembled focusing on Arabidopsis leaf development. Plant morphology depends on the combination of genetic determinants and environmental factors. Regular imaging and objective measurements are crucial to monitor quantitative traits. The PHENOPSIS DB [17] is built for data storage, sharing and analysis of the precise recordings of phenotypic variables and growth conditions from automated phenotyping platforms [18, 19]. Additional measurements and offline microscopic analyses are manually added to each experiments. The database contains more than 93,000 plant images and 57,832 phenotypic details about 1057 Arabidopsis genotypes and offers data visualization and image analysis tools. The results of a systematic reverse genetic screen are summarized in PhenoLeaf [20, 21]. Approximately 24,000 SALK mutant alleles were monitored for visible leaf defects. The 706 identified leaf mutants have been cataloged and can be queried by keywords for phenotype or genes. Another collection accommodating cleared leaves with visible vascular architecture including 412 Arabidopsis pictures are available in the ClearedLeavesDB [22, 23]. The leaf senescence database (LSD) focuses on the last phase of leaf development that leads to organ death [2426]. Manual and computational data were collected about senescence-associated genes (SAGs) from various plant species. The updated LSD 2.0 now contains 5356 genes and 322 mutants from 44 plant species, QTLs, seed information, sequence search functions and information about subcellular localization. Finally, the AGRON-OMICS consortium (Arabidopsis GROwth Network integrating OMICS technologies) was initiated to understand molecular mechanisms behind leaf growth using high-throughput experimental approaches. The effect of mild drought stress was studied in several stages of leaf development using transcript profiling and quantitative proteomics experiments [27]. These datasets along with metabolite measurements, photosynthesis and respiration rates, enzyme activities, ribosome numbers and lipid content are accessible at the project’s data integration and data sharing portal [28]. In the framework of this project a novel literature curation method was developed using the Leaf Knowtator tool and 283 key publications were processed as a community effort [29]. It was demonstrated that the collected information could be integrated with other public resources and a relational database, KnownLeaf was created. Furthermore, a graphical network was built to facilitate knowledge mining. However, access to the curated is data is hindered by the lack of a web interface. Therefore, our main aim was to establish a convenient resource with reliable query functions for easy access to this curated library.

Here, we present LEAFDATA, a high-quality and freely available literature database for Arabidopsis leaf development. By searching and manually curating 380 primary research publications, we collected 13,553 statements about genes that were experimentally linked to leaf organogenesis. We have created LEAFDATA to support fundamental research and provide a solid information resource for our users.

Construction and content

Data collection

LEAFDATA records were collected by employing the customized Leaf Knowtator annotation tool [29]. This interface runs in Protégé software version 3.3.1 using and the Knowtator plug-in version 1.9 beta [30]. Result sections of full-text primary research papers are processed. Entries are collected into ten major categories: phenotype, gene expression, feature, DNA–protein interaction, protein–protein interaction, genetic interaction, process, regulation of gene expression, regulation of process, and regulation of phenotype (Table 1). All categories have predefined structures and information slots attached to them that can be filled with ontology terms already uploaded into Leaf Knowtator (Table 2). The main controlled ontology collections that are included in this project are Plant Ontology (PO) [31], BRENDA Tissue Ontology (BTO) [32], Phenotype, Attribute and Trait Ontology (PATO) [33], Plant Trait Ontology (TO) [34], Molecular Interaction (MI) [35], Plant Environment Ontology (EO) [36] and Gene Ontology (GO) [37]. Genes were associated with the specific AGI identifiers derived from the TAIR10 genome annotation [38]. In addition, the Knowtator plug-in automatically saves further details such as the annotated file, annotator and annotated text. The curation system is flexible and can be easily modified to other annotation projects. Required slots are filled with terms closely following the original text. In addition to the community curations from 283 publications from the AGRON-OMICS project, 97 new papers were processed.

Table 1 Information types annotated in LEAFDATA
Table 2 Phenotype annotation exported from Leaf Knowtator

Database construction

Annotations were exported as XML files from Leaf Knowtator. These files are small and easy to share. The XML files were transformed into a single table with a custom made Perl script [24] and loaded in bulk using Structured Query Language (SQL) queries. The LEAFDATA Database resides on the MS SQL Server 2008 platform. The website design is fully responsive in line with current industry standards and is based on the Bootstrap Framework. Bootstrap utilizes HTML, CSS, and JS frameworks for developing responsive projects on the web. For database integration the server side engine Adobe Coldfusion 9 running over MS IIS was chosen for its relatively inexpensive hosting costs, its rapid development credentials and powerful data collation functions. The employment of these cutting-edge technologies offers a modern, literature-curated website that can be used on any device and provide fast access to our data in any research environment.

Utility

LEAFDATA home

The main site (Fig. 1) provides direct access to the search functions. There is a visual representation of the database content including number of curated publications and individual statements, and details of the different categories. Upon selecting any categories, all annotations can be retrieved. On the bottom of the page, a news section can be found directly connected to an active Twitter account with announcements of relevant publications and database updates. Moreover, a direct contact form is available for any enquiries.

Fig. 1
figure 1

LEAFDATA home. Four key search tools, a summary of the database content, a news section connected to an active Twitter account and a contact form can be reached from the LEAFDATA main page

LEAFDATA search tools

LEAFDATA provides four convenient search functions. Genes of interest can be queried by using unique AGI identifiers based on the last TAIR10 genome release. All annotations can be retrieved from a selected publication using the PubMed ID. In addition to an author query, we also offer a keyword search. Results are arranged according to distinct categories and individual publications. For illustration, records from an AGI search for the HD-ZIPIII transcription factor REVOLUTA (REV) is shown (Fig. 2; Table 3). This query resulted in 78 statements from 17 different papers. The keyword tool is particularly helpful to attain required information. It allows combining multiple keywords and limits the search results to only those documents that contain all the terms. This function can be used effectively to find plant lines that share a certain phenotype, genes with the same biological function or similar expression domains. Recent publications revealed that genetic combinations of plant lines with increased leaf size can further enhance growth [39]. In order to find all the large-leaf Arabidopsis lines curated in LEAFDATA, we performed a search for the terms size_PATO:0000586 and increased size_PATO:0000117 and retrieved a preliminary list of 173 statements (Additional file 1). Ontology terms were used to minimize the recovery of false positive records and ‘plant part’ was not specified to maximize the number of genuine hits. Terms with similar meanings can be used for this query. For example, large leaves, big leaves, increased leaf size gave 162, 12, and 373 results, respectively (Additional file 2: Table S1). Ten statements were randomly selected for additional data mining (Table 4). First AGI codes were collected from the LEAFDATA gene list available under the SEARCH LEAFDATA tab (see also Additional File 3: Table S2) then AGI searches were performed for the individual genes. Further analysis was focused on gene expression data in wild-type background and reported biological functions. For eight genes, both gene expression and functional records were recovered. In one case, only gene expression data was found while for a sole example none of the required additional information was available in LEAFDATA. Importantly, half of these records were gathered from multiple (2–4) papers.

Fig. 2
figure 2

LEAFDATA result page. AGI search for AT5G60690 was performed. Records are organized according to information types and publications with direct links to the PubMed collection

Table 3 Results using AGI search for AT5G60690
Table 4 Mining LEAFDATA for increased leaf size phenotype

All the query tools can be accessed from the main site as well as from dedicated search pages where queries can be restricted to different categories. Finally, to show the full content of LEAFDATA, there is a current list of all annotated papers under the SEARCH LEAFDATA tab (Additional File 4: Table S3).

Discussion

Leaves are essential organs for plant life and the location of multiple biological processes. Organogenesis from emergence of leaf primordium through pattern formation, maturation, maintenance until senescence is regulated by diverse regulatory pathways. Genetic and molecular roles of numerous genes were described in great detail. These genes are classified as key players in leaf morphogenesis. However, numerous additional genes causing altered leaf morphology have been isolated. In many cases, characterization of the observed leaf phenotypes are not main scope of these studies. Furthermore, these information are scattered throughout the existing scientific literature. Our aim was to create a convenient public collection of relevant leaf literature that provides simple query functions and easy access to a large library at the same time. Here, we demonstrate that our published annotation method and the Leaf Knowtator interface [29] can be used effectively for establishing high-quality literature resources. Employing this system guaranteed several unique database features. With a quick workflow, we are able to retain a large amount of information. In LEAFDATA, not only are the curated text fragments from the original publications kept and displayed but ontology terms from established structured vocabularies are simultaneously attached to these statements. Using these standardized terms helps building complex queries and can facilitate data sharing and integration [40]. We adhere to further community standards by employing the entity–attribute–value (EAV) model for phenotype annotations [41]. On average, more than 35 annotations per publication are generated adding up to a total of 13,553 independent statements about nearly 1300 genes. A major advantage of our database is that our curations are not restricted to single genotypes or information types. For instance, phenotype annotations can cover descriptions of single and multiple mutants (Table 3) as well as constitutive or inducible overexpressors, transgenic plants expressing chimeric constructs or modified versions of the gene of interest. Also, gene expression records provide an exceptional range of information including quantification of expression levels and spatial distribution in wild type or various mutant backgrounds (Table 3). Most of our annotations belong to the phenotype and gene expression class however numerous protein–protein interaction, genetic interaction and DNA–protein interaction records can be accessed (Table 1). The original publication details (author, title, PubMed ID) are clearly displayed with each statements and a direct link is provided to the dedicated PubMed page. The search functions were designed to give a quick access to records from a chosen gene, paper or author. The keyword query allows more detailed data mining e.g. for a specific genotype using multiple terms. In summary, the combination of the LEAFDATA tools can be used effectively to collect wide-range of information (Table 4).

LEAFDATA is a useful platform not only for researchers interested in leaf development but for scientists working with other traits, plant species or model organisms. There are possible applications for our dataset in large-scale projects, mutagenesis screens and developing text-mining tools. University students, interested professionals and the general public can benefit from free and easy access to the LEAFDATA library offering processed scientific records.

We envision future improvements for LEAFDATA. The current database contains approximately 15–20 % of the published Arabidopsis leaf literature, is constantly being updated. However, it will take significant effort to annotate every existing leaf development paper and at the same time keep up with the steady flow of new research. We plan to develop advanced search functions for instance queries for specific phenotypic characteristics, combinations of features or exclusion certain traits. Similarly, gene expression statements can be further explored by genotypes, changes in certain target genes or expression in special subcellular compartments, cell types and organs. Lastly, we are interested in data visualization and integration with other datasets.

Conclusions

The sheer amount of scientific literature is calling for carefully curated database summarizing experimental results. We employed the Leaf Knowtator curation system and constructed a unique, comprehensive database focusing on Arabidopsis leaf development. In addition to previously described regulators, genes with clear leaf phenotypes are included. The LEAFDATA collection gives access to 380 publications organized according to papers and information types. Four query functions provide easy access to high-quality annotations and direct links to the original papers. LEAFDATA serves as a valuable resource and reference point for the research community. Finally, our annotation approach, data organization and database structure can serve as a prototype for other literature curation projects.

Availability and requirements

LEAFDATA is an open access database at www.leafdata.org. The collection is updated on a regular basis. Questions, comments and requests regarding this database should be sent to Dóra Szakonyi at info@leafdata.org.

Details of LEAFDATA content and screenshots were recorded on 08/11/2015.