Accessing Cryptosporidium Omic and Isolate Data via CryptoDB.org.

Cryptosporidium has historically been a difficult organism to work with, and molecular genomic data for this important pathogen have typically lagged behind other prominent protist pathogens. CryptoDB ( http://cryptodb.org/ ) was launched in 2004 following the appearance of draft genome sequences for both C. parvum and C. hominis. CryptoDB merged with the EuPathDB Bioinformatics Resource Center family of databases ( https://eupathdb.org ) and has been maintained and updated regularly since its establishment. These resources are freely available, are web-based, and permit users to analyze their own sequence data in the context of reference genome sequences in our user workspaces. Advances in technology have greatly facilitated Cryptosporidium research in the last several years greatly enhancing and extending the data and types of data available for this genus. Currently, 13 genome sequences are available for 9 species of Cryptosporidium as well as the distantly related Gregarina niphandrodes and two free-living alveolate outgroups of the Apicomplexa, Chromera velia and Vitrella brassicaformis. Recent years have seen several new genome sequences for both existing and new Cryptosporidium species as well as transcriptomics, proteomics, SNP, and isolate population surveys. This chapter introduces the extensive data mining and visualization capabilities of the EuPathDB software platform and introduces the data types and tools that are currently available for Cryptosporidium. Key features are demonstrated with Cryptosporidium-relevant examples and explanations.

Cryptosporidium and related species. CryptoDB is a free, online resource for accessing and exploring genome sequence and annotation, functional genomics data, isolate sequences, and orthology profiles across organisms. Data mining via CryptoDB requires no prior bioinformatics or computational experience since all integrated data are pre-analyzed according to standard workflows and accessed via an easy-to-use interactive web interface. CryptoDB brings a genomewide perspective to the exploration of Cryptosporidium and related species by integrating the pre-analyzed data with advanced search capabilities, convenient data visualization, analysis tools, and a comprehensive record system describing genomic features and pathways.
Data mining in CryptoDB proceeds along four general paths: 1. Direct examination of record pages which compile all database information about a feature (genes, SNPs, pathways, EST, isolates, etc.) into tables and graphs.
2. An advanced search strategy system which facilitates genomewide data mining within and across data sets, data types, and organisms (e.g., retrieve all Cryptosporidium orthologs that contain signal peptides or are regulated at 48 h).
3. Data visualization tools, such as a genome browser, which facilitate visualization of sequence-based data to explore gene models, single-nucleotide polymorphisms, or synteny of genomic regions across any or all integrated genome sequences. 4. Data analysis tools for functional enrichment or primary analysis via a private EuPathDB Galaxy [3,4] instance (e.g., analysis of your RNA-seq data or variant calling).
This chapter starts with a brief overview of the database content and then describes the routine data mining methodologies outlined above. For maximum benefit, this chapter is best utilized at a computer with CryptoDB.org open in a browser. Screenshots of all relevant actions are provided as a guide for completing examples and in case offline reading is preferred.

Using CryptoDB
CryptoDB integrates a wide range of data types including genome sequence and annotation, whole-genome sequencing of clinical or environmental isolates, GenBank Popset [5] sequences from surveys, RNA sequence, microarray, proteomics, genome-wide RT-PCR, ESTs, and metabolic pathways. Versions of the database are released about every 2 months and incorporate new data, update existing data, and launch new features. Central to the database is the genome sequence and annotation. CryptoDB enhances these data with a series of in-house bioinformatic analyses for domain prediction, functional label assignments, and genome feature identification. In addition, functional genomics data are

Database Content and Download
Fig. 1 Homepage layout. (a) Header section. Gray menu bar (arrow 2) for quick access to all CryptoDB searches and tools. When scrolling CryptoDB pages, this menu bar remains available at the top of every page. Gene ID search (arrow 3) navigates to the gene record page of the ID entered. Gene Text Search (arrow 4) returns a list of genes whose records contain the desired text. About menu (arrow 5) with links for Help, Login, Registration, and to Contact Us via email with questions and suggestions. (b) Side panel with expandable sections containing a Data Summary (arrow 1) table listing data types available for each genome in CryptoDB, news which documents changes per release, tweets which inform on items of interest to the Crypto research community, links to other community resources, and educational materials. (c) Searches and tools section used to access all searches that return genes (arrow 6), other record types (Popset Isolates, ESTs, etc.) (arrow 7) or tools (arrow 8) for BLAST, analysis or annotation, among others. (d) Footer section with links to all EuPathDB home pages mapped to the genome sequence for interrogation and visualization of sequence-based data in a genomic context.
The data summary table provides an overview of available data as of the submission of this chapter. Open the Data Summary section in the home page left panel (Fig. 1, arrow 1) and click on the table to retrieve a table that provides a quick view of the data available for each organism (Fig. 2a) including the data provider and version.
1. CryptoDB contains 13 annotated genome sequences including sequences for 9 Cryptosporidium species (including several strains), the closely related Gregarina niphandrodes [6] and the proto-apicomplexans Chromera velia and Vitrella brassicaformis [7]. None of the genome sequences is fully complete, telomere-to-telomere without gaps. Contigs have been assigned to scaffolds, super-scaffolds, and chromosomes by the data providers as possible. Currently, C. hominis UdeA01, C. parvum  Table, accessed from the Data Summary tab on the left side of the home page, for a quick overview of the data types available for each organism. (b) Genome scale data files are accessed from Downloads, Data Files section in the gray menu bar. Previous releases are archived, and organism folders contain data that cannot be unambiguously assigned to species, such as ESTs Iowa II, and C. tyzzeri UGA55 are represented as 8 chromosomes each, with gaps and at times some left over, unplaced contigs. It is important to bear in mind the relative completeness of each of the sequences when searching the database. The observation that a particular gene is not found in a species does not necessarily indicate that it is not present, as the genome sequences do contain gaps, in some instances hundreds of gaps. The same caution applies to gene or sequence copy number. Paralogous gene families are abundant in the Apicomplexa and parasitic protists in general. Genome assembly algorithms often overassemble multiple copies of highly similar genes into a single gene or are unable to assemble the additional family members if the similarity is lower. In such cases, additional gene family members are often found in the unassembled contigs for the species. This happened with the Toxoplasma gondii rhoptry proteins [8].
As there has been very little functional genomic data to assist with genome annotation until very recently, Cryptosporidium annotations are also underinformed. The lack of strand-specific RNA-seq or cDNA experimental data has led to under annotation of intron sequences and misannotation of transcripts with overlapping untranslated regions (UTRs). As a user of this and other genomic resources, please bear in mind these rather universal caveats regarding the completeness the genome and annotation.
2. Downloading genome-scale data: The downloads section is accessed from the gray menu bar located near the top of the browser page (see Fig. 1, arrow 2 for the bar and Fig. 2b). This section contains sequence data from the current release as well as archived files from previous releases. Contents are arranged by organism. Folders without a species designation contain data, such as expressed sequence tags, that cannot be assigned unambiguously to a strain. Species folders for unannotated genome sequences contain the genome sequence and predicted open reading frames (ORFs) in fasta format. Genome sequences that are annotated have additional fasta files that contain the predicted protein sequences, transcripts, coding sequence and ORF amino acid translations, as well as GAFand GFF-formatted annotation files and text files of codon usage, and protein domain associations.
3. Genome analyses/predictions performed at CryptoDB: Sequence and annotation files are integrated into CryptoDB from repositories such as GenBank [9,10] or, in rare cases, obtained directly from researchers. CryptoDB runs a series of analyses that supplement the existing information and annotation with data such as protein domain prediction, synteny information, and protein structure and function prediction (Table 1). The OrthoMCL [11] analysis creates ortholog profiles across all genome sequences in CryptoDB and beyond. An initial analysis of 150 genome sequences gathered from across the tree of life included several Cryptosporidium genome sequences and consisted of an all-vs-all BLASTP of quality screened protein sequences followed by clustering of reciprocal best BLAST hits. The analysis produces groups of similar proteins that include proteins that span the tree of life. The ortholog groups created in this analysis facilitate comparative genomic analysis.
1. PopSet Isolates: CryptoDB integrates isolate sequences from the NCBI PopSet database [5]. These sequences originate from phylogenetic, population, environmental, or mutational studies and represent one facet of the known variation for a specific gene, often well-characterized genes commonly used for genotyping like SSU 18S rRNA, gp60, and cowp.
2. Whole-genome sequencing: Decreasing sequencing costs and improved technologies enable whole-genome sequencing (WGS) on a larger scale, and CryptoDB has integrated several WGS projects as isolates of C. parvum and C. hominis [12][13][14][15]. The isolate data are aligned to the reference genome and analyzed for single-nucleotide polymorphisms (SNPs) and copy number variation (CNV).
1. Gene Ontology assignments: GO terms and their associated identifiers are part of the Gene Ontology controlled vocabulary that describe the biological process, molecular function, and cellular component of a gene product. GO terms in CryptoDB are either manually curated or electronically inferred from orthologs.
2. EC numbers: Enzyme commission numbers are assigned to genes and classify the enzymatic activity of the gene product. CryptoDB receives EC numbers with genome annotation and supplements this with assignments retrieved from other sources [16,17].
3. InterPro domains: Protein domains and functional insights associated with them are assigned to predicted proteins using InterProScan [18]. The presence of annotated protein domains facilitates sequence-and feature-based searching of identifiable features in known proteins possible. Several protein motif signatures are available in CryptoDB, including Pfam [19], InterPro [20], and Smart [21].
Metabolic pathways are downloaded from KEGG [35][36][37] and MetaCyc [38][39][40] as networks of enzymes (EC numbers) and compounds (Compound IDs from PubChem Compounds [41]). Genes that are annotated with EC numbers contained in the metabolic pathway are mapped to the pathways, creating a network of gene, pathway, and compound records. Using the strategy system described below, any list of genes can be transformed into the pathways or compounds associated with these genes.
The CryptoDB home page ( Fig. 1) provides quick access to the four data mining resources-record pages, genome-wide searches and strategies, analysis tools, and visualization tools. The page is organized into four sections. The header (Fig. 1a) provides easy access to gene ID (Fig. 1a, arrow 3) and text searches (Fig. 1a, arrow 4) that return genes for further exploration. The header also contains links (Fig. 1a, arrow 5) for registration, login, and a "Contact Us" form for emailing EuPathDB's active support hotline with questions or suggestions. The gray menu bar (Fig. 1a, arrow 2) provides access to all CryptoDB functionality and remains visible on all CryptoDB pages. Registered users can bookmark record pages in My Favorites, save searches and strategies in My Strategies, add comments on record pages (Fig. 7), and use the EuPathDB Galaxy site [42] for private analysis of large-scale data accessed through Analyze My Experiment. Hover your computer cursor over the New Search tab to reveal a dropdown menu for

Layout of Home Page
accessing all searches. The My Basket tool provides a mechanism to save records for later viewing or strategy integration. The side panel (Fig. 1b, red headers) contains links to useful information including a summary of data in CryptoDB (Fig. 2a), community resources, educational materials including exercises and video tutorials, news documenting updates implemented with each release, and the EuPathDB twitter feed. The center panels (Fig. 1c, blue headers) contain expandable categories for accessing all searches that return genes (Fig. 1c, arrow 6), searches that return other records (Fig. 1c, arrow 7), and links to useful tools (Fig. 1c, arrow 8) such as BLAST for retrieving records based on sequence similarity, the Companion annotation tool [43] for easy, first-pass, reference-based genome annotation, and the Genome Browser for visualization of sequence-based data aligned to a reference genome. The footer (Fig. 1d) is also available from any page and provides links to the home pages of all EuPathDB sites.
Genome sequence annotation specifies a unique identifier for each gene in the genome sequence, and CryptoDB offers several ways to retrieve genes based on their ID. The Gene ID search in the header (Fig. 3a (Fig. 3d, brown box). The Gene Results section tabulates and displays data for the active result (highlighted with yellow). All CryptoDB data associated with genes, such as GO terms or gene expression values, are available for display in the Gene Results tab via the Add Columns tool (Fig. 3d, arrow 4). The download tool (Fig. 3d, arrow 5) retrieves either custom (one row per ID) or preconfigured tab-delimited TXT files of chosen data, sequences in FASTA format or annotations in GFF3 format for the active result. The Genome View tab (Fig. 3d, red box) displays genomic sequences (contigs or chromosomes) "painted" with the genes from the active result.
CryptoDB text searches are available for gene, Popset isolate, and compound record retrieval. These searches query the textual content of each record to assemble a set of genes that contain a userdefined text term or phrase. For example, genome annotation, as well as other data in the gene record (e.g., UniProt [17,44] data, protein domain predictions), contains text that can be used to retrieve genes with similar characteristics. CryptoDB offers two searches that return genes based on the text in their records: the Gene Text Search is available in the header (Fig. 3a, arrow 6) and the dedicated search Identify Genes based on Text (product name, notes, etc.) is available in the Search for Genes, Text (product name, notes, etc.) panel (Fig. 3b, arrow 7). A text term or phrase (within quotations) entered in the header Gene Text Search initiates a preconfigured search that queries a broad range of data fields in all gene records of all annotated genome sequences in CryptoDB. The dedicated gene text search (Fig. 3b, arrow 7) offers control over the organisms and fields that will be searched. For example, the dedicated gene text search can be configured to search only the annotated gene product name of genes in C. parvum Iowa II. Both searches support a wild card (wild card = * and means any character) used for broadening the search to include partial text. For example, a text search of the term fructo * will

Record Retrieval Based on Text
retrieve genes whose records contain words that start with the prefix "fructo." Surrounding a multiword phrase in quotation marks configures the search to find the exact phrase. For example, searching for "protein kinase" will return genes with the exact term "protein kinase"; whereas searching for protein kinase (without quotations) will return genes with the term "protein," genes with the term "kinase," and some may have the phrase "protein kinase." 1. Use the Gene Text Search to retrieve genes that are likely kinases and download the GeneIDs and download the IDs, product description, and coding sequence length for all results. Navigate to the dedicated gene text search from the home page ( Fig. 3b, arrow 6) and enter the term kinase in the box for the Text term parameter (Fig. 4a). Notice the default settings for the Organism parameter (all organisms selected) and Fields (exclude similar proteins (BLAST hits v. NRDB/PDB)). Click Get Answer to initiate a search of all annotated genome sequences for genes whose records contain the term kinase.
2. Examine the search results. The result (over 3500 genes) appears in the My Strategies section and genes from every organism are returned by the search as seen in the Organism   18S sequence to the community curated Cryptosporidiidae SSU_18S rRNA Reference Isolate set to determine the putative identity of the input sequence relative to reference sequences that are considered standards by the Cryptosporidium community.

Find the genotype of an unclassified Cryptosporidium isolate.
Consider that the following sequence originated from a laboratory study of the 18S small subunit ribosomal RNA gene from an isolate retrieved during a Cryptosporidium outbreak at a swimming pool. The sequence (below and Electronic Supplemental File 3) was identical from all isolates. Use the CryptoDB BLAST search to find a similar reference sequence.

A A G C T C G T A G T T G G A T T T C T G T T A A T A A TTTATATAAAATATTTTGATGAATATTTATATAATA T T A A C A T A A T T C A T A T T A C T A T A T A T T T T A G T A T A T G A A A T T T T A C T T T G A G A A A A T T A G A G T G C T T A A A G C A G G C A T A T G C C T T G A A T A C T C C A G C A T G G A A T A A T A T T A A A G A T T T T T A T C T T T C T T A T T G G T T C T A A G A T A A G A A T A A T G A T T A A T A G G G A C A GTTGGGGGCATTTGTATTTAACAGTCAGAGGTGAA ATTCTTAGATTTGTTAAAGACAAACTAATGCGAAAG CATTTGCCAAGGATGTTTTCATTAATCAAGAACG AAAGTTAGGGGATCGAAGACGATCAGATACCGT C G T A G T C T T A A C C A T A A A C T A T G C C A A C T A G A G A T T G G A G G T T G T T C C T T A C TCCTTCAGCACCTTA
2. Navigate to the BLAST search from the tools menu and set the Target Data Type to Popset (Fig. 5a, arrow 1) which will configure the query to return Popset Isolate records and to use the BLASTN algorithm. For Target Organism, choose Cryptosporidiidae SSU_18S rRNA Reference Isolates (Fig. 5a, arrow 2) to BLAST your input sequence against the isolate reference database. Paste the sequence into Input Sequence and click Get Answer to initiate the BLAST. (Fig. 5b) with a graphic summary of the strategy at the top followed by three tabbed pages for the results: BLAST (Fig. 5b, arrow 3), Popset Isolate Sequences (Fig. 5c), and Popset Isolate Sequences Geographic Location (Fig. 5c, inset). The BLAST result (BLAST tab) reveals two hits with 100% identity, to C. parvum over the length of the sequence provided. One hit was to a C. parvum CryptoDB's extensive record system is a rich data mining resource. Records compile all database information into tables and graphs for genes and nongene entities including genomic sequences, Popset Isolate Sequences, Genomic Sequences, Genomic Segments, SNPs, ESTs, ORFs, Metabolic Pathways, and Compounds. All record pages open with a summary section at the top (Fig. 6a) and employ a floating, collapsible Contents panel (Fig. 6b) on the left for navigating the data section (Fig. 6c) on the right. This section of the chapter will explore the structure of the record page and highlight important data content using record pages for the gene that encodes GP60 (cgd6_1080), the isolate record page for Cryptosporidium parvum isolate 830 surface glycoprotein 900 gene, and the metabolic pathway for Glycolysis/Gluconeogenesis (KEGG).

The results appear in My Strategies
1. Use the Gene ID search in the header to open the record page for the gene encoding GP60 (cgd6_1080) (Fig. 6a, inset). The gene page opens with summary information at the top (Fig. 6a) including gene ID, gene product name, genomic location, and species/strain information. In this case, the official gene name, Glycoprotein_GP40 differs from the gene and protein name commonly used by the research community. See Step 2 below for more information about gene aliases and other names. Links to bookmark (Add to favorites) or save the gene for later use (Add to basket) as well as to download sequences or gene data appear at the top of the page (Fig. 6a, arrow 1). The Shortcuts ( Fig. 6a arrow 2) provide quick visualizations of the gene page data via the magnifying glass icon and navigate to the corresponding image in the data when the image or the label above is clicked. The collapsible Contents section (Fig. 6b) directs the data section ( Fig. 6c) of the page to the desired content. Uncheck a box to hide that data from the page. Data (Fig. 6, arrow 3) are represented in images, graphs, and downloadable tables (Fig. 6, arrow 4). The Data sets link next to the section title reference the data sets from which the data were taken (Fig. 6, arrow 4). View in genome browser links offer easy access to the genome browser in the genomic region containing the gene (Fig. 6d). All mapped sequence-based data are available for interactive display via the Select Tracks page (Fig. 6d, arrow 5).
2. Annotation, curation, and identifiers section. Use the Contents navigation tool to view this section (Fig. 7a), which offers alternative gene product descriptions, aliases, notes from the annotator, and a User Comments   CryptoDB maps the aliases within table Names, Previous Identifiers, and Aliases (Fig. 7a, arrow 1) to the current ID so that an ID search performed with an alias (CPW_00001274) will return the record page of the current ID, cgd6_1080. CryptoDB encourages the community to enhance the gene page with information added via the User Comments platform. Comments can offer valuable additions to the official annotation such as alternative gene models, unpublished data, and associated publications or references. The comment form (Fig. 7b) used for entering comments is retrieved from the Add a comment link (Fig. 7a, down arrow). Users must be registered and logged in to enter user comments. Once added, the comment immediately becomes part of the record page and the CryptoDB searchable text. For example, the comments entered here (Fig. 7a, arrow 2) provide the alternative gene names "gp40/15" and "GP60." Although "GP60" does not appear in the gene product name, it does appear in the user comment which is queried during the Gene Text Search, so searching for the text term "GP60" will return the gene: cgd6_1080, Glycoprotein_GP40.
3. Orthology and synteny section. Explore orthology and synteny for cgd6_1080, Glycoprotein_GP40 and compare the results to the cgd2_60, ABC_transporter. Navigate to the Ontology and Synteny section with the Contents navigation tool and explore orthology for cgd6_1080, Glycoprotein_GP40 (Fig. 8a). The ID of the ortholog group to which this gene belongs appears as a link to the OrthoMCL group record (Fig. 8a). Use this link to go to OrthoMCL and explore the group features (scroll down in OrthoMCL.org to the List of Sequences table). Only two sequences are assigned to this group, supporting the species-specific and single-copy nature of cgd6_1080, Glycoprotein_GP40. Out of the 150 proteomes contained in OrthoMCLDB, which included C. hominis and C. parvum representing the Cryptosporidiiae, only one sequence each from C. hominis and C. parvum is represented in the group. (Fig. 8b). This table expands the local list of orthologs compared to the Ortholog group at OrthoMCL. As CryptoDB integrates newly available annotated genome sequences, orthology is determined by assigning their proteomes to existing Ortholog groups without updating the original 150 proteome analysis that forms the underlying data set for OrthoMCLDB. Orthologs of cgd6_1080 do not appear in Gregarina, Chromera, or Vitrella. Only one gene from each Cryptosporidium species is orthologous and all are syntenic, in the same genomic location, reflecting the conserved nature of this gene. Also notice the product name does not always reflect the activity of the gene, reflecting the maturity of the annotation. Use the Gene Text search in the header to search for gp40. The search returns less than 10 genes, one for each organism, again confirming the conserved single-copy nature of the gene.

Visit the Orthologs and Paralogs within CryptoDB table which lists all orthologs from organisms within EuPathDB
Contrast the orthology information for cgd6_1080, Glycoprotein_GP40 with cgd2_60, ABC_transporter. Navigate to the gene record page for ABC_transporter by entering the ID (cgd2_60) in the Gene ID search in the header. Use the  Return to the cgd6_1080, Glycoprotein_GP40 gene page by clicking the browser back button or using the Gene ID search in the header. The Retrieve multiple sequence alignment or multi-FASTA tool part of the Orthology and Synteny section can be used to produce a ClustalW multiple sequence alignment in the region of the gene. Choose all organisms (Fig. 9a, arrow 1) and click Run alignment. The alignment appears in a new tab and is an alignment between 9 sequences, the gene itself, and 8 additional genomic sequences found to have similarity. It is interesting to note that the C. andersoni sequence in the alignment is missing from the Ortholog table. A closer look at the Synteny graph (Fig. 9b), which shows the chromosomal organization of genes across organisms with orthology indicated by shadows connecting the genes, shows that no gene is annotated for the region in C. andersoni or C. muris (Fig. 9b, red box). Since the MSA returns an alignment for C. andersoni, this suggests that a gene should be called for the region and this information would make a good comment for the genomic sequence record page. Also notice that Gregarina, Chromera, and Vitrella do not have syntenic sequences or orthologs to GP60, indicative of the distant evolutionary relationship between the organisms. 4. Genetic Variation section: Compare variation in the gp60 gene among isolates from China, Egypt, and Uganda. From the Synteny graph, scroll down to the Genetic Variation section which contains the Isolate Alignments in this Gene Region tool and a summary of DNA Polymorphism determined from an analysis of whole-genome sequencing of isolates. The isolate alignment tool (Fig. 10a) produces an alignment of the wholegenome sequence data within the region of the gene. The tool opens with all isolates selected for the alignment and includes the option to reduce the number of isolates included in the alignment based on isolate characteristics such as geographic location of isolate collection or gp60 genotype. The left panel of the tool shows the isolate characteristics, while the right panel shows the data distribution for the characteristic chosen on the left. Navigate the filter tool (Fig. 10a, arrow 1) to choose only isolates from China, Egypt, and Uganda and retrieve the alignment with the View Results button. The results appear in a new tab with variation highlighted in pink. Figure 10b shows an area of the alignment with a SNP in the Ugandan isolate (TU114) (Fig. 10a, arrow 2) and another area where all but one Egyptian isolate (35090) diverge from  retrieving alignments of isolate whole-genome sequence data with options for choosing isolates based on isolate characteristics (metadata). As shown, the tool is configured to align isolates from China, Egypt, and the reference C. parvum sequence (CM000434) (Fig. 10a,  arrow 3).
Close the Isolate Alignment tool by clicking the triangle next to the title (Fig. 10a, arrow 4) and notice the DNA polymorphism section which includes a summary of the coding potential, the incidence of all SNPs across all isolates, and an image of SNPs mapped to the gene model. In your browser, hover your mouse over the SNPs in the image to reveal a panel of detailed information including a link to the SNP record page (Fig. 10c).

5.
Transcriptomics Section: Scroll down from the DNA polymorphism section to view the Transcriptomics table (Fig. 11), which includes all transcript expression data sets with data that map to this gene. Data types include genome-wide RT-PCR, microarray, and RNA-seq data. Table rows expand to reveal the results of CryptoDB transcriptomic analysis presented in graphs and tables. Click the triangle in the header next to the data set "Expression profiling of life cycle stages post-infection" to view the data graphs and tables (Fig. 11, arrow 1). For this experiment, transcript abundance was determined using Real Time-PCR for >3200 genes over a 72-h infection of HCT8 cells [30]. The data graph reveals that the gene is maximally expressed at 12 h postinfection.
Compare the oocyst and intracellular stage expression for this gene. A comparison of oocyst and intracellular stage transcriptomes is also available, and the graphs and tables associated with this RNA-sequence analysis can be viewed by expanding "Transcriptome of oocyst and intracellular stages" (Fig. 11, arrow 2). The data indicate that the gene is highly expressed in the intracellular stages with very little expression in oocysts. Open Coverage (Fig. 11, arrow 3) to reveal a snapshot of the RNA sequence coverage plots and these can be further interrogated using the View in genome browser link. In your browser, using your mouse to hover over the rectangle peptide glyphs retrieves details concerning the mapped peptide sequence and its spectral count and the experiment in which it was detected (Fig. 12).  search was run on only one ID, the result navigates to the record page instead of My Strategies. The layout of the isolate record page is similar to the gene page with a summary section, a floating Contents section (which can be closed using the double arrowhead on the right top), and a data section. The summary section contains pertinent information such as isolation source, location, organism, and product description information (Fig. 13b). Links to bookmark, save, and download the record appear above the ID and product description. The sequence section provides the nucleic acid sequence associated with the isolate, while the Blast Similarity Alignments and Overlapping Genes table shows the results of a BLAST of the isolate sequence against all genome sequences in CryptoDB (Fig. 13c). Overlapping genomic sequences and genes are indicated for each BLAST hit.
The Cryptosporidiiae are known to be missing several metabolic pathways [45,46]. While basic pathways such as glycolysis are present, surprisingly, both purine and pyrimidine biosynthesis pathways are missing in Cryptosporidiiae, although salvage pathways are present [47]. Amino acid biosynthetic pathways are also missing for lysine, tyrosine, alanine, serine, selenocysteine, threonine, and thiamine. [48] Another pathway, the tricarboxylic acid (TCA) cycle is present in C. muris but absent from C. parvum and C. hominis [48]. These qualities can be investigated in the CryptoDB representation of metabolic pathways.
1. Retrieve the Glycolysis/Gluconeogenesis (KEGG) and explore the basic functions of the page. Navigate to the Pathway Name/ID search under Metabolic Pathways in the Search for Other Data Types category on the home page (Fig. 14a). Set the Pathway Source to KEGG and begin typing "glycolysis" in the Pathway Name or ID parameter and then choose the pathway from the list. Click Get Answer to retrieve the pathway. The organization of the page is similar to other record pages with a Summary at the top, a collapsible Contents section, and the data presented in images and searchable tables. An interactive Cytoscape [49] image (Fig. 14b) shows the pathway as a series of enzymatic reactions with square enzyme nodes and circular compounds. The image can be repositioned by dragging or with the navigation tool (Fig. 14b, arrow 1). Resizing the image is accomplished with the zoom tool (Fig. 14b, arrow  2) or by scrolling a mouse in or out while hovering over any portion of the image. At low zoom levels, pathway nodes appear as glyphs. At mid zoom levels, the main compound nodes appear as molecular formulas. At high zoom levels, enzymes and side nodes appear as formulas, terms, or Enzyme Commission numbers. Clicking on a node retrieves details   Glycolysis/Gluconeogenesis pathway record page depicting the Cytoscape representation. Navigation (arrow 1) and zoom (arrow 2) tools are available and will reveal enzyme EC numbers and compound structures at sufficient zoom. Node details appear once an enzyme (square) or compound (circle) has been clicked. Metabolic Pathway Reactions (arrow 4) tabulate all reactions in the pathway about the node, regardless of zoom level (Fig. 14b, arrow 3 and expansion). Enzyme nodes highlighted in orange represent enzymes for which at least one gene in CryptoDB is annotated with the EC number of the enzyme. Pathway reactions are tabulated in Metabolic Pathway Reactions (Fig. 14, arrow  4) below the pathway image, and Pathways with Shared Reactions offer links to related pathways.

Metabolic Pathway Record Organization and Content
2. Annotate the Glycolysis/Gluconeogenesis (KEGG) pathway using the Paint Enzymes tool. Click on the yellow part of the glyph representing Glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) (Fig. 14b, arrow 3) to retrieve the node details. Open the Paint Enzymes tool in the Cytoscape Drawing menu bar and choose By Genera (Fig. 15, arrow 1). Use the Genera Selector tool to choose Cryptosporidium, Gregarina, Chromera, and Vitrella. Click Paint to retrieve a graph depicting which organisms have genes with this EC number. The graph (Fig. 15, arrow 2) indicates that all organisms have at least one gene with this enzymatic activity. To retrieve a list of the genes, click the Show genes which match this EC Number link (Fig. 15, arrow 3) to initiate a search for genes ("Identify Genes based on EC Number") that are annotated with the EC number. The search returns 25 genes. Scroll to the organism table in the search result and notice that the search returned one gene for each Cryptosporidium organism in CryptoDB, while multiple genes annotated with the EC number are found in Gregarina, Chromera, and Vitrella. These are likely paralogs/orthologs since all genes belong to the same ortholog group. Pathways may also be annotated with expression data from Microarray and RNA sequence data to give a quick overview of pathway expression. Using a similar workflow as above, explore the expression of enzymes within the glycolysis pathway (Fig. 15, inset). Paint the pathway with the RNA sequence data set called "C. parvum Iowa II Transcriptome of oocyst and intracellular stages (Widmer et al.) [48,50]. The CryptoDB strategy system is a unique and powerful tool for exploring relationships across data sets, data types, and organisms.
The system offers over 100 preconfigured searches that query individual data sets ranging from genome-wide analyses for predicted signal peptides to GO annotation, transcriptomic analyses, or SNP analyses. Each search provides evidence for a specific biological property, returning a list of records that meet the search criteria and therefore share the biological property intrinsic to the data set. For a comparative genomics perspective, searches can be configured to query any or all genome sequences integrated into CryptoDB. The system can easily answer questions concerning stage-specific expression, expression timing, biological function, and more. Strategies are created by adding, subtracting, intersecting, joining, or collocating the results of individual searches. An orthology transform tool can convert a gene result in one organism to their orthologs in another, making it possible to take advantage of a data-rich organism such as C. parvum to garner information about less studied organisms such as Gregarina. A colocation tool is used to explore relationships based on relative genomic location, for example, to find SNPs located within 500 bp of annotated genes. A nesting tool can be used to control the strategy logic while combining search results. The "Analyze Results" tools offer GO, metabolic pathway, and gene product word enrichment analyses for any gene search or strategy result. Isolate search results can be analyzed by MSA. The following examples illustrate how to create and analyze a gene strategy and an isolate strategy in CryptoDB.
This example assembles a list of G. niphandrodes genes that are likely kinases expressed in oocysts by creating a four-step strategy. The first step is a text search that retrieves genes from all organisms in CryptoDB whose records contain the word kinase (Fig. 16a,  arrow 1). The second search (Fig. 16a, arrow 2) retrieves genes annotated with GO terms associated with kinase activity which provides a second line of evidence for kinase activity. The first two  (Fig. 16a, arrow 3), which is a more comprehensive set of kinases across all organisms. Next, we want to restrict the list of kinases to only those with evidence of expression during the oocyst stage. Since no transcriptomic data are publicly available for G. niphandrodes, we will query C. parvum data and then transform those results into G. niphandrodes, thus inferring expression is conserved between organisms (potentially an invalid assumption, but it serves for illustrative purposes). The third search in the strategy (Fig. 16a, arrow 4) queries the expression values determined in an RNA sequence analysis comparing the oocyst and intracellular stages of C. parvum. The oocyst expressed genes are intersected with the kinases to produce a Step 3 result (Fig. 16a, arrow 5), including kinases that are also expressed in the oocyst stage. An intersection returns genes in common between the two search results. Since Step 2 is a multi-organism set of kinases, while

Gene Strategy Example
Step 3 contains only C. parvum oocyst genes; the Step 3 result (intersection) contains only C. parvum genes. The fourth step (Fig. 16a, arrow 6) transforms the C. parvum oocysts-expressed kinases into their G. niphandrodes orthologs creating a set of G. niphandrodes genes that are likely oocyst-expressed kinases. The full strategy can be found here: http://cryptodb.org/cryptodb/ im.do?s=88749e65a931cae9 1. Use the text search to find genes whose records contain the work kinase in their records. As in Subheading 2.2.3, Step 1 above, initiate a text search for the term "kinase" from the dedicated Gene text search (Fig. 4a). The search returns over 3500 genes across all organisms in CryptoDB. Future versions of CryptoDB may return different results as new genome sequences become available and are integrated, and the annotation of existing genome sequences is updated.

2.
Expand the set of kinases by adding a search for genes that are annotated with the GO term GO:0016301: kinase activity. GO Terms are part of a controlled vocabulary describing the biological function, molecular process, or location of gene products. Terms are assigned to genes during the annotation process. CryptoDB enhances GO annotations by running InterproScan on all genome sequences and integrating the analysis results. A search for genes annotated with the kinase activity GO term may identify genes with kinase activity that do not contain the word kinase in their records. To extend the strategy with this search, click the Add Step button (Fig. 16b) and navigate the Add Step panel through Run a new Search for, Genes, Function Prediction, Go Term. When the GO Term search page appears, begin typing the GO term in the GO Term or GOID parameter and then choose the correct item (GO:0016301: kinase activity) from the list.
Since this search is the second step of a strategy, the method of combining the GO Term search results with the Text term search results must be indicated before running the search ( Table 2). Choose 1 Union 2 to produce a Step 2 result that contains all genes from both searches. The GO term search returns over 1200 genes annotated with the GO term kinase activity, and the combined result is a set of over 3500 genes with gene products that likely have kinase activity (Fig. 16c).

Reduce the
Step 2 result to only those kinases that are highly expressed in oocysts. To ensure that the kinases in Step 2 are expressed during the oocyst stage, intersect the Step 2 result (kinases) with a search for genes that are highly expressed in the oocyst. Click Add Step (Fig. 16c, arrow 7) and navigate the Add Step panel through Run a new Search for, Genes, Transcriptomics, RNA Sequence (similar to Fig. 16b). Notice that G. niphandrodes is not represented in the Organism column, indicating that CryptoDB does not contain RNA sequence data sets associated with G. niphandrodes. However, CryptoDB does offer a data set that contains expression values from the C. parvum oocyst and intracellular transcriptomes determined by RNA sequence analysis [27]. Searching these data will return genes expressed in oocyst, and the intersection of the oocyst genes with the Step 2 result will find genes in common between the two lists: kinases that are expressed during the oocyst stage. Choose the Percentile search for the data set "Transcriptome of oocyst and intracellular stages (Widmer et al.)" (Fig. 17a).  This search returns genes based on the percentile rank of expression values within a sample. CryptoDB analyzed the raw RNA sequence data to produce expression values and determined the percentile rank of each gene within samples. To configure the search to find genes that are highly expressed in oocysts, choose Oocysts for the Samples parameter (Fig. 17b, arrow 1) and set the Minimum expression percentile to 70 (Fig. 17b, arrow 2). Choose to intersect the results of this new search with Step 2 using 2 Intersect 3 (Fig. 17b, arrow 3). The Step 3 result will have genes that are present in the Step 2 result as well as the expression search and, therefore, have both biological properties, kinase activity, and high expression in oocysts. Click Run Step to initiate the expression search and combine the results with Step 2. The expression search returns over 1100 genes that are highly expressed in oocysts and over 40 of these are likely to have kinase activity (Fig. 17c).
4. Use the Transform by Orthology tool to convert the C. parvum oocyst kinases to their G. niphandrodes orthologs. This step of the strategy takes advantage of expression data available in C. parvum to make inferences about possible expression of G. niphandrodes genes. Click Add Step (Fig. 17c, arrow 4) and choose Transform by Orthology in the first column of the Add Step panel (Fig. 18a). Set the transform to Gregarina niphandrodes using the Filter list below tool. Begin typing the organism name and select the correct organism from the list (Fig. 18a, arrow  1). Click Run Step to initiate the transform. The transform returns over 50 G. niphandrodes genes. This final strategy result is a set of G. niphandrodes genes that likely proteases expressed in oocysts. The full strategy can be found here: http://cryptodb.org/cryptodb/im.do?s=88749e65a931cae9 5. Explore your results. A critical review of the results is advisable to ensure credibility. For example, review the Product Description column of the Gene Result to see if they seem like kinase (Fig. 18b, arrow 2). Use the links in the Gene ID column (Fig. 18b, arrow 3) to visit a few gene pages which may contain information about domains associated with kinase activity.
6. Analyze your results: Determine functional and metabolic pathway enrichment for the G. niphandrodes gene set. CryptoDB offers enrichment analyses that can be accessed from the Analyze Results tab (Fig. 18c, arrow 4) (Fig. 19a). Click Submit to run the default analysis (Fig. 19b, arrow 1) which will apply a Fischer's exact test to compare the biological process GO terms assigned to genes in the gene set to a background consisting of all genes in the genome. The analysis returns GO terms that are statistically enriched, appearing more often in the gene result than in all genes in the genome. The default GO enrichment of the G. niphandrodes oocyst expressed kinases returns over 120 enriched GO terms with p-values as low as 1 × 10 −28 . The Open in Revigo button above the enrichment result table (Fig. 19c, arrow 2) opens the Revigo search page and populates the search with the GO enrichment results. Revigo is a clustering and visualization tool for enrichment analyses (Fig. 19c).
Find isolates collected from the feces of humans in Europe that were genotyped with GP60. While Strategy Example 1 returned genes with shared biological properties, this strategy retrieves isolate sequences that share characteristics such as geographic location and isolation source.  (Fig. 20b). From the Add Step search form, set the Isolation Source to feces (Fig. 20c) (Fig. 21b, inset). Pin color represents the number of isolates from that location. Return to the Popset Isolate Sequences tab and sort the Geographic location column to place the isolates from Croatia and Czech Republic at the top (Fig. 21b, arrow 2). Click the box next to the Popset Sequence IDs for the two Croatian isolates and scroll to the bottom of the list to click Run ClustalW on Checked Strains. The alignment appears in a new window (Fig. 21c, (Fig. 21c, right side) differences between the groups of isolates. The Croatian isolates are similar to each other as are the Czech Republic isolates. However, many differences are noted between the Croatian and Czech Republic isolates.
Sequence alignment and enrichment analysis tools are an integral part of CryptoDB record pages and strategy results. Comparative genomics often relies on sequence alignments to reveal differences between strains or isolates. For easy access, tools for aligning gene sequence from annotated genome sequences (Fig. 9a) or isolate sequence in the region of the gene (Fig. 10a, b) are integrated into the gene record page. Popset Isolate sequence alignments can be performed from any isolate search or strategy result page (Fig. 21b, c). While the strategy system facilitates reducing large volumes of data to meaningful gene sets, enrichment analyses support interpretation by identifying over-represented functional annotations such as GO Terms or metabolic pathways in a gene set. The CryptoDB enrichment analyses apply a Fischer's exact test to compare annotations assigned to the genes in a search result to the entire genome. Tools to determine gene ontology term, metabolic pathway, and gene product description term enrichment are readily accessible from a result page under the blue Analyze Results tab on any gene search or strategy result page (Fig. 18b, arrow 3 and Fig. 19).
CryptoDB also offers an environment for analyzing raw largescale data such as for differential expression from RNA sequence or variant calling from whole-genome sequence of isolates. The following section describes how to use the EuPathDB Galaxy data analysis service developed in partnership with Globus Genomics to analyze your own high-throughput next-generation sequence data. Galaxy is an analysis tool which is available on all EuPathDB sites.
Galaxy is an open-source web-based platform for data intensive research that reduces the barrier of access to bioinformatic analysis since command-line computing is not required. A variety of bioinformatics tools are available and can be used one at a time or dropped into a graphic workspace and chained together to create workflows. Workflows and their results can be shared between users or published to the community. The CryptoDB Galaxy is preloaded with reference genome sequences and offers preassembled workflows for reference-based RNA sequence analysis as well as variant calling (SNP detection). Analysis results are easily ported to CryptoDB for private viewing in the genome browser side-byside with data already integrated into CryptoDB.

CryptoDB/ EuPathDB Galaxy Data Analysis Service
The following example employs a shared EuPathDB workflow to call variants between a diarrheal isolate and the reference genome. A recent study of genetic diversity in Bangladesh performed whole-genome sequencing on 63 Cryptosporidium hominis isolates [13]. The published analysis calls single-nucleotide polymorphisms between the C. hominis isolates and C. parvum IowaII. We will use EuPathDB Galaxy to repeat a portion of this analysis. The data are available in the sequence read archive repositories (Project ID = PRJEB24168 at ENA and SRA) [51,52] and can easily be imported to a Galaxy account. The workflow performs FastQC [53] to check the isolate's read quality and uses Sickle [54] for adapter trimming, Bowtie2 [55] for aligning the isolate sequences to the reference genome sequence, FreeBayes [56] for calling variants, SNPEff [57] for annotating and predicting the effects of genetic variants, and SNPsift [58] for manipulating the variant call files produced by the workflow.

Visit and explore the CryptoDB Galaxy page by clicking Analyze
My Experiment in the header menu bar. Registration with both Globus Genomics and CryptoDB are required. First-time users will be prompted with a series of screens to create and log into their accounts. Once registered, Galaxy opens to a welcome page in the center panel (Fig. 22a, arrow 1) which contains links to tutorials as well as the EuPathDB preconfigured workflows and is controlled by the center top menu bar (Fig. 22a, red border). The center panel also serves as a workspace for configuring tools, starting and editing workflows, and file management. The tools panel (Fig. 22a, arrow 2) offers bioinformatics tools for a variety of purposes, including importing data into Galaxy, next-generation sequencing applications, data manipulation, and statistical analysis. Choosing a tool from the left panel (Fig. 22a, arrow 3) opens the tool in the center panel (Fig. 22b) where one can change default parameter values (Fig. 22b, left) and initiate the analysis. The History panel presents a running history of files imported or created during the analyses (Fig. 22a, arrow 4 and 22b, right). The Tools and History panels can be collapsed (Fig. 22a, arrow 5) creating a larger workspace in the center.
2. Import isolate sequence data into Galaxy. To import the isolate sequence data directly from EBI, open the Get Data via Globus from the EBI server tool from the tools panel (Fig. 22a, arrow 3) and enter SAMEA104459070 for the parameter Enter your ENA Sample ID. In this case, the sample ID is provided but retrieving Sample IDs is easily accomplished by searching the ENA or SRA repositories with a keyword (such as author name) the project ID (PRJEB24168, which is usually documented in the publication) to retrieve the data set record. Enter parameter values for the Data type to be transferred as fastq and the Single or Paired-Ended as paired (Fig. 22b, left). Click execute to initiate the upload. Tiles representing files appear in the history panel and will turn from gray to yellow when the job starts and then to green when the job is finished. The files are named ERR2240057_1 and ERR2240057_2 which represent forward and reverse RNA sequence runs associated with the sample ID.

Run the preconfigured variant calling workflow shared by
EuPathDB. Return to the welcome page by clicking Analyze Data in the center panel (Fig. 22a, red border) and choose the EuPathDB Workflow for Variant Calling, paired-end sequencing (Fig. 22a, arrow 6). Indicate the Input data sets (Fig. 22c ERR2240057_1 for forward reads; ERR2240057_2 for reverse reads). Three tools in the workflow (Bowtie2, FreeBayes and SNPEff) require the reference genome as input. Since all EuPathDB genomes are preloaded into EuPathDB Galaxy, defining the reference genome is as easy as choosing it from a pick list. Scroll down carefully to change the reference genome to CryptoDB-37_CparvumIowaII_Genome in three places (Fig. 22c, arrow 7). Click Run workflow (Fig. 22c, arrow 8) to start the workflow. History files appear and will indicate successful completion of a step when green or failure when red. 4. Explore the results. The history panel contains a tile for each output file created by the workflow (Fig. 23a). Some files are hidden from the history panel but can be shown by clicking the hidden link (Fig. 23a, arrow 1). Clicking the name of the file reveals file attributes within the history tile, whereas clicking the eye icon populates the center panel with file contents. Files can also be downloaded for use in local programs such as excel (Fig. 23a, red border).

View variants in CryptoDB
GBrowse. The FreeBayes output vcf file (variant call file) can easily be converted to a graphic file (bigwig) that can be exported to CryptoDB for viewing. Click the edit icon next and open the Convert Format tab that appears in the center panel (Fig. 23b). Choose Convert BED, GFF, or VCF to BigWig and click Convert. A new history tile representing the new file will appear. Once the file conversion is complete, use the EuPathDB Export Tool, Bigwig Files to EuPathDB, to transfer the file to the My Data Sets section of CryptoDB (Fig. 23c). This tool requires an entry for all five   Bigwig files to EuPathDB tool. Tool is shown configured to create a data set called C hominis isolate which will contain one Bigwig file that is associated with the C. parvum reference genome. All five parameters need entries, or the tool will return a request for parameter values parameters before executing. Open the export tool and make an entry for all parameters, choose the file to transfer, and make an entry for all parameters, My Data Set name, Bigwig files, Reference genome, My Data Set summary, and My Data Set description (Fig. 24). A history tile representing the export will appear. Once the export is finished and your history tile is green, visit CryptoDB My Data Sets (Fig. 25a) page to install the file into GBrowse. The My Data Sets page tabulates all private data sets that are shared with or uploaded by you. Open the record for the new data set by clicking the data set name in the My Data Sets table (Fig. 25a, arrow 1). The GBrowse Tracks table contains a Send to GBrowse button that installs the file in GBrowse (Fig. 24b) and, upon file installation, becomes a link (View in GBrowse) to directly view the file in GBrowse (Fig. 24c). GBrowse is itself a complex analysis tool. Consulting a GBrowse user guide [59] is recommended for assistance with configuring visualizations, especially navigation and adjusting scales when viewing quantitative results.
This exciting time for Cryptosporidium omics-based research supports biological discovery and a changing view of Cryptosporidium as a pathogen. CryptoDB and its associated analysis tools reduce

Summary
the threshold of access to valuable omics data and provide a powerful data mining and analysis platform for Cryptosporidium researchers and related research communities. CryptoDB is updated roughly every 2 months with new data and new features. Changes made to the database will affect data mining outcomes; thus, strategies should be saved for easy execution when new data are available. As laboratory breakthroughs render Cryptosporidium more amenable to advanced molecular and phenotypic analyses such as genetic crosses, high-throughput phenotypic screens, and CRISPR-Cas, additional tools developed for other organisms represented in the EuPathDB family of databases will be provisioned for CryptoDB.