MouseMine: a new data warehouse for MGI
MouseMine (www.mousemine.org) is a new data warehouse for accessing mouse data from Mouse Genome Informatics (MGI). Based on the InterMine software framework, MouseMine supports powerful query, reporting, and analysis capabilities, the ability to save and combine results from different queries, easy integration into larger workflows, and a comprehensive Web Services layer. Through MouseMine, users can access a significant portion of MGI data in new and useful ways. Importantly, MouseMine is also a member of a growing community of online data resources based on InterMine, including those established by other model organism databases. Adopting common interfaces and collaborating on data representation standards are critical to fostering cross-species data analysis. This paper presents a general introduction to MouseMine, presents examples of its use, and discusses the potential for further integration into the MGI interface.
KeywordsData Warehouse Mouse Genome Informatics Template Query Model Organism Database Data Warehouse System
The Mouse Genome Informatics consortium (MGI) has a long history of delivering comprehensive, high-quality online information about the genetics, genomics, and biology of the laboratory mouse (Eppig et al. 2015; Bult et al. 2015; Smith et al. 2014). To maximize the use of these data, MGI has always provided multiple means to access the information. The main web interface (www.informatics.jax.org) supports interactive database querying, viewing, and downloading. A “Batch Query” tool (www.informatics.jax.org/batch) supports uploading a list of gene IDs/symbols and getting back certain information about those genes. Two MGI BioMart databases (available at biomart.informatics.jax.org) support access to basic information about mouse genes (IDs, symbols, alleles, coordinates, GO and MP terms, and orthologs) and to gene expression annotations. Finally, MGI provides many database reports and a public read-only copy of the database to support direct SQL querying (contact: firstname.lastname@example.org).
MouseMine (www.mousemine.org) is the latest step in the evolution of MGI online services. Based on the InterMine (Smith et al. 2012; www.intermine.org) data warehouse system, MouseMine supports powerful query, reporting, and analysis capabilities over a significant portion of the MGI database. MouseMine is a member of a growing community of mines and in particular, is a member of InterMOD (Sullivan et al. 2013), a consortium of mines developed by several model organism databases (MODs). The combination of Intermine’s capabilities and its growing adoption among the MODs significantly enhances a user’s ability to do cross-species data mining and analysis.
InterMine and InterMOD
InterMine is an open source software framework originally developed to support a Drosophila data warehouse called FlyMine (Lyne et al. 2007). Following this initial success, a new project then generalized the software, calling it InterMine (Smith et al. 2012), and established mines for three other model organisms: rat (RGD/RatMine), yeast (SGD/YeastMine), and zebrafish (ZFIN/ZebrafishMine). Starting in 2012 mines were established for mouse (MGI/MouseMine) and worm (WormBase/WormMine). This consortium of MODs, called InterMOD, works together and with the InterMine team, communicating regularly on common data issues, representation standards, and interfaces. Mines for additional species (e.g., human, Xenopus, and Arabidopsis) are also being established though other funding. Users can now access data for multiple species using common interfaces and tools, e.g., a user familiar with FlyMine can immediately start using MouseMine.
The main source of MouseMine data is MGI, which includes a wealth of information about the structure and function of the mouse genome, developmental gene expression patterns, phenotypic effects of mutations, and annotations of human disease models. These data also include a rich set of cross-references (e.g., EntrezGene, UniProt, OMIM, etc.) and cross-species associations (e.g., orthologies to human, rat, zebrafish, etc.), allowing the user to make critical connections to other data resources.
The main software development component in building MouseMine is the code to extract the data from MGI, restructure it to match the InterMine data model (or sometimes, extend the model to match the MGI data), and output it as a set of XML files in a specific format defined by InterMine. This component, called “the dumper”, is also the main source of maintenance costs for MouseMine, as it needs to keep up with the regular changes in MGI. Fortunately, the InterMine data model is both remarkably close to MGI’s in essential ways and is easily extended when needed. This allows the restructuring parts of the dumper to be relatively straightforward and is a significant technical advantage of InterMine over BioMart.
MouseMine also loads data from several other sources in addition to MGI. In most cases, we exploit source loaders already included with InterMine. For example, the NCBI Taxonomy database supplies basic nomenclature information for organisms, and ontologies are loaded from OBO files downloaded from several sources (e.g., the OboFoundry). A more interesting example is Publications. Most InterMine data loaders only create publication “stubs”, i.e., objects having only a PubMed id. InterMine supplies a loader, usually one of the last to run when building a mine, which accesses PubMed and fills in all the details (title, authors, journal, date, etc.) for every publication with a PMID. (Details for the handful of publications without PMIDs come from MGI.) MouseMine also contains a small but growing segment of data not found in MGI such as interactions from BioGrid and IntAct, and homology data from Panther.
MouseMine is rebuilt each week (or whenever there is a data refresh at MGI). The MouseMine build process is completely automated and is controlled by Jenkins, a widely used job management system. A build proceeds in several phases. The first phase prepares all the data files needed to load MouseMine (including running the MGI dumper), the second phase loads/integrates those files into the mine, and a third runs a series of acceptance tests to ensure that the result is consistent with MGI. If (and only if) all tests pass, the results are then “pushed out” to the publicly accessible server.
Agencies that fund MODs are concerned with their long-term sustainability. While this is a complicated issue with no “magic” solutions, the adoption of InterMine by the MODs is a step in the right direction. The tool is powerful, flexible, and free; a mine can be established and maintained with relatively modest effort; users can, for the first time, access all the MODs using a common interface; and contributions to the tool made by one benefit all.
For MGI, MouseMine represents the latest step in its ongoing efforts to disseminate high-quality comprehensive mouse data to the widest audience, to provide powerful programmatic access, to cooperate with other MODs to foster cross-species data analysis, and to embrace strategically important new technologies. Plans for MouseMine include loading additional MGI data, such as miRNA-target interactions and gene models, as well as data from other MGI resources such as cancer models from the Mouse Tumor Database and metabolic pathway data from MouseCyc.
While MouseMine provides a complete web interface, it is also possible to take components of that interface and embed them in other web pages. In particular, it is easy to embed a table showing the results of any desired query and providing all of the interactive functionality available through MouseMine. We can use this capability to augment current MGI web pages. For example, MGI currently provides a page that displays all the phenotype annotations for one or more genotypes, formatted for reading. With relative ease, we could augment this page with the option to see the underlying annotation records, with the ability to sort/filter/download/etc. We can also leverage this functionality to embed other visual components included with InterMine, such as a map displays, protein interaction displays, and a generic graph widget. And finally, the comprehensive web services API provides an open-ended interface for building new interactive and embeddable displays.
MouseMine was developed under a subcontract to NHGRI Grant HG004834, with additional support from NCI Grant CA089713 and NICHD Grant HD062499. Development of an initial MouseMine prototype was supported by an internship with The Jackson Laboratory Summer Student Program. Many thanks to Drs. Carol Bult and Jim Kadin for advice and feedback on this paper.
- Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, Guillier F, Janssens H, Ji W, Mclaren P, North P, Rana D, Riley T, Sullivan J, Watkins X, Woodbridge M, Lilley K, Russell S, Ashburner M, Mizuguchi K, Micklem G (2007) FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biol 8(7):R129PubMedCentralCrossRefPubMedGoogle Scholar
- Smith RN, Aleksic J, Butano D, Carr A, Contrino S, Hu F, Lyne M, Lyne R, Kalderimis A, Rutherford K, Stepan R, Sullivan J, Wakeling M, Watkins X, Micklem G (2012) InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data. Bioinformatics 28(23):3163–3165PubMedCentralCrossRefPubMedGoogle Scholar
- Smith CM, Finger JH, Hayamizu TF, McCright IJ, Xu J, Berghout J, Campbell J, Corbani LE, Forthofer KL, Frost PJ, Miers D, Shaw DR, Stone KR, Eppig JT, Kadin JA, Richardson JE, Ringwald M (2014) The mouse Gene Expression Database (GXD): 2014 update. Nucleic Acids Res 42:D818–D824PubMedCentralCrossRefPubMedGoogle Scholar
- Sullivan J, Karra K, Moxon SA, Vallejos A, Motenko H, Wong JD, Aleksic J, Balakrishnan R, Binkley G, Harris T, Hitz B, Jayaraman P, Lyne R, Neuhauser S, Pich C, Smith RN, Trinh Q, Cherry JM, Richardson J, Stein L, Twigger S, Westerfield M, Worthey E, Micklem G (2013) InterMOD: integrated data and tools for the unification of model organism research. Sci Rep 3:1802PubMedCentralPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.