Background

The laboratory mouse is the premier mammalian model organism for the study of human disease, and it has played a vital role in both the annotation of the human genome and the study of gene function and regulation. Similar to humans, mice naturally develop diverse diseases that affect the hematologic, nervous, cardiovascular, endocrine, musculoskeletal, renal and other systems, providing excellent experimental paradigms for studying the pathogenesis of cancer, autoimmune disease, diabetes, obesity, atherosclerosis, hypertension, gastrointestinal disorders and diverse neurodegenerative states. Mouse models are currently available for hundreds of human disorders [14], spanning diverse quantitative and behavioral phenotypes and physiological systems. These comprise both inbred strains and genetically engineered mutants, many of which have been extensively characterized. For these reasons, the mouse has emerged as a premier system for translating basic human genetic, genomic and physiologic research into paradigms for therapeutic development.

The mouse genome has been uniquely useful in annotating the human genome and advancing the understanding of human gene functions. At 2.7 Gb, the mouse genome is of comparable size and structure with the human genome, and 99% of mouse genes have human orthologs. Because of the availability of inbred strains and the facile and rapid features of mouse breeding, the mouse has played a vital role in decoding fundamental features of gene function and regulation during developmental and differentiation intervals that are either difficult or impossible to study systematically in humans. An ideal evolutionary distance for human comparative genomics (circa 200 million years) has made the mouse genome a standard for comparative genomic analyses seeking to illuminate human functional DNA [57].

Less than 2% of the mouse genome is currently believed to comprise protein-coding regions. Among the vast non-coding sequences lie numerous yet-to-be-identified functional DNA elements that regulate diverse genomic processes, including transcriptional regulation, meiotic recombination, and DNA replication and repair. A major focus of the Mouse ENCODE project is to identify comprehensively transcriptional regulatory elements in the mouse genome, providing a valuable resource for understanding the genetic circuitry that controls animal development and lineage specification. It is expected that millions of cis-regulatory elements lie within mouse non-coding regions, many of which are conserved in human DNA. As such, comprehensive illumination of mouse elements should greatly facilitate the functional annotation of the human genome.

Hundreds of human-to-mouse transgenic studies demonstrate the potential of the mouse genome to inform studies of human gene regulation; indeed, transgenic mice have become a routine part of the repertoire of modern molecular and developmental biology. Many fundamental aspects of transgenic gene regulation that are routinely taken for granted emphasize the great utility of the mouse system. Many human genes integrated into the mouse germline recapitulate features of human gene regulation with striking precision, indicating that the trans-acting regulatory environment has remained largely stable during an evolutionary interval that witnessed marked divergence in the non-coding DNA sequences that regulate most genes [712]. The apparent stability of the trans-acting regulatory environment renders the mouse uniquely useful for studies of transcriptional regulation by mutagenesis of human DNA that is then transferred into mouse. Engineered mutations in transgenic mice frequently show phenotypes analogous to those of naturally occurring mutations in humans.

The Mouse ENCODE Project Consortium

By undertaking a parallel Mouse ENCODE Project that utilizes the same technologies and pipelines developed for the human ENCODE Project [1315], the Mouse ENCODE Consortium aims to (i) enhance the value of the human ENCODE Project through relevant comparative studies; (ii) access cell types, tissues, and developmental time points that are not addressable by the human project; and (iii) provide a general resource to inform and accelerate ongoing efforts in mouse genomics and disease modeling with human translational potential.

The organization of the Mouse ENCODE Consortium includes data production centers and a data coordination center (DCC). Production centers generally focus on different data types, including transcription factor and polymerase occupancy, DNaseI hypersensitivity, histone modification and RNA transcription. The DCC is co-localized with the human ENCODE Project DCC [15] at the University of California Santa Cruz (UCSC), USA.

A web-based portal site (MOUSE ENCODE [16]) has been established to consolidate and distribute information on Mouse ENCODE consortium goals, data, protocols and publications.

Mouse ENCODE data types

The Mouse ENCODE Project is analyzing primary mouse cells and tissues spanning a range of tissue types, developmental time points, as well as model cell lines. To ensure consistency, the project is focusing on C57BL/6-derived cells and tissues, except for the case of certain widely used model cell lines. Primary tissues are harvested from age-matched mice using standardized protocols on mice either bred locally or obtained from standard sources (The Jackson Laboratory, Bar Harbor, Maine, USA; Charles River Laboratories, Wilmington, Massachusetts, USA). Following the practice of the human ENCODE Project [14], model cell lines are cultured using standard operating procedures that are reviewed for consistency and clarity. Among the cell lines in use are those selected as analogs to several human ENCODE common cell lines [14], including K562 (mouse erythroleukemia cell line MEL (ATCC)), GM12878 (mouse lymphoid cell line CH12 (ATCC)), and H1 embryonic stem cells (E14 mouse embryonic stem cells).

Accessing Mouse ENCODE data

The Mouse ENCODE Project has already generated and released hundreds of data sets through the UCSC browser [17, 18] (Table S1 in Additional file 1). All data sets are also deposited with the Gene Expression Omnibus (GEO) repository after public release through the UCSC browser. The data sets shown in Table S1 in Additional file 1 span many high-utility data types generated using state-of-the-art approaches, including DNaseI hypersensitive sites by DNase-seq [19], DNaseI footprints by digital genomic footprinting [20], RNA-seq [21], histone modifications by ChIP-seq [22], transcription factor and polymerase occupancy sites by ChIP-seq [23], and DNA replication timing by Repli-chip [24]. In addition, selected chromosomal regions will be interrogated for chromatin interactions by 5C [25], including the entirety of mouse chromosome 12. All data are collected from at least two biological replicates, and all replicate data are also available through the Mouse ENCODE repository at UCSC. An up-to-date log of Mouse ENCODE data releases can be found [16], and it is also linked through the home page of the ENCODE project [26]. Submissions are ongoing, and an updated summary timeline for major data types is available [18].

To ensure the quality and consistency of experimental procedures used at each data production center, the Consortium has selected a single reference cell type (MEL) on which all experimental approaches are being applied. For other cell and tissue types, the data types vary, with DNaseI sensitivity, histone modifications and RNA-seq focused mainly on primary tissues, and transcription factor binding generally focused on model cell lines (Table S1 in Additional file 1). A comprehensive collection of cell culture and tissue sample preparation protocols utilized by the Consortium is available online [27].

Data production standards and assessment of data quality

The Mouse ENCODE Consortium is applying the same data generation, quality control, analysis pipelines and data standard developed for the human ENCODE Project. Working copies of data standard documents are available as an appendix to the recently published User's Guide to ENCODE Data [14] and at the home page of the ENCODE project [26]. Consortium data undergo quality review at the level of the production centers to ensure experimental success and generation of high-quality data, and subsequently at the DCC (see below) to ensure accurate visualization, and links to primary data files and metadata.

Data availability

Mouse ENCODE data are available online through the UCSC browser mm9 mouse genome sequence build [17] and through a dedicated Mouse ENCODE mirror browser linked to the portal site [18]. Data in the UCSC browser can be viewed readily in the context of other genome annotations available for the mouse genome. An online tutorial developed for facilitating the viewing of human ENCODE data is also directly applicable to the Mouse ENCODE data [28]. Detailed instructions are also provided for the data download and analysis functions available in the browser. DNA sequence reads from Mouse ENCODE ChIP-seq, DNase-seq and RNA-seq are available for direct retrieval from the UCSC browser archive [29] and the GEO repository [30].

Data release and use policy

The Mouse ENCODE data are rapidly released soon after they are verified (that is, shown to be reproducible) to facilitate their immediate utility to the broader community. A log of data releases is available at the Mouse ENCODE portal site [18] and through the main UCSC browser [17]. The terms of data use are described under the ENCODE Data Release and Use Policy [31]. As with human ENCODE, data are made available following quality review and standardization of formatting. While Mouse ENCODE data are made freely available for viewing and pre-publication analysis upon release, data use for genome-wide analysis in papers, abstracts or public presentations is restricted during the first 9 months following public release. The expiration of this 'embargo' period for genome-wide analyses is clearly marked in the track titles of Mouse ENCODE data in the UCSC browser. Mouse ENCODE data are immediately available for analysis of individual gene loci.

Data analysis plans

Production groups are engaged in analysis of the individual data types generated by each group. In addition, the Mouse ENCODE Consortium is currently in the planning stages of an integrated analysis. Integration of multiple mouse ENCODE data types will be performed to assess the extent of annotation of the mouse genome, and to illuminate general features of mouse gene and chromosomal regulation. Mouse ENCODE data will also be extensively integrated with human ENCODE data in order to study the evolution of gene regulatory mechanisms, and to cross-validate findings within both the human and mouse projects. Integration with data from invertebrates (Drosophila melanogaster and Caenorhabditis elegans) generated under the ModENCODE project may also yield insights into common gene regulatory mechanisms and conserved pathways. While it is expected that broad features of regulatory mechanisms will be conserved across animal phyla, the integrative and comparative analyses enabled by the Mouse ENCODE project will provide a unique opportunity for systematic study of both conservation of function and biochemical activity relative to conservation of sequence per se. The Consortium expects to conduct global analyses with an emphasis on integration with the human ENCODE Project, and not to focus on specific genes, genomic regions, tissues/cell states or pathways.

Joining the Mouse ENCODE Consortium

Following on the model of the human ENCODE Consortium, which currently counts hundreds of members worldwide, the Mouse ENCODE Consortium is an open scientific venture that welcomes scientists at all levels and with all types of relevant expertise. More information on joining the human or mouse ENCODE Consortia is available [26].

Perspective

In summary, the laboratory mouse is a powerful tool for the investigation of human gene function and for dissecting the genetic and transcriptional regulatory circuits controlling development and homeostasis of mammals. The Mouse ENCODE Project aims to potentiate both the utility of the mouse as a model for regulatory genomics and the human ENCODE project effort to advance annotation of the human genome.

Mouse ENCODE Consortium Authors

Writing Group:

John A Stamatoyannopoulos1, Michael Snyder3, Ross Hardison7, Bing Ren12

University of Washington-Fred Hutchinson Cancer Research Center Group:

John A Stamatoyannopoulos1, Mark Groudine2, Michael Bender2, Rajinder Kaul1, Theresa Canfield1, Erica Giste1, Audra Johnson1, Mia Zhang2, Gayathri Balasundaram2, Rachel Byron2, Vaughan Roach1, Peter Sabo1, Richard Sandstrom1, A Sandra Stehling1, Bob Thurman1

Stanford-Yale Group:

Michael Snyder3, Sherman M Weissman4, Philip Cayting4,5,6, Manoj Hariharan3, Jin Lian5, Yong Cheng3, Stephen G Landt3, Zhihai Ma3

Penn State/University of Massachusetts/Duke University/Emory University/California Institute of Technology/University of California, Irvine/Children's Hospital of Philadelphia Group:

Ross Hardison7, Barbara J Wold16, Job Dekker8, Gregory Crawford9,10, Cheryl A Keller7, Weisheng Wu7, Christopher Morrissey7, Swathi A Kumar7, Tejaswini Mishra7, Deepti Jain7, Marta Byrska-Bishop7, Daniel Blankenberg7, Bryan R Lajoie8, Gaurav Jain8, Amartya Sanyal8, Kaun-Bei Chen9, Olgert Denas9, James Taylor11 Gerd A Blobel15, Mitchell J Weiss15, Max Pimkin15, Wulan Deng15, Georgi K Marinov16, Brian A Williams16, Katherine I Fisher-Aylor16, Gilberto Desalvo16, Anthony Kiralusha16, Diane Trout16, Henry Amrhein16, Ali Mortazavi17

University of California San Diego Group:

Bing Ren12, Lee Edsall12, David McCleary12, Samantha Kuan12, Yin Shen12, Feng Yue12, Zhen Ye12

Cold Spring Harbor Laboratory/CRG Group:

Thomas R Gingeras18, Carrie A Davis18, Chris Zaleski18, Sonali Jha18, Chenghai Xue18, Alex Dobin18, Wei Lin18, Meagan Fastuca18, Huaien Wang18, Roderic Guigo19, Sarah Djebali19, Julien Lagarde19

Florida State University Group:

David M Gilbert20, Tyrone Ryba20, Takayo Sasaki20

Data Coordination Center at University of California Santa Cruz:

Venkat S Malladi13, Melissa S Cline13, Vanessa M Kirkup13, Katrina Learned13, Kate R Rosenbloom13 and W James Kent13

NHGRI Project Management Group:

Elise A Feingold14, Peter J Good14, Michael Pazin14, Rebecca F Lowdon14, Leslie B Adams14