Skip to main content
Log in

Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

The human reference genome is still incomplete and a number of gene sequences are missing from it. The approaches to uncover them, the reasons causing their absence and their functions are less explored. Here, we comprehensively identified and characterized the missing genes of human reference genome with RNA-Seq data from 16 different human tissues. By using a combined approach of genome-guided transcriptome reconstruction coupled with genome-wide comparison, we uncovered 3.78 and 2.37 Mb transcribed regions in the human genome assemblies of Celera and HuRef either missed from their homologous chromosomes of NCBI human reference genome build 37.2 or partially or entirely absent from the reference. We further identified a significant number of novel transcript contigs in each tissue from de novo transcriptome assembly that are unalignable to NCBI build 37.2 but can be aligned to at least one of the genomes from Celera, HuRef, chimpanzee, macaca or mouse. Our analyses indicate that the missing genes could result from genome misassembly, transposition, copy number variation, translocation and other structural variations. Moreover, our results further suggest that a large portion of these missing genes are conserved between human and other mammals, implying their important biological functions. Totally, 1,233 functional protein domains were detected in these missing genes. Collectively, our study not only provides approaches for uncovering the missing genes of a genome, but also proposes the potential reasons causing genes missed from the genome and highlights the importance of uncovering the missing genes of incomplete genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    PubMed  CAS  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  PubMed  CAS  Google Scholar 

  • Baker M (2012) De novo genome assembly: what every biologist should know. Nat Method 9:333–337

    Article  CAS  Google Scholar 

  • Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL (2011) Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev 25:1915–1927

    Article  PubMed  CAS  Google Scholar 

  • Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C et al (2011) Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet 43:956–963

    Article  PubMed  CAS  Google Scholar 

  • Chen G, Li R, Shi L, Qi J, Hu P, Luo J, Liu M, Shi T (2011a) Revealing the missing expressed genes beyond the human reference genome by RNA-Seq. BMC Genomics 12:590

    Article  PubMed  CAS  Google Scholar 

  • Chen G, Wang C, Shi T (2011b) Overview of available methods for diverse RNA-Seq data analyses. Sci China Life Sci 54:1121–1128

    Article  PubMed  CAS  Google Scholar 

  • Chen G, Yin K, Wang C, Shi T (2011c) De novo transcriptome assembly of RNA-Seq reads with different strategies. Sci China Life Sci 54:1129–1133

    Article  PubMed  CAS  Google Scholar 

  • Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P et al (2010) Origins and functional impact of copy number variation in the human genome. Nature 464:704–712

    Article  PubMed  CAS  Google Scholar 

  • Consortium IHGS (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931–945

    Article  Google Scholar 

  • Eichler EE, Clark RA, She X (2004) An assessment of the sequence gaps: unfinished business in a finished human genome. Nat Rev Genet 5:345–354

    Article  PubMed  CAS  Google Scholar 

  • Feuk L, Carson AR, Scherer SW (2006) Structural variation in the human genome. Nat Rev Genet 7:85–97

    Article  PubMed  CAS  Google Scholar 

  • Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:W29–W37

    Article  PubMed  CAS  Google Scholar 

  • Garber M, Grabherr MG, Guttman M, Trapnell C (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Method 8:469–477

    Article  CAS  Google Scholar 

  • Harris RS (2007) Improved pairwise alignment of genomic DNA. PhD Thesis, The Pennsylvania State University, Pennsylvania

  • Istrail S, Sutton GG, Florea L, Halpern AL, Mobarry CM, Lippert R, Walenz B, Shatkay H, Dew I, Miller JR et al (2004) Whole-genome shotgun assembly and comparison of human genome assemblies. Proc Natl Acad Sci USA 101:1916–1921

    Article  PubMed  CAS  Google Scholar 

  • Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12:656–664

    PubMed  CAS  Google Scholar 

  • Khaja R, Zhang J, MacDonald JR, He Y, Joseph-George AM, Wei J, Rafiq MA, Qian C, Shago M, Pantano L et al (2006) Genome assembly comparison identifies structural variants in the human genome. Nat Genet 38:1413–1418

    Article  PubMed  CAS  Google Scholar 

  • Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F et al (2008) Mapping and sequencing of structural variation from eight human genomes. Nature 453:56–64

    Article  PubMed  CAS  Google Scholar 

  • Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G et al (2010) Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Method 7:365–371

    Article  CAS  Google Scholar 

  • Kielbasa SM, Wan R, Sato K, Horton P, Frith MC (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493

    Article  PubMed  CAS  Google Scholar 

  • Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L et al (2007) Paired-end mapping reveals extensive structural variation in the human genome. Science 318:420–426

    Article  PubMed  CAS  Google Scholar 

  • Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921

    Article  PubMed  CAS  Google Scholar 

  • Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G et al (2007) The diploid genome sequence of an individual human. PLoS Biol 5:e254

    Article  PubMed  Google Scholar 

  • Li R, Li Y, Zheng H, Luo R, Zhu H, Li Q, Qian W, Ren Y, Tian G, Li J et al (2010) Building the sequence map of the human pan-genome. Nat Biotechnol 28:57–63

    Article  PubMed  CAS  Google Scholar 

  • Li Y, Zheng H, Luo R, Wu H, Zhu H, Li R, Cao H, Wu B, Huang S, Shao H et al (2011) Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat Biotechnol 29:723–730

    Article  PubMed  CAS  Google Scholar 

  • Lorenc A, Makalowski W (2003) Transposable elements and vertebrate protein diversity. Genetica 118:183–191

    Article  PubMed  CAS  Google Scholar 

  • Mackie Ogilvie C, Scriven PN (2002) Meiotic outcomes in reciprocal translocation carriers ascertained in 3-day human embryos. Eur J Hum Genet 10:801–806

    Article  PubMed  Google Scholar 

  • Marguerat S, Bahler J (2010) RNA-seq: from technology to biology. Cell Mol Life Sci 67:569–579

    Article  PubMed  CAS  Google Scholar 

  • Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517

    Article  PubMed  CAS  Google Scholar 

  • Nagalakshmi U, Waern K, Snyder M (2010) RNA-Seq: a method for comprehensive transcriptome analysis. In: Frederick M Ausubel et al (eds) Current protocols in molecular biology. Chaps 4: Unit 4 11, pp 11–13

  • Noe L, Kucherov G (2005) YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res 33:W540–W543

    Article  PubMed  CAS  Google Scholar 

  • Oliver-Bonet M, Navarro J, Carrera M, Egozcue J, Benet J (2002) Aneuploid and unbalanced sperm in two translocation carriers: evaluation of the genetic risk. Mol Hum Reprod 8:958–963

    Article  PubMed  CAS  Google Scholar 

  • Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12:87–98

    Article  PubMed  CAS  Google Scholar 

  • Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J et al (2012) The Pfam protein families database. Nucleic Acids Res 40:D290–D301

    Article  PubMed  CAS  Google Scholar 

  • Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W et al (2006) Global variation in copy number in the human genome. Nature 444:444–454

    Article  PubMed  CAS  Google Scholar 

  • Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ et al (2010) De novo assembly and analysis of RNA-seq data. Nat Method 7:909–912

    Article  CAS  Google Scholar 

  • Saha S, Bridges S, Magbanua ZV, Peterson DG (2008) Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res 36:2284–2294

    Article  PubMed  CAS  Google Scholar 

  • Surget-Groba Y, Montoya-Burgos JI (2010) Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res 20:1432–1440

    Article  PubMed  CAS  Google Scholar 

  • Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111

    Article  PubMed  CAS  Google Scholar 

  • Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515

    Article  PubMed  CAS  Google Scholar 

  • Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7:562–578

    Article  PubMed  CAS  Google Scholar 

  • Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D et al (2005) Fine-scale structural variation of the human genome. Nat Genet 37:727–732

    Article  PubMed  CAS  Google Scholar 

  • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al (2001) The sequence of the human genome. Science 291:1304–1351

    Article  PubMed  CAS  Google Scholar 

  • Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y et al (2008) The diploid genome sequence of an Asian individual. Nature 456:60–65

    Article  PubMed  CAS  Google Scholar 

  • Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  PubMed  CAS  Google Scholar 

  • Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O et al (2007) A unified classification system for eukaryotic transposable elements. Nat Rev Genet 8:973–982

    Article  PubMed  CAS  Google Scholar 

  • Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  PubMed  CAS  Google Scholar 

  • Zerbino DR, Paten B, Haussler D (2012) Integrating genomes. Science 336:179–182

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

We thank Danielle and Jean Thierry-Mieg from NCBI, Kangping Yin, Hui Liu and Jiang Li for helpful discussions. This work was supported by the National 973 Key Basic Research Program (Grant Nos. 2010CB945401 and 2012CB910400), the National Natural Science Foundation of China (Grant No. 31171264, 31071162 and 31240038), the Science and Technology Commission of Shanghai Municipality (11DZ2260300) and the Graduate School of East China Normal University.

Conflict of interest

None.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tieliu Shi.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 50 kb)

Supplementary material 2 (BED 683 kb)

Supplementary material 3 (BED 558 kb)

Supplementary material 4 (FA 3914 kb)

Supplementary material 5 (FA 2499 kb)

Supplementary material 6 (FA 203 kb)

Supplementary material 7 (FA 302 kb)

Supplementary material 8 (FA 303 kb)

Supplementary material 9 (FA 246 kb)

Supplementary material 10 (FA 203 kb)

Supplementary material 11 (FA 179 kb)

Supplementary material 12 (FA 209 kb)

Supplementary material 13 (FA 175 kb)

Supplementary material 14 (FA 218 kb)

Supplementary material 15 (FA 247 kb)

Supplementary material 16 (FA 272 kb)

Supplementary material 17 (FA 193 kb)

Supplementary material 18 (FA 137 kb)

Supplementary material 19 (FA 267 kb)

Supplementary material 20 (FA 246 kb)

Supplementary material 21 (FA 145 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, G., Wang, C., Shi, L. et al. Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches. Hum Genet 132, 899–911 (2013). https://doi.org/10.1007/s00439-013-1300-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-013-1300-9

Keywords

Navigation