Data Quality Problems When Integrating Genomic Information

  • Ana LeónEmail author
  • José Reyes
  • Verónica Burriel
  • Francisco Valverde
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9975)


Due to the complexity of genomic information and the broad amount of data produced every day, the genomic information accessible on the web has become very difficult to integrate, which hinders the research process. Using the knowledge from the Data Quality field and after a specific study of a set of genomic databases we have found problems related to six Data Quality dimensions. The aim of this paper is to highlight the problems that bioinformaticians have to face when they integrate information from different genomic databases. The contribution of this paper is to identify and characterize those problems in order to understand which ones hinder the research process, increasing the time-waste that this task means for researchers.


Data quality Data integration Genomic databases 


  1. 1.
    Askham, N., Cook, D., Doyle, M., Fereday, H., Gibson, M., Landbeck, U., Lee, R., Maynard, C., Palmer, G., Schwarzenbach, J.: The six primary dimensions for data quality assessment. Technical report, DAMA UK Working Group (2013)Google Scholar
  2. 2.
    Barker, N., Clevers, H.: Quality control in databanks for molecular biology. BioEssays 22(11), 1024–1034 (2000)CrossRefGoogle Scholar
  3. 3.
    Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv 41(3), 1–52 (2009)CrossRefGoogle Scholar
  4. 4.
  5. 5.
    Eckerson W.: Data quality and the bottom line. TDWI Report. The Data Warehouse Institute (2002)Google Scholar
  6. 6.
    Growth of sequence and 3D structure databases.
  7. 7.
    Jones, C., Brown, A., Baumann, U.: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinform. 8(1), 170 (2007)CrossRefGoogle Scholar
  8. 8.
    Koh, J., Lee, M., Khan, A., Tan, P., Brusic, V.: Duplicate detection in biological data using association rule mining. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics (2004)Google Scholar
  9. 9.
    Krawetz, S.: Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucleic Acids Res. 17(10), 3951–3957 (1989)CrossRefGoogle Scholar
  10. 10.
    Loshin, D.: The Practitioner’s Guide to Data Quality Improvement. A Volume in MK Series on Business Intelligence, pp. 115–128 (2011)Google Scholar
  11. 11.
    Moran, L.: Sandwalk: Errors in Sequence Databases (2008)Google Scholar
  12. 12.
    NCBI is phasing out sequence GIs - use Accession. Version instead!
  13. 13.
    Pastor, O.: Conceptual modeling meets the human genome. In: Li, Q., Spaccapietra, S., Yu, E., Olivé, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 1–11. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-87877-3_1 CrossRefGoogle Scholar
  14. 14.
    Scannapieco, M., Missier, P., Batini, C.: Data quality at aGlance. Datenbank-Spektrum 14, 6–14 (2005)Google Scholar
  15. 15.
    Schnoes, A., Brown, S., Dodevski, I., Babbitt, P.: Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5(12), e1000605 (2009)Google Scholar
  16. 16.
    Smith, B.E., Johnston, M.K., Lucking, R.: From GenBank to GBIF: phylogeny-based predictive niche modeling tests accuracy of taxonomic identifications in large occurrence data repositories. PLoS ONE 11(3), e0151232 (2016)Google Scholar
  17. 17.
    Soh, D., Dong, D., Guo, Y., Wong, L.: Consistency, comprehensiveness, and compatibility of pathway databases. BMC Bioinform. 11(1), 449 (2010)CrossRefGoogle Scholar
  18. 18.
  19. 19.
  20. 20.
    Triplet, T., Butler, G.: Systems biology warehousing: challenges and strategies toward effective data integration. In: Proceedings of the 3rd International Conference on Advances in Databases, Knowledge and Data Applications, pp. 34–40 (2011)Google Scholar
  21. 21.
  22. 22.
    Uniprot knowledgebase.
  23. 23.
    UniProt: reducing proteome redundancy.
  24. 24.
    UniProt: how redundant are the uniprot databases?
  25. 25.
  26. 26.
    UniProt: current release statistics.
  27. 27.
  28. 28.
    Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 86–95 (1995)CrossRefGoogle Scholar
  29. 29.
    Wang, R., Strong, D.: Beyond accuracy: what data quality means to data consumers. J. Manage. Inform. Syst. 12(4), 5–33 (1996)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Ana León
    • 1
    Email author
  • José Reyes
    • 1
  • Verónica Burriel
    • 1
  • Francisco Valverde
    • 1
  1. 1.Research Center on Software Production Methods (PROS)Universitat Politècnica de ValènciaValenciaSpain

Personalised recommendations