Measuring Population-Based Completeness for Single Nucleotide Polymorphism (SNP) Databases

  • Nurul A. Emran
  • Suzanne Embury
  • Paolo Missier
Part of the Studies in Computational Intelligence book series (SCI, volume 551)


Completeness of data sets is an important aspect of data quality as observed in biological domain such as Single Nucleotide Polymorphism (SNP). In order to decide on the acceptability of the data sets of concerned, biologists need to measure the completeness of the data sets. One type of data completeness measure is population-based completeness (PBC) that has been identified as relevant to deal with data completeness problem in this domain. In this paper, the implementation of PBC measurement will be presented as a system prototype involving real SNP data sets. The result of the analysis on the practical problems encountered during the implementation of PBC will also be presented.


population-based completeness (PBC) Single Nucleotide Polymorphism (SNP) data completeness measurement 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brookes, A.J.: The essence of SNPs. Gene 234, 177–186 (1999)CrossRefGoogle Scholar
  2. 2.
    Syvn̈en, A.: Accessing genetic variation: genotyping single nucleotide polymorphisms. Nature Reviews Genetics 2, 930–942 (2001)CrossRefGoogle Scholar
  3. 3.
    Human Genome Project Information: SNP fact sheet (2011), (Online; accessed July 22, 2011)
  4. 4.
    Information, N.C.F.B.: Submission of SNPs to dbSNP (2006), (Online; accessed July 23, 2011)
  5. 5.
    Emran, N., Embury, S., Missier, P.: Model-driven component generation for families of completeness. In: 6th International Workshop on Quality in Databases and Management of Uncertain Data, Very Large Databases (VLDB) (2008)Google Scholar
  6. 6.
    Halperin, E., Kimmel, G., Shamir, R.: Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics 21, 195–203 (2005)CrossRefGoogle Scholar
  7. 7.
    Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., Pethiyagoda, C.L., Stuve, L., Johnson, F., Daly, M., Wade, C., Cox, D.: A sequence-based variation map of 8.27 million snps in inbred mouse strains. Nature 448, 1050–1053 (2007)CrossRefGoogle Scholar
  8. 8.
    Marsh, S., Kwok, P., Mcleod, L.H.: SNP database and pharmacogenetics: great start, but a long way to go. Human Mutation 20, 174–179 (2002)CrossRefGoogle Scholar
  9. 9.
    Sherry, S.T., Ward, M.H., Baker, J., Phan, E.M., Smigielski, E.M., Sirotkin, K.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29, 308–311 (2001)CrossRefGoogle Scholar
  10. 10.
    Hubbard, T.J.P., Aken, B.L., Ayling, S., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Clarke, L., Coates, G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon, L., Graf, S., Haider, S., Hammond, M., Holland, R., Howe, K., Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Kulesha, E., Lawson, D., Longden, I., Megy, K., Meidl, P., Overduin, B., Parker, A., Pritchard, B., Rios, D., Schuster, M., Slater, G., Smedley, D., Spooner, W., Spudich, G., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder, S., Zadissa, A., Birney, E., Cunningham, F., Curwen, V., Durbin, R., Fernandez-Suarez, X.M., Herrero, J., Kasprzyk, A., Proctor, G., Smith, J., Searle, S., Flicek, P.: Ensembl 2009. Nucleic Acids Research 37, D690–D697 (2009)Google Scholar
  11. 11.
    Emran, N.A., Embury, S.M., Missier, P., Isa, M.N.M., Muda, A.K.: Measuring data completeness for microbial genomics database. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013, Part I. LNCS, vol. 7802, pp. 186–195. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  12. 12.
    Emran, N.A., Embury, S.M., Missier, P., Ahmad, N.: Reference architectures to measure data completeness across integrated databases. In: Selamat, A., Nguyen, N.T., Haron, H. (eds.) ACIIDS 2013, Part I. LNCS, vol. 7802, pp. 216–225. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Tiffin, N., Andrade-Navarro, M.A., Perez-Iratxeta, C.: Linking genes to diseases: it’s all in the data. Genome Medicine 1, 1–7 (2009)CrossRefGoogle Scholar
  14. 14.
    Missier, P., Embury, S., Greenwood, R., Preece, A., Jin, B.: Quality views: capturing and exploiting the user perspective on data quality. In: Proceedings of the 32nd international conference on Very Large Databases (VLDB), pp. 977–988. ACM Press (2006)Google Scholar
  15. 15.
    Information, N.C.F.B.: Submission of SNPs to dbSNP (2006), (Online; accessed July 26, 2011)

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Nurul A. Emran
    • 1
  • Suzanne Embury
    • 2
  • Paolo Missier
    • 3
  1. 1.Computing Intellingence Technologies (CIT) Lab, Centre of Advanced Computing Technology (C-ACT)Universiti Teknikal Malaysia Melaka (UTeM)Hang Tuah JayaMalaysia
  2. 2.The University of ManchesterManchesterUK
  3. 3.The University of NewcastleNewcastleAustralia

Personalised recommendations