Uncertain Groupings: Probabilistic Combination of Grouping Data

  • Brend Wanders
  • Maurice van Keulen
  • Paul van der Vet
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9261)


Probabilistic approaches for data integration have much potential [7]. We view data integration as an iterative process where data understanding gradually increases as the data scientist continuously refines his view on how to deal with learned intricacies like data conflicts. This paper presents a probabilistic approach for integrating data on groupings. We focus on a bio-informatics use case concerning homology. A bio-informatician has a large number of homology data sources to choose from. To enable querying combined knowledge contained in these sources, they need to be integrated. We validate our approach by integrating three real-world biological databases on homology in three iterations.



We would like to thank the late Tjeerd Boerman for his work on the use case and his initial concept of groupings. We would also like to thank Arnold Kuzniar for his insights and feedback on our use of biological databases and Ivor Wanders for his reviewing and editing assistance.


  1. 1.
    Altenhoff, A., Dessimoz, C.: Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput. Biol. 5, e1000262 (2009)CrossRefGoogle Scholar
  2. 2.
    Antova, L., Koch, C., Olteanu, D.: \({10^{(10^{6})}}\) worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)CrossRefGoogle Scholar
  3. 3.
    Koonin, E.: Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, 309–338 (2005)CrossRefGoogle Scholar
  4. 4.
    Kuzniar, A., Lin, K., He, Y., Nijveen, H., Pongor, S., Leunissen, J.A.M.: Progmap: an integrated annotation resource for protein orthology. Nucleic Acids Res. 37(suppl. 2), W428–W434 (2009)CrossRefzbMATHGoogle Scholar
  5. 5.
    Kuzniar, A., van Ham, R., Pongor, S., Leunissen, J.: The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539–551 (2008)CrossRefGoogle Scholar
  6. 6.
    Louie, B., Detwiler, L., Dalvi, N., Shaker, R., Tarczy-Hornoch, P., Suciu, D.: Incorporating uncertainty metrics into a general-purpose data integration system. In: 19th International Conference on Scientific and Statistical Database Management. SSBDM 2007, p. 19 (2007)Google Scholar
  7. 7.
    Magnani, M., Montesi, D.: A survey on uncertainty management in data integration. J. Data Inf. Qual. 2(1), 5:1–5:33 (2010)Google Scholar
  8. 8.
    NCBI Resource Coordinators. Database resources of the national center for biotechnology information. 41(D1), D8–D20 (2013)Google Scholar
  9. 9.
    Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., et al.: eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284–D289 (2011)CrossRefGoogle Scholar
  10. 10.
    van Keulen, M.: Managing uncertainty: the road towards better data interoperability. IT - Inf. Technol. 54(3), 138–146 (2012)CrossRefGoogle Scholar
  11. 11.
    Wanders, B., van Keulen, M., van der Vet, P.E.: Uncertain groupings: probabilistic combination of grouping data. Technical report TR-CTIT-14-12, Centre for Telematics and Information Technology, University of Twente, Enschede (2014)Google Scholar
  12. 12.
    Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. Technical report 2004–40, Stanford InfoLab (2004)Google Scholar
  13. 13.
    Wu, C.H., Nikolskaya, A., Huang, H., Yeh, L.-S.L., Natale, D.A., Vinayaka, C.R., Hu, Z.-Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R.S., Suzek, B.E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J.L., Chung, S., Castro-Alvear, J., Dinkov, G., Barker, W.C.: Pirsf: family classification system at the protein information resource. Nucleic Acids Res. 32(suppl. 1), D112–D114 (2004)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Brend Wanders
    • 1
  • Maurice van Keulen
    • 1
  • Paul van der Vet
    • 1
  1. 1.Faculty EEMCSUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations