Uncertain Groupings: Probabilistic Combination of Grouping Data
Probabilistic approaches for data integration have much potential . We view data integration as an iterative process where data understanding gradually increases as the data scientist continuously refines his view on how to deal with learned intricacies like data conflicts. This paper presents a probabilistic approach for integrating data on groupings. We focus on a bio-informatics use case concerning homology. A bio-informatician has a large number of homology data sources to choose from. To enable querying combined knowledge contained in these sources, they need to be integrated. We validate our approach by integrating three real-world biological databases on homology in three iterations.
We would like to thank the late Tjeerd Boerman for his work on the use case and his initial concept of groupings. We would also like to thank Arnold Kuzniar for his insights and feedback on our use of biological databases and Ivor Wanders for his reviewing and editing assistance.
- 6.Louie, B., Detwiler, L., Dalvi, N., Shaker, R., Tarczy-Hornoch, P., Suciu, D.: Incorporating uncertainty metrics into a general-purpose data integration system. In: 19th International Conference on Scientific and Statistical Database Management. SSBDM 2007, p. 19 (2007)Google Scholar
- 7.Magnani, M., Montesi, D.: A survey on uncertainty management in data integration. J. Data Inf. Qual. 2(1), 5:1–5:33 (2010)Google Scholar
- 8.NCBI Resource Coordinators. Database resources of the national center for biotechnology information. 41(D1), D8–D20 (2013)Google Scholar
- 11.Wanders, B., van Keulen, M., van der Vet, P.E.: Uncertain groupings: probabilistic combination of grouping data. Technical report TR-CTIT-14-12, Centre for Telematics and Information Technology, University of Twente, Enschede (2014)Google Scholar
- 12.Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. Technical report 2004–40, Stanford InfoLab (2004)Google Scholar
- 13.Wu, C.H., Nikolskaya, A., Huang, H., Yeh, L.-S.L., Natale, D.A., Vinayaka, C.R., Hu, Z.-Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R.S., Suzek, B.E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J.L., Chung, S., Castro-Alvear, J., Dinkov, G., Barker, W.C.: Pirsf: family classification system at the protein information resource. Nucleic Acids Res. 32(suppl. 1), D112–D114 (2004)CrossRefGoogle Scholar