What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources

Hussels, Philipp; Trißl, Silke; Leser, Ulf

doi:10.1007/978-3-540-73255-6_19

Philipp Hussels¹,
Silke Trißl¹ &
Ulf Leser¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4544))

Included in the following conference series:

International Conference on Data Integration in the Life Sciences

687 Accesses
2 Citations

Abstract

Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited.

We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berman, H., Westbrook, J., Feng, Z., Gilliland, G., et al.: The Protein Data Bank. Nucleic Acids Research 28(1), 235–242 (2000)
Article Google Scholar
Bleiholder, J., Khuller, S., Naumann, F., Raschid, L., Wu, Y.: Query Planning in the Presence of Overlapping Sources. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Boehm, K., Kemper, A., Grust, T., Boehm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 811–828. Springer, Heidelberg (2006)
Chapter Google Scholar
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)
Article Google Scholar
Florescu, D., Koller, D., Levy, A.: Using Probabilistic Information in Data Integration.. In: Proceedings of the VLDB, pp. 216–225. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Huttenhower, C., Hibbs, M., Myers, C., Troyanskaya, O.: A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23), 2890–2897 (2006)
Article Google Scholar
Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Research 33, D428–D432 (2005)
Article Google Scholar
Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG databases at GenomeNet. Nucleic Acids Research 30, 42–46 (2002)
Article Google Scholar
Lacroix, Z., Murthy, H., Naumann, F., Raschid, L.: Links and Paths through Life Sciences Data Sources. In: DILS 2004. LNCS (LNBI), vol. 2994, pp. 203–211 Springer, Heidelberg (2004)
Google Scholar
Lemer, C., Antezana, E., Couche, F., Fays, F., et al.: The aMAZE LightBench: a web interface to a relational database of cellular processes. Nucleic Acids Research 32, D443–D448 (2004)
Article Google Scholar
Marcotte, E., Pellegrini, M., Ng, H., Rice, D., et al.: Detecting protein function and protein-protein interactions from genome sequences. Science 285(5428), 751–753 (1999)
Article Google Scholar
Martin, A.C.: Mapping PDB chains to UniProtKB entries. Bioinformatics 21(23), 4297–4301 (2005)
Article Google Scholar
von Mering, C., Jensen, L., Snel, B., Hooper, S., et al.: STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Research 33,D433–D437 (2005)
Article Google Scholar
Naumann, F., Leser, U., Freytag, J.-C.: Quality-driven Integration of Heterogenous Information Systems. In: Proceedings of the VLDB, pp. 447–458. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Rother, K., Müller, H., Trissl, S., Koch, I., et al.: Columba: Multidimensional Data Integration of Protein Annotations. In: DILS 2004. LNCS (LNBI), vol. 2994, pp. 156–171. Springer, Heidelberg (2004)
Google Scholar
Velankar, S., Mcneil, P., Mittard-Runte, V., Suarez, A., et al.: E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research 33, D262+ (2005)
Article Google Scholar
Via, A., Zanzoni, A., Helmer-Citterich, M.: Seq2Struct: a resource for establishing sequence-structure links. Bioinformatics 21(4), 551–553 (2004)
Article Google Scholar
Yanai, I., DeLisi, C.: The society of genes: networks of functional links between genes from comparative genomics. Genome Biology 3(11) : (research0064) (October 2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Humboldt-Universität zu Berlin, Institute of Computer Sciences, D-10099 Berlin, Germany
Philipp Hussels, Silke Trißl & Ulf Leser

Authors

Philipp Hussels
View author publications
You can also search for this author in PubMed Google Scholar
Silke Trißl
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Leser
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sarah Cohen-Boulakia Val Tannen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hussels, P., Trißl, S., Leser, U. (2007). What’s New? What’s Certain? – Scoring Search Results in the Presence of Overlapping Data Sources. In: Cohen-Boulakia, S., Tannen, V. (eds) Data Integration in the Life Sciences. DILS 2007. Lecture Notes in Computer Science(), vol 4544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73255-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-73255-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73254-9
Online ISBN: 978-3-540-73255-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics