Abstract
Protein interaction networks comprise thousands of individual binary links between distinct proteins. Whilst these data have attracted considerable attention and been the focus of many different studies, the networks, their structure, function, and how they change over time are still not fully known. More importantly, there is still considerable uncertainty regarding their size, and the quality of the available data continues to be questioned. Here, we employ statistical models of the experimental sampling process, in particular capture–recapture methods, in order to assess the false discovery rate and size of protein interaction networks. We uses these methods to gauge the ability of different experimental systems to find the true binary interactome. Our model allows us to obtain estimates for the size and false-discovery rate from simple considerations regarding the number of repeatedly interactions, and provides suggestions as to how we can exploit this information in order to reduce the effects of noise in such data. In particular our approach does not require a reference dataset. We estimate that approximately more than half of the true physical interactome has now been sampled in yeast.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Alm, E., & Arkin, A. (2003). Biological networks. Curr. Opin. Struct. Biol., 13(2), 193–202.
Bader, J. S., Chaudhuri, A., Rothberg, J., & Chant, J. (2004). Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 22(1), 78–85.
Brun, C., Chevenet, F., Martin, D., Wojcik, J., Guénoche, A., & Jacq, B. (2003). Functional classification of proteins for the prediction of cellular function from a protein–protein interaction network. Genome Biol., 5(1), R6.
Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: A review. J. Am. Stat. Assoc., 88(421), 364–373.
Burnham, K. P., & Overton, W. S. (1978). Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika, 65(3), 625–633.
Chao, A. (2001). An overview of closed capture–recapture models. J. Agric. Biol. Environ. Stat., 6(2), 158–175.
Chiang, T., Scholtens, D., Sarkar, D., & Gentleman, R. (2007). Coverage and error models of protein–protein interaction data by directed graph analysis. Genome Biol., 8, R186.
de Silva, E., & Stumpf, M. P. H. (2005). Complex networks and simple models in biology. J. R. Soc. Interface, 2(5), 419–430.
de Silva, E., Thorne, T., Ingram, P. J., Agrafioti, I., Swire, J., Wiuf, C., & Stumpf, M. P. H. (2006). The effects of incomplete protein interaction data on structural and evolutionary inferences. BMC Biol., 4(39), 39.
D’haeseleer, P., & Church, G. (2004). Estimating and improving protein interaction error rates. In Proceedings of the IEEE computational systems bioinformatics conference.
Drees, B. L., Thorsson, V., Carter, G. W., Rives, A. W., Raymond, M. Z., Avila-Campillo, I., Shannon, P., & Galitski, T. (2005). Derivation of genetic interaction networks from quantitative phenotype data. Genome Biol., 6(4), R38.
Gentleman, R., & Huber, W. (2007). Making the most of high-throughput protein-interaction data. Genome Biol., 8(10), 112.
Grigoriev, A. (2003). On the number of protein–protein interactions in the yeast proteome. Nucleic Acids Res., 31(14), 4157–4161.
Hart, G. T., Ramani, A. K., & Marcotte, E. M. (2006). How complete are current yeast and human protein-interaction networks? Genome Biol., 7(11), 120.
Heo, M., Maslov, S., & Shakhnovich, E. (2011). Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions. Proc. Natl. Acad. Sci., 108(10), 4258–4263.
Hirschman, J. E., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hong, E. L., Livstone, M. S., Nash, R., Park, J., Oughtred, R., Skrzypek, M., Starr, B., Theesfeld, C. L., Williams, J., Andrada, R., Binkley, G., Dong, Q., Lane, C., Miyasato, S., Sethuraman, A., Schroeder, M., Thanawala, M. K., Weng, S., Dolinski, K., Botstein, D., & Cherry, J. M. (2006). Genome snapshot: a new resource at the saccharomyces genome database (sgd) presenting an overview of the saccharomyces cerevisiae genome. Nucleic Acids Res., 34(Database issue), D442–D445.
Huang, H., Jedynak, B. M., & Bader, J. S. (2007). Where have all the interactions gone? estimating the coverage of two-hybrid protein interaction maps. PLoS Comput. Biol., 3(11), e214.
Kelly, W. P., & Stumpf, M. P. H. (2008). Protein–protein interactions: from global to local analyses. Curr. Opin. Biotechnol., 19, 396–403.
Kelly, W. P., & Stumpf, M. P. H. (2010). Trees on networks: resolving statistical patterns of phylogenetic similarities among interacting proteins. BMC Bioinform., 11, 470.
Lèbre, S., Becq, J., Devaux, F., Stumpf, M. P. H., & Lelandais, G. (2010). Statistical inference of the time-varying structure of gene-regulation networks. BMC Syst. Biol., 4, 130.
Marras, E., Travaglione, A., & Capobianco, E. (2010). Sub-modular resolution analysis by network mixture models. Stat. Appl. Genet. Mol. Biol., 9(1), 19.
Schlitt, T., & Brazma, A. (2005). Modelling gene networks at different organisational levels. FEBS Lett., 579, 1859–1866.
Shokouhi, M., Zobel, J., & Scholer, F. (2006). Capturing collection size for distributed non-cooperative retrieval. In SIGIR proceedings (pp. 316–323).
Stumpf, M. P. H., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free: sampling properties of networks. Proc. Natl. Acad. Sci., 102(12), 4221–4224.
Stumpf, M. P. H., Thorne, T., de Silva, E., Stewart, R., An, H., Lappe, M., & Wiuf, C. (2008). Estimating the size of the human interactome. Proc. Natl. Acad. Sci., 105(19), 6959–6964.
Thorne, T. W., Ho, H.-L., Huvet, M., Haynes, K., & Stumpf, M. P. H. (2011). Prediction of putative protein interactions through evolutionary analysis of osmotic stress response in the model yeast Saccharomyces cerevisae. Fungal Genet. Biol., 48, 504–511.
von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S. G., Fields, S., & Bork, P. (2002). Comparative assessment of large-scale data sets of protein–protein interactions. Nature, 417(6887), 399–403.
Xu, J., Wu, S., & Li, X. (2007). Estimating collection size with logistic regression. In SIGIR proceedings (pp. 789–790).
Yang, L., Vondriska, T. M., Han, Z., MacLellan, W. R., Weiss, J. N., & Qu, Z. (2008). Deducing topology of protein–protein interaction networks from experimentally measured sub-networks. BMC Bioinform., 9, 301.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Kelly, W.P., Stumpf, M.P.H. Assessing Coverage of Protein Interaction Data Using Capture–Recapture Models. Bull Math Biol 74, 356–374 (2012). https://doi.org/10.1007/s11538-011-9680-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-011-9680-2