Abstract
Nowadays, it is possible to identify terms corresponding to biological entities within passages in biomedical text corpora: critically, their potential relationships then need to be detected. These relationships are typically detected by co-occurrence analysis, revealing associations between bioentities through their coexistence in single sentences and/or entire abstracts. These associations implicitly define networks, whose nodes represent terms/bioentities/concepts being connected by relationship edges; edge weights might represent confidence for these semantic connections.
This chapter provides a review of current methods for co-occurrence analysis, focusing on data storage, analysis, and representation. We highlight scenarios of these approaches implemented by useful tools for information extraction and knowledge inference in the field of systems biology. We illustrate the practical utility of two online resources providing services of this type—namely, STRING and BioTextQuest—concluding with a discussion of current challenges and future perspectives in the field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hunter L, Cohen KB (2006) Biomedical language processing: what’s beyond PubMed? Mol Cell 21(5):589–594
Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford) 2011:baq036
Cohen AM, Hersh WR (2005) A survey of current work in biomedical text mining. Brief Bioinform 6(1):57–71
Rodriguez-Esteban R (2009) Biomedical text mining and its applications. PLoS Comput Biol 5(12):e1000597
Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A, Vongsangnak W, Shen B (2012) Biomedical text mining and its applications in cancer research. J Biomed Inform 46(2):200–211
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R (2012) Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13(12):829–839
Lu Z, Wilbur WJ, McEntyre JR, Iskhakov A, Szilagyi L (2009) Finding query suggestions for PubMed. AMIA Annu Symp Proc 2009:396–400
Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30(1):7–18
States DJ, Ade AS, Wright ZC, Bookvich AV, Athey BD (2009) MiSearch adaptive pubMed search tool. Bioinformatics 25(7):974–976
Giglia E (2011) Quertle and KNALIJ: searching PubMed has never been so easy and effective. Eur J Phys Rehabil Med 47(4):687–690
Hymel GM (2011) PubMed central inclusion, quertle indexing, outbound reference linking, and editorial board successions: encouraging developments in the IJTMB’s evolution. Int J Ther Massage Bodywork 4(1):1–2
Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA (2009) MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res 37(Web Server issue):W141–W146
Errami M, Wren JD, Hicks JM, Garner HR (2007) eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications. Nucleic Acids Res 35(Web Server issue):W12–W15
Poulter GL, Rubin DL, Altman RB, Seoighe C (2008) MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics 9:108
Smalheiser NR, Zhou W, Torvik VI (2008) Anne O’Tate: a tool to support user-driven summarization, drill-down and browsing of PubMed search results. J Biomed Discov Collab 3:2
Doms A, Schroeder M (2005) GoPubMed: exploring PubMed with the gene ontology. Nucleic Acids Res 33(Web Server issue):W783–W786
Perez-Iratxeta C, Bork P, Andrade MA (2001) XplorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci 26(9):573–575
Soldatos TG, O’Donoghue SI, Satagopam VP, Barbosa-Silva A, Pavlopoulos GA, Wanderley-Nogueira AC, Soares-Cavalcanti NM, Schneider R (2012) Caipirini: using gene sets to rank literature. BioData Min 5(1):1
Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32(Database issue):D258–D261
Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (2008) Text processing through Web services: calling Whatizit. Bioinformatics 24(2):296–298
Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14):3191–3192
Pafilis E, O’Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R (2009) Reflect: augmented browsing for the life scientist. Nat Biotechnol 27(6):508–510
Pavlopoulos GA, Pafilis E, Kuhn M, Hooper SD, Schneider R (2009) OnTheFly: a tool for automated document-based text annotation, data linking and network generation. Bioinformatics 25(7):977–978
Frantzi K, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms. Int J Digit Libr 3(2):117–132
Kim JJ, Pezik P, Rebholz-Schuhmann D (2008) MedEvi: retrieving textual evidence of relations between biomedical concepts from Medline. Bioinformatics 24(11):1410–1412
Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P (2007) EBIMed—text crunching to gather facts for proteins from Medline. Bioinformatics 23(2):e237–e244
Douglas SM, Montelione GT, Gerstein M (2005) PubNet: a flexible system for visualizing literature derived networks. Genome Biol 6(9):R80
Plikus MV, Zhang Z, Chuong CM (2006) PubFocus: semantic MEDLINE/PubMed citations analytics through integration of controlled biomedical dictionaries and ranking algorithm. BMC Bioinformatics 7:424
Fontelo P, Liu F, Ackerman M, Schardt CM, Keitz SA (2006) askMEDLINE: a report on a year-long experience. AMIA Annu Symp Proc 923
Fontelo P, Liu F, Ackerman M (2005) MeSH Speller + askMEDLINE: auto-completes MeSH terms then searches MEDLINE/PubMed via free-text, natural language queries. AMIA Annu Symp Proc 957
Fontelo P, Liu F, Ackerman M (2005) askMEDLINE: a free-text, natural language query tool for MEDLINE/PubMed. BMC Med Inform Decis Mak 5:5
Liu F, Ackerman M, Fontelo P (2006) BabelMeSH: development of a cross-language tool for MEDLINE/PubMed. AMIA Annu Symp Proc 1012
Featherstone R, Hersey D (2010) The quest for full text: an in-depth examination of Pubget for medical searchers. Med Ref Serv Q 29(4):307–319
Eaton AD (2006) HubMed: a web-based biomedical literature search interface. Nucleic Acids Res 34(Web Server issue):W745–W747
Hokamp K, Wolfe KH (2004) PubCrawler: keeping up comfortably with PubMed and GenBank. Nucleic Acids Res 32(Web Server issue):W16–W19
Goetz T, von der Lieth CW (2005) PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts. Nucleic Acids Res 33(Web Server issue):W774–W778
Thomas J, Milward D, Ouzounis C, Pulman S, Carroll M (2000) Automatic extraction of protein interactions from scientific abstracts. Pac Symp Biocomput 5:538–549
Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G (2005) CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 6:51
Ono T, Hishigaki H, Tanigami A, Takagi T (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2):155–161
Novichkova S, Egorov S, Daraselia N (2003) MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 19(13):1699–1706
Rebholz-Schuhmann D, Jimeno-Yepes A, Arregui M, Kirsch H (2010) Measuring prediction capacity of individual verbs for the identification of protein interactions. J Biomed Inform 43(2):200–207
Iacucci E, Tranchevent LC, Popovic D, Pavlopoulos GA, De Moor B, Schneider R, Moreau Y (2012) ReLiance: a machine learning and literature-based prioritization of receptor—ligand pairings. Bioinformatics 28(18):i569–i574
van Haagen HH, t Hoen PA, Botelho Bovo A, de Morree A, van Mulligen EM, Chichester C, Kors JA, den Dunnen JT, van Ommen GJ, van der Maarel SM, Kern VM, Mons B, Schuemie MJ (2009) Novel protein-protein interactions inferred from literature context. PLoS One 4(11):e7894
Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36(7):664
Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 39(Database issue):D561–D568
Papanikolaou N, Pafilis E, Nikolaou S, Ouzounis CA, Iliopoulos I, Promponas VJ (2011) BioTextQuest: a web-based biomedical text mining suite for concept discovery. Bioinformatics 27(23):3327–3328
Zhu S, Okuno Y, Tsujimoto G, Mamitsuka H (2006) Application of a new probabilistic model for mining implicit associated cancer genes from OMIM and medline. Cancer Inform 2:361–371
Schuemie MJ, Weeber M, Schijvenaars BJ, van Mulligen EM, van der Eijk CC, Jelier R, Mons B, Kors JA (2004) Distribution of information in biomedical abstracts and full-text publications. Bioinformatics 20(16):2597–2604
Jenssen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28(1):21–28
Stapley BJ, Benoit G (2000) Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput 529–540
Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, Schneider R, Bagos PG (2011) Using graph theory to analyze biological networks. BioData Min 4:10
Pavlopoulos GA, Wegener AL, Schneider R (2008) A survey of visualization tools for biological network analysis. BioData Min 1:12
Gehlenborg N, O’Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, Kohlbacher O, Neuweger H, Schneider R, Tenenbaum D, Gavin AC (2010) Visualization of omics data for systems biology. Nat Methods 7(3 Suppl):S56–S68
Enright AJ, Ouzounis CA (2001) BioLayout—an automatic graph layout algorithm for similarity visualization. Bioinformatics 17(9):853–854
Kohler J, Baumbach J, Taubert J, Specht M, Skusa A, Ruegg A, Rawlings C, Verrier P, Philippi S (2006) Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 22(11):1383–1390
Breitkreutz BJ, Stark C, Tyers M (1998) Pajek—program for large network analysis. Connections 21:47–57
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504
Secrier M, Pavlopoulos GA, Aerts J, Schneider R (2012) Arena3D: visualizing time-driven phenotypic differences in biological systems. BMC Bioinformatics 13:45
Pavlopoulos GA, O’Donoghue SI, Satagopam VP, Soldatos TG, Pafilis E, Schneider R (2008) Arena3D: visualization of biological networks in 3D. BMC Syst Biol 2:104
Pavlopoulos GA, Hooper SD, Sifrim A, Schneider R, Aerts J (2011) Medusa: a tool for exploring and clustering biological networks. BMC Res Notes 4(1):384
Hu Z, Hung JH, Wang Y, Chang YC, Huang CL, Huyck M, DeLisi C (2009) VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res 37(Web Server issue):W115–W121
Wang Z, Zheng Y, Park HJ, Li J, Carr JR, Chen YJ, Kiefer MM, Kopanja D, Bagchi S, Tyner AL, Raychaudhuri P (2013) Targeting FoxM1 effectively retards p53-null lymphoma and sarcoma. Mol Cancer Ther 12(5):759–767
Yamamoto Y, Takagi T (2007) Biomedical knowledge navigation by literature clustering. J Biomed Inform 40(2):114–130
Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Rynbeek M, Stoehr P (2006) Protein annotation by EBIMed. Nat Biotechnol 24(8):902–903
Siadaty MS, Shu J, Knaus WA (2007) Relemed: sentence-level search engine with relevance score for the MEDLINE database of biomedical articles. BMC Med Inform Decis Mak 7:1
Lin J, Wilbur WJ (2007) PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics 8:423
Pavlopoulos GA, Moschopoulos CN, Hooper SD, Schneider R, Kossida S (2009) jClust: a clustering and visualization toolbox. Bioinformatics 25(15):1994–1996
Brohee S, Faust K, Lima-Mendez G, Sand O, Janky R, Vanderstocken G, Deville Y, van Helden J (2008) NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways. Nucleic Acids Res 36(Web Server issue):W444–W451
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976
Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4:2
Spirin V, Mirny LA (2003) Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A 100(21):12123–12128
Li XL, Tan SH, Foo CS, Ng SK (2005) Interaction graph mining for protein complexes using local clique merging. Genome Inform 16(2):260–269
Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S (2006) Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 7:207
Liu G, Wong L, Chua HN (2009) Complex discovery from weighted PPI networks. Bioinformatics 25(15):1891–1897
Mete M, Tang F, Xu X, Yuruk N (2008) A structural approach for finding functional modules from large biological networks. BMC Bioinformatics 9 Suppl 9:S19
Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T (2006) CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8):1021–1023
Moschopoulos CN, Pavlopoulos GA, Schneider R, Likothanassis SD, Kossida S (2009) GIBA: a clustering tool for detecting protein complexes. BMC Bioinformatics 10 Suppl 6:S11
Chua HN, Ning K, Sung WK, Leong HW, Wong L (2008) Using indirect protein-protein interactions for protein complex prediction. J Bioinform Comput Biol 6(3):435–466
Gusarova GA, Wang IC, Major ML, Kalinichenko VV, Ackerson T, Petrovic V, Costa RH (2007) A cell-penetrating ARF peptide inhibitor of FoxM1 in mouse hepatocellular carcinoma treatment. J Clin Invest 117(1):99–111
Millour J, de Olano N, Horimoto Y, Monteiro LJ, Langer JK, Aligue R, Hajji N, Lam EW (2011) ATM and p53 regulate FOXM1 expression via E2F in breast cancer epirubicin treatment and resistance. Mol Cancer Ther 10(6):1046–1058
Moschopoulos CN, Pavlopoulos GA, Iacucci E, Aerts J, Likothanassis S, Schneider R, Kossida S (2011) Which clustering algorithm is better for predicting protein complexes? BMC Res Notes 4:549
Vikis HG, Guan KL (2004) Glutathione-S-transferase-fusion based assays for studying protein-protein interactions. Methods Mol Biol 261:175–186
Puig O, Caspary F, Rigaut G, Rutz B, Bouveret E, Bragado-Nilsson E, Wilm M, Seraphin B (2001) The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 24(3):218–229
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 98(8):4569–4574
Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415(6868):141–147
Stoll D, Templin MF, Bachmann J, Joos TO (2005) Protein microarrays: applications and future challenges. Curr Opin Drug Discov Devel 8(2):239–252
Costanzo MC, Hogan JD, Cusick ME, Davis BP, Fancher AM, Hodges PE, Kondu P, Lengieza C, Lew-Smith JE, Lingner C, Roberg-Perez KJ, Tillberg M, Brooks JE, Garrels JI (2000) The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res 28(1):73–76
Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V (2006) MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 34(Database issue):D169–D172
Licata L, Briganti L, Peluso D, Perfetto L, Iannuccelli M, Galeota E, Sacco F, Palma A, Nardozza AP, Santonico E, Castagnoli L, Cesareni G (2012) MINT, the molecular interaction database: 2012 update. Nucleic Acids Res 40(Database issue):D857–D861
Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H (2007) IntAct—open source resource for molecular interaction data. Nucleic Acids Res 35(Database issue):D561–D565
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30(1):303–305
Bader GD, Betel D, Hogue CW (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 31(1):248–250
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34(Database issue):D535–D539
von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B (2003) STRING: a database of predicted functional associations between proteins. Nucleic Acids Res 31(1):258–261
Machesky LM, Gould KL (1999) The Arp2/3 complex: a multifunctional actin organizer. Curr Opin Cell Biol 11(1):117–121
Veltman DM, Insall RH (2010) WASP family proteins: their evolution and its physiological implications. Mol Biol Cell 21(16):2880–2893
Iliopoulos I, Enright AJ, Ouzounis CA (2001) Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 384–395
Riechmann V, Ephrussi A (2001) Axis formation during Drosophila oogenesis. Curr Opin Genet Dev 11(4):374–383
Dai H-J, Chang Y-C, Tzong-Han Tsai R, Hsu W-L (2010) New challenges for biological text-mining in the next decade. J Comput Sci Tech 25(1):169
Acknowledgments
The work was supported in part by the European Commission FP7 programme “Translational Potential” (TransPOT; EC contract number 285948).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this protocol
Cite this protocol
Pavlopoulos, G.A., Promponas, V.J., Ouzounis, C.A., Iliopoulos, I. (2014). Biological Information Extraction and Co-occurrence Analysis. In: Kumar, V., Tipney, H. (eds) Biomedical Literature Mining. Methods in Molecular Biology, vol 1159. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-0709-0_5
Download citation
DOI: https://doi.org/10.1007/978-1-4939-0709-0_5
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-0708-3
Online ISBN: 978-1-4939-0709-0
eBook Packages: Springer Protocols