Text Mining to Support Gene Ontology Curation and Vice Versa
In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
Key wordsAutomatic text categorization Gene ontology Data curation Databases Data stewardship Information storage and retrieval
This chapter attempts to concisely describes the role played by text mining in literature-based curation tasks concerned with the description of protein functions. More specifically, the chapter explores the relationships between the Gene Ontology (GO) and Text Mining.
Subheading 3 presents the text mining methods developed to support the assignment of GO descriptors to a gene or a gene product based on the content of some published articles. The section also introduces the methodological framework needed to assess the performances of these systems called automatic text categorizers.
Subheading 4 presents the evolution of results obtained today by GOCat, a GO categorizer, which participated in several BioCreative campaigns.
Finally, Subheading 5 discusses an inverted perspective and shows how GO categorization systems are foundational of a new type of text mining applications, so-called Deep Question-Answering (QA). Given a question, Deep QA engines are able to find answers, which are literally found in no corpus.
Subheading 6 concludes and emphasizes the responsibility of national and international research infrastructures, in establishing virtuous relationships between text mining services and curated databases.
2 State of the Art
This section presents the state of the art in text mining from the point of view of a biocurator, i.e., a person who is maintaining the knowledge stored in gene and protein databases.
2.1 Curation Tasks
In modern molecular biology databases, such as UniProt , the content is authored by biologists called biocurators. The work performed by these biologists when they curate a gene or a gene product encompasses a relatively complex set of individual and collaborative tasks . We can separate these tasks into two subsets: sequence annotation—any information added to the sequence such as the existence of isoforms—and functional annotation—any information about the role of the gene or gene product in a given pathway or phenotype. Such a separation is partially artificial because a functional annotation can also establish a relationship between the role of a protein and some sequence positions but it is didactically convenient to adopt such a view.
The primary source of knowledge for genomics and proteomics is the research literature. In the context of biocuration, text mining can be defined as a process aimed at supporting biocurators when they search, read, identify entities, and store the resulting structured knowledge. The developments of benchmarks and metrics to evaluate how automatic text mining systems can help performing these tasks are thus crucial.
BioCreative is a community initiative to periodically evaluates the advances in text mining for biology and biocuration.1 The forum explored a wide span of tasks with emphasis on named-entity recognition. Named-entity recognition covers a large set of methods that seek to locate and classify textual elements into predefined categories such as the names of persons, organizations, locations, genes, diseases, chemical compounds, etc. Thus, querying PubMed with the keywords “biocreative” and “information retrieval” returns 8 PMIDs, whereas 32 PMIDs are returned for the keywords “biocreative” and “named entity” [18th of November 2015].
Comparative curation steps supported by text mining
Search is often the first step of a text mining pipeline, although information retrieval has received little attention from bioinformaticians active in Text Mining. Fortunately, information retrieval has been explored by other scientific communities and in particular by information scientists via the TREC (Text Retrieval Conferences) evaluation campaigns, see ref. 7 for a general introduction. From 2002 to 2015, molecular biology , clinical decision-support  and chemistry-related information retrieval  challenges have been explored by TREC. Interestingly, large-scale information retrieval studies have consistently shown that named-entity recognition has no or little impact on search effectiveness [11, 12].
2.2 From Basic Search to More Advanced Textual Mining
Beyond information retrieval, more elaborated mining instruments can then be derived. Thus, search engines, which return documents or pointers to documents, are often powered with passage retrieval skills , i.e., the ability to highlight a particular sentence, a few phrases, or even a few keywords in a given context. The enriched representation can help the end-user to decide upon the relevance of the document. If for MEDLINE records, such passage retrieval functionalities are not crucial because an abstract is short enough to be rapidly read by a human, passage retrieval tools become necessary when the search is performed on a collection of full-text articles like for instance in PubMed Central. Within a full-text article, the ability to identify the section where a given set of keywords can be very useful as matching the relevant keywords in a “background” section has a different value than matching them in a “results” section. The latter is likely to be a new statement while the former is likely to be regarded as a well-established knowledge.
2.3 Named-Entity Recognition
Unlike in other scientific or technical fields (finance, high energy physics, etc.), in the biomedical domain, named-entity recognition covers a very large set of entities. Such a richness is well expressed by the content of modern biological databases. Text Mining studies have been published for many of those curation needs, including sequence curation and identification of polymorphisms , posttranslational modifications , interactions with gene products or metabolites , etc. In this context, most studies attempted to develop instruments likely to address a particular set of annotation dimensions, serving the needs of a particular molecular biology database. The focus in such studies is often to design a Graphic User Interfaces and to simplify the curation work by highlighting specific concepts in a dedicated tool . While most of these systems seem exploratory studies, some seem deeply integrated in the curation workflow, as shown by the OntoMate tool designed by the Rat Genome Database , the STRING DB for protein–protein interactions or the BioEditor of neXtProt .
From an evaluation perspective, the idea is to detect the beginning and the end of an entity and to assign a semantic type to this string. Thus in named-entity recognition, we assume that entity components are textually contiguous. Inherited from early corpus works on information extraction and computational linguistics , the goal is to assign a unique semantic category—e.g., Time, Location, and Person—to a string in a text .
Semantic categories are virtually infinite but some entities received more attention. Gene, gene products, proteins, species [21, 22], and more recently chemical compounds were significantly more studied than for instance organs, tissues, cell types, cell anatomy, molecular functions, symptoms, or phenotypes .
The initial works dealing with the recognition of GO entities were disappointing (Subheading 3.2), which may explain part of the reluctance to address these challenges. We see here one important limitation of named entities: it is easy to detect a one or two words terms into a document, while the recognition of a protein function does require a “deeper” understanding or combination of biological concepts. Indeed a complex GO concept is likely to combine subconcepts belonging to various semantic types, including small molecules, atoms, protein families, as well as biological processes, molecular functions, and cell locations.
2.4 Normalization and Relationship Extraction
In order to compensate for the limitations of named-entity recognition frameworks, two more complementary approaches have been proposed: entity normalization and information (or relationship) extraction.
Normalization can be defined as the process by which a unique semantic identifier is assigned to the recognized entities . The identifiers are available in different resources such as several onto-terminologies or knowledge bases. The assignment of unique identifiers can be relatively difficult in practice due to a linguistic phenomenon called lexical ambiguity. Many strings are lexically ambiguous and therefore can receive more than one identifier depending on the context (e.g., HIV could be a disease or a virus). The difficulty is amplified in cascaded lexical ambiguities. Many entities require the extraction of other entities to receive an unambiguous identifier. For instance, the assignment of an accession number to a protein may depend on the recognition of an organism or a cell line somewhere else in the text.
Further, the extraction of relationships requires the recognition of the specific entities, which can be as various as a location, an interaction (binding, coexpression, etc.) , an etiology or a temporal marker (cause, trigger, simultaneity, etc.) . For some information extraction tasks such as protein–protein interactions, the normalization and relationship extraction may require first the proper identification of other entities such as the experimental methods (e.g., yeast 2-hybrid) used to generate the prediction. Furthermore, additional information items may be provided such as the scale of the interaction or the confidence in the interaction .
To identify GO terms, named-entity recognition and information extraction is insufficient due to two main difficulties: first, the difficulty of defining all (or most) strings describing a given concept; second, the difficulty of defining the string boundaries of a given concept. The parsing of texts to identify GO functions and how they are linked with a given protein demands the development of specific methods.
2.5 Automatic Text Categorization
Automatic text categorization (ATC) can be defined as the assignment of any class or category to any text content. The interested reader can refer to , where the author provides a comprehensive introduction to ATC, with a focus on machine learning methods.
In both ATC and in Information Retrieval, documents are regarded as “bag-of-words.” Such a representation is an approximation but it is a powerful and productive simplification. From this bag, where all entities and relationships are treated as flat and independent data, ATC attempts to assign a set of unambiguous descriptors. The set of descriptors can be binary as in triage tasks, where documents can be either classified as relevant for curation or irrelevant, or it can be multiclass. The scale of the problem is one parameter of the model. In some situations, ATC systems do not need to provide a clear split between relevant and irrelevant categories. In particular, when a human is in the loop to control the final descriptor assignment step, ATC systems can provide a ranked list of descriptors, where each rank expresses the confidence score of the ATC system. ATC systems and search engines share here a second common point: compared to named-entity recognition, which is normally not interactive, ATC and Information Retrieval are well suited for human–computer interactions.
With over 40,000 terms—and many more if we account for synonyms—assigning a GO descriptor to a protein based on some published document is formally known as a large multiclass classification problem.
3.1 Automatic Text Categorization
The two basic approaches to solve the GO assignment problem are the following: (1) exploit the lexical similarity between a text and a GO term and its synonyms ; (2) use some existing database to train a classifier likely to infer associations beyond string matching. The second approach uses any scalable machine learning techniques to generate a model trained on the Gene Ontology Annotation (GOA) database. Several machine learning strategies have been used but the trade-off between effectiveness, efficiency, and scalability often converges toward an approach called k-Nearest Neighbors (k-NN); see also ref. 30.
3.2 Lexical Approaches
Lexical approaches for ATC exploit the similarities between the content of a text and the content of a GO term and its related synonyms . Additional information can be taken into account to augment the categorization power such as the definitions of the GO terms. The ranking functions take into account the frequency of words, their specificity (measured by the “inverse document frequency,” the inverse of how many documents contain the word), as well as various positional information (e.g., word order); see ref. 32 for a detailed description.
The task is extremely challenging if we consider that some GO terms contain a dozen words, which makes those terms virtually unmatchable in any textual repository. The results of the first BioCreative competition, which was addressing this challenge, were therefore disappointing. The best “high-precision” system achieved an 80 % precision but this system covered less than 20 % of the test sample. In contrast, with a recall close to 80 %, the best “high-recall” systems were able to obtain an average precision of 20–30 % . At that time, over 10 years ago, such a complex task was consequently regarded are practically out of reach for machines.
3.3 k-Nearest Neighbors
The principle of a k-NN is the following: for an instance X to be classified, the system computes a similarity measure between X and some annotated instances. In a GO categorizer, an instance is typically a PMID annotated with some GO descriptors. Instances on the top of the list are assumed “similar” to X. Experimentally, the value of k must be determined, where k is the number of similar instances (or neighbors), which should be taken into account to assign one or several categories to X.
When considering a full-text article, a particular section in this article, or even a MEDLINE record, it is possible to compute a distance between this section and similar articles in the GOA database because in the curated section of GOA, many GO descriptors are associated with a PMID—those marked up with an EXP evidence code . The computation of the distance between two arbitrary texts can be more or less complex—starting with counting how many words they share—and the determination of the k parameters can also be dependent on different empirical features (number of documents in the collection, average size a document, etc.) but the approach is both effective and computationally simple . Moreover, the ability to index a priori all the curated instances makes possible to compute distances efficiently.
The effectiveness of such machine learning algorithms is directly dependent on the volume of curated data. Surprisingly GO categorizers seem not affected by any concept drift, which affects database and data-driven approaches in general. Even old data, i.e., protein annotated with an early version of the GO, seem useful for k-NN approaches . To give a concrete example, consider proteins curated in 2005 with a version of the Gene Ontology and a MEDLINE reports available at that time: it is difficult to understand why a model containing mainly annotations from 2010 to 2014 would outperform a model containing data from 2003 to 2007 using data exactly centered on 2005. While the GO itself has been expanded by at least a factor 4 in the past decade, the consistency of the curation model has remained remarkably stable.
3.4 Properties of Lexical and k-NN Categorizers
In this example, the title of an article (; “Modulation by copper of p53 conformation and sequence-specific DNA binding: role for Cu(II)/Cu(I) redox mechanism”) is used as input to contrast the behavior of the two approaches: This reference is used in UniProt to support the assignment of the “copper ion binding” descriptor to p53. We see that the lexical system (left panel) is able to assign the descriptor at rank #12, while the k-NN system (right panel) provides the descriptor in position #1.
Finally, we see how both categorizers are also flexible instruments as they basically learn to rank a set of a priori categories. Such systems can easily be used as fully automatic systems—thus taking into account only the top N returned descriptors by setting up an empirical threshold score—or as interactive systems able to display dozens of descriptors including many irrelevant ones, which then can be discarded by the curator.
Today, GO k-NN categorizers do outperform lexical categorizers; however, the behavior of the two systems is complementary. While the latter is potentially able to assign a GO descriptor, which has rarely or never been used to generate an annotation, the former is directly dependent on the quantity of [GO; PMID] pairs available in GOA.
3.5 Inter-annotator Agreement
An important parameter when assessing text mining tools is the development of a ground truth or gold standard. Thus, typically for GO annotation, we assume that the content of curated databases is the absolute reference. This assumption is acceptable from a methodological perspective, as text mining systems need such benchmarks. However, it is worth observing that two curators would not absolutely agree when they assign descriptors, which means that a 100 % precision is purely theoretical. Thus, Camon et al.  reports that two GO annotators would have an agreement score of about 39–43 %. The upper score is achieved when we consider that the assignment of a generic concept instead of a more specific one (children) is counted as an agreement.
4 Today’s Performances
Today, GOCat is able to assign a correct descriptor to a given MEDLINE record two times out of three using the BioCreative I benchmark , which makes it useful to support functional annotation. Another type of systems, can be used to support complementary tasks of literature exploration (GoPubMed: ) or named-entity recognition . While GOCat attempts to assign GO descriptors to any input with the objective to help curating the content of the input, GoPubMed provides a set of facets (Gene Ontology or Medical Subject Headings) to navigate the result of a query submitted to PubMed.
It is worth observing that GO categorizers work best when they assume that the curator is involved in selecting the input papers (performing a triage or selection task as described in Table 1). Such a setting, inherited from the BioCreative competitions, [33, 40] is questionable for at least two reasons: (1) Curators read full-text articles and not only the abstracts—captions and legends seem especially important; (2) The triage task, i.e., the ability to select an article as relevant for curation, could mostly be performed by a machine, provided that fair training data are available. In 2013, the campaign of BioCreative, under the responsibility of the NCBI, revisited the task . The competitors were provided with full-text articles and they were asked not only to return GO descriptors but also to select a subset of sentences. The evaluation was thus more transparent. A small but high-quality annotated sample of full-text papers was provided .
Together with GO and GOA, which was used by most participants in the competition, some online databases seem particularly valuable to help assigning GO descriptors. Thus, Luu et al.  uses the cross-product databases  with some effectiveness.
Although a fraction of it is likely to be sufficient to obtain the top-ranked GO descriptors, the results reported in the previous section are obtained by using only 10–20 % of the content of an article. This suggests that 80–90 % of what is published is unnecessary from an information-theoretic perspective.
5.1 Information Redundancy and Curation-Driven Data Stewardship
New and informative statements are rare in general. They are moreover buried in a mass of relatively redundant and poorly content-bearing claims. It has been shown that the density and precision of information in abstracts is higher [5, 46] than in full-text reports while the level of redundancy across papers and abstracts is probably relatively high as well.
We understand that the separation of valuable scientific statements is labor intensive for curators. This filtering effort is complicated within an article but also between articles at retrieval time. We argue that such task could be performed by machines provided that high-quality training data are available. The training data needed by text mining systems are unfortunately lost during the curation process. Indeed, the separation between useful and useless materials (e.g., PMIDs and sentences) is performed—but not recorded—by the curator during the annotation process but they are unfortunately not stored in databases.
In some cases, the separation is explicit, in other cases, it is implicit but the key point is that a mass of information is definitely lost with no possible recovery. The capture of the output of the selection process—at least for the positive content but ideally also for a fraction of the negative content—is a minimal requirement to improve text mining methods. The expected impact of the implementation of such simple data stewardship recommendation is likely a game changer for text mining far beyond any hypothetical technological advances.
5.2 Assigning Unmatchable GO Descriptors: Toward Deep QA
Some GO concepts describe entities which are so specific that they can hardly be found anywhere. This has several consequences. Traditional QA systems were recently made popular to answer Jeopardy-like questions with entities as various as politicians, town, plants, countries, songs, etc., see ref. 47. In the biomedical field, Bauer and Berleant  compare four systems, looking at their ergonomics. With a precision in the range of 70–80 % , these systems perform relatively well. However, none of these systems is able to answer questions about functional proteomics. Indeed, how can a text mining system find an answer if such an answer is not likely to be found on Earth in any corpus of book, article, or patent? The ability to accurately process questions, such as what molecular functions are associated with tp53 requires to supply answers, such as “RNA polymerase II transcription regulatory region sequence-specific DNA binding transcription factor activity involved in positive regulation of transcription” and only GO categorizers are likely to automatically generate such an answer.
We may think that such complex concepts could be made simpler by splitting the concept into subconcepts, using clinical terminological resources such as SNOMED CT [50, 51] or ICD-10 , see also Chap. 20 . That might be correct in some rare cases but in general, complex systems tend to be more accurately described using complex concepts. The post-coordination methods explored elsewhere remain effective to perform analytical tasks but they make generative tasks very challenging . Post-coordination is useful to search a database or a digital library because search tasks assume that documents are “bag of words” and they ignore the relationships between these words. However, other tasks such as QA or curation do require to be able to meaningfully combine concepts. In this context, the availability of a precomputed list of concepts or controlled vocabulary is extremely useful to avoid generating ill-formed entities.
Answering functional omics questions is truly original: it requires the elaboration of a new type of QA engines such as the DeepQA4GO engine . For GO-type of answers, DeepQA4GO is able to answer the expected GO descriptors about two times out of three, compared to one time out of three for traditional systems. We propose to call these new emerging systems: Deep QA engines. Deep QA, like traditional QA engines are able to screen through millions of documents, but since no corpus contain the expected answers, Deep QA is needed to exploit curated biological databases in order to generate useful candidate answers for curators.
While the chapter started with introducing the reader to how text mining can support database annotation, the conclusion is that next generation text mining systems will be supported by curated databases. The key challenges have moved from the design of text mining systems to the design of text mining systems able to capitalize on the availability of curated databases. Future advances in text mining to support biocuration and biomedical knowledge discovery are largely in the hands of database providers. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
In parallel, the accuracy of text mining system to support GO annotation has improved massively from 20 to 65 % (+225 %) from 2005 to 2015. With almost 10,000 queries a month, a tool like GOCat is useful in order to provide a basic functional annotation of protein with unknown and/or uncurated functions  as exemplified by the large-scale usage of GOCat by the COMBREX database [56, 57]. However, the integration of text mining support systems into curation workflows remains challenging. As often stated, curation is accurate but does not scale while text mining is not accurate but scales. National and international Research Infrastructures should play a central role to promote optimal data stewardship practices across the databases they support. Similarly, innovative curation models should emerge by combining the quality and richness of curation workflows, more cost-effective crowd-based triage, and the scalability of text mining instruments .
- 2.Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C, Martin MJ, Bely B, Browne P, Mun Chan W, Eberhardt R, Gardner M, Laiho K, Legge D, Magrane M, Pichler K, Poggioli D, Sehra H, Auchincloss A, Axelsen K, Blatter MC, Boutet E, Braconi-Quintaje S, Breuza L, Bridge A, Coudert E, Estreicher A, Famiglietti L, Ferro-Rojas S, Feuermann M, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, James J, Jimenez S, Jungo F, Keller G, Lemercier P, Lieberherr D, Masson P, Moinat M, Pedruzzi I, Poux S, Rivoire C, Roechert B, Schneider M, Stutz A, Sundaram S, Tognolli M, Bougueleret L, Argoud-Puy G, Cusin I, Duek-Roggli P, Xenarios I, Apweiler R (2012) The UniProt-GO Annotation database in 2011. Nucleic Acids Res 40(Database issue):D565–D570. doi: 10.1093/nar/gkr1048, Epub 2011 Nov 28CrossRefPubMedGoogle Scholar
- 3.Poux S, Magrane M, Arighi CN, Bridge A, O’Donovan C, Laiho K; UniProt Consortium (2014) Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data. Database (Oxford):bau016. doi: 10.1093/database/bau016
- 4.Vishnyakova D, Emilie Pasche E, Patrick Ruch P (2012) Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database. Database 2012Google Scholar
- 6.Lu Z, Hirschman L. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II. Database 2012Google Scholar
- 7.Singhal A (2001) Modern information retrieval: a brief overview. IEEE Data Eng Bull 24:35–43Google Scholar
- 9.Simpson MS, Voorhees ES, Hersh W (2014) Overview of the TREC 2014. Clinical Decision Support Track. TREC 2014Google Scholar
- 15.Xu S, An X, Zhu L, Zhang Y, Zhang H (2015) A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature. J Cheminform 7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S11. doi: 10.1186/1758-2946-7-S1-S11. eCollection 2015
- 16.Dowell KG, McAndrews-Hill MS, Hill DP, Drabkin HJ, Blake JA (2009) Integrating text mining into the MGI biocuration workflow. Database (Oxford):bap019. Epub 2009 Nov 21Google Scholar
- 17.Liu W, Laulederkind SJ, Hayman GT, Wang SJ, Nigam R, Smith JR, De Pons J, Dwinell MR, Shimoyama M (2015) OntoMate: a text-mining tool aiding curation at the Rat Genome Database. Database (Oxford):bau129Google Scholar
- 19.Black WJ, Gilardoni L, Dressel R, Rinaldi F (1997) Integrated text categorisation and information extraction using pattern matching and linguistic processing. RIAOGoogle Scholar
- 20.Chinchor N (1997) Overview of MUC-7. Message Understanding Conferences (MUC).Google Scholar
- 22.Smith L, Tanabe LK, Ando RJ, Kuo CJ, Chung IF, Hsu CN, Lin YS, Klinger R, Friedrich CM, Ganchev K, Torii M, Liu H, Haddow B, Struble CA, Povinelli RJ, Vlachos A, Baumgartner WA Jr, Hunter L, Carpenter B, Tsai RT, Dai HJ, Liu F, Chen Y, Sun C, Katrenko S, Adriaans P, Blaschke C, Torres R, Neves M, Nakov P, Divoli A, Maña-López M, Mata J, Wilbur WJ (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9(Suppl 2):S2CrossRefPubMedPubMedCentralGoogle Scholar
- 23.Tran LT, Divita G, Carter ME, Judd J, Samore MH, Gundlapalli AV (2015) Exploiting the UMLS Metathesaurus for extracting and categorizing concepts representing signs and symptoms to anatomically related organ systems. J Biomed Inform. pii: S1532-0464(15)00192-6. doi: 10.1016/j.jbi.2015.08.024
- 24.Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu HH, Torres R, Krauthammer M, Lau WW, Liu H, Hsu CN, Schuemie M, Cohen KB, Hirschman L (2008) Overview of BioCreative II gene normalization. Genome Biol 9(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3, Epub 2008 Sep 1CrossRefPubMedPubMedCentralGoogle Scholar
- 26.Perfetto L, Briganti L, Calderone A, Perpetuini AC, Iannuccelli M, Langone F, Licata L, Marinkovic M, Mattioni A, Pavlidou T, Peluso D, Petrilli LL, Pirrò S, Posca D, Santonico E, Silvestri A, Spada F, Castagnoli L, Cesareni G (2015) SIGNOR: a database of causal relationships between biological entities. Nucleic Acids Res 44:D548–D554CrossRefPubMedPubMedCentralGoogle Scholar
- 27.Bastian FB, Chibucos MC, Gaudet P, Giglio M, Holliday GL, Huang H, Lewis SE, Niknejad A, Orchard S, Poux S, Skunca N, Robinson-Rechavi M (2015) The Confidence Information Ontology: a step towards a standard for asserting confidence in annotations. Database:bav043 doi: 10.1093/database/bav043
- 31.Couto F, Silva M, Coutinho P (2005) FiGO: finding GO terms in unstructured text. BioCreative Workshop ProceedingsGoogle Scholar
- 34.Gaudet et al. Primer on gene ontology. GO handbookGoogle Scholar
- 35.Gobeill J, Pasche E, Vishnyakova D, Ruch P. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Database 2013Google Scholar
- 41.Mao Y, Van Auken K, Li D, Arighi CN, McQuilton P, G Hayman T, Tweedie S, Schaeffer ML, Laulederkind SJF, Wang S-J, Gobeill J, Ruch P, Luu AT, Kim J-J, Chiang J-H, De Chen Y, Yang C-J, Liu H, Zhu D, Li Y, Yu H, Emadzadeh E, Gonzalez G, Chen J-M, Dai H-J, Lu Z (2014). Overview of the gene ontology task at BioCreative IV. Database (Oxford) 2014Google Scholar
- 42.Van Auken K, Schaeffer ML, McQuilton P, Laulederkind SJ, Li D, Wang SJ, Hayman GT, Tweedie S, Arighi CN, Done J, Müller HM, Sternberg PW, Mao Y, Wei CH, Lu Z (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database (Oxford). pii: bau074. doi: 10.1093/database/bau074
- 43.Gobeill J, Pasche E, Dina V, Ruch P. (2014) Closing the loop: from paper to protein annotation using supervised Gene Ontology classification. Database:bau088Google Scholar
- 44.Luu AT, Kim JJ, Ng SK (2013) Gene ontology concept recognition using cross-products and statistical methods. In: The Fourth BioCreative Challenge Evaluation Workshop, vol. 1, Bethesda, MD, USA, pp 174–181Google Scholar
- 47.Ferrucci D (2012) Introduction to « This is Watson ». IBM J Res Dev 56(3.4):1–15Google Scholar
- 49.Gobeill J, Patsche E, Teodoro D, Veuthey AL, Lovis C, Ruch P. Question answering for biology and medicine. Information Technology and Applications in Biomedicine, 2009. ITAB 2009Google Scholar
- 51.Dolin RH, Spackman KA, Markwell D (2002) Selective retrieval of pre- and post-coordinated SNOMED concepts. Proc AMIA Symp:210–214Google Scholar
- 52.Baud RH, Rassinoux AM, Ruch P, Lovis C, Scherrer JR (1999) The power and limits of a rule-based morpho-semantic parser. Proc AMIA Symp:22–26Google Scholar
- 53.Denaxas SC (2016) Integrating bio-ontologies and controlled clinical terminologies: from base pairs to bedside phenotypes. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 20Google Scholar
- 54.Gobeill J, Gaudinat A, Pasche E, Vishnyakova D, Gaudet P, Bairoch A, Ruch P (2015) Deep question answering for protein annotation. Database (Oxford):bav081Google Scholar
- 56.Anton BP, Chang YC, Brown P, Choi HP, Faller LL, Guleria J, Hu Z, Klitgord N, Levy-Moonshine A, Maksad A, Mazumdar V, McGettrick M, Osmani L, Pokrzywa R, Rachlin J, Swaminathan R, Allen B, Housman G, Monahan C, Rochussen K, Tao K, Bhagwat AS, Brenner SE, Columbus L, de Crécy-Lagard V, Ferguson D, Fomenkov A, Gadda G, Morgan RD, Osterman AL, Rodionov DA, Rodionova IA, Rudd KE, Söll D, Spain J, Xu SY, Bateman A, Blumenthal RM, Bollinger JM, Chang WS, Ferrer M, Friedberg I, Galperin MY, Gobeill J, Haft D, Hunt J, Karp P, Klimke W, Krebs C, Macelis D, Madupu R, Martin MJ, Miller JH, O’Donovan C, Palsson B, Ruch P, Setterdahl A, Sutton G, Tate J, Yakunin A, Tchigvintsev D, Plata G, Hu J, Greiner R, Horn D, Sjölander K, Salzberg SL, Vitkup D, Letovsky S, Segrè D, DeLisi C, Roberts RJ, Steffen M, Kasif S (2013) The COMBREX Project: design, methodology, and initial results. PLoS Biol 11(8):e1001638CrossRefPubMedPubMedCentralGoogle Scholar
- 57.Škunca N, Roberts RJ, Steffen M (2016) Evaluating computational gene ontology annotations. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 8Google Scholar
- 58.Burger J, Doughty E, Khare R, Wei CH, Mishra R, Aberdeen J, Tresner-Kirsch D, Wellner B, Kann M, Lu Z, Hirschman L (2014) Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. Database (Oxford) 22:2014Google Scholar
This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.