A Gene Ontology Tutorial in Python
This chapter is a tutorial on using Gene Ontology resources in the Python programming language. This entails querying the Gene Ontology graph, retrieving Gene Ontology annotations, performing gene enrichment analyses, and computing basic semantic similarity between GO terms. An interactive version of the tutorial, including solutions, is available at http://gohandbook.org.
Key wordsGene Ontology Tutorial Python
One of the main goals of developing a formal ontology is to facilitate computational analysis. The purpose of this chapter is to provide a hands-on introduction to handling GO terms and GO annotations in Python. This tutorial also shows how Python can be used to perform GO term enrichment analyses, as well as how to compute the similarity between GO terms.
This tutorial uses Python, but other popular languages commonly used to perform GO analyses include Java, R, Perl, and Matlab. The Gene Ontology consortium website maintains a list of software libraries, accessible from
An interactive version of this tutorial, with model solutions to all the questions, is available from the book homepage at http://gohandbook.org.
2 Querying the Gene Ontology
One convenient Python package available to query the GO is GOATOOLS . This package can read the GO structure stored in OBO format, which is available from the GO website (see Chap. 11 ). After loading this file, it is possible to traverse the GO structure, search for particular GO terms, and find out which other terms they are related to and how.
This package is available on the Python Package Index (PyPI), a standard repository of python libraries. As such, it is possible to install it locally using the command1:
pip install goatools
GOTerm.name: textual definition;
GOTerm.namespace: the ontology the term belongs to (i.e., Molecular Function [MF], Biological Process [BP], or Cellular Component [CC]);
GOTerm.parents: list of parent terms;
GOTerm.children: list of children terms;
GOTerm.level: shortest distance to the root node;
What is the name of the GO term GO:0048527?
What are the immediate parent(s) of the term GO:0048527?
What are the immediate children of the term GO:0048527?
Recursively find all the parent and child terms of the term GO:0048527. Hint: use your solutions to the previous two questions, with a recursive loop.
How many GO terms have the word “growth” in their name?
What is the deepest common ancestor term of GO:0048527 and GO:0097178?
Which GO terms regulate GO:0007124 (pseudohyphal growth)? Hint: load the relationship tags and look for terms which define regulation.
Produce a figure similar to that in Fig. 1, for the GO term GO:0097190. From the visualisation, what is the name of this term?
Using this figure, what is the most specific term that is in the parent terms of both GO:0097191 (extrinsic apoptotic signalling pathway) and GO:0038034 (signal transduction in absence of ligand)? This is also referred to as the lowest common ancestor (see Chap. 12 ).
As an alternative to GOATOOLS and OBO files, it is possible to retrieve information relating to a specific term from a web service. One such service is the EMBL-EBI QuickGO resource (see Chap. 11; [3, 5]), which can provide descriptive information about GO terms in OBO-XML format. It is possible to request this OBO-XML file over HTTP, using a URL of the form
where <GO_ID> is replaced with the GO identifier for the term of interest. In Source Code 2.1, an example function to automate this in Python is listed, which uses the urllib library to request the OBO-XML and the xmltodict library to parse the XML into an easy to use dictionary structure. Both libraries are available to install using pip, if required. Note that the future library was used to ensure that the function is both Python 2 and 3 compatible.
Source Code 2.1. get_oboxml() function for Python 2 and 3.
from future.standard_library import install_aliases
from urllib.request import urlopen
This function retrieves the OBO-XML for a given Gene Ontology term, using EMBL-EBI's QuickGO browser.
Input: go_id - a valid Gene Ontology ID, e.g. GO:0048527.
oboxml = urlopen(quickgo_url)
# Check the response
if(oboxml.getcode() == 200):
obodict = xmltodict.parse(oboxml.read())
raise ValueError("Couldn't receive OBOXML from QuickGO. Check URL and try again.")
The main advantage of using a web service, such as QuickGO, is that there is no requirement to download and parse the entire Gene Ontology structure; only the information required is retrieved. This is therefore more efficient if only a few particular terms are involved in an analysis. By contrast, for analyses involving many terms, the file-based approach described above is more suitable.
Find the name and description of the GO term GO:0048527 (lateral root development). Hint: print out the dictionary returned by the function and study its structure, or use the visualisation in Fig. 2.
Look at the difference in the OBO-XML output for the GO terms GO:00048527 (lateral root development) and GO:0097178 (ruffle assembly), then generate a table of the synonymous relationships of the term GO:0097178.
3 Retrieving GO Annotations
This section looks at manipulating the Gene Association File (GAF) standard, using a parser from the BioPython package .
Firstly, a GAF file, which contains GO annotations, shall be downloaded from the UniProt-GOA database . Their website (https://www.ebi.ac.uk/GOA/downloads) lists a number of variants. For this tutorial the reduced GAF file containing only the gene association data for Arabidopsis thaliana is going to be used.
Annotations from GAF files can be loaded into a Python dictionary using an iterator from the BioPython package (Bio.UniProt.GOA.gafiterator). Source Code 3.1 shows a simple example of this being used, in order to print out the protein ID for each annotation.
Source Code 3.1
from Bio.UniProt.GOA import gafiterator
# filename = <LOCATION OF GAF FILE>
filename = 'gene_association.goa_arabidopsis.gz'
with gzip.open(filename, 'rt') as fp:
for annotation in gafiterator(fp):
# Output annotated protein ID
'DB': the protein database;
'DB_Object_ID': protein ID;
'Qualifier': annotation qualifier (such as NOT);
'GO_ID': GO term;
'Evidence': evidence code.
Find the total number of annotations for Arabidopsis thaliana with NOT qualifiers. What is this as a percentage of the total number of annotations for this species?
How many genes (of Arabidopsis thaliana) have the annotation GO:0048527 (lateral root development)?
Generate a list of annotated proteins which have the word “growth” in their name.
There are 21 evidence codes used in the Gene Ontology project. As discussed in Chap. 3 , many of these are inferred, either by curators or automatically. Find the counts of each evidence code in the Arabidopsis thaliana annotation file.
4 GO Enrichment or Depletion Analysis
As discussed in detail in Chap. 13  one of the most common analyses performed on GO data is an enrichment (or depletion) analysis. In this tutorial, the GOEnrichmentStudy() function available in the GOATOOLS library (which has been seen in section 2) will be used.
the background set of terms (also known as the “population set”), passed as a list of GO term IDs;
associations between proteins IDs and GO term IDs, passed as a dictionary with protein IDs as the keys and sets of associated GO terms as the values;
the Gene Ontology structure, i.e., the output by the obo_parser() function from GOATOOLS;
whether annotations should be propagated to all parent terms, (defined in terms of is_a tags, only), indicated by setting the optional boolean parameter propagate_counts to True (default) or False;
the significance level, indicated by setting the optional parameter alpha to the desired cut-off (default: 0.05);
the foreground set of terms (also known as “study set”), indicated by setting the parameter study to a list of GO term IDs;
- 7.the list of method(s) to be used to assess significance, indicated by setting the parameter methods to a list containing one or several of these elements:
"bonferroni": Fisher’s exact test with Bonferroni correction for multiple testing;
"sidak": Fisher’s exact test with Šidák correction for multiple testing;
"holm": Fisher’s exact test with Holm–Bonferroni correction for multiple testing;
The function returns the list of over-represented and under-represented GO terms in the population set, compared to the background set.
Which GO term is most significantly enriched or depleted? Does this make sense?
How many terms are enriched, when using the Bonferroni corrected p-value ≤ 0.01?
How many terms are enriched, when using the false discovery rate (a.k.a. q-value) ≤ 0.01?
5 Computing Basic Semantic Similarities Between GO Terms
In this section, the focus is on computing semantic similarity between GO terms, based on ideas presented in detail in Chap. 12 . Semantic similarity measures enable us to quantify the functional similarity of genes annotated with GO terms.
Recall that semantic similarity measures are broadly separated in two categories: graph-based and information-theoretic measures. The former relies only on the structure of the Gene Ontology graph, whilst the latter also accounts for the information content of the terms.
GO:0048364 (root development) and GO:0044707 (single-multicellular organism process) are two GO terms taken from Fig. 1. Calculate the semantic similarity between them based on the inverse of the semantic distance (number of branches separating them).
Calculate the information content (IC) of the GO term GO:0048364 (root development), based on the frequency of observation in Arabidopsis thaliana.
Calculate the Resnik similarity measure between the same two terms as in part a.
GOATOOLS version 0.6.4 was used to write this tutorial and the exercises. To install this exact version, use pip install goatools==0.6.4
We thank Adrian Altenhoff, Debra Klopfenstein, and Haibao Tang for helpful feedback on the tutorial. CD acknowledges Swiss National Science Foundation grant 150654 and UK BBSRC grant BB/M015009/1. Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.
- 1.Gaudet P, Škunca N, Hu JC, Dessimoz C (2016) Primer on the gene ontology. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 3Google Scholar
- 2.Tang H, Klopfenstein D, Pedersen B et al (2015) GOATOOLS: tools for gene ontology, ZenodoGoogle Scholar
- 3.Munoz-Torres M, Carbon S (2016) Get GO! retrieving GO data using AmiGO, QuickGO, API, files, and tools. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 11Google Scholar
- 4.Pesquita C (2016) Semantic similarity in the gene ontology. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 12Google Scholar
- 8.Bauer S (2016) Gene-category analysis. In: Dessimoz C, Škunca N (eds) The gene ontology handbook. Methods in molecular biology, vol 1446. Humana Press. Chapter 13Google Scholar
This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.