Data Retrieval

Couto, Francisco M.

doi:10.1007/978-3-030-13845-5_3

Francisco M. Couto⁷

Part of the book series: Advances in Experimental Medicine and Biology ((AEMB,volume 1137))

9031 Accesses

Abstract

This chapter starts by introducing an example of how we can retrieve text, where every step is done manually. The chapter will describe step-by-step how we can automatize each step of the example using shell script commands, which will be introduced and explained as long as they are required. The goal is to equip the reader with a basic set of skills to retrieve data from any online database and follow the links to retrieve more information from other sources, such as literature.

You have full access to this open access chapter, Download chapter PDF

Keywords

Caffeine Example

As our main example, let us consider that we need to retrieve more data and literature about caffeine. If we really do not know anything about caffeine, we may start by opening our favorite internet browser and then searching caffeine in Wikipedia^{Footnote 1} to know what it really is (see Fig. 3.1). From all the information that is available we can check in the infobox that there are multiple links to external sources. The infobox is normally a table added to the top right-hand part of a web page with structured data about the entity described on that page.

From the list of identifiers (see Fig. 3.2), let us select the link to one resource hosted by the European Bioinfomatics Institute (EBI), the link to CHEBI:27732^{Footnote 2}.

CHEBI represents the acronym of the resource Chemical Entities of Biological Interest (ChEBI)^{Footnote 3} and 27732 the identifier of the entry in ChEBI describing caffeine (see Fig. 3.3). ChEBI is a freely available database of molecular entities with a focus on “small” chemical compounds. More than a simple database, ChEBI also includes an ontology that classifies the entities according to their structural and biological properties.

By analyzing the CHEBI:27732 web page we can check that ChEBI provides a comprehensive set of information about this chemical compound. But let us focus on the Automatic Xrefs tab^{Footnote 4}. This tab provides a set of external links to other resources describing entities somehow related to caffeine (see Fig. 3.4).

In the Protein Sequences section, we have 77 proteins (in September of 2018) related to caffeine. If we click on show all we will get the complete list^{Footnote 5} (see Fig. 3.5). These links are to another resource hosted by the EBI, the UniProt, a database of protein sequences and annotation data.

The list includes the identifiers of each protein with a direct link to respective entry in UniProt, the name of the protein and some topics about the description of the protein. For example, DISRUPTION PHENOTYPE means some effects caused by the disruption of the gene coding for the protein are known^{Footnote 6}.

We should note that at bottom-right of the page there are Export options that enable us to download the full list of protein references in a single file. These options include:

CSV:
Comma Separated Values, the open format file that enable us to store data as a single table format (columns and rows).
Excel:
a proprietary format designed to store and access the data using the software Microsoft Excel.
XML:
eXtensible Markup Language, the open format file that enable us to store data using a hierarchy of markup tags.

We start by downloading the CSV, Excel and XML files. We can now open the files and check its contents in a regular text editor software^{Footnote 7} installed in our computer, such as notepad (Windows), TextEdit (Mac) or gedit (Linux).

The first lines of the chebi_27732_xrefs_ UniProt.csv file should look like this:

The first lines of the chebi_27732_xrefs_ UniProt.xls file should look like this:

As we can see, this is not the proprietary format XLS but instead a TSV format. Thus, the file can still be open directly on Microsoft Excel.

The first lines of the chebi_27732_xrefs_ UniProt.xml file should look like this:

We should note that all the files contain the same data they only use a different format.

If for any reason, we are not able to download the previous files from UniProt, we can get them from the book file archive^{Footnote 8}.

In the following sections we will use these files to automatize this process, but for now let us continue our manual exercise using the internet browser. Let us select the Ryanodine receptor 1 with the identifier P21817 and click on the link^{Footnote 9} (see Fig. 3.6). We can now see that UniProt is much more than just a sequence database. The sequence is just a tiny fraction of all the information describing the protein. All this information can also be downloaded as a single file by clicking on Format and on XML. Then, save the result as a XML file to our computer.

Again, we can use our text editor to open the downloaded file named P21817.xml, which first lines should look like this:

We can check that this entry represents a Homo sapiens (Human) protein, so if we are interested only in Human Proteins, we will have to filter them. For example, the entry E9PZQ0^{Footnote 10} in the ChEBI list also represents a Ryanodine receptor 1 protein but for the Mus musculus (Mouse).

Going back to the browser in the top-left side of the UniProt entry we have a link to publications^{Footnote 11}. If we click on it, we will see a list of publications somehow related to the protein (see Fig. 3.7).

Let us assume that we are interested in finding phenotypic information, the first title that may attract our attention is: Polymorphisms and deduced amino acid substitutions in the coding sequence of the ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia. To know more about the publication, we can use the UniProt citations service by clicking on the Abstract link^{Footnote 12} (see Fig. 3.8).

To check if the abstract mentions any disease we can use an online text mining tool, for example the Minimal Named-Entity Recognizer (MER)^{Footnote 13}. We can copy and paste the abstract of the publication into MER and select DO – Human Disease Ontology as lexicon (see Fig. 3.9).

We will see that MER detects three mentions of malignant hyperthermia, giving us another link^{Footnote 14} about the disease found (see Fig. 3.10).

Thus, in summary, we started from a generic definition of caffeine and ended with an abstract about hyperthermia by following the links in different databases. Of course, this does not mean that by taking caffeine we will get hyperthermia, or that we will treat hyperthermia by taking caffeine (maybe as a cold drink $\ddot \smile $ ^{Footnote 15}). However, this relation has a context, a protein and a publication, that need to be further analyzed before drawing any conclusions.

We should note that we only analyzed one protein and one publication, we now need to repeat all the steps to all the proteins and to all the publications related to each protein. And this could even be more complicated if we were interested in other central nervous system stimulants, for example by looking in the ChEBI ontology^{Footnote 16}. This is of course the motivation to automatize the process, since it is not humanly feasible to deal with such large amount of data, that keeps evolving every day.

However, if the goal was to find a relation between caffeine and hyperthermia, we could simply have searched these two terms in PubMed. We did not do that because some relations are not explicitly mention in the text, thus we have to navigate through database links. The second reason is because we needed an example using different resources and multiple entries to explain how we can automate most of these steps using shell scripting. The automation of the example will introduce a comprehensive set of techniques and commands, which with some adaptation Life and Health specialists can use to address many of their text and data processing challenges.

Unix Shell

The first step is to open a shell in our personal computer. A shell is a software program that interprets and executes command lines given by the user in consecutive lines of text. A shell script is a list of such command lines. The command line usually starts by invoking a command line tool. This manuscript will introduce a few command line tools, which will allow us to automatize the previous example. Unix shell was developed to manage Unix-like operating systems, but due to their usefulness nowadays they are available is most personal computers using Linux, macOS or Windows operating systems. There are many types of Unix shells with minor differences between them (e.g. sh, ksh, csh, tcsh and bash), but the most widely available is the Bourne-Again shell (bash^{Footnote 17}). The examples in this manuscript were tested using bash.

So, the first step is to open a shell in our personal computer using a terminal application (see Fig. 3.11). If we are using Linux or macOS then this is usually not new for us, since most probably we have a terminal application already installed, that opens a shell for us. In case we are using a Microsoft Windows operating system, then we have several options to consider. If we are using Windows 10, then we can install a Windows Subsystem for Linux^{Footnote 18} or just install a third-party application, such as MobaXterm^{Footnote 19}. No matter which terminal application we end up using, the shell will always have a common look: a text window with a cursor blinking waiting for our first command line. We should note that most terminal applications allow the usage of the up and down cursor keys to select, edit, and execute previous commands, and the usage of the tab key to complete the name of a command or a file.

Current Directory

As our first command line, we can type:

After hitting enter, the command will show the full path of the directory (folder) of our computer in which the shell is working on. The dollar sign in the left is only to indicate that this is a command to be executed directly in the shell.

To understand a command line tool, such as pwd, we can type man followed by the name of the tool. For example, we can type man pwd to learn more about pwd (do not forget to hit enter, and press q to quit). We can also learn more about man by typing man man. A shorter alternative to man, is to add the --help option after any command tool. For example, we can type pwd --help to have a more concise description of pwd.

As our second command line, we can type ls and hit enter. It will show the list of files in the current directory. For example, we can type ls --help to have a concise description of ls. Since we will work with files, that we need to open with a text editor or a spreadsheet application^{Footnote 20}, such as LibreOffice Calc or Microsoft Excel, we should select a current directory that we can easily open in our file explorer application. A good idea is to open our favorite file explorer application, select a directory, and then check its full path^{Footnote 21}.

Windows Directories

Notice that in Windows the full path to a directory each name is separated by a backslash (\\) while in a Unix shell is a forward slash (/).

For example, a Windows path to the Documents folder may look like:

If we are using the Windows Subsystem for Linux^{Footnote 22}, the previous folder must be accessed using the path:

If we are using MobaXterm^{Footnote 23}, the following path should be used instead:

Change Directory

To change the directory, we can use another command line tool, the cd (change directory) followed by the new path. In a Linux system we may want to use the Documents directory. If the Documents directory is inside our current directory (shown using ls), we only need to type:

Now we can type pwd to see what changed.

And if we want to return to the parent directory, we only need to use the two dots ..:

And if we want to return to the home directory, we only need to use the tilde character (∼):

Again, we should type pwd to double check if we are in the directory we really want.

In Windows we may need to use the full path, for example:

We should note that we need to enclose the path within single (or double) quotes in case it contains spaces:

Later on, we will know more about the difference between using single or double quotes. For now, we may assume that they are equivalent. To know more about cd, we can type cd --help.

Useful Key Combinations

Every time the terminal is blocked by any reason, we can press both the control and C key at the same time^{Footnote 24}. This usually cancels the current tool being executed. For example, try using the cd command with only one single quote:

This will block the terminal, because it is still waiting for a second single quote that closes the argument. Now press control-C, and the command will be aborted.

Now we can type again the previous command, but instead of pressing control-C we may also press control-D^{Footnote 25}. The combination control-D indicates the terminal that it is the end of input. So, in this case, the cd command will not be canceled, but instead it is executed without the second single quote and therefore a syntax error will be shown on our display.

Other useful key combinations are the control-L that when pressed cleans the terminal display, and the control-insert and shift-insert that when pressed copy and paste the selected text, respectively.

Shell Version

The following examples will probably work in any Unix shell, but if we want to be certain that we are using bash we can type the following command, and check if the output says bash.

ps is a command line tool that shows information about active processes running in our computer. The -p option selects a given process, and in this case \$\$ represents the process running in our terminal application. In most terminal applications bash is the default shell. If this is not our case, we may need to type bash, hit enter and now we are using bash.

Now that we know how to use a shell, we can start writing and running a very simple script that reverse the order of the lines in a text file.

Data File

We start by creating a file named myfile.txt using any text editor, and adding the following lines:

We cannot forget to save it in our working directory, and check if it has the proper filename extension.

File Contents

To check if the file is really on our working directory, we can type:

The contents of the file should appear in our terminal. cat is a simple command line tool that receives a filename as argument and displays its contents on the screen. We can type man cat or cat --help to know more about this command line tool.

Reverse File Contents

An alternative to cat tool is the tac tool. To try it, we only need to type:

The contents of the file should also appear in our terminal, but now in the reverse order. We can type man tac or tac --help to know more about this command line tool.

My First Script

Now we can create a script file named reversemyfile.sh by using the text editor, and add the following lines:

We cannot forget to save the file in our working directory. \$1 represents the first argument after the script filename when invoking it. Each script file presented in this manuscript will include the line numbers in the left. This will helps us not only to identify how many lines the script contains, but also to distinguish a script file from the commands to be executed directly in the shell.

Line Breaks

A Unix file represents a single line break by a line feed character, instead of two characters (carriage return and line feed) used by Windows^{Footnote 26}. So, if we are using a text editor in Windows, we must be careful to use one that lets us save it as Unix file, for example the open source Notepad++^{Footnote 27}.

In case we do not have such text editor, we can also remove the extra carriage return by using the command line tool tr, that replaces and deletes characters:

The -d option of tr is used to remove a given character from the input, in this case tr will delete all carriage returns (\\r). Many command line options can be used in short form using a single dash (-), or in a long form using two dashes (--). In this tool, using the --delete option is equivalent to the -d option. Long forms are more self-explanatory, but they take longer to type and occupy more space. We can type man tr or tr --help to know more about this command line tool.

Redirection Operator

The > character represents a redirection operator^{Footnote 28} that moves the results being displayed at the standard output (our terminal) to a given file. The < character represents a redirection operator that works on the opposite direction, i.e. opens a given file and uses it as the standard input.

We should note that cat received the filename as an input argument, while tr can only receive the contents of the file through the standard input. Instead of providing the filename as argument, the cat command can also receive the contents of a file through the standard input, and produce the same output:

The previous tr command used a new file for the standard output, because we cannot use the same file to read and write at the same time. To keep the same filename, we have to move the new file by using the mv command:

We can type man mv or mv --help to know more about this command line tool.

Installing Tools

These two last commands could be replaced by the dos2unix tool:

If not available, we have to install the dos2unix tool. For example, in the Ubuntu Windows Subsystem we need to execute:

The apt (Advanced Package Tool) command is used to install packages in many Linux systems^{Footnote 29}. Another popular alternative is the yum (Yellowdog Updater, Modified) command^{Footnote 30}.

To avoid fixing line breaks each time we update our file when using Windows, a clearly better solution is to use a Unix friendly text editor.

When we are not using Windows, or we are using a Unix friendly text editor, the previous commands will execute but nothing will happen to the contents of reversemyfile.sh, since the tr command will not remove any character. To see the command working replace '\\r' by '\$' and check what happens.

Permissions

A script also needs permission to be executed, so every time we create a new script file we need to type:

The command line tool chmod just gave the user (u) permissions to execute (+x). We can type man chmod or chmod --help to know more about this command line tool.

Finally, we can execute the script by providing the myfile.txt as argument:

The contents of the file should appear in our terminal in the reverse order:

Congratulations, we made our first script work $! \, \ddot \smile $

If we give more arguments, they will be ignored:

The output will be exactly the same because our script does not use \$2 and \$3, that in this case will represent myotherfile.txt and my other file.txt, respectively. We should note that when containing spaces, the argument must be enclosed by single quotes.

Debug

If something is not working well, we can debug the entire script by typing:

Our terminal will not only display the resulting text, but also the command line tools executed preceded by the plus character (+):

Alternatively, we can add the set -x command line in our script to start the debugging mode, and set +x to stop it.

Save Output

We can now save the output into another file named mynewfile.txt by typing:

Again, to check if the file was really created, we can use the cat tool:

Or, we can reverse it again by typing:

Of course, the result should exactly be the original contents of myfile.txt.

Web Identifiers

The input argument(s) of our retrieval task is the chemical compound(s) of which we want to retrieve more information. For the sake of simplicity, we will start by assuming that the user knows the ChEBI identifier(s), i.e. the script does not have to search by the name of the compounds. Nevertheless, to find the identifier of a compound by its name is also possible, and this manuscript will describe how to do it later on.

So, the first step, is to automatically retrieve all proteins associated to the given input chemical compound, that in our example was caffeine (CHEBI:27732). In the manual process, we downloaded the files by manually clicking on the links shown as Export options, namely the URLs:

for downloading a CSV, Excel, or XML file, respectively.

We should note that the only difference between the three URLs is a single numerical digit (1, 2, and 3) after the first equals character (=), which means that this digit can be used as an argument to select the type of file. Another parameter that is easily observable is the ChEBI identifier (27732). Try to replace 27732 by 17245 in any of those URLs by using a text editor, for example:

Now we can use this new URL in the internet browser, and check what happens. If we did it correctly, our browser downloaded a file with more than seven hundred proteins, since the 17245 is the ChEBI identifier of a popular chemical compound in life systems, the carbon monoxide.

In this case, we are not using a fully RESTful web service, but the data path is pretty modular and self-explanatory. The path is clearly composed of:

the name of the database (chebi);
the method (viewDbAutoXrefs.do);
and a list of parameters and their value (arguments) after the question mark character (?).

The order of the parameters in the URL is normally not relevant. They are separated by the ampersand character (&) and the equals character (=) is used to assign a value to each parameter (argument). This modular structure of these URLs allows us to use them as data pipelines to fill our local files with data, like pipelines that transport oil or gas from one container to another.

Single and Double Quotes

To construct the URL for a given ChEBI identifier, let us first understand the difference between single quotes and double quotes in a string (sequence of characters). We can create a script file named getproteins.sh by using a text editor to add the following lines:

The command line tool echo displays the string received as argument. Do not forget to save it in our working directory and add the right permissions with chmod as we did previously with our first script.

Now to execute the script we will only need to type:

The output on the terminal should be:

This means that when using single quotes, the string is interpreted literally as it is, whereas the string within double quotes is analyzed, and if there is a special character, such as the dollar sign (\$), the script translates it to what it represents. In this case, \$1 represents the first input argument. Since no argument was given, the double quotes displays nothing.

To execute the script with an argument, we can type:

The output on our terminal should be:

We can check now that when using double quotes \$1 is translated to the string given as argument.

Now we can update our script file named getproteins.sh to contain only the following line:

Comments

Instead of removing the previous lines, we can transform them in comments by adding the hash character (\ #) to the beginning of the line:

Commented lines are ignored by the computer when executing the script.

Now, we can execute the script giving the ChEBI identifier as argument:

The output on our terminal should be the link that returns the CSV file containing the proteins associated with caffeine.

Data Retrieval

After having the link, we need a web retrieval tool that works like our internet browser, i.e. receives as input a URL for programmatic access and retrieves its contents from the internet. We will use Client Uniform Resource Locator (cURL), which is available as a command line tool, and allows us to download the result of opening a URL directly into a file (man curl or curl --help for more information).

For example, to display in our screen the list of proteins related to caffeine, we just need to add the respective URL as input argument:

In some systems the curl command needs to be installed^{Footnote 31}. Since we are using a secure connection https, we may also need to install the ca-certificates package^{Footnote 32}.

An alternative to curl is the command w g e t , which also receives a URL as argument but by default wget writes the contents to a file instead of displaying it on the screen (man wget or wget --help for more information). So, the equivalent command, is to add the -O- option to select where the contents is placed:

We should note that dash - character after -O represents the standard output. The equivalent long form to the -O option is --output-document=file.

The output on our terminal should be the long list of proteins:

Instead of using a fixed URL, we can update the script named getproteins.sh to contain only the following line:

We should note that now we are using double quotes, since we replaced the caffeine identifier by \$1.

Now to execute the script we only need to provide a ChEBI identifier as input argument:

The output on our terminal should be the long list of proteins:

Or, if we want the proteins related to carbon monoxide, we only need to replace the argument:

And the output on our terminal should be an even longer list of proteins:

If we want to analyze all the lines we can redirect the output to the command line tool less, which allows us to navigate through the output by using the arrow keys. To do that we can add the bar character (|) between two commands, which will transfer the output of the first command as input of the second:

To exit from less just press q.

However, what we really want is to save the output as a file, not just printing some characters on the screen. Thus, what we should do is redirect the output to a CSV file. This can be done by adding the redirect operator > and the filename, as described previously:

We should note that curl still prints some progress information into the terminal.

Standard Error Output

This happens because it is displaying that information into the standard error output, which was not redirected to the file^{Footnote 33}. The > character without any preceding number by default redirects the standard output. The same happens if we precede it by the number 1. If we do not want to see that information, we can also redirect the standard error output (2), but in this case to the null device (/dev/null):

We can also use the -s option of curl in order to suppress the progress information, by adding it to our script file named getproteins.sh:

The equivalent long form to the -s option is --silent.

Now when executing the script, no progress information is shown:

To check if the file was really created and to analyze its contents, we can use the less command:

We can also open the file in our spreadsheet application, such as LibreOffice Calc or Microsoft Excel.

As an exercise execute the script to get the CSV file with the associated proteins of water^{Footnote 34} and gold^{Footnote 35}.

Data Extraction

Some data in the CSV file may not be relevant regarding our information need, i.e. we may need to identify and extract relevant data. In our case, we will select the relevant proteins (lines) using the command line tool grep, and secondly, we will select the column we need using the command line tool gawk, which is the GNU implementation of awk ^{Footnote 36}. We should note that if we are using MobaXterm we may need to install the gawk package^{Footnote 37}. We can also replace gawk by awk in case another implementation is available^{Footnote 38}.

Since our information need is about diseases related to caffeine, we may assume that we are only interested in proteins that have one of these topics in the third column:

Extracting lines from a text file is the main function of grep. The selection is performed by giving as input a pattern that grep tries to find in each line, presenting only the ones where it was able to find a match. The pattern is the same as the one we normally use when searching for a word in our text editor. The grep command also works with more complex patterns such as regular expressions, that we will describe later on.

Single and Multiple Patterns

We can execute the following command that selects the proteins with the topic CC - MISCELLANEOUS, our pattern, in our CSV file:

The output will be a shorter list of proteins, all with CC - MISCELLANEOUS as topic:

To use multiple patterns, we must precede each pattern with the -e option:

The equivalent long form to the -e option is --regexp=PATTERN.

The output on our terminal should be a longer list of proteins:

We should note that as previously, we can add | less to check all of them more carefully. The less command also gives the opportunity to find lines based on a pattern. We only need to type / and then a pattern.

We can now update our script file named getproteins.sh to contain the following lines:

We should note that we added the -s option to suppress the progress information of curl, and the characters | \\ to the end of line to redirect the output of that line as input of the next line, in this case the grep command. We need to be careful in ensuring that \\ is the last character in the line, i.e. spaces in the end of the line may cause problems.

We can now execute the script again:

The output should be similar of what we got previously, but the script downloads the data and filters immediately.

To save the file with the relevant proteins, we only need to add the redirection operator:

Data Elements Selection

Now we need to select just the first column, the one that contains the protein identifiers. Selecting columns from a tabular file is one easy task for gawk, that besides performing pattern scanning also provides a complex processing language (AWK^{Footnote 39}). This processing language can be highly complex^{Footnote 40} and it is out of our scope for this introductory manuscript. The gawk command can receive as arguments the character that divides each data element (column) in a line using the -F option, and an instruction of what to do with it enclosed by single quotes and curly brackets. The equivalent long form to the -F option is --field-separator=fs.

For example, we can get the first column of our CSV file:

We should note that comma (,) is the character that separates data elements in a CSV file, and that print is equivalent to echo, and \$1 represents the first data element.

The command will display only the first column of the file, i.e. the protein identifiers:

For example, we can get the first and third columns separated by a comma:

Now, the output contains both the first and third column of the file:

We can update our script file named getproteins.sh to contain the following lines:

The last line is the only that changes, except the | \\ in the previous line to redirect the output.

To execute the script, we can type again:

The output should be similar of what we got previously, but now only the protein identifiers are displayed.

To save the output as a file with the relevant proteins’ identifiers, we only need to add the redirection operator:

Task Repetition

Given a protein identifier we can construct the URL that will enable us to download its information from UniProt. We can use the RESTful web services provided by UniProt^{Footnote 41}, more specifically the one that allow us to retrieve a specific entry^{Footnote 42}. The construction of the URL is simple, it starts always by https://www.uniprot.org/uniprot/, followed by the protein identifier, ending with a dot and the data format. For example, the link for protein P21817 using the XML format is: http://www.uniprot.org/uniprot/P21817.xml

Assembly Line

However, we need to construct one URL for each protein from the list we previously retrieved. The size of the list can be large (hundreds of proteins), varies for different compounds and evolves with time. Thus, we need an assembly line in which a list of proteins identifiers, independently of its size, are added as input to commands that construct one URL for each protein and retrieve the respective file.

The xargs command line tool works as an assembly line, it executes a command per each line given as input. We should note that if we are using MobaXterm we may need to install the findutils package^{Footnote 43}, since the default xargs only has minimal options^{Footnote 44}.

We can start by experimenting the xargs command by giving as input the list of protein identifiers in file chebi_27732_xrefs_UniProt_ relevant_identifiers.csv, display each identifier on the screen in the middle of a text message by providing the echo command as argument:

The xargs command received as input the contents our CSV file, and for each line displayed a message including the identifier in that line. The -I option tells xargs to replace \\{} in the command line given as argument by the value of the line being processed. The equivalent long form to the -I option is --replace=R.

The output should be something like this:

Instead of creating inconsequential text messages, we can use xargs to create the URLs:

The output should be something like this:

We can try to use these links in our internet browser to check if those displayed URLs are working correctly.

Now that we have the URLs, we can automatically download the files using the curl command instead of echo:

We should note that we now use the -o option to save the output to a given file, named after each protein identifier. The equivalent long form to the -o option is --output <file>.

To check if everything worked as expected we can use the ls command to view which files were created:

The asterisk character (*) character is here used to represent any file whose name starts with chebi\_27732\_ and ends with .xml.

To check the contents of any of them, we can use the less command:

File Header

We should note that the content of every file has to start with <?xml otherwise there was a download error, and we have to run curl again for those entries. To check the header of each file, we can use the head command together with less.

The -n option specifies how many lines to print, in the previous command just one.

If for any reason, we are not able to download the files from UniProt, we can get them from the book file archive^{Footnote 45}.

Variable

We can now update our script file named getproteins.sh to contain the following lines:

We should note that the last line now includes the xargs and curl commands, and the \$ID variable. This new variable is created in the first line to contain the first value given as argument (\$1). So, every time we mention \$ID in the script we are mentioning the first value given as argument. This avoids ambiguity in cases where \$1 is used for other purposes, like in the gawk command. Since the preceding character of \$ID is an underscore (\_), we have to add a backslash (\\) before it. The second line uses the rm command to remove any files that were downloaded in a previous execution. We also now added two comments after the hash character, so we humans do not forget why these commands are needed for.

To execute the script once more:

And again, to check the results:

XML Processing

Assuming that our information need only concerns human diseases, we have to process the XML file of each protein to check if it represents a Homo sapiens (Human) protein.

Human Proteins

For performing this filter, we can again use the grep command, to select only the lines of any XML file that specify the organism as Homo sapiens:

We should get in our display the filenames that represent a human protein, i.e. something like this:

We should note that since the asterisk character (*) provides multiple files as argument to grep, the ones whose name starts with chebi\_27732\_ and ends with .xml, the output now includes the filename (followed by a colon) where each line was matched.

We can use the gawk command to extract only the filename, but grep has the -l option to just print the filename:

The equivalent long form to the -l option is --files-with-matches.

The output will now show only the filenames:

These four files represent the four Human proteins related to caffeine.

PubMed Identifiers

Now we need to extract the PubMed identifiers from these files to retrieve the related publications. For example, if we execute the following command:

The output is a long list of publications related to protein P21817:

To extract just the identifier, we can again use the gawk command:

We should note that " is used as the separation character and, since the PubMed identifier appears after the third ", the \$4 represents the identifier.

Now the output should be something like this:

PubMed Identifiers Extraction

Now to apply to every protein we may again use the xargs command:

This may provide a long list of PubMed identifiers, including repetitions since the same publication can be cited in different entries.

Duplicate Removal

To help us identify the repetitions, we can add the sort command (man sort or sort --help for more information), which will display the repeated identifiers in consecutive lines (due by sorting all identifiers):

For example some repeated PubMed identifiers that we should easily be able to see:

Fortunately, we also have the -u option that removes all these duplicates:

To easily check how many duplicates were removed, we can use the word count wc command with and without the usage of the -u option:

In case we have in our folder any auxiliary file, such as chebi\_27732\_P21817\_entry.xml, we should add the option --exclude *entry.xml to the first grep command.

The output should be something like:

wc prints the numbers of lines, words, and bytes, thus in our case we are interested in first number (man wc or wc --help for more information). We can see that we have removed 255 − 129 = 126 duplicates.

Just for curiosity, we can also use the shell to perform simple mathematical calculations using the expr command:

Now let us create a script file named getpublications.sh by using a text editor to add the following lines:

Again, do not forget to save it in our working directory, and add the right permissions with chmod as we did previously with the other scripts.

To execute the script again:

We can verify how many unique publications were obtained by using the -l option of wc, that provides only the number of lines:

The output will be 129 as expected.

Complex Elements

Not always the XML elements are in the same line, as fortunately was the case of the PubMed identifiers. In those cases, we may have to use the xmllint command, a parser that is able to extract data through the specification of a XPath query, instead of using a single line pattern as in grep.

XPath

XPath (XML Path Language) is a powerful tool to extract information from XML and HTML documents by following their hierarchical structure. Check W3C for more about XPath syntax^{Footnote 46}. We should note that xmllint may not be installed by default depending on our operating system, but it should be very easy to do it^{Footnote 47} If we are using MobaXterm, then we need to install the xmllint plugin^{Footnote 48}.

Namespace Problems

In the case of our protein XML files, we can see that their second line defines a specific namespace using the xmlns attribute^{Footnote 49}:

This complicates our XPath queries, since we need to explicitly specify that we are using the local name for every element in a XPath query. For example, to get the data in each reference element:

We should note that // means any path in the XML file until reaching a reference element. The square brackets in XPath queries normally represent conditions that need to be verified.

Only Local Names

If we are only interested in using local names there is a way to avoid the usage of local-name() for every element in a XPath query. We can identify the top-level element, in our case entry, and extract all the data that it encloses using a XPath query. For example, we can create the auxiliary file chebi\_27732\_P21817\_entry.xml by adding the redirection operator:

The new XML file now starts and ends with the entry element without any namespace definition:

Now we can apply any XPath query, for example //reference, on the auxiliary file without the need to explicitly say that it represents a local name:

The output should contain only the data inside of each reference element:

Queries

The XPath syntax allow us to create many useful queries, such as:

//dbReference – elements of type dbReference that are descendants of something; Result:
/entry//dbReference – equivalent to the previous query but specifying that the dbReference elements are descendants of the entry element;
/entry/reference/citation/dbRe ference– equivalent to the previous query but specifying the full path in the XML file;
//dbReference/* – any child elements of a dbReference element; Result:
//dbReference/property[1] – first property element of each dbReference element; Result:
//dbReference/property[2] – second property element of each dbReference element; Result:
//dbReference/property[3] – third property element of each dbReference element; Result:
//dbReference/property/@type – all type attributes of the property elements; Result:
//dbReference/property[@type="protein sequence ID"] – the previous property elements that have an attribute type equal to protein sequence ID; Result:
//dbReference/property[@type="protein sequence ID"]/@value – the string assigned to each attribute value of the previous property elements; Result:
//sequence/text() – the contents inside the sequence elements; Result:

We should note that to try the previous queries we only need to replace the string after the --xpath option of the previous xmllint command, such as:

Thus, an alternative way to extract the PubMed identifiers using xmllint instead of grep, would be something like this:

However, the output contains all identifiers in the same line and with the id label:

Extracting XPath Results

To extract the identifiers, we need to apply the tr command to split the output in multiple lines (one line per identifier), and then the gawk command:

The tr command replaces each space by a newline character, and the gawk command extracts the value inside the double quotes. We should note that NF >0 is used to only select lines with at least a separation character ", i.e. in our case it ignores empty lines.

Text Retrieval

Now that we have all the PubMed identifiers, we need to download the text included in the titles and abstracts of each publication.

Publication URL

To retrieve from the UniProt citations service the publication entry of a given identifier, we can again use the curl command and a link to the publication entry. For example, if we click on the Format button of the UniProt citations service entry^{Footnote 50}, we can get the link to the RDF/XML version. RDF^{Footnote 51} is a standard data model that can be serialized in a XML format. Thus, in our case, we can deal with this format like we did with XML.

We can retrieve the publication entry by executing the following command:

Thus, we can now update the script getpublications.sh to have the following commands:

We should note that only the second and last lines were updated to remove and retrieve the files, respectively.

Now let us execute the script:

It may take a while to download all the entries, but probably no more than one minute with a standard internet connection.

To check if everything worked as expected we can use the ls command to view which files were created:

If for any reason, we are not able to download the abstracts from UniProt, we can get them from the book file archive^{Footnote 52}.

Title and Abstract

Each file has the title and abstract of the publication as values of the title and rdfs:comment elements, respectively. To extract them we can again use the grep command:

The output should be something like these two lines:

To remove the XML elements, we can again use gawk:

We should note that we now use two characters as field separators < and > to get the text between the first > and the second <. The first field separator is < so \$2 contains the string title or rdfs:comment while \$1 is empty. The second field separator is > so \$3 contains the string we want to keep.

The output should now be free of XML elements:

Thus, let us create the script gettext.sh to have the following commands:

Again do not forget to save it in our working directory, and add the right permissions.

Now to execute the script and see the retrieved text:

We can save the resulting text in a file named chebi_27732.txt that we may share or read using our favorite text editor, by adding the redirection operator:

Disease Recognition

Instead of reading all that text to find any disease related with caffeine, we can try to find sentences about a given disease by using grep:

To save the filtered text in a file named chebi_27732_hyperthermia.txt, we only need to add the redirection operator:

This is a very simple way of recognizing a disease in text. The next chapters will describe how to perform more complex text processing tasks.

Notes

1.
https://en.wikipedia.org/wiki/Caffeine
2.
https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:27732
3.
http://www.ebi.ac.uk/chebi/
4.
http://www.ebi.ac.uk/chebi/displayAutoXrefs.do?chebiId=CHEBI:27732
5.
http://www.ebi.ac.uk/chebi/viewDbAutoXrefs.do?dbName=UniProt&chebiId=27732
6.
https://web.expasy.org/docs/userman.html#CC_line
7.
https://en.wikipedia.org/wiki/Text_editor
8.
http://labs.rd.ciencias.ulisboa.pt/book/
9.
http://www.uniprot.org/uniprot/P21817
10.
http://www.uniprot.org/uniprot/E9PZQ0
11.
https://www.uniprot.org/uniprot/P21817/publications
12.
https://www.uniprot.org/citations/1354642
13.
http://labs.rd.ciencias.ulisboa.pt/mer/
14.
http://purl.obolibrary.org/obo/DOID_8545
15.
https://en.wikipedia.org/wiki/Hyperthermia#Treatment
16.
https://www.ebi.ac.uk/chebi/chebiOntology.do?chebiId=35337
17.
https://en.wikipedia.org/wiki/Bash_(Unix_shell)
18.
https://docs.microsoft.com/en-us/windows/wsl/about
19.
https://mobaxterm.mobatek.net/
20.
https://en.wikipedia.org/wiki/Spreadsheet
21.
https://en.wikipedia.org/wiki/Path_(computing)
22.
https://www.howtogeek.com/261383/how-to-access-your-ubuntu-bash-files-in-windows-and-your-windows-system-drive-in-bash/
23.
https://mobaxterm.mobatek.net/documentation.html
24.
https://en.wikipedia.org/wiki/Control
25.
https://en.wikipedia.org/wiki/End-of-Transmission_character
26.
https://en.wikipedia.org/wiki/Newline
27.
https://notepad-plus-plus.org/
28.
https://www.gnu.org/software/bash/manual/html_node/Redirections.html
29.
https://en.wikipedia.org/wiki/APT_(Debian)
30.
https://en.wikipedia.org/wiki/Yum_(software)
31.
apt install curl
32.
apt install ca-certificates
33.
https://www.gnu.org/software/bash/manual/html_node/Redirections.html
34.
https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15377
35.
https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:30050
36.
http://www.gnu.org/software/gawk/
37.
apt install gawk
38.
https://en.wikipedia.org/wiki/AWK#Versions_and_implementations
39.
https://en.wikipedia.org/wiki/AWK
40.
https://www6.software.ibm.com/developerworks/education/au-gawk/au-gawk-a4.pdf
41.
https://www.uniprot.org/help/api
42.
https://www.uniprot.org/help/api_retrieve_entries
43.
apt install findutils
44.
In some versions the scripts may have to use xargs.exe to invoke the new version. Or rename the xargs shortcut in the bin folder to other name, that way the right version will always be invoked.
45.
http://labs.rd.ciencias.ulisboa.pt/book/
46.
https://www.w3schools.com/xml/xpath_syntax.asp
47.
apt install libxml2-utils
48.
https://mobaxterm.mobatek.net/plugins.html
49.
https://www.w3schools.com/xml/xml_namespaces.asp
50.
https://www.uniprot.org/citations/1354642
51.
https://www.w3.org/RDF/
52.
http://labs.rd.ciencias.ulisboa.pt/book/
53.
https://www.w3schools.com/

References

Shotts WE Jr (2012) The Linux command line: a complete introduction. No Starch Press, San Francisco
Google Scholar

Download references

Author information

Authors and Affiliations

LASIGE, Department of Informatics, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
Francisco M. Couto

Authors

Francisco M. Couto
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Couto, F.M. (2019). Data Retrieval. In: Data and Text Processing for Health and Life Sciences. Advances in Experimental Medicine and Biology, vol 1137. Springer, Cham. https://doi.org/10.1007/978-3-030-13845-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-13845-5_3
Published: 11 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13844-8
Online ISBN: 978-3-030-13845-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics

Data Retrieval

Abstract

Keywords

Caffeine Example

Unix Shell

Current Directory

Windows Directories

Change Directory

Useful Key Combinations

Shell Version

Data File

File Contents

Reverse File Contents

My First Script

Line Breaks

Redirection Operator

Installing Tools

Permissions

Debug

Save Output

Web Identifiers

Single and Double Quotes

Comments

Data Retrieval

Standard Error Output

Data Extraction

Single and Multiple Patterns

Data Elements Selection

Task Repetition

Assembly Line

File Header

Variable

XML Processing

Human Proteins

PubMed Identifiers

PubMed Identifiers Extraction

Duplicate Removal

Complex Elements

XPath

Namespace Problems

Only Local Names

Queries

Extracting XPath Results

Text Retrieval

Publication URL

Title and Abstract

Disease Recognition

Further Reading

Notes

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation