In the previous chapter we were able to automatically process text by recognizing a limited set of entities. This chapter will introduce the world of semantics, and present step-by-step examples to retrieve and enhance text and data processing by using semantics. The goal is to equip the reader with the basic set of skills to explore semantic resources that are nowadays available using simple shell script commands.
- OWL: Web Ontology Language
- Semantic resources
- DO: disease ontology
- ChEBI: chemical entities of biological interest
- Entity linking
- Semantic similarity
In the previous chapters we searched for mentions of caffeine and malignant hyperthermia in text. However, we may miss related entities that may also be of our interest. These related entities can be found in semantic resources, such as ontologies. The semantics of caffeine and malignant hyperthermia are represented in ChEBI and DO ontologies, respectively.
Thus, we can start by retrieving both ontologies, i.e. their OWL files.
The -O option saves the content to a local file named according to the name of the remote file, usually the last part of the URL. The equivalent long form to the -O option is --remote-name.
The previous commands will create the files chebi_lite.owl and doid.owl, respectively. We should note that these links are for the specific releases used in this book. Using another release may change the output of the examples presented in this chapter.
The links may also change in the future, so we may need to check them on the BioPortalFootnote 1 or on the OBO FoundryFootnote 2 webpages. Alternatively, we can also get the OWL files from the book file archiveFootnote 3.
Both OWL files use the XML format syntax. Thus, to check if our entities are represented in the ontology, we can search for ontology elements that contain them using a simple grep command:
For each grep the output will be the line that describes the property label (rdfs:label), which is inside the definition of the class that represents the entity:
To retrieve the full class definition, a more efficient approach is to use the xmllint command, which we already used in previous chapters:
The XPath query starts by finding the label that contains malignant hyperthermia and then .. gives the parent element, in this case the Class element.
From the output we can see that the semantics of malignant hyperthermia is much more than its label:
A graphical visualization of this class is depicted in Fig. 5.1.
For example, we can check that malignant hyperthermia is a subclass of (specialization) the entries 0050736 and 66. We can directly use the linkFootnote 4 in our browser to know more about this parent disease. We will see that it represents a muscle tissue disease. This means that malignant hyperthermia is a special case of a muscle tissue disease.
We can do the same to retrieve the full class definition of caffeine:
From the output we can see that the types of semantics available for caffeine differs from the semantics of malignant hyperthermia, but they still share many important properties, such as the definition of subClassOf:
A graphical visualization of this class is depicted in Fig. 5.2.
The class caffeine is a specialization of two other entries: 26385 (purine alkaloid Footnote 5), and 27134 (trimethylxanthine Footnote 6). However, it contains additional subclass relationships that do not represent subsumption (is-a).
Figures 5.3 and 5.4 show other related classes of malignant hyperthermia and caffeine, respectively.
For example, the relationship between caffeine and the entry 25435 (mutagen Footnote 7) is defined by the entry 0000087 (has role Footnote 8) of the Relations Ontology. This means that the relationship defines that caffeine has role mutagen.
We can also search in the OWL file for the definition of the type of relation has role:
The XPath query starts by finding the elements ObjectProperty and then selects the ones containing the about attribute with the relation URI as value.
We can check that the relation is neither transitive or cyclic:
A graphical visualization of this property is depicted in Fig. 5.5.
URIs and Labels
In the previous examples, we searched the OWL file using labels and URIs. To standardize the process, we will create two scripts that will convert a label into a URI and vice-versa. The idea is to perform all the internal ontology processing using the URIs and in the end convert them to labels, so we can use them in text processing.
URI of a Label
To get the URI of malignant hyperthermia, we can use the following query:
We added the @*[local-name()='about'] to extract the URI specified as an attribute of that class.
The output will be the name of the attribute and its value:
To extract only the value, we can add the string function to the XPath query:
Unfortunately, the string function returns only one attribute value, even if many are matched. Nonetheless, we use the string function because we assume that malignant hyperthermia is an unambiguous label, i.e. only one class will match.
The output will now be only the attribute value:
To get the URI of caffeine is just about the same command:
We can now write a script that receives multiple labels given as standard input and the OWL file where to find the URIs as argument. Thus, we can create the script named geturi.sh with the following lines:
Again we cannot forget to save the file in our working directory, and add the right permissions using chmod as we did with our scripts in the previous chapters. The xargs command is used to process each line of the standard input. The tr command was added because xmllint displays all the matches in the same line, so we split the output using the character delimiting the URI, i.e. ". Then we use the grep command to keep only the lines with a URI, i.e. the ones that contain the term http.
Now to execute the script we only need to provide the labels as standard input:
The output should be the URIs of those classes:
We can also execute the script using multiple labels, one per line:
The output will be a URI for each label:
Label of a URI
To get the label of the disease entry with the identifier 8545, we can also use the xmllint command:
We added the @*[local-name()='label'] to select the element within the class that describes the label.
The output should be the label we were expecting:
We can do the same to get the label of the compound entry with the identifier 27732:
Again, the output should be the label we were expecting:
We can now write a script that receives multiple URIs given as standard input and the OWL file where to find the labels. We can create a script named getlabels.sh with the following lines:
The xargs command is used to process each line of the standard input. The text function does not add a newline character after each match, so if we have multiple matches is almost impossible to separate them. This explains why we removed the text function from the XPath. Then we have to split the result in multiple lines using the tr command and filtering the lines that contain the :label keyword or are empty.
Now to execute the script we only need to provide the URIs as standard input:
The output should be the labels of those classes:
We can also execute the script with multiple URIs:
The output will be a label for each URI:
To test both scripts, we can feed the output of one as the input of the other, for example:
The output will be the original input, i.e. the labels given as arguments to the echo command:
Now we can use the URIs as input:
Again the output will be the original input, i.e. the URIs given as arguments to the echo command:
Concepts are not always mentioned using the same official label. Frequently, we can find in text alternative labels. This is why some of the classes also specify alternative labels, such as the ones represented by the element hasExactSynonym.
For example, to find all the synonyms of a disease, we can use the same XPath as used before but replacing the keyword label by hasExactSynonym:
The output will be the two synonyms of malignant hyperthermia:
We can also get both the primary label and the synonyms. We only need to add an alternative match to the keyword label:
The output will include now the two synonyms plus the official label:
Thus, we can now update the script getlabels.sh to include synonyms:
We should note that the XPath query and the grep command were modified by adding the hasExactSynonym keyword. We also added the hasRelatedSynonym which is available for some classes.
We can test the script exactly in the same way as before:
But now the output will display multiple labels for this class:
URI of Synonyms
Since the script now returns alternative labels, we may encounter some problems if we send the output to the geturi.sh script:
The previous command will display XPath warnings for the two synonyms:
If we do not want to know about these mismatches, we can always redirect them to the null device:
However, we can update the script geturi.sh to also include synonyms:
Now we can execute the same command:
Every label should now be matched exactly with the same class:
If we want to avoid repetitions, we can add the sort command with the -u option to the end of each command, as we did in previous chapters:
The output should now be only one URI:
Parent classes represent generalizations that may also be relevant to recognize in text. To extract all the parent classes of malignant hyperthermia, we can use the following XPath query:
The first part of the XPath is the same as the above to get the class element, then [local-name()= 'subClassOf'] is used to get the subclass element, and finally @*[local-name()= 'resource'] is used to get the attribute containing its URI.
The output should be the URIs representing the parents of class 8545:
We can also execute the same command for caffeine:
The output will now include two parents:
We should note that we no longer can use the string function, because ontologies are organized as DAGs using multiple inheritance, i.e. each class can have multiple parents, and the string function only returns the first match. To get only the URIs, we can apply the previous technique of using the tr and grep commands:
Now the output only contains the URIs:
We can now create a script that receives multiple URIs given as standard input and the OWL file where to find all the parents as argument. The script named getparents.sh should contain the following lines:
To get the parents of malignant hyperthermia, we will only need to give the URI as input and the OWL file as argument:
The output will include the URIs of the two parents:
Labels of Parents
But if we need the labels we can redirect the output to the getlabels.sh script:
The output will now be the label of the parents of malignant hyperthermia:
Again, the same can be done with caffeine:
And now the output contains the labels of the parents of caffeine:
If we are interested in using all the related classes besides the ones that represent a generalization (subClassOf), we have to change our XPath to:
We should note that these related classes are in the attribute resource of someValuesFrom element inside a subClassOf element.
The URIs of the 18 related classes of caffeine are now displayed:
Labels of Related Classes
To get the labels of these related classes, we only need to add the getlabels.sh script:
The output is now 18 terms that we could use to expand our text processing:
Finding all the ancestors of a class includes many chain invocations of the getparents.sh until we get no matches. We also should avoid relations that are cyclic, otherwise we will enter in a infinite loop. Thus, for identifying the ancestors of a class, we will only consider parent relations, i.e. subsumption relations.
In the previous section we were able to extract the direct parents of a class, but the parents of these parents also represent generalizations of the original class. For example, to get the parents of the parents (grandparents) of malignant hyperthermia we need to invoke getparents.sh twice:
And we will find the URIs of the grandparents of malignant hyperthermia:
Or to get their labels we can add the getlabels.sh script:
And we find the labels of the grandparents of malignant hyperthermia:
However, there are classes that do not have any parent, which are called root classes. In Figs. 5.1 and 5.2, we can see that disease and chemical entity are root classes of DO and ChEBI ontologies, respectively. As we can see these are highly generic terms.
To check if it is the root class, we can ask for their parents:
In both cases, we will get the warning that no matches were found, confirming that they are the root class.
We can now build a script that receives a list of URIs as standard input, and invokes getparents.sh recursively until it reaches the root class.
The script named getancestors.sh should contain the following lines:
The second line of the script saves the standard input in a variable named CLASSES, because we need to use it twice: (i) to check if the input as any classes or is empty (third line) and (ii) to get the parents of the classes given as input (fourth line). If the input is empty then the script ends, this is the base case of the recursionFootnote 9. This is required so the recursion stops at a given point. Otherwise, the script would run indefinitely until the user stops it manually.
The fourth line of the script stores the output in a variable named PARENTS, because we need also to use it twice: (i) to output these direct parents (fifth line), and (ii) to get the ancestors of this parents (sixth line). We should note that we are invoking the getancestors.sh script inside the getancestors.sh, which defines the recursion step. Since the subsumption relation is acyclic, we expect that at some time we will reach classes without parents (root classes) and then the script will end.
We should note that the echo of the variables CLASSES and PARENTS need to be inside commas, so the newline characters are preserved.
Recursion is most of the times computational expensive, but usually it is possible to replace recursion with iteration to develop a more efficient algorithm. Explaining iteration and how to refactor a recursive script is out of scope of this book, nevertheless the following script represents an equivalent way to get all the ancestors without using recursion:
The script uses the while command that basically implements iteration by repeating a set of commands (lines 6–8) while a given condition is satisfied (line 4).
To test the recursive script, we can provide as standard input the label malignant hyperthermia:
The output will be the URIs of all its ancestors:
We should note that we will still receive the XPath warning when the script reaches the root class and no parents are found:
To remove this warning and just get the labels of the ancestors of malignant hyperthermia, we can redirect the warnings to the null device:
The output will now include the name of all ancestors of malignant hyperthermia:
We should note that the first two ancestors are the direct parents of malignant hyperthermia, and the last one is the root class. This happens because the recursive script print the parents before invoking itself to find the ancestors of the direct parents.
We can do the same with caffeine, but be advised that given the higher number of ancestors in ChEBI we may now have to wait a little longer for the script to end.
The results include repeated classes that were found by using different branches, so that is why we need to add the sort command with the -u option to eliminate the duplicates.
The script will print the ancestors being found by the script:
Now that we know how to extract all the labels and related classes from an ontology, we can construct our own lexicon with the list of terms that we want to recognize in text.
Let us start by creating the file do_8545_ lexicon.txt representing our lexicon for malignant hyperthermia with all its labels:
Now we can add to the lexicon all the labels of the ancestors of malignant hyperthermia by adding the redirection operator:
We should note that now we use >> and not >, this will append more lines to the file instead of creating a new file from scratch.
Now we can check the contents of the file do_8545_lexicon.txt to see the terms we got:
We should note that we use the sort command with the -u option to eliminate any duplicates that may exist.
We should be able to see the following labels:
We can also apply the same commands for caffeine to produce its lexicon in the file chebi_27732_lexicon.txt by adding the redirection operator:
We should note that it may take a while until it gets all labels.
Now let us check the contents of this new lexicon:
Now we should be able to see that this lexicon is much larger:
If we are interested in finding everything related to caffeine or malignant hyperthermia, we may be interested in merging the two lexicons in a file named lexicon.txt:
Using this new lexicon, we can recognize any mention in our previous file named chebi_27732_sentences.txt:
We added the -F option because our lexicon is a list of fixed strings, i.e. does not include regular expressions. The equivalent long form to the -F option is --fixed-strings.
We now get more sentences, including some that do not include a direct mention to caffeine or malignant hyperthermia. For example, the following sentence was selected because it mentions molecule, which is an ancestor of caffeine:
Another example is the following sentence, which was selected because it mentions disease, which is an ancestor of malignant hyperthermia:
We can also use our script getentities.sh giving this lexicon as argument. However, since we are not using any regular expressions it would be better to add the -F option to the grep command in the script, so the lexicon is interpreted as list of fixed strings to be matched. Only then we can execute the script safely:
Besides these two previous examples, we can check if there other ancestors being matched by using the grep command with the -o option:
We can see that besides the terms caffeine and malignant hyperthermia, only one ancestor of each one of them was matched, molecule and disease, respectively:
This can be explained because our text is somehow limited and because we are using the official labels and we may be missing acronyms, and simple variations such as the plural of a term. To cope with this issue, we may use a stemmerFootnote 10, or use all the ancestors besides subsumption. However, if our lexicon is small is better to do it manually and maybe add some regular expressions to deal with some of the variations.
Instead of using a customized and limited lexicon, we may be interested in recognizing any of the diseases represented in the ontology. By recognizing all the diseases in our caffeine related text, we will be able to find all the diseases that may be related to caffeine
To extract all the labels from the disease ontology we can use the same XPath query used before, but now without restricting it to any URI:
We can create a script named getalllabels.sh, that receives as argument the OWL file where to find all labels containing the following lines:
We should note that this script is similar to the getlabels.sh script without the xargs, since it does not receive a list of URIs as standard input.
Now we can execute the script to extract all labels from the OWL file:
The output will contain the full list of diseases:
To create the generic lexicon, we can redirect the output to the file diseases.txt:
We can check how many labels we got by using the wc command:
The lexicon contains more than 29 thousand labels.
We can now recognize the lexicon entries in the sentences of the file chebi_27732_ sentences.txt by using the grep command:
However, we will get the following error:
This error happens because our lexicon contains some special characters also used by regular expressions, such as the parentheses.
One way to address this issue is to replace the -E option by the -F option, that treats each lexicon entry as a fixed string to be recognized:
The output will show the large list of sentences mentioning diseases:
Despite using the -F option, the lexicon contains some problematic entries. Some entries have expressions enclosed by parentheses or brackets, that represent alternatives or a category:
Other entries have separation characters, such as commas or colons, to represent a specialization. For example:
A problem is that not all have the same meaning. A comma may also be part of the term. For example:
Other case includes using & to represent an ampersand. For example:
However, most of the times the alternatives are already included in the lexicon in different lines. For example:
As we can see by these examples, it is not trivial to devise rules that fully solve these issues. Very likely there will be exceptions to any rule we devise and that we are not aware of.
Special Characters Frequency
To check the impact of each of these issues, we can count the number of times they appear in the lexicon:
We will be able to see that parentheses and commas are the most frequent, with more than one thousand entries.
Now let us check if the ATR acronym representing the alpha thalassemia-X-linked intellectual disability syndrome is in the lexicon:
All the entries include more terms than only the acronym:
Thus, a single ATR mention will not be recognized.
This is problematic if we need to match sentences mentioning that acronym, such as:
We will now try to mitigate these issues as simply as we can. We will not try to solve them completely, but at least address the most obvious cases.
Removing Special Characters
The first fix we will do, is to remove all the parentheses and brackets by using the tr command, since they will not be found in the text:
Of course, we may lose the shorter labels, such as Post measles encephalitis, but at least now, the disease Post measles encephalitis disorder will be recognized:
If we really need these alternatives, we would have to create multiple entries in the lexicon or transform the labels in regular expressions.
Removing Extra Terms
The second fix is to remove all the text after a separation character, by using the sed command:
We should note that the regular expression enforces a space after the separation character to avoid separation characters that are not really separating two expressions, such as: 46,XY DSD due to LHB deficiency
We can see that now we are able to recognize both ATR and ATR syndrome:
Removing Extra Spaces
The third fix is to remove any leading or trailing spaces of a label:
We should note that we added two more replacement expressions to the sed command by separating them with a semicolon.
We can now update the script getalllabels.sh to include the previous tr and sed commands:
And we can now generate a fixed lexicon:
We can check again the number of entries:
We now have a lexicon with about 28 thousand labels. We have less entries because our fixes made some entries equal to others already in the lexicon, and thus the -u option filtered them.
We can now try to recognize lexicon entries in the sentences of file chebi_27732_ sentences.txt:
To obtain the list of labels that were recognized, we can use the grep command:
We will get a list of 43 unique labels representing diseases that may be related to caffein:
The grep is quite efficient but of course when using large lexicons and texts we may start to feel some performing issues. Its execution time is proportional to the size of the lexicon, since each term of the lexicon will correspond to an independent pattern to match. This means that for large lexicons we may face serious performance issues.
A solution for dealing with large lexicons is to use the inverted recognition technique (Couto et al. 2017; Couto and Lamurias 2018). The inverted recognition uses the words of the input text as patterns to be matched against the lexicon file. When the number of words in the input text is much smaller than the number of terms in the lexicon, grep has much fewer patterns to match. For example, the inverted recognition technique applied to ChEBI has shown to be more than 100 times faster than using the standard technique.
Another performance issue arises when we use the -i option to perform a case insensitive matching. For instance, in most computers if we execute the following command, we will have to wait much longer than not using the -i option:
One solution is to convert both the lexicon and text to lowercase (or uppercase), but this may result in more incorrect matches, such as incorrectly matching acronyms in lowercase.
The low performance issue of case insensitive matching is normally due to the usage of UTF-8 character encodingFootnote 11, instead of ASCII character encodingFootnote 12. UTF-8 allow us to use special characters, such as the euro symbol, in a standard way so it is interpreted by every computer around the world in the same way. However, for normal text without special characters ASCII works fine and more efficiently. In Unix shells we can normally specify the usage of ASCII encoding by adding the expression LC\_ALL=C before the command (man locale for more information).
So, another solution is to execute the following command:
We will be able to watch the significant increase in performance.
To check how many labels are now being recognized we can execute:
We have now 60 labels being recognized.
To check which new labels were recognized, we can compare the results with and without the -i option:
We are now able to see that the new labels are:
Some important diseases could only be recognized by performing a case insensitive match, such as arthrogryposis. This disease was missing because in the lexicon we had the uppercase case version of the labels, but not the lowercase version. We can check it by using the grep command:
The output does not include the lowercase case version:
We can also check in the text which versions are used:
We can see that only the lowercase version is used:
Another example is dyskinesia:
The lexicon has only the disease name with the first character in uppercase:
However, using a case insensitive match may also create other problems, such as the acronym CAN for the disease Crouzon syndrome-acanthosis nigricans syndrome:
By using a case insensitive grep we will recognize the common word CAN as a disease. For example, we can check how many times CAN is recognized:
It is recognized 18 times.
And to see which type of matches they are, we can execute the following command:
We can verify that the matches are incorrect mentions of the disease acronym:
This means we created at least 18 mismatches by performing a case insensitive match.
When we are using a generic lexicon, we may be interested in identifying what the recognized labels represent. For example, we may not be aware of what the matched label AD2 represents.
To solve this issue, we can use our script geturi.sh to perform linking (aka entity disambiguation, entity mapping, normalization), i.e. find the classes in the disease ontology that may be represented by the recognized label. For example, to find what AD2 represents, we can execute the following command:
In this case, the result clearly shows that AD2 represents the Alzheimer disease:
However, we may not be so lucky with the labels that were modified by our previous fixes in the lexicon. For example, we can test the case of ATR:
As expected, we received the warning that no URI was found:
An approach to address this issue may involve keeping a track of the original label in a lexicon using another file.
We may also have to deal with ambiguity problems where a label may represent multiple terms. For example, if we check how many classes the acronym ATS may represent:
We can see that it may represent two classes:
These two classes represent two distinct diseases, namely Andersen-Tawil syndrome and X-linked Alport syndrome, respectively.
We can also obtain their alternative labels by providing the two URI as standard input to the getlabels.sh script:
We will get the following two lists, both containing ATS as expected:
If we find a ATS mention in the text, the challenge is to identify which of the syndromes the mention refers to. For addressing this challenge, we may have to use advanced entity linking techniques that analyze the context of the text.
An intuitive solution is to select the class closer in terms of meaning to the others classes mentioned in the surrounding text. This assumes that entities present in a piece of text are somehow semantically related to each other, which is normally the case. At least the author assumed some type of relation between them, otherwise the entities would not be in the same sentence.
Let us consider the following sentence about genes and related syndromes from our text file chebi_27732_sentences.txt (on line 436):
Now assume that the label Andersen-Tawil syndrome been replaced by the acronym ATS:
Then, to identify the diseases in the previous sentence, we can execute the following command:
We have a list of labels that can help us decide which is the right class representing ATS:
To find their URIs we can use the geturi.sh script:
The only ambiguity is for ATS that returns two URIs, one representing the Andersen-Tawil syndrome (DOID:0050434) and the other representing the X-linked Alport syndrome (DOID:0110034):
To decide which of the two URIs we should select, we can measure how close in meaning they are to the other diseases also found in the text.
Semantic similarity measures have been successfully applied to solve these ambiguity problems (Grego and Couto 2013). Semantic similarity quantifies how close two classes are in terms of semantics encoded in a given ontology (Couto and Lamurias 2019). Using the web tool Semantic Similarity Measures using Disjunctive Shared Information (DiShIn)Footnote 13, we can calculate the semantic similarity between our recognized classes. For example, we can calculate the similarity between LQT1 (DOID:0110644) and Andersen-Tawil syndrome (DOID:0050434) (see Fig. 5.6), and the similarity between LQT1 and X-linked Alport syndrome (DOID:0110034) (see Fig. 5.7).
DiShIn provides the similarity values for three measures, namely Resnik, Lin and Jiang-Conrath (Resnik 1995; Lin et al. 1998; Jiang and Conrath 1997). The last two measures provide values between 0 and 1, and Jiang-Conrath is a distance measure that is converted to similarity.
We can see that for all measures LQT1 is much more similar to Andersen-Tawil syndrome than to X-linked Alport syndrome. Moreover, Jiang-Conrath’s measure gives the only similarity value larger than zero for X-linked Alport syndrome, since it is a converted distance measure. We obtain similar results if we replace LQT1 by LQT2, LQT3, LQT5, or LQT6. This means that by using semantic similarity we can identify Andersen-Tawil syndrome as the correct linked entity for the mention ATS in this text.
To automatize this process we can also execute DiShIn as a command lineFootnote 14, however we may need to install python (or python3) and SQLiteFootnote 15.
First, we need to install it locally using the git command line:
The git command automatically retrieves a tool from the GitHubFootnote 16 software repository.
If everything works fine, we should be able to see something like this in our display:
If the git command is not available, we can alternatively download the compressed file (zip), extract its contents and then move to the DiShIn folder:
The option -L enables the curl command to follow a URL redirectionFootnote 17. The equivalent long form to the -L option is --location.
We now have to copy the Human Disease Ontology in to the folder using the cp command, and then enter into the DiShIn folder:
To execute DiShIn, we need first to convert the ontology file named doid.owl into a database (SQLite) file named doid.db:
If the module rdflib is not installed, the following error will be displayed:
We can try to install itFootnote 18, but this will still take a few minutes to run.
Alternatively, we can download the latest database version:
After being installed, we can execute DiShIn by providing the database and two classes identifiers:
The output of the first command will be the semantic similarity values between LQT1 (DOID:0110644) and Andersen-Tawil syndrome (DOID:0050434):
The output of the second command will be the semantic similarity values between LQT1 (DOID:0110644) and X-linked Alport syndrome (DOID:0110034):
In the end, we should not forget to return to our parent folder:
Learning pythonFootnote 19 and SQLFootnote 20 is out of scope of this book, but if we do not intend to make any modifications the above steps should be quite simple to execute.
The online tool MER is based on a shell scriptFootnote 21, so it can be easily executed as a command line to efficiently recognize and link entities using large lexicons.
First, we need to install it locally using the git command line:
If everything works fine, we should be able to see something like this in our display:
If the git command is not available, we can alternatively download the compressed file (zip), and extract its contents:
We now have to copy the Human Disease Ontology in to the data folder of MER, and then enter into the MER folder:
To execute MER, we need first to create the lexicon files:
This may take a few minutes to run. However, we only need to execute it once, each time we want to use a new version of the ontology. If we wait, the output will include the last patterns of each of the lexicon files.
Alternatively, we can download the lexicon files, and extract them into the data folder:
We can check the contents of the created lexicons by using the tail command:
These patterns are created according to the number of words of each term.
The output should be something like this:
Now we are ready to execute MER, by providing each sentence from the file chebi_27732_senten- ces.txt as argument to its get_entities.sh script.
We removed single quotes from the text, since they are special characters to the command line xargs. We should note that this is the get_entities.sh script inside the MER folder, not the one we created before.
Now we will be able to obtain a large number of matches:
The first two numbers represent the start and end position of the match in the sentence. They are followed by the name of the disease and its URI in the ontology.
We can also redirect the output to a TSV file named diseases_recognized.tsv:
We can now open the file in our spreadsheet application, such as LibreOffice Calc or Microsoft Excel (see Fig. 5.8).
Again, we should not forget to return to our parent folder in the end:
To know more about biomedical ontologies, the book entitled Introduction to bio-ontologies is an excellent option, covering most of the ontologies and computational techniques exploring them (Robinson and Bauer 2011).
Another approach is to read and watch the materials of the training course given by Barry SmithFootnote 22.
apt install python sqlite3 or apt install python3 sqlite3
Couto F, Lamurias A (2018) MER: a shell script and annotation server for minimal named entity recognition and linking. J Cheminfo 10(1):58
Couto F, Lamurias A (2019) Semantic similarity definition. In: Ranganathan S, Nakai K, Schönbach C, Gribskov M (eds) Encyclopedia of bioinformatics and computational biology, vol 1. Oxford: Elsevier
Couto FM, Campos LF, Lamurias A (2017) Mer: a minimal named-entity recognition tagger and annotation server. Proc BioCreative 5:130–7
Grego T, Couto FM (2013) Enhancement of chemical entity identification in text using semantic similarity validation. PloS one 8(5):e62984
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th research on computational linguistics international conference, pp 19–33
Lin D et al (1998) An information-theoretic definition of similarity. In: Icml, vol 98, pp 296–304. Citeseer
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 1, pp 448–453. Morgan Kaufmann Publishers Inc.
Robinson PN, Bauer S (2011) Introduction to bio-ontologies. Chapman and Hall/CRC, Boca Raton
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
© 2019 The Author(s)
About this chapter
Cite this chapter
Couto, F.M. (2019). Semantic Processing. In: Data and Text Processing for Health and Life Sciences. Advances in Experimental Medicine and Biology, vol 1137. Springer, Cham. https://doi.org/10.1007/978-3-030-13845-5_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13844-8
Online ISBN: 978-3-030-13845-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)