The DrugDDI Corpus
Most biomedical corpora (BioInfer [8], BioCreAtIvE-PPI [9] or AIMed [10]) have focus on describing genetic or protein interactions, but none contains DDI. While NLP techniques are relatively domain-portable, corpora are not [11]. For this reason, we have created the first annotated corpus that studies the phenomena of interations among drugs.
The DrugDDI corpus consists of 579 documents describing DDI. These documents were randomly selected from the DrugBank database [12] and analyzed by the UMLS MetaMap Transfer (MMTx) tool [7] that performs sentence splitting, tokenization, POS-tagging, shallow syntactic parsing, and linking of phrases with Unified Medical Language System (UMLS) Metathesaurus concepts. Thus, MMTx allows to recognize a variety of biomedical entities, including drugs. The DrugDDI corpus consists of 66,021 phrases from which 22.6% (14,930) are drugs. It contains 3,775 sentences with two or more drugs, although only 2,044 sentences have at least one interaction. A total of 3,160 DDI were annotated at sentence level with the assistance of a pharmacist. The average number of interactions per document is 5.46 and per sentence 0.54.
Detecting coordinate structures
Coordination is an extremely common grammatical phenomenon in biomedical texts. Since coordinate constituents are semantically close and usually they play the same syntactic and grammatical roles in a sentence, it is necessary to assemble them together [6]. For example, the following sentence contains three DDI:
In order to extract them, it is necessary to interpret the coordinate structure in it: probenecid, sulfinpyrazone, and phenylbutazone, in which the conjunction and coordinates the conjunct probenecid with sulfinpyrazone and with phenylbutazone.
Although a wide variety of structures can be conjoined, not all coordinations are acceptable. Coordination of Likes Constraint (CLC) [13] (also called Law of Coordination of Likes) asserts that syntactically different categories cannot be conjoined. However, based on the corpus observation, this constraint is too restrictive for the kind of parsing provided by MMTx. For example, the above sentence demonstrates that being of the same syntactic category is too strong requirement for conjuncts in a coordinate construction, since a prepositional phrase, of probenecid, can be conjoined with two noun phrases: sulfinpyrazone and phenylbutazone. In fact, we have observed in the corpus that coordinate structures involving constituents with different syntactic categories are very common. Sometimes it is due to the fact that MMTx is not able to determine the syntactic type of a phrase, classifying it as an unknown phrase (that is, with the tag UNK).
Table 2 presents a set of syntactic patterns to detect coordinate structures, where the first row shows a pattern in which different syntactic types can be combined to detect coordination at the phrase level. An exception is made for verb phrases, since the coordination between a verbal phrase and another type of syntactic phrase is a coordination between clauses. Thus, the second pattern only allows to connect the verbal phrases with verbal phrases. Since this section focuses on coordination between phrases, we have only considered the coordinators and, or, nor, and/or, as well as as possible coordinators to link phrases. Table 2 also includes a syntactic pattern to detect correlative expressions such as both midazolam and triazolam (third row).
Table 2 Patterns to detect coordinate, correlative and appositive structures. Identifying appositions
There are divergent views within Linguistics with regard to what is or is not an apposition (also called appositional or appositive structure). [14] and [15] restrict the category of apposition to coreferential noun phrases (called appositives) that are juxtaposed and refer to the same extralinguistic entity. [16] and [17] expand this definition with the inclusion of constructions such as clauses and sentences as possible elements of an apposition. [18] admits as apposition only those constructions which can be linked by a marker of apposition.
Although the above approaches provide insights into the category of apposition, they provide either an inadequate or an incomplete description of apposition. The objective of this work is not to provide formal and complete description of apposition, but rather to identify appositions, in particular, those that contain drugs. Thus, we only deal with appositions that are linked by a marker of apposition since this kind of apposition appears frequently in the sentences that contain DDIs. Markers are helpful clues for detecting these structures. The markers of apposition that we have used in this approach are: such as, like, including, for example, e.g. and i.e.. Appositions that are not linked by any marker are also frequent in scientific texts, however, the lack of markers makes the detection of this kind of apposition extremely difficult. Moreover, we have observed they hardly ever occur in expressions describing DDI.
We have defined a set of syntactic patterns in order to identify the appositions (see table 2). Appositions comprise at least two contiguous phrases, the second of which is marked by clues such as parentheses or markers. This second phrase may be a coordinate structure. The APPOSITIVE pattern allows to recognize the intervening elements in an apposition, that is, their appositives. This pattern matches a phrase type (provided by MMTx) or another apposition. In this way, the pattern is able to recognize nested appositions. Regarding the phrase types, it has not considered types such as VP, CONJ, ADV, or, ADJ, since our main focus is to recognize appositions containing drugs (drugs only appear in noun, preposition and unknown phrases). The APPOSITION pattern is used to recognize appositions. This pattern matches an intervening element APPOSITIVE followed by a marker and by one or more intervening elements expressed by coordinate phrases. Parentheses are also included in the pattern. Two different DDI can be extracted from the sentence:
-
(1)
Catecholamine-depleting drugs with beta-blocking agents, and (2) Reserpine with beta-blocking agents.
Thus, it is essential to detect and resolve the appositions occurring in sentences, prior to the application of the lexical patterns responsible for DDI extraction. The appositions are firstly encapsulated and then unfolded when the relation is obtained by any lexical pattern.
Clause splitting
Biomedical texts usually consist of extremely long sentences. Long sentences are usually complex or compound-complex sentences, that is, contain two or more clauses. For example, the following sentence contains two independent clauses (marked with clause1 and clause2).
Both clauses have the same subject: Coadministration of CRIXIVAN and other drugs that inhibit CYP3A4. This subject includes a relative clause (marked with rel) whose subject is other drugs.
Parsing-based and pattern-based approaches are inefficient to deal with complex and compound sentences. Parsers are usually trained in common English text corpora and are difficult to extend to new domains. For this reason, they usually fail particularly in biomedical complex sentences. Regarding the pattern-based methods, relations are possibly extracted incorrectly when patterns are matched beyond the scope of one clause or other kinds of grammatical units [6]. For example, the previous example contains a relative clause (that inhibit CYP3A4), which hinders the matching between the sentence and the P8 pattern (see Table 3). This section proposes an algorithm for clause splitting that aims to reduce the complexity of sentences in biomedical texts, in order to improve the performance of our pattern-based method for DDI extraction. Clause splitting is the task of dividing a complex or compound sentence into several clauses. The algorithm exploits syntactic and lexical information provided by MMTx. Once sentences have been split into clauses, a set of simplification rules is used in order to generate new independent sentences from the clauses. Finally, the lexical patterns defined by the pharmacist can be applied to the generated sentences in order to extract DDI.
Table 3 Lexical patterns to extract DDIs. We now explain how the sentences are broken into clauses. First of all, it is necessary to ensure that the sentence is actually a compound or a complex sentence. It is not enough to check that there is some coordinator or subordinator in the sentence since sometimes they do not function like connectors between clauses, but as prepositions, adverbs, etc. A possible heuristic is to count the number of verb phrases included in the sentence. To give a definition of verb phrase is not an easy task. In fact, linguists have not even reached an agreement on what the verb phrase should include: only the words that are verbs, or also the complements of the verb. While the generative grammarians propose that a verb phrase consists of various combinations of the main verb and any auxiliary verbs, plus optional specifiers, complements, and adjuncts (for example, Anagrelide [may interacts with any of these compounds]
VP
), for functionalist linguists the verb phrases consist only of main verbs, auxiliary verbs, and other infinitive or participle constructions [19] (for example, Anagrelide [may interacts]
VP
[with any of these compounds]
PP
). We have decided to adopt the last definition, that is, we define a verb phrase as a syntactic structure that is composed of a main verb and, optionally, of auxiliary and modal verbs, but the complements are excluded of this structure. Unfortunately, MMTx offers an even simpler definition of verb phrase, because MMTx labels each verb as a VP. Forms of to be are labeled as V/
be
. In order to group the main verb, its auxiliary or modal verbs, as well as its adverbial complements in the same verb phrase, we define the VP-pattern as: [VP|V/
be
|VPG] (V/
be
)? (NOT)? (ADV)? (VP|V/
be
|VPG)? (TO VP)?. The VP-pattern is applied to sentences in order to merge their adjacent verb phrases into an extended verb phrase. If a sentence contains two or more extended VPs, then we can conclude that it is a complex or compound sentence. However, if a sentence only contains an extended VP, it is a simple sentence despite containing any conjunction. First column in Table 4 shows some sentences parsed by MMTx, while the second column shows the result of applying our Vp-pattern to them.
Table 4 How does MMTx label the verb phrases? Once it has been determined that the sentence contains two or more clauses, the following step is to determine the type of sentence. Such information will be very useful in detecting the clause boundaries. In the English language, a compound sentence is composed of two or more independent clauses joined by a conjunction that can be a coordinator (coordinating conjunction: for, and, nor, but, or,yet, so), a correlative conjunction (both, either, whether... or; not only... but also) or an independent marker word (however, moreover, furthermore, consequently, nevertheless, therefore). Semicolons and commas can also function as conjunctions. If an independent marker occurs at the beginning of the sentence, then a semicolon or a comma should separate the clauses. If the second independent clause starts with an independent marker, then a semicolon or a comma is needed before the marker [20]. The independent markers can also occur in simple sentences, as in the following sentence: However, initial dose modification is generally not necessary.
A complex sentence has an independent clause joined with one or more subordinate clauses. Subordinate clauses contain both a subject and a verb, but do not express a complete thought. A complex sentence always has a relative pronoun (who, that, which, whoever, whom, whomever, whose, whichever, whatever) or a subordinator (after, although, as, as if, because, before, even if, even though, if, in order to, since, though, unless, until, whatever, whether, when, whenever, while.) that links the clauses. If the complex sentence begins with a subordinator, that is, the subordinate clause is at the beginning of the sentence, then the subordinate clause should end with a comma. On the other hand, if the independent clause is attached at the beginning of the main sentence and the subordinator is in the middle, then no comma is required [20].
Taking into account the above clues, we initially defined a set of lexical patterns for detecting clauses boundaries in compound and complex sentences (see Table 5). Relative clauses are a especial case, since, they often appear in the middle of a main clause, splitting it into two parts. If a sentence matches some of these patterns, then its clauses can be easily extracted from the matching.
Table 5 Initial patterns for clause splitting. However, these patterns are not always enough. Determining where a clause ends is not always a trivial task, since there might be commas or conjunctions internal to the clause. Moreover, some conjunctions can also function as prepositions (for example for) or as adverbs (for example yet, so). The problem regarding adverbs is easily resolved (at least in most of cases) because MMTx labels them as CONJ phrases when they function as coordinators (though sometimes MMTx mistakes the phrases or is not able to determine the types). The previous identification of appositions and coordinate structures allows to reduce the number of commas and conjunctions internal to a clause. However, for each comma or coordinator not included in any apposition or coordinate structure, it is required to know whether the clause ends or not in it. Therefore, the above patterns have been replaced with a set of heuristics based on the observation of fifty compound and complex sentences. These heuristics are encoded in algorithm 1.
In a few words, the algorithm works as follows. the input of the algorithm is the sentence in which its verb phrases have been joined by the VP-pattern. First of all, the algorithm must check that the sentence contains two or more clauses. Then, the sentence is reviewed while it contains any separator marker. A separator marker can be a coordinator, a independent marker, a dependent marker, a semicolon or a comma. The coordinators and subordinators must be labeled by MMTx as CONJ phrases, otherwise, they are not considered as conjunctions. Then, the algorithm iteratively finds candidate clauses, that is, a substring of the sentence between markers. If the candidate clause contains a verb phrase, then it is considered as clause. The algorithm is able to decide the kind of clause, that is, independent or subordinate.
Rules for sentence simplification
Once appositions and coordinate propositions have been recognized, and compound and complex sentences have been split into clauses, it is possible to apply a set of rules for sentence simplification. These rules allow to simplify the complex and compound sentences in simple sentences. Then, the pattern-based approach for DDI extraction will be applied to these simpler sentences.
We have adapted some of the simplification rules presented in [4]. This work also recognized relative clauses, apposition, coordination and subordination, however its goal was not relation extraction, but to provide syntactic simplification of sentences for improving the performance of NLP applications such as text summarization or machine translation. [4] proposes seven simplification rules to generate new simplified sentences from the clauses of the complex and compound sentences. Table 6 presents the rules adapted in our approach and some sentences broken up into simpler sentences by these rules. The following list shows examples of how the simplification rules split complex and compound sentences:
Table 6 Rules to generate new simplified sentences from the clauses. The clause CLAUSE
REL
(
NP
) means that it is attached to the noun phrase NP.
-
[Because]
MARKER
[busulfan is eliminated from the body via conjugation with glutathione]
CLAUSE
1 [use of acetaminophen prior to (72 hours) or concurrent with BUSULFEX may result in reduced busulfan clearance based upon the known property of acetaminophen to decrease glutathione levels in the blood and tissues]
CLAUSE
2.
-
[Although]
MARKER
[the interactions observed in these studies do not appear to be of major clinical importance]
CLAUSE
1, [BREVIBLOC should be titrated with caution in patients being treated concurrently with digoxin, morphine, succinylcholine or warfarin.]
CLAUSE
2
-
[Trimeprazine also decreases the effect of heparin and oral anticoagulants,]
CLAUSE
1 [while]
MARKER
[MAOIs can increase the effect of trimeprazine.]
CLAUSE
2
The following sentence (containing a relative clause) is transformed into the two simpler sentences (1) and (2):
-
Since the excretion of oxipurinol is similar to that of urate, uricosuric agents, which increase the excretion of urate, are also likely to increase the excretion of oxipurinol and thus lower the degree of inhibition of xanthine oxidase.
-
1.
Since the excretion of oxipurinol is similar to that of urate, uricosuric agents are also likely to increase the excretion of oxipurinol and thus lower the degree of inhibition of xanthine oxidase.
-
2.
Uricosuric agents (which) increase the excretion of urate.
Lexical patterns for DDI extraction
Despite the richness of natural language expressions, in practice, DDI are often expressed by a limited number of constructions. This fact favors the use of patterns as an excellent method for their extraction. Based on her professional experience and the corpus observation, our pharmacist defined a set of lexical patterns (see Table 3) to capture the various language constructions used to express DDI in pharmacological texts. Moreover, the pharmacist provided a set of synonyms for the verbs that can indicate a possible DDI (see Table 7).
Table 7 Auxiliary patterns.