Keywords

1 Introduction

A medication error that involves confusable drug names occurs as result of weak medication system and human errors-related factors [1,2,3]. Many human factors are related to the Look-Alike and Sound-Alike drug names (LASA) problem like visual perception error, auditory perception error, short term memory error, and motor control are errors. However, the similarity between confusable drug names is a detectable root-cause. Drug names like cycloserine and cyclosporine are involved in LASA errors. LASA pairs normally sound similar and have a similar spelling [4]. Sometimes the confusion happens when the names are communicated in prescriptions handwritten, for example, the drugs Avandia and Coumadin [5]. In other cases, the confusion occurs in verbal communication when the pronunciation sounds similar. For example, Zantac and Xanax [6].

Nowadays, the Institute for Safe Medication Practice (ISMP) publishes a list that contains LASA pairs that were previously reported [7,8,9,10]. Regulatory agencies, including the Food and Drug Administration (FDA), the World Health Organization (WHO), and the Joint Commission are implementing strategies to identify and to prevent a LASA error.

String-matching algorithms are used to measure the distance or the similarity between two drug names and to identify a priori potential confused drug names. For example, Edit Distance (ED) measures the minimum of the insertion, elimination and substitution operations to transform a string to another [11]. For example, the LASA pair cycloserine and cyclosporine has a distance of two because there are needed at least two edit operations (a substitution p→e and an elimination of letter o) to transform cycloserine in cyclosporine.

Longest Common Subsequence (LCS) measures the maximum possible length of the longest common subsequences between two drug names. NLCS represents the Normalization of LCS that is obtained by dividing the maximum length of the longest drug name. In the previous example, /cyclos-rine/ is the LCS and the NLCS is 0.833. NLCS presents a weakness to ignore subsequences that does not represent a similarity between drug names. For example, the no-LASA pair Benadryl and Cardura have the LCS /adr/ [6].

Ngram similarity represents a drug name as the set of all its contiguous subsequences (grams) of size n [12, 13]. For example, the bigrams for the LASA pair cycloserine and cyclosporine are {cy, yc, cl, lo, os, se, er, ri, in, ne} and {cy, yc, cl, lo, os, sp, po, or, ri, in, ne}, respectively. In this case, eight bigrams are shared, and the number of bigrams is 10 and 11, respectively. Therefore, the similarity is \( (2 \times 8)/(10 + 11) = 0.762 \). However, the Ngram similarity of the LASA pair Verelan and Virilon is zero [6].

Nsim similarity [14] extends to NLCS but it manages the n-grams of a drug name with a scale of similarity. The predefined scale of similarity between a pair of n-grams is computed by counting the identical matching letters in each position and normalized by n. Bisim is a specific case of Nsim with a predefined scale of similarity. For example, the bigrams cy and cy have a similarity of 2/2 = 1 and the bigrams se and sp of 1/2 = 0.5. The similarity scale presents a weakness when computes values for bigrams like sp and ps, or sp and es; because it misplaces completely the common letters in previous or next positions. This issue is a common root-cause when a visual or auditory perception error happens [15,16,17]. Even the pairs of bigrams {aa}{aa} and {ac}{ac} computes the same similarity, it is clear that in the first example the letter a match all the letters of the bigrams showing a higher similarity. In this manner, commonalities characteristics that are presented in LASA pairs [18] needs to be considered to adjust a softened similarity scale.

In this paper, we propose a new softened similarity measure based on Bisim that increase the accuracy to identify LASA pairs. For this, different cases that form the scale of bigrams are identified, and a proposed methodology based on an evolutionary algorithm to soften the scale of the similarity is described. Therefore, this paper is based on the hypothesis that an evolutionary approach can adjust better the weights of the scale of similarity between n-grams.

2 Definitions

String matching algorithms recover the common correspondences between the drug names that are used to determinate a similarity or a distance measure. Measures are classified as distance (as closer to zero as more related are the names) or similarity (as greater is the value as more related are the names). A normalized similarity/distance measure keeps a scale between different similarity values.

Similarity and distance measures detect the particular look-alike (orthographic cause) and sound-alike (phonetic case) issue. In this sense, the measures are classified as orthographic or phonetic in relation to the used approach to detect the confusion.

2.1 Orthographic Distance Measures

Edit distance (ED). Given the drug names X and Y as sequences of size n and m, respectively, ED (also called Levenshtein) refers to the minimum cost of editing operations (insertion, deletion and substitution) to convert the sequence X into Y [11, 19,20,21]. In this paper, all editing operations have a cost of 1. In this case, the edit distance between X and Y is given by edit \( (n, m) \) computed by the following recurrence:

$$ edit\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {max\left( {i,j} \right) } \hfill & {i = 0 \vee {\text{j}} = 0} \hfill \\ {edit\left( {i - 1,j - 1} \right) } \hfill & {x_{i} = y_{j} } \hfill \\ {min\left\{ {\begin{array}{*{20}c} {edit\left( {i - 1,j} \right) + 1} \\ {edit\left( {i,j - 1} \right) + 1} \\ {edit\left( {i - 1,j - 1} \right) + cs(x_{i} ,y_{i} )} \\ \end{array} } \right.} \hfill & {x_{i} \ne y_{j} } \hfill \\ \end{array} } \right. $$
(1)

A Normalized ED (NED).

NED is computed by dividing the ED between the length of the longer sequence [6, 21,22,23,24,25].

2.2 Orthographic Similarity Measures

Prefix Similarity.

Given the drug names \( X \) and \( Y \) as sequences of size m and n respectively, Prefix represents the ratio of the longest contiguous common initial letters [6], see Eq. 2. The common prefix for drug names Accutane and Accolate is Acc (|Acc| = 3), and the normalized prefix similarity is 0.375.

$$ Prefix\left( {X,Y} \right) = \frac{{|x_{1} = y_{1} ,x_{2} = y_{2} , \ldots ,x_{i} = y_{i} |}}{{\hbox{max} \left( {X,Y} \right)}} $$
(2)

N-gram Similarity.

Represents a sequence of the set of all its contiguous subsequences (grams) of size n [12]. For example, if \( \left| X \right| = m \) and n = 2 (bigrams), then \( X' = \left\{ {x_{1} x_{2} , x_{2} x_{3} , \ldots , x_{m - 1} x_{m} } \right\} \) [6, 14, 26]. Given the sequences \( X \) and \( Y \), the n-gram similarity is defined as the Dice similarity [27] between the sets X′ and Y′ in the next way:

$$ Dice\left( {X^{ '} ,Y^{ '} } \right) = \frac{{2\left| {X^{ '} \; \cap \;Y^{ '} } \right|}}{{\left| {X^{ '} } \right| + \left| {Y^{ '} } \right|}} $$
(3)

N-gram similarity presents a weakness because it is well-known that the prefixes and suffixes of the drug names are involved in their confusion [6, 18]. For increasing the sensitivity of the N-gram similarity some variations with respect to initial and final letters area applied. Lambert [14] proposes to add spaces (or a letter not included in the names) (B)efore and (A)fter in both drug names to make that the initial or final letters appear in one or more n-grams. Lambert experimented with the variants of Bigram-(1B, 1A, 1B1A and 1A) and Trigram-(1B, 1A, 1B1A, 2B, 2A, 2B2A, 1B2A and 2B1A) [14, 17, 28].

Normalized LCS (NLCS).

NLCS similarity lets to maintain an order in the common matching letters. Given the sequences \( X \) and \( Y \) of size n and m, respectively, the NLCS similarity is defined as the ratio of the length of the longest common subsequences between \( X \) and Y, \( NLCS = \left| {lcs\left( {n, m} \right)} \right|/max\left( {m,n} \right) \), where \( lcs\left( {n, m} \right) \) can be calculated by the recurrence in Eq. (4) [6, 14, 23,24,25, 29].

$$ lcs\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {i = 0 \vee {\text{j}} = 0} \hfill \\ {lcs\left( {i - 1,j - 1} \right) + 1, } \hfill & {x_{i} = y_{j} } \hfill \\ {max\left( {lcs\left( {i,j - 1} \right),lcs\left( {i - 1,j} \right)} \right)} \hfill & {x_{i} \ne y_{j} } \hfill \\ \end{array} } \right. $$
(4)

Nsim Similarity.

It is proposed by Kondrak [6, 23, 24] and it combines features implemented by grams of size \( \beta \), non-crossing-links constraints and the first letter it is repeated at the begging of the drug name. A specific case of Nsim is the measure Bisim [6]. Given the sequences (with the first repeated letter) \( X \) and \( Y \) representing the drug names of size n and m, respectively, Bisim similarity is defined as:

$$ Bisim\left( {X, Y} \right) = \frac{nsim(n,m)}{{max\left( {n,m} \right)}} $$
$$ nsim\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {i = 0 \vee j = 0} \hfill \\ {max\left\{ {\begin{array}{*{20}c} {nsim\left( {i,j - 1} \right),} \\ {nsim\left( {i - 1,j} \right),} \\ {nsim\left( {i - 1,i - 1} \right) + } \\ {s\left( {x_{i} x_{i + 1} , y_{j} y_{j + 1} } \right),} \\ \end{array} } \right.\begin{array}{*{20}c} {in} \\ {other} \\ {case} \\ \end{array} } \hfill & {} \hfill \\ \end{array} } \right. $$
(5)
$$ s\left( {x_{i} x_{i + 1} , y_{j} y_{j + 1} } \right) = \frac{1}{2}\sum\nolimits_{k = 0}^{1} {id(x_{i + k} , y_{j + k} )} $$
(6)
$$ id\left( {a,b} \right) = \left\{ {\begin{array}{*{20}l} {1, a = b} \hfill \\ {0, a \ne b} \hfill \\ \end{array} } \right. $$
(7)

2.3 Related Work

Using a list of 1,127 LASA pairs and 1,127 non-LASA pairs, Lambert [14] evaluates 22 measures with ten-fold cross-validation technique and concludes that Trigram2B, NED and Editex [20] are the best measures to identify LASA pairs.

Kondrak [6, 23,24,25] proposes the orthographic Nsim similarity and the phonetic Aline similarity [30, 31] where the recall metric is used to evaluate the results of 12 measures with the USP LASA list [32] of 360 unique drug names. Kondrak [6] concludes that Bisim is the best orthographic measure. Bisim is used to create automated warning systems to identify potential LASA errors in prescription electronic systems [4, 33] and in the software POCA by the FDA [6]. Furthermore, the average of Bisim, Aline, Prefix, and NED measures outperform to Bisim [6].

3 Proposed Method

In this paper, a Soften Bigram Similarity measure (Soft-Bisim) is proposed. First, the cases of bigrams involved in the scale of similarity in Soft-Bisim are described. After that, the fitness function used to find the weights in the scale of similarity by a genetic algorithm is described. Our hypothesis is that an evolutionary approach defines better the levels in the scale of similarity compared to the original similarity scale proposed by Kondrak in Bisim (cf. Eqs. 7 and 8). In other words, we consider this problem as an evolutionary approach for optimizing the internal parameters of the similarity scale.

3.1 Definition of Soft-Bisim Similarity

Given the drug names \( X \) and \( Y \) as sequences of size \( n \) and \( m \), respectively, Soft-Bisim is defined as:

$$ Soft - Bisim\left( {X, Y} \right) = \frac{Bisim(n,m)}{{max\left( {n,m} \right)}} $$
(8)
$$ Bisim\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & { i = 0 \vee j = 0} \hfill \\ {max\left\{ {\begin{array}{*{20}c} {Bisim\left( {i,j - 1} \right),} \\ {Bisim\left( {i - 1,j} \right),} \\ {Bisim\left( {i - 1,i - 1} \right) + } \\ {s\left( {x_{i} x_{i + 1} , y_{j} y_{j + 1} } \right),} \\ \end{array} } \right.\begin{array}{*{20}c} {in} \\ {other} \\ {case} \\ \end{array} } \hfill & {} \hfill \\ \end{array} } \right. $$
(9)

Where the proposed scale of similarity for Soft-Bisim is defined as:

$$ s\left( {a_{\text{i}} a_{{{\text{i}} + 1}} , b_{\text{j}} b_{{{\text{j}} + 1}} } \right) = \left\{ {\begin{array}{*{20}l} {w_{1} , a_{{{\text{i}} + 1}} = b_{{{\text{j}} + 1}} \ne a_{\text{i}} \ne b_{\text{j}} } \hfill \\ {w_{2} , a_{\text{i}} \ne a_{{{\text{i}} + 1}} \ne b_{\text{j}} \ne b_{{{\text{j}} + 1}} } \hfill \\ {w_{3} , a_{\text{i}} = a_{{{\text{i}} + 1}} = b_{\text{j}} \ne b_{{{\text{j}} + 1}} \vee a_{\text{i}} = b_{\text{j}} = b_{{{\text{j}} + 1}} \ne a_{{{\text{i}} + 1}} } \hfill \\ {w_{4} , a_{\text{i}} = b_{{{\text{j}} + 1}} \wedge a_{{{\text{i}} + 1}} = b_{\text{j}} } \hfill \\ {w_{5} , a_{\text{i}} = b_{{{\text{j}} + 1}} \ne a_{{{\text{i}} + 1}} \ne b_{\text{j}} \vee a_{{{\text{i}} + 1}} = b_{\text{j}} \ne a_{\text{i}} \ne b_{{{\text{j}} + 1}} } \hfill \\ {w_{6} , a_{\text{i}} = a_{{{\text{i}} + 1}} = b_{{{\text{j}} + 1}} \ne b_{\text{j}} \vee a_{{{\text{i}} + 1}} = b_{\text{j}} = b_{{{\text{j}} + 1}} \ne a_{\text{i}} } \hfill \\ {w_{7} , a_{\text{i}} = b_{\text{j}} \ne a_{{{\text{i}} + 1}} \ne b_{{{\text{j}} + 1}} } \hfill \\ {w_{8} , a_{\text{i}} = a_{{{\text{i}} + 1}} = b_{\text{j}} = b_{{{\text{j}} + 1}} } \hfill \\ {w_{9} , a_{\text{i}} = b_{\text{j}} \wedge a_{{{\text{i}} + 1}} = b_{{{\text{j}} + 1}} } \hfill \\ \end{array} } \right. $$
(10)

For increasing accuracy to identify confusable drug names it is needed to find the set of weights \( W = \left\{ {w_{1} , w_{2} , \ldots ,w_{9} } \right\} \) of the scale of similarity of Soft-Bisim. For this, a Genetic Algorithm is used [34,35,36].

3.2 Finding the Scale of Similarity for Soft-Bisim

The fitness function of the Genetic Algorithm is designed to evaluate each individual in relation to the objective to optimize.

The FDA reviews the similarity of a new drug name with all drug names that were previously registered. Therefore, the f-measure evaluation widely used in the information retrieval is used as the fitness function [37]. Given a LASA pair \( \left( {d_{i} ,d_{j} } \right) \in List\,of\,LASA\,pairs \), the f-measure for the query \( d_{i} \) evaluates the size of the set of retrieved drug names in ranking one (most similar drug names to the query \( d_{i} \)), but if \( d_{j} \) does not appears in the last set, the f-measure add the size of the retrieved drug names in the next ranking, until appears \( d_{j} \). In this way, f-measure evaluates the ability to find a relevant drug name from a query. The f-measure could be obtained at every ranking (\( r \)). In fact, we desire to improve the f-measure in the top four rankings. Therefore, the fitness function computes a macro-averaging f-measure for the queries of all different drug names (set \( D \)) based on the sum of the first four rankings, see Eq. 11. In other words, the fitness function gives more relevance to the combination of weights in \( W \) (Eq. 10) that, after retrieving the queries of all different drug names with the Soft-Bisim measure, produces the best sum of the first four f-measure evaluation.

$$ fitness\left( D \right) = \sum\nolimits_{r = 1}^{4} {f - measure(D,r)} $$
(11)

4 Results and Discussion

In all the experiments, the ground truth USP-858 collection with 858 LASA pairs is used. The USP-858 contains 630 unique drug names, and it can generate 36,900 pairs of drug names. That means that 0.3% of LASA pairs must be recovered.

4.1 Calculating the Scale of Similarity for Soft-Bisim

Although, the genetic algorithm only optimizes the macro-averaging f-measure for the top four positions, the comparison in Table 1 shows an improvement, with respect to Bisim, in all positions of ranking to retrieve LASA pairs. As it is possible to observe, the weight \( w_{9} \) for Soft-Bisim maintains a higher relevance than \( w_{8} \) while in Bisim \( w_{8} \) and \( w_{9} \) are the same. On the contrary, the case when all the letters are different the weight is not zero.

Table 1. Comparison of the macro-averaging f-measure evaluation for Bisim and Soft-Bisim with the USP-858 collection where the resulting weights for Soft-Bisim are: \( w_{1} = 0 \), \( w_{2} = 0.1 \), \( w_{3} = 0.4 \), \( w_{4} = 0 \), \( w_{5} = 0 \), \( w_{6} = 0.2 \), \( w_{7} = 0.4 \), \( w_{8} = 0.6 \) and \( w_{9} = 0.8 \); and the implicit weights for Bisim are: \( w_{1 \ldots 3} = 0 \), \( w_{4 \ldots 7} = 0.5 \), \( w_{7} = 0 \), and \( w_{8} = w_{9} = 1 \). *In the last row the ten-fold cross-validation results are showed.

4.2 Evaluation of Orthographic Measures

In Fig. 1, Soft-Bisim is compared to all orthographic measures presented in Sects. 2.1 and 2.2. In this case, Trigram-2B maintains the relevance indicated by Lambert but Bisim is more relevant that Trigram-2B. It is worth to mention that Trigram-2B2A and Trigram-2B1A are more relevant that Bisim. However, Soft-Bisim obtains the best performance with the adjusted similarity scale.

Fig. 1.
figure 1

Ranking obtained for each orthographic measure according to sum of top four positions of macro-averaging f-measure.

4.3 A Combined Measure with Soft-Bisim

Using the Average of Prefix, NED, Bisim and Aline, Kondrak [6] proposes a combined measure that outperform to Bisim: AvgBisim(Prefix, NED, Bisim, Aline). In this paper, we propose two combined measures, in the first one, Soft-Bisim is added to the average Avgall (Prefix, NED, Bisim, Aline, Soft-Bisim), and the second one, Bisim is replaced by Soft-Bisim in the average, AvgSoftBisim(Prefix, NED, Aline, Soft-Bisim). In Table 2, the comparison of original combined proposed by Kondrak and our proposed combined measures are presented.

Table 2. Macro-averaging f-measure evaluation for AvgBisim, AvgAll and AvgSoftBisim.

In Table 3, using Bisim as a Baseline measure the best measures to identify confuse drug names are showed. In this case, all the combined measures outperform to the individual measures. However, the best individual measure is Soft-Bisim that it is involved in the first two combined measures. Moreover, the best performance is achieved when Bisim is replaced by Soft-Bisim.

Table 3. Comparison of Soft-Bisim with the best previous measures.

5 Conclusion

The problem of confusion of drug names needs attention because it is still growing. All measures presented in this paper (except by Nsim) are designed or adjusted to different application or domain. In this sense, Nsim takes into consideration characteristics that take part on confusable drug names like the fact that the initial letters are frequently involved in a confused drug name. In this paper we propose to Soft-Bisim measure that it is a new orthographic measure for identifying LASA pairs based on Nsim similarity with an extension to soften the scale of similarity between the bi-grams that conforms a drug name. In specific, nine combinations of weights were calculated. For this, the sum the first-four macro-averaging f-measure of the retrieved pairs is proposed as the fitness function in a genetic algorithm.

According to the experimentation, Soft-Bisim increases the accuracy with respect to Bisim in a retrieved list of potential LASA pairs in all the ranking positions. Furthermore, Soft-Bisim outperforms significantly to the others 17 orthographic measures used in this problem. In this paper, we found that the measures Trigram-2B2A and Trigram-2B1A are good measures since outperform to the Bisim measure.

In addition, a new average combination of four measures using Soft-Bisim is proposed. This new average combination outperforms to the previous average that use Bisim measure. Even thought, we only use a list of drug names Soft-Bisim can be used to retrieve other cases of confusions like in proper names or brand names.