Soft Bigram Similarity to Identify Confusable Drug Names

Millán-Hernández, Christian Eduardo; García-Hernández, René Arnulfo; Ledeneva, Yulia; Hernández-Castañeda, Ángel

doi:10.1007/978-3-030-21077-9_40

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11524))

Included in the following conference series:

Mexican Conference on Pattern Recognition

1527 Accesses
2 Citations

Abstract

Look-alike and Sound-alike drug names are related to medication errors where doctors, nurses, and pharmacists prescribe and administer the wrong medication. Bisim similarity is reported as the best orthographic measure to identifying confusable drug names, but it lacks from a similarity scale between the bigrams of a drug name. In this paper, we propose a Soft-Bisim similarity measure that extends to the Bisim to soften the comparison scale between the Bigrams of a drug name for improving the detection of confusable drug names. In the experimentation, Soft-Bisim outperforms others 17 similarity measures for 396,900 pairs of drug names. In addition, the average of four measures is outperformed when Bisim is replaced by Soft-Bisim similarity.

You have full access to this open access chapter, Download conference paper PDF

Drug Name Similarity Index for Sound-Alikeness

Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

Article Open access 12 November 2019

A Method of Addressing Proprietary Name Similarity for US Prescription Drugs

Article 01 July 2015

Keywords

1 Introduction

A medication error that involves confusable drug names occurs as result of weak medication system and human errors-related factors [1,2,3]. Many human factors are related to the Look-Alike and Sound-Alike drug names (LASA) problem like visual perception error, auditory perception error, short term memory error, and motor control are errors. However, the similarity between confusable drug names is a detectable root-cause. Drug names like cycloserine and cyclosporine are involved in LASA errors. LASA pairs normally sound similar and have a similar spelling [4]. Sometimes the confusion happens when the names are communicated in prescriptions handwritten, for example, the drugs Avandia and Coumadin [5]. In other cases, the confusion occurs in verbal communication when the pronunciation sounds similar. For example, Zantac and Xanax [6].

Nowadays, the Institute for Safe Medication Practice (ISMP) publishes a list that contains LASA pairs that were previously reported [7,8,9,10]. Regulatory agencies, including the Food and Drug Administration (FDA), the World Health Organization (WHO), and the Joint Commission are implementing strategies to identify and to prevent a LASA error.

String-matching algorithms are used to measure the distance or the similarity between two drug names and to identify a priori potential confused drug names. For example, Edit Distance (ED) measures the minimum of the insertion, elimination and substitution operations to transform a string to another [11]. For example, the LASA pair cycloserine and cyclosporine has a distance of two because there are needed at least two edit operations (a substitution p→e and an elimination of letter o) to transform cycloserine in cyclosporine.

Longest Common Subsequence (LCS) measures the maximum possible length of the longest common subsequences between two drug names. NLCS represents the Normalization of LCS that is obtained by dividing the maximum length of the longest drug name. In the previous example, /cyclos-rine/ is the LCS and the NLCS is 0.833. NLCS presents a weakness to ignore subsequences that does not represent a similarity between drug names. For example, the no-LASA pair Benadryl and Cardura have the LCS /adr/ [6].

Ngram similarity represents a drug name as the set of all its contiguous subsequences (grams) of size n [12, 13]. For example, the bigrams for the LASA pair cycloserine and cyclosporine are {cy, yc, cl, lo, os, se, er, ri, in, ne} and {cy, yc, cl, lo, os, sp, po, or, ri, in, ne}, respectively. In this case, eight bigrams are shared, and the number of bigrams is 10 and 11, respectively. Therefore, the similarity is $ (2 \times 8)/(10 + 11) = 0.762 $. However, the Ngram similarity of the LASA pair Verelan and Virilon is zero [6].

Nsim similarity [14] extends to NLCS but it manages the n-grams of a drug name with a scale of similarity. The predefined scale of similarity between a pair of n-grams is computed by counting the identical matching letters in each position and normalized by n. Bisim is a specific case of Nsim with a predefined scale of similarity. For example, the bigrams cy and cy have a similarity of 2/2 = 1 and the bigrams se and sp of 1/2 = 0.5. The similarity scale presents a weakness when computes values for bigrams like sp and ps, or sp and es; because it misplaces completely the common letters in previous or next positions. This issue is a common root-cause when a visual or auditory perception error happens [15,16,17]. Even the pairs of bigrams {aa}{aa} and {ac}{ac} computes the same similarity, it is clear that in the first example the letter a match all the letters of the bigrams showing a higher similarity. In this manner, commonalities characteristics that are presented in LASA pairs [18] needs to be considered to adjust a softened similarity scale.

In this paper, we propose a new softened similarity measure based on Bisim that increase the accuracy to identify LASA pairs. For this, different cases that form the scale of bigrams are identified, and a proposed methodology based on an evolutionary algorithm to soften the scale of the similarity is described. Therefore, this paper is based on the hypothesis that an evolutionary approach can adjust better the weights of the scale of similarity between n-grams.

2 Definitions

String matching algorithms recover the common correspondences between the drug names that are used to determinate a similarity or a distance measure. Measures are classified as distance (as closer to zero as more related are the names) or similarity (as greater is the value as more related are the names). A normalized similarity/distance measure keeps a scale between different similarity values.

Similarity and distance measures detect the particular look-alike (orthographic cause) and sound-alike (phonetic case) issue. In this sense, the measures are classified as orthographic or phonetic in relation to the used approach to detect the confusion.

2.1 Orthographic Distance Measures

Edit distance (ED). Given the drug names X and Y as sequences of size n and m, respectively, ED (also called Levenshtein) refers to the minimum cost of editing operations (insertion, deletion and substitution) to convert the sequence X into Y [11, 19,20,21]. In this paper, all editing operations have a cost of 1. In this case, the edit distance between X and Y is given by edit $ (n, m) $ computed by the following recurrence:

$$ edit\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {max\left( {i,j} \right) } \hfill & {i = 0 \vee {\text{j}} = 0} \hfill \\ {edit\left( {i - 1,j - 1} \right) } \hfill & {x_{i} = y_{j} } \hfill \\ {min\left\{ {\begin{array}{*{20}c} {edit\left( {i - 1,j} \right) + 1} \\ {edit\left( {i,j - 1} \right) + 1} \\ {edit\left( {i - 1,j - 1} \right) + cs(x_{i} ,y_{i} )} \\ \end{array} } \right.} \hfill & {x_{i} \ne y_{j} } \hfill \\ \end{array} } \right. $$

(1)

A Normalized ED (NED).

NED is computed by dividing the ED between the length of the longer sequence [6, 21,22,23,24,25].

2.2 Orthographic Similarity Measures

Prefix Similarity.

Given the drug names $ X $ and $ Y $ as sequences of size m and n respectively, Prefix represents the ratio of the longest contiguous common initial letters [6], see Eq. 2. The common prefix for drug names Accutane and Accolate is Acc (|Acc| = 3), and the normalized prefix similarity is 0.375.

$$ Prefix\left( {X,Y} \right) = \frac{{|x_{1} = y_{1} ,x_{2} = y_{2} , \ldots ,x_{i} = y_{i} |}}{{\hbox{max} \left( {X,Y} \right)}} $$

(2)

N-gram Similarity.

Represents a sequence of the set of all its contiguous subsequences (grams) of size n [12]. For example, if $ \left| X \right| = m $ and n = 2 (bigrams), then $ X' = \left\{ {x_{1} x_{2} , x_{2} x_{3} , \ldots , x_{m - 1} x_{m} } \right\} $ [6, 14, 26]. Given the sequences $ X $ and $ Y $, the n-gram similarity is defined as the Dice similarity [27] between the sets X′ and Y′ in the next way:

$$ Dice\left( {X^{ '} ,Y^{ '} } \right) = \frac{{2\left| {X^{ '} \; \cap \;Y^{ '} } \right|}}{{\left| {X^{ '} } \right| + \left| {Y^{ '} } \right|}} $$

(3)

N-gram similarity presents a weakness because it is well-known that the prefixes and suffixes of the drug names are involved in their confusion [6, 18]. For increasing the sensitivity of the N-gram similarity some variations with respect to initial and final letters area applied. Lambert [14] proposes to add spaces (or a letter not included in the names) (B)efore and (A)fter in both drug names to make that the initial or final letters appear in one or more n-grams. Lambert experimented with the variants of Bigram-(1B, 1A, 1B1A and 1A) and Trigram-(1B, 1A, 1B1A, 2B, 2A, 2B2A, 1B2A and 2B1A) [14, 17, 28].

Normalized LCS (NLCS).

NLCS similarity lets to maintain an order in the common matching letters. Given the sequences $ X $ and $ Y $ of size n and m, respectively, the NLCS similarity is defined as the ratio of the length of the longest common subsequences between $ X $ and Y, $ NLCS = \left| {lcs\left( {n, m} \right)} \right|/max\left( {m,n} \right) $, where $ lcs\left( {n, m} \right) $ can be calculated by the recurrence in Eq. (4) [6, 14, 23,24,25, 29].

$$ lcs\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {i = 0 \vee {\text{j}} = 0} \hfill \\ {lcs\left( {i - 1,j - 1} \right) + 1, } \hfill & {x_{i} = y_{j} } \hfill \\ {max\left( {lcs\left( {i,j - 1} \right),lcs\left( {i - 1,j} \right)} \right)} \hfill & {x_{i} \ne y_{j} } \hfill \\ \end{array} } \right. $$

(4)

Nsim Similarity.

It is proposed by Kondrak [6, 23, 24] and it combines features implemented by grams of size $ \beta $, non-crossing-links constraints and the first letter it is repeated at the begging of the drug name. A specific case of Nsim is the measure Bisim [6]. Given the sequences (with the first repeated letter) $ X $ and $ Y $ representing the drug names of size n and m, respectively, Bisim similarity is defined as:

$$ Bisim\left( {X, Y} \right) = \frac{nsim(n,m)}{{max\left( {n,m} \right)}} $$

$$ nsim\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {i = 0 \vee j = 0} \hfill \\ {max\left\{ {\begin{array}{*{20}c} {nsim\left( {i,j - 1} \right),} \\ {nsim\left( {i - 1,j} \right),} \\ {nsim\left( {i - 1,i - 1} \right) + } \\ {s\left( {x_{i} x_{i + 1} , y_{j} y_{j + 1} } \right),} \\ \end{array} } \right.\begin{array}{*{20}c} {in} \\ {other} \\ {case} \\ \end{array} } \hfill & {} \hfill \\ \end{array} } \right. $$

(5)

$$ s\left( {x_{i} x_{i + 1} , y_{j} y_{j + 1} } \right) = \frac{1}{2}\sum\nolimits_{k = 0}^{1} {id(x_{i + k} , y_{j + k} )} $$

(6)

$$ id\left( {a,b} \right) = \left\{ {\begin{array}{*{20}l} {1, a = b} \hfill \\ {0, a \ne b} \hfill \\ \end{array} } \right. $$

(7)

2.3 Related Work

Using a list of 1,127 LASA pairs and 1,127 non-LASA pairs, Lambert [14] evaluates 22 measures with ten-fold cross-validation technique and concludes that Trigram2B, NED and Editex [20] are the best measures to identify LASA pairs.

Kondrak [6, 23,24,25] proposes the orthographic Nsim similarity and the phonetic Aline similarity [30, 31] where the recall metric is used to evaluate the results of 12 measures with the USP LASA list [32] of 360 unique drug names. Kondrak [6] concludes that Bisim is the best orthographic measure. Bisim is used to create automated warning systems to identify potential LASA errors in prescription electronic systems [4, 33] and in the software POCA by the FDA [6]. Furthermore, the average of Bisim, Aline, Prefix, and NED measures outperform to Bisim [6].

3 Proposed Method

In this paper, a Soften Bigram Similarity measure (Soft-Bisim) is proposed. First, the cases of bigrams involved in the scale of similarity in Soft-Bisim are described. After that, the fitness function used to find the weights in the scale of similarity by a genetic algorithm is described. Our hypothesis is that an evolutionary approach defines better the levels in the scale of similarity compared to the original similarity scale proposed by Kondrak in Bisim (cf. Eqs. 7 and 8). In other words, we consider this problem as an evolutionary approach for optimizing the internal parameters of the similarity scale.

3.1 Definition of Soft-Bisim Similarity

Given the drug names $ X $ and $ Y $ as sequences of size $ n $ and $ m $, respectively, Soft-Bisim is defined as:

$$ Soft - Bisim\left( {X, Y} \right) = \frac{Bisim(n,m)}{{max\left( {n,m} \right)}} $$

(8)

$$ Bisim\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & { i = 0 \vee j = 0} \hfill \\ {max\left\{ {\begin{array}{*{20}c} {Bisim\left( {i,j - 1} \right),} \\ {Bisim\left( {i - 1,j} \right),} \\ {Bisim\left( {i - 1,i - 1} \right) + } \\ {s\left( {x_{i} x_{i + 1} , y_{j} y_{j + 1} } \right),} \\ \end{array} } \right.\begin{array}{*{20}c} {in} \\ {other} \\ {case} \\ \end{array} } \hfill & {} \hfill \\ \end{array} } \right. $$

(9)

Where the proposed scale of similarity for Soft-Bisim is defined as:

$$ s\left( {a_{\text{i}} a_{{{\text{i}} + 1}} , b_{\text{j}} b_{{{\text{j}} + 1}} } \right) = \left\{ {\begin{array}{*{20}l} {w_{1} , a_{{{\text{i}} + 1}} = b_{{{\text{j}} + 1}} \ne a_{\text{i}} \ne b_{\text{j}} } \hfill \\ {w_{2} , a_{\text{i}} \ne a_{{{\text{i}} + 1}} \ne b_{\text{j}} \ne b_{{{\text{j}} + 1}} } \hfill \\ {w_{3} , a_{\text{i}} = a_{{{\text{i}} + 1}} = b_{\text{j}} \ne b_{{{\text{j}} + 1}} \vee a_{\text{i}} = b_{\text{j}} = b_{{{\text{j}} + 1}} \ne a_{{{\text{i}} + 1}} } \hfill \\ {w_{4} , a_{\text{i}} = b_{{{\text{j}} + 1}} \wedge a_{{{\text{i}} + 1}} = b_{\text{j}} } \hfill \\ {w_{5} , a_{\text{i}} = b_{{{\text{j}} + 1}} \ne a_{{{\text{i}} + 1}} \ne b_{\text{j}} \vee a_{{{\text{i}} + 1}} = b_{\text{j}} \ne a_{\text{i}} \ne b_{{{\text{j}} + 1}} } \hfill \\ {w_{6} , a_{\text{i}} = a_{{{\text{i}} + 1}} = b_{{{\text{j}} + 1}} \ne b_{\text{j}} \vee a_{{{\text{i}} + 1}} = b_{\text{j}} = b_{{{\text{j}} + 1}} \ne a_{\text{i}} } \hfill \\ {w_{7} , a_{\text{i}} = b_{\text{j}} \ne a_{{{\text{i}} + 1}} \ne b_{{{\text{j}} + 1}} } \hfill \\ {w_{8} , a_{\text{i}} = a_{{{\text{i}} + 1}} = b_{\text{j}} = b_{{{\text{j}} + 1}} } \hfill \\ {w_{9} , a_{\text{i}} = b_{\text{j}} \wedge a_{{{\text{i}} + 1}} = b_{{{\text{j}} + 1}} } \hfill \\ \end{array} } \right. $$

(10)

For increasing accuracy to identify confusable drug names it is needed to find the set of weights $ W = \left\{ {w_{1} , w_{2} , \ldots ,w_{9} } \right\} $ of the scale of similarity of Soft-Bisim. For this, a Genetic Algorithm is used [34,35,36].

3.2 Finding the Scale of Similarity for Soft-Bisim

The fitness function of the Genetic Algorithm is designed to evaluate each individual in relation to the objective to optimize.

The FDA reviews the similarity of a new drug name with all drug names that were previously registered. Therefore, the f-measure evaluation widely used in the information retrieval is used as the fitness function [37]. Given a LASA pair $ \left( {d_{i} ,d_{j} } \right) \in List\,of\,LASA\,pairs $, the f-measure for the query $ d_{i} $ evaluates the size of the set of retrieved drug names in ranking one (most similar drug names to the query $ d_{i} $), but if $ d_{j} $ does not appears in the last set, the f-measure add the size of the retrieved drug names in the next ranking, until appears $ d_{j} $. In this way, f-measure evaluates the ability to find a relevant drug name from a query. The f-measure could be obtained at every ranking ($ r $). In fact, we desire to improve the f-measure in the top four rankings. Therefore, the fitness function computes a macro-averaging f-measure for the queries of all different drug names (set $ D $) based on the sum of the first four rankings, see Eq. 11. In other words, the fitness function gives more relevance to the combination of weights in $ W $ (Eq. 10) that, after retrieving the queries of all different drug names with the Soft-Bisim measure, produces the best sum of the first four f-measure evaluation.

$$ fitness\left( D \right) = \sum\nolimits_{r = 1}^{4} {f - measure(D,r)} $$

(11)

4 Results and Discussion

In all the experiments, the ground truth USP-858 collection with 858 LASA pairs is used. The USP-858 contains 630 unique drug names, and it can generate 36,900 pairs of drug names. That means that 0.3% of LASA pairs must be recovered.

4.1 Calculating the Scale of Similarity for Soft-Bisim

Although, the genetic algorithm only optimizes the macro-averaging f-measure for the top four positions, the comparison in Table 1 shows an improvement, with respect to Bisim, in all positions of ranking to retrieve LASA pairs. As it is possible to observe, the weight $ w_{9} $ for Soft-Bisim maintains a higher relevance than $ w_{8} $ while in Bisim $ w_{8} $ and $ w_{9} $ are the same. On the contrary, the case when all the letters are different the weight is not zero.

Table 1. Comparison of the macro-averaging f-measure evaluation for Bisim and Soft-Bisim with the USP-858 collection where the resulting weights for Soft-Bisim are: $ w_{1} = 0 $, $ w_{2} = 0.1 $, $ w_{3} = 0.4 $, $ w_{4} = 0 $, $ w_{5} = 0 $, $ w_{6} = 0.2 $, $ w_{7} = 0.4 $, $ w_{8} = 0.6 $ and $ w_{9} = 0.8 $; and the implicit weights for Bisim are: $ w_{1 \ldots 3} = 0 $, $ w_{4 \ldots 7} = 0.5 $, $ w_{7} = 0 $, and $ w_{8} = w_{9} = 1 $. *In the last row the ten-fold cross-validation results are showed.

Full size table

4.2 Evaluation of Orthographic Measures

In Fig. 1, Soft-Bisim is compared to all orthographic measures presented in Sects. 2.1 and 2.2. In this case, Trigram-2B maintains the relevance indicated by Lambert but Bisim is more relevant that Trigram-2B. It is worth to mention that Trigram-2B2A and Trigram-2B1A are more relevant that Bisim. However, Soft-Bisim obtains the best performance with the adjusted similarity scale.

4.3 A Combined Measure with Soft-Bisim

Using the Average of Prefix, NED, Bisim and Aline, Kondrak [6] proposes a combined measure that outperform to Bisim: Avg_Bisim(Prefix, NED, Bisim, Aline). In this paper, we propose two combined measures, in the first one, Soft-Bisim is added to the average Avg_all (Prefix, NED, Bisim, Aline, Soft-Bisim), and the second one, Bisim is replaced by Soft-Bisim in the average, Avg_SoftBisim(Prefix, NED, Aline, Soft-Bisim). In Table 2, the comparison of original combined proposed by Kondrak and our proposed combined measures are presented.

Table 2. Macro-averaging f-measure evaluation for Avg_Bisim, Avg_All and Avg_SoftBisim.

Full size table

In Table 3, using Bisim as a Baseline measure the best measures to identify confuse drug names are showed. In this case, all the combined measures outperform to the individual measures. However, the best individual measure is Soft-Bisim that it is involved in the first two combined measures. Moreover, the best performance is achieved when Bisim is replaced by Soft-Bisim.

Table 3. Comparison of Soft-Bisim with the best previous measures.

Full size table

5 Conclusion

The problem of confusion of drug names needs attention because it is still growing. All measures presented in this paper (except by Nsim) are designed or adjusted to different application or domain. In this sense, Nsim takes into consideration characteristics that take part on confusable drug names like the fact that the initial letters are frequently involved in a confused drug name. In this paper we propose to Soft-Bisim measure that it is a new orthographic measure for identifying LASA pairs based on Nsim similarity with an extension to soften the scale of similarity between the bi-grams that conforms a drug name. In specific, nine combinations of weights were calculated. For this, the sum the first-four macro-averaging f-measure of the retrieved pairs is proposed as the fitness function in a genetic algorithm.

According to the experimentation, Soft-Bisim increases the accuracy with respect to Bisim in a retrieved list of potential LASA pairs in all the ranking positions. Furthermore, Soft-Bisim outperforms significantly to the others 17 orthographic measures used in this problem. In this paper, we found that the measures Trigram-2B2A and Trigram-2B1A are good measures since outperform to the Bisim measure.

In addition, a new average combination of four measures using Soft-Bisim is proposed. This new average combination outperforms to the previous average that use Bisim measure. Even thought, we only use a list of drug names Soft-Bisim can be used to retrieve other cases of confusions like in proper names or brand names.

References

Billstein-Leber, M., Carrillo, C.J.D., Cassano, A.T., Moline, K., Robertson, J.J.: ASHP guidelines on preventing medication errors in hospitals (2018). https://www.ashp.org/Pharmacy-Practice/Policy
Cohen, M.R., Domizio, G.D., Lee, R.E.: The role of drug names in medication errors. In: Medication Errors, pp. 87–110. American Pharmacists Association, Washington, DC (2007)
Google Scholar
Medication Without Harm.: World Health Organization, Geneva (2017)
Google Scholar
Rash-Foanio, C., et al.: Automated detection of look-alike/sound-alike medication errors. Am. J. Heal. Pharm. 74, 521–527 (2017)
Article Google Scholar
Tittemore, L.M.: The name game (2017). https://sunsteinlaw.com/l-tittemore/
Kondrak, G., Dorr, B.: Automatic identification of confusable drug names. Artif. Intell. Med. 36, 29–42 (2006)
Article Google Scholar
FDA: FDA and ISMP Work to Prevent Medication Errors 2017 (2012)
Google Scholar
Craigle, V.: MedWatch: the FDA safety information and adverse event reporting program. J. Med. Libr. Assoc. 95, 224–225 (2007)
Article Google Scholar
Gershman, J.A., Fass, A.D.: Medication safety and pharmacovigilance resources for the ambulatory care setting: enhancing patient safety. Hosp. Pharm. 49, 363–368 (2014)
Article Google Scholar
Getz, K.A., Stergiopoulos, S., Kaitin, K.I.: Evaluating the completeness and accuracy of MedWatch data. Am. J. Ther. 21, 442–446 (2014)
Article Google Scholar
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21, 168–173 (1974)
Article MathSciNet Google Scholar
Pfeifer, U., Poersch, T., Fuhr, N., Vi, L.I.: Searching proper names in databases. In: HIM, pp. 259–275. Citeseer (1995)
Google Scholar
Pfeifer, U., Vi, L.I.: Searching proper names in databases, vol. 20, pp. 1–13, October 1994
Google Scholar
Lambert, B.L., Lin, S.J., Chang, K.Y., Gandhi, S.K.: Similarity as a risk factor in drug-name confusion errors: The look-alike (orthographic) and sound-alike (phonetic) model. Med. Care 37, 1214–1225 (1999)
Article Google Scholar
Schroeder, S.R., et al.: Cognitive tests predict real-world errors: the relationship between drug name confusion rates in laboratory-based memory and perception tests and corresponding error rates in large pharmacy chains. BMJ Qual. Saf. 26, 395–407 (2017)
Google Scholar
Lambert, B.L., et al.: Listen carefully: the risk of error in spoken medication orders. Soc. Sci. Med. 70, 1599–1608 (2010)
Article Google Scholar
Lambert, B.L.: Predicting look-alike and sound-alike medication errors. Am. J. Heal. Pharm. 54, 1161–1171 (1997)
Article Google Scholar
Shah, M.B., Merchant, L., Chan, I.Z., Taylor, K.: Characteristics that may help in the identification of potentially confusing proprietary drug names. Ther. Innov. Regul. Sci. 51, 232–236 (2017)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, pp. 707–710 (1966)
Google Scholar
Zobel, J., Box, G.P.O., Dart, P.: Phonetic string matching : lessons from information retrieval. In: Proceedings of 19th Annual International ACM SIGIR Conference Research and Development in Information Retrieval, pp. 166–172 (1996)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Article Google Scholar
Chen, S., Liu, Y., Wei, L., Guan, B.: PS-FW: a hybrid algorithm based on particle swarm and fireworks for global optimization. Comput. Intell. Neurosci. 2018, 1–27 (2018)
Google Scholar
Kondrak, G., Dorr, B.: Identification of confusable drug names: a new approach and evaluation methodology (2004)
Google Scholar
Kondrak, G., Dorr, B.J.: A similarity-based approach and evaluation methodology for reduction of drug name confusion. Alberta University, Edmonton (2003)
Google Scholar
Kondrak, G.: N-Gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13
Chapter Google Scholar
Chen, L.-C., Chen, C.-H., Chen, H.-M., Tseng, V.S.: Hybrid data mining approaches for prevention of drug dispensing errors. J. Intell. Inf. Syst. 36, 305–327 (2011)
Article Google Scholar
Adamson, G.W., Boreham, J.: The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Inf. Storage Retr. 10, 253–260 (1974)
Article Google Scholar
Lambert, B.L., Chang, K.-Y., Lin, S.-J.: Effect of orthographic and phonological similarity on false recognition of drug names. Soc. Sci. Med. 52, 1843–1857 (2001)
Article Google Scholar
Lambert, B.L., Yu, C., Thirumalai, M.: A system for multiattribute drug product comparison. J. Med. Syst. 28, 31–56 (2004)
Article Google Scholar
Kondrak, G.: Phonetic alignment and similarity. Comput. Hum. 37, 273–291 (2003)
Article Google Scholar
Kondrak, G.: Algorithms for language reconstruction (2002)
Google Scholar
USP: USP quality review (76). US Pharmacopeia. (2001)
Google Scholar
Or, C.K.L., Wang, H.H.L.: Examining text enhancement methods to improve look-alike drug name differentiation accuracy. In: Proceedings of the Human Factors and Ergonomics Society, pp. 645–649 (2013)
Google Scholar
Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, Reading (1989)
MATH Google Scholar
Holland, J.H.: Adaptation in Natural and Artificial Systems: an Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. MIT Press, Cambridge (1992)
Book Google Scholar
Mitchell, M.: An Introduction to Genetic Algorithms. Cambridge, Massachusetts, London, England, Fifth Printing (1999)
Google Scholar
Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice. Addison-Wesley Publishing Company, Boston (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Autonomous University of State of Mexico, 50000, Toluca, Mexico
Christian Eduardo Millán-Hernández, René Arnulfo García-Hernández, Yulia Ledeneva & Ángel Hernández-Castañeda

Authors

Christian Eduardo Millán-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
René Arnulfo García-Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Yulia Ledeneva
View author publications
You can also search for this author in PubMed Google Scholar
Ángel Hernández-Castañeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to René Arnulfo García-Hernández .

Editor information

Editors and Affiliations

National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
José Francisco Martínez-Trinidad
Autonomous University of Puebla , Puebla, Mexico
José Arturo Olvera-López
National Polytechnic Institute of Mexico , Querétaro, Mexico
Joaquín Salas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Millán-Hernández, C.E., García-Hernández, R.A., Ledeneva, Y., Hernández-Castañeda, Á. (2019). Soft Bigram Similarity to Identify Confusable Drug Names. In: Carrasco-Ochoa, J., Martínez-Trinidad, J., Olvera-López, J., Salas, J. (eds) Pattern Recognition. MCPR 2019. Lecture Notes in Computer Science(), vol 11524. Springer, Cham. https://doi.org/10.1007/978-3-030-21077-9_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-21077-9_40
Published: 18 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21076-2
Online ISBN: 978-3-030-21077-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Soft Bigram Similarity to Identify Confusable Drug Names

Abstract

Similar content being viewed by others

Drug Name Similarity Index for Sound-Alikeness

Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

A Method of Addressing Proprietary Name Similarity for US Prescription Drugs

Keywords

1 Introduction

2 Definitions

2.1 Orthographic Distance Measures

A Normalized ED (NED).

2.2 Orthographic Similarity Measures

Prefix Similarity.

N-gram Similarity.

Normalized LCS (NLCS).

Nsim Similarity.

2.3 Related Work

3 Proposed Method

3.1 Definition of Soft-Bisim Similarity

3.2 Finding the Scale of Similarity for Soft-Bisim

4 Results and Discussion

4.1 Calculating the Scale of Similarity for Soft-Bisim

4.2 Evaluation of Orthographic Measures

4.3 A Combined Measure with Soft-Bisim

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Soft Bigram Similarity to Identify Confusable Drug Names

Abstract

Similar content being viewed by others

Drug Name Similarity Index for Sound-Alikeness

Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

A Method of Addressing Proprietary Name Similarity for US Prescription Drugs

Keywords

1 Introduction

2 Definitions

2.1 Orthographic Distance Measures

A Normalized ED (NED).

2.2 Orthographic Similarity Measures

Prefix Similarity.

N-gram Similarity.

Normalized LCS (NLCS).

Nsim Similarity.

2.3 Related Work

3 Proposed Method

3.1 Definition of Soft-Bisim Similarity

3.2 Finding the Scale of Similarity for Soft-Bisim

4 Results and Discussion

4.1 Calculating the Scale of Similarity for Soft-Bisim

4.2 Evaluation of Orthographic Measures

4.3 A Combined Measure with Soft-Bisim

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation