# Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines

**Part of the following topical collections:**

## Abstract

### Background

The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of the positive matches, absence mismatches or negative matches. To our knowledge, there is no published work presenting a systematic way of finding an appropriate equation to measure binary similarity that performs well for certain data type or application. A proper method to select a suitable binary similarity or dissimilarity measure is needed to obtain better classification results.

### Results

In this study, we proposed a novel approach to select binary similarity and dissimilarity measures. We collected 79 binary similarity and dissimilarity equations by extensive literature search and implemented those equations as an R package called bmeasures. We applied these metrics to quantify the similarity and dissimilarity between herbal medicine formulas belonging to the Indonesian Jamu and Japanese Kampo separately. We assessed the capability of binary equations to classify herbal medicine pairs into match and mismatch efficacies based on their similarity or dissimilarity coefficients using the Receiver Operating Characteristic (ROC) curve analysis. According to the area under the ROC curve results, we found Indonesian Jamu and Japanese Kampo datasets obtained different ranking of binary similarity and dissimilarity measures. Out of all the equations, the Forbes-2 similarity and the Variant of Correlation similarity measures are recommended for studying the relationship between Jamu formulas and Kampo formulas, respectively.

### Conclusions

The selection of binary similarity and dissimilarity measures for multivariate analysis is data dependent. The proposed method can be used to find the most suitable binary similarity and dissimilarity equation wisely for a particular data. Our finding suggests that all four types of matching quantities in the Operational Taxonomic Unit (OTU) table are important to calculate the similarity and dissimilarity coefficients between herbal medicine formulas. Also, the binary similarity and dissimilarity measures that include the negative match quantity *d* achieve better capability to separate herbal medicine pairs compared to equations that exclude *d*.

## Keywords

Binary data Similarity measures Distance metric Jamu Kampo ROC curve Hierarchical clustering## Abbreviations

- AUC
The Area Under the ROC Curve

- D
Dissimilarity

- FN
False Negative

- FP
False Positive

- FPR
False Positive Rate

- NA-DFC
The National Agency of Drug and Food Control

- NCBI
The National Center for Biotechnology Information

- OTU
The Operational Taxonomic Unit

- PR
Precision-Recall

- ROC
The Receiver Operating Characteristic

- S
Similarity

- TCM
Traditional Chinese Medicine

- TN
True Negative

- TP
True Positive

- TPR
True Positive Rate

## Background

Binary features have been commonly used to represent a great variety of data [1, 2, 3], expressing the binary status of samples as presence/absence, yes/no, or true/false. It has many applications in the bioinformatics, chemometrics, and medical fields [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], as well as in pattern recognition, information retrieval, statistical analysis, and data mining [20, 21]. The choice of an appropriate coefficient of similarity or dissimilarity is necessary to evaluate multivariate data represented by binary feature vectors because different similarity measures may yield conflicting results [22]. Choi et al. [23] collected binary similarity and dissimilarity measures used over the last century and revealed their correlation through the hierarchical clustering technique. They also classified equations into two groups based on inclusion and exclusion of negative matches. Consonni & Todeschini [1] proposed five new similarity coefficients and compared those coefficients with some well-known similarity coefficients. Three of the five similarity coefficients are less correlated with the other common similarity coefficients and need an investigation to understand their potential. Meanwhile, Todeschini et al. [24] reported an analysis of 44 different similarity coefficients for computing the similarities between binary fingerprints by using simple descriptive statistics, correlation analysis, multidimensional scaling Hasse diagrams, and their proposed method ‘atemporal target diffusion model’.

Nowadays, the utilization of herbal medicines, i.e. Indonesian Jamu, Japanese Kampo, traditional Chinese medicine (TCM), and so on [25], are becoming popular for disease treatment and maintaining good health. In case of Indonesian Jamu, each Jamu medicine is prepared from a single plant or a mixture of several plants as its ingredients. The National Agency of Drug and Food Control (NA-DFC) of Indonesia supervises the production of Jamu medicines before its release for public use. Up to 2014, there were 1247 Jamu factories in Indonesia [26]. They have concocted a lot of Jamu formulas with various efficacies. Consequently, the studies of Jamu formulas have become an interesting research topic in the last few years. It may be related to the problems of the Jamu philosophy, systematization of Jamu, or phytochemistry. In the Jamu studies, the relationships between plants, Jamu, and efficacies lead to determine important plants for every disease class using global and local approaches [4, 5, 27]. In addition, Kampo formulas are traditional medicines from Japan. These are generally prepared by combination of crude drugs. In total, 294 Kampo formulas are listed in the Japanese Pharmacopoeia of 2012 and it can be used for self-medication [28]. Currently, many researchers have done Kampo studies to unveil the complex systems of Kampo medication and to reveal the scientific aspect of its relevance to modern healthcare. In Jamu and Kampo studies, herbal medicine formula and plant/crude drug relations are represented as binary feature vectors, denoting whether a particular plant is used or not as an ingredient.

The relationships between Jamu formulas, as well as Kampo formulas and other herbal medicines, are not only reflected by the efficacy similarity but also by the ingredient similarity. One Jamu formula can be suggested as an alternative to the other one if they have relatively similar ingredients. For mathematical analysis, each Jamu formula is represented as a binary vector using 1 to indicate the presence of a plant and 0 otherwise. However, each Jamu formula usually uses a few plants. Thus, most of the Jamu vectors contain a few 1 s and many 0 s. Consequently, the number of plants that are used simultaneously in Jamu pairs is much smaller than the number of plants that are not used simultaneously as Jamu ingredients. Therefore, in order to find relatively similar Jamu formulas, the high number of negative matches might influence the calculation of binary similarity or dissimilarity between Jamu pairs. On the other hand, there is no guarantee that negative co-occurrence between two entities is identical [29]. Hence, it is necessary to examine the binary similarity and dissimilarity coefficients of Jamu formulas to determine the appropriate measurement for finding a suitable mixing alternative of a target crude drug.

Currently, there are several methods to measure the quality of classifiers [30, 31] such as the Receiver Operating Characteristic (ROC) curves [32, 33], Precision-Recall (PR) curves [33, 34], Cohen’s Kappa scores [35, 36], and so on. An ROC curve is a very powerful tool for measuring classifiers’ performance in many fields, especially in the machine learning and binary-class problems [37]. The purpose of ROC analysis is similar to that of the Cohen’s Kappa, which is mainly used for ranking classifiers. The ROC curve conveys more information than Cohen’s Kappa in a sense that it can also visualize the performance of a classifier by a curve instead of generating just a scalar value. In this study, we propose a method to select the most suitable similarity measures in the context of classification based on False Positive Rates (FPRs) and True Positive Rates (TPRs) by using ROC curve analysis. We discuss the step-by-step development of this method by applying it to assess the similarity of herbal medicines in the context of their efficacies. Initially, we gathered 79 binary similarity and dissimilarity equations. Some identical equations were eliminated in the preliminary step. Subsequently, the capability of binary measures to separate herbal medicine pairs into match and mismatch efficacy groups was assessed by using the ROC analysis.

## Methods

### Datasets

We used 3131 Jamu formulas collected from NA-DFC of Indonesia [4, 5, 27], which comprise of 465 plants. Thus, Jamu vs. plant relations were then organized as a 3131x465 matrix (Fig. 1a). Jamu formulas were represented by binary vectors, which express the binary status of plants as ingredients, 1 (presence) and 0 (absence). Each Jamu formula consists of 1 to 26 plants, with average 4.904, standard deviation 2.969 and the set union of all formulas consists of 465 plants. Each Jamu formula corresponds to one or more efficacy/disease classes. Total 14 disease classes are used in this Jamu study, of which 12 classes are from the National Center for Biotechnology Information (NCBI) [38]. The list of disease classes are as follows: blood and lymph diseases (E1), cancers (E2), the digestive system (E3), female-specific diseases (E4), the heart and blood vessels (E5), diseases of the immune system (E6), male-specific diseases (E7), muscle and bone (E8), the nervous system (E9), nutritional and metabolic diseases (E10), respiratory diseases (E11), skin and connective tissue (E12), the urinary system (E13), and mental and behavioral disorders (E14). Corresponding to 3131 Jamu formulas, there can be (3,131x3,130)/2 = 4,900,015 Jamu pairs.

For the purpose of comparison, we created four random matrices as the same size as Jamu-plant relations by randomly inserting 1 s and 0 s. In three of the random datasets, the numbers of 1 s are 1, 5 and 10% of 465 plants (called as random 1%, random 5%, and random 10%). In the case of the other dataset, we randomly inserted the equal number of 1 s in every row as it is in the original Jamu formulas (called as random Jamu). We also applied our proposed method into Kampo dataset [28]. This dataset is presented as a two-dimensional binary matrix with rows and columns representing Kampo formulas and crude drug ingredients, respectively. Kampo dataset is composed of 274 Kampo formulas and each formula consists of 3 to 19 crude drugs, with average 8.923, standard deviation 3.885, and the set union of all formulas consists of 227 crude drugs. Then, each Kampo formula is classified into deficiency or excess class, according to Kampo-specific diagnosis of patient’s constitution.

### Flow of the experiment

The binary similarity (S) and dissimilarity (D) measure between a herbal medicine pair is expressed by the Operational Taxonomic Units (OTUs as shown in Fig. 1a) [39, 40]. Concretely, let two Jamu formulas be described by two-row vectors *J* _{ i } and *J* _{ i’ }, each comprised of *M* variables with value 1 (presence) or 0 (absence). The four quantities *a, b, c, d* in the OTUs table are defined as follows: *a* is the number of features where the values for both *j* _{ i } and *j* _{ i’ } are 1 (positive matches), *b* and *c* are the number of features where the value for *j* _{ i } is 0 and *j* _{ i’ } is 1 and vice versa, respectively (absence mismatches), and *d* is the number of features where the values for both *j* _{ i } and *j* _{ i’ } are 0 (negative matches). The sum of *a* and *d* represents the total number of matches between *j* _{ i } and *j* _{ i’ }, the sum of *b* and *c* represents the total number of mismatches between *j* _{ i } and *j* _{ i’ }. The total sum of the quantities in the OTUs table *a* + *b* + *c* + *d* is equal to *M*.

*a, b, c*and

*d*. We also implemented these 79 equations as an R package, called bmeasures. The bmeasures package is available on Github and can be installed by invoking these commands: install.packages(“devtools”), library(“devtools”), install_github(“shwijaya/bmeasures”), library(“bmeasures”). The installation of bmeasures package was tested on R release 3.2.4 and the devtools package ver. 1.11.0. Initially, we measure the similarity and dissimilarity coefficients between herbal medicine pairs by using 79 equations. Then, the resulted similarity/dissimilarity coefficients are used for further analysis. Our experimental procedure can be divided into two major steps, which we discuss in the following segments:

List of 79 binary similarity and dissimilarity measures

Eq. IDs | Equations | References | Note |
---|---|---|---|

1 | \( {S}_{Jaccard}=\frac{a}{a+b+c} \) | [1, 20, 21, 23, 24, 29, 40, 41, 42, 43, 45, 46, 47, 48, 49, 50, 55] | |

2 | \( {S}_{Dice-2}=\frac{a}{2a+b+c} \) | ||

3 | \( {S}_{Dice-1/ Czekanowski}=\frac{2a}{2a+b+c} \) | *** | |

4 | \( {S}_{3W- Jaccard}=\frac{3a}{3a+b+c} \) | ||

5 | \( {S}_{Nei\&Li}=\frac{2a}{\left(a+b\right)+\left(a+c\right)} \) | * | |

6 | \( {S}_{Sokal\& Sneath-1}=\frac{a}{a+2b+2c} \) | ||

7 | \( {S}_{Sokal\& Michener}=\frac{a+d}{a+b+c+d} \) | ||

8 | \( {S}_{Sokal\& Sneath-2}=\frac{2\left(a+d\right)}{2a+b+c+2d} \) | ||

9 | \( {S}_{Roger\& Tanimoto}=\frac{a+d}{a+2\left(b+c\right)+d} \) | ||

10 | \( {S}_{Faith}=\frac{a+0.5d}{a+b+c+d} \) | ||

11 | \( {S}_{Gower\& Legendre}=\frac{a+d}{a+0.5\left(b+c\right)+d} \) | * | |

12 | | ||

13 | | [23] | *** |

14 | \( {S}_{Russell\&Rao}=\frac{a}{a+b+c+d} \) | [1, 3, 20, 21, 23, 24, 29, 40, 41, 45, 47, 48, 49, 50, 55, 56] | *** |

15 | | ||

16 | \( {D}_{Euclid}=\sqrt{b+c} \) | [23] | |

17 | \( {D}_{Squared- euclid}=\sqrt{{\left(b+c\right)}^2} \) | * | |

18 | \( {D}_{Canberra}={\left(b+c\right)}^{\frac{2}{2}} \) | [23] | * |

19 | | [23] | * |

20 | \( {D}_{Mean- Manhattan}=\frac{b+c}{a+b+c+d} \) | *** | |

21 | | [23] | * |

22 | \( {D}_{Minkowski}={\left(b+c\right)}^{\frac{1}{1}} \) | [23] | * |

23 | \( {D}_{Vari}=\frac{b+c}{4\left(a+b+c+d\right)} \) | *** | |

24 | \( {D}_{SizeDifference}=\frac{{\left(b+c\right)}^2}{{\left(a+b+c+d\right)}^2} \) | [23] | |

25 | \( {D}_{ShapeDifference}=\frac{n\left(b+c\right)-{\left(b-c\right)}^2}{{\left(a+b+c+d\right)}^2} \) | [23] | |

26 | \( {D}_{PatternDifference}=\frac{4bc}{{\left(a+b+c+d\right)}^2} \) | [23] | |

27 | \( {D}_{Lance\& Williams}=\frac{b+c}{2a+b+c} \) | ||

28 | \( {D}_{Bray\& Curtis}=\frac{b+c}{2a+b+c} \) | [23] | * |

29 | \( {D}_{Hellinger}=2\sqrt{\left(1-\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}}\right)} \) | [23] | |

30 | \( {D}_{Chord}=\sqrt{2\left(1-\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}}\right)} \) | [23] | *** |

31 | \( {S}_{Cosine}=\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}} \) | ||

32 | \( {S}_{Gilbert\& Wells}= \log a- \log n- \log \left(\frac{a+b}{n}\right)- \log \left(\frac{a+c}{n}\right) \) | ** | |

33 | \( {S}_{Ochiai-1}=\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}} \) | * | |

34 | \( {S}_{Forbes-1}=\frac{na}{\left(a+b\right)\left(a+c\right)} \) | ||

35 | \( {S}_{Fossum}=\frac{n{\left(a-0.5\right)}^2}{\left(a+b\right)\left(a+c\right)} \) | ||

36 | \( {S}_{Sorgenfrei}=\frac{a^2}{\left(a+b\right)\left(a+c\right)} \) | ||

37 | \( {S}_{Mountford}=\frac{a}{0.5\left( ab+ac\right)+bc} \) | ** | |

38 | \( {S}_{Otsuka}=\frac{a}{{\left(\left(a+b\right)\left(a+c\right)\right)}^{0.5}} \) | * | |

39 | \( {S}_{McConnaughey}=\frac{a^2-bc}{\left(a+b\right)\left(a+c\right)} \) | ||

40 | \( {S}_{Tarwid}=\frac{na-\left(a+b\right)\left(a+c\right)}{na+\left(a+b\right)\left(a+c\right)} \) | ||

41 | \( {S}_{Kulczynski-2}=\frac{\frac{a}{2}\left(2a+b+c\right)}{\left(a+b\right)\left(a+c\right)} \) | *** | |

42 | \( {S}_{Driver\& Kroeber}=\frac{a}{2}\left(\frac{1}{a+b}+\frac{1}{a+c}\right) \) | *** | |

43 | \( {S}_{Johnson}=\frac{a}{a+b}+\frac{a}{a+c} \) | *** | |

44 | \( {S}_{Dennis}=\frac{ad-bc}{\sqrt{n\left(a+b\right)\left(a+c\right)}} \) | ||

45 | \( {S}_{Simpson}=\frac{a}{ \min \left(a+b,a+c\right)} \) | ||

46 | \( {S}_{Braun\& Banquet}=\frac{a}{ \max \left(a+b,a+c\right)} \) | ||

47 | \( {S}_{Fager\& McGowan}=\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}}-\frac{ \max \left(a+b,a+c\right)}{2} \) | ||

48 | \( {S}_{Forbes-2}=\frac{na-\left(a+b\right)\left(a+c\right)}{n \min \left(a+b,a+c\right)-\left(a+b\right)\left(a+c\right)} \) | ||

49 | \( {S}_{Sokal\& Sneath-4}=\frac{\frac{a}{\left(a+b\right)}+\frac{a}{\left(a+c\right)}+\frac{d}{\left(b+d\right)}+\frac{d}{\left(c+d\right)}}{4} \) | ||

50 | \( {S}_{Gower}=\frac{a+d}{\sqrt{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)}} \) | [23] | |

51 | \( {S}_{Pearson-1}={\chi}^2=\frac{n{\left( ad-bc\right)}^2}{\left(a+b\right)\left(a+c\right)\left(c+d\right)\left(b+d\right)} \) | ||

52 | \( {S}_{Pearson-2}={\left(\frac{\chi^2}{n+{\chi}^2}\right)}^{\frac{1}{2}} \) | ||

53 | \( {S}_{Pearson-3}={\left(\frac{\rho }{n+\rho}\right)}^{\frac{1}{2}} \) \( \mathrm{where}\kern0.75em \rho =\frac{ad-bc}{\sqrt{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)}} \) | [23] | ** |

54 | \( {S}_{Pearson\& Heron-1}=\frac{ad-bc}{\sqrt{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)}} \) | ||

55 | \( {S}_{Pearson\& Heron-2}= \cos \left(\frac{\pi \sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\right) \) | ||

56 | \( {S}_{Sokal\& Sneath-3}=\frac{a+d}{b+c} \) | ** | |

57 | \( {S}_{Sokal\& Sneath-5}=\frac{ad}{\left(a+b\right)\left(a+c\right)\left(b+d\right){\left(c+d\right)}^{0.5}} \) | ||

58 | \( {S}_{Cole}=\frac{\sqrt{2}\left( ad-bc\right)}{\sqrt{{\left( ad-bc\right)}^2-\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)}} \) | ** | |

59 | \( {S}_{Stiles}={ \log}_{10}\frac{n{\left(\left| ad-bc\right|-\frac{n}{2}\right)}^2}{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)} \) | ||

60 | \( {S}_{Ochiai-2}=\frac{ad}{\sqrt{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)}} \) | * | |

61 | \( {S}_{Yuleq}=\frac{ad-bc}{ad+bc} \) | ||

62 | \( {D}_{Yuleq}=\frac{2bc}{ad+bc} \) | [23] | |

63 | \( {S}_{Yulew}=\frac{\sqrt{ad}-\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}} \) | ||

64 | \( {S}_{Kulczynski-1}=\frac{a}{b+c} \) | ** | |

65 | \( {S}_{Tanimoto}=\frac{a}{\left(a+b\right)+\left(a+c\right)-a} \) | * | |

66 | \( {S}_{Disperson}=\frac{ad-bc}{{\left(a+b+c+d\right)}^2} \) | ||

67 | \( {S}_{Hamann}=\frac{\left(a+d\right)-\left(b+c\right)}{a+b+c+d} \) | *** | |

68 | \( {S}_{Michael}=\frac{4\left( ad-bc\right)}{{\left(a+d\right)}^2+{\left(b+c\right)}^2} \) | ||

69 | \( {S}_{Goodman\& Kruskal}=\frac{\sigma -{\sigma}^{\hbox{'}}}{2n-{\sigma}^{\hbox{'}}} \) \( \begin{array}{l}\mathrm{where}\;\sigma = \max \left(a,b\right)+ \max \left(c,d\right)+ \max \left(a,c\right)+ \max \left(b,d\right)\\ {}\kern1.56em {\sigma}^{\hbox{'}}= \max \left(a+c,b+d\right)+ \max \left(a+b,c+d\right)\end{array} \) | [23] | ** |

70 | \( {S}_{Anderberg}=\frac{\sigma -{\sigma}^{\hbox{'}}}{2n} \) | [23] | ** |

71 | \( {S}_{Baroni- Urbani\& Buser-1}=\frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c} \) | ||

72 | \( {S}_{Baroni- Urbani\& Buser-2}=\frac{\sqrt{ad}+a-\left(b+c\right)}{\sqrt{ad}+a+b+c} \) | *** | |

73 | \( {S}_{Peirce}=\frac{ab+bc}{ab+2bc+ cd} \) | ** | |

74 | \( {S}_{Eyraud}=\frac{n^2\left(na-\left(a+b\right)\left(a+c\right)\right)}{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)} \) | [23] | |

75 | \( {S}_{Tarantula}=\frac{\frac{a}{\left(a+b\right)}}{\frac{c}{\left(c+d\right)}}=\frac{a\left(c+d\right)}{c\left(a+b\right)} \). | [23] | ** |

76 | \( {S}_{Ample}=\left|\frac{\frac{a}{\left(a+b\right)}}{\frac{c}{\left(c+d\right)}}\right|=\left|\frac{a\left(c+d\right)}{c\left(a+b\right)}\right| \). | [23] | ** |

77 | \( {S}_{Derived\_ Rusell-Rao}=\frac{ \log \left(1+a\right)}{ \log \left(1+n\right)} \). | ||

78 | \( {S}_{Derived\_ Jaccard}=\frac{ \log \left(1+a\right)}{ \log \left(1+a+b+c\right)} \) | ||

79 | \( {S}_{Var\_ of\_ Correlation}=\frac{ \log \left(1+ ad\right)- \log \left(1+bc\right)}{ \log \left(1+{n}^2/4\right)} \) |

**Step 1.** Reducing the candidate equations

The binary similarity and dissimilarity equations were evaluated to eliminate duplications. When two or more equations can be transformed into the same form by algebraic manipulations, only one of them is kept for further analysis. We also removed equations from our analysis that produce infinite/NaN values or indeterminate forms while applying to measure similarity and dissimilarity using all datasets.

*k*and

*l*, that is:

*s*

_{ mn }(

*k*) and

*s*

_{ mn }(

*l*) are the similarity/dissimilarity values between corresponding herbal medicine pair using equations

*k*and

*l*respectively,

*N*is the total number of herbal medicine formulas, and

*d*

_{ k,l }is the distance between equation

*k*and

*l*. The cluster centroid is the average values of the variables for the observations (in the present case equations) in that cluster. Let \( {\overline{X}}_G,{\overline{X}}_H \) denote group averages for clusters

*G*and

*H*. Then, the distance between cluster centroids is calculated using Eq. 81.

where \( {\overline{X}}_G \) is the centroid of *G* by arithmetic mean \( {\overline{X}}_G=\frac{1}{n_G}{\displaystyle {\sum}_{i=1}^{n_G}}{X}_{Gi} \) [2, 65, 66]. We implemented the clustering process using hclust function in R. At each step, the cluster centroid was calculated to represent a group of equations in the clusters. Furthermore, two equations or clusters are merged for which the distance between the centroids is the minimum until all equations are merged into one cluster.

We performed the hierarchical clustering process twice, first to reduce the candidate equations for which the distance between equations measured by Eq. 80 is zero or nearly zero and secondly to evaluate the combined characteristic of a group of equations. Mean centering and unit variance scaling was applied to the similarity/dissimilarity coefficients before the clustering process.

**Step 2.** ROC Analysis of selected equations

*FPR*and

*TPR. FPR*is the proportion of false positive predictions out of all the false data and

*TPR*is the proportion of true positive predictions out of all the true data, defined by Eq. 82 [67, 68, 69]:

*TP*) is the number of herbal medicine pairs correctly classified as positive, true negative (

*TN*) is the number of pairs correctly classified as negative, false positive (

*FP*) is the number of pairs incorrectly classified as positive, and false negative (

*FN*) is the number of pairs incorrectly classified as negative. We defined and compared the performance of good equations by using the minimum distance of the ROC curve to the theoretical optimum point and by using the Area Under the ROC Curve (AUC) analysis [70]. The minimum distance between the ROC curve and the optimum point was measured as the Euclidean distance. The minimum distance can also be computed by

*TP*,

*TN*,

*FP*, and

*FN*values corresponding to selected similarity thresholds

*i*using the following formulation:

## Results and discussion

### Preliminary verification of the equations

_{Jaccard}, S

_{Dice-1/Czekanowski}, S

_{Sokal}&

_{Sneath-2}, D

_{Hamming}, D

_{Lance}&

_{Williams}, S

_{Cosine}and S

_{Sokal}&

_{Sneath-5}in our analysis and therefore, we were left with 67 equations at this stage. Next, we clustered the 67 equations to reduce the number of equations using Jamu and Kampo datasets. During the clustering process, we eliminated 11 equations indicated by ‘**’ in Table 1 that produced infinite/NaN values or indeterminate forms while applied to all datasets. Such conditions can be reached when denominator of an equation becomes equal to 0, i.e. the values of

*b*and

*c*in the Mountford and Peirce similarities (Eq. 37 and Eq. 73) are 0 if two formulas use exactly the same ingredients.

Groups of identical equations

Groups | Eliminated Equations | Selected Equations |
---|---|---|

1 | \( {S}_{Nei\&Li}=\frac{2a}{\left(a+b\right)+\left(a+c\right)} \) (Eq.5) | \( {S}_{Dice-1/ Czekanowski}=\frac{2a}{2a+b+c} \) (Eq.3) |

2 | \( {S}_{Gower\& Legendre}=\frac{a+d}{a+0.5\left(b+c\right)+d} \) (Eq.11) | \( {S}_{Sokal\& Sneath-2}=\frac{2\left(a+d\right)}{2a+b+c+2d} \) (Eq.8) |

3 | \( {D}_{Squared- euclid}=\sqrt{{\left(b+c\right)}^2} \) (Eq.17) | |

\( {D}_{Canberra}={\left(b+c\right)}^{\frac{2}{2}} \) (Eq.18) | ||

| ||

| ||

\( {D}_{Minkowski}={\left(b+c\right)}^{\frac{1}{1}} \) (Eq.22) | ||

4 | \( {D}_{Bray\& Curtis}=\frac{b+c}{2a+b+c} \) (Eq.28) | \( {D}_{Lance\& Williams}=\frac{b+c}{2a+b+c} \) (Eq.27) |

5 | \( {S}_{Ochiai-1}=\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}} \) (Eq.33) | \( {S}_{Cosine}=\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}} \) (Eq.31) |

\( {S}_{Otsuka}=\frac{a}{{\left(\left(a+b\right)\left(a+c\right)\right)}^{0.5}} \) (Eq.38) | ||

6 | \( {S}_{Ochiai-2}=\frac{ad}{\sqrt{\left(a+b\right)\left(a+c\right)\left(b+d\right)\left(c+d\right)}} \) (Eq.60) | \( {S}_{Sokal\& Sneath-5}=\frac{ad}{\left(a+b\right)\left(a+c\right)\left(b+d\right){\left(c+d\right)}^{0.5}} \) (Eq.57) |

7 | \( {S}_{Tanimoto}=\frac{a}{\left(a+b\right)+\left(a+c\right)-a} \) (Eq.65) | \( {S}_{Jaccard}=\frac{a}{a+b+c} \) (Eq.1) |

_{Baroni-Urbani}&

_{Buser-2}(Eq. 72) because it is similar to S

_{Baroni-Urbani}&

_{Buser-1}(Eq. 71). A careful observation of equations belonging to the same cluster in the group IDs 1 to 7 in Fig. 2 implies that one equation can be transformed to another just by adding or multiplying by constants (Table 3). For example, we can represent S

_{Baroni-Urbani}&

_{Buser-2}as [(2 x S

_{Baroni-Urbani}&

_{Buser-1}) – 1]. The excluded equations based on the clustering process are as follows: S

_{Dice-1/Czekanowski}(Eq. 3), S

_{Innerproduct}(Eq. 13), S

_{Russell}&

_{Rao}(Eq. 14), D

_{Mean-Manhattan}(Eq. 20), D

_{Vari}(Eq. 23), D

_{Chord}(Eq. 30), S

_{Kulczynski-2}(Eq. 41), S

_{Driver}&

_{Kroeber}(Eq. 42), S

_{Johnson}(Eq. 43), S

_{Hamann}(Eq. 67), and S

_{Baroni-Urbani}&

_{Buser-2}(Eq. 72). In case of Kampo dataset, the clustering results also identified the same equations belong to the same cluster with zero or nearly to zero distance. Therefore, both datasets eliminated the same equations, indicated by ‘***” in Table 1, and also obtained the same number of selected equations (45 binary similarity and dissimilarity measures) for further analysis. Hence, among the 79 binary similarity dissimilarity measures used over the last century, there are only 45 unique equations that produce different coefficients by capturing different information. Additionally, these binary measures satisfy the symmetry property [71], i.e. in case of such equations

*d*(

*x*,

*y*) =

*d*(

*y*,

*x*) or

*S*(

*x*,

*y*) =

*S*(

*y*,

*x*).

Transformation of an equation into another by adding or multiplying by constants (Group IDs correspond to clusters in Fig. 2)

Group IDs | Eliminated Equations | Selected Equations |
---|---|---|

1 | \( {D}_{Chord}=\sqrt{2\left(1-\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}}\right)} \) (Eq 30) | \( =\frac{1}{\sqrt{2}}2\sqrt{\left(1-\frac{a}{\sqrt{\left(a+b\right)\left(a+c\right)}}\right)}=\frac{1}{\sqrt{2}}{D}_{Hellinger} \) (Eq.29) |

2 | \( {D}_{Mean- Manhattan}=\frac{b+c}{a+b+c+d} \) (Eq.20) | \( =\frac{1}{M}\left(b+c\right)=\frac{1}{M}{D}_{Hamming} \) (Eq.15) |

\( {D}_{Vari}=\frac{b+c}{4\left(a+b+c+d\right)} \) (Eq.23) | \( =\frac{1}{4M}\left(b+c\right)=\frac{1}{4M}{D}_{Hamming} \) (Eq.15) | |

3 | \( {S}_{Russell\&Rao}=\frac{a}{a+b+c+d} \) (Eq.14) | \( =\frac{1}{M}a=\frac{1}{M}{S}_{Intersection} \) (Eq.12) |

4 | \( {S}_{Baroni- Urbani\& Buser-2}=\frac{\sqrt{ad}+a-\left(b+c\right)}{\sqrt{ad}+a+b+c} \) (Eq.72) | \( =2\frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}-1=\left[2 \times {S}_{Baroni- Urbani\& Buser-1}\right] \) |

5 | \( {S}_{Kulczynski-2}=\frac{\frac{a}{2}\left(2a+b+c\right)}{\left(a+b\right)\left(a+c\right)} \) (Eq.41) | \( =\frac{1}{2}\left(\frac{a}{a+b}+\frac{a}{a+c}\right)=\frac{1}{2}{S}_{Johnson} \) (Eq.43) |

\( {S}_{Driver\& Kroeber}=\frac{a}{2}\left(\frac{1}{a+b}+\frac{1}{a+c}\right) \) (Eq.42) | \( =\frac{1}{2}\left(\frac{a}{a+b}+\frac{a}{a+c}\right)=\frac{1}{2}{S}_{Johnson} \) (Eq.43) | |

\( {S}_{Johnson}=\frac{a}{a+b}+\frac{a}{a+c} \) (Eq.43) | \( =1+\left(\frac{a^2-bc}{\left(a+b\right)\left(a+c\right)}\right)=1+{S}_{McConnaughey} \) (Eq.39) | |

6 | \( {S}_{Dice-1/ Czekanowski}=\frac{2a}{2a+b+c} \) (Eq.3) | \( =2\frac{a}{2a+b+c}=2 \times {S}_{Dice-2} \) (Eq.2) |

7 | | \( =M\frac{a+d}{a+b+c+d}=M \times {S}_{Sokal\& Michener} \) (Eq.7) |

\( {S}_{Hamann}=\frac{\left(a+d\right)-\left(b+c\right)}{a+b+c+d} \) (Eq.67) | \( =2\left(\frac{a+d}{a+b+c+d}\right)-1=\left[2 \times {S}_{Sokal\& Michener}\right]-1 \) (Eq.7) |

*d*spread throughout all the clusters. This result indicates that the equations cannot be grouped based on the existence of negative match quantity

*d*.

### ROC analysis of selected equations

The ROC curves were created for each binary similarity/dissimilarity equation to compare their performance. Initially, we normalized the similarity and dissimilarity coefficients, such that their minimum becomes 0 and maximum becomes 1, before using them to create the ROC curves. In the case of equations that measure dissimilarity, we transformed a normalized dissimilarity coefficient *D* to a similarity coefficient *S* for the sake of comparison by using the following equation *S* = 1 − *D* ^{2} [40, 41].

In the context of Jamu data, we started the ROC analysis of selected equations by classifying the Jamu pairs into match and mismatch classes based on their efficacies. A Jamu pair belongs to the match class if the efficacy of both the Jamu formulas of a pair is the same. On the other hand, a Jamu pair belongs to the mismatch class if the efficacies of the formulas of a pair are different. The number of Jamu pairs in the match and mismatch classes are 646,728 and 4,253,287 respectively. Obviously, the number of Jamu pairs in the mismatch class is much larger than that in the match class. This imbalance is a challenge in assessment of the capability of equations to separate Jamu pairs into match and mismatch classes. In order to handle this condition, we created 20 mismatch classes each equal to the size of the match class by random sampling of the mismatch class Jamu pairs according to bootstrap method [67]. Every equation was then iteratively evaluated by using those datasets as mismatch class data.

Our objective is to assess the capability of the equations to separate the Jamu pairs into match and mismatch efficacy classes based on their similarity coefficients using ROC analysis. In order to create an ROC curve corresponding to an equation, we need the distributions of match class and mismatch class Jamu pairs with respect to their similarity values calculated by the equation. We divided the range of the similarity coefficient into 100 equal intervals, and the lower limit of each interval was considered as a threshold. Corresponding to every threshold, *TP* and *FN* were determined from the distribution of match class and *FP* and *TN* were determined from the distribution of mismatch class. In our case, *TP* and *FP* are the numbers of Jamu pairs with the similarity value larger than or equal to threshold, and *FN* and *TN* are the numbers of Jamu pairs with the similarity value smaller than threshold. *FPR* and *TPR* were then calculated for every threshold using Eq. 82. We produced the ROC curve by plotting the resulting *FPR* on the *x*-axis and *TPR* on the *y*-axis. In perfect or ideal classification, the ROC curve follows the vertical line from (0,0) to (0,1) and then horizontal line up to (1,1). In the case of random data, the ROC curve follows the diagonal line from (0,0) to (1,1). In the case of real data, the ROC curve usually follows an above diagonal line. The (0,1) is the optimum classification point where *FPR* is zero and *TPR* is one and hence the (0,1) point will be referred to as ‘optimum point’. The performance of a classifier was assessed either by measuring the minimum distance from the optimum point to the curve or by measuring the AUC. In the case of the minimum distance, the lower is the value of the minimum distance the better is the performance of the classifier. In the case of the AUC, the bigger is the AUC value, the better is the performance of the classifier.

*FPR*,

*TPR*) points for all 45 selected equations. In addition, we created 20 ROC curves for each equation considering in each case the match class Jamu pairs and one of the 20 different mismatch class samples. Thus, we obtained 20 AUCs of the ROC curve for each equation and averaged those values to determine the overall AUCs corresponding to an equation. The ROCR package [72] was used to calculate the AUC values. Table 4 shows the results of ROC analysis and also Kappa scores for Jamu data. The scatter plot of minimum distances and mean of AUCs corresponding to 45 equations for both datasets is shown in Fig. 4. Based on the scatter plot generated using Jamu data in Fig. 4a, the 45 equations are empirically divided into 4 groups (C1, C2, C3, and C4). The well-performing equations corresponding to both approaches were obtained in C1, which consists of Eqs. 48, 49, 54, 68, and 79. The Michael similarity (Eq. 68) produces the lowest minimum distance, and the highest AUC is obtained by the Forbes-2 similarity (Eq. 48). The ROC curves generated using Michael and Forbes-2 similarities for all datasets are shown in Fig. 5. As expected, the ROC curves corresponding to all random datasets follow the diagonal line and that corresponding to Jamu data follows the above diagonal line. Most equations with the highest AUC values are similarity-measuring equations and these equations belong to cluster III in Fig. 3. Out of these equations, the Lance & Williams distance (Eq. 27) produces the highest AUC value among dissimilarity-measuring equations.

The ROC analysis and Cohen’s Kappa score of Jamu data. A value inside the bracket in the minimum distance and mean Kappa columns represents the ranking of an equation if we order based on respective columns. Standard deviations from both metrics are relatively similar and small, those are 2-4×10^{-4} for mean AUCs and 0-6×10^{-4} for mean of Kappa scores

No | Equations | S/D | Incl. | ROC analysis | Cohen’s Kappa | |
---|---|---|---|---|---|---|

Mean AUCs | Min. distance | Mean Kappa | ||||

1 | Eq. 48 | S | Y | 0.616 | 0.587 (3) | 0.088 (13) |

2 | Eq. 74 | S | Y | 0.613 | 0.599 (29) | 0.024 (28) |

3 | Eq. 49 | S | Y | 0.613 | 0.588 (4) | 0.076 (15) |

4 | Eq. 54 | S | Y | 0.611 | 0.590 (5) | 0.074 (19) |

5 | Eq. 44 | S | Y | 0.611 | 0.599 (19) | 0.073 (21) |

6 | Eq. 66 | S | Y | 0.611 | 0.599 (26) | 0.023 (31) |

7 | Eq. 68 | S | Y | 0.610 | 0.583 (1) | 0.024 (29) |

8 | Eq. 79 | S | Y | 0.610 | 0.583 (2) | 0.090 (11) |

9 | Eq. 78 | S | 0.609 | 0.599 (28) | 0.092 (8) | |

10 | Eq. 46 | S | 0.609 | 0.599 (20) | 0.065 (23) | |

11 | Eq. 01 | S | 0.609 | 0.599 (10) | 0.052 (24) | |

12 | Eq. 04 | S | 0.609 | 0.599 (11) | 0.089 (12) | |

13 | Eq. 06 | S | 0.609 | 0.599 (12) | 0.036 (27) | |

14 | Eq. 27 | D | 0.609 | 0.599 (14) | 0.109 (7) | |

15 | Eq. 02 | S | 0.609 | 0.599 (8) | 0.074 (20) | |

16 | Eq. 36 | S | 0.608 | 0.600 (31) | 0.040 (25) | |

17 | Eq. 29 | D | 0.608 | 0.599 (15) | 0.076 (16) | |

18 | Eq. 31 | S | 0.608 | 0.599 (16) | 0.076 (17) | |

19 | Eq. 57 | S | Y | 0.608 | 0.599 (22) | 0.076 (18) |

20 | Eq. 71 | S | Y | 0.608 | 0.599 (9) | 0.152 (6) |

21 | Eq. 39 | S | 0.607 | 0.599 (17) | 0.078 (14) | |

22 | Eq. 62 | D | Y | 0.606 | 0.599 (24) | 0.185 (1) |

23 | Eq. 63 | S | Y | 0.606 | 0.599 (25) | 0.167 (5) |

24 | Eq. 55 | S | Y | 0.606 | 0.599 (21) | 0.180 (3) |

25 | Eq. 61 | S | Y | 0.606 | 0.599 (23) | 0.183 (2) |

26 | Eq. 40 | S | Y | 0.605 | 0.599 (18) | 0.180 (4) |

27 | Eq. 34 | S | 0.605 | 0.600 (30) | 0.024 (30) | |

28 | Eq. 45 | S | 0.605 | 0.599 (7) | 0.091 (10) | |

29 | Eq. 52 | S | Y | 0.604 | 0.597 (6) | 0.092 (9) |

30 | Eq. 77 | S | Y | 0.604 | 0.599 (27) | 0.067 (22) |

31 | Eq. 51 | S | Y | 0.604 | 0.602 (32) | 0.039 (26) |

32 | Eq. 12 | S | 0.604 | 0.599 (13) | 0.022 (32) | |

33 | Eq. 10 | S | Y | 0.556 | 0.656 (33) | 0.014 (34) |

34 | Eq. 35 | S | Y | 0.546 | 0.671 (34) | 0.018 (33) |

35 | Eq. 59 | S | Y | 0.545 | 0.671 (35) | 0.013 (35) |

36 | Eq. 24 | D | Y | 0.529 | 0.860 (44) | 0.000 (43) |

37 | Eq. 15 | D | 0.529 | 0.680 (39) | 0.004 (42) | |

38 | Eq. 08 | S | Y | 0.529 | 0.680 (37) | 0.010 (39) |

39 | Eq. 09 | S | Y | 0.529 | 0.680 (38) | 0.010 (36) |

40 | Eq. 16 | D | 0.529 | 0.680 (40) | 0.010 (38) | |

41 | Eq. 07 | S | Y | 0.529 | 0.680 (36) | 0.010 (37) |

42 | Eq. 25 | D | Y | 0.526 | 0.680 (41) | 0.004 (41) |

43 | Eq. 26 | D | Y | 0.517 | 0.895 (45) | 0.000 (44) |

44 | Eq. 47 | S | 0.515 | 0.684 (42) | 0.005 (40) | |

45 | Eq. 50 | S | Y | 0.466 | 0.754 (43) | -0.008 (45) |

The ROC analysis and Cohen’s Kappa score of Kampo data. A value inside the bracket in the minimum distance and mean Kappa columns represents the ranking of an equation if we order based on respective columns

No | Equations | S/D | Incl. | ROC analysis | Cohen’s Kappa | |||
---|---|---|---|---|---|---|---|---|

Mean AUCs | SD mean AUCs | Min. distance | Mean Kappa | SD mean Kappa | ||||

1 | Eq. 79 | S | Y | 0.610 | 0.001 | 0.607 (9) | 0.069 (14) | 0.001 |

2 | Eq. 55 | S | Y | 0.609 | 0.001 | 0.604 (2) | 0.106 (1) | 0.001 |

3 | Eq. 61 | S | Y | 0.609 | 0.001 | 0.606 (5) | 0.106 (2) | 0.001 |

4 | Eq. 63 | S | Y | 0.609 | 0.001 | 0.606 (6) | 0.099 (5) | 0.001 |

5 | Eq. 62 | D | Y | 0.609 | 0.001 | 0.610 (16) | 0.101 (4) | 0.001 |

6 | Eq. 48 | S | Y | 0.608 | 0.001 | 0.608 (12) | 0.084 (9) | 0.001 |

7 | Eq. 49 | S | Y | 0.608 | 0.001 | 0.607 (11) | 0.069 (15) | 0.001 |

8 | Eq. 44 | S | Y | 0.608 | 0.001 | 0.610 (15) | 0.065 (21) | 0.001 |

9 | Eq. 54 | S | Y | 0.607 | 0.001 | 0.607 (8) | 0.066 (20) | 0.001 |

10 | Eq. 39 | S | 0.607 | 0.002 | 0.607 (10) | 0.070 (13) | 0.001 | |

11 | Eq. 57 | S | Y | 0.606 | 0.001 | 0.611 (17) | 0.067 (18) | 0.000 |

12 | Eq. 71 | S | Y | 0.606 | 0.001 | 0.608 (14) | 0.092 (6) | 0.001 |

13 | Eq. 51 | S | Y | 0.606 | 0.001 | 0.612 (18) | 0.040 (27) | 0.001 |

14 | Eq. 31 | S | 0.606 | 0.001 | 0.612 (20) | 0.068 (17) | 0.001 | |

15 | Eq. 29 | D | 0.606 | 0.001 | 0.612 (19) | 0.068 (16) | 0.001 | |

16 | Eq. 52 | S | Y | 0.606 | 0.001 | 0.608 (13) | 0.078 (10) | 0.001 |

17 | Eq. 36 | S | 0.606 | 0.001 | 0.612 (21) | 0.042 (26) | 0.001 | |

18 | Eq. 74 | S | Y | 0.605 | 0.002 | 0.606 (4) | 0.037 (29) | 0.001 |

19 | Eq. 45 | S | 0.605 | 0.001 | 0.606 (7) | 0.086 (8) | 0.001 | |

20 | Eq. 04 | S | 0.605 | 0.001 | 0.615 (29) | 0.075 (12) | 0.001 | |

21 | Eq. 27 | D | 0.605 | 0.001 | 0.615 (30) | 0.091 (7) | 0.001 | |

22 | Eq. 06 | S | 0.605 | 0.001 | 0.618 (32) | 0.032 (40) | 0.001 | |

23 | Eq. 02 | S | 0.604 | 0.001 | 0.615 (28) | 0.065 (22) | 0.001 | |

24 | Eq. 34 | S | 0.604 | 0.001 | 0.605 (3) | 0.035 (36) | 0.001 | |

25 | Eq. 01 | S | 0.604 | 0.001 | 0.616 (31) | 0.047 (24) | 0.001 | |

26 | Eq. 40 | S | Y | 0.604 | 0.001 | 0.604 (1) | 0.102 (3) | 0.002 |

27 | Eq. 78 | S | 0.602 | 0.001 | 0.614 (25) | 0.075 (11) | 0.001 | |

28 | Eq. 46 | S | 0.600 | 0.001 | 0.613 (23) | 0.055 (23) | 0.001 | |

29 | Eq. 68 | S | Y | 0.597 | 0.001 | 0.612 (22) | 0.036 (32) | 0.001 |

30 | Eq. 66 | S | Y | 0.597 | 0.001 | 0.614 (24) | 0.035 (37) | 0.001 |

31 | Eq. 59 | S | Y | 0.591 | 0.001 | 0.614 (26) | 0.043 (25) | 0.001 |

32 | Eq. 35 | S | Y | 0.590 | 0.001 | 0.615 (27) | 0.036 (35) | 0.001 |

33 | Eq. 12 | S | 0.590 | 0.001 | 0.621 (33) | 0.034 (38) | 0.000 | |

34 | Eq. 77 | S | Y | 0.589 | 0.001 | 0.621 (34) | 0.066 (19) | 0.000 |

35 | Eq. 10 | S | Y | 0.584 | 0.001 | 0.630 (35) | 0.036 (31) | 0.001 |

36 | Eq. 26 | D | Y | 0.568 | 0.001 | 0.653 (43) | 0.015 (43) | 0.001 |

37 | Eq. 24 | D | Y | 0.564 | 0.001 | 0.651 (42) | 0.017 (42) | 0.001 |

38 | Eq. 25 | D | Y | 0.564 | 0.001 | 0.650 (36) | 0.032 (41) | 0.001 |

39 | Eq. 08 | S | Y | 0.564 | 0.001 | 0.651 (38) | 0.036 (33) | 0.001 |

40 | Eq. 16 | D | 0.564 | 0.001 | 0.651 (41) | 0.037 (30) | 0.001 | |

41 | Eq. 15 | D | 0.563 | 0.001 | 0.651 (40) | 0.032 (39) | 0.001 | |

42 | Eq. 07 | S | Y | 0.563 | 0.001 | 0.651 (37) | 0.036 (34) | 0.001 |

43 | Eq. 09 | S | Y | 0.563 | 0.001 | 0.651 (39) | 0.037 (28) | 0.001 |

44 | Eq. 47 | S | 0.518 | 0.001 | 0.683 (44) | 0.010 (44) | 0.001 | |

45 | Eq. 50 | S | Y | 0.501 | 0.001 | 0.702 (45) | -0.004 (45) | 0.000 |

In case of Jamu and Kampo pairs, the negative match quantity *d* is much higher compared to the positive match *a* and the absence mismatches *b* and *c*. One of our objectives is to understand the effect of *d* in calculating similarity/dissimilarity coefficients between herbal medicines. Among the equations that do not include *d*, the Simpson similarity (Eq. 45) and the Forbes-1 similarity (Eq. 34) produce the lowest minimum distance in Jamu and Kampo data, respectively. Furthermore, the Derived Jaccard similarity (Eq. 78) and the McConnaughey (Eq. 39) produce the highest AUC in Jamu data and Kampo data. Out of 79 equations in Table 1, 46 equations use *d* in their expressions. Interestingly, the equations that include *d* perform better in measuring similarity/dissimilarity in both datasets. The best performing equations corresponding to minimum distance and mean AUCs for Jamu data are Eqs. 68 and 48, which include negative match quantity *d*. Likewise, the best equations in the Kampo data (Eqs. 79 and 40) also include negative match quantity *d*. Then, the top-5 well performing equations corresponding to both datasets include *d*. If we also consider another metric to rank the classifier performance, i.e. Cohen’s Kappa, we find a consistent result. That is top-5 equations with the largest Kappa score also include *d* (Table 4 and 5). It implies the similarity between Jamu pairs and Kampo pairs are influenced by the negative matches. This result supports the findings of Zhang et al. [20] that all possible matches, *S* _{ ij } where *i, j* ϵ{0,1}, should be considered for better classification results. Moreover, the performance measurement of binary similarity/dissimilarity equations using the AUC of ROC curve is more preferable to the minimum distance because this approach considers all (*FPR*, *TPR*) points, not only a single point with minimum distance to the optimum point.

## Conclusions

Different binary similarity and dissimilarity measures yield different similarity/dissimilarity coefficients, which in turn causes differences in downstream analysis e.g. clustering. Hence, determining appropriate binary similarity and dissimilarity coefficients is an essential aspect of big data analysis in versatile areas of scientific research including chemometrics and bioinformatics. In this study, we presented an organized way to select a suitable equation for studying relationship between herbal medicine formulas in Indonesian Jamu and Japanese Kampo. We started our study by collecting 79 binary similarity and dissimilarity equations from literature. In the early stages, we reduced algebraically redundant equations and equations that produce invalid values or relatively similar coefficients when applied to our datasets. In addition, we eliminated some equations based on agglomerative hierarchical clustering because they were very closely related to other equations in the same cluster. Finally, we selected 45 unique equations that produced different coefficients for our analysis. The ROC curve analysis was then performed to assess the capabilities of these equations to separate herbal medicine pairs having the same and different efficacies. The experimental results show that the binary similarity and dissimilarity measures that include the negative match quantity *d* in their expressions have a better capability to separate herbal medicine pairs than those equations that exclude *d*. Moreover, we obtained different ranking of binary equations for different datasets, i.e. Jamu and Kampo data. Thus, this result indicates the selection of binary similarity and dissimilarity measures is data dependent and we should choose the binary similarity and dissimilarity measures wisely depending on the data to be processed. In case of Jamu data, the biggest AUC value is obtained by the Forbes-2 similarity. Conversely, the Variant of Correlation similarity is recommended for classifying Kampo pairs into match and mismatch classes. The procedure followed in this work can also be used to find suitable binary similarity and dissimilarity measures under similar situations in other applications.

## Notes

### Acknowledgements

Not applicable.

## Funding

This work was supported by the National Bioscience Database Center in Japan; the Ministry of Education, Culture, Sports, Science, and Technology of Japan; the US National Science Foundation and Japan Science and Technology Agency [Strategic International Collaborative Research Program ‘Metabolomics for a Low Carbon Society’]; the National Bioscience Database Center in Japan and NAIST Big Data Project.

## Availability of data and materials

The simulated dataset(s) supporting the conclusions of this article are available in KNApSAcK Family Databases (http://kanaya.naist.jp/KNApSAcK_Family/).

## Authors’ contributions

SW conducted the primary investigation, carried out the experiments, developed bmeasures package, and drafted the manuscript; SW, MA and SK designed the proposed method; FA provided Jamu-Species relations; MA and IB aided in the manuscript development; LD and SK supervised the study and participated in the manuscript. All authors read and approved the manuscript.

## Competing interests

The authors declare that they have no competing interests.

## Consent for publication

Not applicable.

## Ethics approval and consent to participate

Not applicable.

## Supplementary material

## References

- 1.Consonni V, Todeschini R. New similarity coefficients for binary data. Match-Communications Math Comput Chem. 2012;68:581–92.Google Scholar
- 2.Legendre P, Legendre L. Numerical ecology. 2nd. Amsterdam: Elsevier Science; 1998.Google Scholar
- 3.Batagelj V, Bren M. Comparing resemblance measures. J Classif. 1995;12:73–90.CrossRefGoogle Scholar
- 4.Afendi FM, Darusman LK, Hirai A, Altaf-Ul-Amin M, Takahashi H, Nakamura K, Kanaya S: System biology approach for elucidating the relationship between Indonesian herbal plants and the efficacy of Jamu. In Proceedings - IEEE International Conference on Data Mining, ICDM. IEEE; 2010:661–668.Google Scholar
- 5.Afendi FM, Okada T, Yamazaki M, Hirai-Morita A, Nakamura Y, Nakamura K, Ikeda S, Takahashi H, Altaf-Ul-Amin M, Darusman LK, Saito K, Kanaya S: KNApSAcK family databases: Integrated metabolite-plant species databases for multifaceted plant research. Plant Cell Physiol 2012, 53:e1(1–12).Google Scholar
- 6.Auer J, Bajorath J. Molecular similarity concepts and search calculations. In: Keith JM, editor. Bioinformatics volume II: Structure, function and applications (Methods in molecular biology), vol. 453. Totowa: Humana Press; 2008. p. 327–47.CrossRefGoogle Scholar
- 7.Kedarisetti P, Mizianty MJ, Kaas Q, Craik DJ, Kurgan L. Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim Biophys Acta - Proteins Proteomics. 2014;1844(1 PART B):181–90.CrossRefGoogle Scholar
- 8.Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, Bussemaker HJ, Gordân R, Rohs R. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci. 2015;112:4654–9.CrossRefPubMedPubMedCentralGoogle Scholar
- 9.Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT. Sample classification from protein mass spectrometry, by “peak probability contrasts. Bioinformatics. 2004;20:3034–44.CrossRefPubMedGoogle Scholar
- 10.Pinoli P, Chicco D, Masseroli M. Computational algorithms to predict Gene Ontology annotations. BMC Bioinformatics. 2015;16 Suppl 6:1–15.CrossRefGoogle Scholar
- 11.Kangas JD, Naik AW, Murphy RF. Efficient discovery of responses of proteins to compounds using active learning. BMC Bioinformatics. 2014;15:1–11.CrossRefGoogle Scholar
- 12.Ohtana Y, Abdullah AA, Altaf-Ul-Amin M, Huang M, Ono N, Sato T, Sugiura T, Horai H, Nakamura Y, Morita Hirai A, Lange KW, Kibinge NK, Katsuragi T, Shirai T, Kanaya S. Clustering of 3D-structure similarity based network of secondary metabolites reveals their relationships with biological activities. Mol Inform. 2014;33:790–801.PubMedGoogle Scholar
- 13.Abe H, Kanaya S, Komukai T, Takahashi Y, Sasaki SI. Systemization of semantic descriptions of odors. Anal Chim Acta. 1990;239:73–85.CrossRefGoogle Scholar
- 14.Willett P, Barnard JM, Downs GM. Chemical similarity searching. J Chem Inf Model. 1998;38:983–96.Google Scholar
- 15.Flower DR. On the properties of bit string-based measures of chemical similarity. J Chem Inf Model. 1998;38:379–86.Google Scholar
- 16.Godden JW, Xue L, Bajorath J. Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. J Chem Inf Model. 2000;40:163–6.Google Scholar
- 17.Agrafiotis DK, Rassokhin DN, Lobanov VS. Multidimensional scaling and visualization of large molecular similarity tables. J Comput Chem. 2001;22:488–500.CrossRefGoogle Scholar
- 18.Rojas-Cherto M, Peironcely JE, Kasper PT, van der Hooft JJJ, De Vos RCH, Vreeken RJ, Hankemeier T, Reijmers T. Metabolite identification using automated comparison of high-resolution multistage mass spectral trees. Anal Chem. 2012;84:5524–34.CrossRefPubMedGoogle Scholar
- 19.Fligner MA, Verducci JS, Blower PE. A modification of the Jaccard–Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics. 2002;44:110–9.CrossRefGoogle Scholar
- 20.Zhang B, Srihari SN. Binary vector dissimilarity measures for handwriting identification. In: Proceedings of SPIE-IS&T Electronic Imaging, vol. 5010. 2003. p. 28–38.Google Scholar
- 21.Zhang B, Srihari SN. Properties of binary vector dissimilarity measures. In: Proc. JCIS Int’l Conf. Computer Vision, Pattern Recognition, and Image Processing. 2003. p. 1–4.Google Scholar
- 22.Kosman E, Leonard KJ. Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Mol Ecol. 2005;14(2):415–24.Google Scholar
- 23.Choi S-S, Cha S-H, Tappert CC. A survey of binary similarity and distance measures. J Syst Cybern Informatics. 2010;8:43–8.Google Scholar
- 24.Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P. Similarity coefficients for binary chemoinformatics data: Overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012;52:2884–901.CrossRefPubMedGoogle Scholar
- 25.Wijaya SH, Tanaka Y, Hirai A, Afendi FM, Batubara I, Ono N, Darusman LK, Kanaya S. Utilization of KNApSAcK Family Databases for Developing Herbal Medicine Systems. J Comput Aided Chem. 2016;17:1–7.CrossRefGoogle Scholar
- 26.Seminar nasional dan pameran industri Jamu [http://seminar.ift.or.id/seminar-jamu-brand-indonesia/]. Accessed 19 Aug 2014.
- 27.Wijaya SH, Husnawati H, Afendi FM, Batubara I, Darusman LK, Altaf-Ul-Amin M, Sato T, Ono N, Sugiura T, Kanaya S. Supervised clustering based on DPClusO: Prediction of plant-disease relations using Jamu formulas of KNApSAcK database. Biomed Res Int. 2014;2014:1–15.CrossRefGoogle Scholar
- 28.Okada T, Afendi FM, Yamazaki M, Chida KN, Suzuki M, Kawai R, Kim M, Namiki T, Kanaya S, Saito K. Informatics framework of traditional Sino-Japanese medicine (Kampo) unveiled by factor analysis. J Nat Med. 2016;70:107–14.CrossRefPubMedGoogle Scholar
- 29.da Silva MA, Garcia AAF, Pereira de Souza A, Lopes de Souza C. Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L). Genet Mol Biol. 2004;27:83–91.CrossRefGoogle Scholar
- 30.Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.Google Scholar
- 31.Lim T, Loh W, Shih Y. A comparison of prediction accuracy, complexity, and training time of thirty three old and new classification algorithms. Mach Learn. 2000;40:203–29.CrossRefGoogle Scholar
- 32.Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8:283–98.CrossRefPubMedGoogle Scholar
- 33.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves, Proc 23rd Int Conf Mach Learn -- ICML’06. 2006. p. 233–40.Google Scholar
- 34.Manning CD, Schütze H. Foundations of statistical natural language processing. Cambridge: MITpress; 1999.Google Scholar
- 35.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.CrossRefGoogle Scholar
- 36.Ben-David A. A lot of randomness is hiding in accuracy. Eng Appl Artif Intell. 2007;20:875–85.CrossRefGoogle Scholar
- 37.Ben-David A. About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell. 2008;21:874–82.CrossRefGoogle Scholar
- 38.Genes and diseases [http://www.ncbi.nlm.nih.gov/books/NBK22185/]. Accessed 20 May 2016.
- 39.Clifford HT, Stephenson W. An Introduction to Numerical Classification. New York: Academic; 1975.Google Scholar
- 40.Warrens MJ. Similarity coefficients for binary data: properties of coefficients, coefficient matrices, multi-way metrics and multivariate coefficients. Psychometrics and Research Methodology Group, Leiden University Institute for Psychological Research, Faculty of Social Sciences, Leiden University; 2008.Google Scholar
- 41.Jackson DA, Somers KM, Harvey HH. Similarity coefficients: Measures of co-occurrence and association or simply measures of occurrence? Am Nat. 1989;133:436–53.CrossRefGoogle Scholar
- 42.Dalirsefat SB, da Silva MA, Mirhoseini SZ. Comparison of similarity coefficients used for cluster analysis with amplified fragment length polymorphism markers in the silkworm, Bombyx mori. J Insect Sci. 2009;9:1–8.CrossRefPubMedGoogle Scholar
- 43.Jaccard P. The distribution of the flora in the alpine zone. New Phytol. 1912;11:37–50.CrossRefGoogle Scholar
- 44.Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302.CrossRefGoogle Scholar
- 45.Hubalek Z. Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation. Biol Rev. 1982;57:669–89.CrossRefGoogle Scholar
- 46.Cheetham AH, Hazel JE, Journal S, Sep N. Binary (presence-absence) similarity coefficients. J Paleontol. 1969;43:1130–6.Google Scholar
- 47.Cha S, Choi S, Tappert C. Anomaly between Jaccard and Tanimoto coefficients. In: Proceedings of Student-Faculty Research Day, CSIS, Pace University. 2009. p. 1–8.Google Scholar
- 48.Cha S-H, Tappert CC, Yoon S. Enhancing Binary Feature Vector Similarity Measures. 2005.Google Scholar
- 49.Lourenco F, Lobo V, Bacao F. Binary-Based Similarity Measures for Categorical Data and Their Application in Self-Organizing Maps. 2004.Google Scholar
- 50.Ojurongbe TA. Comparison of different proximity measures and classification methods for binary data. Faculty of Agricultural Sciences, Nutritional Sciences and Environmental Management, Justus Liebig University Gießen; 2012.Google Scholar
- 51.Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32:241–54.CrossRefPubMedGoogle Scholar
- 52.Michael EL. Marine ecology and the coefficient of association: A plea in behalf of quantitative biology. J Ecol. 1920;8:54–9.CrossRefGoogle Scholar
- 53.Stiles HE. The association factor in information retrieval. J ACM. 1961;8(2):271–9.CrossRefGoogle Scholar
- 54.Nei M, Li W-H. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci U S A. 1979;76:5269–73.CrossRefPubMedPubMedCentralGoogle Scholar
- 55.Holliday JD, Hu C-Y, Willett P. Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen. 2002;5:155–66.CrossRefPubMedGoogle Scholar
- 56.Boyce RL, Ellison PC. Choosing the best similarity index when performing fuzzy set ordination on binary data. J Veg Sci. 2001;12:711–20.CrossRefGoogle Scholar
- 57.Faith DP. Asymmetric binary similarity measures. Oecologia. 1983;57:287–90.CrossRefGoogle Scholar
- 58.Gower JC, Legendre P. Metric and Euclidean properties of dissimilarity coefficients. J Classif. 1986;3:5–48.CrossRefGoogle Scholar
- 59.Chang J, Chen R, Tsai S. Distance-preserving mappings from binary vectors to permutations. IEEE Trans Inf Theory. 2003;49:1054–9.CrossRefGoogle Scholar
- 60.Lance GN, Williams WT. Computer Programs for Hierarchical Polythetic Classification (“Similarity Analyses”). Comput J. 1966;9:60–4.CrossRefGoogle Scholar
- 61.Avcibaş I, Kharrazi M, Memon N, Sankur B. Image steganalysis with binary similarity measures. EURASIP J Appl Signal Processing. 2005;17:2749–57.CrossRefGoogle Scholar
- 62.Baroni-urbani C, Buser MW. Similarity of binary data. Syst Biol. 1976;25:251–9.Google Scholar
- 63.Frigui H, Krishnapuram R. Clustering by competitive agglomeration. Pattern Recognit. 1997;30:1109–19.CrossRefGoogle Scholar
- 64.Cimiano P, Hotho A, Staab S. Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text. In: Ecai 2004: Proceedings of the 16th European Conference on Artificial Intelligence, vol. 110. 2004. p. 435–9.Google Scholar
- 65.Bolshakova N, Azuaje F. Cluster validation techniques for genome expression data. Signal Process. 2003;83:825–33.CrossRefGoogle Scholar
- 66.Bien J, Tibshirani R. Hierarchical clustering with prototypes via minimax linkage. J Am Stat Assoc. 2011;106(495):1075–84.Google Scholar
- 67.Sonego P, Kocsor A, Pongor S. ROC analysis: Applications to the classification of biological sequences and 3D structures. Brief Bioinform. 2008;9:198–209.CrossRefPubMedGoogle Scholar
- 68.Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27:861–74.CrossRefGoogle Scholar
- 69.Li M, Chen J, Wang J, Hu B, Chen G. Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008;9:1–16.CrossRefGoogle Scholar
- 70.Gorunescu F. Data Mining: Concepts, models and techniques. Springer Science & Business Media, Verlag Berlin Heidelberg, Germany; 2011.Google Scholar
- 71.Carey VJ, Huber W, Irizarry RA, Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005.Google Scholar
- 72.Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: Visualizing classifier performance in R. Bioinformatics. 2005;21:3940–1.CrossRefPubMedGoogle Scholar
- 73.Gelbard R, Goldman O, Spiegler I. Investigating diversity of clustering methods: an empirical comparison. Data Knowl Eng. 2007;63:155–66.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.