Advertisement

Information Systems Frontiers

, Volume 15, Issue 1, pp 99–110 | Cite as

“Padding” bitmaps to support similarity and mining

  • Roy Gelbard
Article

Abstract

The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of “padding” where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e., the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute’s domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher’s Iris dataset.

Keywords

Bitmap-index Data representation Similarity index Cluster analysis Classification Data mining 

1 Introduction

Classification and clustering, as well as other data mining techniques, usually set up hypotheses to assign different objects to groups and classes based on the similarity/distance between them (Estivill-Castro and Yang 2004; Jain and Dubes 1988; Jain et al 1999; Lim et al 2000; Zhang and Srihari 2004). These techniques are widely used in numerous fields such as medical diagnosis, technical support, direct marketing, customer segmentation, fraud detection, bioinformatics, etc.

Bitmap-indexing enables efficient data storage and retrieval (Spiegler and Maayan 1985; O’Neil 1987; Chan and Ioannidis 1998; Johnson 1999), as well as clustering and mining (Erlich et al 2002; Gelbard et al. 2007). However, bitmap-indexing by means of a binary data representation (assigning a ‘1’ or ‘0’ to each possible value of each attribute) is restricted to nominal discrete attributes which severely limits its use, since it requires the ability to handle continuous data in a raw format (Zhang and Srihari 2003). A bitmap-index representation does not preserve the natural numeric capability to “bind” close numerical values, which is fundamental to similarity-distance calculations as to classification and data mining techniques. A similar problem exists for high dimensional data attributes such as identifiers (IDs) or product numbers. Perlich and Provost (2006) suggest an interesting solution to dimensional attributes.

The current research describes and verifies a scheme for representing ordinal and continuous data by applying the concept of “padding” bins, produced by transforming each data value into a range of nominal-discrete values (binning), by adding adjacent bits "around" the original value (bin). The padding factor, i.e., the number of adjacent bits added, is based on first and second derivative degrees of each attribute’s domain-distribution. The main benefit of the padded bitmap format is the improved accuracy of similarity measures, which thus leads to better clustering and mining.

The impact of the padded bitmap format on the accuracy of clustering and data mining is tested on Fisher’s Iris dataset (Fisher 1936), which is often referenced as a baseline in the field of classification and data mining. In the experiment, nine clustering algorithms were executed on the regular representation of the Iris dataset. Then the dataset was transformed into a padded bitmap format, and re-clustered using the same algorithms. Comparison of the results demonstrates the advantages of the new approach for handling ordinal and continuous data attributes in classification, clustering and mining.

2 Theoretical background

2.1 Bitmap-index

Bitmap-indexing is well known and widely used in database technologies such as DB2 and Oracle (O’Neil 1987; Oracle 1993), as well as in data warehouse technologies such as Sybase IQ and others (Chan and Ioannidis 1998; Oracle 2001). A bitmap-index creates a storage scheme in which data appear in binary form rather than the common numeric and alphanumeric formats. The dataset is viewed as a two-dimensional matrix that relates entities to all attribute values these entities may assume. The rows represent entities and the columns represent possible values such that entries in the matrix are either ‘1’ or ‘0’, indicating that a given entity (e.g., record, object) has or does not have a given value, respectively (Spiegler and Maayan 1985).

A formal definition

Suppose we have n entities. For each entity, we construct a binary vector that represents the values of its attributes in binary form, as follows. Suppose that for each entity i (i = 1, 2, . . ., n) we have m attributes, a 1, a 2, . . ., a m . The domain of each attribute a j is all its possible values, where p j is the domain size. We assume that for each attribute a j (j = 1, 2, . . ., m), its domain consists of p j mutually exclusive possible values; i.e., for each attribute a j , an entity can attain exactly one of its p j domain values. Denoting the k th value of attribute a j (j = 1, 2, . . ., m; k = 1, 2, . . ., p j ) by a jk , we can represent the domain attributes vector of all possible values of all m attributes as: (a 11, a 12, . . ., a 1p1, a 21, a 22, . . . , a 2p2, . . . , a m1, a m2, . . . , a mpm)

Denoting the length of the domain attributes vector by p, we have: \( p = \sum\limits_{j = 1}^m {pj} \)

We define the binary vector, of length p, for each entity i (i = 1, 2, . . ., n) in the following way:
  • x ijk  = 1 if for entity i, the value of attribute j is a jk

  • 0 otherwise

  • i = 1, 2, . . ., n

  • j = 1, 2, . . ., m

  • k = 1, 2, . . ., p j

x ijk is the corresponding value for the k th value of attribute j (a jk ) for entity i , where x ijk is either ‘1’ or ‘0’, indicating that a given entity has or does not have a given value a jk for attribute j, respectively.

The binary vector, of length p, for entity i, is given by: (x i11, x i12, . . . , x impm )

We can express the mutual exclusivity property assumption for each entity and for each attribute over its domain, for each i and j, as: \( \sum\limits_{k = 1}^{pj} {xijk = 1} \) (i = 1, 2, . . ., n; j = 1, 2, . . ., m)

This yields the sum of all the 1’s in each binary vector as the number of attributes, m, i.e., for each i, \( \sum\limits_{j = 1}^m {\sum\limits_{k = 1}^{pj} {xijk = m} } \) (i = 1, 2, . . ., n)

For example

Suppose we have entities with three (m = 3) attributes:
  • Attribute 1: Gender: with two (p 1  = 2) mutually exclusive values M (male), F (female).

  • Attribute 2: Marital status: with four (p 2  = 4) mutually exclusive values S (single), M (married), D (divorced), W (widowed).

  • Attribute 3: Education with five (p 3  = 5) mutually exclusive values: 1 (elementary), 2 (high school), 3 (college), 4 (undergraduate), and 5 (graduate).

We have the domain attributes vector of length p = p 1  + p 2  + p 3  = 2 + 4 + 5 = 11:
$$ \left( {{a_{{11}}},{a_{{12}}},{a_{{21}}},{a_{{22}}},{a_{{23}}},{a_{{24}}},{a_{{31}}},{a_{{32}}},{a_{{33}}},{a_{{34}}},{a_{{35}}}} \right) = \left( {{\text{M}},{\text{F}},{\text{S}},{\text{M}},{\text{D}},{\text{W}},{1},{2},{3},{4},{5}} \right) $$
Now, suppose that the first entity (person), i = 1, is a married graduate man; its binary vector is then: (1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1) which are all discrete attributes (Table 1).
Table 1

Bitmap representation

 

Gender

Marital status

Education

M

F

S

M

D

W

1

2

3

4

5

i = 1

1

0

0

1

0

0

0

0

0

0

1

2

1

0

0

0

1

0

0

0

1

0

0

3

0

1

1

0

0

0

0

0

0

0

1

4

1

0

0

0

0

1

0

0

0

0

1

5

1

0

1

0

0

0

1

0

0

0

0

6

1

0

0

0

1

0

0

0

1

0

0

2.2 Similarity measures

Calculating similarity among data records is a fundamental function in data mining and particularly in clustering (Gelbard and Spiegler 2000).

Hierarchical clustering algorithms use the squared Euclidean distance as the likelihood-similarity measure. This measure calculates the distance between two samples as the square root of the sums of all the squared distances between their properties. Generally speaking, it is possible to differentiate these algorithms by means of the values assigned to variables A, B, and C in the general formula used to calculate the likelihood-similarity between object z and two unified objects (xy), producing a distance-similarity index:
$$ {\text{D(xy)z}} = {\text{Ax}}*{\text{Dxz}} + {\text{Ay}}*{\text{Dyz}} + {\text{B}}*{\text{Dxy}} + {\text{C}}*\left| {{\text{Dxz}} - {\text{Dyz }}} \right| $$
In each algorithm, the variables A, B and C attain different values as illustrated in Table 2.
Table 2

Values assigned to distance-similarity variables

Algorithm

Ax

Ay

B

C

Group average

Nx / (Nx + Ny)

Ny / (Nx + Ny)

C

0

Nearest neighbor

0.5

0.5

0

−0.5

Furthest neighbor

0.5

0.5

0

0.5

Median

0.5

0.5

−0.25

0

Centroid

Nx / (Nx + Ny)

Ny / (Nx + Ny)

-Ax * Ay

0

Ward’s method

(Nz + Nx) / (Nx + Ny + Nz)

(Nz + Ny) / (Nx + Ny + Nz)

-Nz / (Nx + Ny + Nz)

0

However, these likelihood-similarity measures are applicable only to ordinal/continuous attributes and cannot be used to classify nominal, discrete, or categorical attributes.

For nominal attributes, similarity measures such as Dice (Dice 1945) are used. Additional nominal-similarity measures are presented and evaluated (Erlich et al 2002; Zhang and Srihari 2003); all of them take into account positive values alone, i.e., the ‘1’ bits. According to the Dice index, the similarity between two binary sequences is as follows:
$$ 0 \leqslant \frac{{2Nab}}{{Na + Nb}} \leqslant 1 $$
Where:
Na

the number of ‘1’s in sequence a.

Nb

the number of ‘1’s in sequence b.

Nab

the number of ‘1’s common to both a and b.

3 The padding model

As seen, the bitmap representation is limited to nominal discrete attributes. To enable handling ordinal and continuous data in a binary format, we introduce a new concept called padding, to create “padded” bitmaps. Adding padded bits is not adding “noise” to the binary data as it is not randomly generated. Padding involves recovering some of the implicit similarity inherent to the numerical scale that “got lost” in the bitmap-indexing. A similar way of representing numeric features is temperature coding, as discussed in Perlich and Provost (2006). The conversion of ordinal/continuous data into a padded bitmap format, i.e., into a range of nominal-discrete values, has three stages: (a) Binning stage, (b) Padding stage, and (c) Determining the Padding Factor.

3.1 The binning stage

In this stage, each ordinal/continuous attribute is converted into a binary vector (defined in Section 2.1). The resultant vector contains a bit for every possible value in the attribute value’s range, starting from the minimal value and ending at the maximal value. In other words, partitioning each data attribute (variable range) into several bins and then assigning a value of 1 to the corresponding bin and zero otherwise. For example, assume attributes, which have the following values: {0.0, 0.2, 0.3, 0.7, 0.9}. The minimal value = 0.0, the maximal value = 0.9, and the minimal interval between values = 0.3–0.2 = 0.1. The binning stage is illustrated in Table 3.
Table 3

Illustration of the Binning stage

Mutually exclusive representation

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.0

1

0

0

0

0

0

0

0

0

0

0.2

0

0

1

0

0

0

0

0

0

0

0.3

0

0

0

1

0

0

0

0

0

0

0.7

0

0

0

0

0

0

0

1

0

0

0.9

0

0

0

0

0

0

0

0

0

1

Since the described binning process always produces a finite number of bins, one can also consider it as a private case suitable for situations where the candidate variable assumes finite number of relevant possible categories. Although such a representation is precise and preserves data accuracy; i.e., there is neither loss of information nor rounding off of any value, mutual exclusivity of the binning representation causes the “isolation” of each value. Normally and intuitively, we assume that 0.2 is closer-similar to 0.3 than to 0.7; but in converting such values into bitmap representation we lose those basic numerical relations. In the Dice similarity measure, the similarity between any pair of such values is always 0; the same goes for the HD similarity measure, for which the similarity is always 0.8, for {0.2, 0.3}, as it is for any other pair.

Losing these basic similarity “intuitions” is the main drawback of the bitmap representation.

3.2 The padding stage

Reformulating the intuition in which 0.2 is closer-similar to 0.3 than to 0.7 is done by padding each bin according to first and second derivative degrees of the attribute domain, as it appears in the dataset, or if known, of the entire population.

To illustrate the concept, observe Table 4, which presents eight pairs each containing two binary vectors. The first binary vector in line No.1 (the upper row) represents the value of 0.2 and the second represents the value of 0.7. Lines No.2 through line No.8 present 7 different padding rates of the two binary vectors presented in line No.1, as a function of the padding factor.
Table 4

Similarity values between pairs of Padded vectors

#

Padding factor

The sequences

Dice

 

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P.F. = |0|

0

0

1

0

0

0

0

0

0

0

0.00

 

0

0

0

0

0

0

0

1

0

0

 

 

2

P.F. = |1|

0

1

1

1

0

0

0

0

0

0

0.00

 

0

0

0

0

0

0

1

1

1

0

 

 

3

P.F. = |2|

1

1

1

1

1

0

0

0

0

0

0.00

 

0

0

0

0

0

1

1

1

1

1

 

 

4

P.F. = |3|

1

1

1

1

1

1

0

0

0

0

0.33

 

0

0

0

0

1

1

1

1

1

1

 

 

5

P.F. = |4|

1

1

1

1

1

1

1

0

0

0

0.57

 

0

0

0

1

1

1

1

1

1

1

 

 

6

P.F. = |5|

1

1

1

1

1

1

1

1

0

0

0.75

 

0

0

1

1

1

1

1

1

1

1

 

 

7

P.F. = |6|

1

1

1

1

1

1

1

1

1

0

0.89

 

0

1

1

1

1

1

1

1

1

1

 

 

8

P.F. = |7|

1

1

1

1

1

1

1

1

1

1

1.00

 

1

1

1

1

1

1

1

1

1

1

 

Each line illustrates different padding factors. The padding factor in line No.2 is noted |1|, i.e., one additional ‘1’ bit, on the right and left sides of the original bin. Similarly, the padding factor in line No.3 is |2|, i.e., it has two additional ‘1’s on the right side and on the left side of the original bin; this continues up to line No.8, in which the padding factor is |7|. The original binary vectors (presented in line No.1) have a padding factor of |0|.

The right column in Table 4 presents the calculated similarity between the two vectors in each pair according to the Dice similarity index. The obvious issue is to determine the right padding factor to yield the best information for clustering.

3.3 Illustration of the model

The padding model is illustrated on Fisher's Iris dataset (Fisher1936), which is often referenced as a baseline in the field of cluster analysis and data mining. Table 5 presents the Iris attributes: Petal Width (PW), Petal Length (PL), Sepal Width (SW), and Sepal Length (SL); Minimal value, Maximal value, Minimal interval between values, and Total Number of Bins.
Table 5

Attributes of the Fisher’s Iris dataset

 

PW

SL

SW

PL

Min. Value

0.1

1

2

4.3

Max. Value

2.5

6.9

4.4

7.9

Min. Interval

0.1

0.1

0.1

0.1

Number of Bins

25

60

25

37

Figures 1, 2, 3 and 4 present probability functions (histograms) for each Iris attribute: Fig. 1 presents the Sepal Length attribute (SL), Fig. 2 the SW attribute, Fig. 3 the PW attribute, and Fig. 4 the PL attribute. Each figure contains 9 histograms (lines) where the lines present the probability function at different padding factors, starting with a padding factor of |0|, up to a padding factor of |8|. The X-axis presents the attribute’s values (bin's value), starting with the Minimal value and increasing up to the Maximal value. The Y-axis presents the number of records having the current bin value, starting from 0 and ending at 150 (the total number of specimen records in Fisher’s Iris dataset). Therefore, each line-histogram represents the probability function of a specific attribute’s values at a specific padding rate.
Fig. 1

Probability functions of the SL attribute (histogram for each padding factor)

Fig. 2

Probability functions of the SW attribute (histogram for each padding factor)

Fig. 3

Probability functions of the PW attribute (histogram for each padding factor)

Fig. 4

Probability functions of the PL attribute (histogram for each padding factor)

The histograms in Fig. 1 show that at a padding factor of |8| the probability function of the Sepal Length (SL) attribute can be regarded as a normal distribution, whereas at a padding factor of |3| we can observe several midpoints. Therefore, there is a need for a mechanism or rule to determine the right padding factor to yield the best clustering.

Figure 2 shows that, for almost all padding factors, the probability function of the Sepal Width (SW) attribute resembles a normal distribution. The definition of the right padding factor will not affect the probability function of the SW attribute, but may affect mutual relations with the other three attributes.

The histograms in Fig. 3 show that at padding factor of |2| the probability function of the Petal Width (PW) attribute indicates three noticeable values, whereas at higher factors the distinction becomes vague.

Figure 4 illustrates a different phenomenon, in which for all padding factors the probability function of the Petal Length (PL) attribute cannot be regarded as a normal distribution. However, we can identify at least two noticeable values, which probably indicate different Iris species. Such phenomena may probably suggests two clusters; however, at this point, given one attribute (the PL) we have no way to determine whether there are two or more clusters.

Likewise, the PL attribute is presented in additional graphic form in Appendix B. There, the entire binary strings of all entities are presented at a padding factor of |5|. By contrast, to visual inspection of Fig. 4, it portrays very clearly that the PL values of the 150 Iris specimens are distributed around three main values that match the three Iris species:
  • The block of low-range values designated by the letter “A” matches Iris Setosa.

  • The block of mid-range values designated by the letter “B” matches Iris Versicolor.

  • The block of high-range values designated with the letter “C” matches Iris Verginica.

4 Determining the padding factor

As shown in Figs. 1, 2, 3 and 4, the probability functions of each attribute change depending on the change in the padding factor. Hence, the probability function is the key to setting the padding factor that yields maximal ordinality and likelihood for each attribute. The derivation of the padding factor is based on the following calculations:
  • First-order derivation degree of the attribute domain, i.e., growth rates of the probability function related to changes in the padding factor. The term "Growth Rate" refers to the maximal percent of records with a common bin value, i.e., the value of 1.00 is achieved when all 150 entities have a common bin value. For example, the PW attribute, which has 24 possible padding factors (since there are 25 possible values in this attribute), reaches the value of 1.00 at a padding factor of 12. The term "Growth Derivation" refers to the first derivation of the Growth Rate.

  • Second-order derivation degree of the attribute domain, i.e., reflection points at the first-order derivation of the probability function.

  • The required padding factor is the first local minima, i.e., the first reflection point.

As stated, the required padding factor is the first local minima of the derivate probability function. Since the Iris dataset contains 150 samples (records), the relevant padding factors are only those in which the probability function is less than 150. Figure 5 presents the Growth Rate for all four Iris attributes.
Fig. 5

Percentage of records with a common value at all padding factors

Figure 6 presents Growth Derivation, indicating the first reflection point in each attribute, i.e., the second derivation, which determines the padding factor for each attribute.
Fig. 6

Growth derivation of all four Iris attributes (SL, SW, PL, PW)

In light of the above, the padding factors of each Iris attribute are as following:
  • PW-padding factor of |2|, i.e., each value is represented by 5 bins.

  • PL-padding factor of |5|, i.e., each value is represented by 11 bins.

  • SW-padding factor of |6|, i.e., each value is represented by 13 bins.

  • SL-padding factor of |3|, i.e., each value is represented by 7 bins.

5 Clustering continuous data vs. padded bitmap data

The proposed model was evaluated using Fisher’s Iris dataset (Fisher1936). Nine clustering algorithms were executed on the regular Iris dataset (continuous data). Then the dataset was transformed into a padded bitmap format, applying the relevant padding factor for each attribute (as discussed in the previous chapter). The padded bitmap was re-clustered using the same nine algorithms. It worth mention that the evaluation objective is not to search for the best classification algorithm, but to show the merit of the proposed representation to the accuracy of clustering-classification produced by diverse algorithms.

5.1 Tools and research process

  1. 1.

    Fisher’s Iris dataset is available on all SPSS versions (see also Appendix A). It consists of 3 clusters of 50 samples (entities) each. Each cluster corresponds to one species of the Iris flower: Iris Setosa (cluster C1), Iris Versicolor (cluster C2), and Iris Verginica (cluster C3). Each species (entity) has four features (attributes), representing Petal Width (PW), Petal Length (PL), Sepal Width (SW), and Sepal Length (SL); all attributes are expressed in centimeters (i.e. continuous attributes). Appendix A presents Fisher’s entire Iris dataset.

     
  2. 2.

    Converting the Iris dataset from continuous values into bitmap representation was done using a MatLab application developed for this purpose.

     
  3. 3.

    Extracting the padding factor and then Bitmap padding accordingly, was done using another MatLab application.

     
  4. 4.

    Nine cluster algorithms (discussed in Section 5.2) were executed using SPSS version 13.0.

     
  5. 5.

    Results and graphs were edited using Microsoft Excel for display purposes.

     

5.2 Clustering algorithms

This section provides a brief description of the nine clustering algorithms used in this study.

Two-step

This algorithm is applicable to both ordinal (continuous) and nominal discrete (categorical) attributes. It is based, as its name implies, on two passes of the dataset. The first pass divides the dataset into a coarse set of sub-clusters, and the second pass groups the sub-clusters into the desired number of clusters. This algorithm is dependent on the order of the samples and may produce different results based on the initial order. The desired number of clusters can be determined automatically or it can be a predetermined fixed number of clusters. We used the fixed number of clusters option in our analysis, so as to be able to use this algorithm in conjunction with the other algorithms chosen for this study.

K-means

This algorithm is applicable to both ordinal (continuous) and nominal discrete (categorical) attributes. One of its requirements is that the number of clusters used to classify the dataset is predetermined. It is based on determining arbitrary centers for the desired clusters, associating the samples with the clusters by using a predetermined distance measurement, iteratively changing the center of the clusters and then re-associating the samples. The length of the process is highly dependent on the initial setting of the cluster centers and can be improved when there is knowledge as to the location of these cluster centers.

Hierarchical methods

This set of algorithms work in a similar manner. These algorithms take the dataset properties that need to be clustered and start by classifying the dataset such that each sample represents a cluster. Next, it merges the clusters in steps: each step merges two clusters into a single cluster, until there is only one cluster (the dataset) remaining. The algorithms differ in the way in which distance is measured between clusters, mainly by using two parameters: the distance, or likelihood, measure, e.g., Euclidean, Dice, etc. and the cluster method, e.g., between group linkage, nearest neighbor, etc.

In this study, we used the following well-known hierarchical methods to classify the datasets:
  • Within Groups Average—This method calculates the distance between two clusters by applying the likelihood measure to all the samples in the two clusters. The clusters with the highest average likelihood measure are then united.

  • Between Groups Average—This method calculates the distance between two clusters by applying the likelihood measure to all the samples of one cluster and then comparing it with all the samples of the other cluster. Again, the two clusters with the highest likelihood measure are then united.

  • Nearest Neighbor—This method, as in the Average Linkage (Between Groups) method, calculates the distance between two clusters by applying the likelihood measure to all the samples of one cluster and then comparing it with all the samples of the other cluster. The two clusters with the highest likelihood measure, from a pair of samples, are then united.

  • Furthest Neighbor—This method, like the previous methods, calculates the distance between two clusters by applying the likelihood measure to all the samples of one cluster and then comparing it with all the samples of another cluster. For each pair of clusters, the pair with the lowest likelihood measure is taken. The two clusters with the highest likelihood measure of those pairs are then united.

  • Centroid—This method calculates the centroid of each cluster by calculating the mean average of all the properties for all the samples of each cluster. The likelihood measure is then applied to the means of the clusters and the clusters with the highest likelihood measure between their centroids are then united.

  • Median—This method calculates the median of each cluster. The likelihood measure is applied to the medians of the clusters and the clusters with the highest median likelihood are then united.

  • Ward’s Method—This method calculates the centroid for each cluster and the square of the likelihood measure of each sample in the cluster and the centroid. The two clusters which when united have the smallest (negative) effect on the sum of likelihood measures are the clusters that need to be united.

5.3 Impact on clustering-classification accuracy

As aforesaid, the impact of the proposed representation on the accuracy of clustering-classification produced by diverse algorithms is tested and verified using Fisher’s widely accepted Iris dataset and nine cluster algorithms. The algorithms were executed using the numerical representation of Fisher’s Iris dataset, and then the padded bitmap format of the same data. The resultant classifications provided by these algorithms were compared to the common classification of the 150 Iris specimens (entities), as follows:
  • Cluster C1 contains specimens 1–50 which relate to the Iris Setosa species.

  • Cluster C2 contains specimens 51–100 relate to the Iris Versicolor species.

  • Cluster C3 contains specimens 101–150 relate to the Iris Verginica species.

5.3.1 Matching evaluation

The accuracy of the classification, arrived at using each of the clustering algorithms, was evaluated according to a matching score. The matching score of each algorithm is calculated based in a cross-tab table, presenting matched and mismatched specimens to the three Iris species. Table 6 presents the cross-tab table, which is related to the K-Means algorithm. Each row corresponds to one species of Iris flower: Iris Setosa (C1), Iris Versicolor (C2), and Iris Verginica (C3). The columns: Group A, Group B and Group C present resultant cluster groups generated by the algorithm. This shows, for example, that in the K-Means algorithm all members of C1 are grouped in Group A, but members of C2 are divided between Group B (47 members) and Group C (3 members), and members of C3 are also divided into two groups; Group B (14 members) and Group C (36 members). Thus, the K-Means algorithm yields three clusters/groups of 50, 61, and 39 members respectively.
Table 6

Cross-tab table for results of the K-Means algorithm

Common clusters

Group A

Group B

Group C

Total members

C1

50

0

0

50

C2

0

47

3

50

C3

0

14

36

50

Total members

50

61

39

150

Score I

1

0.94

0.72

0.89

Score II

1

0.66

0.66

0.77

Each cross-tab table is then translated into a score by calculating matching percentages for each column. The matching factor is calculated so that each resultant cluster is compared to the corresponding original cluster, and the percentage of matching is identified. Finally, the average percentage of matching is calculated. For instance, if we want to calculate the score for the cross-tab table presented in Table 6:
  • Matching of Group A and C1 = 100% identity (50/50).

  • Matching of Group B and C2 = 94% identity (47/50).

  • Matching of Group C and C3 = 72% identity (36/50).

The average matching factor, presented in row Score I, is therefore: (1 + 0.94 + 0.72)/3 = 0.89

In addition, row Score II presents matching formula that takes into account not only the matching between resulting groups and common clusters, but also the mismatching factor:
  • Matching of Group A and C1 = 100% (50/50).

  • Irrelevant members in Group A = 0%

  • Total score of “Group A = 100% - 0% = 100%

  • Matching of Group B and C2 = 94% (47/50).

  • Irrelevant members in Group B = 28% (14/50).

  • Total score of Group B = 94%–28% = 66%

  • Matching of Group C and C3 = 72% (36/50).

  • Irrelevant members in Group C = 6% (3/50).

  • Total score of Group C = 72%–6% = 66%

The average matching factor presented in row Score II is: (1 + 0.66 + 0.66)/3 = 0.77

5.3.2 Scoring comparison

Table 7 presents a matching evaluation for all nine algorithms used in the study. Since Score-I is more “tolerant”, the scoring is according to Score I only. Each algorithm was tested twice, using the regular Iris dataset (continuous data), and then the padded bitmap representation format.
Table 7

Scoring the classification results

Score

Standard Iris Dataset

Padded bitmap format & binary similarity measure

Algorithm

Nearest neighbor

0.68

0.72

Between groups average

0.75

0.79

Median

0.75

0.80

Within groups average

0.84

0.90

Furthest neighbor

0.84

0.91

Two-step

0.86

0.92

K-means

0.89

0.94

Centroid

0.89

0.95

Ward’s method

0.91

0.97

The results show strong evidence for the efficiency of the combination of a padded bitmap representation and a binary similarity measure. The results of each algorithm were improved about 6–8%, simply by using this combination of representation and similarity. Table 7 presents the results of all nine algorithms in ascending order.

6 Summary and conclusions

Data representation is an essential part of similarity calculations, as well as in classification and data mining techniques. Various representation forms, measures and algorithms have been developed, one of which is the bitmap index. A bitmap format representation requires a binary similarity measure that focuses on the positive aspect of the data (‘1’ values), such as Dice measure. Earlier research suggested that coupling a binary similarity measure and a bitmap format representation could yield better clustering results than similarity indexes based upon the regular data representation. However, the bitmap format representation is currently limited to nominal and discrete attributes, which makes this approach unattractive for widespread use.

This paper describes new approach for representing and handling ordinal and continuous data in the form of a padded bitmap. Each ordinal or continuous value is converted into a range of bins, and then the 1's values (that represent the original values) are “padded” with adjacent 1's (around the original value), according to first and second derivative degrees of the attribute domain. The main benefit of the padded representation is improved accuracy of classification and data mining products.

The suggested methodology, that involves converting ordinal and continuous data into padded bitmap, is general and wide ranging, thus enabling the conversion of any numerical dataset into a padded bitmap representation form.

The impact of the proposed model on the accuracy of classification was tested and verified using Fisher's Iris dataset and nine common clustering algorithms. The accuracy of the resultant clusters was evaluated based upon cross-tab tables which indicated the percentage of the right assignments for each cluster. Strong evidence for the efficiency of the combination of the padded bitmap format and the binary similarity measure was found. Each algorithm was enhanced about 7% simply as a result of using this combination. Although there are domains in which such improvements are insignificant, in other domains, such as medical diagnostics, such improvements are crucial.

6.1 Limitations and future research

There are diverse classification techniques, Bayes-Nets, Neural-Nets, Regressions, Clustering, Decision-Trees, Decision-Rules etc. The current study is focused on supervised problems (i.e., problems in which the classification is known and agreed) using clustering techniques.

Further research should test: (a) Additional classification techniques. (b) Unsupervised clustering problems (where the classification is unsettled), which are quit common in the business world, hence the importance of their being formalized and evaluated. (c) Extending the padding model by assigning probabilities to the padded bins, rather than 1's, for better representing the probability function of the entire population. (d) Adjusted similarity index, Dice like, to support the extended padding model.

References

  1. Chan, C. Y.,& Ioannidis, Y. E. (1998). Bitmap index design and evaluation. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, pp. 355–366.Google Scholar
  2. Dice, L. R. (1945). Measures of the amount of ecological association between species. Ecology, 26, 297–302.CrossRefGoogle Scholar
  3. Erlich, Z., Gelbard, R., & Spiegler, I. (2002). Data mining by means of binary representation: a model for similarity and clustering. Information Systems Frontiers, 4(2), 187–197.CrossRefGoogle Scholar
  4. Estivill-Castro, V., & Yang, J. (2004). Fast and robust general purpose clustering algorithms. Data Mining and Knowledge Discovery, 8, 127–150.CrossRefGoogle Scholar
  5. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.CrossRefGoogle Scholar
  6. Gelbard, R., & Spiegler, I. (2000). Hempel’s raven paradox: a positive approach to cluster analysis. Computers and Operations Research, 27(4), 305–320.CrossRefGoogle Scholar
  7. Gelbard, R., Goldman, O., & Spiegler, I. (2007). Investigating diversity of clustering methods: an empirical comparison. Data and Knowledge Engineering, 63, 155–166.CrossRefGoogle Scholar
  8. Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.Google Scholar
  9. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Communication Surveys, 31, 264–323.CrossRefGoogle Scholar
  10. Johnson, T. (1999). Performance Measurements of Compressed Bitmap Indices. VLDB-1999, 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, pp. 278–289.Google Scholar
  11. Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228.CrossRefGoogle Scholar
  12. O’Neil, P. E. (1987). Model 204 Architecture and Performance. Lecture Notes In Computer Science, Vol.359, Proceedings of the 2nd International Workshop on High Performance Transaction Systems, pp. 40–59.Google Scholar
  13. Oracle corp. (1993). Database concept—overview of indexes—bitmap index. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref1008
  14. Oracle corp. (2001). Data warehousing guide—using bitmap index in data warehousing. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14223/indexes.htm#sthref349
  15. Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62, 65–105.CrossRefGoogle Scholar
  16. Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.CrossRefGoogle Scholar
  17. Zhang, B., & Srihari, S. N. (2003) Properties of binary vector dissimilarity measures. In JCIS CVPRIP 2003, Cary, North Carolina, pp. 26–30.Google Scholar
  18. Zhang, B., & Srihari, S. N. (2004). Fast k-nearest neighbor classification using cluster-based trees. IEEE Trans Pattern Analysis and Machine Intelligence, 26(4), 525–528.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Information System Program, Graduate School of Business AdministrationBar-Ilan UniversityRamat-GanIsrael

Personalised recommendations