Skip to main content
Log in

“Padding” bitmaps to support similarity and mining

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of “padding” where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e., the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute’s domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher’s Iris dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Chan, C. Y.,& Ioannidis, Y. E. (1998). Bitmap index design and evaluation. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, pp. 355–366.

  • Dice, L. R. (1945). Measures of the amount of ecological association between species. Ecology, 26, 297–302.

    Article  Google Scholar 

  • Erlich, Z., Gelbard, R., & Spiegler, I. (2002). Data mining by means of binary representation: a model for similarity and clustering. Information Systems Frontiers, 4(2), 187–197.

    Article  Google Scholar 

  • Estivill-Castro, V., & Yang, J. (2004). Fast and robust general purpose clustering algorithms. Data Mining and Knowledge Discovery, 8, 127–150.

    Article  Google Scholar 

  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.

    Article  Google Scholar 

  • Gelbard, R., & Spiegler, I. (2000). Hempel’s raven paradox: a positive approach to cluster analysis. Computers and Operations Research, 27(4), 305–320.

    Article  Google Scholar 

  • Gelbard, R., Goldman, O., & Spiegler, I. (2007). Investigating diversity of clustering methods: an empirical comparison. Data and Knowledge Engineering, 63, 155–166.

    Article  Google Scholar 

  • Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.

  • Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Communication Surveys, 31, 264–323.

    Article  Google Scholar 

  • Johnson, T. (1999). Performance Measurements of Compressed Bitmap Indices. VLDB-1999, 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, pp. 278–289.

  • Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228.

    Article  Google Scholar 

  • O’Neil, P. E. (1987). Model 204 Architecture and Performance. Lecture Notes In Computer Science, Vol.359, Proceedings of the 2nd International Workshop on High Performance Transaction Systems, pp. 40–59.

  • Oracle corp. (1993). Database concept—overview of indexes—bitmap index. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref1008

  • Oracle corp. (2001). Data warehousing guide—using bitmap index in data warehousing. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14223/indexes.htm#sthref349

  • Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62, 65–105.

    Article  Google Scholar 

  • Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.

    Article  Google Scholar 

  • Zhang, B., & Srihari, S. N. (2003) Properties of binary vector dissimilarity measures. In JCIS CVPRIP 2003, Cary, North Carolina, pp. 26–30.

  • Zhang, B., & Srihari, S. N. (2004). Fast k-nearest neighbor classification using cluster-based trees. IEEE Trans Pattern Analysis and Machine Intelligence, 26(4), 525–528.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roy Gelbard.

Appendices

Appendix A

Table 8 Fisher’s Iris dataset

Appendix B

Table 9 Padded bitmap format

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gelbard, R. “Padding” bitmaps to support similarity and mining. Inf Syst Front 15, 99–110 (2013). https://doi.org/10.1007/s10796-011-9318-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-011-9318-9

Keywords

Navigation