Abstract
The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of “padding” where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e., the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute’s domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher’s Iris dataset.
Similar content being viewed by others
References
Chan, C. Y.,& Ioannidis, Y. E. (1998). Bitmap index design and evaluation. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, pp. 355–366.
Dice, L. R. (1945). Measures of the amount of ecological association between species. Ecology, 26, 297–302.
Erlich, Z., Gelbard, R., & Spiegler, I. (2002). Data mining by means of binary representation: a model for similarity and clustering. Information Systems Frontiers, 4(2), 187–197.
Estivill-Castro, V., & Yang, J. (2004). Fast and robust general purpose clustering algorithms. Data Mining and Knowledge Discovery, 8, 127–150.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.
Gelbard, R., & Spiegler, I. (2000). Hempel’s raven paradox: a positive approach to cluster analysis. Computers and Operations Research, 27(4), 305–320.
Gelbard, R., Goldman, O., & Spiegler, I. (2007). Investigating diversity of clustering methods: an empirical comparison. Data and Knowledge Engineering, 63, 155–166.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Communication Surveys, 31, 264–323.
Johnson, T. (1999). Performance Measurements of Compressed Bitmap Indices. VLDB-1999, 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, pp. 278–289.
Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228.
O’Neil, P. E. (1987). Model 204 Architecture and Performance. Lecture Notes In Computer Science, Vol.359, Proceedings of the 2nd International Workshop on High Performance Transaction Systems, pp. 40–59.
Oracle corp. (1993). Database concept—overview of indexes—bitmap index. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref1008
Oracle corp. (2001). Data warehousing guide—using bitmap index in data warehousing. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14223/indexes.htm#sthref349
Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62, 65–105.
Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.
Zhang, B., & Srihari, S. N. (2003) Properties of binary vector dissimilarity measures. In JCIS CVPRIP 2003, Cary, North Carolina, pp. 26–30.
Zhang, B., & Srihari, S. N. (2004). Fast k-nearest neighbor classification using cluster-based trees. IEEE Trans Pattern Analysis and Machine Intelligence, 26(4), 525–528.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
Appendix B
Rights and permissions
About this article
Cite this article
Gelbard, R. “Padding” bitmaps to support similarity and mining. Inf Syst Front 15, 99–110 (2013). https://doi.org/10.1007/s10796-011-9318-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10796-011-9318-9