Information Systems Frontiers

, Volume 15, Issue 1, pp 99–110 | Cite as

“Padding” bitmaps to support similarity and mining

Article

Abstract

The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of “padding” where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e., the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute’s domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher’s Iris dataset.

Keywords

Bitmap-index Data representation Similarity index Cluster analysis Classification Data mining 

References

  1. Chan, C. Y.,& Ioannidis, Y. E. (1998). Bitmap index design and evaluation. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, pp. 355–366.Google Scholar
  2. Dice, L. R. (1945). Measures of the amount of ecological association between species. Ecology, 26, 297–302.CrossRefGoogle Scholar
  3. Erlich, Z., Gelbard, R., & Spiegler, I. (2002). Data mining by means of binary representation: a model for similarity and clustering. Information Systems Frontiers, 4(2), 187–197.CrossRefGoogle Scholar
  4. Estivill-Castro, V., & Yang, J. (2004). Fast and robust general purpose clustering algorithms. Data Mining and Knowledge Discovery, 8, 127–150.CrossRefGoogle Scholar
  5. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188.CrossRefGoogle Scholar
  6. Gelbard, R., & Spiegler, I. (2000). Hempel’s raven paradox: a positive approach to cluster analysis. Computers and Operations Research, 27(4), 305–320.CrossRefGoogle Scholar
  7. Gelbard, R., Goldman, O., & Spiegler, I. (2007). Investigating diversity of clustering methods: an empirical comparison. Data and Knowledge Engineering, 63, 155–166.CrossRefGoogle Scholar
  8. Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.Google Scholar
  9. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Communication Surveys, 31, 264–323.CrossRefGoogle Scholar
  10. Johnson, T. (1999). Performance Measurements of Compressed Bitmap Indices. VLDB-1999, 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, pp. 278–289.Google Scholar
  11. Lim, T. S., Loh, W. Y., & Shih, Y. S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–228.CrossRefGoogle Scholar
  12. O’Neil, P. E. (1987). Model 204 Architecture and Performance. Lecture Notes In Computer Science, Vol.359, Proceedings of the 2nd International Workshop on High Performance Transaction Systems, pp. 40–59.Google Scholar
  13. Oracle corp. (1993). Database concept—overview of indexes—bitmap index. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref1008
  14. Oracle corp. (2001). Data warehousing guide—using bitmap index in data warehousing. Retrieved July 2010, from Oracle site: http://download.oracle.com/docs/cd/B19306_01/server.102/b14223/indexes.htm#sthref349
  15. Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62, 65–105.CrossRefGoogle Scholar
  16. Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.CrossRefGoogle Scholar
  17. Zhang, B., & Srihari, S. N. (2003) Properties of binary vector dissimilarity measures. In JCIS CVPRIP 2003, Cary, North Carolina, pp. 26–30.Google Scholar
  18. Zhang, B., & Srihari, S. N. (2004). Fast k-nearest neighbor classification using cluster-based trees. IEEE Trans Pattern Analysis and Machine Intelligence, 26(4), 525–528.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Information System Program, Graduate School of Business AdministrationBar-Ilan UniversityRamat-GanIsrael

Personalised recommendations