Skip to main content
Log in

Data Mining by Means of Binary Representation: A Model for Similarity and Clustering

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

In this paper we outline a new method for clustering that is based on a binary representation of data records. The binary database relates each entity to all possible attribute values (domain) that entity may assume. The resulting binary matrix allows for similarity and clustering calculation by using the positive (‘1’ bits) of the entity vector. We formulate two indexes: Pair Similarity Index (PSI) to measure similarity between two entities and Group Similarity Index (GSI) to measure similarity within a group of entities. A threshold factor for each attribute domain is defined that is dependent on the domain but independent of the number of entities in the group. The similarity measure provides simplicity of storage and efficiency of calculation. A comparison of our similarity index to other indexes is made. Experiments with sample data indicate a 48% improvement of group similarity over standard methods pointing to the potential and merit of the binary approach to clustering and data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Adriaans P, Zantinge D. Data Mining. Reading, MA: Addison-Wesley, 1996.

    Google Scholar 

  • Agrawal R. Data mining: Crossing the chasm. In: Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, San Diego, Aug. 15–18, 1999.

  • Borgen FH, Barnet DC. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology 1987;34(4): 456–468.

    Google Scholar 

  • Cabena P, Hadjinian P, Stadler R, Verhees J, Zanasi A. Discovering Data Mining:FromConcept to Implementation. Englewood Cliffs, NJ: Prentice Hall (IBM), 1998.

    Google Scholar 

  • Cheeseman P, Stutz J. Bayesian classification (autoclass): Theory and results. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, eds. Advances in Knowledge Discovery and Data Mining. AAAI, MIT Press, 1996:153–180.

  • Chung HM, Gray P. Data mining. Journal of Management Information Systems 1999;16(1):11–16.

    Google Scholar 

  • Everitt B. Cluster Analysis. Social Science Research Council, 1974.

  • Fayyad U, Haussler D, Stolorz P. Mining scientific data. Communications of the ACM 1996:51–57.

  • Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds.), Advances in Knowledge Discovery and Data Mining. AAAI, MIT Press, 1996.

  • Ganti V, Gehrke J, Ramakrishnan R. Mining very large databases. IEEE Computer 1999:38–45.

  • Gelbard R, Spiegler I. Hempel's Raven paradox: A positive approach to cluster analysis. Computers & Operations Research 2000;27:305–320.

    Google Scholar 

  • Hartigan JA. Clustering Algorithms. New York: John Wiley, 1975.

    Google Scholar 

  • Hempel CG. Aspects of Scientific Explanation. The Free Press, 1965.

  • Illingworth V, Glaser EL, Pyle IC (eds.), Hamming distance. In: Dictionary of Computing. Oxford: Oxford University Press, 1983:162–163.

  • Jain AK, Dubes RC. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988.

    Google Scholar 

  • Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Computing Surveys 1999;31(3):264–323.

    Google Scholar 

  • Ramakrishnan N, Grama AY. Data mining: From serendipity to science. IEEE Computer, August 1999.

  • Shapira B. Advanced model for information filtering based on extended user profiles and stereotypes using cluster analysis. Doctoral dissertation, Department of Industrial and Management Engineering, Ben Gurion University, Beer Sheva, Israel, October, 1998.

    Google Scholar 

  • Sneath P, Sokal R. Numerical Taxonomy. W.H Freeman, 1973.

  • Spangler W, May JH, Vargas LG. Choosing data mining method for multiple classification: Representational and performance measurement implications for decision support. Journal of Management Information Systems 1999;16(1):37–62.

    Google Scholar 

  • Spiegler I, Maayan R. Storage and retrieval considerations of binary data bases. Information Processing & Management 1985;21(3):233–254

    Google Scholar 

  • Uthurusamy R. From data mining to knowledge discovery: Current challenges and future directions. In: Advances in Knowledge Discovery. American Association of Artificial Intelligence (AAAI), MIT Press, 1996:561–569.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Erlich, Z., Gelbard, R. & Spiegler, I. Data Mining by Means of Binary Representation: A Model for Similarity and Clustering. Information Systems Frontiers 4, 187–197 (2002). https://doi.org/10.1023/A:1016002919937

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1016002919937

Navigation