Abstract
In this paper we outline a new method for clustering that is based on a binary representation of data records. The binary database relates each entity to all possible attribute values (domain) that entity may assume. The resulting binary matrix allows for similarity and clustering calculation by using the positive (‘1’ bits) of the entity vector. We formulate two indexes: Pair Similarity Index (PSI) to measure similarity between two entities and Group Similarity Index (GSI) to measure similarity within a group of entities. A threshold factor for each attribute domain is defined that is dependent on the domain but independent of the number of entities in the group. The similarity measure provides simplicity of storage and efficiency of calculation. A comparison of our similarity index to other indexes is made. Experiments with sample data indicate a 48% improvement of group similarity over standard methods pointing to the potential and merit of the binary approach to clustering and data mining.
Similar content being viewed by others
References
Adriaans P, Zantinge D. Data Mining. Reading, MA: Addison-Wesley, 1996.
Agrawal R. Data mining: Crossing the chasm. In: Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, San Diego, Aug. 15–18, 1999.
Borgen FH, Barnet DC. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology 1987;34(4): 456–468.
Cabena P, Hadjinian P, Stadler R, Verhees J, Zanasi A. Discovering Data Mining:FromConcept to Implementation. Englewood Cliffs, NJ: Prentice Hall (IBM), 1998.
Cheeseman P, Stutz J. Bayesian classification (autoclass): Theory and results. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, eds. Advances in Knowledge Discovery and Data Mining. AAAI, MIT Press, 1996:153–180.
Chung HM, Gray P. Data mining. Journal of Management Information Systems 1999;16(1):11–16.
Everitt B. Cluster Analysis. Social Science Research Council, 1974.
Fayyad U, Haussler D, Stolorz P. Mining scientific data. Communications of the ACM 1996:51–57.
Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds.), Advances in Knowledge Discovery and Data Mining. AAAI, MIT Press, 1996.
Ganti V, Gehrke J, Ramakrishnan R. Mining very large databases. IEEE Computer 1999:38–45.
Gelbard R, Spiegler I. Hempel's Raven paradox: A positive approach to cluster analysis. Computers & Operations Research 2000;27:305–320.
Hartigan JA. Clustering Algorithms. New York: John Wiley, 1975.
Hempel CG. Aspects of Scientific Explanation. The Free Press, 1965.
Illingworth V, Glaser EL, Pyle IC (eds.), Hamming distance. In: Dictionary of Computing. Oxford: Oxford University Press, 1983:162–163.
Jain AK, Dubes RC. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988.
Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Computing Surveys 1999;31(3):264–323.
Ramakrishnan N, Grama AY. Data mining: From serendipity to science. IEEE Computer, August 1999.
Shapira B. Advanced model for information filtering based on extended user profiles and stereotypes using cluster analysis. Doctoral dissertation, Department of Industrial and Management Engineering, Ben Gurion University, Beer Sheva, Israel, October, 1998.
Sneath P, Sokal R. Numerical Taxonomy. W.H Freeman, 1973.
Spangler W, May JH, Vargas LG. Choosing data mining method for multiple classification: Representational and performance measurement implications for decision support. Journal of Management Information Systems 1999;16(1):37–62.
Spiegler I, Maayan R. Storage and retrieval considerations of binary data bases. Information Processing & Management 1985;21(3):233–254
Uthurusamy R. From data mining to knowledge discovery: Current challenges and future directions. In: Advances in Knowledge Discovery. American Association of Artificial Intelligence (AAAI), MIT Press, 1996:561–569.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Erlich, Z., Gelbard, R. & Spiegler, I. Data Mining by Means of Binary Representation: A Model for Similarity and Clustering. Information Systems Frontiers 4, 187–197 (2002). https://doi.org/10.1023/A:1016002919937
Issue Date:
DOI: https://doi.org/10.1023/A:1016002919937