Information Systems Frontiers

, Volume 4, Issue 2, pp 187–197 | Cite as

Data Mining by Means of Binary Representation: A Model for Similarity and Clustering

  • Zippy Erlich
  • Roy Gelbard
  • Israel Spiegler

Abstract

In this paper we outline a new method for clustering that is based on a binary representation of data records. The binary database relates each entity to all possible attribute values (domain) that entity may assume. The resulting binary matrix allows for similarity and clustering calculation by using the positive (‘1’ bits) of the entity vector. We formulate two indexes: Pair Similarity Index (PSI) to measure similarity between two entities and Group Similarity Index (GSI) to measure similarity within a group of entities. A threshold factor for each attribute domain is defined that is dependent on the domain but independent of the number of entities in the group. The similarity measure provides simplicity of storage and efficiency of calculation. A comparison of our similarity index to other indexes is made. Experiments with sample data indicate a 48% improvement of group similarity over standard methods pointing to the potential and merit of the binary approach to clustering and data mining.

binary representation similarity clustering data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adriaans P, Zantinge D. Data Mining. Reading, MA: Addison-Wesley, 1996.Google Scholar
  2. Agrawal R. Data mining: Crossing the chasm. In: Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, San Diego, Aug. 15–18, 1999.Google Scholar
  3. Borgen FH, Barnet DC. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology 1987;34(4): 456–468.Google Scholar
  4. Cabena P, Hadjinian P, Stadler R, Verhees J, Zanasi A. Discovering Data Mining:FromConcept to Implementation. Englewood Cliffs, NJ: Prentice Hall (IBM), 1998.Google Scholar
  5. Cheeseman P, Stutz J. Bayesian classification (autoclass): Theory and results. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, eds. Advances in Knowledge Discovery and Data Mining. AAAI, MIT Press, 1996:153–180.Google Scholar
  6. Chung HM, Gray P. Data mining. Journal of Management Information Systems 1999;16(1):11–16.Google Scholar
  7. Everitt B. Cluster Analysis. Social Science Research Council, 1974.Google Scholar
  8. Fayyad U, Haussler D, Stolorz P. Mining scientific data. Communications of the ACM 1996:51–57.Google Scholar
  9. Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds.), Advances in Knowledge Discovery and Data Mining. AAAI, MIT Press, 1996.Google Scholar
  10. Ganti V, Gehrke J, Ramakrishnan R. Mining very large databases. IEEE Computer 1999:38–45.Google Scholar
  11. Gelbard R, Spiegler I. Hempel's Raven paradox: A positive approach to cluster analysis. Computers & Operations Research 2000;27:305–320.Google Scholar
  12. Hartigan JA. Clustering Algorithms. New York: John Wiley, 1975.Google Scholar
  13. Hempel CG. Aspects of Scientific Explanation. The Free Press, 1965.Google Scholar
  14. Illingworth V, Glaser EL, Pyle IC (eds.), Hamming distance. In: Dictionary of Computing. Oxford: Oxford University Press, 1983:162–163.Google Scholar
  15. Jain AK, Dubes RC. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988.Google Scholar
  16. Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Computing Surveys 1999;31(3):264–323.Google Scholar
  17. Ramakrishnan N, Grama AY. Data mining: From serendipity to science. IEEE Computer, August 1999.Google Scholar
  18. Shapira B. Advanced model for information filtering based on extended user profiles and stereotypes using cluster analysis. Doctoral dissertation, Department of Industrial and Management Engineering, Ben Gurion University, Beer Sheva, Israel, October, 1998.Google Scholar
  19. Sneath P, Sokal R. Numerical Taxonomy. W.H Freeman, 1973.Google Scholar
  20. Spangler W, May JH, Vargas LG. Choosing data mining method for multiple classification: Representational and performance measurement implications for decision support. Journal of Management Information Systems 1999;16(1):37–62.Google Scholar
  21. Spiegler I, Maayan R. Storage and retrieval considerations of binary data bases. Information Processing & Management 1985;21(3):233–254Google Scholar
  22. Uthurusamy R. From data mining to knowledge discovery: Current challenges and future directions. In: Advances in Knowledge Discovery. American Association of Artificial Intelligence (AAAI), MIT Press, 1996:561–569.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Zippy Erlich
    • 1
  • Roy Gelbard
    • 2
  • Israel Spiegler
    • 2
  1. 1.Computer Science DepartmentThe Open UniversityTel AvivIsrael
  2. 2.Technology and Information Systems Department, The Leon Recanati Graduate School of Business AdministrationTel Aviv UniversityTel AvivIsrael

Personalised recommendations