Text Comparison Using Soft Cardinality

  • Sergio Jimenez
  • Fabio Gonzalez
  • Alexander Gelbukh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6393)


The  classical set theory provides a method for comparing objects using cardinality and intersection, in combination with well-known resemblance coefficients such as Dice, Jaccard, and cosine. However, set operations are intrinsically crisp: they do not take into account similarities between elements. We propose a new general-purpose method for comparison of objects using a soft cardinality function that show that the soft cardinality method is superior via an auxiliary affinity (similarity) measure. Our experiments with 12 text matching datasets suggest that the soft cardinality method is superior to known approximate string comparison methods in text comparison task.


Vector Space Model Cardinality Measure Cardinality Function Approximate String Match String Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    De Baets, B., De Meyer, H.: Transitivity-preserving fuzzification schemes for cardinality-based similarity measures. European Journal of Operational Research 160, 726–740 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Zadeh, L.: Fuzzy Sets, Fuzzy Logic and Fuzzy Systems. World Scientific, Singapore (1996)CrossRefzbMATHGoogle Scholar
  3. 3.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York (1983)zbMATHGoogle Scholar
  4. 4.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (2003)Google Scholar
  5. 5.
    Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison Wesley/ACM Press (1999)Google Scholar
  6. 6.
    Jimenez, S.: A knowledge-based information extraction prototype for data-rich documents in the information technology domain. Master’s thesis, National University of Colombia (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Sergio Jimenez
    • 1
  • Fabio Gonzalez
    • 1
  • Alexander Gelbukh
    • 2
  1. 1.National University of ColombiaColombia
  2. 2.CIC-IPNMexico

Personalised recommendations