Advertisement

An Order-Based Taxonomy for Text Similarity

Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 107)

Abstract

Text similarity is a common and basic issue to consider in many fields. This paper proposes a new order-based taxonomy for text similarity. Based on the different consideration on the order of text comparison unit, we classify text similarities into three categories: order-sensitive similarity, order-insensitive similarity, and order-semi-sensitive similarity. For order-sensitive similarity, each text is considered as a string of items, and text matching is carried out item by item as a pairwise alignment process. For order-insensitive similarity, each text is considered as a set of distinct items, and only item co-occurrence is considered during comparison. For order-semi-sensitive similarity, block of items with dynamically determined length is used as comparison unit, and only local order (the item order within each block) is preserved during matching. The taxonomy presented in this paper provides us an insight into the text similarity issue in an order perspective, which could be beneficial in understanding and developing this basic element in many disciplines.

Keywords

Text similarity Text similarity taxonomy Order-sensitive similarity Order-insensitive similarity Order-semi-sensitive similarity Text matching Information retrieval 

Notes

Acknowledgments

This work is supported by National Natural Science Fund of China (No. 60905026, 71071141), Specialized Research Fund for the Doctoral Program of Higher Education of China (No. 20093326120004), Natural Science Fund of Zhejiang Province (No. Y1091164, Z1091224), and Zhejiang Science and Technology Plan Project (No. 2010C33016).

References

  1. 1.
    Salton G, Lesk M (1971) Computer evaluation of indexing and text processing. Prentice Hall, Englewood Cliffs, pp 143–180Google Scholar
  2. 2.
    Singhal A (2001) Modern information retrieval: a brief overview. Bullet IEEE Comput Soc Tech Committee Data Eng 24(4):35–43Google Scholar
  3. 3.
    Sahami M, Heilman T (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of WWW, pp 377–386Google Scholar
  4. 4.
    Murdock V, Croft WB (2005) A translation model for sentence retrieval. In: Proceedings of HLT/EMNLP 2005, pp 684–691Google Scholar
  5. 5.
    Hillard D, Schroedl S, Manavoglu E, Raghavan H, Leggetter C (2010) Improving ad relevance in sponsored search. In: Proceedings of WSDM 2010, pp 361–369Google Scholar
  6. 6.
    Lin C, Hovy E (2003) Automatic evaluation of summaries using N-gram co-occurrence statistics. In: Proceedings of HLT/NAACL 2003, pp 150–156Google Scholar
  7. 7.
    Schutze H (1998) Automatic word sense discrimination. Computat Linguist 24(1):97–124MathSciNetGoogle Scholar
  8. 8.
    Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proceedings of ACM SIGMOD 1995Google Scholar
  9. 9.
    Feng Y, Wu ZH, Zhou ZM (2005) Combining an order-semi-sensitive text similarity and closest fit approach to textual missing values in knowledge discovery. In: Proceedings of KES 2005, pp 943–949Google Scholar
  10. 10.
    Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4):845–848Google Scholar
  11. 11.
    Sankoff D, Kruskal JB (1983) Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading, MAGoogle Scholar
  12. 12.
    Das G, Fleischer R, Gasieniec L, Gunopulos D, Karkkainen J (1997) Episode matching. In: Proceedings of CPM 1997, LNCS, vol 1264. Springer, Berlin, pp 12–27Google Scholar
  13. 13.
    Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:444–453CrossRefGoogle Scholar
  14. 14.
    Jaccard P (1908) Nouvelles Recherches Sur La Distribution Florale. Bull Soc Vaud Sci Nat 44:223–270Google Scholar
  15. 15.
    Mao W, Chu W (2002) Free-text medical document retrieval via phrase-based vector space model. In: Proceedings of AMIA Annual Symposium 2002Google Scholar
  16. 16.
    Lowrance R, Wagner R (1975) An extension of the string-to-string correction problem. J ACM 22(2):177–183Google Scholar
  17. 17.
    Wagner R (1975) On the complexity of the extended string-to-string correction problem, In: Proceedings of seventh annual ACM symposium on theory of computing pp 218–223Google Scholar
  18. 18.
    Amir A, Aumann Y, Landau GM, Lewenstein M, Lewenstein N (2000) Pattern matching with swaps. J Algorithms 37(2):247–266Google Scholar
  19. 19.
    Tichy F (1984) The string-to-string correction problem with block moves. ACM Trans Comput Syst 2(4):309–321CrossRefMathSciNetGoogle Scholar
  20. 20.
    Lopresti D, Tomkins A (1997) Block edit models for approximate string matching. Theor Comput Sci 181(1):159–179CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.College of Computer Science and Information EngineeringZhejiang Gongshang UniversityHangzhouChina

Personalised recommendations