Finding Similar RSS News Articles Using Correlation-Based Phrase Matching

  • Maria Soledad Pera
  • Yiu-Kai Ng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4798)

Abstract

Traditional phrase matching approaches, which can discover documents containing exactly the same phrases, fail to detect documents including phrases that are semantically relevant, but not exact matches. We propose a correlation-based phrase matching (CPM) model that can detect RSS news articles which contain not only phrases that are exactly the same but also semantically relevant, which dictate the degrees of similarity of any two articles. As the number of RSS news feeds continue to increase over the Internet, our CPM approach becomes more significant, since it minimizes the workload of the user who is otherwise required to scan through huge number of news articles to find related articles of interest, which is a tedious and often an impossible task. Experimental results show that our CPM model on matching bigrams and trigrams outperforms other phrase, including keyword, matching approaches.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amer-Yahia, S., Fernandez, M., Srivastava, D., Xu, Y.: PIX: Exact and Approximate Phrase Matching in XML. In: ACM SIGMOD, pp. 664–667. ACM Press, New York (2003)Google Scholar
  2. 2.
    de Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and Flexible Word Searching on Compressed Text. ACM TOIS 18(2), 113–139 (2000)CrossRefGoogle Scholar
  3. 3.
    Hammouda, K., Kamel, M.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE TKDE 16(10), 1279–1296 (2004)Google Scholar
  4. 4.
    Haveliwala, T., Gionis, A., Klein, D., Indyk, P.: Evaluating Strategies for Similarity Search on the Web. In: WWW Conf., pp. 432–442 (2002)Google Scholar
  5. 5.
    Judea, P.: Probabilistic Reasoning in the Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)Google Scholar
  6. 6.
    Luger, G.: Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 5th edn. Addison-Wesley, Reading (2005)Google Scholar
  7. 7.
    Luo, G., Tang, C., Tian, Y.: Answering Relationship Queries on the Web. In: WWW Conf., pp. 561–570 (2007)Google Scholar
  8. 8.
    Mishne, G., de Rijke, M.: Boosting Web Retrieval through Query Operations. In: European Conf. on Information Retrieval, pp. 502–516 (2005)Google Scholar
  9. 9.
    Narita, M., Ogawa, Y.: The Use of Phrases from Query Texts in Information Retrieval. In: ACM SIGIR, pp. 318–320. ACM Press, New York (2000)CrossRefGoogle Scholar
  10. 10.
    Ogawa, Y., Morita, T., Kobayashi, K.: A Fuzzy Document Retrieval System Using the Keyword Connection Matrix and a Learning Method. Fuzzy Sets and Systems 39, 163–179 (1991)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)Google Scholar
  12. 12.
    Toud, S.: Creating a Custom Metrics Tool. MSDN Magazine (April 2005), http://msdn.microsoft.com/msdnmag/issues/05/04/EndBracket/
  13. 13.
    Tzong-Han, T., Chia-Wei, W.: Enhance Genomic IR with Term Variation and Expansion: Experience of the IASL Group. In: Text Retrieval Conf. (2005)Google Scholar
  14. 14.
    Wilbur, W., Kim, W.: Flexible Phrase Based Query Handling Algorithms. In: American Society for Information Science and Technology, pp. 438–449 (2001)Google Scholar
  15. 15.
    Yerra, R., Ng, Y.-K.: Detecting Similar HTML Documents Using a Fuzzy Set Information Retrieval Approach. In: IEEE GrC 2005, pp. 693–699. IEEE, Los Alamitos (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Maria Soledad Pera
    • 1
  • Yiu-Kai Ng
    • 1
  1. 1.Computer Science Dept., Brigham Young University, Provo, Utah 84602U.S.A.

Personalised recommendations