Detecting Near-Duplicate Relations in User Generated Forum Content

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6428)


A webforum is a large database of community knowledge, with information of the most recent events and developments. Unfortunately this knowledge is presented in a format easily understood by humans but not automatically by machines. However, from observing several forums for a long time it seems obvious that there are several distinct types of postings and relations between them.

One often occurring and very annoying relation between two contributions is the near-duplicate relation. In this paper we propose a work to detect and utilize contribution relations, concentrating on near-duplication. We propose ideas on how to calculate similarity, build groups of similar threads and thus make near-duplicates in forums evident. One of the core theses is, that it is possible to apply information from forum and thread structure to improve existing near-duplicate detection approaches. In addition, the proposed work shows the qualitative and quantitative results of applying such principles, thereby finding out which features are really useful in the near-duplicate detection process. Also proposed are several sample applications, which benefit from forum near-duplicate detection.


Name Entity Recognition Spam Detector Semantical Data Structure Name Entity Recognition System Information Piece 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality content in social media. Proceedings of the International Conference on Web Search and Web Data Mining, WSDM 2008, 183 (2008)Google Scholar
  2. 2.
    Broder, A.: On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), 21–29 (1997)Google Scholar
  3. 3.
    Broder, A.: Identifying and Filtering Near-Duplicate Documents. In: DOM 2000: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, L, pp. 1–10. Springer, London (2000)Google Scholar
  4. 4.
    Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Computer, pp. 380–388. ACM Press, New York (2002)Google Scholar
  5. 5.
    Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)CrossRefGoogle Scholar
  6. 6.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE Trans. on Knowl. and Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  7. 7.
    Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 61–70. ACM Press, New York (2009)Google Scholar
  8. 8.
    Henzinger, M.: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 284–291. ACM, New York (2006)Google Scholar
  9. 9.
    Lin, C., Yang, J., Cai, R., Wang, X.: Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications. In: Proceedings of the 32nd, pp. 131–138 (2009)Google Scholar
  10. 10.
    Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference WTEC 1994, pp. 2–2. USENIX Association, Berkeley (1994)Google Scholar
  11. 11.
    Manning, C.D., Raghavan, P., Schütze, H.: An introduction to information retrieval. Cambridge University Press, New York (2009)zbMATHGoogle Scholar
  12. 12.
    Muthmann, K.: Entwicklung eines Gruppierungs - Operators für Forenbeiträge. Diploma, Technische Universität Dresden (2009)Google Scholar
  13. 13.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. In: World Wide Web Internet And Web Information Systems, pp. 1–17 (1998)Google Scholar
  14. 14.
    Xu, G., Ma, W.Y.: Building implicit links from content for forum search. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 300–307. ACM Press, New York (2006)Google Scholar
  15. 15.
    Yang, C., Ng, T.: Analyzing Content Development and Visualizing Social Interactions in Web Forum. On Intelligence and Security Informatics (ISI&#39 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Chair for Computer NetworksTechnical University DresdenGermany
  2. 2.DIMA GroupTechnical University BerlinGermany

Personalised recommendations