Detecting Near-Duplicate Relations in User Generated Forum Content
- 1.1k Downloads
A webforum is a large database of community knowledge, with information of the most recent events and developments. Unfortunately this knowledge is presented in a format easily understood by humans but not automatically by machines. However, from observing several forums for a long time it seems obvious that there are several distinct types of postings and relations between them.
One often occurring and very annoying relation between two contributions is the near-duplicate relation. In this paper we propose a work to detect and utilize contribution relations, concentrating on near-duplication. We propose ideas on how to calculate similarity, build groups of similar threads and thus make near-duplicates in forums evident. One of the core theses is, that it is possible to apply information from forum and thread structure to improve existing near-duplicate detection approaches. In addition, the proposed work shows the qualitative and quantitative results of applying such principles, thereby finding out which features are really useful in the near-duplicate detection process. Also proposed are several sample applications, which benefit from forum near-duplicate detection.
KeywordsName Entity Recognition Spam Detector Semantical Data Structure Name Entity Recognition System Information Piece
Unable to display preview. Download preview PDF.
- 1.Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality content in social media. Proceedings of the International Conference on Web Search and Web Data Mining, WSDM 2008, 183 (2008)Google Scholar
- 2.Broder, A.: On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), 21–29 (1997)Google Scholar
- 3.Broder, A.: Identifying and Filtering Near-Duplicate Documents. In: DOM 2000: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, L, pp. 1–10. Springer, London (2000)Google Scholar
- 4.Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Computer, pp. 380–388. ACM Press, New York (2002)Google Scholar
- 7.Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 61–70. ACM Press, New York (2009)Google Scholar
- 8.Henzinger, M.: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 284–291. ACM, New York (2006)Google Scholar
- 9.Lin, C., Yang, J., Cai, R., Wang, X.: Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications. In: Proceedings of the 32nd, pp. 131–138 (2009)Google Scholar
- 10.Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference WTEC 1994, pp. 2–2. USENIX Association, Berkeley (1994)Google Scholar
- 12.Muthmann, K.: Entwicklung eines Gruppierungs - Operators für Forenbeiträge. Diploma, Technische Universität Dresden (2009)Google Scholar
- 13.Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. In: World Wide Web Internet And Web Information Systems, pp. 1–17 (1998)Google Scholar
- 14.Xu, G., Ma, W.Y.: Building implicit links from content for forum search. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 300–307. ACM Press, New York (2006)Google Scholar
- 15.Yang, C., Ng, T.: Analyzing Content Development and Visualizing Social Interactions in Web Forum. On Intelligence and Security Informatics (ISI' (2008)Google Scholar