Detecting Near-Duplicates in Large-Scale Short Text Databases

Gong, Caichun; Huang, Yulan; Cheng, Xueqi; Bai, Shuo

doi:10.1007/978-3-540-68125-0_87

Caichun Gong^1,2,
Yulan Huang^1,2,
Xueqi Cheng¹ &
…
Shuo Bai¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5012))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1872 Accesses
14 Citations

Abstract

Near-duplicates are abundant in short text databases. Detecting and eliminating them is of great importance. SimFinder proposed in this paper is a fast algorithm to identify all near-duplicates in large-scale short text databases. An ad hoc term weighting scheme is employed to measure each term’s discriminative ability. A certain number of terms with higher weights are seletect as features for each short text. SimFinder generates several fingerprints for each text, and only texts with at least one fingerprint in common are compared with each other. An optimization procedure is employed in SimFinder to make it more efficient. Experiments indicate that SimFinder is an effective solution for short text duplicate detection with almost linear time and storage complexity. Both precision and recall of SimFinder are promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Website of Ministry of Information Industry of China, http://www.mii.gov.cn/
Hu, J.X.: Message text clustering based on frequent patterns (In Chinese). M.S. thesis, Institute of Computing Technology, Chinese Academy of Sciences. Beijing, China (2006)
Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA (May 1995)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM:A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas (June 1995)
Google Scholar
Lyon, C., Barrett, R., Malcolm, J.: A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector. In: Plagiarism: Prevention, Practice and Policies Conference (June 2004)
Google Scholar
Lyon, C., Barrett, R., Malcolm, J.: Plagiarism is easy, but also easy to detect. Plagiary: Cross-Disciplinary Studies in Plagiarism, Fabrication, and Falsification 1(5), 1–10 (2006)
Google Scholar
Shivakumar, N., Garnia-Molina, H.: Finding near-replicas of documents on the web. In: Proceedings of Workshop on Web Databases, Valencia, Spain (March 1998)
Google Scholar
Broder, A.: Identifying and Filtering Near-Duplicate Documents. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, Montreal, Canada (June 2000)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International World Wide Web Conference, Banff, Alberta, Canada (May 2007)
Google Scholar
Henzinger, M.: Finding near-duplicate web pages: A large-scale evaluation of algorithms. In: Proceedings of the 29th Annul International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, U.S.A (August 2006)
Google Scholar
Tian, Z.P., Lu, H.J., Ji, W.Y., et al.: An n-gram-based approach for detecting approximately duplicate database records. International Journal on Digital Libraries 5(3), 325–331 (2001)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, U.S.A (1995)
Google Scholar
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of 34th Annul Symposium on Theory of Computing, Montréal, Québec, Canada (May 2002)
Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, P.R.C.
Caichun Gong, Yulan Huang, Xueqi Cheng & Shuo Bai
Graduate School of Chinese Academy of Sciences, Beijing, 100049, P.R.C.
Caichun Gong & Yulan Huang

Authors

Caichun Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yulan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xueqi Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Shuo Bai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Einoshin Suzuki Kai Ming Ting Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gong, C., Huang, Y., Cheng, X., Bai, S. (2008). Detecting Near-Duplicates in Large-Scale Short Text Databases. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_87

Download citation

DOI: https://doi.org/10.1007/978-3-540-68125-0_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics