Asia-Pacific Web Conference

APWeb 2011: Web Technologies and Applications pp 412-423

Batch Text Similarity Search with MapReduce

  • Rui Li
  • Li Ju
  • Zhuo Peng
  • Zhiwei Yu
  • Chaokun Wang
Conference paper

DOI: 10.1007/978-3-642-20291-9_46

Volume 6612 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Li R., Ju L., Peng Z., Yu Z., Wang C. (2011) Batch Text Similarity Search with MapReduce. In: Du X., Fan W., Wang J., Peng Z., Sharaf M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg

Abstract

Batch text similarity search aims to find the similar texts according to users’ batch text queries. It is widely used in the real world such as plagiarism check, and attracts more and more attention with the emergence of abundant texts on the web. Existing works, such as FuzzyJoin, can neither support the variation of thresholds, nor support the online batch text similarity search. In this paper, a two-stage algorithm is proposed. It can effectively resolve the problem of batch text similarity search based on inverted index structures. Experimental results on real datasets show the efficiency and expansibility of our method.

Keywords

MapReduce Batch Text Similarity Search 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Rui Li
    • 1
    • 2
    • 3
  • Li Ju
    • 4
  • Zhuo Peng
    • 1
  • Zhiwei Yu
    • 5
  • Chaokun Wang
    • 1
    • 2
    • 3
  1. 1.School of SoftwareTsinghua UniversityBeijingChina
  2. 2.Tsinghua National Laboratory for Information Science and TechnologyChina
  3. 3.Key Laboratory for Information System Security, Ministry of EducationChina
  4. 4.Department of Information EngineeringHenan College of Finance and TaxationZhengzhouChina
  5. 5.Department of Computer Science and TechnologyTsinghua UniversityChina