Batch Text Similarity Search with MapReduce

  • Rui Li
  • Li Ju
  • Zhuo Peng
  • Zhiwei Yu
  • Chaokun Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6612)


Batch text similarity search aims to find the similar texts according to users’ batch text queries. It is widely used in the real world such as plagiarism check, and attracts more and more attention with the emergence of abundant texts on the web. Existing works, such as FuzzyJoin, can neither support the variation of thresholds, nor support the online batch text similarity search. In this paper, a two-stage algorithm is proposed. It can effectively resolve the problem of batch text similarity search based on inverted index structures. Experimental results on real datasets show the efficiency and expansibility of our method.


MapReduce Batch Text Similarity Search 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI (2004)Google Scholar
  3. 3.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD (2010)Google Scholar
  4. 4.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140. ACM, New York (2007)CrossRefGoogle Scholar
  5. 5.
    Lewis, J., Ossowski, S., Hicks, J., Errami, M., Garner, H.R.: Text similarity: an alternative way to search medline. Bioinformatics 22(18) (2006)Google Scholar
  6. 6.
    Berchtold, S., Christian, G., Braunmüller, B., Keim, D.A., Kriegel, H.P.: Fast parallel similarity search in multimedia databases. In: SIGMOD, pp. 1–12 (1997)Google Scholar
  7. 7.
    Dong, X., Halevy, A.Y., Madhavan, J., Nemes, E., Zhang, J.: Similarity search for web services. In: VLDB, pp. 372–383. Morgan Kaufmann, San Francisco (2004)Google Scholar
  8. 8.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  9. 9.
    Jin, L., Li, C., Mehrotra, S.: Efficient similarity string joins in large data sets. In: VLDB (2002)Google Scholar
  10. 10.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929. ACM, New York (2006)Google Scholar
  11. 11.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754. ACM, New York (2004)Google Scholar
  12. 12.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5. IEEE Computer Society, Los Alamitos (2006)Google Scholar
  13. 13.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140. ACM, New York (2008)CrossRefGoogle Scholar
  14. 14.
    Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In: SIGIR, pp. 155–162. ACM, New York (2009)CrossRefGoogle Scholar
  15. 15.
    Hadoop: Apache Hadoop,

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Rui Li
    • 1
    • 2
    • 3
  • Li Ju
    • 4
  • Zhuo Peng
    • 1
  • Zhiwei Yu
    • 5
  • Chaokun Wang
    • 1
    • 2
    • 3
  1. 1.School of SoftwareTsinghua UniversityBeijingChina
  2. 2.Tsinghua National Laboratory for Information Science and TechnologyChina
  3. 3.Key Laboratory for Information System Security, Ministry of EducationChina
  4. 4.Department of Information EngineeringHenan College of Finance and TaxationZhengzhouChina
  5. 5.Department of Computer Science and TechnologyTsinghua UniversityChina

Personalised recommendations