International Conference on Similarity Search and Applications

Similarity Search and Applications pp 332-338 | Cite as

On the Use of Similarity Search to Detect Fake Scientific Papers

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9371)


Fake scientific papers have recently become of interest within the academic community as a result of the identification of fake papers in the digital libraries of major academic publishers [8]. Detecting and removing these papers is important for many reasons. We describe an investigation into the use of similarity search for detecting fake scientific papers by comparing several methods for signature construction and similarity scoring and describe a pseudo-relevance feedback technique that can be used to improve the effectiveness of these methods. Experiments on a dataset of 40,000 computer science papers show that precision, recall and MAP scores of 0.96, 0.99 and 0.99, respectively, can be achieved, thereby demonstrating the usefulness of similarity search in detecting fake scientific papers and ranking them highly.


Similarity search Fake papers SciGen 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8–13), 1157–1166 (1997)CrossRefGoogle Scholar
  2. 2.
    Butler, D.: Investigating journals: The dark side of publishing. Nature 495(7442), 433–435 (2013)CrossRefGoogle Scholar
  3. 3.
    Gad-el Hak, M.: Publish or perish - an ailing enterprise? Physics Today 57(3), 61–62 (2004)CrossRefGoogle Scholar
  4. 4.
    Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics 94(1), 379–396 (2012)CrossRefGoogle Scholar
  5. 5.
    Manku, G., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–149 (2007)Google Scholar
  6. 6.
    Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: EMNLP, vol. 3, pp. 1318–1327 (2009)Google Scholar
  7. 7.
    Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF (2014)Google Scholar
  8. 8.
    Van Noorden, R.: Publishers withdraw more than 120 gibberish papers. Nature, February 2014Google Scholar
  9. 9.
    Williams, K., Giles, C.L.: Near duplicate detection in an academic digital library. In: DocEng, pp. 91–94 (2013)Google Scholar
  10. 10.
    Xiong, J., Huang, T.: An effective method to identify machine automatically generated paper. In: KESE, pp. 101–102. IEEE (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Information Sciences and TechnologyThe Pennsylvania State UniversityState CollegeUSA
  2. 2.Computer Science and EngineeringThe Pennsylvania State UniversityState CollegeUSA

Personalised recommendations