, Volume 94, Issue 1, pp 379–396 | Cite as

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

  • Cyril LabbéEmail author
  • Dominique Labbé


Two kinds of bibliographic tools are used to retrieve scientific publications and make them available online. For one kind, access is free as they store information made publicly available online. For the other kind, access fees are required as they are compiled on information provided by the major publishers of scientific literature. The former can easily be interfered with, but it is generally assumed that the latter guarantee the integrity of the data they sell. Unfortunately, duplicate and fake publications are appearing in scientific conferences and, as a result, in the bibliographic services. We demonstrate a software method of detecting these duplicate and fake publications. Both the free services (such as Google Scholar and DBLP) and the charged-for services (such as IEEE Xplore) accept and index these publications.


Bibliographic tools Scientific conferences Fake publications Text-mining Inter-textual distance Google Scholar Scopus WoK 



The authors would like to thank Tom Merriam, Jacques Savoy, Edward Arnold for their careful readings of previous versions of this paper, the anonymous reviewers and members of the LIG laboratory for their valuable comments.


  1. Ball, P. (2005). Computer conference welcomes gobbledegook paper. Nature, 434, 946.Google Scholar
  2. Beel, J., & Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3).
  3. Benzecri J. P. (1980). L’analyse des données. Paris: Dunod.Google Scholar
  4. Cover, T.M., & Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.zbMATHCrossRefGoogle Scholar
  5. Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.Google Scholar
  6. Elmacioglu, E., & Lee, D. (2009). Oracle, where shall i submit my papers?. Communications of the ACM (CACM), 52(2), 115–118.CrossRefGoogle Scholar
  7. Falagas, M.E., Pitsouni, E.I., Malietzis, G.A., & Pappas, G. (2008). Comparison of pubmed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338–342.CrossRefGoogle Scholar
  8. Hockey, S., & Martin, J. (1988). OCP users’ manual. Oxford: Oxford University Computing Service.Google Scholar
  9. Jacso, P. (2008). Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster. Library Trends, 56(4)Google Scholar
  10. Jacso, P.: The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32(3), 437–452 (2008). doi: 10.1108/14684520810889718.
  11. Kato, J. (2005). Isi Web of Knowledge: proven track record of high quality and value. KnowledgeLink newsletter from Thomson Scientific.Google Scholar
  12. Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. International Society for Scientometrics and Informetrics Newsletter, 6(2), 48–52.Google Scholar
  13. Labbé, C., & Labbé, D. (2001). Inter-textual distance and authorship attribution corneille and moliere. Journal of Quantitative Linguistics 8(3), 213–231.CrossRefGoogle Scholar
  14. Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in english. Journal of Quantitative Linguistics, 14(1), 33–80.CrossRefGoogle Scholar
  15. Lavoie, A., Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. ArXiv e-prints.Google Scholar
  16. Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32.Google Scholar
  17. Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.MathSciNetCrossRefGoogle Scholar
  18. Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 569–576.Google Scholar
  19. Parnas, D. L. (2007). Stop the numbers game. Communications of ACM, 50(11), 19–21.CrossRefGoogle Scholar
  20. Roux, M. (1985). Algorithmes de classification. Paris: Masson.Google Scholar
  21. Roux M. (1994) Classification des données d’enquête. Paris: Dunod.Google Scholar
  22. Savoy, J. (2006). Les résultats de google sont-ils biaisés? Genève: Le Temps.Google Scholar
  23. Sneath, P., & Sokal, R. (1973). Numerical Taxonomy. San Francisco: Freeman.zbMATHGoogle Scholar
  24. Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering, 2009, KESE ’09, pp. 101–102Google Scholar
  25. Yang, K., & Meho, L. I. (2006). Citation analysis: a comparison of google scholar, scopus, and web of science. American Society for Information Science and Technology, 43(1), 1–15.Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2012

Authors and Affiliations

  1. 1.Laboratoire d’Informatique de GrenobleUniversité Joseph FourierGrenobleFrance
  2. 2.PACTE, Institut d’Etudes Politiques de GrenobleGrenobleFrance

Personalised recommendations