Skip to main content

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?


Two kinds of bibliographic tools are used to retrieve scientific publications and make them available online. For one kind, access is free as they store information made publicly available online. For the other kind, access fees are required as they are compiled on information provided by the major publishers of scientific literature. The former can easily be interfered with, but it is generally assumed that the latter guarantee the integrity of the data they sell. Unfortunately, duplicate and fake publications are appearing in scientific conferences and, as a result, in the bibliographic services. We demonstrate a software method of detecting these duplicate and fake publications. Both the free services (such as Google Scholar and DBLP) and the charged-for services (such as IEEE Xplore) accept and index these publications.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4



  2. Bibliographic information and corpora are available upon request to the authors.



  5. Blog post:; SCIgen-Physics Sources:




  9. February and March 2012.


  • Ball, P. (2005). Computer conference welcomes gobbledegook paper. Nature, 434, 946.

    Google Scholar 

  • Beel, J., & Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3).

  • Benzecri J. P. (1980). L’analyse des données. Paris: Dunod.

  • Cover, T.M., & Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.

    MATH  Article  Google Scholar 

  • Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.

  • Elmacioglu, E., & Lee, D. (2009). Oracle, where shall i submit my papers?. Communications of the ACM (CACM), 52(2), 115–118.

    Article  Google Scholar 

  • Falagas, M.E., Pitsouni, E.I., Malietzis, G.A., & Pappas, G. (2008). Comparison of pubmed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338–342.

    Article  Google Scholar 

  • Hockey, S., & Martin, J. (1988). OCP users’ manual. Oxford: Oxford University Computing Service.

  • Jacso, P. (2008). Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster. Library Trends, 56(4)

  • Jacso, P.: The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32(3), 437–452 (2008). doi:10.1108/14684520810889718.

  • Kato, J. (2005). Isi Web of Knowledge: proven track record of high quality and value. KnowledgeLink newsletter from Thomson Scientific.

  • Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. International Society for Scientometrics and Informetrics Newsletter, 6(2), 48–52.

    Google Scholar 

  • Labbé, C., & Labbé, D. (2001). Inter-textual distance and authorship attribution corneille and moliere. Journal of Quantitative Linguistics 8(3), 213–231.

    Article  Google Scholar 

  • Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in english. Journal of Quantitative Linguistics, 14(1), 33–80.

    Article  Google Scholar 

  • Lavoie, A., Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. ArXiv e-prints.

  • Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32.

  • Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.

    MathSciNet  Article  Google Scholar 

  • Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 569–576.

    Google Scholar 

  • Parnas, D. L. (2007). Stop the numbers game. Communications of ACM, 50(11), 19–21.

    Article  Google Scholar 

  • Roux, M. (1985). Algorithmes de classification. Paris: Masson.

    Google Scholar 

  • Roux M. (1994) Classification des données d’enquête. Paris: Dunod.

    Google Scholar 

  • Savoy, J. (2006). Les résultats de google sont-ils biaisés? Genève: Le Temps.

    Google Scholar 

  • Sneath, P., & Sokal, R. (1973). Numerical Taxonomy. San Francisco: Freeman.

    MATH  Google Scholar 

  • Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering, 2009, KESE ’09, pp. 101–102

  • Yang, K., & Meho, L. I. (2006). Citation analysis: a comparison of google scholar, scopus, and web of science. American Society for Information Science and Technology, 43(1), 1–15.

    Google Scholar 

Download references


The authors would like to thank Tom Merriam, Jacques Savoy, Edward Arnold for their careful readings of previous versions of this paper, the anonymous reviewers and members of the LIG laboratory for their valuable comments.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Cyril Labbé.


Appendix 1: Examples of SCIgen papers

Figure 5 is an example of a SCIgen-Physics paper. Formula generation have been improved compare to the one used by SCIgen-Origin (cf. Fig. 6).

Fig. 5
figure 5

Generated text, graph and formula: SCIgen Physics

Fig. 6
figure 6

Generated text: SCIgen Computer Science

Appendix 2: Comparison between inter-textual distance and other similarity index

Figures 7, 8 and 9 show the dendrograms obtained using cosine, Jaccard and Euclidean metrics. They are computed using the R text mining package (Meyer et al. 2008). These dendrograms are to be compared to the one in Fig. 4. Dendrograms for Cosine and Euclidean do not group together the Ike Antkare corpus.

Fig. 7
figure 7

Cosine: dendrogram for analysis of corpora Antkare (black), Z (blue), MLT (red) (color figure online)

Fig. 8
figure 8

Euclidean: dendrogram for analysis of corpora Antkare (black), Z (blue), MLT (red) (color figure online)

Fig. 9
figure 9

Jaccard: Dendrogram for analysis of corpora Antkare (black), Z (blue), MLT (red) (color figure online)

Results, for the classification by assigning a text of the MLT corpus to the class of its nearest neighbor, are given in Table 4. The arXiv data set was not tested because of its size which make the use of the R text mining package problematic.

Table 4 Classification of the MLT Corpus (122 papers) using Inter-textual distance, Cosine, Euclidean and Jaccard metrics

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Labbé, C., Labbé, D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?. Scientometrics 94, 379–396 (2013).

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI:


  • Bibliographic tools
  • Scientific conferences
  • Fake publications
  • Text-mining
  • Inter-textual distance
  • Google Scholar
  • Scopus
  • WoK