Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

Abstract

Two kinds of bibliographic tools are used to retrieve scientific publications and make them available online. For one kind, access is free as they store information made publicly available online. For the other kind, access fees are required as they are compiled on information provided by the major publishers of scientific literature. The former can easily be interfered with, but it is generally assumed that the latter guarantee the integrity of the data they sell. Unfortunately, duplicate and fake publications are appearing in scientific conferences and, as a result, in the bibliographic services. We demonstrate a software method of detecting these duplicate and fake publications. Both the free services (such as Google Scholar and DBLP) and the charged-for services (such as IEEE Xplore) accept and index these publications.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    http://ip-science.thomsonreuters.com/news/2005-04/8272986/.

  2. 2.

    Bibliographic information and corpora are available upon request to the authors.

  3. 3.

    http://arxiv.org/help/endorsement.

  4. 4.

    http://pdos.csail.mit.edu/scigen/.

  5. 5.

    Blog post: http://pythonic.pocoo.org/2009/1/28/fun-with-scigen; SCIgen-Physics Sources: http://bitbucket.org/birkenfeld/scigen-physics/overview.

  6. 6.

    http://paperdetection.blogspot.com/.

  7. 7.

    http://montana.informatics.indiana.edu/cgi-bin/fsi/fsi.cgi.

  8. 8.

    http://sigma.imag.fr/labbe/main.php.

  9. 9.

    February and March 2012.

References

  1. Ball, P. (2005). Computer conference welcomes gobbledegook paper. Nature, 434, 946.

    Google Scholar 

  2. Beel, J., & Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3). http://hdl.handle.net/2027/spo.3336451.0013.305.

  3. Benzecri J. P. (1980). L’analyse des données. Paris: Dunod.

  4. Cover, T.M., & Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.

    MATH  Article  Google Scholar 

  5. Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.

  6. Elmacioglu, E., & Lee, D. (2009). Oracle, where shall i submit my papers?. Communications of the ACM (CACM), 52(2), 115–118.

    Article  Google Scholar 

  7. Falagas, M.E., Pitsouni, E.I., Malietzis, G.A., & Pappas, G. (2008). Comparison of pubmed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338–342.

    Article  Google Scholar 

  8. Hockey, S., & Martin, J. (1988). OCP users’ manual. Oxford: Oxford University Computing Service.

  9. Jacso, P. (2008). Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster. Library Trends, 56(4)

  10. Jacso, P.: The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32(3), 437–452 (2008). doi:10.1108/14684520810889718.

  11. Kato, J. (2005). Isi Web of Knowledge: proven track record of high quality and value. KnowledgeLink newsletter from Thomson Scientific.

  12. Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. International Society for Scientometrics and Informetrics Newsletter, 6(2), 48–52.

    Google Scholar 

  13. Labbé, C., & Labbé, D. (2001). Inter-textual distance and authorship attribution corneille and moliere. Journal of Quantitative Linguistics 8(3), 213–231.

    Article  Google Scholar 

  14. Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in english. Journal of Quantitative Linguistics, 14(1), 33–80.

    Article  Google Scholar 

  15. Lavoie, A., Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. ArXiv e-prints.

  16. Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32.

  17. Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.

    MathSciNet  Article  Google Scholar 

  18. Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 569–576.

    Google Scholar 

  19. Parnas, D. L. (2007). Stop the numbers game. Communications of ACM, 50(11), 19–21.

    Article  Google Scholar 

  20. Roux, M. (1985). Algorithmes de classification. Paris: Masson.

    Google Scholar 

  21. Roux M. (1994) Classification des données d’enquête. Paris: Dunod.

    Google Scholar 

  22. Savoy, J. (2006). Les résultats de google sont-ils biaisés? Genève: Le Temps.

    Google Scholar 

  23. Sneath, P., & Sokal, R. (1973). Numerical Taxonomy. San Francisco: Freeman.

    Google Scholar 

  24. Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering, 2009, KESE ’09, pp. 101–102

  25. Yang, K., & Meho, L. I. (2006). Citation analysis: a comparison of google scholar, scopus, and web of science. American Society for Information Science and Technology, 43(1), 1–15.

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Tom Merriam, Jacques Savoy, Edward Arnold for their careful readings of previous versions of this paper, the anonymous reviewers and members of the LIG laboratory for their valuable comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Cyril Labbé.

Appendices

Appendix 1: Examples of SCIgen papers

Figure 5 is an example of a SCIgen-Physics paper. Formula generation have been improved compare to the one used by SCIgen-Origin (cf. Fig. 6).

Fig. 5
figure5

Generated text, graph and formula: SCIgen Physics

Fig. 6
figure6

Generated text: SCIgen Computer Science

Appendix 2: Comparison between inter-textual distance and other similarity index

Figures 7, 8 and 9 show the dendrograms obtained using cosine, Jaccard and Euclidean metrics. They are computed using the R text mining package (Meyer et al. 2008). These dendrograms are to be compared to the one in Fig. 4. Dendrograms for Cosine and Euclidean do not group together the Ike Antkare corpus.

Fig. 7
figure7

Cosine: dendrogram for analysis of corpora Antkare (black), Z (blue), MLT (red) (color figure online)

Fig. 8
figure8

Euclidean: dendrogram for analysis of corpora Antkare (black), Z (blue), MLT (red) (color figure online)

Fig. 9
figure9

Jaccard: Dendrogram for analysis of corpora Antkare (black), Z (blue), MLT (red) (color figure online)

Results, for the classification by assigning a text of the MLT corpus to the class of its nearest neighbor, are given in Table 4. The arXiv data set was not tested because of its size which make the use of the R text mining package problematic.

Table 4 Classification of the MLT Corpus (122 papers) using Inter-textual distance, Cosine, Euclidean and Jaccard metrics

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Labbé, C., Labbé, D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?. Scientometrics 94, 379–396 (2013). https://doi.org/10.1007/s11192-012-0781-y

Download citation

Keywords

  • Bibliographic tools
  • Scientific conferences
  • Fake publications
  • Text-mining
  • Inter-textual distance
  • Google Scholar
  • Scopus
  • WoK