Two kinds of bibliographic tools are used to retrieve scientific publications and make them available online. For one kind, access is free as they store information made publicly available online. For the other kind, access fees are required as they are compiled on information provided by the major publishers of scientific literature. The former can easily be interfered with, but it is generally assumed that the latter guarantee the integrity of the data they sell. Unfortunately, duplicate and fake publications are appearing in scientific conferences and, as a result, in the bibliographic services. We demonstrate a software method of detecting these duplicate and fake publications. Both the free services (such as Google Scholar and DBLP) and the charged-for services (such as IEEE Xplore) accept and index these publications.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Bibliographic information and corpora are available upon request to the authors.
Blog post: http://pythonic.pocoo.org/2009/1/28/fun-with-scigen; SCIgen-Physics Sources: http://bitbucket.org/birkenfeld/scigen-physics/overview.
February and March 2012.
Ball, P. (2005). Computer conference welcomes gobbledegook paper. Nature, 434, 946.
Beel, J., & Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3). http://hdl.handle.net/2027/spo.3336451.0013.305.
Benzecri J. P. (1980). L’analyse des données. Paris: Dunod.
Cover, T.M., & Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.
Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.
Elmacioglu, E., & Lee, D. (2009). Oracle, where shall i submit my papers?. Communications of the ACM (CACM), 52(2), 115–118.
Falagas, M.E., Pitsouni, E.I., Malietzis, G.A., & Pappas, G. (2008). Comparison of pubmed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338–342.
Hockey, S., & Martin, J. (1988). OCP users’ manual. Oxford: Oxford University Computing Service.
Jacso, P. (2008). Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster. Library Trends, 56(4)
Jacso, P.: The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32(3), 437–452 (2008). doi:10.1108/14684520810889718.
Kato, J. (2005). Isi Web of Knowledge: proven track record of high quality and value. KnowledgeLink newsletter from Thomson Scientific.
Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. International Society for Scientometrics and Informetrics Newsletter, 6(2), 48–52.
Labbé, C., & Labbé, D. (2001). Inter-textual distance and authorship attribution corneille and moliere. Journal of Quantitative Linguistics 8(3), 213–231.
Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in english. Journal of Quantitative Linguistics, 14(1), 33–80.
Lavoie, A., Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. ArXiv e-prints.
Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32.
Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.
Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 569–576.
Parnas, D. L. (2007). Stop the numbers game. Communications of ACM, 50(11), 19–21.
Roux, M. (1985). Algorithmes de classification. Paris: Masson.
Roux M. (1994) Classification des données d’enquête. Paris: Dunod.
Savoy, J. (2006). Les résultats de google sont-ils biaisés? Genève: Le Temps.
Sneath, P., & Sokal, R. (1973). Numerical Taxonomy. San Francisco: Freeman.
Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering, 2009, KESE ’09, pp. 101–102
Yang, K., & Meho, L. I. (2006). Citation analysis: a comparison of google scholar, scopus, and web of science. American Society for Information Science and Technology, 43(1), 1–15.
The authors would like to thank Tom Merriam, Jacques Savoy, Edward Arnold for their careful readings of previous versions of this paper, the anonymous reviewers and members of the LIG laboratory for their valuable comments.
Appendix 1: Examples of SCIgen papers
Appendix 2: Comparison between inter-textual distance and other similarity index
Figures 7, 8 and 9 show the dendrograms obtained using cosine, Jaccard and Euclidean metrics. They are computed using the R text mining package (Meyer et al. 2008). These dendrograms are to be compared to the one in Fig. 4. Dendrograms for Cosine and Euclidean do not group together the Ike Antkare corpus.
Results, for the classification by assigning a text of the MLT corpus to the class of its nearest neighbor, are given in Table 4. The arXiv data set was not tested because of its size which make the use of the R text mining package problematic.
About this article
Cite this article
Labbé, C., Labbé, D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?. Scientometrics 94, 379–396 (2013). https://doi.org/10.1007/s11192-012-0781-y
- Bibliographic tools
- Scientific conferences
- Fake publications
- Inter-textual distance
- Google Scholar