Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?
Two kinds of bibliographic tools are used to retrieve scientific publications and make them available online. For one kind, access is free as they store information made publicly available online. For the other kind, access fees are required as they are compiled on information provided by the major publishers of scientific literature. The former can easily be interfered with, but it is generally assumed that the latter guarantee the integrity of the data they sell. Unfortunately, duplicate and fake publications are appearing in scientific conferences and, as a result, in the bibliographic services. We demonstrate a software method of detecting these duplicate and fake publications. Both the free services (such as Google Scholar and DBLP) and the charged-for services (such as IEEE Xplore) accept and index these publications.
KeywordsBibliographic tools Scientific conferences Fake publications Text-mining Inter-textual distance Google Scholar Scopus WoK
The authors would like to thank Tom Merriam, Jacques Savoy, Edward Arnold for their careful readings of previous versions of this paper, the anonymous reviewers and members of the LIG laboratory for their valuable comments.
- Ball, P. (2005). Computer conference welcomes gobbledegook paper. Nature, 434, 946.Google Scholar
- Beel, J., & Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3). http://hdl.handle.net/2027/spo.3336451.0013.305.
- Benzecri J. P. (1980). L’analyse des données. Paris: Dunod.Google Scholar
- Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.Google Scholar
- Hockey, S., & Martin, J. (1988). OCP users’ manual. Oxford: Oxford University Computing Service.Google Scholar
- Jacso, P. (2008). Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster. Library Trends, 56(4)Google Scholar
- Jacso, P.: The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32(3), 437–452 (2008). doi: 10.1108/14684520810889718.
- Kato, J. (2005). Isi Web of Knowledge: proven track record of high quality and value. KnowledgeLink newsletter from Thomson Scientific.Google Scholar
- Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. International Society for Scientometrics and Informetrics Newsletter, 6(2), 48–52.Google Scholar
- Lavoie, A., Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. ArXiv e-prints.Google Scholar
- Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32.Google Scholar
- Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 569–576.Google Scholar
- Roux, M. (1985). Algorithmes de classification. Paris: Masson.Google Scholar
- Roux M. (1994) Classification des données d’enquête. Paris: Dunod.Google Scholar
- Savoy, J. (2006). Les résultats de google sont-ils biaisés? Genève: Le Temps.Google Scholar
- Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering, 2009, KESE ’09, pp. 101–102Google Scholar
- Yang, K., & Meho, L. I. (2006). Citation analysis: a comparison of google scholar, scopus, and web of science. American Society for Information Science and Technology, 43(1), 1–15.Google Scholar