Abstract
Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89 % of accuracy. A systematic analysis of features relevance revealed that the accessibility and betweenness were useful in particular cases, even though the relevance depended upon the dataset. The successful application of the methods described here show, as a proof of principle, that network features can be used to identify scientific gibberish papers. In addition, the CN-based approach can be combined in a straightforward fashion with traditional statistical language processing methods to improve the performance in identifying artificially generated papers.
Similar content being viewed by others
References
Abramov, O., & Mehler, A. (2011). Automatic language classification by means of syntactic dependency networks. Journal of Quantitative Linguistics, 18(4), 291–336.
Amancio, D. R., Antiqueira, L., Pardo, T. A. S., da Costa, L. F., Oliveira, O. N, Jr, & Nunes, M. G. V. (2008). Complex networks analysis of manual and machine translations. International Journal of Modern Physics C, 19, 583–598.
Amancio, D. R., Nunes, M. G. V., Oliveira, O. N, Jr, Pardo, T. A. S., Antiqueira, L., & da Costa, L. F. (2011). Using metrics from complex networks to evaluate machine translation. Physica A, 390, 131–142.
Amancio, D. R., Altmann, E. G., Oliveira, O. N, Jr, & da Costa, L. F. (2011). Comparing intermittency and network measurements of words and their dependency on authorship. New Journal of Physics, 13, 123024.
Amancio, D. R., Nunes, M. G. V., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Extractive summarization using complex networks and syntactic dependency. Physica A, 391, 1855–1864.
Amancio, D. R., Aluisio, S. M., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Complex networks analysis of language complexity. EPL, 100, 58002.
Amancio, D. R., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Identification of literary movements using complex networks to represent texts. New Journal of Physics, 14, 043029.
Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira, O. N, Jr, & Costa, L. D. F. (2013). Probing the statistical properties of unknown texts: Application to the Voynich manuscript. PLOS One, 8, e67310.
Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., et al. (2014). A systematic comparison of supervised classifiers. PLOS One, 9, e94137.
Amancio, D. R. (2015). Probing the topological properties of complex networks modeling short written texts. PLoS One, 10, e0118394. doi:10.1371/journal.pone.0118394.
Antiqueira, L., Oliveira, O. N, Jr, da Costa, L. F., & Nunes, M. G. V. (2009). A complex network approach to text summarization. Information Sciences, 179, 584–599.
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen, M. H. (2013). Networks in cognitive science. Trends in Cognitive Sciences, 17, 348–360.
Bartneck, C., & Kokkelmans, S. (2011). Detecting h-index manipulation through self-citation analysis. Scientometrics, 87(1), 85–98.
Berger, A. L., Della Pietra, V. J., & Della Pietra, S. A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
Citron, D. T., & Ginsparg, P. (2015). Patterns of text reuse in a scientific corpus. Proceedings of the National Academy of Sciences, 112(1), 25–30.
Cong, J., & Liu, H. (2014). Approaching human language with complex networks. Physics of Life Reviews, 11(4), 598–618.
Cormen, T. H., Stein, C., Rivest, R. L., & Leiserson, C. E. (2001). Introduction to algorithms. New York City: McGraw-Hill Higher Education.
da Costa, L. F. (2014). Shape classification and analysis: Theory and practice (2nd ed.). Boca Raton: CRC Press.
Dalkilic, M. M., Clark, W. T.,Costello, J. C., & Radivojac, P. (2006) Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Hoboken: Wiley.
Fahrenberg, U., Biondi, F., Corre, K., Jégourel, C., Kongshoj, S., & Legay, A. (2014) Measuring structural distances between texts. arXiv:1403.4024
Ferrara, E., & Romero, A. E. (2013). Scientific impact evaluation and the effect of self-citations: Mitigating the bias by discounting the h-index. Journal of the American Society for Information Science and Technology, 64(11), 2332–2339.
Finardia, U. (2013). Correlation between journal impact factor and citation performance: An experimental study. Journal of Informetrics, 7(2), 357–370.
García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.
Ginsparg, P. (2014). Automated screening: arXiv screens spot fake papers. Nature, 508(7494), 44.
Glänzel, W., Braun, T., Schubert, A., & Zosimo-Landolfo, G. (2014). Coping with copying. Scientometrics, 102(1), 1–3.
Glanzel, W., Schlemmer, B., & Thijs, B. (2003). Better late than never? On the chance to become highly cited only beyond the standard time horizon. Scientometrics, 58(3), 571–586.
Hajra, K. B., & Sen, P. (2005). Aging in citation networks. Physica A, 346(1–2), 44–48.
i Cancho, R. F., Solé, R. V., & Kohler, R. (2004). Patterns in syntactic dependency networks. Physical Review E, 69, 051915.
Labbé, C., & Labbé, D. (2013). Duplicate and fake publications in the scientific literature: How many SCIGen papers in computer science? Scientometrics, 94(1), 379–396.
Lavoie, A., & Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. arXiv:1008.0706
Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.
Liu, H. (2008). The complexity of Chinese syntactic dependency networks. Physica A, 387, 3048–3058.
Liu, H., Christiansen, T., Baumgartner, W. A., & Verspoor, K. (2012). BioLemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3, 3.
Liu, H. T., & Cong, J. (2013). Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10), 1139–1144.
Liu, H., & Li, W. (2010). Language clusters based on linguistic complex networks. Chinese Science Bulletin, 55(30), 3458–3465.
Liu, H., & Xu, C. (2011). Can syntactic networks indicate morphological complexity of a language? EPL, 93, 28005.
Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Masucci, A. P., Kalampokis, A., Eguíluz, V. M., & Hernández-García, E. (2011). Wikipedia information flow analysis reveals the scale-free architecture of the semantic space. PLoS One, 6(2), e17333.
Mota, N. B., Furtado, R., Maia, P. P. C., Copelli, M., & Ribeiro, S. (2014). Graph analysis of dream reports is especially informative about psychosis. Scientific Reports, 4, 3691.
Newman, M. E. J. (2003). Mixing patterns in networks. Physical Review E, 67, 026126.
Newman, M. E. J. (2006). Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74, 036104.
Newman, M. (2010). Networks: An introduction. New York, NY: Oxford University Press Inc.
Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065.
Peirce, C. S. (1884). The numerical measure of the success of predictions. Science, 4(93), 453–454.
Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80, 056103.
Ronen, S., Gonçalves, B., Hu, K. Z., Vespignani, A., Pinker, S., & Hidalgo, C. A. (2014). Links that speak: The global language network and its association with global fame. Proceedings of the National Academy of Sciences, 111(52), 5616–5622.
Sigman, M., & Cecchi, G. A. (2002). Global organization of the Wordnet lexicon. Proceedings of the National Academy of Sciences, 99(3), 1742–1747.
Silva, T. C., & Amancio, D. R. (2012). Word sense disambiguation via high order of learning in complex networks. EPL, 98, 58001.
Silva, T. C., & Amancio, D. R. (2013). Discriminating word senses with tourist walks in complex networks. The European Physical Journal B, 86, 297.
Solé, R. V., Corominas-Murtra, B. B., Valverde, S., & Steels, L. (2009). Language networks: Their structure, function and evolution. Complexity, 15(6), 20–26.
Travençolo, B. A. N., & da Costa, L. F. (2008). Accessibility in complex networks. Physics Letters A, 373, 89–95.
Van Calster, B. (2012). It takes time: A remarkable example of delayed recognition. Journal of the American Society for Information Science and Technology, 63(11), 2341–2344.
Van Noorden, R. (2014). Publishers withdraw more than 120 gibberish papers. Nature, 24. doi:10.1038/nature.2014.14763.
Wu, Y., Fu, T. Z. J., & Chiu, D. M. (2014). Generalized preferential attachment considering aging. Journal of Informetrics, 8(3), 650–658.
Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering (pp. 101–102).
Yasseri, T., Kornai, A., & Kertész, J. (2012). A practical approach to language complexity: A wikipedia case study. PLoS One, 7, e48386.
Yua, T., Yua, G., & Wang, M.-Y. (2014). Classification method for detecting coercive self-citation in journals. Journal of Informetrics, 8(1), 123–135.
Acknowledgments
I am thankful to São Paulo Research Foundation (FAPESP) (grant number 14/20830-0) for the financial support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Amancio, D.R. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics 105, 1763–1779 (2015). https://doi.org/10.1007/s11192-015-1637-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-015-1637-z