Skip to main content
Log in

Comparing the topological properties of real and artificially generated scientific manuscripts

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89 % of accuracy. A systematic analysis of features relevance revealed that the accessibility and betweenness were useful in particular cases, even though the relevance depended upon the dataset. The successful application of the methods described here show, as a proof of principle, that network features can be used to identify scientific gibberish papers. In addition, the CN-based approach can be combined in a straightforward fashion with traditional statistical language processing methods to improve the performance in identifying artificially generated papers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Abramov, O., & Mehler, A. (2011). Automatic language classification by means of syntactic dependency networks. Journal of Quantitative Linguistics, 18(4), 291–336.

    Article  Google Scholar 

  • Amancio, D. R., Antiqueira, L., Pardo, T. A. S., da Costa, L. F., Oliveira, O. N, Jr, & Nunes, M. G. V. (2008). Complex networks analysis of manual and machine translations. International Journal of Modern Physics C, 19, 583–598.

    Article  MATH  Google Scholar 

  • Amancio, D. R., Nunes, M. G. V., Oliveira, O. N, Jr, Pardo, T. A. S., Antiqueira, L., & da Costa, L. F. (2011). Using metrics from complex networks to evaluate machine translation. Physica A, 390, 131–142.

    Article  Google Scholar 

  • Amancio, D. R., Altmann, E. G., Oliveira, O. N, Jr, & da Costa, L. F. (2011). Comparing intermittency and network measurements of words and their dependency on authorship. New Journal of Physics, 13, 123024.

    Article  Google Scholar 

  • Amancio, D. R., Nunes, M. G. V., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Extractive summarization using complex networks and syntactic dependency. Physica A, 391, 1855–1864.

    Article  Google Scholar 

  • Amancio, D. R., Aluisio, S. M., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Complex networks analysis of language complexity. EPL, 100, 58002.

    Article  Google Scholar 

  • Amancio, D. R., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Identification of literary movements using complex networks to represent texts. New Journal of Physics, 14, 043029.

    Article  Google Scholar 

  • Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira, O. N, Jr, & Costa, L. D. F. (2013). Probing the statistical properties of unknown texts: Application to the Voynich manuscript. PLOS One, 8, e67310.

    Article  Google Scholar 

  • Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., et al. (2014). A systematic comparison of supervised classifiers. PLOS One, 9, e94137.

    Article  Google Scholar 

  • Amancio, D. R. (2015). Probing the topological properties of complex networks modeling short written texts. PLoS One, 10, e0118394. doi:10.1371/journal.pone.0118394.

    Article  Google Scholar 

  • Antiqueira, L., Oliveira, O. N, Jr, da Costa, L. F., & Nunes, M. G. V. (2009). A complex network approach to text summarization. Information Sciences, 179, 584–599.

    Article  MATH  Google Scholar 

  • Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen, M. H. (2013). Networks in cognitive science. Trends in Cognitive Sciences, 17, 348–360.

    Article  Google Scholar 

  • Bartneck, C., & Kokkelmans, S. (2011). Detecting h-index manipulation through self-citation analysis. Scientometrics, 87(1), 85–98.

    Article  Google Scholar 

  • Berger, A. L., Della Pietra, V. J., & Della Pietra, S. A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.

    Google Scholar 

  • Citron, D. T., & Ginsparg, P. (2015). Patterns of text reuse in a scientific corpus. Proceedings of the National Academy of Sciences, 112(1), 25–30.

    Article  Google Scholar 

  • Cong, J., & Liu, H. (2014). Approaching human language with complex networks. Physics of Life Reviews, 11(4), 598–618.

    Article  MathSciNet  Google Scholar 

  • Cormen, T. H., Stein, C., Rivest, R. L., & Leiserson, C. E. (2001). Introduction to algorithms. New York City: McGraw-Hill Higher Education.

    MATH  Google Scholar 

  • da Costa, L. F. (2014). Shape classification and analysis: Theory and practice (2nd ed.). Boca Raton: CRC Press.

    Google Scholar 

  • Dalkilic, M. M., Clark, W. T.,Costello, J. C., & Radivojac, P. (2006) Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Hoboken: Wiley.

    Google Scholar 

  • Fahrenberg, U., Biondi, F., Corre, K., Jégourel, C., Kongshoj, S., & Legay, A. (2014) Measuring structural distances between texts. arXiv:1403.4024

  • Ferrara, E., & Romero, A. E. (2013). Scientific impact evaluation and the effect of self-citations: Mitigating the bias by discounting the h-index. Journal of the American Society for Information Science and Technology, 64(11), 2332–2339.

    Article  Google Scholar 

  • Finardia, U. (2013). Correlation between journal impact factor and citation performance: An experimental study. Journal of Informetrics, 7(2), 357–370.

    Article  Google Scholar 

  • García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.

    Article  Google Scholar 

  • Ginsparg, P. (2014). Automated screening: arXiv screens spot fake papers. Nature, 508(7494), 44.

    Article  Google Scholar 

  • Glänzel, W., Braun, T., Schubert, A., & Zosimo-Landolfo, G. (2014). Coping with copying. Scientometrics, 102(1), 1–3.

  • Glanzel, W., Schlemmer, B., & Thijs, B. (2003). Better late than never? On the chance to become highly cited only beyond the standard time horizon. Scientometrics, 58(3), 571–586.

    Article  Google Scholar 

  • Hajra, K. B., & Sen, P. (2005). Aging in citation networks. Physica A, 346(1–2), 44–48.

    Article  Google Scholar 

  • i Cancho, R. F., Solé, R. V., & Kohler, R. (2004). Patterns in syntactic dependency networks. Physical Review E, 69, 051915.

    Article  Google Scholar 

  • Labbé, C., & Labbé, D. (2013). Duplicate and fake publications in the scientific literature: How many SCIGen papers in computer science? Scientometrics, 94(1), 379–396.

    Article  Google Scholar 

  • Lavoie, A., & Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. arXiv:1008.0706

  • Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, H. (2008). The complexity of Chinese syntactic dependency networks. Physica A, 387, 3048–3058.

    Article  Google Scholar 

  • Liu, H., Christiansen, T., Baumgartner, W. A., & Verspoor, K. (2012). BioLemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3, 3.

    Article  Google Scholar 

  • Liu, H. T., & Cong, J. (2013). Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10), 1139–1144.

    Article  Google Scholar 

  • Liu, H., & Li, W. (2010). Language clusters based on linguistic complex networks. Chinese Science Bulletin, 55(30), 3458–3465.

    Article  Google Scholar 

  • Liu, H., & Xu, C. (2011). Can syntactic networks indicate morphological complexity of a language? EPL, 93, 28005.

    Article  Google Scholar 

  • Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Masucci, A. P., Kalampokis, A., Eguíluz, V. M., & Hernández-García, E. (2011). Wikipedia information flow analysis reveals the scale-free architecture of the semantic space. PLoS One, 6(2), e17333.

    Article  Google Scholar 

  • Mota, N. B., Furtado, R., Maia, P. P. C., Copelli, M., & Ribeiro, S. (2014). Graph analysis of dream reports is especially informative about psychosis. Scientific Reports, 4, 3691.

    Article  Google Scholar 

  • Newman, M. E. J. (2003). Mixing patterns in networks. Physical Review E, 67, 026126.

    Article  MathSciNet  Google Scholar 

  • Newman, M. E. J. (2006). Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74, 036104.

    Article  MathSciNet  Google Scholar 

  • Newman, M. (2010). Networks: An introduction. New York, NY: Oxford University Press Inc.

    Book  Google Scholar 

  • Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065.

    Article  MathSciNet  MATH  Google Scholar 

  • Peirce, C. S. (1884). The numerical measure of the success of predictions. Science, 4(93), 453–454.

    Article  Google Scholar 

  • Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80, 056103.

    Article  Google Scholar 

  • Ronen, S., Gonçalves, B., Hu, K. Z., Vespignani, A., Pinker, S., & Hidalgo, C. A. (2014). Links that speak: The global language network and its association with global fame. Proceedings of the National Academy of Sciences, 111(52), 5616–5622.

    Article  Google Scholar 

  • Sigman, M., & Cecchi, G. A. (2002). Global organization of the Wordnet lexicon. Proceedings of the National Academy of Sciences, 99(3), 1742–1747.

    Article  Google Scholar 

  • Silva, T. C., & Amancio, D. R. (2012). Word sense disambiguation via high order of learning in complex networks. EPL, 98, 58001.

    Article  Google Scholar 

  • Silva, T. C., & Amancio, D. R. (2013). Discriminating word senses with tourist walks in complex networks. The European Physical Journal B, 86, 297.

    Article  Google Scholar 

  • Solé, R. V., Corominas-Murtra, B. B., Valverde, S., & Steels, L. (2009). Language networks: Their structure, function and evolution. Complexity, 15(6), 20–26.

    Google Scholar 

  • Travençolo, B. A. N., & da Costa, L. F. (2008). Accessibility in complex networks. Physics Letters A, 373, 89–95.

    Article  MATH  Google Scholar 

  • Van Calster, B. (2012). It takes time: A remarkable example of delayed recognition. Journal of the American Society for Information Science and Technology, 63(11), 2341–2344.

    Article  Google Scholar 

  • Van Noorden, R. (2014). Publishers withdraw more than 120 gibberish papers. Nature, 24. doi:10.1038/nature.2014.14763.

  • Wu, Y., Fu, T. Z. J., & Chiu, D. M. (2014). Generalized preferential attachment considering aging. Journal of Informetrics, 8(3), 650–658.

    Article  Google Scholar 

  • Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering (pp. 101–102).

  • Yasseri, T., Kornai, A., & Kertész, J. (2012). A practical approach to language complexity: A wikipedia case study. PLoS One, 7, e48386.

    Article  Google Scholar 

  • Yua, T., Yua, G., & Wang, M.-Y. (2014). Classification method for detecting coercive self-citation in journals. Journal of Informetrics, 8(1), 123–135.

    Article  Google Scholar 

Download references

Acknowledgments

I am thankful to São Paulo Research Foundation (FAPESP) (grant number 14/20830-0) for the financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Raphael Amancio.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amancio, D.R. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics 105, 1763–1779 (2015). https://doi.org/10.1007/s11192-015-1637-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-015-1637-z

Keywords

Navigation