Comparing the topological properties of real and artificially generated scientific manuscripts

Amancio, Diego Raphael

doi:10.1007/s11192-015-1637-z

Comparing the topological properties of real and artificially generated scientific manuscripts

Published: 15 July 2015

Volume 105, pages 1763–1779, (2015)
Cite this article

Scientometrics Aims and scope Submit manuscript

Diego Raphael Amancio¹

860 Accesses
47 Citations
3 Altmetric
Explore all metrics

Abstract

Recent years have witnessed the increase of competition in science. While promoting the quality of research in many cases, an intense competition among scientists can also trigger unethical scientific behaviors. To increase the total number of published papers, some authors even resort to software tools that are able to produce grammatical, but meaningless scientific manuscripts. Because automatically generated papers can be misunderstood as real papers, it becomes of paramount importance to develop means to identify these scientific frauds. In this paper, I devise a methodology to distinguish real manuscripts from those generated with SCIGen, an automatic paper generator. Upon modeling texts as complex networks (CN), it was possible to discriminate real from fake papers with at least 89 % of accuracy. A systematic analysis of features relevance revealed that the accessibility and betweenness were useful in particular cases, even though the relevance depended upon the dataset. The successful application of the methods described here show, as a proof of principle, that network features can be used to identify scientific gibberish papers. In addition, the CN-based approach can be combined in a straightforward fashion with traditional statistical language processing methods to improve the performance in identifying artificially generated papers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of Computer-Generated Papers in Scientific Literature

Investigating the integrated landscape of the intellectual topology of bioinformatics

Article 06 September 2014

Network-based statistical comparison of citation topology of bibliographic databases

Article Open access 29 September 2014

References

Abramov, O., & Mehler, A. (2011). Automatic language classification by means of syntactic dependency networks. Journal of Quantitative Linguistics, 18(4), 291–336.
Article Google Scholar
Amancio, D. R., Antiqueira, L., Pardo, T. A. S., da Costa, L. F., Oliveira, O. N, Jr, & Nunes, M. G. V. (2008). Complex networks analysis of manual and machine translations. International Journal of Modern Physics C, 19, 583–598.
Article MATH Google Scholar
Amancio, D. R., Nunes, M. G. V., Oliveira, O. N, Jr, Pardo, T. A. S., Antiqueira, L., & da Costa, L. F. (2011). Using metrics from complex networks to evaluate machine translation. Physica A, 390, 131–142.
Article Google Scholar
Amancio, D. R., Altmann, E. G., Oliveira, O. N, Jr, & da Costa, L. F. (2011). Comparing intermittency and network measurements of words and their dependency on authorship. New Journal of Physics, 13, 123024.
Article Google Scholar
Amancio, D. R., Nunes, M. G. V., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Extractive summarization using complex networks and syntactic dependency. Physica A, 391, 1855–1864.
Article Google Scholar
Amancio, D. R., Aluisio, S. M., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Complex networks analysis of language complexity. EPL, 100, 58002.
Article Google Scholar
Amancio, D. R., Oliveira, O. N, Jr, & da Costa, L. F. (2012). Identification of literary movements using complex networks to represent texts. New Journal of Physics, 14, 043029.
Article Google Scholar
Amancio, D. R., Altmann, E. G., Rybski, D., Oliveira, O. N, Jr, & Costa, L. D. F. (2013). Probing the statistical properties of unknown texts: Application to the Voynich manuscript. PLOS One, 8, e67310.
Article Google Scholar
Amancio, D. R., Comin, C. H., Casanova, D., Travieso, G., Bruno, O. M., Rodrigues, F. A., et al. (2014). A systematic comparison of supervised classifiers. PLOS One, 9, e94137.
Article Google Scholar
Amancio, D. R. (2015). Probing the topological properties of complex networks modeling short written texts. PLoS One, 10, e0118394. doi:10.1371/journal.pone.0118394.
Article Google Scholar
Antiqueira, L., Oliveira, O. N, Jr, da Costa, L. F., & Nunes, M. G. V. (2009). A complex network approach to text summarization. Information Sciences, 179, 584–599.
Article MATH Google Scholar
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen, M. H. (2013). Networks in cognitive science. Trends in Cognitive Sciences, 17, 348–360.
Article Google Scholar
Bartneck, C., & Kokkelmans, S. (2011). Detecting h-index manipulation through self-citation analysis. Scientometrics, 87(1), 85–98.
Article Google Scholar
Berger, A. L., Della Pietra, V. J., & Della Pietra, S. A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
Google Scholar
Citron, D. T., & Ginsparg, P. (2015). Patterns of text reuse in a scientific corpus. Proceedings of the National Academy of Sciences, 112(1), 25–30.
Article Google Scholar
Cong, J., & Liu, H. (2014). Approaching human language with complex networks. Physics of Life Reviews, 11(4), 598–618.
Article MathSciNet Google Scholar
Cormen, T. H., Stein, C., Rivest, R. L., & Leiserson, C. E. (2001). Introduction to algorithms. New York City: McGraw-Hill Higher Education.
MATH Google Scholar
da Costa, L. F. (2014). Shape classification and analysis: Theory and practice (2nd ed.). Boca Raton: CRC Press.
Google Scholar
Dalkilic, M. M., Clark, W. T.,Costello, J. C., & Radivojac, P. (2006) Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Hoboken: Wiley.
Google Scholar
Fahrenberg, U., Biondi, F., Corre, K., Jégourel, C., Kongshoj, S., & Legay, A. (2014) Measuring structural distances between texts. arXiv:1403.4024
Ferrara, E., & Romero, A. E. (2013). Scientific impact evaluation and the effect of self-citations: Mitigating the bias by discounting the h-index. Journal of the American Society for Information Science and Technology, 64(11), 2332–2339.
Article Google Scholar
Finardia, U. (2013). Correlation between journal impact factor and citation performance: An experimental study. Journal of Informetrics, 7(2), 357–370.
Article Google Scholar
García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.
Article Google Scholar
Ginsparg, P. (2014). Automated screening: arXiv screens spot fake papers. Nature, 508(7494), 44.
Article Google Scholar
Glänzel, W., Braun, T., Schubert, A., & Zosimo-Landolfo, G. (2014). Coping with copying. Scientometrics, 102(1), 1–3.
Glanzel, W., Schlemmer, B., & Thijs, B. (2003). Better late than never? On the chance to become highly cited only beyond the standard time horizon. Scientometrics, 58(3), 571–586.
Article Google Scholar
Hajra, K. B., & Sen, P. (2005). Aging in citation networks. Physica A, 346(1–2), 44–48.
Article Google Scholar
i Cancho, R. F., Solé, R. V., & Kohler, R. (2004). Patterns in syntactic dependency networks. Physical Review E, 69, 051915.
Article Google Scholar
Labbé, C., & Labbé, D. (2013). Duplicate and fake publications in the scientific literature: How many SCIGen papers in computer science? Scientometrics, 94(1), 379–396.
Article Google Scholar
Lavoie, A., & Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. arXiv:1008.0706
Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.
Article MathSciNet MATH Google Scholar
Liu, H. (2008). The complexity of Chinese syntactic dependency networks. Physica A, 387, 3048–3058.
Article Google Scholar
Liu, H., Christiansen, T., Baumgartner, W. A., & Verspoor, K. (2012). BioLemmatizer: A lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3, 3.
Article Google Scholar
Liu, H. T., & Cong, J. (2013). Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10), 1139–1144.
Article Google Scholar
Liu, H., & Li, W. (2010). Language clusters based on linguistic complex networks. Chinese Science Bulletin, 55(30), 3458–3465.
Article Google Scholar
Liu, H., & Xu, C. (2011). Can syntactic networks indicate morphological complexity of a language? EPL, 93, 28005.
Article Google Scholar
Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
MATH Google Scholar
Masucci, A. P., Kalampokis, A., Eguíluz, V. M., & Hernández-García, E. (2011). Wikipedia information flow analysis reveals the scale-free architecture of the semantic space. PLoS One, 6(2), e17333.
Article Google Scholar
Mota, N. B., Furtado, R., Maia, P. P. C., Copelli, M., & Ribeiro, S. (2014). Graph analysis of dream reports is especially informative about psychosis. Scientific Reports, 4, 3691.
Article Google Scholar
Newman, M. E. J. (2003). Mixing patterns in networks. Physical Review E, 67, 026126.
Article MathSciNet Google Scholar
Newman, M. E. J. (2006). Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74, 036104.
Article MathSciNet Google Scholar
Newman, M. (2010). Networks: An introduction. New York, NY: Oxford University Press Inc.
Book Google Scholar
Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065.
Article MathSciNet MATH Google Scholar
Peirce, C. S. (1884). The numerical measure of the success of predictions. Science, 4(93), 453–454.
Article Google Scholar
Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80, 056103.
Article Google Scholar
Ronen, S., Gonçalves, B., Hu, K. Z., Vespignani, A., Pinker, S., & Hidalgo, C. A. (2014). Links that speak: The global language network and its association with global fame. Proceedings of the National Academy of Sciences, 111(52), 5616–5622.
Article Google Scholar
Sigman, M., & Cecchi, G. A. (2002). Global organization of the Wordnet lexicon. Proceedings of the National Academy of Sciences, 99(3), 1742–1747.
Article Google Scholar
Silva, T. C., & Amancio, D. R. (2012). Word sense disambiguation via high order of learning in complex networks. EPL, 98, 58001.
Article Google Scholar
Silva, T. C., & Amancio, D. R. (2013). Discriminating word senses with tourist walks in complex networks. The European Physical Journal B, 86, 297.
Article Google Scholar
Solé, R. V., Corominas-Murtra, B. B., Valverde, S., & Steels, L. (2009). Language networks: Their structure, function and evolution. Complexity, 15(6), 20–26.
Google Scholar
Travençolo, B. A. N., & da Costa, L. F. (2008). Accessibility in complex networks. Physics Letters A, 373, 89–95.
Article MATH Google Scholar
Van Calster, B. (2012). It takes time: A remarkable example of delayed recognition. Journal of the American Society for Information Science and Technology, 63(11), 2341–2344.
Article Google Scholar
Van Noorden, R. (2014). Publishers withdraw more than 120 gibberish papers. Nature, 24. doi:10.1038/nature.2014.14763.
Wu, Y., Fu, T. Z. J., & Chiu, D. M. (2014). Generalized preferential attachment considering aging. Journal of Informetrics, 8(3), 650–658.
Article Google Scholar
Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering (pp. 101–102).
Yasseri, T., Kornai, A., & Kertész, J. (2012). A practical approach to language complexity: A wikipedia case study. PLoS One, 7, e48386.
Article Google Scholar
Yua, T., Yua, G., & Wang, M.-Y. (2014). Classification method for detecting coercive self-citation in journals. Journal of Informetrics, 8(1), 123–135.
Article Google Scholar

Download references

Acknowledgments

I am thankful to São Paulo Research Foundation (FAPESP) (grant number 14/20830-0) for the financial support.

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of São Paulo, P. O. Box 369, São Carlos, SP, 13560-970, Brazil
Diego Raphael Amancio

Authors

Diego Raphael Amancio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Raphael Amancio.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amancio, D.R. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics 105, 1763–1779 (2015). https://doi.org/10.1007/s11192-015-1637-z

Download citation

Received: 15 February 2015
Published: 15 July 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11192-015-1637-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing the topological properties of real and artificially generated scientific manuscripts

Abstract

Access this article

Similar content being viewed by others

Detection of Computer-Generated Papers in Scientific Literature

Investigating the integrated landscape of the intellectual topology of bioinformatics

Network-based statistical comparison of citation topology of bibliographic databases

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparing the topological properties of real and artificially generated scientific manuscripts

Abstract

Access this article

Similar content being viewed by others

Detection of Computer-Generated Papers in Scientific Literature

Investigating the integrated landscape of the intellectual topology of bioinformatics

Network-based statistical comparison of citation topology of bibliographic databases

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation