Skip to main content

Comparative Analysis of Scientific Papers Collections via Topic Modeling and Co-authorship Networks

Part of the Communications in Computer and Information Science book series (CCIS,volume 1119)

Abstract

In this paper, the authors present an approach to benchmarking the collections of scientific journals based on the analysis of co-authorship graphs and a text models. The main methodical result is Comparative Topic Modeling (CTM) technique. The application of time series to the metrics of co-authorship graphs allowed trends in the development of author collaborations in scientific journals to be analyzed. A text model was created using machine learning methods. The content of journals was classified to determine the degree of authenticity both in various journals and their issues. Experiments was conducted on the archives of two journals in the field of Rheumatology. The authors used public data sets from the SNAP research laboratory at Stanford University to benchmark the co-authorship network metrics. The application of the research results is improving editorial strategies for development of co-authorship collaborations and scientific content excellence.

Keywords

  • Comparative text mining
  • Additive regularization of topic models
  • Social network analysis
  • Comparative graphs metrics
  • Text benchmarking

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-34518-1_6
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-34518-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   74.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.
Fig. 12.
Fig. 13.
Fig. 14.
Fig. 15.
Fig. 16.
Fig. 17.
Fig. 18.
Fig. 19.

Notes

  1. 1.

    https://snap.stanford.edu/data/ca-GrQc.html.

  2. 2.

    https://snap.stanford.edu/data/ca-HepTh.html.

References

  1. Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39(1), 45–65 (2003)

    MATH  CrossRef  Google Scholar 

  2. Alba, R.D.: A graph-theoretic definition of a sociometric clique. J. Math. Sociol. 3(1), 113–126 (1973)

    MathSciNet  MATH  CrossRef  Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  4. Bholowalia, P., Kumar, A.: EBK-means: a clustering technique based on elbow method and K-means in WSN. Int. J. Comput. Appl. 105(9), 17–24 (2014)

    Google Scholar 

  5. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Sebastopol (2009)

    MATH  Google Scholar 

  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  7. Bondy, J.A., Murty, U.S.R., et al.: Graph Theory with Applications, vol. 290. Citeseer (1976)

    Google Scholar 

  8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    MATH  CrossRef  Google Scholar 

  9. Cunningham, S.J., Dillon, S.M.: Authorship patterns in information systems. Scientometrics 39(1), 19 (1997)

    CrossRef  Google Scholar 

  10. Egghe, L., Rousseau, R., Van Hooydonk, G.: Methods for accrediting publications to authors or countries: consequences for evaluation studies. J. Am. Soc. Inf. Sci. 51(2), 145–157 (2000)

    CrossRef  Google Scholar 

  11. Farkas, I., Derényi, I., Jeong, H., Neda, Z., Oltvai, Z., Ravasz, E., Schubert, A., Barabási, A.L., Vicsek, T.: Networks in life: scaling properties and eigenvalue spectra. Physica A: Stat. Mech. Appl. 314(1–4), 25–34 (2002)

    MathSciNet  MATH  CrossRef  Google Scholar 

  12. Garfield, E.: Is citation analysis a legitimate evaluation tool? Scientometrics 1(4), 359–375 (1979)

    CrossRef  Google Scholar 

  13. Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR Forum, vol. 51, pp. 211–218. ACM (2017)

    Google Scholar 

  14. Kleene, S.C.: Representation of events in nerve nets and finite automata. Technical report, RAND PROJECT AIR FORCE SANTA MONICA CA (1951)

    Google Scholar 

  15. Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2_31

    CrossRef  Google Scholar 

  16. Krasnov, F., Sen, A.: The number of topics optimization: clustering approach. Mach. Learn. Knowl. Extr. 1(1), 416–426 (2019)

    CrossRef  Google Scholar 

  17. Krasnov, F., Ushmaev, O.: Exploration of hidden research directions in oil and gas industry via full text analysis of OnePetro digital library. Int. J. Open Inf. Technol. 6(5), 7–14 (2018)

    Google Scholar 

  18. Kucera, H., Francis, W.N.: Computational Analysis of Present - Day American English. Dartmouth Publishing Group, Hanover (1967)

    Google Scholar 

  19. Law, J., Zhuo, H.H., He, J.H., Rong, E.: LTSG: latent topical skip-gram for mutually improving topic model and vector representations. In: Lai, J.-H., et al. (eds.) PRCV 2018. LNCS, vol. 11258, pp. 375–387. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03338-5_32

    CrossRef  Google Scholar 

  20. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discovery Data (TKDD) 1(1), 2 (2007)

    CrossRef  Google Scholar 

  21. Lovins, J.B.: Development of a stemming algorithm. Mech. Translat. Comp. Linguist. 11(2), 22–31 (1968)

    Google Scholar 

  22. Lu, X., Zheng, X., Li, X.: Latent semantic minimal hashing for image retrieval. IEEE Trans. Image Process. 26(1), 355–368 (2016)

    MathSciNet  MATH  CrossRef  Google Scholar 

  23. Lucas, C., Nielsen, R.A., Roberts, M.E., Stewart, B.M., Storer, A., Tingley, D.: Computer-assisted text analysis for comparative politics. Polit. Anal. 23(2), 254–277 (2015)

    CrossRef  Google Scholar 

  24. Naik, R.R., Landge, M.B., Mahender, C.N.: A review on plagiarism detection tools. Int. J. Comput. Appl. 125(11) (2015)

    Google Scholar 

  25. Newman, M.E.: Scientific collaboration networks. i. Network construction and fundamental results. Phys. Rev. E 64(1), 016131 (2001)

    MathSciNet  CrossRef  Google Scholar 

  26. Newman, M.E.: Analysis of weighted networks. Phys. Rev. E 70(5), 056131 (2004)

    CrossRef  Google Scholar 

  27. Packard, D.: Computer-assisted morphological analysis of ancient Greek. In: COLING 1973 Volume 2: Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics, vol. 2 (1973)

    Google Scholar 

  28. Porter, M.F.: Snowball: a language for stemming algorithms (2001)

    Google Scholar 

  29. Schwenk, H., Gauvain, J.L.: Connectionist language modeling for large vocabulary continuous speech recognition. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, p. I-765. IEEE (2002)

    Google Scholar 

  30. Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA, pp. 273–280. Citeseer (2003)

    Google Scholar 

  31. Sharoff, S., Nivre, J.: The proper place of men and machines in language technology: processing Russian without any linguistic knowledge. In: Proceedings of Dialogue 2011, Russian Conference on Computational Linguistics (2011)

    Google Scholar 

  32. Smeaton, A.F., Keogh, G., Gurrin, C., McDonald, K., Sødring, T.: Analysis of papers from twenty-five years of SIGIR conferences: what have we been doing for the last quarter of a century? In: ACM SIGIR Forum, vol. 37, pp. 49–53. ACM (2003)

    CrossRef  Google Scholar 

  33. Teahan, W.J., Cleary, J.G.: The entropy of English using PPM-based models. In: DCC, p. 53. IEEE (1996)

    Google Scholar 

  34. Teahan, W., Cleary, J.G.: Models of English text. In: 1997 Proceedings of Data Compression Conference, DCC’97, pp. 12–21. IEEE (1997)

    Google Scholar 

  35. Thompson, K.: Programming techniques: regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)

    MATH  CrossRef  Google Scholar 

  36. Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101(1–3), 303–323 (2015)

    MathSciNet  MATH  CrossRef  Google Scholar 

  37. Wang, X., Ren, J., Zhang, Y., Zhu, D., Qiu, P., Huang, M.: China’s patterns of international technological collaboration 1976–2010: a patent analysis study. Technol. Anal. Strateg. Manag. 26(5), 531–546 (2014)

    CrossRef  Google Scholar 

  38. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)

    MATH  CrossRef  Google Scholar 

  39. Weizenbaum, J.: Eliza–a computer program for the study of natural language communication between man and machine. Commun. ACM 9(1), 36–45 (1966)

    CrossRef  Google Scholar 

  40. Wiederhold, G.: Intelligent integration of information. In: ACM SIGMOD Record, vol. 22, pp. 434–437. ACM (1993)

    Google Scholar 

  41. Willett, P.: The porter stemming algorithm: then and now. Program 40(3), 219–223 (2006)

    CrossRef  Google Scholar 

  42. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

  43. Zhao, W.X., et al.: Comparing Twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fedor Krasnov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Krasnov, F., Dimentov, A., Shvartsman, M. (2019). Comparative Analysis of Scientific Papers Collections via Topic Modeling and Co-authorship Networks. In: Ustalov, D., Filchenkov, A., Pivovarova, L. (eds) Artificial Intelligence and Natural Language. AINL 2019. Communications in Computer and Information Science, vol 1119. Springer, Cham. https://doi.org/10.1007/978-3-030-34518-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34518-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34517-4

  • Online ISBN: 978-3-030-34518-1

  • eBook Packages: Computer ScienceComputer Science (R0)