On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection

  • Hennadii DobrovolskyiEmail author
  • Nataliya Keberle
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1007)


This paper presents evidences concerned to convergence of controlled snowball sampling iterations applied to collecting seminal papers in a selected domain of research. Iterations start from the seed paper selection, plain snowball sampling and probabilistic topic modelling, then greedy controlled snowball sampling and analysis of the collected citation network are performed in rotation until the list of seminal papers becomes stable. The topic model is built on the base of word-word co-occurrence probability with combination of sparse symmetric nonnegative matrix factorization and principal component approximation. Experiments show that the number of topics in the model is determined in natural way and the Kullback-Leibler (KL) divergence provides the upper bound of the cosine similarity calculated from keywords assigned by publication authors. Several citation networks are collected and analysed. The analysis shows that all networks are “small worlds” and therefore the observed saturation of the controlled snowball sampling can provide the complete set of publications in domains of interest. Experiments with KL-divergence, symmetric KL-divergence and Jensen-Shannon divergence show that KL-divergence produces less connected citation network but provides better convergence of snowball iterations. Multiple runs of the sampling confirm the hypothesis that the set of seminal publications is stable with respect to variations of the seed papers. The modified main path analysis allows to distinguish the seminal papers including new publications following main stream of research. The comparison of different ranking criterion is made. It shows that Search Path Count provides better lists of seminal papers than citation index, PageRank and indegree.


Text mining Short text document Topic modelling Principal component analysis Sparse symmetric nonnegative matrix factorization Citation network Main path analysis Convergence Saturation 



The authors would like to express their gratitude to anonymous reviewers whose comments and suggestions helped improve the paper.


  1. 1.
    Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)CrossRefGoogle Scholar
  2. 2.
    Akavipat, R., Wu, L.S., Menczer, F., Maguitman, A.G.: Emerging semantic communities in peer web search. In: Proceedings of the International Workshop on Information Retrieval in Peer-to-Peer Networks, pp. 1–8. ACM (2006)Google Scholar
  3. 3.
    Baez, M., Mirylenka, D., Parra, C.: Understanding and supporting search for scholarly knowledge. In: Proceeding of the 7th European Computer Science Summit, pp. 1–8 (2011)Google Scholar
  4. 4.
    Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Barbosa, M.W., Costa, M.M., Almeida, J.M., Almeida, V.A.: Using locality of reference to improve performance of peer-to-peer applications. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 216–227. ACM (2004)CrossRefGoogle Scholar
  6. 6.
    Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint cs/0309023 (2003)Google Scholar
  7. 7.
    Batagelj, V., Mrvar, A.: Pajek-program for large network analysis. Connections 21(2), 47–57 (1998)zbMATHGoogle Scholar
  8. 8.
    Beel, J., Gipp, B., Langer, S., Breitinger, C.: Paper recommender systems: a literature survey. Int. J. Digit. Librar. 17(4), 305–338 (2016)CrossRefGoogle Scholar
  9. 9.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  10. 10.
    Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: Proceedings 22nd International Conference on Distributed Computing Systems, pp. 23–32. IEEE (2002)Google Scholar
  11. 11.
    De Bruijn, N.G.: Asymptotic Methods in Analysis, vol. 4. Courier Corporation, Chelmsford (1981)zbMATHGoogle Scholar
  12. 12.
    Dobrovolskyi, H., Keberle, N.: Collecting the seminal scientific abstracts with topic modelling, snowball sampling and citation analysis. In: Proceedings of the 14th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer. Volume I: Main Conference, vol. 2105, pp. 179–192. CEUR-WS (2018)Google Scholar
  13. 13.
    Dobrovolskyi, H., Keberle, N., Todoriko, O.: Probabilistic topic modelling for controlled snowball sampling in citation network collection. In: Różewski, P., Lange, C. (eds.) KESW 2017. CCIS, vol. 786, pp. 85–100. Springer, Cham (2017). Scholar
  14. 14.
    Dong, R., Tokarchuk, L., Ma, A.: Digging friendship: paper recommendation in social network. In: Proceedings of Networking and Electronic Commerce Research Conference, NAEC 2009, pp. 21–28 (2009)Google Scholar
  15. 15.
    Doulamis, N.D., Karamolegkos, P.N., Doulamis, A., Nikolakopoulos, I.: Exploiting semantic proximities for content search over P2P networks. Comput. Commun. 32(5), 814–827 (2009)CrossRefGoogle Scholar
  16. 16.
    Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory (2003)Google Scholar
  17. 17.
    Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3) (2014)Google Scholar
  18. 18.
    Even, S.: Graph Algorithms. Cambridge University Press, Cambridge (2011)CrossRefGoogle Scholar
  19. 19.
    Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, vol. 57. Elsevier, Amsterdam (2004)zbMATHGoogle Scholar
  20. 20.
    Gori, M., Pucci, A.: Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 778–781. IEEE (2006)Google Scholar
  21. 21.
    Hamilton, D.P., et al.: Publishing by–and for?–the numbers. Science 250(4986), 1331–1332 (1990)CrossRefGoogle Scholar
  22. 22.
    Huang, Z., Chung, W., Ong, T.H., Chen, H.: A graph-based recommender system for digital library. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 65–73. ACM (2002)Google Scholar
  23. 23.
    Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Recommendation on academic networks using direction aware citation analysis. arXiv preprint arXiv:1205.1143 (2012)
  24. 24.
    Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)Google Scholar
  26. 26.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRefGoogle Scholar
  27. 27.
    Liang, Y., Li, Q., Qian, T.: Finding relevant papers based on citation relations. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds.) WAIM 2011. LNCS, vol. 6897, pp. 403–414. Springer, Heidelberg (2011). Scholar
  28. 28.
    Lops, P., de Gemmis, M., Semeraro, G.: Content-based recommender systems: state of the art and trends. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 73–105. Springer, Boston, MA (2011). Scholar
  29. 29.
    Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transitions in histcite™-based historiograms. J. Assoc. Inf. Sci. Technol. 59(12), 1948–1962 (2008)CrossRefGoogle Scholar
  30. 30.
    MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  31. 31.
    Mendenhall, W.M., Sincich, T.L., Boudreau, N.S.: Statistics for Engineering and the Sciences, Student Solutions Manual. Chapman and Hall/CRC, Boca Raton (2016)CrossRefGoogle Scholar
  32. 32.
    Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6(2–3), 161–180 (1995)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)CrossRefGoogle Scholar
  34. 34.
    Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(Suppl. 1), 5200–5205 (2004)CrossRefGoogle Scholar
  36. 36.
    Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesñevar, C.I.: Intelligent algorithms for improving communication patterns in thematic P2P search. Inf. Proces. Manag. 53(2), 388–404 (2017)CrossRefGoogle Scholar
  37. 37.
    Nikulin, M.S.: Hellinger distance. In: Encyclopedia of Mathematics, vol. 78 (2001)Google Scholar
  38. 38.
    Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). Scholar
  39. 39.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)Google Scholar
  40. 40.
    Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health Psychol. Pract. 150–179 (2004)Google Scholar
  41. 41.
    Pohl, S., Radlinski, F., Joachims, T.: Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 417–418. ACM (2007)Google Scholar
  42. 42.
    Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 1–12. Springer, Heidelberg (2004). Scholar
  43. 43.
    Ricci, F., Rokach, L., Shapira, B.: Recommender systems: introduction and challenges. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 1–34. Springer, Boston, MA (2015). Scholar
  44. 44.
    Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004)CrossRefGoogle Scholar
  45. 45.
    Small, H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)CrossRefGoogle Scholar
  46. 46.
    de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)CrossRefGoogle Scholar
  47. 47.
    Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India, Delhi (2007)Google Scholar
  48. 48.
    Trudeau, R.J.: Introduction to Graph Theory. Courier Corporation, Chelmsford (2013)Google Scholar
  49. 49.
    Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI Workshop: Scholarly Big Data (2015)Google Scholar
  50. 50.
    Varela, A.R., et al.: Mapping the historical development of physical activity and health research: a structured literature review and citation network analysis. Prev. Med. 111, 466–472 (2018)CrossRefGoogle Scholar
  51. 51.
    Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly research articles. arXiv preprint arXiv:1303.7149 (2013)
  52. 52.
    Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). Scholar
  53. 53.
    Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440 (1998)CrossRefGoogle Scholar
  54. 54.
    Woodruff, A., Gossweiler, R., Pitkow, J., Chi, E.H., Card, S.K.: Enhancing a digital book with a reading recommender. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 153–160. ACM (2000)Google Scholar
  55. 55.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)Google Scholar
  56. 56.
    Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Information retrieval techniques for peer-to-peer networks. Comput. Sci. Eng. 6(4), 20–26 (2004)CrossRefGoogle Scholar
  57. 57.
    Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Exploiting locality for scalable information retrieval in peer-to-peer networks. Inf. Syst. 30(4), 277–298 (2005)CrossRefGoogle Scholar
  58. 58.
    Zhou, D., et al.: Learning multiple graphs for document recommendations. In: Proceedings of the 17th International Conference on World Wide Web, pp. 141–150. ACM (2008)Google Scholar
  59. 59.
    Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceZaporizhzhya National UniversityZaporizhzhyaUkraine

Personalised recommendations