Advertisement

Scientometrics

, Volume 117, Issue 1, pp 61–84 | Cite as

Overlapping thematic structures extraction with mixed-membership stochastic blockmodel

  • Shuo Xu
  • Junwan Liu
  • Dongsheng Zhai
  • Xin An
  • Zheng Wang
  • Hongshen Pang
Article
  • 119 Downloads

Abstract

It is increasing important to identify automatically thematic structures from massive scientific literature. The interdisciplinarity enables thematic structures without natural boundaries. In this work, the identification of thematic structures is regarded as an overlapping community detection problem from the large-scale citation-link network. A mixed-membership stochastic blockmodel, armed with stochastic variational inference algorithm, is utilized to detect the overlapping thematic structures. In the meanwhile, in order to enhance readability, each theme is labeled with soft mutual information based method by several topical terms. Extensive experimental results on the astro dataset indicate that mixed-membership stochastic blockmodel primarily uses the local information and allows for the pervasive overlaps, but it favors similar sized themes, which disqualifies this approach from being used to extract the thematic structures from scientific literature. In addition, the thematic structures from the bibliographic coupling network is similar to those from the co-citation network.

Keywords

Overlapping thematic structure Mixed-membership stochastic blockmodel Stochastic variational inference Soft mutual information Cluster labeling 

Notes

Acknowledgements

The present study is an extended version of an article (Xu et al. 2017) presented at the 16th International Conference on Scientometrics and Informetrics, Wuhan (China), 16–20 October 2017. The clustering results from this work have been deposited with the other astro-dataset results. Our gratitude also goes to the anonymous reviewers and the editor for their valuable comments. This work was supported partially by the Social Science Foundation of Beijing (Grant No. 17GLB074), Science and Technology Project of Guangdong Province (Grant No. 2017A030303065), and National Natural Science Foundation of China (Grant Nos. 71403255 and 71473237).

References

  1. Abbe, E. & Sandon, C. (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Proceedings of the 56th IEEE annual symposium on foundations of computer science (pp. 670–688). Washington, DC: IEEE Computer Society.  https://doi.org/10.1109/FOCS.2015.47.
  2. Ahlgren, P., & Colliander, C. (2009). Document–document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.  https://doi.org/10.1016/j.joi.2008.11.003.CrossRefGoogle Scholar
  3. Airoldi, E. M., Blei, D. M., Fienberg, S. E., & Xing, E. P. (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9(Sep), 1981–2014.zbMATHGoogle Scholar
  4. Amelio, A., & Pizzuti, C. (2014). Overlapping community discovery methods: A survey (pp. 105–125). Vienna: Springer.  https://doi.org/10.1007/978-3-7091-1797-2_6.Google Scholar
  5. An, X., Xu, S., Wen, Y., & Hu, M. (2014). A shared interest discovery model for co-author relationship in SNS. International Journal of Distributed Sensor Networks, 2014, 1–9.  https://doi.org/10.1155/2014/820715.Google Scholar
  6. Ananiadou, S. (1994). A methodology for automatic term recognition. In Proceedings of the 15th international conference on computational linguistics (pp. 1034–1038). Stroudsburg, PA: Association for Computational Linguistics.  https://doi.org/10.3115/991250.991317.
  7. Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50(1–2), 5–43.  https://doi.org/10.1023/A:1020281327116.CrossRefzbMATHGoogle Scholar
  8. Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In Proceedings of the 3rd international AAAI conference on weblogs and social media (pp. 361–362).Google Scholar
  9. Bennett, C. L., Halpern, M., Hinshaw, G., Jarosik, N., Kogut, A., Limon, M., et al. (2003). First-year wilkinson microwave anisotropy probe (WMAP) observations: Preliminary maps and basic results. The Astrophysical Journal Supplement Series, 148(1), 1–27.CrossRefGoogle Scholar
  10. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.zbMATHGoogle Scholar
  11. Boyack, K. W. (2017). Thesaurus-based methods for mapping contents of publication sets. Scientometrics, 111(2), 1141–1155.  https://doi.org/10.1007/s11192-017-2304-3.CrossRefGoogle Scholar
  12. Chen, P.-Y., & Hero, A. O, I. I. I. (2015). Universal phase transition in community detectability under a stochastic block model. Physical Review E: Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 91(3), 032804.  https://doi.org/10.1103/PhysRevE.91.032804.MathSciNetCrossRefGoogle Scholar
  13. Conroy, C., & Gunn, J. E. (2010). The propagation of uncertainties in stellar population synthesis modeling. III. Model calibration, comparison, and evaluation. The Astrophysical Journal, 712(2), 833–857.  https://doi.org/10.1088/0004-637X/712/2/833.CrossRefGoogle Scholar
  14. Dave, R. N. (1996). Validation fuzzy partition obtained through \(c\)-shells clustering. Pattern Recognition Letters, 17(6), 613–623.  https://doi.org/10.1016/0167-8655(96)00026-8.MathSciNetCrossRefGoogle Scholar
  15. Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–274). New York, NY: ACM.  https://doi.org/10.1145/502512.502550.
  16. Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word term: The C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130.  https://doi.org/10.1007/s007999900023.CrossRefGoogle Scholar
  17. Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145–147.  https://doi.org/10.1038/476145a.CrossRefGoogle Scholar
  18. Glänzel, W., & Thijs, B. (2011). Using ’core documents’ for the representation of clusters and topics. Scientometrics, 88(1), 297–309.  https://doi.org/10.1007/s11192-011-0347-4.CrossRefGoogle Scholar
  19. Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ’core documents’ for the representation of clusters and topics: The astronomy dataset. Scientometrics, 111(2), 1071–1087.  https://doi.org/10.1007/s11192-017-2301-6.CrossRefGoogle Scholar
  20. Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data-different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981–998.  https://doi.org/10.1007/s11192-017-2296-z.CrossRefGoogle Scholar
  21. Gopalan, P. K., & Blei, D. M. (2013). Efficient discovery of overlapping communities in massive networks. Proceedings of the National Academy of Sciences of the United States of America, 110(36), 14534–14539.  https://doi.org/10.1073/pnas.1221839110.MathSciNetCrossRefzbMATHGoogle Scholar
  22. Goswami, S., Murthy, C. A., and Das, A. K. (2016). Sparsity measure of a network graph: Gini index. eprint arXiv:1612.07074.
  23. Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. Scientometrics, 111(2), 1089–1118.  https://doi.org/10.1007/s11192-017-2302-5.CrossRefGoogle Scholar
  24. Havemann, F., Gläser, J., Heinz, M., & Struck, A. (2012). Identifying overlapping and hierarchical thematic structures in networks of scholarly papers: A comparison of three approaches. PLoS ONE, 7(3), e33255.  https://doi.org/10.1371/journal.pone.0033255.CrossRefGoogle Scholar
  25. Healey, P., Rothman, H., & Hoch, P. K. (1986). An experiment in science mapping for research planning. Research Policy, 15(5), 233–251.  https://doi.org/10.1016/0048-7333(86)90024-7.CrossRefGoogle Scholar
  26. Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(May), 1303–1347.MathSciNetzbMATHGoogle Scholar
  27. Hurley, N., & Rickard, S. (2009). Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10), 4723–4741.  https://doi.org/10.1109/TIT.2009.2027527.MathSciNetCrossRefzbMATHGoogle Scholar
  28. Janssens, F., Glänzel, W., & de Moor, B. (2008). A hybrid mapping of information science. Scientometrics, 75(3), 607–631.  https://doi.org/10.1007/s11192-007-2002-7.CrossRefGoogle Scholar
  29. Jordan, M., Grhahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.  https://doi.org/10.1023/A:1007665907178.CrossRefzbMATHGoogle Scholar
  30. Klavans, R., & Boyack, K. W. (2011). Using global mapping to create more accurate document-level maps of research fields. Journal of the Association for Information Science and Technology, 62(1), 1–18.  https://doi.org/10.1002/asi.21444.Google Scholar
  31. Koopman, R., & Wang, S. (2017). Mutual information based labelling and comparing clusters. Scientometrics, 111(2), 1157–1167.  https://doi.org/10.1007/s11192-017-2305-2.CrossRefGoogle Scholar
  32. Leydesdorff, L., & Welbers, K. (2011). The semantic mapping of words and co-words in contexts. Journal of Informetrics, 5(3), 469–475.  https://doi.org/10.1016/j.joi.2011.01.008.CrossRefGoogle Scholar
  33. Lorenz, M. O. (1905). Methods of measuring the concentration of wealth. Publications of the American Statistical Association, 9(70), 209–219.CrossRefGoogle Scholar
  34. Manning, C. D., Raghavan, P., & Schütze, H. (Eds.). (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  35. Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157–169.  https://doi.org/10.1142/S0218213004001466.CrossRefGoogle Scholar
  36. Mei, Q., Shen, X., and Zhai, C. (2007). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 490–499).  https://doi.org/10.1145/1281192.1281246.
  37. Nepusz, T., Petróczi, A., Négyessy, L., & Bazsó, F. (2008). Fuzzy communities and the concept of bridgeness in complex networks. Physical Review E, 77(1), 016107.  https://doi.org/10.1103/PhysRevE.77.016107.MathSciNetCrossRefGoogle Scholar
  38. Park, Y., Byrd, R. J., and Boguraev, B. K. (2002). Automatic glossary extraction: Beyond terminology identification. In Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan (pp. 1–7).Google Scholar
  39. Pedregosa, F., Varoquaus, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct), 2825–2830.MathSciNetzbMATHGoogle Scholar
  40. Role, F., & Nadif, M. (2014). Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation. Knowledge-based System, 56, 141–155.  https://doi.org/10.1016/j.knosys.2013.11.005.CrossRefGoogle Scholar
  41. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). In M. W. Berry & J. Kogan (Eds.), Text mining: Application and theory (pp. 1–20). Hoboken: Wiley.Google Scholar
  42. Sclano, F. and Velardi, P. (2007). Termextractor: A web application to learn the common terminology of interest groups and research communities. In Proceedings of the 3rd international conference on interoperability for enterprise software and applications.Google Scholar
  43. Shi, Q., Qiao, X., Xu, S., & Nong, G. (2013). Author-topic evolution model and its application in analysis of research interests evolution. Journal of the China Society for Scientific and Technical Information, 32(9), 912–919.Google Scholar
  44. Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2009). Comparative study on methods of detecting research fronts using different types of citation. Journal of the Association for Information Science and Technology, 60(3), 571–580.  https://doi.org/10.1002/asi.20994.Google Scholar
  45. Skrutskie, M. F., Cutri, R. M., Stiening, R., Weinberg, M. D., Schneider, S., Carpenter, J. M., et al. (2006). The two micron all sky survey (2MASS). The Astronomical Journal, 131(2), 1163–1183.CrossRefGoogle Scholar
  46. van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? an analysis of some well-known similarity measures. Journal of the Association for Information Science and Technology, 60(8), 1635–1651.  https://doi.org/10.1002/asi.21075.Google Scholar
  47. van Eck, N. J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.  https://doi.org/10.1007/s11192-009-0146-3.CrossRefGoogle Scholar
  48. van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics, 111(2), 1053–1070.  https://doi.org/10.1007/s11192-017-2300-7.CrossRefGoogle Scholar
  49. van Raan, A. F. J. (1996). Advanced bibliometric methods as quantitative core of peer review based evaluation and foresight exercises. Scientometrics, 36(3), 397–420.  https://doi.org/10.1007/BF02129602.CrossRefGoogle Scholar
  50. Velden, T., Boyack, K. W., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017). Comparison of topic extraction approaches and their results. Scientometrics, 111(2), 1169–1221.  https://doi.org/10.1007/s11192-017-2306-1.CrossRefGoogle Scholar
  51. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clustering comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837–2854.MathSciNetzbMATHGoogle Scholar
  52. Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the Association for Information Science and Technology, 63(12), 2378–2392.  https://doi.org/10.1002/asi.22748.Google Scholar
  53. Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis for the analysis of data. Biometrika, 55(1), 1–17.  https://doi.org/10.1093/biomet/55.1.1.Google Scholar
  54. Xie, J., Kelley, S., & Szymanski, B. K. (2013). Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Computing Surveys, 45(4), 43:1–43:35.  https://doi.org/10.1145/2501654.2501657.CrossRefzbMATHGoogle Scholar
  55. Xu, S., Liu, J., & Wang, Z. (2017). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. In Proceedings of ISSI 2017—the 16th international conference on scientometrics & informetrics (pp. 1007–1012).Google Scholar
  56. Xu, S., Qiao, X., Zhu, L., Zhang, Y., Xue, C., & Li, L. (2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4), 1493–1512.CrossRefGoogle Scholar
  57. Xu, S., Shi, Q., Qiao, X., Zhu, L., Zhang, H., Jung, H., et al. (2014). A dynamic users’ interest discovery model with distributed inference algorithm. International Journal of Distributed Sensor Networks, 2014, 1–11.  https://doi.org/10.1155/2014/280892.Google Scholar
  58. Yau, C.-K., Porter, A., Newman, N., & Suominen, A. (2014). Clustering scientific documents with topic modeling. Scientometrics, 100(3), 767–786.  https://doi.org/10.1007/s11192-014-1321-8.CrossRefGoogle Scholar
  59. Zhang, Z., Gao, J., & Ciravegna, F. (2016). JATE 2.0: Java automatic term extraction with Apache Solr. In Proceedings of the 10th language resources and evaluation conference (pp. 2262–2269).Google Scholar
  60. Zhang, Z., Iria, J., Brewster, C., & Ciravegna, F. (2008). A comparative evaluation of term recognition algorithms. In Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco (pp. 2108–2113).Google Scholar
  61. Zhu, G., Blanton, M. R., & Moustakas, J. (2010). Stellar populations of elliptical galaxies in the local universe. The Astrophysical Journal, 722(1), 491–519.  https://doi.org/10.1088/0004-637X/722/1/491.CrossRefGoogle Scholar
  62. Zitt, M., Ramanana-Rahary, S., & Bassecoulard, E. (2005). Relativity of citation performance and excellence measures: From cross-field to cross-scale effects of field-normalisation. Scientometrics, 63(2), 373–401.  https://doi.org/10.1007/s11192-005-0218-y.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2018

Authors and Affiliations

  1. 1.School of Economics and ManagementBeijing University of TechnologyBeijingPeople’s Republic of China
  2. 2.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of EducationJilin UniversityChangchunPeople’s Republic of China
  3. 3.School of Economics and ManagementBeijing Forestry UniversityBeijingPeople’s Republic of China
  4. 4.Institute of Scientific and Technical Information of ChinaBeijingPeople’s Republic of China
  5. 5.LibraryShenzhen UniversityShenzhenPeople’s Republic of China

Personalised recommendations