Discovering cross-topic collaborations among researchers by exploiting weighted association rules
Identifying the most relevant scientific publications on a given topic is a well-known research problem. The Author-Topic Model (ATM) is a generative model that represents the relationships between research topics and publication authors. It allows us to identify the most influential authors on a particular topic. However, since most research works are co-authored by many researchers the information provided by ATM can be complemented by the study of the most fruitful collaborations among multiple authors. This paper addresses the discovery of research collaborations among multiple authors on single or multiple topics. Specifically, it exploits an exploratory data mining technique, i.e., weighted association rule mining, to analyze publication data and to discover correlations between ATM topics and combinations of authors. The mined rules characterize groups of researchers with fairly high scientific productivity by indicating (1) the research topics covered by their most cited publications and the relevance of their scientific production separately for each topic, (2) the nature of the collaboration (topic-specific or cross-topic), (3) the name of the external authors who have (occasionally) collaborated with the group either on a specific topic or on multiple topics, and (4) the underlying correlations between the addressed topics. The applicability of the proposed approach was validated on real data acquired from the Online Mendelian Inheritance in Man catalog of genetic disorders and from the PubMed digital library. The results confirm the effectiveness of the proposed strategy.
KeywordsAuthor Topic Model Weighted association rule mining Data mining Knowledge discovery
- Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the 20th VLDB conference, pp. 487–499.Google Scholar
- Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In ACM SIGMOD, 1993, pp. 207–216.Google Scholar
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.
- Brin, S., & Page, L. (1998) The anatomy of a large-scale hypertextual web search engine. In Seventh international world-wide web conference (WWW 1998). http://ilpubs.stanford.edu:8090/361/.
- Cagliero, L., Garza, P., Kavoosifar, M. R., & Baralis, E. (2017). Identifying collaborations among researchers: A pattern-based approach. In Proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL 2017) co-located with the 40th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2017), Tokyo, Japan, August 11, 2017, pp. 56–68. http://ceur-ws.org/Vol-1888/paper5.pdf.
- Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., & Zhai, C. (2014). Content-based citation analysis: The next generation of citation analysis. JASIST, 65, 1820–1833.Google Scholar
- Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp. 43–52. https://doi.org/10.1145/312129.312191.
- Hamosh, A., Scott, A., Amberger, J., Valle, D., & McKusick, V. (2000). Online mendelian inheritance in man (OMIM). Human Mutation, 15(1), 57–61. https://doi.org/10.1002/(SICI)1098-1004(200001)15:1%3c57::AID-HUMU12%3e3.0.CO;2-G.CrossRefGoogle Scholar
- Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In SIGMOD’00, Dallas, TX.Google Scholar
- Kim, H. J., An, J., Jeong, Y. K., & Song, M. (2016). Exploring the leading authors and journals in major topics by citation sentences and topic modeling. In BIRNDL@JCDL.Google Scholar
- Kou, N. M., Hou, U. L., Mamoulis, N., & Gong, Z. (2015a). Weighted coverage based reviewer assignment. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, New York, NY, USA, SIGMOD ’15, pp. 2031–2046. https://doi.org/10.1145/2723372.2723727.
- Loper, E., & Bird, S. (2002). NLTK: The natural language toolkit. In Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, ETMTNLP ’02, pp. 63–70. https://doi.org/10.3115/1118108.1118117.
- Lu, C., Zhang, C., & Ma, S. (2015). How does citing behavior for a scientific article change over time? A preliminary study. In Proceedings of the 78th ASIS&T annual meeting: Information science with impact: Research in and for the Community. American Society for Information Science, Silver Springs, MD, USA, ASIST ’15, pp. 97:1–97:4. http://dl.acm.org/citation.cfm?id=2857070.2857167.
- NCBI. (2017). National Center for Biotechnology Information Website. Available at http://www.ncbi.nlm.nih.gov/. Last Access: May 2017.
- Rosen-Zvi, M., Griffiths, T. L., Steyvers, M., & Smyth, P. (2012). The author-topic model for authors and documents. CoRR arxiv:abs/1207.4169.
- Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’04, pp. 306–315. https://doi.org/10.1145/1014052.1014087.
- Tan, P. N., Kumar, V., & Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’02, pp. 32–41. https://doi.org/10.1145/775047.775053.
- Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Reading: Addison-Wesley.Google Scholar
- Tang, J., Zhang, J., Yao, L., Li, J. Z., Zhang, L., & Su, Z. (2008) Arnetminer: Extraction and mining of academic social networks. In KDD Google Scholar
- Tao, F., Murtagh, F., & Farid, M. (2003). Weighted association rule mining using weighted support and significance framework. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’03, pp. 661–666.Google Scholar
- Wang, J., Han, J., & Pei, J. (2003). Closet+: Searching for the best strategies for mining frequent closed itemsets. In L. Getoor, T.E. Senator, P. Domingos, C. Faloutsos (Eds.), Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 236–245.Google Scholar
- Wang, W., Yang, J., & Yu, P. S. (2000). Efficient mining of weighted association rules (WAR). In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’00, pp. 270–274.Google Scholar
- White, S., & Smyth, P. (2003). Algorithms for estimating relative importance in networks. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’03, pp. 266–275. https://doi.org/10.1145/956750.956782.