Abstract
We propose a new approach to recommend scientific literature, a domain in which the efficient organization and search of information is crucial. The proposed system relies on the hypothesis that two scientific articles are semantically related if they are co-cited more frequently than they would be by pure chance. This relationship can be quantified by the probability of co-citation, obtained from a null model that statistically defines what we consider pure chance. Looking for article pairs that minimize this probability, the system is able to recommend a ranking of articles in response to a given article. This system is included in the co-occurrence paradigm of the field. More specifically, it is based on co-cites so it can produce recommendations more focused on relatedness than on similarity. Evaluation has been performed on the ACL Anthology collection and on the DBLP dataset, and a new corpus has been compiled to evaluate the capacity of the proposal to find relationships beyond similarity. Results show that the system is able to provide, not only articles similar to the submitted one, but also articles presenting other kind of relations, thus providing diversity, i.e. connections to new topics.
Similar content being viewed by others
Notes
These proposals use the relationships to contruct a graph and then they apply algorithms for graphs, such as clustering or page rank.
The corpus has been annotated by the authors, that have a long experience in working with research papers.
References
Arnold, A., & Cohen, W. (2009). Information extraction as link prediction: Using curated citation networks to improve gene detection. In B. Liu, A. Bestavros, D.Z.Du, & J. Wang (Eds.), Wireless algorithms, systems, and applications. Lecture Notes in Computer Science (Vol. 5682, pp. 541–550). Berlin Heidelberg: Springer.
Baez, M., Mirylenka, D., & Parra, C. (2011). Understanding and supporting search for scholarly knowledge. In 7th European computer science summit, Milano, Italy (pp. 1–8).
Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems: A literature survey. International Journal on Digital Libraries, 17(4), 305–338.
Bird, S., Dale, R., Dorr, B. J., Gibson, B. R., Joseph, M., Kan, M. Y., et al. (2008). The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In European Language Resources Association (LREC) (pp. 1755–1759).
Castells, P., Vargas, S., & Wang, J. (2011). Novelty and diversity metrics for recommender systems: Choice, discovery and relevance. In International workshop on diversity in document retrieval (DDR 2011) at the 33rd European conference on information retrieval (ECIR 2011), Dublin, Ireland. http://ir.ii.uam.es/rim3/publications/ddr11.pdf. Accessed 10 May 2019.
Ding, Y., Yan, E., Frazho, A. R., & Caverlee, J. (2009). Pagerank for ranking authors in co-citation networks. JASIST, 60(11), 2229–2243.
Eto, M. (2016). Rough co-citation as a measure of relationship to expand co-citation networks for scientific paper searches. Proceedings of the Association for Information Science and Technology, 53, 1–4.
Feld, S. L. (1991). Why your friends have more friends than you do. American Journal of Sociology, 96(6), 1464–1477.
Ge, M., Delgado-Battenfeld, C., & Jannach, D. (2010). Beyond accuracy: Evaluating recommender systems by coverage and serendipity. In Proceedings of the fourth ACM conference on recommender systems (RecSys ’10) (pp. 257–260). ACM
Gipp, B., & Beel, J. (2009). Citation proximity analysis (CPA)—A new approach for identifying related work based on co-citation analysis. In B. Larsen & J. Leta (Eds.), Proceedings of the 12th international conference on scientometrics and informetrics (ISSI’09), international society for scientometrics and informetrics, Rio de Janeiro, Brazil (Vol. 2, pp 571–575). iSSN:2175-1935.
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–70.
Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015). Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1), 1–254.
He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, L. (2010). Context-aware citation recommendation. In Proceedings of the 19th international conference on world wide web ( WWW’10) (pp. 421–430). New York, NY: ACM.
Jurgens, D., & Stevens, K. (2010). The s-space package: An open source package for word space models. In Proceedings of the ACL 2010 system demonstrations (ACLDemos ’10), Association for Computational Linguistics, Stroudsburg, PA, USA (pp. 30–35).
Kim, H. J., Jeong, Y. K., & Song, M. (2016). Content- and proximity-based author co-citation analysis using citation sentences. Journal of Informetrics, 10(4), 954–966.
Kotkov, D., Wang, S., & Veijalainen, J. (2016). A survey of serendipity in recommender systems. Knowledge-Based Systems, 111(C), 180–192.
Lao, N., & Cohen, W. W. (2010). Relational retrieval using a combination of path-constrained random walks. Machine Learning, 81(1), 53–67.
Liang, Y., Li, Q., & Qian, T. (2011). Finding relevant papers based on citation relations. In Proceedings of the 12th international conference on web-age information management (WAIM’11) (pp. 403–414). Berlin: Springer.
Lops, P., de Gemmis, M., & Semeraro, G. (2011). Content-based recommender systems: State of the art and trends. In F. Ricci, L. Rokach, B. Shapira, & P. B. Kantor (Eds.), Recommender systems handbook (pp. 73–105). New York: Springer.
Martinez-Romo, J., Araujo, L., Borge-Holthoefer, J., Arenas, A., Capitán, J. A., & Cuesta, J. A. (2011). Disentangling categorical relationships through a graph of co-occurrences. Physical Review E, 84, 046108. https://doi.org/10.1103/PhysRevE.84.046108.
Mustafee, N., Dwivedi, Y. K., Bell, D., & Williams, M. D. (2010). A methodology for profiling literature using co-citation analysis. In Sustainable IT collaboration around the globe. 16th Americas conference on information systems (AMCIS 2010), August 12–15, 2010, Lima, Peru (p. 359).
Pedersen, T., Pakhomov, S. V., Patwardhan, S., & Chute, C. G. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3), 288–299.
Pohl, S., Radlinski, F., & Joachims, T. (2007). Recommending related papers based on digital library access records. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL ’07) (pp. 417–418). ACM.
Radev, D., Muthukrishnan, P., Qazvinian, V., & Abu-Jbara, A. (2013). The ACL anthology network corpus. Language Resources and Evaluation, 1–26. https://doi.org/10.1007/s10579-012-9211-2.
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994). Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM conference on computer supported cooperative work (CSCW ’94) (pp. 175–186). New York, NY: ACM.
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4), 265–269.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’08) (pp 990–998). New York, NY: ACM.
Tejeda-Lorente, A., Porcel, C., Bernabé-Moreno, J., & Herrera-Viedma, E. (2015). Refore: A recommender system for researchers based on bibliometrics. Applied Soft Computing, 30, 778–791.
White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972–1995. Journal of the American Society for Information Science, 49(4), 327–355.
Zhou, D., Zhu, S., Yu, K., Song, X., Tseng, B.L., Zha, H., & Giles, C.L. (2008). Learning multiple graphs for document recommendations. In Proceedings of the 17th International Conference on World Wide Web (WWW ’08) ( pp 141–150). New York, NY: ACM.
Acknowledgements
This work has been partially supported by the Spanish Ministry of Science and Innovation within the projects PROSA-MED (TIN2016-77820-C3-2-R) and EXTRAE (IMIENS 2017).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Rodriguez-Prieto, O., Araujo, L. & Martinez-Romo, J. Discovering related scientific literature beyond semantic similarity: a new co-citation approach. Scientometrics 120, 105–127 (2019). https://doi.org/10.1007/s11192-019-03125-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-019-03125-9