, Volume 87, Issue 3, pp 597–620 | Cite as

Accuracy of inter-researcher similarity measures based on topical and social clues

  • Guillaume CabanacEmail author


Scientific literature recommender systems (SLRSs) provide papers to researchers according to their scientific interests. Systems rely on inter-researcher similarity measures that are usually computed according to publication contents (i.e., by extracting paper topics and citations). We highlight two major issues related to this design. The required full-text access and processing are expensive and hardly feasible. Moreover, clues about meetings, encounters, and informal exchanges between researchers (which are related to a social dimension) were not exploited to date. In order to tackle these issues, we propose an original SLRS based on a threefold contribution. First, we argue the case for defining inter-researcher similarity measures building on publicly available metadata. Second, we define topical and social measures that we combine together to issue socio-topical recommendations. Third, we conduct an evaluation with 71 volunteer researchers to check researchers’ perception against socio-topical similarities. Experimental results show a significant 11.21% accuracy improvement of socio-topical recommendations compared to baseline topical recommendations.


Similarity among researchers Topical clues Social clues Literature review Recommendation Experiment Human perception Measurement 



The constructive criticisms and suggestions of the referees are warmly acknowledged. I am also grateful to the 71 volunteer researchers who took part in the experiment reported in this paper. Their feedback, comments, and insightful advice have been a source of stimulating thinking. Finally, I am indebted to Anaïs Lefeuvre for her involvement in this work as a research assistant.


  1. Adomavicius, G., & Tuzhilin, A (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734–749. doi: 10.1109/TKDE.2005.99 CrossRefGoogle Scholar
  2. Agarwal, N., Haque, E., Liu, H., & Parsons, L. (2005). Research paper recommender systems: A subspace clustering approach. In W. Fan, Z. Wu, & J. Yang (Eds.), WAIM’05: Proceedings of the 6th international conference on web-age information management. LNCS (Vol. 3739, pp. 475–491). New York: Springer. doi: 10.1007/11563952_42.
  3. Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9–15. doi: 10.1145/1480506.1480508.CrossRefGoogle Scholar
  4. Balabanović, M., & Shoham, Y. (1997). Fab: Content-based, collaborative recommendation. Communications of the ACM, 40(3), 66–72. doi: 10.1145/245108.245124.CrossRefGoogle Scholar
  5. Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), 29–38. doi: 10.1145/138859.138861.CrossRefGoogle Scholar
  6. Ben Jabeur, L., Tamine, L., & Boughanem, M. (2010). A social model for Literature Access: Towards a weighted social network of authors. In RIAO’10: Proceedings of the 9th international conference on information retrieval and its applications. CDROM.Google Scholar
  7. Biryukov, M. (2008). Co-author network analysis in DBLP: Classifying personal names. In MCO’08: Proceedings of the 2nd international conference on modelling, computation and optimization in information systems and management sciences. Communications in computer and information science (Vol. 14, pp. 399–408). New York: Springer. doi: 10.1007/978-3-540-87477-5_43.
  8. Bogers, T., & van den Bosch, A. (2008). Recommending scientific articles using CiteULike. In RecSys’08: Proceedings of the 4th ACM conference on recommender systems, ACM, New York, NY, USA (pp. 287–290). doi: 10.1145/1454008.1454053.
  9. Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In SIGIR’00: Proceedings of the 23rd international ACM SIGIR conference, ACM, New York, NY, USA (pp. 33–40). doi: 10.1145/345508.345543.
  10. Buckley, C., & Voorhees, E. M. (2005). Retrieval system evaluation. In E. M. Voorhees & D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval (Chap. 3, pp. 53–75). Cambridge, MA: MIT Press.Google Scholar
  11. Cazella, S. C., & Campos Alvares, L. O. (2005). Modeling user’s opinion relevance to recommending research papers. In UM’05: Proceedings of the 10th international conference on user modeling. LNCS (Vol. 3538, pp. 327–331). New York: Springer. doi: 10.1007/11527886_42.
  12. Cleverdon, C. W. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. ASLIB Cranfield Research Project, Cranfield, UK.Google Scholar
  13. Deng, H., King, I., & Lyu, M. R. (2008). Formal models for expert finding on DBLP bibliography data. In ICDM’08: Proceedings of the 8th IEEE international conference on data mining (pp. 163–172). Washington, DC: IEEE Computer Society. doi: 10.1109/ICDM.2008.29.
  14. Dolamic, L., & Savoy, J. (2010). When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1), 200–203. doi: 10.1002/asi.21186.CrossRefGoogle Scholar
  15. Easley, D., & Kleinberg, J. (2010). Networks, crowds, and markets: Reasoning about a highly connected world. New York: Cambridge University Press.zbMATHGoogle Scholar
  16. Elmacioglu, E., & Lee, D. (2005). On six degrees of separation in DBLP-DB and more. SIGMOD Record, 34(2), 33–40. doi: 10.1145/1083784.1083791.CrossRefGoogle Scholar
  17. Fox, C. (1989). A stop list for general text. SIGIR Forum, 24(1–2), 19–21. doi: 10.1145/378881.378888.CrossRefGoogle Scholar
  18. Fox, E. A., & Shaw, J. A. (1993). Combination of multiple searches. In D. K. Harman (Ed.), TREC-1: Proceedings of the first text retrieval conference, NIST, Gaithersburg, MD, USA (pp. 243–252).Google Scholar
  19. Garfield, E. (1955). Citation indexes for science: A new dimension in documentation through association of ideas. Science, 122(3159), 108–111. doi: 10.1126/science.122.3159.108.CrossRefGoogle Scholar
  20. Garfield, E. (1996). What is the primordial reference for the phrase ‘Publish or perish’? The Scientist, 10(12), 11.
  21. Garfield, E. (2006). The history and meaning of the journal impact factor. Journal of the American Medical Association, 295(1), 90–93. doi: 10.1001/jama.295.1.90.CrossRefGoogle Scholar
  22. Glenisson, P., Glänzel, W., Janssens, F., & Moor, B. D. (2005a). Combining full text and bibliometric information in mapping scientific disciplines. Information Processing and Management, 41(6), 1548–1572. doi: 10.1016/j.ipm.2005.03.021.CrossRefGoogle Scholar
  23. Glenisson, P., Glänzel, W., & Persson, O. (2005b). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180. doi: 10.1007/s11192-005-0208-0.CrossRefGoogle Scholar
  24. Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. B. (1992). Using collaborative filtering to weave an Information Tapestry. Communications of the ACM, 35(12), 61–70. doi: 10.1145/138859.138867.CrossRefGoogle Scholar
  25. Gori, M., & Pucci, A. (2006). Research paper recommender systems: A random-walk based approach. In WI’06: Proceedings of the 5th IEEE/WIC/ACM international conference on web intelligence, IEEE Computer Society, Los Alamitos, CA, USA (pp. 778–781). doi: 10.1109/WI.2006.149.
  26. Herlocker, J. L., Konstan, J. A., Terveen, L. G., & Riedl, J. T. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 5–53. doi: 10.1145/963770.963772.CrossRefGoogle Scholar
  27. Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572. doi: 10.1073/pnas.0507655102.CrossRefGoogle Scholar
  28. Hirsch, J. E. (2010). An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics, 85(3), 741–754. doi:  10.1007/s11192-010-0193-9.CrossRefMathSciNetGoogle Scholar
  29. Huang, Z., Yan, Y., Qiu, Y., & Qiao, S. (2009). Exploring emergent semantic communities from DBLP bibliography database. In N. Memon & R. Alhajj (Eds.), ASONAM’09: Proceedings of the 1st international conference on advances in social network analysis and mining, IEEE Computer Society (pp. 219–224). doi: ASONAM.2009.6.
  30. Hubert, G., & Mothe, J. (2009). An adaptable search engine for multimodal information retrieval. Journal of the American Society for Information Science and Technology, 60(8), 1625–1634. doi: 10.1002/asi.21091.CrossRefGoogle Scholar
  31. Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In SIGIR’93: Proceedings of the 16th annual international ACM SIGIR conference, ACM Press, New York, NY, USA (pp. 329–338). doi: 10.1145/160688.160758
  32. Hurtado Martín, G., Cornelis, C., & Naessens, H. (2009). Training a personal alert system for research information recommendation. In J. P. Carvalho, D. Dubois, U. Kaymak, & J. M. C. Sousa (Eds.), IFSA/EUSFLAT’09: Proceedings of the joint 2009 international Fuzzy systems association world congress and 2009 European Society of fuzzy logic and technology conference (pp. 408–413).Google Scholar
  33. Hurtado Martín, G., Schockaert, S., Cornelis, C., & Naessens, H. (2010). Metadata impact on research paper similarity. In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, & I. Frommholz (Eds.), ECDL’10: Proceedings of the 14th European conference on research and advanced technology for digital libraries. LNCS (Vol. 6273, pp. 457–460). New York: Springer. doi: 10.1007/978-3-642-15464-5_56.
  34. Janas, J. M. (1977). Automatic recognition of the part-of-speech for English texts. Information Processing & Management, 13(4), 205–213. doi: 10.1016/0306-4573(77)90001-2.CrossRefGoogle Scholar
  35. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. doi: 10.1145/582415.582418.CrossRefGoogle Scholar
  36. Karoui, H., Kanawati, R., & Petrucci, L. (2006). COBRAS: Cooperative CBR system for bibliographical reference recommendation. In ECCBR’06: Proceedings of the 8th European conference on advances in case-based reasoning. LNCS (Vol. 4106, pp. 76–90). New York: Springer. doi: 10.1007/11805816_8.
  37. Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224. doi: 10.1561/1500000012.Google Scholar
  38. Klas, C. P., & Fuhr, N. (2000). A new effective approach for categorizing web documents. In Proceedings of the 22th BCS-IRSG colloquium on IR research.Google Scholar
  39. Lee, J. H. (1997). Analyses of multiple evidence combination. In SIGIR’97: Proceedings of the 20th annual international ACM SIGIR conference, ACM Press, New York, NY, USA (pp. 267–276). doi: 10.1145/258525.258587.
  40. Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In A. H. F. Laender & A. L. Oliveira (Eds.), SPIRE’02 : Proceedings of the 9th international conference on string processing and information retrieval. LNCS (Vol. 2476, pp. 1–10). New York: Springer. doi: 10.1007/3-540-45735-6_1.
  41. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 5–54.Google Scholar
  42. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  43. McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A., & Riedl, J. (2002). On the recommending of citations for research papers. In CSCW’02: Proceedings of the 2002 ACM conference on computer supported cooperative work, ACM, New York, NY, USA (pp. 116–125). doi: 10.1145/587078.587096.
  44. McNee, S. M., Kapoor, N., & Konstan, J. A. (2006). Don’t look stupid: Avoiding pitfalls when recommending research papers. In CSCW ’06: Proceedings of the 2006 20th anniversary conference on computer supported cooperative work, ACM, New York, NY, USA (pp. 171–180). doi: 10.1145/1180875.1180903.
  45. Micarelli, A., Sciarrone, F., & Marinilli, M. (2007). In Web document modeling. LNCS (Vol. 4321, pp. 155–192). New York: Springer. doi: 10.1007/978-3-540-72079-9_5.
  46. Milgram, S. (1967). The small-world problem. Psychology Today, 1(1), 61–67.MathSciNetGoogle Scholar
  47. Mimno, D., & McCallum, A. (2007). Mining a digital library for influential authors. In JCDL’07: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries, ACM, New York, NY, USA (pp. 105–106). doi: 10.1145/1255175.1255196.
  48. Mittelbach, F., & Goossens, M. (2005). \({\hbox{L}}{\hbox{\sc a}}{\hbox{T}}_{\rm{E}}{\hbox{X}}\) companion (2nd ed.). Boston, MA: Pearson Education.Google Scholar
  49. Montaner, M., López, B., & de la Rosa, J. L. (2003). A taxonomy of recommender agents on the Internet. Artificial Intelligence Review, 19(4), 285–330. doi: 10.1023/A:1022850703159.CrossRefGoogle Scholar
  50. Naak, A., Hage, H., & Aïmeur, E. (2009). A multi-criteria collaborative filtering approach for research paper recommendation in papyres. In MCETECH’09: Proceedings of the 4th international conference on E-technologies: Innovation in an open world. LNBIP (Vol. 26, pp. 25–39). New York: Springer. doi: 10.1007/978-3-642-01187-0_3.
  51. Porcel, C., López-Herrera, A. G., & Herrera-Viedma, E. (2009). A recommender system for research resources based on fuzzy linguistic modeling. Expert Systems with Applications, 36(3), 5173–5183. doi: 10.1016/j.eswa.2008.06.038.CrossRefGoogle Scholar
  52. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.Google Scholar
  53. Powley, B., & Dale, R. (2007). Evidence-based information extraction for high accuracy citation and author name identification. In RIAO’07: Proceedings of the 8th conference on information retrieval and its applications. CID, CDROM.Google Scholar
  54. Reips, U. D. (2002). Standards for Internet-based experimenting. Experimental Psychology, 49(4), 243–256. doi: 10.1026//1618-3169.49.4.243.Google Scholar
  55. Reips, U. D. (2007). The methodology of Internet-based experiments. In A. N. Joinson, K. Y. A. McKenna, T. Postmes, & U. D. Reips (Eds.), The Oxford handbook of Internet psychology. New York: Oxford University Press (Chap. 24, pp. 373–390).Google Scholar
  56. Reips, U. D., & Lengler, R. (2005). The Web experiment list: A Web service for the recruitment of participants and archiving of Internet-based experiments. Behavior Research Methods, 37(2), 287–292.CrossRefGoogle Scholar
  57. Reitz, F., & Hoffmann, O. (2010). An analysis of the evolving coverage of computer science sub-fields in the DBLP digital library. In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, & I. Frommholz (Eds.), ECDL’10: Proceedings of the 14th European conference on research and advanced technology for digital libraries. LNCS (Vol. 6273, pp. 216–227). New York: Springer. doi: 10.1007/978-3-642-15464-5_23.
  58. Resnick, P., & Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3), 56–58. doi: 10.1145/245108.245121.CrossRefGoogle Scholar
  59. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., & Steyvers, M. (2010). Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1), 4:1–4:38. doi: 10.1145/1658377.1658381.CrossRefGoogle Scholar
  60. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In UAI’04: Proceedings of the 20th annual conference on uncertainty in artificial intelligence, AUAI Press, Arlington, Virginia (pp. 487–494).Google Scholar
  61. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. doi: 10.1016/0306-4573(88)90021-0. CrossRefGoogle Scholar
  62. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. doi: 10.1145/361219.361220.CrossRefzbMATHGoogle Scholar
  63. Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375. doi: 10.1561/1500000009.CrossRefzbMATHGoogle Scholar
  64. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In SIGIR’05: Proceedings of the 28th annual international ACM SIGIR conference, ACM, New York, NY, USA (pp. 162–169). doi: 10.1145/1076034.1076064.
  65. Spärck Jones, K. (1973). Index term weighting. Information Storage and Retrieval , 9(11), 619–633. doi: 10.1016/0020-0271(73)90043-0.CrossRefGoogle Scholar
  66. Spärck Jones, K. (1974). Automatic indexing. Journal of Documentation, 30(4), 393–432. doi: 10.1108/eb026588.CrossRefGoogle Scholar
  67. Student. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. doi: 10.2307/2331554.
  68. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and mining of academic social networks. In KDD’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA (pp. 990–998). doi: 10.1145/1401890.1402008.
  69. Travers, J., & Milgram, S. (1969). An experimental study of the small world problem. Sociometry, 32(4), 425–443. doi: 10.2307/2786545.CrossRefGoogle Scholar
  70. Tsatsaronis, G., Varlamis, I., Stamou, S., Nørvåg, K., & Vazirgiannis, M. (2009). Semantic relatedness hits bibliographic data. In WIDM’09: Proceeding of the 11th international workshop on Web information and data management, ACM, New York, NY, USA (pp. 87–90). doi: 10.1145/1651587.1651607.
  71. Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In C. Peters, M. Braschler, J. Gonzalo, & M. Kluck (Eds.), CLEF’01: Second workshop of the cross-language evaluation forum. LNCS (Vol. 2406, pp. 355–370). New York: Springer. doi: 10.1007/3-540-45691-0_34.
  72. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT Press.Google Scholar
  73. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. doi: 10.2307/3001968.CrossRefGoogle Scholar
  74. Yan, E., & Ding, Y. (2009). Applying centrality measures to impact analysis: A coauthorship network analysis. Journal of the American Society for Information Science and Technology, 60(10), 2107–2118. doi: 10.1002/asi.21128.CrossRefGoogle Scholar
  75. Yang, Z., Hong, L., & Davison, B. D. (2010). Topic-driven multi-type citation network analysis. In RIAO’10: Proceedings of the 9th international conference on information retrieval and its applications. CDROM.Google Scholar
  76. Zamparelli, R. (1998). Internet publications: Pay-per-use or pay-per-subscription? In C. Nikolaou & C. Stephanidis (Eds.), ECDL’98: Proceedings of the 2nd European conference on research and advanced technology for digital libraries. LNCS (Vol. 1513, pp. 635–636). New York: Springer. doi: 10.1007/3-540-49653-X_38.
  77. Zhou, D., Orshanskiy, S. A., Zha, H., & Giles, C. L. (2007). Co-ranking authors and documents in a heterogeneous network. In ICDM’07: Proceedings of the 7th IEEE international conference on data mining (pp. 739–744). doi: 10.1109/ICDM.2007.57.

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2011

Authors and Affiliations

  1. 1.Computer Science Department, IRIT UMR 5505 CNRSUniversity of ToulouseToulouse Cedex 9France

Personalised recommendations