GraphDBLP: a system for analysing networks of computer scientists through graph databases

GraphDBLP

Abstract

This paper presents GraphDBLP, a system that models the DBLP bibliography as a graph database for performing graph-based queries and social network analyses. GraphDBLP also enriches the DBLP data through semantic keyword similarities computed via word-embedding. In this paper, we discuss how the system was formalized as a multi-graph, and how similarity relations were identified through word2vec. We also provide three meaningful queries for exploring the DBLP community to (i) investigate author profiles by analysing their publication records; (ii) identify the most prolific authors on a given topic, and (iii) perform social network analyses over the whole community. To date, GraphDBLP contains 5+ million nodes and 24+ million relationships, enabling users to explore the DBLP data by referencing more than 3.3 million publications, 1.7 million authors, and more than 5 thousand publication venues. Through the use of word-embedding, more than 7.5 thousand keywords and related similarity values were collected. GraphDBLP was implemented on top of the Neo4j graph database. The whole dataset and the source code are publicly available to foster the improvement of GraphDBLP in the whole computer science community.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    In this work, venues include conferences and journals.

  2. 2.

    A multi-graph is a graph where multiple edges between two nodes are permitted and might be specified through labels. Our notation was inspired by [17].

  3. 3.

    In addition to single words, even n-grams can be mapped to vectors. An n-gram is a set of n consecutive words. As outlined in Section 3.2 frequent co-occurrences of n consecutive words are identified and replaced by a single word e.g., machine learning is replaced by machine_learning.

  4. 4.

    A similar (but reversed problem) is the Skip-n-gram model i.e., to train a neural network to predict the representation of n context words from the representation of w. The Skip-n-gram approach can be summarised as “predicting the context given a word” while the CBOW, in a nutshell, is “predicting the word given a context”.

  5. 5.

    Py2neo Python library Available: http://py2neo.org/.

  6. 6.

    Though the same result could be achieved adding a property on the node, the use of multiple labels allows one to immediately access to the nodes with the desired label.

  7. 7.

    Performed through the stop words dictionary by the NLTK framework [10].

  8. 8.

    The edges selected using the Similarity label.

  9. 9.

    The idea is inspired by [7] though they compute the weight of triples through arithmetic functions.

  10. 10.

    The lower quartile is the 25th percentile while the upper quartile is the 75th percentile.

  11. 11.

    https://fabiomercorio.github.io/GraphDBLP/.

References

  1. 1.

    Adomavicius G, Sankaranarayanan R, Sen S, Tuzhilin A (2005) Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans Inf Syst (TOIS) 23(1):103–145

    Article  Google Scholar 

  2. 2.

    Aggarwal C C (2011) An introduction to social network data analytics. Socl Netw Data Anal 1–15

  3. 3.

    Albanese M, d’Acierno A, Moscato V, Persia F, Picariello A (2013) A multimedia recommender system. ACM Trans Internet Technol (TOIT) 13(1):3

    Article  Google Scholar 

  4. 4.

    Amato F, Moscato V, Picariello A, Piccialli F (2017) Sos: a multimedia recommender system for online social networks. Fut Gen Comput Syst

  5. 5.

    Angles R, Gutierrez C (2008) Survey of graph database models. ACM Comput Surv (CSUR) 40(1):1

    Article  Google Scholar 

  6. 6.

    Bao J, Zheng Y, Wilkie D, Mokbel M (2015) Recommendations in location-based social networks: a survey. GeoInformatica 19(3):525–565

    Article  Google Scholar 

  7. 7.

    Barrat A, Barthelemy M, Pastor-Satorras R, Vespignani A (2004) The architecture of complex weighted networks. Proc Natl Acad Sci USA 101(11):3747–3752

    Article  Google Scholar 

  8. 8.

    Belák V, Lam S, Hayes C (2012) Cross-community influence in discussion fora. ICWSM 12:34–41

    Google Scholar 

  9. 9.

    Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  10. 10.

    Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media Inc.

  11. 11.

    Boselli R, Cesarini M, Marrara S, Mercorio F, Mezzanzanica M, Pasi G, Viviani M (2017) Wolmis: a labor market intelligence system for classifying web job vacancies. J Intell Inf Syst. https://doi.org/10.1007/s10844-017-0488-x

  12. 12.

    Boselli R, Cesarini M, Mercorio F, Mezzanzanica M (2017) Using machine learning for labour market intelligence. In: Altun Y, Das K, Mielikäinen T, Malerba D, Stefanowski J, Read J, Zitnik M, Ceci M, Dzeroski S (eds) Machine learning and knowledge discovery in databases - European conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part III, Lecture Notes in Computer Science, vol 10536. Springer, pp 330–342. DOI https://doi.org/10.1007/978-3-319-71273-4_27, (to appear in print)

  13. 13.

    Boselli R, Cesarini M, Mercorio F, Mezzanzanica M, Vaccarino A (2017) A pipeline for multimedia twitter analysis through graph databases: preliminary results. In: DATA 2017 - the international conference on data technologies and applications. https://doi.org/10.5220/0006490703430349

  14. 14.

    Cattell R (2011) Scalable sql and nosql data stores. ACM Sigmod Record 39 (4):12–27

    Article  Google Scholar 

  15. 15.

    Chikhaoui B, Chiazzaro M, Wang S (2015) A new granger causal model for influence evolution in dynamic social networks: the case of dblp. In: AAAI, pp 51–57

  16. 16.

    Colace F, De Santo M, Greco L, Moscato V, Picariello A (2015) A collaborative user-centered framework for recommending items in online social networks. Comput Hum Behav 51:694–704

    Article  Google Scholar 

  17. 17.

    Consens M P, Mendelzon A O (1990) Graphlog: a visual formalism for real life recursion. In: Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM, pp 404–416

  18. 18.

    Deng H, King I, Lyu M R (2008) Formal models for expert finding on dblp bibliography data. In: Eighth IEEE international conference on data mining, 2008. ICDM’08. IEEE, pp 163–172

  19. 19.

    Diederich J, Balke W T, Thaden U (2007) Demonstrating the semantic growbag: automatically creating topic facets for faceteddblp. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 505–505

  20. 20.

    Distributed graph database (2017) http://titan.thinkaurelius.com/

  21. 21.

    Du N, Wu B, Pei X, Wang B, Xu L (2007) Community detection in large-scale social networks. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, pp 16–25

  22. 22.

    Elmacioglu E, Lee D (2005) On six degrees of separation in dblp-db and more. ACM SIGMOD Record 34(2):33–40

    Article  Google Scholar 

  23. 23.

    Girvan M, Newman M E (2002) Community structure in social and biological networks. Proc Nat Acad Sci 99(12):7821–7826

    MathSciNet  Article  MATH  Google Scholar 

  24. 24.

    Han J, Haihong E, Le G, Du J (2011) Survey on nosql database. In: 2011 6th international conference on pervasive computing and applications (ICPCA). IEEE, pp 363–366

  25. 25.

    Jiang M, Cui P, Chen X, Wang F, Zhu W, Yang S (2015) Social recommendation with cross-domain transferable knowledge. IEEE Trans Knowl Data Eng 27(11):3084–3097

    Article  Google Scholar 

  26. 26.

    Le T, Zhang D (2015) Dblpminer: a tool for exploring bibliographic data. In: 2015 IEEE international conference on information reuse and integration (IRI). IEEE, pp 435–442

  27. 27.

    Lee S, Song SI, Kahng M, Lee D, Lee SG (2011) Random walk based entity ranking on graph for multidimensional recommendation. In: Proceedings of the fifth ACM conference on recommender systems. ACM, pp 93–100

  28. 28.

    Ley M (2009) Dblp: some lessons learned. Proc VLDB Endow 2(2):1493–1500

    Article  Google Scholar 

  29. 29.

    Li X, Chen H (2013) Recommendation as link prediction in bipartite graphs: a graph kernel-based machine learning approach. Decis Support Syst 54(2):880–890

    Article  Google Scholar 

  30. 30.

    Liu L, Tang J, Han J, Jiang M, Yang S (2010) Mining topic-level influence in heterogeneous networks. In: Proceedings of the 19th ACM international conference on information and knowledge management. ACM, pp 199–208

  31. 31.

    Marrara S, Pasi G, Viviani M, Cesarini M, Mercorio F, Mezzanzanica M, Pappagallo M A language modelling approach for discovering novel labour market occupations from the web. In: Sheth AP, Ngonga A, Wang Y, Chang E, Slezak D, Franczyk B, Alt R, Tao X, Unland R (eds) Proceedings of the international conference on web intelligence. ACM, Leipzig, pp 1026–1034. https://doi.org/10.1145/3106426.3109035

  32. 32.

    Mehmood Y, Barbieri N, Bonchi F, Ukkonen A (2013) Csi: community-level social influence analysis. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 48–63

  33. 33.

    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  34. 34.

    Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  35. 35.

    Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Hlt-naacl, vol 13, pp 746–751

  36. 36.

    Moreira C, Calado P, Martins B (2015) Learning to rank academic experts in the dblp dataset. Expert Syst 32(4):477–493

    Article  Google Scholar 

  37. 37.

    Nascimento M A, Sander J, Pound J (2003) Analysis of sigmod’s co-authorship graph. ACM Sigmod Record 32(3):8–10

    Article  Google Scholar 

  38. 38.

    Newman M E (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256

    MathSciNet  Article  MATH  Google Scholar 

  39. 39.

    Newman M E (2004) Who is the best connected scientist? A study of scientific coauthorship networks. In: Complex networks. Springer, pp 337–370

  40. 40.

    Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Disc 24(3):515–554

    Article  Google Scholar 

  41. 41.

    Pham T A N, Li X, Cong G, Zhang Z (2015) A general graph-based model for recommendation in event-based social networks. In: 2015 IEEE 31st international conference on data engineering (ICDE). IEEE, pp 567–578

  42. 42.

    Ricci F, Rokach L, Shapira B, Kantor P B (2015) Recommender systems handbook. Springer

  43. 43.

    Scott J (2017) Social network analysis. Sage

  44. 44.

    Stonebraker M (2010) Sql databases v. nosql databases. Commun ACM 53 (4):10–11

    Article  Google Scholar 

  45. 45.

    Tagarelli A, Interdonato R (2013) Ranking vicarious learners in research collaboration networks. In: International conference on Asian digital libraries. Springer, pp 93–102

  46. 46.

    Tang J, Sun J, Wang C, Yang Z (2009) Social influence analysis in large-scale networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 807–816

  47. 47.

    Tesoriero C (2013) Getting started with orientDB. Packt Publishing Ltd

  48. 48.

    Watts D J, Strogatz S H (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442

    Article  MATH  Google Scholar 

  49. 49.

    Webber J (2012) A programmatic introduction to neo4j. In: Proceedings of the 3rd annual conference on systems, programming, and applications: software for humanity. ACM, pp 217–218

  50. 50.

    Wu Y, Cao N, Gotz D, Tan Y P, Keim D A (2016) A survey on visual analytics of social media data. IEEE Trans Multimed 18(11):2135–2148

    Article  Google Scholar 

  51. 51.

    Zaiane O R, Chen J, Goebel R (2007) Dbconnect: mining research community on dblp data. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, pp 74–81

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Fabio Mercorio.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mezzanzanica, M., Mercorio, F., Cesarini, M. et al. GraphDBLP: a system for analysing networks of computer scientists through graph databases. Multimed Tools Appl 77, 18657–18688 (2018). https://doi.org/10.1007/s11042-017-5503-2

Download citation

Keywords

  • Graph database
  • Word embedding
  • Knowledge extraction
  • Semantic analytics
  • Social network analysis