Advertisement

Learning from the Past: An Analysis of Person Name Corrections in the DBLP Collection and Social Network Properties of Affected Entities

  • Florian ReitzEmail author
  • Oliver Hoffmann
Chapter
Part of the Lecture Notes in Social Networks book series (LNSN, volume 6)

Abstract

Many projects like the DBLP bibliography have to use names as identifiers for persons. Names however are neither unique nor is it guaranteed that a person is referred to by only one name. This causes inconsistencies which reduce the data quality of a collection. Though there are a large number of algorithmic approaches to solve this problem, little is known on the properties of the inconsistent entities. We show how to extract a large number of past name inconsistencies from the DBLP data set. We analyze the social network properties of these names and of the communities they belong to. We evaluate the usefulness of different properties to differentiate defective and none-defective names and present an approach which can predict the probability that a name will need correction in the future.

Keywords

Negative Condition Relational Network Test Collection Data Entity Correction Density 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

We thank Manh Cuong Pham and Ralf Klamma for providing us with the thematic clustering data.

References

  1. 1.
    Bavelas, A.: Communication patterns in task-oriented groups. J. Acoust. Soc. Am. 22, 725–730 (1950)CrossRefGoogle Scholar
  2. 2.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 1–36 (2007)CrossRefGoogle Scholar
  3. 3.
    Bilenko, M., Mooney, R.J., Cohen, W.W., Ravikumar, P.D., Fienberg, S.E.: Adaptive name matching in information integration. IEEE Intell. Syst. 18(5), 16–23 (2003)CrossRefGoogle Scholar
  4. 4.
    Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25(2), 163–177 (2001)zbMATHCrossRefGoogle Scholar
  5. 5.
    Chakrabarti, D., Kumar, R., Tomkins, A.: Evolutionary clustering. In: KDD, pp. 554–560. ACM, New York (2006)Google Scholar
  6. 6.
    Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD Conference, pp. 313–324. ACM, New York (2003)Google Scholar
  7. 7.
    Chwistek, L., Hetper, W.: New foundation of formal metamathematics. J. Symb. Log. 3(1), 1–36 (1938)zbMATHCrossRefGoogle Scholar
  8. 8.
    D’Ambros, M., Lanza, M., Robbes, R.: An extensive comparison of bug prediction approaches. In: MSR, pp. 31–41. IEEE, Piscataway (2010)Google Scholar
  9. 9.
    Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on DBLP bibliography data. In: ICDM, pp. 163–172. IEEE Computer Society, Los Alamitos (2008)Google Scholar
  10. 10.
    Dimitrov, M., Zhou, H.: Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging. In: ASPLOS, pp. 61–72. ACM, New York (2009)Google Scholar
  11. 11.
    Elmacioglu, E., Lee, D.: On six degrees of separation in DBLP-DB and more. SIGMOD Rec. 34(2), 33–40 (2005)CrossRefGoogle Scholar
  12. 12.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  13. 13.
    Ferreira, A.A., Veloso, A., Gonçalves, M.A., Laender, A.H.F.: Effective self-training author name disambiguation in scholarly digital libraries. In: JCDL, pp. 39–48. ACM, New York (2010)Google Scholar
  14. 14.
    Freeman, L.C.: A set of measures of centrality based upon betweeness. Sociometry 40(1), 35–41 (1977)CrossRefGoogle Scholar
  15. 15.
    Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99(12), 7821–7826 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Han, H., Giles, C.L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: JCDL, pp. 296–305. ACM, New York (2004)Google Scholar
  17. 17.
    Han, H., Xu, W., Zha, H., Giles, C.L.: A hierarchical naive bayes mixture model for name disambiguation in author citations. In: SAC, pp. 1065–1069. ACM, New York (2005)Google Scholar
  18. 18.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL, pp. 334–343, ACM, New York (2005)Google Scholar
  19. 19.
    Huang, J., Ertekin, S., Giles, C.L.: Efficient name disambiguation for large-scale databases. In: PKDD. Lecture Notes in Computer Science, vol. 4213, pp. 536–544. Springer, Berlin (2006)Google Scholar
  20. 20.
    Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., Lee, J.-H.: On co-authorship for author disambiguation. Inf. Process. Manag. 45(1), 84–97 (2009)CrossRefGoogle Scholar
  21. 21.
    Lee, D., On, B.-W., Kang, J., Park, S.: Effective and scalable solutions for mixed and split citation problems in digital libraries. In: IQIS, pp. 69–76. ACM, New York (2005)Google Scholar
  22. 22.
    Levin, F.H., Heuser, C.A.: Evaluating the use of social networks in author name disambiguation in digital libraries. JIDM 1(2), 183–198 (2010)Google Scholar
  23. 23.
    Levin, F.H., Heuser, C.A.: Using genetic programming to evaluate the impact of social network analysis in author name disambiguation. In: AMW. CEUR Workshop Proceedings, vol. 619 (2010). CEUR-WS.org
  24. 24.
    On, B.-W., Lee, D., Kang, J., Mitra, P.: Comparative study of name disambiguation problem using a scalable blocking-based framework. In: JCDL, pp. 344–353. ACM, New York (2005)Google Scholar
  25. 25.
    On, B.-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In: JCDL, pp. 51–52. ACM, New York (2006)Google Scholar
  26. 26.
    Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005)CrossRefGoogle Scholar
  27. 27.
    Pereira, D.A., Ribeiro-Neto, B.A., Ziviani, N., Laender, A.H.F., Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: JCDL, pp. 49–58. ACM, New York (2009)Google Scholar
  28. 28.
    Pham, M.C., Klamma, R.: The structure of the computer science knowledge network. In: ASONAM, pp. 17–24. IEEE Computer Society, Los Alamitos (2010)Google Scholar
  29. 29.
    Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101(9), 2658 (2004)CrossRefGoogle Scholar
  30. 30.
    Reuther, P., Walter, B.: Survey on test collections and techniques for personal name matching. IJMSO 1(2), 89–99 (2006)CrossRefGoogle Scholar
  31. 31.
    Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the quality of person names in DBLP. In: ECDL. Lecture Notes in Computer Science, vol. 4172, pp. 508–511. Springer, Berlin (2006)Google Scholar
  32. 32.
    Shin, D., Kim, T., Jung, H., Choi, J.: Automatic method for author name disambiguation using social networks. In: AINA, pp. 1263–1270. IEEE Computer Society, Los Alamitos (2010)Google Scholar
  33. 33.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: KDD, pp. 350–359. ACM, New York (2002)Google Scholar
  34. 34.
    Yoshida, M., Ikeda, M., Ono, S., Sato, I., Nakagawa, H.: Person name disambiguation by bootstrapping. In: SIGIR, pp. 10–17. ACM, New York (2010)Google Scholar
  35. 35.
    Zhang, B., Horvath, S.: A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4(1), 1128 (2005)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Wien 2013

Authors and Affiliations

  1. 1.University of TrierTrierGermany
  2. 2.Schloss Dagstuhl – Leibniz-Zentrum für Informatik GmbHWadernGermany

Personalised recommendations