Author Disambiguation in the YADDA2 Software Platform

  • Piotr Jan Dendek
  • Mariusz Wojewódzki
  • Łukasz Bolikowski
Part of the Studies in Computational Intelligence book series (SCI, volume 467)


SYNAT platform powered by the YADDA2 architecture has been extended with the Author Disambiguation Framework and the Query Framework. The former framework clusters occurrences of contributor names into identities of authors, the latter answers queries about authors and documents written by them. This paper presents an outline of the disambiguation algorithms, implementation of the query framework, integration into the platform and performance evaluation of the solution.


author disambiguation record deduplication software architecture YADDA2 software platform 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2004, p. 20. ACM Press, New York (2004), CrossRefGoogle Scholar
  2. 2.
    Aono, M., Seddiqui, M.H.: Scalability in ontology instance matching of large semantic knowledge base. In: AIKED 2010 Proceedings of the 9th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, pp. 378–383 (2010)Google Scholar
  3. 3.
    Berman, J.J.: Concept-Match Medical Data Scrubbing. Archives of Pathology & Laboratory Medicine 127(6), 680–686 (2003)Google Scholar
  4. 4.
    Bolikowski, L., Dendek, P.J.: Towards a Flexible Author Name Disambiguation Framework. In: Sojka, P., Bouche, T. (eds.) Towards a Digital Mathematics Library, pp. 27–37. Masaryk University Press (2011)Google Scholar
  5. 5.
    Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD 2001, pp. 175–186. ACM Press, New York (2001), CrossRefGoogle Scholar
  6. 6.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, p. 475. ACM Press, New York (2002), CrossRefGoogle Scholar
  7. 7.
    Culotta, A., Kanani, P., Hall, R., Wick, M., McCallum, A.: Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function. In: Sixth International Workshop on Information Integration on the Web (2007)Google Scholar
  8. 8.
    Dai, A.M., Storkey, A.J.: Author Disambiguation: A Nonparametric Topic and Co-authorship Model. In: NIPS Workshop on Applications for Topic Models Text and Beyond, pp. 1–4 (2009)Google Scholar
  9. 9.
    Dendek, P.J., Bolikowski, L.: Evaluation of Features for Author Name Disambiguation Using Linear Support Vector Machines. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, pp. 440–444 (2012)Google Scholar
  10. 10.
    Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007), CrossRefGoogle Scholar
  11. 11.
    Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)CrossRefGoogle Scholar
  12. 12.
    Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries - JCDL 2004, p. 296 (2004),
  13. 13.
    Han, H., Zha, H., Giles, C.L.: Name disambiguation in author citations using a K-way spectral clustering method. In: JCDL 2005: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 334–343. ACM Press, New York (2005)CrossRefGoogle Scholar
  14. 14.
    Hernández, M.A., Stolfo, S.J.: Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  15. 15.
    Knight, K., Graehl, J.: Machine Transliteration. Computational Linguistics 24(4), 599–612 (1998)Google Scholar
  16. 16.
    Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992), CrossRefGoogle Scholar
  17. 17.
    Levin, F.H., Heuser, C.A.: Using Genetic Programming to Evaluate the Impact of Social Network Analysis in Author Name Disambiguation. In: Laender, A.H.F., Lakshmanan, L.V.S. (eds.) Proceedings of the 4th Alberto Mendelzon International Workshop on Foundations of Data Management Buenos Aires Argentina, Citeseer, May 17-20., vol. 619 (2010),,
  18. 18.
    Manning, C.D., Raghavan P., Schütze, H.: Introduction to Information Retrieval (2008),
  19. 19.
    McCallum, A., Freitag, D.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the Seventeenth International Conference on Machine Learning (2000),
  20. 20.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2000, pp. 169–178. ACM Press, New York (2000),,,
  21. 21.
    Monge, A., Elkan, C.: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proc. Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 23–29 (1997)Google Scholar
  22. 22.
    Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001), CrossRefGoogle Scholar
  23. 23.
    Polish Technical Journal Contents,
  24. 24.
    Bigdata Database Webpage,
  25. 25.
    Large Triple Stores Description,
  26. 26.
    Neo4j: The World’s Leading Graph Database,
  27. 27.
    Semame Database Webpage,
  28. 28.
    Park, K., Becker, E., Vinjumur, J.K., Le, Z., Makedon, F.: Human behavioral detection and data cleaning in assisted living environment using wireless sensor networks. In: Proceedings of the 2nd International Conference on PErvsive Technologies Related to Assistive Environments - PETRA 2009, pp. 1–8. ACM Press, New York (2009), CrossRefGoogle Scholar
  29. 29.
    Philips, L.: The double metaphone search algorithm. C/C++ Users Journal 18(6), 38–43 (2000)MathSciNetGoogle Scholar
  30. 30.
    Qian, Y., Hu, Y., Cui, J., Zheng, Q., Nie, Z.: Combining Machine Learning and Human Judgment in Author Disambiguation Framework. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1241–1246. ACM Press (2011),
  31. 31.
    Raman, V.: Potter’s wheel: An interactive data cleaning system. In: International Conference on Very Large Data (2001),
  32. 32.
    Ristad, E., Yianilos, P.: Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5), 522–532 (1998), CrossRefGoogle Scholar
  33. 33.
    Sutton, C., Rohanimanesh, K., McCallum, A.: Dynamic conditional random fields. In: Twenty-first International Conference on Machine Learning, ICML 2004, p. 99. ACM Press, New York (2004), CrossRefGoogle Scholar
  34. 34.
    Sylwestrzak, W., Rosiek, T., Bolikowski, L.: YADDA2 Assemble Your Own Digital Library Application from Lego Bricks. In: Proceedings of the 2012 ACM/IEEE on Joint Conference on Digital Libraries (2012)Google Scholar
  35. 35.
    Tan, Y.F., Kan, M.Y., Lee, D.: Search engine driven author disambiguation. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries - JCDL 2006, p. 314. ACM Press, New York (2006), CrossRefGoogle Scholar
  36. 36.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Information Systems 26(8), 607–633 (2001),, Google Scholar
  37. 37.
    Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data 3(3), 1–29 (2009), CrossRefGoogle Scholar
  38. 38.
    Verykios, V.S., Moustakides, G.V.: A generalized cost optimal decision model for record matching. In: Proceedings of the 2004 International Workshop on Information Quality in Informational Systems, IQIS 2004, p. 20. ACM Press, New York (2004), CrossRefGoogle Scholar
  39. 39.
    Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian decision model for cost optimal record matching. The VLDB Journal The International Journal on Very Large Data Bases 12(1), 28–40 (2003), CrossRefGoogle Scholar
  40. 40.
    Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database. In: Proceedings of the 48th Annual Southeast Regional Conference on - ACM SE 2010, p. 1. ACM Press, New York (2010), Google Scholar
  41. 41.
    Widom, J.: Research problems in data warehousing. In: Proceedings of the Fourth International Conference on Information and Knowledge Management, CIKM 1995, pp. 25–30. ACM Press, New York (1995), CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Piotr Jan Dendek
    • 1
  • Mariusz Wojewódzki
    • 1
  • Łukasz Bolikowski
    • 1
  1. 1.Interdisciplinary Centre for Mathematical and Computational ModellingUniversity of WarsawWarsawPoland

Personalised recommendations