Named Entity Matching in Publication Databases

A Case Study of PubMed in SONCA
  • Marcin Szczuka
  • Paweł Betliński
  • Kamil Herba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7413)


We present a case study in approximate data matching for a database system that contains information about scientific publications. The approximate matching process is meant to identify whether several records in the database are in fact repeated instances of the same real-world object. In our case study we are concerned with matching instances of objects such as XML documents, persons’ names, affiliations, journal names, and so on. The particular data we are dealing with is a representation of the PubMed Central document corpus within the data warehouse that is a part of the SONCA system. SONCA system is being developed as one of components of the general scientific information platform SYNAT.


Text mining approximate matching document grouping data cleaning data matching similarity function record linkage record matching duplicate detection object matching entity resolution data warehousing granulation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Association for Computing Machinery: The Digital Library: the ACM Guide to Computing Literature. WWW Page (2012),
  2. 2.
    Beck, J., Sequeira, E.: PubMed Central (PMC): An archive for literature from life sciences journals. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook, ch. 9. National Center for Biotechnology Information, Bethesda (2003),
  3. 3.
    Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgódka, M. (eds.): Intelligent Tools for Building a Scientific Information Platform. SCI, vol. 390. Springer, Heidelberg (2012)Google Scholar
  4. 4.
    Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)CrossRefGoogle Scholar
  5. 5.
    Herba, K.: Semantic recognition and tagging of scientific articles. Master’s thesis, Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland (2011) (in Polish)Google Scholar
  6. 6.
    Infobright, Inc.: Infobright Enterprise Edition (IEE). WWW Page (2012),
  7. 7.
    Jonnalagadda, S., Topham, P.: Nemo: Extraction and normalization of organization names from pubmed affiliation strings. Journal of Biomedical Discovery and Collaboration 5, 50–75 (2010)Google Scholar
  8. 8.
    Kowalski, M., Ślęzak, D., Stencel, K., Pardel, P., Grzegorowski, M., Kijowski, M.: Rdbms model for scientific articles analytics. In: Bembenik, et al. [3], ch. 4, pp. 49–60Google Scholar
  9. 9.
    Nadkarni, P.: The EAV/CR model of data representation. Tech. rep., Center for Medical Informatics, Yale University School of Medicine (2000),
  10. 10.
    National Center for Biotechnology Information: Archiving and Interchange Tag Set (2008),
  11. 11.
    Nguyen, A.L., Nguyen, H.S.: On designing the sonca system. In: Bembenik, et al. [3], ch. 2, pp. 9–35Google Scholar
  12. 12.
    Nguyen, H.S., Ślęzak, D., Skowron, A., Bazan, J.: Semantic search and analytics over large repository of scientific articles. In: Bembenik, et al. [3], ch. 1, pp. 1–8Google Scholar
  13. 13.
    Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the Quality of Person Names in DBLP. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 508–511. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Szczuka, M., Ślęzak, D.: Representation and Evaluation of Granular Systems. In: Watada, J., Watanabe, T., Phillips-Wren, G., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies. SIST, vol. 15, pp. 287–296. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  15. 15.
    Tsai, R.T.H., Sung, C.L., Dai, H.J., Hung, H.C., Sung, T.Y., Hsu, W.L.: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(S-5) (2006)Google Scholar
  16. 16.
    Zhang, D., Tang, J., Li, J.Z., Wang, K.: A constraint-based probabilistic framework for name disambiguation. In: Silva, M.J., Laender, A.H.F., Baeza-Yates, R.A., McGuinness, D.L., Olstad, B., Olsen, Ø.H., Falcão, A.O. (eds.) CIKM, pp. 1019–1022. ACM (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Marcin Szczuka
    • 1
  • Paweł Betliński
    • 1
  • Kamil Herba
    • 1
  1. 1.Institute of MathematicsThe University of WarsawWarsawPoland

Personalised recommendations