Towards a Stepwise Method for Unifying and Reconciling Corporate Names in Public Contracts Metadata: The CORFU Technique

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 390)


The present paper introduces a technique to deal with coporate names heterogeneities in the context of public procurement metadata. Public bodies are currently facing a big challenge trying to improve both the performance and the transparency of administrative processes. The e-Government and Open Linked Data initiatives have emerged as efforts to tackle existing interoperability and integration issues among ICT-based systems but the creation of a real transparent environment requires much more than the simple publication of data and information in specific open formats; data and information quality is the next major step in the pubic sector. More specifically in the e-Procurement domain there is a vast amount of valuable metadata that is already available via the Internet protocols and formats and can be used for the creation of new added-value services. Nevertheless the simple extraction of statistics or creation of reports can imply extra tasks with regards to clean, prepare and reconcile data. On the other hand, transparency has become a major objective in public administractions and, in the case of public procurement, one of the most interesting services lies in tracking rewarded contracts (mainly type, location, and supplier). Although it seems a basic kind of reporting service the truth is that its generation can turn into a complex task due to a lack of standardization in supplier names or the use of different descriptors for the type of contract. In this paper, a stepwise method based on natural language processing and semantics to address the unfication of corporate names is defined and implemented. Moreover a research study to evaluate the precision and recall of the proposed technique, using as use case the public dataset of rewarded public contracts in Australia during the period 2004-2012, is also presented. Finally some discussion, conclusions and future work are also outlined.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Araujo, S., Hidders, J., Schwabe, D., De Vries, A.P.: SERIMI Resource Description Similarity, RDF Instance Matching and Interlinking. In: WebDB 2012 (2011)Google Scholar
  2. 2.
    Erickson, J.: TWC RPI’s OrgPedia Technology Demonstrator (May 2013),
  3. 3.
    Directorate-General for Informatics European Commission. The eProcurement Map. a map of activities having an impact on the development of european interoperable eprocurement solutions (August 2011),
  4. 4.
    Galvez, C., Moya-Anegón, F.: The unification of institutional addresses applying parametrized finite-state graphs (P-FSG). Scientometrics 69(2), 323–345 (2006)CrossRefGoogle Scholar
  5. 5.
    Galvez, C., Moya-Anegón, F.: A Dictionary-Based Approach to Normalizing Gene Names in One Domain of Knowledge from the Biomedical Literature. Journal of Documentation 68(1), 5–30 (2012)CrossRefGoogle Scholar
  6. 6.
    Isele, R., Jentzsch, A., Bizer, C.: Silk Server - Adding missing Links while consuming Linked Data. In: COLD (2010)Google Scholar
  7. 7.
    Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 180–183. Association for Computational Linguistics, Stroudsburg (2003)CrossRefGoogle Scholar
  8. 8.
    Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. J. of Biomedical Informatics 37(6), 512–526 (2004)CrossRefGoogle Scholar
  9. 9.
    Stanford Natural Language Processing Lecture. Spelling Correction and the Noisy Channel. The Spelling Correction Task (March 2013),
  10. 10.
    Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., Lee, B.-S.: TwiNER: Named entity recognition in targeted twitter stream. In: Proc. of the 35th International ACM SIGIR, SIGIR 2012, pp. 721–730. ACM, New York (2012)Google Scholar
  11. 11.
    Loper, E., Bird, S.: NLTK: The Natural Language Toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pp. 62–69. Association for Computational Linguistics, Somerset (2002), Google Scholar
  12. 12.
    Maali, F., Cyganiak, R., Peristeras, V.: Re-using Cool URIs: Entity Reconciliation Against LOD Hubs. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) LDOW, CEUR Workshop Proceedings. (2011)Google Scholar
  13. 13.
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proc. of the 7th International Conference on Semantic Systems, I-Semantics 2011, pp. 1–8. ACM, New York (2011)Google Scholar
  14. 14.
    Vafolopoulos, M.M.M., Xidias, G., et al.: Publicspending. gr: Interconnecting and visualizing Greek public expenditure following Linked Open Data directives (July 2012)Google Scholar
  15. 15.
    Michalec, G., Bender-deMoll, S.: Browser and API for CorpWatch (May 2013),
  16. 16.
    Morillo, F., Aparicio, J., González-Albo, B., Moreno, L.: Towards the automation of address identification. Scientometrics 94(1), 207–224 (2013)CrossRefGoogle Scholar
  17. 17.
    Nadeau, D.: Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. PhD thesis, School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada (2007)Google Scholar
  18. 18.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  19. 19.
    Norvig, P.: How to Write a Spelling Corrector (March 2013),
  20. 20.
    Rodríguez, J.M.Á., Gayo, J.E.L., Silva, F.A.C., Alor-Hernández, G., Sánchez, C., Luna, J.A.G.: Towards a Pan-European E-Procurement Platform to Aggregate, Publish and Search Public Procurement Notices Powered by Linked Open Data: the Moldeas Approach. International Journal of Software Engineering and Knowledge Engineering 22(3), 365–384 (2012)Google Scholar
  21. 21.
    Rodíguez, J.M.A., Gayo, J.E.L., De Pablos, P.O.: Enabling the Matchmaking of Organizations and Public Procurement Notices by Means of Linked Open Data. Cases on Open-Linked Data and Semantic Web Applications 1(1), 105–131 (2013)Google Scholar
  22. 22.
    Rodríguez, J.M.A., Paredes, L.P., Azcona, E.R., González, A.R., Gayo, J.E.L., De Pablos, P.O.: Enhancing the Access to Public Procurement Notices by Promoting Product Scheme Classifications to the Linked Open Data Initiative. Cases on Open-Linked Data and Semantic Web Applications 1(1), 1–27 (2013)Google Scholar
  23. 23.
    Taggart, C., McKinnon, R.: The Open Database of The Corporate World (May 2013),
  24. 24.
    Vafolopoulos, M.: The Web economy: goods, users, models and policies. Foundations and Trends® in Web Science, vol. 1. Now Publishers Inc. (2012)Google Scholar
  25. 25.
    Wang, Y.: Annotating and recognising named entities in clinical notes. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, ACLstudent 2009, pp. 18–26. Association for Computational Linguistics, Stroudsburg (2009)CrossRefGoogle Scholar
  26. 26.
    Yeates, S.: Automatic Extraction of Acronyms from Text. In: University of Waikato, pp. 117–124 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.South East European Research CenterThessalonikiGreece
  2. 2.WESO Research Group, Department of Computer ScienceUniversity of OviedoOviedoSpain
  3. 3.Multimedia Technology LaboratoryNational Technical University of AthensAthensGreece

Personalised recommendations