Skip to main content
Log in

Standardizing formats of corporate source data

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

This paper describe an approach for improving the data quality of corporate sources when databases are used for bibliometric purposes. Research management relies on bibliographic databases and citation index systems as analytical tools, yet the raw resources for bibliometric studies are plagued by a lack of consistency in fied formatting for institution data. The present contribution puts forth a Natural Language Processing (NLP)-oriented method for the identification of the structures guiding corporate data and their mapping into a standardized format. The proposed unification process is based on the definition of address patterns and the ensuing application of Enhanced Finite-State Transducers (E-FST). Our procedure was tested on address formats downloaded from the INSPEC, MEDLINE and CAB Abstracts. The results demonstrate the helpfulness of the method as long as close control of errors is exercised as far as the formats to be unified. The computational efficacy of the model is noteworthy, due to the fact that it is firmly guided by the definition of data in the application domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abney, S. (2002), Bootstrapping. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). Philadelphia.

  • Abney, S. (1996), Partial parsing via finite-state cascades. In: Proceedings of the ESSLLI’96 Robust Parsing Workshop. Prague, pp. 8–15.

  • Anderson, J., Collins, P. M. D., Irvine, J., Isard, P. A., Martin, B. R., Narin, F., Stevens, K. (1988), On-line approaches to measuring national scientific output: A cautionary tale, Science and Public Policy, 15: 153–161.

    Google Scholar 

  • Bourke, P., Butler, L. (1996), Standards issues in a national bibliometric database: The Australian case, Scientometrics, 35: 199–207.

    Article  Google Scholar 

  • Bourke, P., Butler, L. (1998), Institutions and the map of science: Matching university departments and fields of research, Research Policy, 26: 711–718.

    Article  Google Scholar 

  • Braun, T., Brocken, M., Glänzel, W., Rinia, E., Schubert, A. (1995), “Hyphenation” of databases in building scientometric indicators: Physics briefs, SCI based indicators of 13 European countries, 1980–1989, Scientometrics, 33: 131–148.

    Article  Google Scholar 

  • Carpenter, M. P., Gibb, F., Harris, J., Irvine, J., Narin, F. (1988), Bibliometric profiles for British academic institutions: An experiment to develop research output indicators, Scientometrics, 14: 213–234.

    Article  Google Scholar 

  • Catarci, T. (2004), Special issue on data quality in cooperative information systems (Editorial), Information Systems, 29: 529–530.

    Article  Google Scholar 

  • Chomsky, N. (1965), Aspects of the Theory of Syntax, Massachusetts Institute of Technology, Cambridge, Massachusetts.

  • Chomsky, N. (1957), Syntactic Structures, Mouton, The Hague.

    Google Scholar 

  • Cunningham, H. (2005), Information extraction, Automatic, Enclyclopedia of Language and Linguistics, 2nd ed. Elsevier, Oxford.

    Google Scholar 

  • Cunningham, S. J. (1998), Applications for bibliometric research in the emerging digital libraries, Scientometrics, 43: 161–175.

    Article  MathSciNet  Google Scholar 

  • De Bruin, R. E., Moed, H. F. (1990), The unification of addresses in scientific publications. In: L. Egghe, R. Rousseau (Eds), Informetrics 1989/90. Elsevier Science Publishers, Amsterdam, pp. 65–78.

    Google Scholar 

  • De Bruin, R. E., Moed, H. F. (1993), Delimitation of scientific subfields using cognitive words from corporate addresses in scientific publications, Scientometrics, 26: 65–80.

    Article  Google Scholar 

  • French, J. C., Powell, A. L., Schulman, E. (2000), Using clustering strategies for creating authority files, Journal of the American Society for Information Science and Technology, 51: 774–786.

    Article  Google Scholar 

  • Galvez, C., Moya-Anegon, F. (2006), The unification of institutional addresses applying parametrized finite-state graphs (P-FSG), Scientometrics, 69: 323–345.

    Article  Google Scholar 

  • Garfield, E. (1979), Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, John Wiley, New York.

    Google Scholar 

  • Garfield, E. (1983a), Idiosyncrasies and errors, or the terrible things journals do to us, Current Contents, 2: 5–11.

    Google Scholar 

  • Garfield, E. (1983b), Quality control at ISI, Current Contents, 19: 5–12.

    Google Scholar 

  • Grishman, R. (1997), Information extraction: Techniques and challenges. In: M. T. Pazienza (Ed.), Information Extraction. Springer-Verlag, Rome, pp. 10–27.

    Google Scholar 

  • Harris, Z. S. (1951), Methods in Structural Linguistics. University of Chicago Press, Chicago.

    Google Scholar 

  • Hawkins, D. T. (1977), Unconventional uses of on-line information retrieval systems: On-line bibliometric studies, Journal of the American Society for Information Science, 28: 13–18.

    Google Scholar 

  • Hawkins, D. T. (1981), Machine-readable output from online searches, Journal of the American Society for Information Science, 32: 253–256.

    Google Scholar 

  • Herbertz, H., Müller-Hill, B. (1995), Quality and efficiency of basic research in molecular biology: A bibliometric analysis of thirteen excellent research institutes, Research Policy, 24: 959–979.

    Article  Google Scholar 

  • Hobbs, J. R. (1993), The generic information extraction system. In: Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufman, San Mateo, CA, pp. 87–91.

    Google Scholar 

  • Hobbs, J. R., Appelt, D. E., Tyson, M., Mabry, B., Israel, D. (1992), SRI international: Description of the FASTUS system used for MUC-4. In: Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, pp. 268–275.

  • Hood, W. W., Wilson, C. S. (2003), Informetric studies using databases: Opportunities and challenges. Scientometrics, 58: 587–608.

    Article  Google Scholar 

  • Ingwersen, P., Christensen, F. H. (1997), Data set isolation for bibliometric online analyses of research publications: Fundamental methodological issues, Journal of the American Society for Information Science, 48: 205–217.

    Article  Google Scholar 

  • Jacobs, P. S., Rau, L. F. (1990), SCISOR: Extracting information from on-line news, Communications of the ACM, 33: 88–97.

    Article  Google Scholar 

  • Leydesdorff, L. (1988), Problems with the ‘measurement’ of national scientific performance, Science and Public Policy, 15: 149–152.

    Google Scholar 

  • Mählck, P., Persson, O. (2000), Socio-bibliometric mapping of intra-departmental networks, Scientometrics, 49: 81–91.

    Article  Google Scholar 

  • McGrath, W. (1996), The unit of analysis (object of study) in biblometrics and scientometrics, Scientometrics, 32: 257–264.

    Article  Google Scholar 

  • Melin, G., Persson, O. (1996), Studying research collaboration using co-authorships, Scientometrics, 36: 363–377.

    Article  Google Scholar 

  • Moed, H. F. (1988), The Use of on-line databases for bibliometric analysis. In: L. Egghe, R. Rousseau (Eds), Informetrics 87/88. Elsevier Science Publishers, Amsterdam, pp. 133–146.

    Google Scholar 

  • Moed, H. F. (2000), Bibliometric indicators reflect publication and management strategies, Scientometrics, 47: 323–346.

    Article  Google Scholar 

  • Moed, H. F., De Bruin, R. E., Van Leeuwen, Th N. (1995), New bibliometric tools for the assessment of national research performance: Database description, overview of indicators and first applications, Scientometrics, 33: 381–422.

    Article  Google Scholar 

  • Moed, H. F., Van Raan, A. F. J. (1988), Indicators of research performance: Applications in university research policy. In: A. F. J. Van Raan (Ed.), Handbook of Quantitative Studies of Science and Technology. Elsevier Science Publishers, Amsterdam, pp. 177–192.

    Google Scholar 

  • Moed, H. F., Vriensv, M. (1989), Possible inaccuracies occurring in citation analysis, Journal of Information Science, 15: 95–117.

    Google Scholar 

  • Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernandez, F. J. (2004), A new technique for building maps of large scientific domains based on the cocitation of classes and categories, Scientometrics, 61: 129–145.

    Article  Google Scholar 

  • Neri, F., Saitta, L. (1997), Machine learning for information extraction. In: M. L. Pazienza (Ed.), Information Extraction. Springer-Verlag, Rome, pp. 10–27.

    Google Scholar 

  • Noyons, E. C. M., Moed, H. F., Luwel, M. (1999), Combining mapping and citation analysis for evaluative bibliometric purposes: A bibliometric study, Journal of the American Society for Information Science, 50: 115–131.

    Article  Google Scholar 

  • Piternick, A. B. (1982), Standardization of journal titles in databases (letter to the editor), Journal of the American Society for Information Science, 33: 105.

    Google Scholar 

  • Rinia, E. J., De Lange, C., Moed, H. F. (1993), Measuring national output in physics: Delimitation problems, Scientometrics, 28: 89–110.

    Article  Google Scholar 

  • Roche, E. (1996), Finite-state transducers: Parsing free and frozen sentences. In: A. Kornai (Ed.), Proceedings of the ECAI 96 Workshop extended finite state models of language. ECAI, pp. 52–57.

  • Sher, I. H., Garfield, E., Elias, A. W. (1966), Control and elimination of errors in ISI services, Journal of Chemical Documentation, 6: 132–135.

    Article  Google Scholar 

  • Shrum, W., Mullins, N. (1988), Network analysis in the study of science and technology. In: A. F. J. Van Raan (Ed.), Handbook of Quantitative Studies of Science and Technology. Elsevier Science Publishers, Amsterdam, pp. 107–133.

    Google Scholar 

  • Silberztein, M. (1999), Text indexation with INTEX, Computers and the Humanities, 33: 265–280.

    Article  Google Scholar 

  • Silberztein, M. (2000), INTEX: An FST toolbox, Theoretical Computer Science, 231: 33–46.

    Article  MathSciNet  Google Scholar 

  • Stefaniak, B. (1987), Use of bibliographic data bases for scientometric studies, Scientometrics, 12: 149–161.

    Article  Google Scholar 

  • Van Den Berghe, H., De Bruin, R. E., Houben, J. A., Kint, A., Luwel, M., Spruyt, E., Moed, H. F. (1998), Bibliometric indicators of university research performance in Flanders, Journal of the American Society for Information Science, 49: 59–67.

    Google Scholar 

  • Van Raan, A. F. J. (2005), Fatal attraction: conceptual and methodological problems in the ranking of universities by bibliometric methods, Scientometrics, 62: 133–143.

    Article  Google Scholar 

  • Van Zaanen, M. (1999). Bootstrapping structure using similarity. In: P. Monachesi (Ed.), Computational Linguistics in the Netherlands 1999-Selected Papers From the Tenth CLIN Meeting. Universteit Utrecht, Utrecht, The Netherlands, pp. 235–245.

    Google Scholar 

  • Watrin, P. (2003), Information extraction and lexicon-grammar. In: Proceedings of the Fourth Dutch-Belgian Information Retrieval Workshop, DIR, Amsterdam, pp. 16–21.

    Google Scholar 

  • Williams, M. E., Lannom, L. (1981), Lack of standardization of the journal title data element in databases, Journal of the American Society for Information Science, 32: 229–233.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carmen Galvez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Galvez, C., Moya-Anegón, F. Standardizing formats of corporate source data. Scientometrics 70, 3–26 (2007). https://doi.org/10.1007/s11192-007-0101-0

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-007-0101-0

Keywords

Navigation