Document type assignment accuracy in the journal citation index data of Web of Science

Donner, Paul

doi:10.1007/s11192-017-2483-y

Document type assignment accuracy in the journal citation index data of Web of Science

Published: 04 August 2017

Volume 113, pages 219–236, (2017)
Cite this article

Scientometrics Aims and scope Submit manuscript

Paul Donner ORCID: orcid.org/0000-0001-5737-8483¹

1212 Accesses
30 Citations
4 Altmetric
Explore all metrics

Abstract

This article reports the results of a study of the correctness of document type assignments in the commercial citation index database Web of Science (SCIE, SSCI, AHCI collections). The document type assignments for publication records are compared to those given on the official journal websites or in the publication full-texts for a random sample of 791 Web of Science records across the four document type categories articles, letters, reviews and others, according to the definitions of WoS. The proportion of incorrect assignments across document types and its influence on document specific normalized citations scores are analysed. It is found that document type data is correct in 94% of records. Further analyses show that within records of one document type as assigned in the data source, the records assigned to the type correctly and incorrectly have different average page counts and reference counts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the causes of subject-specific citation rates in Web of Science

Article 21 December 2014

Werner Marx & Lutz Bornmann

Comparison of publication-level approaches to ex-post citation normalization

Article Open access 17 May 2019

Cristian Colliander & Per Ahlgren

What do citation counts measure? An updated review of studies on citations in scientific documents published between 2006 and 2018

Article 28 September 2019

Iman Tahamtan & Lutz Bornmann

Notes

For Web of Science: Web of Science^® Help. Searching the Document Type Field [accessed 2016/10/07] http://images.webofknowledge.com/WOKRS59B4/help/WOS/hs_document_type.html. For Scopus: Scopus Content Coverage Guide Jan. 2016; page 10 [accessed 2016/10/07] https://www.elsevier.com/_data/assets/pdf_file/0007/69451/scopus_content_coverage_guide.pdf.
The department of Thomson Reuters producing the citation indexes recently changed ownership and now runs under the company name Clarivate Analytics.
For our WoS data (publications from 1980 to 2014) this list of multiply assigned DOI contains 19,096 entries and the list for Scopus data contains 81,715 entries (publications from 1996 to 2014).
Documented for the 2015/2016 edition https://www.timeshighereducation.com/news/ranking-methodology-2016 but not the 2016/2017 edition https://www.timeshighereducation.com/world-university-rankings/methodology-world-university-rankings-2016-2017.
https://www.nsf.gov/statistics/2016/nsb20161/#/.
It should be kept in mind that all the expected values are themselves affected by DT inaccuracies. No corrected values for these reference scores are available, so this issue must be put aside for this study although it is relevant in general.
http://www.tandfonline.com/doi/abs/10.1080/14767050701832833 accessed Sept. 21, 2016.

References

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.
Google Scholar
Barrios, M., Guilera, G., & Gómez-Benito, J. (2013). Impact and structural features of meta-analytical studies, standard articles and reviews in psychology: Similarities and differences. Journal of Informetrics, 7(2), 478–486.
Article Google Scholar
Braun, T., Glänzel, W., & Schubert, A. (1989). Some data on the distribution of journal publication types in the Science Citation Index Database. Scientometrics, 15(5), 325–330.
Article Google Scholar
Chaiworapongsa, T., Romero, R., Kim, Y. M., Kim, G. J., Kim, M. R., Espinoza, J., et al. (2008). The maternal plasma soluble vascular endothelial growth factor receptor-1 concentration is elevated in SGA and the magnitude of the increase relates to Doppler abnormalities in the maternal and fetal circulation. The Journal of Maternal-Fetal & Neonatal Medicine, 21(1), 25–40.
Article Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2013). A novel approach for estimating the omitted citation rate of bibliometric databases. Journal of the American Society for Information Science and Technology, 64(10), 2149–2156.
Article Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015a). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186.
Article Google Scholar
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015b). Influence of omitted citations on the bibliometric statistics of the major Manufacturing journals. Scientometrics, 103(3), 1083–1122.
Article Google Scholar
Glänzel, W. (2008). Seven myths in bibliometrics about facts and fiction in quantitative science studies. Collnet Journal of Scientometrics and Information Management, 2(1), 9–17.
Article Google Scholar
Gorraiz, J., & Schloegl, C. (2008). A bibliometric analysis of pharmacology and pharmacy journals: Scopus versus Web of Science. Journal of Information Science, 34(5), 715–725.
Article Google Scholar
Harzing, A. W. (2013). Document categories in the ISI Web of Knowledge: Misunderstanding the social sciences? Scientometrics, 93(1), 23–34.
Article Google Scholar
Korn, E. L., & Graubard, B. I. (1998). Confidence intervals for proportions with small expected number of positive counts estimated from survey data. Survey Methodology, 24(2), 193–201.
Google Scholar
Lohr, S. L. (2010). Sampling: Design and analysis (2nd ed.). Boston: Brooks/Cole, Cengage Learning.
MATH Google Scholar
Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1), 1–19.
MathSciNet Google Scholar
Lundberg, J. (2007). Lifting the crown—Citation z-score. Journal of Informetrics, 1(2), 145–154.
Article Google Scholar
Moed, H. F., & van Leeuwen, T. N. (1995). Improving the accuracy of Institute for Scientific Information’s journal impact factors. Journal of the American Society for Information Science, 46(6), 461.
Article Google Scholar
Montesi, M., & Mackenzie Owen, J. (2008). Research journal articles as document genres: Exploring their role in knowledge organization. Journal of Documentation, 64(1), 143–167.
Article Google Scholar
Patsopoulos, N. A., Analatos, A. A., & Ioannidis, J. P. (2005). Relative citation impact of various study designs in the health sciences. Journal of the American Medical Association, 293(19), 2362–2366.
Article Google Scholar
Romero, A., Cortés, J., Escudero, C., López, J., & Moreno, J. (2009). Measuring the influence of clinical trials citations on several bibliometric indicators. Scientometrics, 80(3), 747–760.
Article Google Scholar
Sigogneau, A. (2000). An analysis of document types published in journals related to physics: Proceeding papers recorded in the Science Citation Index database. Scientometrics, 47(3), 589–604.
Article Google Scholar
Sirtes, D. (2012). How (dis-) similar are different citation normalizations and the fractional citation indicator? (And how it can be improved). In É. Archambault, Y. Gingras, & V. Larivière (Eds.), Proceedings of 17th international conference on science and technology indicators (STI) (pp. 894–896). Montréal: Science-Metrix and OST.
Spodick, D. H., & Goldberg, R. J. (1983). The editor’s correspondence: Analysis of patterns appearing in selected specialty and general journals. The American Journal of Cardiology, 52(10), 1290–1292.
Article Google Scholar
Tierney, E., O’Rourke, C., & Fenton, J. E. (2015). What is the role of ‘the letter to the editor’? European Archives of Oto-Rhino-Laryngology, 272(9), 2089–2093.
Article Google Scholar
Valderrama-Zurián, J.-C., Aguilar-Moya, R., Melero-Fuentes, D., & Aleixandre-Benavent, R. (2015). A systematic analysis of duplicate records in Scopus. Journal of Informetrics, 9(3), 570–576. doi:10.1016/j.joi.2015.05.002.
Article Google Scholar
van Leeuwen, T., Costas, R., Calero-Medina, C., & Visser, M. (2013). The role of editorial material in bibliometric research performance assessments. Scientometrics, 95(2), 817–828.
Article Google Scholar
van Leeuwen, T. N., van der Wurff, L. J., & de Craen, A. J. M. (2007). Classification of “research letters” in general medical journals and its consequences in bibliometric research evaluation processes. Research Evaluation, 16(1), 59–63.
Article Google Scholar
Vinkler, P. (2010). The evaluation of research by scientometric indicators. Oxford: Chandos Publishing. ISBN 978-1-84334-572-5.
Book Google Scholar
Waltman, L., van Eck, N. J., van Leeuwen, T. N., Visser, M. S., & van Raan, A. J. F. (2011). Towards a new crown indicator: Some theoretical considerations. Journal of Informetrics, 5(1), 37–47.
Article Google Scholar
Wang, J. (2013). Citation time window choice for research impact evaluation. Scientometrics, 94(3), 851–872.
Article Google Scholar
Zuccala, A., & van Leeuwen, T. (2011). Book reviews in humanities research evaluations. Journal of the American Society for Information Science and Technology, 62(10), 1979–1991.
Article Google Scholar

Download references

Acknowledgements

This study was supported by the German Federal Ministry of Education and Research (BMBF) Grant 01PQ13001, project “Kompetenzzentrum Bibliometrie”. I want to thank Anastasiia Tcypina for help with data collection and Nees Jan van Eck for discussion of the manuscript.

Author information

Authors and Affiliations

Deutsches Zentrum für Wissenschafts- und Hochschulforschung, Schützenstraße 6a, 10117, Berlin, Germany
Paul Donner

Authors

Paul Donner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Donner.

Appendices

Appendix 1: Matching of WoS and Scopus records

An initial basic matching table was created by FIZ Karlsruhe for general use in projects by the database users based on combinations of exactly corresponding values in 10 metadata fields. The fields were assigned weights for their discriminative power. For example, the publication year field has a low discriminative power while that of article title and DOI is very high. Field values were normalized for differences in character sets, capitalization, special characters and some structural aspects (e.g. removing separating dashes in ISSN) to make them more uniform between the two data sources. All journal records are then mutually compared across all fields and a score is calculated from the number and weights of exactly matching fields. Only pairs with a predetermined threshold score are kept. Through this procedure, a pair of records may be included multiple times in the resulting table because all reasonable field combinations are used. Furthermore, a record from one data source can occur as a plausible match pair with multiple records from the other data source. The matching quality of this method was assessed before the study was conducted and found to be satisfactory. For this, a random sample of 2450 matching pairs was selected from the table and manually checked for equivalence using full bibliographical data from both data sources. In 9 cases the match was incorrect, in 15 cases the decision was unclear and in 2426 cases the match was found to be correct, the percentage of correct matches being greater than 99%.

This basic data was slightly modified for matching the sampled WoS records to those in Scopus uniquely. For WoS records being assigned only one possible Scopus record, the entry rows were copied directly to the final matching table. For those WoS records with more than one plausible match, the single row with the highest matching score was copied. This concerns 2.6% of the entries in the base table. Using this look-up table, of the 793 records that were initially assessed, it was possible to identify matches in Scopus in 711 cases. This was extended by matching only on the DOI for remaining records and manually assessing the matches. Finally, for all unmatched records that were left, the title was searched in the Scopus online platform and the results checked for a correct match. These two steps produced another eleven verified matches.

The remaining unmatched records are mostly meeting abstracts records, which are deliberately not included in Scopus. According to the DT assigned for this study, these unmatched records comprise, after exclusion of publications which could not be found, four articles, one letter, five reviews and 59 others. Within these, there were five misclassifications by WoS.

Appendix 2: Distribution of publication years in the WoS sample

See Table 10.

Table 10 Distribution of publication years in the WoS sample

Full size table

Appendix 3: Equality of population estimates of Precision and Recall

In this study the overall population estimates of Precision and Recall, as opposed to the estimates for particular DT, for each data source are equal, unlike in typical information retrieval evaluation. This is so because each case is “relevant” for one of the four DT categories. To illustrate, a simplified example is worked through.

Consider a dataset with only two document types, A and B, for which we have the data source’s assignments and the true DTs, as assigned manually. There is no sampling and stratification. Data are cross-tabulated data source and independently assigned DT counts like this:

Independently assigned DT	Data source DT
Independently assigned DT	A	B
A	20	6
B	5	40

There are 26 true A records, they have a population proportion of 0.37 while the 45 B records have a proportion of 0.63. Calculating first the DT Recall (R) and Precision (P) values for A and B gives:

$$R_{\text{A}} = \frac{20}{20 + 6} = 0.77\quad R_{\text{B}} = \frac{40}{40 + 5} = 0.89$$

$$P_{\text{A}} = \frac{20}{20 + 5} = 0.80\quad P_{\text{B}} = \frac{40}{40 + 6} = 0.87$$

Then the overall Recall and Precision are:

$$R = \left( {R_{\text{A}} \times w_{\text{A}} } \right) + \left( {R_{\text{B}} \times w_{\text{B}} } \right) = \left( {0.77 \times 0.37} \right) + \left( {0.89 \times 0.63} \right) = 0.85$$

$$P = \left( {P_{\text{A}} \times w_{\text{A}} } \right) + \left( {P_{\text{B}} \times w_{\text{B}} } \right) = \left( {0.80 \times 0.37} \right) + \left( {0.87 \times 0.63} \right) = 0.85$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Donner, P. Document type assignment accuracy in the journal citation index data of Web of Science. Scientometrics 113, 219–236 (2017). https://doi.org/10.1007/s11192-017-2483-y

Download citation

Received: 23 February 2017
Published: 04 August 2017
Issue Date: October 2017
DOI: https://doi.org/10.1007/s11192-017-2483-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Document type assignment accuracy in the journal citation index data of Web of Science

Abstract

Access this article

Similar content being viewed by others

On the causes of subject-specific citation rates in Web of Science

Comparison of publication-level approaches to ex-post citation normalization

What do citation counts measure? An updated review of studies on citations in scientific documents published between 2006 and 2018

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Matching of WoS and Scopus records

Appendix 2: Distribution of publication years in the WoS sample

Appendix 3: Equality of population estimates of Precision and Recall

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Document type assignment accuracy in the journal citation index data of Web of Science

Abstract

Access this article

Similar content being viewed by others

On the causes of subject-specific citation rates in Web of Science

Comparison of publication-level approaches to ex-post citation normalization

What do citation counts measure? An updated review of studies on citations in scientific documents published between 2006 and 2018

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Matching of WoS and Scopus records

Appendix 2: Distribution of publication years in the WoS sample

Appendix 3: Equality of population estimates of Precision and Recall

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation