This article reports the results of a study of the correctness of document type assignments in the commercial citation index database Web of Science (SCIE, SSCI, AHCI collections). The document type assignments for publication records are compared to those given on the official journal websites or in the publication full-texts for a random sample of 791 Web of Science records across the four document type categories articles, letters, reviews and others, according to the definitions of WoS. The proportion of incorrect assignments across document types and its influence on document specific normalized citations scores are analysed. It is found that document type data is correct in 94% of records. Further analyses show that within records of one document type as assigned in the data source, the records assigned to the type correctly and incorrectly have different average page counts and reference counts.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
For Web of Science: Web of Science® Help. Searching the Document Type Field [accessed 2016/10/07] http://images.webofknowledge.com/WOKRS59B4/help/WOS/hs_document_type.html. For Scopus: Scopus Content Coverage Guide Jan. 2016; page 10 [accessed 2016/10/07] https://www.elsevier.com/_data/assets/pdf_file/0007/69451/scopus_content_coverage_guide.pdf.
The department of Thomson Reuters producing the citation indexes recently changed ownership and now runs under the company name Clarivate Analytics.
For our WoS data (publications from 1980 to 2014) this list of multiply assigned DOI contains 19,096 entries and the list for Scopus data contains 81,715 entries (publications from 1996 to 2014).
Documented for the 2015/2016 edition https://www.timeshighereducation.com/news/ranking-methodology-2016 but not the 2016/2017 edition https://www.timeshighereducation.com/world-university-rankings/methodology-world-university-rankings-2016-2017.
It should be kept in mind that all the expected values are themselves affected by DT inaccuracies. No corrected values for these reference scores are available, so this issue must be put aside for this study although it is relevant in general.
http://www.tandfonline.com/doi/abs/10.1080/14767050701832833 accessed Sept. 21, 2016.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.
Barrios, M., Guilera, G., & Gómez-Benito, J. (2013). Impact and structural features of meta-analytical studies, standard articles and reviews in psychology: Similarities and differences. Journal of Informetrics, 7(2), 478–486.
Braun, T., Glänzel, W., & Schubert, A. (1989). Some data on the distribution of journal publication types in the Science Citation Index Database. Scientometrics, 15(5), 325–330.
Chaiworapongsa, T., Romero, R., Kim, Y. M., Kim, G. J., Kim, M. R., Espinoza, J., et al. (2008). The maternal plasma soluble vascular endothelial growth factor receptor-1 concentration is elevated in SGA and the magnitude of the increase relates to Doppler abnormalities in the maternal and fetal circulation. The Journal of Maternal-Fetal & Neonatal Medicine, 21(1), 25–40.
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2013). A novel approach for estimating the omitted citation rate of bibliometric databases. Journal of the American Society for Information Science and Technology, 64(10), 2149–2156.
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015a). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186.
Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015b). Influence of omitted citations on the bibliometric statistics of the major Manufacturing journals. Scientometrics, 103(3), 1083–1122.
Glänzel, W. (2008). Seven myths in bibliometrics about facts and fiction in quantitative science studies. Collnet Journal of Scientometrics and Information Management, 2(1), 9–17.
Gorraiz, J., & Schloegl, C. (2008). A bibliometric analysis of pharmacology and pharmacy journals: Scopus versus Web of Science. Journal of Information Science, 34(5), 715–725.
Harzing, A. W. (2013). Document categories in the ISI Web of Knowledge: Misunderstanding the social sciences? Scientometrics, 93(1), 23–34.
Korn, E. L., & Graubard, B. I. (1998). Confidence intervals for proportions with small expected number of positive counts estimated from survey data. Survey Methodology, 24(2), 193–201.
Lohr, S. L. (2010). Sampling: Design and analysis (2nd ed.). Boston: Brooks/Cole, Cengage Learning.
Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1), 1–19.
Lundberg, J. (2007). Lifting the crown—Citation z-score. Journal of Informetrics, 1(2), 145–154.
Moed, H. F., & van Leeuwen, T. N. (1995). Improving the accuracy of Institute for Scientific Information’s journal impact factors. Journal of the American Society for Information Science, 46(6), 461.
Montesi, M., & Mackenzie Owen, J. (2008). Research journal articles as document genres: Exploring their role in knowledge organization. Journal of Documentation, 64(1), 143–167.
Patsopoulos, N. A., Analatos, A. A., & Ioannidis, J. P. (2005). Relative citation impact of various study designs in the health sciences. Journal of the American Medical Association, 293(19), 2362–2366.
Romero, A., Cortés, J., Escudero, C., López, J., & Moreno, J. (2009). Measuring the influence of clinical trials citations on several bibliometric indicators. Scientometrics, 80(3), 747–760.
Sigogneau, A. (2000). An analysis of document types published in journals related to physics: Proceeding papers recorded in the Science Citation Index database. Scientometrics, 47(3), 589–604.
Sirtes, D. (2012). How (dis-) similar are different citation normalizations and the fractional citation indicator? (And how it can be improved). In É. Archambault, Y. Gingras, & V. Larivière (Eds.), Proceedings of 17th international conference on science and technology indicators (STI) (pp. 894–896). Montréal: Science-Metrix and OST.
Spodick, D. H., & Goldberg, R. J. (1983). The editor’s correspondence: Analysis of patterns appearing in selected specialty and general journals. The American Journal of Cardiology, 52(10), 1290–1292.
Tierney, E., O’Rourke, C., & Fenton, J. E. (2015). What is the role of ‘the letter to the editor’? European Archives of Oto-Rhino-Laryngology, 272(9), 2089–2093.
Valderrama-Zurián, J.-C., Aguilar-Moya, R., Melero-Fuentes, D., & Aleixandre-Benavent, R. (2015). A systematic analysis of duplicate records in Scopus. Journal of Informetrics, 9(3), 570–576. doi:10.1016/j.joi.2015.05.002.
van Leeuwen, T., Costas, R., Calero-Medina, C., & Visser, M. (2013). The role of editorial material in bibliometric research performance assessments. Scientometrics, 95(2), 817–828.
van Leeuwen, T. N., van der Wurff, L. J., & de Craen, A. J. M. (2007). Classification of “research letters” in general medical journals and its consequences in bibliometric research evaluation processes. Research Evaluation, 16(1), 59–63.
Vinkler, P. (2010). The evaluation of research by scientometric indicators. Oxford: Chandos Publishing. ISBN 978-1-84334-572-5.
Waltman, L., van Eck, N. J., van Leeuwen, T. N., Visser, M. S., & van Raan, A. J. F. (2011). Towards a new crown indicator: Some theoretical considerations. Journal of Informetrics, 5(1), 37–47.
Wang, J. (2013). Citation time window choice for research impact evaluation. Scientometrics, 94(3), 851–872.
Zuccala, A., & van Leeuwen, T. (2011). Book reviews in humanities research evaluations. Journal of the American Society for Information Science and Technology, 62(10), 1979–1991.
This study was supported by the German Federal Ministry of Education and Research (BMBF) Grant 01PQ13001, project “Kompetenzzentrum Bibliometrie”. I want to thank Anastasiia Tcypina for help with data collection and Nees Jan van Eck for discussion of the manuscript.
Appendix 1: Matching of WoS and Scopus records
An initial basic matching table was created by FIZ Karlsruhe for general use in projects by the database users based on combinations of exactly corresponding values in 10 metadata fields. The fields were assigned weights for their discriminative power. For example, the publication year field has a low discriminative power while that of article title and DOI is very high. Field values were normalized for differences in character sets, capitalization, special characters and some structural aspects (e.g. removing separating dashes in ISSN) to make them more uniform between the two data sources. All journal records are then mutually compared across all fields and a score is calculated from the number and weights of exactly matching fields. Only pairs with a predetermined threshold score are kept. Through this procedure, a pair of records may be included multiple times in the resulting table because all reasonable field combinations are used. Furthermore, a record from one data source can occur as a plausible match pair with multiple records from the other data source. The matching quality of this method was assessed before the study was conducted and found to be satisfactory. For this, a random sample of 2450 matching pairs was selected from the table and manually checked for equivalence using full bibliographical data from both data sources. In 9 cases the match was incorrect, in 15 cases the decision was unclear and in 2426 cases the match was found to be correct, the percentage of correct matches being greater than 99%.
This basic data was slightly modified for matching the sampled WoS records to those in Scopus uniquely. For WoS records being assigned only one possible Scopus record, the entry rows were copied directly to the final matching table. For those WoS records with more than one plausible match, the single row with the highest matching score was copied. This concerns 2.6% of the entries in the base table. Using this look-up table, of the 793 records that were initially assessed, it was possible to identify matches in Scopus in 711 cases. This was extended by matching only on the DOI for remaining records and manually assessing the matches. Finally, for all unmatched records that were left, the title was searched in the Scopus online platform and the results checked for a correct match. These two steps produced another eleven verified matches.
The remaining unmatched records are mostly meeting abstracts records, which are deliberately not included in Scopus. According to the DT assigned for this study, these unmatched records comprise, after exclusion of publications which could not be found, four articles, one letter, five reviews and 59 others. Within these, there were five misclassifications by WoS.
Appendix 2: Distribution of publication years in the WoS sample
See Table 10.
Appendix 3: Equality of population estimates of Precision and Recall
In this study the overall population estimates of Precision and Recall, as opposed to the estimates for particular DT, for each data source are equal, unlike in typical information retrieval evaluation. This is so because each case is “relevant” for one of the four DT categories. To illustrate, a simplified example is worked through.
Consider a dataset with only two document types, A and B, for which we have the data source’s assignments and the true DTs, as assigned manually. There is no sampling and stratification. Data are cross-tabulated data source and independently assigned DT counts like this:
|Independently assigned DT||Data source DT|
There are 26 true A records, they have a population proportion of 0.37 while the 45 B records have a proportion of 0.63. Calculating first the DT Recall (R) and Precision (P) values for A and B gives:
Then the overall Recall and Precision are:
About this article
Cite this article
Donner, P. Document type assignment accuracy in the journal citation index data of Web of Science. Scientometrics 113, 219–236 (2017). https://doi.org/10.1007/s11192-017-2483-y
- Citation normalization
- Document type
- Data accuracy
- Bibliometric data
- Citation impact
- Web of Science
- Data quality