Document type assignment accuracy in the journal citation index data of Web of Science

Abstract

This article reports the results of a study of the correctness of document type assignments in the commercial citation index database Web of Science (SCIE, SSCI, AHCI collections). The document type assignments for publication records are compared to those given on the official journal websites or in the publication full-texts for a random sample of 791 Web of Science records across the four document type categories articles, letters, reviews and others, according to the definitions of WoS. The proportion of incorrect assignments across document types and its influence on document specific normalized citations scores are analysed. It is found that document type data is correct in 94% of records. Further analyses show that within records of one document type as assigned in the data source, the records assigned to the type correctly and incorrectly have different average page counts and reference counts.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    For Web of Science: Web of Science® Help. Searching the Document Type Field [accessed 2016/10/07] http://images.webofknowledge.com/WOKRS59B4/help/WOS/hs_document_type.html. For Scopus: Scopus Content Coverage Guide Jan. 2016; page 10 [accessed 2016/10/07] https://www.elsevier.com/_data/assets/pdf_file/0007/69451/scopus_content_coverage_guide.pdf.

  2. 2.

    The department of Thomson Reuters producing the citation indexes recently changed ownership and now runs under the company name Clarivate Analytics.

  3. 3.

    For our WoS data (publications from 1980 to 2014) this list of multiply assigned DOI contains 19,096 entries and the list for Scopus data contains 81,715 entries (publications from 1996 to 2014).

  4. 4.

    Documented for the 2015/2016 edition https://www.timeshighereducation.com/news/ranking-methodology-2016 but not the 2016/2017 edition https://www.timeshighereducation.com/world-university-rankings/methodology-world-university-rankings-2016-2017.

  5. 5.

    https://www.nsf.gov/statistics/2016/nsb20161/#/.

  6. 6.

    It should be kept in mind that all the expected values are themselves affected by DT inaccuracies. No corrected values for these reference scores are available, so this issue must be put aside for this study although it is relevant in general.

  7. 7.

    http://www.tandfonline.com/doi/abs/10.1080/14767050701832833 accessed Sept. 21, 2016.

References

  1. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.

    Google Scholar 

  2. Barrios, M., Guilera, G., & Gómez-Benito, J. (2013). Impact and structural features of meta-analytical studies, standard articles and reviews in psychology: Similarities and differences. Journal of Informetrics, 7(2), 478–486.

    Article  Google Scholar 

  3. Braun, T., Glänzel, W., & Schubert, A. (1989). Some data on the distribution of journal publication types in the Science Citation Index Database. Scientometrics, 15(5), 325–330.

    Article  Google Scholar 

  4. Chaiworapongsa, T., Romero, R., Kim, Y. M., Kim, G. J., Kim, M. R., Espinoza, J., et al. (2008). The maternal plasma soluble vascular endothelial growth factor receptor-1 concentration is elevated in SGA and the magnitude of the increase relates to Doppler abnormalities in the maternal and fetal circulation. The Journal of Maternal-Fetal & Neonatal Medicine, 21(1), 25–40.

    Article  Google Scholar 

  5. Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2013). A novel approach for estimating the omitted citation rate of bibliometric databases. Journal of the American Society for Information Science and Technology, 64(10), 2149–2156.

    Article  Google Scholar 

  6. Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015a). Errors in DOI indexing by bibliometric databases. Scientometrics, 102(3), 2181–2186.

    Article  Google Scholar 

  7. Franceschini, F., Maisano, D., & Mastrogiacomo, L. (2015b). Influence of omitted citations on the bibliometric statistics of the major Manufacturing journals. Scientometrics, 103(3), 1083–1122.

    Article  Google Scholar 

  8. Glänzel, W. (2008). Seven myths in bibliometrics about facts and fiction in quantitative science studies. Collnet Journal of Scientometrics and Information Management, 2(1), 9–17.

    Article  Google Scholar 

  9. Gorraiz, J., & Schloegl, C. (2008). A bibliometric analysis of pharmacology and pharmacy journals: Scopus versus Web of Science. Journal of Information Science, 34(5), 715–725.

    Article  Google Scholar 

  10. Harzing, A. W. (2013). Document categories in the ISI Web of Knowledge: Misunderstanding the social sciences? Scientometrics, 93(1), 23–34.

    Article  Google Scholar 

  11. Korn, E. L., & Graubard, B. I. (1998). Confidence intervals for proportions with small expected number of positive counts estimated from survey data. Survey Methodology, 24(2), 193–201.

    Google Scholar 

  12. Lohr, S. L. (2010). Sampling: Design and analysis (2nd ed.). Boston: Brooks/Cole, Cengage Learning.

    Google Scholar 

  13. Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1), 1–19.

    MathSciNet  Google Scholar 

  14. Lundberg, J. (2007). Lifting the crown—Citation z-score. Journal of Informetrics, 1(2), 145–154.

    Article  Google Scholar 

  15. Moed, H. F., & van Leeuwen, T. N. (1995). Improving the accuracy of Institute for Scientific Information’s journal impact factors. Journal of the American Society for Information Science, 46(6), 461.

    Article  Google Scholar 

  16. Montesi, M., & Mackenzie Owen, J. (2008). Research journal articles as document genres: Exploring their role in knowledge organization. Journal of Documentation, 64(1), 143–167.

    Article  Google Scholar 

  17. Patsopoulos, N. A., Analatos, A. A., & Ioannidis, J. P. (2005). Relative citation impact of various study designs in the health sciences. Journal of the American Medical Association, 293(19), 2362–2366.

    Article  Google Scholar 

  18. Romero, A., Cortés, J., Escudero, C., López, J., & Moreno, J. (2009). Measuring the influence of clinical trials citations on several bibliometric indicators. Scientometrics, 80(3), 747–760.

    Article  Google Scholar 

  19. Sigogneau, A. (2000). An analysis of document types published in journals related to physics: Proceeding papers recorded in the Science Citation Index database. Scientometrics, 47(3), 589–604.

    Article  Google Scholar 

  20. Sirtes, D. (2012). How (dis-) similar are different citation normalizations and the fractional citation indicator? (And how it can be improved). In É. Archambault, Y. Gingras, & V. Larivière (Eds.), Proceedings of 17th international conference on science and technology indicators (STI) (pp. 894–896). Montréal: Science-Metrix and OST.

  21. Spodick, D. H., & Goldberg, R. J. (1983). The editor’s correspondence: Analysis of patterns appearing in selected specialty and general journals. The American Journal of Cardiology, 52(10), 1290–1292.

    Article  Google Scholar 

  22. Tierney, E., O’Rourke, C., & Fenton, J. E. (2015). What is the role of ‘the letter to the editor’? European Archives of Oto-Rhino-Laryngology, 272(9), 2089–2093.

    Article  Google Scholar 

  23. Valderrama-Zurián, J.-C., Aguilar-Moya, R., Melero-Fuentes, D., & Aleixandre-Benavent, R. (2015). A systematic analysis of duplicate records in Scopus. Journal of Informetrics, 9(3), 570–576. doi:10.1016/j.joi.2015.05.002.

    Article  Google Scholar 

  24. van Leeuwen, T., Costas, R., Calero-Medina, C., & Visser, M. (2013). The role of editorial material in bibliometric research performance assessments. Scientometrics, 95(2), 817–828.

    Article  Google Scholar 

  25. van Leeuwen, T. N., van der Wurff, L. J., & de Craen, A. J. M. (2007). Classification of “research letters” in general medical journals and its consequences in bibliometric research evaluation processes. Research Evaluation, 16(1), 59–63.

    Article  Google Scholar 

  26. Vinkler, P. (2010). The evaluation of research by scientometric indicators. Oxford: Chandos Publishing. ISBN 978-1-84334-572-5.

    Google Scholar 

  27. Waltman, L., van Eck, N. J., van Leeuwen, T. N., Visser, M. S., & van Raan, A. J. F. (2011). Towards a new crown indicator: Some theoretical considerations. Journal of Informetrics, 5(1), 37–47.

    Article  Google Scholar 

  28. Wang, J. (2013). Citation time window choice for research impact evaluation. Scientometrics, 94(3), 851–872.

    Article  Google Scholar 

  29. Zuccala, A., & van Leeuwen, T. (2011). Book reviews in humanities research evaluations. Journal of the American Society for Information Science and Technology, 62(10), 1979–1991.

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported by the German Federal Ministry of Education and Research (BMBF) Grant 01PQ13001, project “Kompetenzzentrum Bibliometrie”. I want to thank Anastasiia Tcypina for help with data collection and Nees Jan van Eck for discussion of the manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Paul Donner.

Appendices

Appendix 1: Matching of WoS and Scopus records

An initial basic matching table was created by FIZ Karlsruhe for general use in projects by the database users based on combinations of exactly corresponding values in 10 metadata fields. The fields were assigned weights for their discriminative power. For example, the publication year field has a low discriminative power while that of article title and DOI is very high. Field values were normalized for differences in character sets, capitalization, special characters and some structural aspects (e.g. removing separating dashes in ISSN) to make them more uniform between the two data sources. All journal records are then mutually compared across all fields and a score is calculated from the number and weights of exactly matching fields. Only pairs with a predetermined threshold score are kept. Through this procedure, a pair of records may be included multiple times in the resulting table because all reasonable field combinations are used. Furthermore, a record from one data source can occur as a plausible match pair with multiple records from the other data source. The matching quality of this method was assessed before the study was conducted and found to be satisfactory. For this, a random sample of 2450 matching pairs was selected from the table and manually checked for equivalence using full bibliographical data from both data sources. In 9 cases the match was incorrect, in 15 cases the decision was unclear and in 2426 cases the match was found to be correct, the percentage of correct matches being greater than 99%.

This basic data was slightly modified for matching the sampled WoS records to those in Scopus uniquely. For WoS records being assigned only one possible Scopus record, the entry rows were copied directly to the final matching table. For those WoS records with more than one plausible match, the single row with the highest matching score was copied. This concerns 2.6% of the entries in the base table. Using this look-up table, of the 793 records that were initially assessed, it was possible to identify matches in Scopus in 711 cases. This was extended by matching only on the DOI for remaining records and manually assessing the matches. Finally, for all unmatched records that were left, the title was searched in the Scopus online platform and the results checked for a correct match. These two steps produced another eleven verified matches.

The remaining unmatched records are mostly meeting abstracts records, which are deliberately not included in Scopus. According to the DT assigned for this study, these unmatched records comprise, after exclusion of publications which could not be found, four articles, one letter, five reviews and 59 others. Within these, there were five misclassifications by WoS.

Appendix 2: Distribution of publication years in the WoS sample

See Table 10.

Table 10 Distribution of publication years in the WoS sample

Appendix 3: Equality of population estimates of Precision and Recall

In this study the overall population estimates of Precision and Recall, as opposed to the estimates for particular DT, for each data source are equal, unlike in typical information retrieval evaluation. This is so because each case is “relevant” for one of the four DT categories. To illustrate, a simplified example is worked through.

Consider a dataset with only two document types, A and B, for which we have the data source’s assignments and the true DTs, as assigned manually. There is no sampling and stratification. Data are cross-tabulated data source and independently assigned DT counts like this:

Independently assigned DT Data source DT
A B
A 20 6
B 5 40

There are 26 true A records, they have a population proportion of 0.37 while the 45 B records have a proportion of 0.63. Calculating first the DT Recall (R) and Precision (P) values for A and B gives:

$$R_{\text{A}} = \frac{20}{20 + 6} = 0.77\quad R_{\text{B}} = \frac{40}{40 + 5} = 0.89$$
$$P_{\text{A}} = \frac{20}{20 + 5} = 0.80\quad P_{\text{B}} = \frac{40}{40 + 6} = 0.87$$

Then the overall Recall and Precision are:

$$R = \left( {R_{\text{A}} \times w_{\text{A}} } \right) + \left( {R_{\text{B}} \times w_{\text{B}} } \right) = \left( {0.77 \times 0.37} \right) + \left( {0.89 \times 0.63} \right) = 0.85$$
$$P = \left( {P_{\text{A}} \times w_{\text{A}} } \right) + \left( {P_{\text{B}} \times w_{\text{B}} } \right) = \left( {0.80 \times 0.37} \right) + \left( {0.87 \times 0.63} \right) = 0.85$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Donner, P. Document type assignment accuracy in the journal citation index data of Web of Science. Scientometrics 113, 219–236 (2017). https://doi.org/10.1007/s11192-017-2483-y

Download citation

Keywords

  • Citation normalization
  • Document type
  • Data accuracy
  • Bibliometric data
  • Citation impact
  • Web of Science
  • Scopus
  • Data quality