Benford’s Law and articles of scientific journals: comparison of JCR® and Scopus data

Benford’s Law is a logarithmic probability distribution function used to predict the distribution of the first significant digits in numerical data. This paper presents the results of a study of the distribution of the first significant digits of the number of articles published of journals indexed in the JCR® Sciences and Social Sciences Editions from 2007 to 2011. The data of these journals were also analyzed by the country of origin and the journal’s category. Results considering the number of articles published informed by Scopus are also presented. Comparing the results we observe that there is a significant difference in the data informed in the two databases.


Introduction
Benford's Law (Benford 1938), also known as the first digit law or law of the leading digits, is a logarithmic probability distribution function for the first significant digits, which can be written as P d ð Þ ¼ log 10 1 þ 1 d variables for each year separately, and compared this to the number predicted by Benford's Law. Citations data followed Benford's Law in all years studied. However, for the data on the number of articles, there was no compliance with Benford's Law in any of the years considered. The same occurred with the data for impact factors in almost all years studied. Recently, Egghe and Guns (2012) used a generalization of Benford's Law related to the general law of Zipf with exponent b [ 0 in the data of Campanario and Coslado (2011). They applied nonlinear least squares to determine the optimal b and showed that this generalized law fits the data better than the classical Benford's Law.
The present paper extends the work of Campanario and Coslado (2011). We analyzed the compliance of the number of articles published of journals indexed in the JCR Ò Science and Social Sciences Editions from 2007 to 2011 with Benford's Law. We also investigated their compliance with Benford's Law analyzing the number of articles published according to the journal's country of origin and to the journal's category. In addition, we make a comparison with the Scopus data.

Materials and methods
In this study we used data available in the JCR Ò database on the web from 2007 to 2011, with separate editions for Science and Social Sciences. All journals indexed in the JCR Ò with at least 1 article published were included. We also take into consideration the journal's country origin and the journal's category.
Initially we identified the first significant digit of the number of articles published in each journal indexed in the JCR Ò , for each year and edition, separately, to calculate the frequency of each digit and we compared it with the number predicted by Benford's Law.
Then, we carry out the v 2 test: to test the Null Hypothesis, H 0 that the observed distribution of the first significant digit, in each case we consider, is the same as the expected number based on Benford's Law. For n = 9 we have n -1 = 8 degrees of freedom, and v 2 (8) = 15.507 for a 95 % confidence level. This is the critical value for the acceptance or rejection of the Null Hypothesis, that is, if the calculated value of v 2 is less than the critical value then we accept H 0 and conclude that data is in compliance with Benford's Law, otherwise, we reject H 0 .
Alternatively we can test each of the nine proportions separately. The Z-statistic is the test to verify whether the observed proportion for a digit differs significantly from the expected value based on Benford's Law (Nigrini 2012). The Z-statistic formula takes into account the absolute magnitude (the numeric distance) of the difference between the observed and the expected values, the cardinality of the data set, and the expected proportion value and is given by the following equation: where P o denotes the observed proportion value, P e the expected proportion value, and N is the total numbers of observations. The term in the numerator (1/2N) is a continuity correction term and it is considered only when it is smaller than the other term in the Scientometrics (2014) 98:173-184 175 numerator. For a significant level of 5 %, the cutoff level is 1.96. When Z-statistic exceeds 1.96 it indicates that the difference between the observed proportion and the expected proportion values is significant at the 0.05 level, which means there is only a 5 % probability that the difference is due to chance alone. Data available in the Scopus database were also used. Using this database, we tested the number of articles published in journals of some countries and categories of the JCR Ò . Similarly, all journals indexed in Scopus with at least 1 article published were considered. Furthermore, only journals present in both databases were considered.
Using the binomial distribution (Ni et al. 2009), the expected root-mean-square error, was also calculated where N is the total number of points considered and P(d) is the prediction of Benford's Law.

Results and discussions
Campanario and Coslado (2011)  We decided to extend their analysis and we investigated the data of the following years. We analyzed the number of articles published in journals indexed in the JCR Ò Science Edition from 2007 to 2011 and the results are shown in Table 3. Despite of the fact that they had already calculated the v 2 value for 2007, we calculated it again for the sake of verifying the compatibility of our results with theirs. We observed a small difference probably due to the fact that we considered a larger number of journals, with the update of the JCR Ò database.  Table 3 The    The v 2 values in all years are significantly greater than the critical value. Furthermore, we observe that the Z values for digit 1 are greater than the cutoff level (1.96) in all years. The same occurred with digit 5, except in 2007. Campanario and Coslado (2011) take into consideration only journals of the Science Edition but we extended the calculation for the JCR Ò Social Sciences Edition. The result is presented in Table 4 and, as can be seen, the result is even worse. All years are not in compliance with Benford's Law and the Z values are greater than the cutoff level for almost all digits. They mentioned in their paper that they have no explanation for these differences. Mir (2012) observed that the data of three major Christian denominations follow Benford's Law. However, when Christianity is considered as a single religious group, the distribution of the significant digits of the adherent data deviates from the predictions of Benford's Law. Inspired by this observation we analyzed the journals according to their country of origin and to their JCR Ò category. Table 5  It is possible to observe that the great majority of the countries is in compliance with Benford's Law. ''Poland'' and ''Turkey'' are the countries that appeared more times in the list of countries that are not in compliance with Benford's Law. In the case of ''Turkey'' it is interesting to note that the number of journals indexed in the JCR Ò greatly increased from one year to another. Furthermore, one can see that the v 2 values decrease as the number of journals and articles increases. It is worth observing that the number of journals indexed in the JCR Ò is very small for some countries, not being sufficient for using the v 2 test for the adherence of the data to Benford's Law. According to Nigrini (2012), the rule for Benford's Law for first non-zero significant digit v 2 test is that the expected number of observations of each cell should be at least 5, hence, the number of observations should be at least 100 (100 times 0.0458 which is close enough to 5). The result is very similar for the journals indexed in the JCR Ò Social Sciences Edition. Only a few countries are not in compliance with Benford's Law, as shown in Table 6. Nevertheless, the v 2 values are much smaller than the values presented when journals were considered as a single group. It is interesting to observe that ''United States'' and ''England'' are not in compliance with Benford's Law in all years.
Other analysis carried out took into consideration the journal's category in the JCR Ò Science Edition from 2007 to 2011. The result is presented in Table 7. It is possible to verify that the percentage of categories that are in compliance with Benford's Law is larger compared to the percentage of countries that are in compliance with Benford's Law in almost every year, except in 2009. ''Mathematics'' and ''Nursing'' appeared more times in the list of categories that are not in compliance with Benford's Law.
For the journals indexed in the JCR Ò Social Sciences Edition, the result is significantly worse compared to the results with journal's country of origin, as shown in Table 8. In some cases the numbers of journals in compliance with Benford's Law were lower than the number of journals not in compliance. ''Sociology'' is a category that is not in compliance with Benford's Law in all years.
It is interesting to observe that the v 2 values observed for the journals indexed in the JCR Ò Social Sciences Edition are always greater than those presented for journals indexed in the JCR Ò Science Edition. We compare next the number of articles published informed in the JCR Ò and the Scopus databases. We limited the comparison to journals of some countries and of some categories. To make the comparison, we considered only journals that were present in both databases.
The analysis performed showed that there are some cases where Scopus data are in compliance with Benford's Law but the JCR Ò Editions data are not. Also the opposite was observed, that is, where JCR Ò Editions data are in compliance with Benford's Law but the Scopus data are not. In Table 9 we summarize these findings with 8 examples. The examples presented in Table 9 were carefully chosen so that the total number of journals is more than 100. In each example, the number of journals indexed in both databases is presented. Beside this value, we present the number of journals indexed in JCR Ò database in parenthesis. The columns ''Min'' and ''Max'' indicate the minimum and maximum number of articles published in journals indexed in the JCR Ò and the Scopus databases, respectively, according to the country of origin or category considered. The v 2 values are also presented, and the values that are not in compliance with Benford's Law are highlighted. The digits (d) with significant differences according to the Z-statistic test are also presented. We observed that there are two examples in compliance with Benford's Law according to the v 2 test but with one digit with significant difference according to its Z value.

Conclusions
In this paper we applied Benford's Law to the data of JCR Ò Science and Social Sciences Editions, and Scopus, taking into consideration the number of articles published by journals indexed in the two databases. The data of these journals were analyzed by the country of origin and the journal's category. From the country of origin analyses the majority is in compliance with Benford's Law. In the case of journal's category, the majority also follows Benford's Law, except two recent years (2010 and 2011) in journals indexed in the JCR Ò Social Sciences Edition.
The nonconformity with Benford's Law identified with the analysis performed in this work could be indications of either incomplete data (for instance, Karamourzov (2012) observed that there is a small fraction (\8 %) of journals of Russia indexed by JCR Ò in 2010; Michels and Schmoch (2012) noted the steady increase in recent years of publications that have been indexed at Web of Science and Scopus too), data errors, inconsistencies, or anomalies, and/or conformity to a large exponential power law, occurring with the JCR Ò and/or SCOPUS data, in view of significant differences observed. These indications were already mentioned in previous works where nonconformities were observed (see, for instance, Nigrini (2012)).
We believe that the main contribution of this study is to alert about these differences and, perhaps, provide an explorative instrument to identify where possibly some data anomalies may be occurring, regardless of which database is correct.