Field- and time-normalization of data with many zeros: an empirical analysis using citation and Twitter data

Thelwall (J Informetr 11(1):128–151, 2017a. 10.1016/j.joi.2016.12.002; Web indicators for research evaluation: a practical guide. Morgan and Claypool, London, 2017b) proposed a new family of field- and time-normalized indicators, which is intended for sparse data. These indicators are based on units of analysis (e.g., institutions) rather than on the paper level. They compare the proportion of mentioned papers (e.g., on Twitter) of a unit with the proportion of mentioned papers in the corresponding fields and publication years. We propose a new indicator (Mantel–Haenszel quotient, MHq) for the indicator family. The MHq is rooted in the Mantel–Haenszel (MH) analysis. This analysis is an established method, which can be used to pool the data from several 2 × 2 cross tables based on different subgroups. We investigate using citations and assessments by peers whether the indicator family can distinguish between quality levels defined by the assessments of peers. Thus, we test the convergent validity. We find that the MHq is able to distinguish between quality levels in most cases while other indicators of the family are not. Since our study approves the MHq as a convergent valid indicator, we apply the MHq to four different Twitter groups as defined by the company Altmetric. Our results show that there is a weak relationship between the Twitter counts of all four Twitter groups and scientific quality, much weaker than between citations and scientific quality. Therefore, our results discourage the use of Twitter counts in research evaluation.


Introduction
Alternative metrics (altmetrics) is a new fast-moving area in scientometrics (Galloway, Pease, & Rauh, 2013).Initially, altmetricsa collection of many web-based indicatorshave been proposed as a supplement to traditional bibliometric indicators.They measure attention related to research papers on internet platforms.The core of altmetrics is gathered from social media platforms, but mentions in mainstream media or in policy documents also belong to the umbrella term altmetrics (National Information Standards Organization, 2016;Work, Haustein, Bowman, & Larivière, 2015).According to Haustein (2016), sources of altmetrics can be grouped into (i) social networks, (ii) social bookmarks and online reference management, (iii) social data (e.g., data sets, software, presentations), (iv) blogs, (v) microblogs, (vi) wikis, and (vii) recommendations, ratings, and reviews.Recently, some indicators based on altmetrics have been proposed which are normalized with respect to the scientific field and publication year.These indicators were developed because studies have shown that altmetrics aresimilar to bibliometric datafield-and timedependent (see, e.g., Bornmann, 2014).Some fields are more relevant to the general public or a broader audience than other fields (Haustein, Larivière, Thelwall, Amyot, & Peters, 2014).The Mean Normalized Reader Score (MNRS) was introduced by Haunschild and Bornmann (2016) for normalization of data from social bookmarks and online reference management platforms (with a special emphasis on Mendeley readers) (see also Fairclough & Thelwall, 2015).The Mean Discipline Normalized Reader Score (MDNRS) was tailored specifically to Mendeley by Bornmann and Haunschild (2016b).The MDNRS uses Mendeley disciplines for field normalization.The employed normalization procedures rely on average value calculations across scientific fields and publication years as expected values.However, normalization procedures based on averages (and percentiles) of individual papers are problematic for zero-inflated data sets (Haunschild, Schier, & Bornmann, 2016).The overview of Work, et al. (2015) on studies investigating the coverage of papers on social media platforms show that many platforms have coverages of less than 5% (e.g., Blogs, or Wikipedia).Erdt, Nagarajan, Sin, and Theng (2016) reported similar findings in their metaanalysis.They found that former empirical studies dealing with the coverage of altmetrics show that about half of the platforms are at or below 5%; except for three (out of eleven) where the coverage is below 10%.Bornmann and Haunschild (2016a) propose the Twitter Percentile (TP)a field-and timenormalized indicator for Twitter data.Bornmann and Haunschild (2016a) circumvent the problem of zero-inflated Twitter data by including in the TP calculation only journals with at least 80% of the papers having at least 1 tweet each.However, this procedure leads to the exclusion of many journals from the TP procedure.Very recently, Thelwall (2017aThelwall ( , 2017b) ) proposed a new family of field-and time-normalized indicators.These indicators are based on units of analysis (e.g., a researcher or institution) rather than on single papers.They compare the proportion of mentioned papers (e.g., on Twitter) of a unit with the proportion of mentioned papers in the corresponding fields and publication years (the expected values).The family consists of the Equalized Mean-based Normalized Proportion Cited (EMNPC) and the Mean-based Normalized Proportion Cited (MNPC).Hitherto, this new family of indicators has only been studied on rather small samples.In this study, we investigate the new indicator family empirically on a large scale (multiple complete publication years) and add another member to this family.In statistics, the Mantel-Haenszel (MH) analysis is frequently used for pooling the data from multiple 2×2 cross tables based on different subgroups.In this study, we have mentioned and not-mentioned papers of a unit, which have been published in different subject categories and publication years and are compared with the corresponding reference sets.We name the new indicator Mantel-Haenszel quotient (MHq).In the empirical part of this study, we compare the indicator scores with assessments by peers.We are interested whether the indicators can discriminate between different quality levels, which peers assigned to publications.In other words, we investigate the convergent validity of the indicators.The convergent validity can only be tested by using citations, since we can assume that they are related to quality (Diekmann, Naf, & Schubiger, 2012).Thus, the first empirical part is based on citations.In the second part (after confirmation of convergent validity), MHq values are exemplarily presented for Twitter data.

Indicators for zero-inflated count data
The next sections focus on the formulas not only for the calculation of the EMNPC, MNPC, and MHq, but also for the corresponding 95% confidence intervals (CIs).The CI shows the range of possible indicator values: We can be 95% confident that the interval includes the "true" indicator value in the population.Thus, we assume to analyze sample data and infer by using CIs to a larger, inaccessible population (Williams & Bornmann, 2016).Claveau (2016) argue for using inferential statistics with scientometric data as follows: "these observations are realizations of an underlying data generating process … The goal is to learn properties of the data generating process.The set of observations to which we have access, although they are all the actual realizations of the process, do not constitute the set of all possible realizations.In consequence, we face the standard situation of having to infer from an accessible set of observationswhat is normally called the sampleto a larger, inaccessible one -the population.Inferential statistics are thus pertinent" (p.1233).
Equalized Mean-based Normalized Proportion Cited (EMNPC) Thelwall (2017aThelwall ( , 2017b) ) proposed the EMNPC as an alternative indicator for zero-inflated count data.Here, the proportion of mentioned publications is calculated: suppose that the publication set of a group g has n gf papers in the publication year and subject category combination f. s gf denotes the number of mentioned papers (e.g., on Twitter).F denotes all publication year and subject category combinations of the publications in the set.The overall proportion of g's mentioned publications is the number of mentioned publications (s gf ) divided by the total number of publications (n gf ): However, p g could lead to misleading results, if the publication set g includes many publications which appeared in fields with many mentioned papers.Thus, Thelwall (2017aThelwall ( , 2017b) ) proposes to artificially treat g as having the same number of publications in each year and subject category combination.Thelwall (2017aThelwall ( , 2017b) ) fixes it to the arithmetic mean of numbers in each combination.However, he recommends not to include combinations of g with only a few publications in the analysis.Thus, the equalized sample proportion ĝ is the average of the proportions in each combination: (2) The corresponding equalized sample proportion of the world (w) is: (3) In Eqns.( 2) and (3), [F] is the number of subject category and publication year combinations in which the group (in case of Eq. ( 2)) and the world (in case of Eq. ( 3)) publish.The equalized group sample proportion has the following undesirable property: it treats g as if the average number of mentioned publications does not vary between the subject categories.The ratio of both equalized sample proportions (group and world) is the EMNPC: According to Thelwall (2017a), CIs for the EMNPC can be calculated: In Eqns.( 5) and ( 6), n w is the total sample size of the world, and n g is the total sample size of group g.

Mean-based Normalized Proportion Cited (MNPC)
The other indicator proposed by Thelwall (2017a) is the Mean-based Normalized Proportion Cited (MNPC) which is calculated as follows: For each publication which is mentioned at least once (e.g., mentioned on Twitter), the reciprocal of the world proportion mentioned for the corresponding subject category and publication year replaces the number of mentions.All unmentioned publications remain at zero.If p gf = s gf /n gf is the proportion of mentioned publications in set g in the corresponding subject category and publication year combination f and p wf = s wf /n wf the proportion of world's mentioned publications in the same year and subject category combination f, then: = { 0,    = 0 1/  , if   > 0 where paper  is from year and subject category combination  (7) The MNPC calculation follows the calculation of the MNCS (Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011) and is defined as: Thelwall (2016, 2017a) proposes an approximate CI for the MNPC.The lower limit L (MNPC gfL ) and upper limit U (MNPC gfU ) for group g in year and subject category combination f is calculated in the first step: = exp (ln ( The group-specific lower and upper limits are used to calculate the MNPC CIs in a second step: If any of the world proportions are equal to zero, the MNPC cannot be calculated.If any of the group proportions are equal to zero, CIs cannot be calculated.As solutions, either the corresponding subject category and publication year combination can be removed from the data or a continuity correction of 0.5 can be added to the number of mentioned and not mentioned publications (Thelwall, 2017a).We prefer to use the continuity correction.Plackett (1974) recommends this approach for the calculation of odds ratios.

Mantel-Haenszel quotient (MHq)
The recommended method for pooling the data from multiple 2×2 cross tablesbased on different subgroups (which are part of a larger population)is the Mantel Haenszel (MH) analysis (Hollander & Wolfe, 1999;Mantel & Haenszel, 1959;Sheskin, 2007).According to Fleiss, Levin, and Paik (2003) the method "permits one to estimate the assumed common odds ratio and to test whether the overall degree of association is significant.Curiously, it is not the odds ratio itself but another measure of association that directly underlies the test for overall association … The fact that the methods use simple, closed-form formulas has much to recommend it" (p.250).The results of Radhakrishna (1965) point out that the MH approach is empirically and formally valid against the background of clinical trials.
The MH analysis yields a summary odds ratio for multiple 2×2 cross tables.We call this summary odds ratio MHq.If the impact of units in science is compared with reference sets (the world), the 2×2 cross tables (which are pooled) consist of the number of publications mentioned and not mentioned in subject category and publication year combinations f.In the 2×2 subject-and year-specific cross table with the cells a f , b f , c f , and d f (see Table 1), a f is the number of mentioned publications published by group g in subject category and year f.b f is the number of unmentioned publications published by group g in subject category and year f.c f is the number of mentioned publications of the world w in subject category and year f.d f is the number of unmentioned publications of the world w in subject category and year f.Publications of group g are part of the publications in the world.The MHq calculation starts by defining some auxiliary variables: Where: The MHq is defined as: The MHq CIs are calculated as recommended by Fleiss, et al. (2003).The variance of ln(MHq) is estimated by: Calculation of the MHq CIs is performed as follows: It is an advantage of the MHq that the world average has a value of 1.This is similar to the EMNPC and MNPC and simplifies the interpretation.A further advantage of the MHq is that the result can be interpreted as a percentage relative to the world average.MHq = 1.20, e.g., means that the publication set under study has an impact which is 20% above average.

Empirical analysis using citations
It is an established way of analyzing the convergent validity of indicators comparing the indicator values with peer evaluations (Garfield, 1979;Kreiman & Maunsell, 2011).Convergent validity is defined as the degree to which two measurements of a construct (here: two indicators of scientific quality) with a theoretical relationship are also empirically related.This approach has been justified by Thelwall (2017b) as follows: "if indicators tend to give scores that agree to a large extent with human judgements then it would be reasonable to replace human judgements with them when a decision is not important enough to justify the time necessary for experts to read the articles in question.Indicators can be useful when the value of an assessment is not great enough to justify the time needed by experts to make human judgements" (p.4).Several publications studying the relationship between Research Excellence Framework (REF) outcomes and citations reveal considerable relationships in different fields, such as psychology and biological science (Butler & McAllister, 2011;Mahdi, d'Este, & Neely, 2008;McKay, 2012;Smith & Eysenck, 2002;Traag & Waltman, 2017;Wouters et al., 2015).Similar results have been reported for the Italian research assessment exercise: "The correlation strength between peer assessment and bibliometric indicators is statistically significant, although not perfect.Moreover, the strength of the association varies across disciplines, and it also depends on the discipline internal coverage of the used bibliometric database" (Franceschet & Costantini, 2011, p. 284).Bornmann (2011) shows in an overview of studies on journal peer review that better recommendations from peers are related to higher citation impact of the corresponding papers.The correlation between citation impact scores and RS from F1000Prime has already been investigated in other studies.The results of Bornmann (2015) reveal that about 40% of papers with RS=1 are highly cited papers; for publications with RS=2 and RS=3 the percentages are 60% and 73%, respectively.Waltman and Costas (2014) report "a clear correlation between F1000 recommendations and citations" (p.433).The previous results on F1000Prime might point out, therefore, that citation-based indicators differentiate between the three quality levels.Looking at it the other way round, the validity of new indicators does not seem to be given if they do not differentiate.
Against this backdrop, we analyze in this study the ability of MHq, EMNPC, and MNPC to differentiate between the F1000Prime quality groups.Figure 1 shows the MHqs with CIs for Q0, Q1, and Q2 across four publication years.The average MHq for all years is close to (but below) 1 for Q0.The mean MHq for Q1 is about eight times and that for Q2 is about 15 times higher than the mean MHq for Q0.Thus, the MHq indicator seems to significantly separate between Q0, Q1, and Q2; the MHq values seem to be convergent valid with respect to F1000Prime scores.

Empirical analysis using Twitter data
In the previous section, we demonstrated that the MHq is convergent valid using citation data compared with post-publication recommendation scores from F1000Prime.The MHq is able to distinguish between different scientific quality levels as defined by F1000Prime scores.In this section, we determine whether and to which extent different Twitter groups (as defined by the company Altmetric) can distinguish between the same quality levels.Figure 4 shows the MHq results for researchers, science communicators, practitioners, and members of the public.For all four Twitter groups, the quality group Q0 is close to but below 1.The MHq values for the quality groups Q1 and Q2 are between 2 and 4 and between 4 and 6, respectively.Compared to citation data, all four Twitter groups show a much weaker association to scientific quality than citations, by a factor of about three.In Figure 1, which shows the results for citation data, MHq(Q1) is between 7 and 9, and MHq(Q2) is between 11 and 18.The results in Figure 4 further reveal that the association to scientific quality is on a similar level for all four Twitter groups.Since researchers should assess scientific quality better than the other groups in the figure, the association to quality is somewhat stronger for researchers in the figure than for the other groups.

Discussion
Much of the altmetrics data is sparse (Neylon, 2014).A metric inflated by zero values is not informative for research evaluation purposes in the first place (Thelwall, Kousha, Dinsmore, & Dolby, 2016).Thelwall (2017aThelwall ( , 2017b) ) introduced a new family of field-and time normalized indicators for sparse data including EMNPC and MNPC.Here, the proportion of mentioned publications of a unit (e.g., a researcher) is compared with the expected values (the proportion of mentioned publications in the corresponding publication years and fields).EMNPC and MNPC differ from most of the indicators used in bibliometrics and altmetrics.Usually, an indicator value is calculated for each publication.The publication-based indicator values can then be aggregated by the user, for example, by averaging or summing.Instead, the indicators of the new family are based on the calculation of the indicator values for publication sets of groups (e.g., universities).This property implies that the new indicators cannot be used as versatilely as the usual bibliometric (and altmetric) indicators.However, they are able (by construction) to handle zero-inflated data properly which the usual indicators are not.
In this study, we added a further variant to the familythe MHqand analyzed all three variants empirically.We started by analyzing the convergent validity of the indicators based on citation data.Citation data can be used to formulate predictions, which can be empirically validated.Thus, we studied whether EMNPC, MNPC, and MHq are able to validly differentiate between three different quality levelsas defined by RS from F1000 (FFa ̅̅̅̅̅ ).By comparing the indicator values with peer recommendations, we were able to test whether the indicators discriminate between different quality levels.
The results point out that EMNPC and MNPC cannot validly discriminate between different quality levels as defined by peers.The EMNPC and MNPC values are close to the worldwide averageindependent of the quality levels.The CIs substantially overlap in many comparisons.Thus, the convergent validity of the EMNPC and MNPC does not seem to be given.In contrast to these indicators, the MHq is able to discriminate between the quality levels.
Because of the positive results for the MHq, we applied the MHq to Twitter counts of four different groups as defined by the company Altmetric: researchers, science communicators, practitioners, and members of the public.If Twitter counts are intended to be used for research evaluation, a substantial relationship to scientific quality should be given.Otherwise, Twitter counts should not be employed in research evaluation.Our investigation of MHq values based on Twitter data reveals a weak relationship between Twitter counts and scientific quality.This relationship is much weaker than that between citation counts and scientific quality.
Our study of the relationship of different Twitter groups' data to scientific quality is directed at specific societal groups.Earlier, we studied the directed societal impact measurement for different status groups of Mendeley data (Bornmann & Haunschild, 2017).Researchers on Twitter show a slightly stronger relationship to scientific quality than other societal groups, but the differences are only minor.This study follows the important initiative of Thelwall (2017aThelwall ( , 2017b) ) to develop new indicators for zero-inflated data.The current study is the first independent attempt to investigate the new indicator family empirically.This family is important for altmetrics data.Thus, we need further studies focusing on various sources with sparse data (in addition to Twitter).Since F1000 concentrates on biomedicine, future empirical studies should analyze the new family in other disciplines.

Figure 1 .
Figure 1.MHqs with CIs for Q0, Q1, and Q2 across four publications years.The horizontal line with MHq=1 is the worldwide average.

Figure 2 .
Figure 2. MNPC with CIs for Q0, Q1, and Q2 across four publications years.The horizontal line with MNPC=1 is the worldwide average.

Figure 3 .
Figure 3. EMNPC with CIs for Q0, Q1, and Q2 across four publications years.The horizontal line with EMNPC=1 is the worldwide average.

Figure 4 .
Figure 4. MHq values for Q0, Q1, and Q2 with CIs differentiated by Twitter groups and publications years.The horizontal line with MHq=1 is the worldwide average.

Table 1 .
show overviews of the data included in this study.It is clearly visible in Table2that Twitter data are affected by zero-inflation, but citation data are not.Number of papers included in this study broken down by different sources (citations and Twitter groups), publication year, and FFa ̅̅̅̅̅

Table 2 .
Proportion of uncited and untweeted papers broken down by different sources (citations and Twitter groups), publication year, and FFa ̅̅̅̅̅