Introduction

Over the last two decades, journals in different fields have increasingly required authors to disclose their contributions as part of the research paper (Larivière et al., 2021). As authorship is a proxy for scientific productivity (Cronin, 2001), coupled with an increasing trend to institute metric-guided evaluative mechanisms for governing academia (e.g., Abbott et al., 2010; Hicks, 2012; Wilsdon et al., 2015), there is a need to validate different models for the allocation of author credit against the information presented in such author contribution statements. Although Hagen (2010, 2013) has evaluated several models for allocating author credit against perceived author credit scores, no study to date has validated such models against the author contribution statements of scientific articles. This paper aims to address this issue.

It is evident from the literature (Egghe et al., 2000; Gauffriau & Larsen, 2005; Gauffriau et al., 2008; Huang et al., 2011; Vavryčuk, 2018) that the use of different counting procedures has bearings on the number of publications and citations assigned to authors. For example, using fractional counting and thus disregarding taking into account the information present in the structure of the byline introduces an equalization bias (Hagen, 2013, 2014a, 2015). It has a strong distortional effect that is ”inevitably compounded in bibliometric indices and performance rankings” (Hagen, 2014b, p. 626). The routine practice of determining author credit by dividing one credit equally among all coauthors of a research paper is inappropriate if the alphabet does not purposely structure the bylines (Hagen, 2013). A majority of the alternative models for determining author credit are, in their basic form, rank-dependent, meaning they share the basic assumption that position in the author bylines is apportioned according to the weight of contributions and structured in descending order. In cases where the least important authors are positioned somewhere in the middle, such models become problematic and risk overestimating some authors' credit and underestimating others' credit.

Allocation models vary in their assumptions about the logic underpinning author order. For example, fractional counting assumes that each author has contributed equally and apportions credit conversely, while harmonic counting assumes a descending ordering logic with regard to the authors' relative contributions. Several bibliometric papers drawing upon the author contributions statements provided by journals report that first and last authors typically contribute to more tasks than middle authors do (Baerlocher et al., 2007; Larivière et al., 2016, 2021; Sauermann & Haeussler, 2017; Sundling, 2017; Yang et al., 2017). Several studies based on interviews and surveys also noted that middle authors contribute less than first and last authors (e.g., Louis et al., 2008; Shapiro et al., 1994; Wren et al., 2007). These results indicate that an allocation model should give more credit to the first and last author and less to the authors between these positions.

However, there are several reasons to question the grouping together of authors between first and last. First, second authors contribute more than authors closer to the middle and end of the byline order (Baerlocher et al., 2007; Wren et al., 2007). Second, the increase in team size and coauthored articles in science (Larivière et al., 2015; Wuchty et al., 2007) comes with an increase in the number of primary and supervisory authors on the same paper (Mongeon et al., 2017), some of which are bound to end up between the first and last positions. Third, and likely partly as a consequence of “team science,” there is an increase in the number of scientific articles in which two or more authors claim "equal co-first authorship" (Akhabue & Lautenbach, 2010; Hosseini & Bruton, 2020) or appear as corresponding authors (Hu, 2009). All in all, treating middle authors as one heterogenous group conceals significant differences in this seemingly diverse group of authors. This also means that the question of how a model for allocating authorship credit should best apportion credit to authors between first and last is still an open question.

Scientific fields differ in aspects such as the degree of collaborative activity, authorship practices, how research is organized, and publication practices. In high energy physics, the bylines are regulated by a labor approach in which everyone working on a project ends up on an alphabetically ordered author list (Birnholtz, 2006; Knorr-Cetina, 1999). Mathematics and Economics also exhibit a high degree of byline alphabetization (Frandsen & Nicolaisen, 2010; Waltman, 2012), while in biomedicine and clinical research, the bylines tend to be structured according to a principle by which the primary authors are first, and the supervisory authors are last (Mongeon et al., 2017). As author inclusion and byline order differ per field, so will the recommended author credit allocation model. This paper is to be considered a case study in the chemical biology field.

This paper explores the relationship between an author's position in the bylines of a research article and the research contributions they have made in order to analyze the validity of five bibliometric counting methods in Chemical Biology. The research questions explored in this paper are:

Research question 1::

What logic structures the byline ordering in Chemical Biology with regard to the authors contributions?

Research question 2::

How well do models for allocating author credit reflect the number of contributions made by authors in Chemical Biology?

Research question 3::

How well do models for allocating author credit predict core authors in Chemical Biology?

The field of chemical biology can be considered a part of the lab-oriented life sciences. Scientists working within this field often have a background in chemistry or biology (or, on some occasions, are specially trained chemical biologists) and bring together a diverse array of experimental techniques and theoretical knowledge (Ostler, 2007).

Theory

Ordering the bylines

Tscharntke et al. (2007) present four types of name ordering:

  1. 1.

    The sequence-determines-credit approach (SDC), wherein the sequence of authors reflects the usually declining importance of their contribution.

  2. 2.

    The equal contribution approach (EC), wherein authors use alphabetical sequence to reflect having contributed equally to the research. According to Waltman (2012, p. 704), this is more common in fields where the average author team is either small or large, while in fields such as the Medical and Life Sciences, “intentional alphabetical authorship is a virtually non-existent phenomenon.” However, in biomedicine and clinical medicine, partially structuring the middle of the bylines according to the alphabet is increasing (Mongeon et al., 2017). Sometimes structuring the bylines according to the alphabet is only a custom and is not meant to convey any information regarding the contributions of authors (Egghe et al., 2000).

  3. 3.

    The first-last-author-emphasis approach (FLAE), wherein the first and last authors have made the most significant contributions. This practice is well-established in many labs and scientific fields. Statements regarding equal contributions or the presence of one or more corresponding authors are also versions of this approach.

  4. 4.

    The percent-contribution-indicated approach (PCI), wherein a percentage score details each author's contribution. I argue that expressing the authors' contributions in an article's author contribution statement can be regarded as a qualitative version of the PCI approach. Even though there is no percentage score, it is usually quite clear who has contributed in a major or a minor way.

The approaches mentioned above are sometimes combined, for example, by partially structuring the bylines by alphabet to indicate equal contributions by some, but not all, of the authors (Mongeon et al., 2017; Waltman, 2012).

Models for allocating authorship credit

Four models for allocating authorship credit are especially prominent in the bibliometric literature (see Waltman (2016, Section 7) and Xu et al. (2016) for an overview): (1) fractional allocation (Price, 1981); (2) harmonic allocation (Hodge & Greenberg, 1981); (3) geometric allocation (Egghe et al., 2000); and (4) arithmetic allocation (Van Hooydonk, 1997). Hagen (2010, p. 792) tests these four empirically, concluding that harmonic allocation “provides unrivalled accuracy, fairness and flexibility”. One should note that these four allocation models are not the only ones (e.g., Assimakis & Adam, 2010; Lukovits & Vinkler, 1995; Stallings et al., 2013; Trueba & Guerrero, 2004). Many papers also deal specifically with author credit allocation in the construction of the Hirsch-index and its variants (e.g., Jian & Xiaoli, 2013) or include it as an essential part of constructing weighting schemes for ranking scientific articles (e.g., Zhang et al., 2019) or researchers (e.g., Vavryčuk, 2018). Several allocation models proposed by the referenced papers above are tested empirically in Hagen (2013), again with the harmonic allocation model being the most accurate.

This paper analyzes five allocation models. The first is harmonic allocation, which is included based on the performances relative to other models mentioned above. The second is fractional allocation, which is included because it is the most implemented approach, besides giving full credit to all authors of a publication. Geometric and Arithmetic allocation is also included to compare the results with earlier studies of allocation models. The formulas for these first four allocation models are:

  1. 1.

    \({\text{Fractional }} i{\text{th author credit}} = \frac{1}{n}\)

  2. 2.

    \({\text{Harmonic }} ith {\text{ author}} {\text{ credit}} = \frac{\frac{1}{i}}{{\left[ {1 + \frac{1}{2} + \ldots + \frac{1}{n}} \right]}}\)

  3. 3.

    \({\text{Geometric }} i{\text{th author credit}} = \frac{{2^{n - i} }}{{2^{n} - 1}}\)

  4. 4.

    \({\text{Arithmetic }} i{\text{th author credit}} = \frac{n + 1 - i}{{\left( {1 + 2 + \ldots + n} \right)}}\)

By denoting i as the index for author position and n as the total number of authors on the publication, it is possible to calculate the credit score for each author of a paper.

All formulas above provide additive weights, meaning that they sum to 1 and therefore do not inflate the total publication count. While fractional allocation is rank-independent, the remaining three are rank-dependent; for fractional allocation, all authors get an equal share, while the other models give the first author more than the second and the second more than the third (and so on). In this way, fractional allocation is the model best suited for bylines structured according to the EQ approach, while the other three models are best suited for the SDC approach. If the SDC and FLAE approach structure the bylines, then it is possible to modify the calculations to give equal credit to the first and last author, as described in Hagen (2010) and Liu & Fang (2012). The same modification procedure can be extended to include all corresponding authors, and all authors stated to have contributed equally.

The fifth allocation model to be included is one proposed by Aziz & Rosing (2013), but they did not provide a name for this weighing algorithm. For the sake of simplicity, this paper calls it the harmonic parabolic model for allocating authorship credit. There are two reasons for including it. First, it represents a radically different way of interpreting the byline hierarchy when compared to traditional allocation models. The basis for this model is that most credit should be given to the first and last author and decrease from those positions so that the median position gets the least credit. Moreover, it has not been tested empirically before. The formula for calculating it is:

  • \({\text{Harmonic parabolic }} i\text{th author credit}=\frac{1+\left|n+1-2i\right|}{\frac{1}{2}{n}^{2}+n(1-D)}\)

By denoting i as the index for author position and n as the total number of authors on the publication, with \(D = 0\) if n is an even number or \(D =\frac{1}{2n}\) if n is an odd number, it is possible to calculate the score for each author of a paper. The harmonic parabolic model is additive, rank-dependent, and theoretically best suited to use when a combination of the SDC and FLAE approach structures the bylines. That is, in scientific fields where the distribution of key contributors is u-shaped.

Method

The point of departure for this study is the research papers published in Nature Chemical Biology in 2013 and 2014. Each such paper (being either an article or a brief communication) comes with an author contribution statement that “specifies the contribution of every author” (“Authorship: authors & referees @ npg,” 2015). The journal requires the authors to include such a statement, but they are free to structure it and be as detailed as they see fit. See the Appendix for an example of an author contribution statement.

Creating a data set

Bibliographic data concerning all published documents in Nature Chemical Biology for 2013 and 2014 were downloaded from Web of Science (WOS). By using the Digital Object Identifiers (DOI) in the WOS data, it was possible to harvest full-text XHTML data for each document from the journal home page. Using a script written in Tool Command Language (Tcl), the full-text XHTML data of each document was parsed for author contribution statements and, if found, were added to the bibliographic data for the specific document. All documents that did not include author contribution statements were discarded from the data set—this meant only keeping documents of the article and brief communication type. Three additional documents had to be discarded due to errors in the author contribution statement. All documents (n = 14) exhibiting hyperauthorship, which is defined by Morris & Goldstein (2007, p. 1766) as “any article with 20 or more authors”, were excluded from the dataset. There were too few papers with an author team of 18 (n = 2) and 19 (n = 1) authors to warrant inclusion in the final data set, leaving 208 research papers for final analysis. There were at least six papers representing all the other author team sizes. Each paper resulted from collaborative activity (i.e., they were coauthored).

Each author contribution statement was manually scanned for descriptions of authors performing specific work tasks. For each task performed by an author, an item was created in an authorship-task database (if several authors were registered in the author contribution statement of a paper as performing the task, an item was created for each authorship-task combination). In order to classify the tasks in the authorship-task database, we followed the classification procedure used in Baerlocher et al. (2007): classification, pilot-testing; subject expert review; modification; and reclassification. A preliminary analysis was thus made after classifying all tasks according to a three-layered taxonomy initially developed by Davenport & Cronin (2001) and further operationalized for the classification of author contribution statements by Danell (2014). After discussions with a chemical biologist, the taxonomy and the way specific tasks were classified were modified. The data were then reclassified using the modified taxonomy shown in Table 1. A total of 4955 entries in the authorship-task database were classified according to this taxonomy. The modification and testing of the taxonomy—the intercoder reliability rateFootnote 1 was 91,6%—is further detailed in Sundling (2017).

Table 1 A three-tiered taxonomy for the classification of tasks found in the author contribution statement (ACS)

From the entries in the authorship-task database, an authorship database was created wherein each authorship was represented by only one item and classified according to a three-step procedure: First, I classified all authors performing at least one core-layer task as core-layer authors; Second, authors performing at least one middle-layer task but no core-layer tasks were classified as middle-layer authors; Third, all authors not performing any core or middle-layer tasks were classified as outer-layer authors. A total of 1743 entries (authorships) in the authorship database were classified according to this procedure.

While the terms author and authorship have different meanings in bibliometrics, they are both used in this text (for readability) to indicate authorship. Specifically, if one author performs tasks on several papers, they are represented by several items in the authorship database.

Calculating credit scores: the standard version and the special version

Credit scores were calculated for each author in the authorship database according to the following models for allocating authorship credit: fractional allocation, harmonic allocation, geometric allocation, arithmetic allocation, and harmonic parabolic allocation (see Section “Models for allocating authorship credit” for the specific formulas). Two versions of the credit scores were calculated for the harmonic, geometric, and arithmetic allocation model:

  • The standard version: the formulas for each model listed in Section “Models for allocating authorship credit” were used to calculate each author's credit score as usual.

  • The special version: the formulas for each model listed in Section “Models for allocating authorship credit” were first used to calculate each author's credit score. The credit was then restructured between each paper’s authors so that the first and last authors of each paper share the credit for the first and second position, and the credit of the intermediate authors is reduced by one position (similar to what Hagen (2008, 2014a, 2014b) suggest). This version was calculated to accommodate the FLAE approach, and is considered a common practice in biomedicine and the lab-based life sciences (Larivière et al., 2016; Tscharntke et al., 2007; Yank & Rennie, 1999).

As the fractional and harmonic parabolic allocation models both give equal credit to the first and last author, there was no need to calculate a special version for them.

Relationship between empirical observations and author credit scores

To answer the second research question, I first constructed three definitions of an empirical observation:

  • Definition A: author I's share of the total number of core-layer tasks specified in the author contribution statement of the paper.

  • Definition B: author I's share of the total number of core- and middle-layer tasks specified in the author contribution statement of the paper.

  • Definition C: author I's share of the total number of all tasks specified in the author contribution statement of the paper.

Standardized lack of fit was calculated for each model's standard and special version for all definitions in order to measure how well the models distribute credit to the authors. I also plotted the relationship between the empirical observations (using definition C) and the author's credit scores as calculated by the special version of the allocation models. The coefficients of determination (\({R}^{2}\)) were calculated to measure how well each plotted allocation model replicates the empirical observations.

Receiver operating characteristic (ROC) analysis

The Receiver Operating Characteristic (ROC) framework is useful for evaluating the predictive value of bibliometric information (Lindahl & Danell, 2016; Zhang et al., 2019). Thus, in order to evaluate the different allocation models' performance in predicting the core authors (the third research question), an ROC analysis was performed in SPSS. The foundation of a ROC analysis is the confusion matrix illustrated in Table 2.

Table 2 The confusion matrix

This matrix is a cross-tabulation of the predicted class versus the correct class of the instances in a data set (Fawcett, 2006; Provost & Fawcett, 2001). In this study, the correct class of each instance—or authorship, if you will—was derived from the analysis of author contributions statements, as described in Section “Creating a data set”. Each of the 1743 authorships in the data set was thus either classed (correctly) as being a core-layer author (true) or as not being a core-layer author (false).

The predicted class of each instance derives from the values of the different models for allocating authorship credit described in Section “Models for allocating authorship credit”. The credit scores produced for each author were considered the continuous output of a classifier that estimates membership in the core author group. A good classifier should give higher credit scores to core authors than to those given to middle- and outer-layer authors. Thus a higher score should indicate a higher probability of being a core author. As Fawcett (2006) describes, given a threshold parameter T, a predicted classification for an instance—positive or negative—can be extracted from such a classifier. A positive classification means the prediction is a core-layer author, and a negative means the prediction is a non-core-layer author (i.e., middle- or outer-layer author).

The two-by-two confusion matrix (exemplified in Table 2) classifies the authorships into four categories. Authorships predicted as positive can be classified as true positives (TP) if they belong to the core layer, or false positives (FP) if they do not belong to the core layer. Similarly, authorships predicted as negative can be classified as false negatives (FN) if they belong to the core layer, or true negatives (TN) if they do not belong to the core layer.

From the confusion matrix, it is possible to calculate several common metrics. Below are two which are essential for understanding ROC:

$${\text{False}}\,{\text{Positive rate}}\,\left( {{\text{fp rate}}} \right)=\frac{FP}{FP+TN}$$
$${\text{True}}\,{\text{Positive}}\,{\text{Rate}}\,\left( {{\text{tp rate}}} \right)=\frac{TP}{TP+FN}$$

The tp and fp rates are strict columnar ratios and produce scores in the range [0, 1]. In ROC-space, which is defined by the tp rate on the y-axis and the fp rate on the x-axis, a single point is given by the values for the confusion matrix produced under the threshold T. For each classifier, a curve can be drawn in ROC-space by plotting the tp rate versus the fp rate for every possible value of T (starting with \(+\infty\) and reducing step by step). Such a curve, called an ROC graph, depicts the relative tradeoffs between the benefits (true positives) and costs (false positives) of each classifier and has the attractive property of being “insensitive to changes in class distribution” (Fawcett, 2006, p. 864). The area under the ROC curve (AUC) is used to quantify each classifier’s performance in predicting the core authors. As stated in Fawcett (2006, p. 868), “the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance,” and it is used to compare the classifiers. In this study, the AUC for each author credit model can be restated as equivalent to the probability that a randomly chosen core-layer author receives a higher author credit score than a randomly chosen non-core-layer author.

Results

This section is structured into two parts. In the first part, I explore the relationship between an author's position in the bylines of a research article and the contributions they have made. This is done to uncover what ordering logic structures articles in Chemical Biology (research question 1). In the second part, I test the validity of the five bibliometric counting methods (research questions 2 and 3).

The relationship between authorship order and research contributions

Table 3 shows that as the number of authors of each paper rises, the share of core authors positioned between the first and last author increases. This increase is because there only exist two unique author positions, the first and last position, while all author teams in the data set, without exception, have two or more core authors. When the number of authors on a paper reaches ten, the percentage of core authors in the middle position stops rising and seems to level out.

Table 3 The percentages of first authors, last authors, and middle authors for core- and middle-layer authors

Regardless of author team size, almost all middle-layer authors are positioned between the first and last positions. Outer-layer authors have been left out of Table 3 as none of them figure as first or last authors. All authors performing only outer-layer tasks are found somewhere in the middle of the bylines (for more details, see Table 7).

As evidenced in Table 3, core authors do not only populate the first and last author positions. In addition, they are commonly found to inhabit the middle position, prompting the question of where the core authors that do not end up first or last in the bylines are positioned. Before answering this question, it is instructive to regard the average number of tasks performed as distributed per author position. Table 4 shows the average number of tasks for each author position as the number of total authors per publication (i.e., the author team) increases from 2 to 17. A striking feature of Table 4 is that first authors, regardless of the size of the author team, always perform more tasks on average than those in any other author position. Depending on the size of the author team, authors positioned last and authors positioned second are most often registered as the second- or third-highest performers of tasks on average. There are, however, two exceptions to this. First, when the author team is 14, the second-highest performer is found just before the last author position. Second, when the author team is 15, the second- and third-highest performer is found to be the second and third author position.

Table 4 The average number of tasks performed by an author on a specific author position

The general trend for each value of the author team is that the average number of tasks performed has a roughly u-shaped distribution. The average number of tasks in Table 4 decreases from the first author position until somewhere in the middle of the bylines, where it begins to increase and does so until it reaches the last author position. The reader can quickly see from the coloring in Table 4 that the position of the reds (high values) are at the beginning and the end of the bylines, and the blues (low values) are more in the middle of the bylines. However, the last author position never reaches the average number of tasks performed by the first authors, which again points to the importance of the first author in terms of workload and involvement in the research. The point where the decrease reaches the lowest value and turns into an increase differs between the sizes of the author team. For three authors, it is the second position that naturally has the lowest value, but for many other sizes of the author team, several positions report the lowest value. Sometimes the lowest average values of tasks performed correspond to the median position, but more times than not, the low point is slightly to the right of the median position. This result indicates that the u-shaped distribution is not symmetric.

Table 5 gives more detail on the specific positions of core-layer authors. Using the same coloring as in Table 4, we see that Table 5 reports a similar pattern; red (high values) on the two sides, followed by whiter shades (medium values) inwards the middle, and lastly, blue shades (low values) around the median position. This result tells us that the distribution of core-layer authors to specific positions in the byline is u-shaped and that this shape is a recurrent phenomenon regardless of the size of the author team.

Table 5 The percentage of core-layer authors in a specific author position

In other words, core-layer authors are in the majority in the positions that begin and end the bylines, while they are in the minority in the positions surrounding the median. The few positions that do not report any core authors are all median positions or positions close to the median.

In almost all sizes of author team, the position of last author is entirely made up of core-layer authors; The table indicates 100 percent for all author teams except those numbering 4 and 6. Comparing the first author position with the last author position, we see that the percentage of core-layer authors is higher in the latter group. This result is valid for all sizes of the author team except for sizes below four, where the percentages are equal. Furthermore, the percentages of core authors in each position are often higher on the right side of the median than on the left side. The conclusion is that the u-shape is nonsymmetric and that core-layer authors are more prevalent at the end of the bylines than at the beginning.

Table 6 gives more detail on the specific positions of middle-layer authors. The coloring is in stark contrast to Table 5, which reports the percentages of core authors; Instead of red (high values) on the edges, we have blue (low values), and instead of blue (low values) in the middle, we have red (high values). There is a gradual increase from the edges to the middle, where the highest reported value, more often than not, is to be found in the median position. Only for two sizes of the author team is the highest reported value not in, or directly next to, the median position. Instead of a u-shaped distribution, the distribution in Table 6 roughly resembles a bell-shaped distribution. There is no middle-layer author positioned last in the bylines (as we have seen from Table 5, core-layer authors populate the last author position). However, a small number populate the first author position. Middle-layer authors are more often positioned before the median than after, as the red in the table shows, pointing to a positive skewness in the distribution.

Table 6 The percentages of middle-layer authors in a specific author position

In Table 7, which gives more detail to the specific positions of outer-layer authors, there is a distinct lack of red (high values) and a prevalence of blue (low values). The values in Table 7 should be interpreted cautiously as the number of outer-layer authors is low; they only make up 11.0% of the whole data set. With that in mind, no outer-layer author is to be found first or last in the bylines. They are all spread out in the middle positions. For author team sizes up to 7, the median position reports the highest value, but it is hard to discern a pattern for the placement of outer-layer authors after that.

Table 7 The percentages of outer-layer authors on a specific author position

Any allocation model that does not account for the u-shaped distribution of tasks and core authors presented in this section is theoretically ill suited for allocating author credit in fields such as chemical biology. While fractional, harmonic, geometric, and arithmetic allocation does not do this, the harmonic parabolic allocation divides credit in this way and should theoretically be the best candidate to use. In the next section, the different allocation models are tested against how they perform in awarding higher credit to (1) the authors who make the greatest contributions in terms of the number of tasks they perform; and (2) the authors who are part of the core layer relative authors who are part of the middle and outer layer (i.e., how the models perform in predicting core authors).

The validity of five bibliometric counting methods

Table 8 shows the standardized lack of fit between the authorship credit scores predicted by the five models and the share of tasks performed by each author according to the author contribution statement. There are three levels in the table: the first counts only core-layer tasks as the empirical observations; the second counts both core- and middle-layer tasks as the empirical observations; and the third counts all tasks as the empirical observations.

Table 8 Standardized lack of fit between the authorship credit scores predicted by the five models and the proportion of tasks performed by each author according to the author contribution statement

The first thing to notice in the table is that the special version of the arithmetic, geometric and harmonic models produces lower values of lack of fit at all levels of analysis. This finding is a direct result of the u-shaped distribution of tasks discussed earlier (see Table 4). Not taking this distribution into account makes for an author-credit distribution that underestimates the contributions of the last author. Three models—fractional, harmonic (special version), and harmonic parabolic—produce low values at all levels of counting, indicating a better fit between model predictions and empirical observations.

The second thing to notice in the table is the general decrease in lack of fit as we move from a more conservative to a more liberal criterion for defining empirical observations. As we move from counting only core-layer tasks to counting core- and middle-layer tasks and finally all types of tasks, empirical observations with a value of zero decrease in number. None of the allocation models ever computes a zero value, which explains the general decrease in lack of fit.

Compared with the other models, the arithmetic allocation model never produces the highest or lowest values for lack of fit. The fractional model produces the lowest values of all models when counting all types of tasks, and it works well when counting both core- and middle-layer tasks. It also produces low values when counting core-layer tasks. The geometric model is consequently the worst in terms of lack of fit. As it is constructed to produce very low author credit scores for the authors not positioned at the beginning of the bylines, it means that even in its special version, it severely underestimates the contributions of many authors. At the same time, the model overestimates the relative contributions of authors positioned first. In its special version, the harmonic model performs better than the fractional model when counting core tasks. It also works well for the other two levels of analysis. The harmonic parabolic model produces the lowest values of all models both when counting core- and middle-layer tasks and when only counting core-layer tasks. It also produces low scores when counting all tasks.

Figure 1 plots the relationship between the empirical observations (i.e., author I's share of the total number of all tasks registered in the author contribution statement of the paper) and the author's credit scores, together with the corresponding coefficient of determination (\({R}^{2}\)) for each allocation model. The special version of each model is used as it outperforms the standard version in all cases according to Table 8. As reported in Fig. 1, the geometric model performs worst in replicating the empirical observations (\({R}^{2}\)= 0.4494). The harmonic (\(R^{2} \,\)= 0.6168), fractional (\({R}^{2} \,\)= 0.6326), and arithmetic model (\({R}^{2} \,\)= 0.6637) all give similar values, while the harmonic parabolic model (\({R}^{2} \,\)= 0.7086) does the best job of explaining the variation in the empirical data.

Fig. 1
figure 1

Relationship between empirical observations and author credit scores produced by five counting models. The empirical observations are defined as author I's share of the total number of all tasks registered in the author contribution statement of the paper. The author credit scores are calculated so that the first and last author recieves equal credit scores

Figures 2 and 3 show the ROC curve indicating how well the different allocation models predict the core authors. Figure 2 uses the author credit scores produced by the standard version, and Fig. 3 uses the author credit scores for the special version, which treats the first and last authors’ credit as equal. Tables 9 and 10 give the area under the curve (AUC), which measures how well the models differentiate between core- and non-core authors based on their credit scores. As in Table 8, when considering the lack of fit between author credit scores and percentages of tasks performed by an author, the special version produces ROC curves with a larger area under the curve. For the standard version (Fig. 2, Table 9), no model other than the harmonic parabolic does a compelling job at predicting which authors are core authors and which are not. The fractional model performs better than the other three but still performs poorly. In the special version (Fig. 3, Table 10) of the models, the geometric and the harmonic do a fair job of separating core- and non-core authors. However, the harmonic parabolic model still performs best. Specifically, when using the Harmonic Parabolic model, the probability that a randomly chosen core-layer author receives a higher author credit score than a randomly chosen non-core-layer author is 81%.

Fig. 2
figure 2

ROC curve for predicting core authors (normal version). The credit scores for each allocation model have been calculated according to the formulas listed in Section ”Models for allocating authorship credit

Table 9 Area Under the Curve (for ROC curve in Fig. 2)
Fig. 3
figure 3

ROC curve for predicting core authors (special version). The credit scores for each allocation model have been calculated according to the formulas listed in Section ” Models for allocating authorship credit” but with the additional rule that the first and last author receives equal credit scores

Table 10. Area Under the Curve (for ROC curve in Fig. 3)

Discussion and conclusions

Scientific authorship serves as a means of establishing priority, getting peer acknowledgment (Merton, 1973), and assigning accountability for scientific truth claims (Biagioli & Galison, 2003). Due to the increasing number of authors per article (Larivière et al., 2015), the increasing diversity in the meaning ascribed to authorship (Smith et al., 2019, 2020), and the perceived increase in scientific fraud and irreproducible research (Begley & Ioannidis, 2015; Steen, 2011), alternative ways of attributing scientific authorship has become an ongoing topic for discussion by stakeholders in science (Larivière et al., 2021). One suggestion from these discussions is that authors disclose their scientific contributions in an article's author contribution statement (Rennie et al., 1997). In the present paper, such author contribution statements form the basis for an analysis of the author-ordering logic in the field of chemical biology, as well as an analysis of the validity of five bibliometric counting methods in this field.

The ordering of authors in Chemical Biology implies differences in relative contributions. When ordering the authorships according to the position on the bylines, there is a distinct u-shaped distribution both for (a) the percentage of authors involved in writing the paper or designing the research (i.e., core authors), and (b) the average number of tasks performed. All sizes of author team report this. The two distributions have some differences, however. While the percentages of core authors are higher at the end than at the beginning of the bylines, the inverse is detected for the average number of tasks. Borrowing the terminology of Baerlocher et al. (2007), it thus seems that the analyzed data uncovers two important but different types of authors: primary authors positioned at the beginning of the bylines (e.g., the first and second author), who do more of the research than the authors equivalently positioned at the end of the bylines (e.g., the last and second from last author); and supervisory authors positioned at the end of the bylines, who are more often involved in core tasks such as writing the manuscript and research design than the authors equivalently positioned at the beginning of the bylines. The authors around the median position are seldom core authors. Such positions also report low values for the average number of tasks performed. If one aims to allocate credit by the importance of an author’s contributions in the field of Chemical Biology, the formula for calculating such credit must consider the above u-shaped distributions. It should give less credit to authors in the median position while increasing the credit given as one approaches the first and last positions, which is in line with the author-ordering logic for this field.

It is probable that the distributions detailed above are related to different demographic variables such as gender, academic age, and rank. Larivière et al. (2016) report that contributions that I classify as part of the core-layer are associated with high mean academic age. In contrast, middle-layer tasks such as analyzing data and performing experiments are associated with a lower mean year from first publication (Larivière et al., 2016). According to Costas & Bordons (2011) there is a strong trend for younger researchers and researchers in the lower academic ranks to figure as first authors, while senior researchers and researchers with high academic status are more likely to be positioned last. Larivière et al. (2021) can identify a gendered divide between conceptual and empirical work. For example, conceptualizing a study, something I classify as a core-layer task, is more likely to be performed by men. Women are more likely to contribute to investigation (i.e. the middle-layer task of performing the experiments). While interesting, such factors were not part of the data for this paper and thus could not be analyzed. However, I aim to explore them using a larger and more diverse dataset in a future study.

Standardized lack of fit was calculated for each model's standard and special version in order to measure how well the author-credit allocation models distribute credit to the authors. As is to be expected, there is no difference between the values reported by the standard and the special version of the fractional and the harmonic parabolic models. However, calculating the arithmetic, geometric and harmonic model as normal, as opposed to calculating the credit for the first and last authors as equal in each model, produces significantly worse results. This is a direct result of the u-shaped distribution of tasks. When using definition C (i.e., author I's share of the total number of all tasks registered in the author contribution statement of the paper) for the empirical observations, the fractional model does the best job distributing author credit. However, for both definition B (i.e., author I's share of the total number of core- and middle-layer tasks registered in the author contribution statement of the paper) and A (i.e., author I's share of the total number of core-layer tasks registered in the author contribution statement of the paper), the harmonic parabolic model performs best. The harmonic parabolic model also performs better than the special version of the other models in replicating the empirical observations (definition C), as measured by the coefficient of determination. Theoretically, it would seem that using only core-layer tasks or core- and middle-layer tasks as the empirical observation does a better job at defining an important author in terms of contributions. This is because the core- and middle-layer tasks are similar to those used by scientific organizations such as the International Committee of Medical Journal Editors (ICMJE) to determine which authors are worthy of inclusion in the bylines. Counting an author’s share of the total number of tasks in the author contribution statement introduces several outer-layer tasks of lesser importance for attributing authorship (Biagioli & Galison, 2003; Danell, 2014; House & Seeman, 2010; Sundling, 2017). It also gives them equal value to the core- and middle-layer tasks. However, it does not rely on classifying tasks into different layers, so it is much less subject to classification bias (e.g., Lambert, 2011).

Calculating the author allocation models according to the special version (i.e., equating the credit of the first and last author) as opposed to the standard version produces a significant increase in the area under the curve for the arithmetic, geometric and harmonic models. Regardless, the arithmetic model (together with the fractional model) performs poorly, while the geometric and harmonic models do a fair job of separating core- and non-core authors. However, the harmonic parabolic model outperforms them all, doing a good job of predicting which authors are core authors and which are not.

I would argue that the harmonic parabolic model is the one that performs best overall for the analyzed data set, even though the fractional model produces the lowest lack of fit when considering all types of tasks. Looking at the reported coefficients of determination in Fig. 1 and the ROC curves in Figs. 2 and 3, together with the area under the curve in Tables 9 and 10, lends proof to the consideration that the harmonic parabolic model is the best overall pick for allocating author credit in bibliometric exercises in the field of chemical biology.

There are some limitations of this study. The most critical problem is accurately measuring the extent of a contribution, which is something that even a detailed author contribution statement has problems expressing. I have made efforts to classify contributions as core-, middle-, or outer-layer, which partly reflects the extent of involvement in a task. However, it is important to realize that even though an author could be involved in more tasks than another author, it is not sure that the former has contributed more than the latter. Furthermore, even when contributors are classified as performing the same type of task, there is a risk that there is a considerable difference in their extent of involvement. Such misclassifications ultimately depend on how specific the authors have been when detailing the contributions and if the classification procedure picks this up. On an aggregate level, the errors introduced by these should be lower, which motivates further studies using larger datasets—something I aim to take on in the near future. However, looking at the data for this study, core-layer authors are, on average, involved in almost twice as many tasks as middle-layer authors and dominate the first and last author positions. So even though uncertainties exist concerning the extent of involvement, the u-shaped author distributions align with what we know about the organization of lab work and author ordering (e.g. Larivière et al., 2016; Louis et al., 2008; Sundling, 2017).

Another limitation is that this study is based only on articles in the chemical biology field; as such, the results apply only within the context of this field. As detailed investigations into all byline positions are lacking in the literature, there is no room for saying that the harmonic parabolic model performs as well in other parts of science as it does in chemical biology. However, there are indications that papers in biomedicine and clinical medicine also have several primary and supervisory authors positioned at the beginning and the end of the bylines, respectively (Mongeon et al., 2017). Baerlocher et al. (2007) and Wren et al. (2007) make similar conclusions in these fields; They found that first authors contributed the most, followed by last and second authors, while authors situated in the middle had low levels of participation in tasks such as conception and drafting the manuscript. Recent studies of author contribution statements from articles in several PLOS journals also suggest that first and last authors contribute to more tasks than middle authors (Larivière et al., 2021; Sauermann & Haeussler, 2017). In fields with such an ordering logic, the harmonic parabolic model would probably perform better than an allocation model that assumes a declining ordering logic, such as the harmonic model.

Author inclusion and byline order vary per field, but it is crucial to acknowledge that these aspects also likely vary between labs, specializations, and countries (Knorr-Cetina, 1999; Pontille, 2003; Smith et al., 2019, 2020; Trimbur & Braun, 1992). For example, some lab leaders might be more inclined to “rank an unlucky post-doc first on a paper on which he or she would normally have been placed only as a second author, in order to keep that person motivated” (Knorr-Cetina, 1999, p. 231). Furthermore, there are indications that transdisciplinary research collaborations have more inclusive authorship practices than research where the collaborators are from the same field (Elliott et al., 2017). Such factors also influence the findings and generalizability of this study.

This study evaluates the validity of different models for allocating authorship credit by comparing them in their standard form, and when considering the first and last positions equal. One might argue that treating the credit for corresponding authors as equal to that of the first and last authors would have produced other results and corrected some models. There are, however, reasons for not including such a version of the models in the analysis. First, while it is true that many articles in the data set contain multiple corresponding authors, only one for each paper is registered in the reprint author field in WOS (Waltman, 2012). As long as this is the case, including all corresponding authors for each paper in an allocation model to be used in more extensive bibliometric studies would be a somewhat unrealistic enterprise. Second, the reprint authors that WOS reports for this data set all appear as either first or last authors, making a different version of the models redundant.

Many studies have shown that not fractionalizing authorship credit produces inflated numbers which fundamentally change rankings (Gauffriau & Larsen, 2005; Gauffriau et al., 2008; Huang et al., 2011; Piro et al., 2013). Hagen (2014a) points out that while the effects of this kind of inflationary bias are well known, the same cannot be said for the effect of distributing credit equally among coauthors who have not contributed equally (i.e., the equalizing bias). Some studies indicate that using a harmonic allocation model, as opposed to a fractional model, produces rankings and metrics that are different on the level of individual researchers (Hagen, 2014a) as well as for countries (Hagen, 2015). A future research venue lies in studying rank and output metrics changes when the harmonic parabolic model is used instead of the harmonic or the fractional model. This line of inquire could also be combined with data about the gender of the authors as well as academic rank and organizational prestige to uncover distributional biases in the author credit allocation models. Another fruitful enterprise would be using author contribution data from journals adopting the CRediT-taxonomy for a large-scale study of allocation models in various fields.