Biographical articles in scientific literature: analysis of articles indexed in Web of Science

Biographical articles in scientific journals offer a platform for the commemoration of distinguished individuals from the world of science. Despite so important a role for the scientific community, research on biographical articles is scarce. To fill this gap, we have analyzed 190,350 biographical articles indexed in Web of Science, written by 251,908 authors in 1945–2014. We have analyzed the development of this article type over the studied period and research areas, how women and men are represented in the subject of articles, and who the authors are. Over the time the number of biographical articles has been increasing, with the highest number in Life Sciences and Biomedicine. Around 20% of the articles were written about women, with the highest share of 24% in Arts and Humanities. Both male and female authors write more often about men than about women, a stable situation for the last 70 years.


Introduction
Article types vary in their roles in the dissemination of knowledge and in their frequency in journals (Sigogneau 2000). Articles, normally considered a source of original research, are the most basic-and the most important-document type; for instance, they accounted for 57.4% of all documents indexed by WoS for 2014. Less frequent in WoS, but still important, are proceeding papers (13.4%), meeting abstracts (11%), editorial materials (4.4%), book chapters (4.3%), reviews (3.0%), book reviews (2.6%), and letters (1.7%). All others document types accounted for 2.2% of all the documents in 2014. respected and inspiring figure.'' The authors used NVivo software for textual analysis of obituaries to find scholarly and personal characteristics that were most common in the obituaries. Based on these characteristics, the authors created a tree map to visualize what it means to be an intellectual leader.
These studies focused on obituaries, but biographical articles constitute a wider category. Here, we analyze it as a whole. In order to do so, we analyze a collection of documents classified by WoS as either biographical item or item about an individual, published from 1945 to 2014 in scholarly journals indexed in Web of Science. Since this is the first analysis of a large collection of biographical articles, we have only indirect hints about what we might look for in these data, hints that result from scientometrics research on other document types as well as on obituaries, whether related to academia or not. We will thus conduct our analyses around the following questions: How did the number of biographical articles change over the years? Do journals representing different scientific disciplines differ in the number of biographical articles they publish? Are women and men equally represented in biographical articles? Do women (men) write more articles about women or men? We will not, however, limit the analysis to these questions, but we will explore any phenomena in the data that catch our eyes. In addition, we will analyze the variety of biographical articles in terms of their contents. WoS defines biographical articles rather generally, so we will investigate what indeed can be found in articles that are classified as biographical items or items about an individual.

Data sources
We searched the Web of Science (WoS) database (Web of Science 2016) for two types of biographical articles, namely, Biographical-Item and Item-About-an-Individual, published from 1945 to 2014. This way, we collected the following data about 190,350 unique biographical articles: • WoS accession number (the unique identifier of an item), • title, • year of publication, • language of article, • authors' names (surnames with first names or initials), • one or more WoS category of the item, and • the number of citations.
The first WoS category defined for an item was used to assign a higher-level WoS Research Area to this item.
To classify the authors and people mentioned in titles of articles based on their gender, we used genderizeR package (Wais 2016a) of R (R Core Team 2017). The package guesses the gender of a person based on the first name and the data gathered in the genderize.io database (Strømgren 2016). Created in August 2013, the database has been regularly updated since, by the continuous scanning of public profiles of social network users. In April 2014, the genderize.io database contained information of about 120,000 first names based on about half a million social network profiles of men and women. Almost 3 years later, in June 2017, the database had information of over 200,000 first names from social network profiles from 79 countries and in 89 languages (Strømgren 2016).

Glossary
Authorship-the unique combination of the title of an article and the name of one of the authors (note that the same author can publish more than one article, so the number of authorships will be greater than the number of authors).
Biographical article-an article assigned to one of the two categories in WoS database: Biographical-Item and Item-About-an-Individual.
Unisex first name-a first name that can be used both by men and women. Gender database-a database used for gender classification; in our study, we used genderize.io database, which contains information about relationships between first names and gender obtained from public profiles from social networks.
Probability-given a first name, a probability that the person with this first name is men (or women, depending on the context). If the probability is 0.5, half of the people in the gender database who share this first name are men while the other half are women.
Count-a number of people in the gender database with the same first name.

Gender classification
We used the methodology suggested in Wais (2016b) to guess the gender of (i) people mentioned in titles of biographical articles and (ii) authors of these articles. The algorithm, available in the genderizeR package (Wais 2016b), 1. automatically parses all title words, 2. checks in the genderize.io database if these words were used as first names in social network profiles, and 3. estimates probability that a person with this first name is men or women.
In the third step above, the algorithm takes into account that some first names are valid for both men and women, and so classifying such names is always imprecise. Using the gender data from the database, we can estimate this uncertainty: given a first name, the probability of being a woman is estimated as the share of people with this first name who declared themselves as women.

Validation of gender classifications
Validation datasets We validated the algorithm with a random sample of 1000 unique biographical articles. The gender of persons in the titles were manually coded as • ''male'' or ''female'', if all people mentioned in the title had the same gender, • ''both'', if more than one person was mentioned in the title and their gender was different, • ''unknown'', if it was impossible to assign a gender based on the name given in the title, or • ''noname'', if no person was mentioned in a title.
This way, we coded the gender of persons in the titles as Similarly, to validate how precisely the algorithm classified the gender of authors, we randomly sampled 2000 biographical articles and extracted 2641 author names. If the first name of an author was given, the author's gender was manually coded as a ''female'' or ''male,'' based on Internet queries that used the author's affiliation, contact information, and the title of the biographical article. We coded the gender of authorships as "noname" in 2039 (77.2%) cases "male" in 346 (13.1%) cases "unknown" in 165 (6.2%) cases "female" in 91 (3.5%) cases Training the algorithm From the genderize.io database, for each first name we have probability that a person with this name is man or woman. We have to decide whether we wish to work only with names for which this probability is close to 1 or we accept also names for which this probability is closer to 0.5; for a probability close to 0.5, such a name is given to both men and women, and so classification of the gender for such a name will be the most uncertain.
Thus, to train the algorithm for classifying gender, we should check different threshold values of this probability and choose the best one. The algorithm will not use first names with probabilities below this threshold; this way, we can decrease the uncertainty of our classifications at the cost of ignoring unisex first names.
We should also be cautious when using rare unisex first names. To decide which names should be included in the algorithm and which ignored, we should test different threshold values for counts of how many times a first name was recorded in the gender database; the algorithm will use only those first names which occurred more often than the threshold.
So, we looked for the optimum values of these two parameters: probability (that a first name represents a particular gender) and count (of how many times a first name was recorded in the database with gender data) (Wais 2016b). Based on a preliminary, exploratory analysis, we have decided that the optimum probability should be between 0.5 and 0.8 while the optimum count, between 1 and 13. Note that the algorithm should be independently trained for the two datasets: titles and authorships. For both datasets, we checked all 403 combinations of (i) probability between 0.50 and 0.80 with a step 0.01 (so, 0.50, 0.51, … , 0.80) and (ii) count between 1 and 13. The best combination is that which leads to the highest accuracy of gender classification, that is, for which the algorithm would match the manually coded data in the highest number of cases.
For the validation dataset of titles, the algorithm worked best with the probability parameter set to 0.67 and the count parameter set to 1. Using these values, we obtained a relatively small overall classification error rate (8.7% percent of items with incorrectly classified gender) and a small proportion of items with an unclassified gender (1.9%). The gender bias error rate in automatic gender classification was also low (4.1%) and had a positive sign, which suggests that more men were incorrectly classified as women than vice versa, indicating a slight overestimation of the proportion of women in the population studied. Since we estimated the overall classification error rate (8.3%) on the training dataset, the error was underestimated. Thus to get a more realistic indicator of classification error rate, we also estimated a more robust bootstrapped error rate (8.5%) (Wais 2016b).
For the validation dataset of authorships, the algorithm worked best with the probability parameter set to 0.54 and the count parameter set to 1. Using these values, we obtained small overall classification error rate (6.9% and bootstrapped error rate 7.1%), small proportion of items with unclassified gender (2.7%), and small gender bias error (1.4%).

Terminology
Web of Science defines biographical items and items about an individual (which we join to a document type of biographical articles) as, generally put, articles focused on life of individuals, obituaries, tributes, and commemorations as well as tributes to such people. The latter group represents articles that are not considered biographical in the traditional meaning; these can be, for example, transcripts of lectures or review articles on a given topic, whose only relation to an individual is dedication of the article.
Individual biographical articles, thus, can differ quite a lot. Thus, we conducted an indepth analysis of a sample of 750 biographical articles, to find out whether they can be classified into distinct categories. After a preliminary analysis, we divided the articles into those about alive and dead people. We divided these categories into subcategories based on the purpose of an article (Table 1).
We decided to create a special category for atypical biographical articles, which we called ''Other.'' It would include, for instance, articles that are not about any individual but are dedicated to a person. An example is tributes explained above. Another example could be an article that is focused on a scientific knowledge, with additional explanation of people who developed this knowledge-although such an article includes the biographies of these people, this topic is additional to the main topic. We decided such articles of marginal biographical character should fall into a different category than those which are biographical in their essence.

Data collection and sources
We analyzed a sample of 750 biographical articles. To do so, we took three independent subsamples of 250 articles from years 1945-1984, 1985-1999, and 2000-2014, from which we took random samples without replacement. We chose these three periods based on the trend of the number of biographical articles (Fig. 1). The years 1985 and 2000 showed changes in the trend, so we decided to break the whole period into the three corresponding sub-periods and check whether the categorization of biographical articles differed between the periods.
We analyzed each article from the sample in a following manner. First, we looked for particular words in their titles. For instance, articles whose titles included words ''obituary'' or ''in memoriam'' were classified as obituaries while articles whose titles contained information about awards given to a person were classified as ''award for individual.'' Then, we searched for all other articles in WoS, Google Scholar, and/or archives of the journal they were published in. When we succeeded to find their full texts, we read them and assigned a category and subcategory. In some cases, we failed to access full texts but succeeded to classify the articles based on their abstracts or first pages. Sometimes, to reinforce our guesses about classification of an article about this person, we additionally used the information about an individual we found in news articles and press releases.

Results
As of January 2015 Web of Science indexed information about 190,350 unique biographical articles written by 251,908 authors in the period studied. Below, we analyze those articles and their authors.  (Table 2). This corresponds to the higher number of journals and articles published in Life Sciences and Biomedicine. In Art & Humanities, on the other hand, there is a prevalent tradition of introducing new talents in music, ballet, and theater, consequently increasing the number of biographical articles.

Citations
Although, when calculating impact factor, WoS does not consider biographical articles as citable items, citations to them contribute to the overall number of citations for a journal (Garfield 2006;McVeigh and Mann 2009). Thus, citations of these articles are worth studying. Most of the biographical articles have been infrequently cited or not cited at all; a few, however, have had many citations. The mean number of citations per article for all studied years was below 1 (Fig. 2).
Scientific articles from Social Sciences are usually less often cited than those from Life Sciences and Physical Sciences, a phenomenon we have not observed for the bibliographical articles: those from Multidisciplinary and Social Sciences were most often cited  1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 year Mean number of citations per article We have analyzed top ten biographical articles with the highest number of citations in the Web of Science database ( Table 3). All of them were in English, and all but one were assigned to Life Sciences & Biomedicine, Technology, and Physical Sciences research areas. The most cited bibliographical article was Murdoch (1994) with 385 citations; this article was previously presented at a conference in 1991 by population ecologist William W. Murdoch. His lecture was awarded Robert H. MacArthur Award, one of the most prestigious prizes given by the Ecological Society of America (ESA Historical Records Committee 2014). Like Westphal (1975), this article is not a typical biographical article as it is not about an individual-we thus classified it as ''Other.''

Gender of articles' subject
The classification algorithm helped us classify that out of 190,350 biographical articles, 148,509 (78.0%) were written about men, 30,152 (15.8%) were written about women, 11,689 (6.1%) were unidentified.
At the beginning of the studied period, most articles (over 90%) were about men. The share of articles about women had been slightly increasing till around the seventies of the twentieth century, when it stabilized at around 20% (Fig. 3). Such an increase in the share of articles about women was likely related to the movement toward gender equality in workforce in the 1960th and 1970th. However, the stable number of articles published about women from 1970th up to the present time suggests that little progress have been made in the appreciation of the contribution of women.
The highest share of articles about women was in Arts & Humanities (almost 24%), Social Sciences (over 18%), and Multidisciplinary Sciences (over 17%). The lowest share was in Life Sciences & Biomedicine (14%) and Technology and Physical Sciences (both over 12%) (Fig. 4).
The articles about women were slightly less often cited (with the mean number of citations of 0.19) than those about men (0.21). Interestingly, the ''unidentified'' articles were more often cited (0.24). In Multidisciplinary Sciences and Technology, the articles about women were cited more often than those about men, even though in Technology the share of articles about women was the lowest. In Social Sciences and Physical Sciences, the situation was opposite. In Arts & Humanities and Life Sciences & Biomedicine, the mean number of citations per article was similar for the men and the women (Fig. 5).

What's in there? Classification of biographical articles
A quarter of the sample of 750 biographical articles were articles in honor of an alive individual while 70% were in honor of a deceased person. We failed to collect sufficient information about nearly 5% of the articles, so we did not classify them. Unlike in the citation analysis, none of the 750 articles was wrongfully classified by WoS as a biographical article. The sample was stratified, with 250 biographical articles from three strata being represented by the periods of up to 1984, 1985-1999, and from 2000 to 2014. The periods did not differ in the distribution of the subcategories (results not shown), which suggests biographical articles have not changed, as a document type, throughout this period, even if their number has been changing (Fig. 6).
Popularity of biographical articles categories differed across Web of Science research areas. Table 4 represents shares of different categories of articles in research areas.
Obituaries represented the majority of biographical articles (458 out of 750, 61%). Over 45% of them were in Life Sciences & Biomedicine. Celebration of work was more common for alive people; most such articles were in Arts & Humanities (37%) and Life  1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005  Seventy-seven percent of obituaries were about men and 15% about women; others were not classified (8%). A similar gender distribution was for anniversary of birthday (alive), award for individual, and celebration of work for both alive and deceased individuals. All anniversaries of birthday of deceased people as well as all anniversaries of death were only about men. Out of 7 autobiographical articles two were about woman.

Authors
We have also analyzed the authors of the biographical articles. The 190,350 articles were written by 251,908 authors. The mean number of authors of biographical articles ranged between 1.04 and 1.6 and have been increasing since 1984 (Fig. 7). The highest number of authors (145) was for a tribute article to the Nobel Prize laureate Robert Geoffrey Edward. A supplement to the main issue of Reproductive Biomedicine Online, this article was atypical, being a collection of 130 tributes (Kamal et al. 2011). Fourteen of the top 20 articles with the highest number of authors were published in the Russian language and in Russian journals; the remaining six were written in English. Two Russian journals had seven and six articles with over 50 authors. Among them, fifteen articles mentioned men in their titles, and four mentioned women; the algorithm failed to identify one person's sex; we checked it, and it was a man. The highest mean number of authors was for Physical Sciences while the smallest, for Arts & Humanities (Fig. 8). This result confirms what we know about Physical Sciences: that scientific articles in physics often have many co-authors (Iglič et al. 2017;Ioannidis, Klavans and Boyack 2016).
In Multidisciplinary Sciences, Social Sciences, Life Sciences & Biomedicine, and Technology, biographical articles written by more than one author were more often cited than those written by one author (Fig. 9). In Physical Sciences and, to a smaller degree, in Arts & Humanities, the situation was opposite. year Mean number of authors per article 194519501955196019651970197519801985199019952000200520102015 Mean number of authors of the biographical articles across the studied period Based on the first names of the authors, we classified the authors' gender for 60.1% of the authorships (see the glossary to recall what the term means). For 39.9% of the authorships, the classification was impossible for two reasons: (i) authors in over 11% authorships were anonymous, and (ii) initials instead of first names were given for many non-anonymous authors. Note when an author wrote k articles, we counted this author k times. Table 4 shows the authors' gender (Table 5).
Among those authorships for whom we classified the gender, the women constituted 24% of authors while the men, 76%. The female authors had the highest share (around 30%) in Multidisciplinary Sciences and the lowest (around 20%) in Technology (Fig. 10).

Authors and articles
We have classified the biographical articles into those written by men, women, and both men and women (Table 6). For over 40% of articles, we were unable to identify the authors' gender. Of the classified articles, 72% were written by men, 20% by women, and only 7.5% by a team of men and women. This classification was stable over the studied period (Fig. 11). Above, we have analyzed the biographical articles in two contexts: who wrote them and about whom they were written. Now, let us join the two contexts and analyze who wrote about whom (Fig. 12). As we already know, most articles were about men. This phenomenon did not depend on who wrote the articles (Fig. 12). After 2005, there was a peak in the articles about women written by women, but it lasted only for a few years. The trend for the male authors has been stable since the eighties, with the highest share of the articles written about men. A similar situation was for articles written by both women and men, though in a few earlier years, such teams published a similar number of biographical articles about men and women.
The articles written by men were least often cited (with mean citation count of 0.25) while those published by men-women teams were most often cited (0.29); the articles published by women had, on average, 0.27 citations per article. The WoS Research Areas differed in citation patterns (Fig. 13). For Multidisciplinary Sciences, Physical Sciences, Life Sciences, and Arts & Humanities, the articles written by women were more often cited   1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 1950 1970 1990 2010 1950 1970 1990 2010 1950 1970 1990 2010 Author is woman Author is woman & man Author is man Person in title is woman Person in title is man Fig. 12 Share of articles written by women, men, and men-women teams about men or women, during 1945-2014. Included are only those articles for which we were able to identify gender of the authors and of the subject

Conclusions and implications
Journals often use different names for article types than WoS does, hindering scientometrics research. Much more damaging, though, can be misclassification of articles into document types by WoS: such misclassification can affect the evaluation of journals and scholars (Harzing 2013). We found that biographical articles are seldom misclassified in WoS-although we did find such a misclassification among the most-cited articles, we did not among the 750 articles in the sample. As mentioned in Introduction, previous studies on obituaries analyzed articles published in popular media; the majority of them presented content analysis of samples of obituaries. Till now, however, no one has attempted to analyze in detail biographical articles published in scientific journals-which is surprising, given the number of such articles. This gap led us to conduct the present research, which aimed to analyze biographical articlesincluding obituaries-published in scholarly journals indexed in Web of Science.
Most biographical articles do not directly contribute to the development of science. We do believe, however, that they do deserve attention-because they deal with one of the most valuable aspects of science development: the excellence of human mind. These over 190,000 biographical articles celebrating distinguished individuals constitute a rich source of information about the science world.
Thanks to analyzing biographical articles over the last 70 years, we were able to study various aspects of the development of this type of article. Some of such aspects were trends over time and across science disciplines related to article number, citation impact, variety in contents, and gender equality in article topics and authors.
Over time, the number of biographical articles in WoS has been increasing, including not only obituaries, but also job anniversaries, birthday celebrations, and commemorations of individuals. Most biographical articles were published in Life Sciences and Biomedicine, but the highest mean number of citations were in Social Sciences (although the between-area differences were rather small). This result is astonishing because regular scientific articles in Life Sciences and Biomedicine are much more frequently cited than those in Social Sciences. Dealing with people, not with scientific observations, however, biographical articles are governed by different rules than scientific articles are. Social scientists are likely more apt to write and read about people, and so they might be more apt to cite such articles. Among the top ten most often cited articles, however, only one was in Social Sciences; seven were from Life Sciences and Biomedicine, one from Technology, and one from Physical Sciences. We believe that some of these articles gained such high popularity (represented by many citations) because of their contents: they were more of review articles than biographies in the traditional meaning (e.g. Murdoch 1994;Westphal 1975).
The number of co-authored biographical articles (that is, written by more than one author)-in particular of those written by man-woman teams-has been increasing since the 1980s. This observation reinforces the leap towards collaboration in science (e.g. Adams et al. 2005;Persson et al. 2004;Glänzel 2002).
Over the studied period, the share of biographical articles commemorating men and women were stable, the overall representation of women in titles of biographical articles being 20%. The only exceptional period was several years in the 1970s, when this number increased by about 10 percentage points. Arts and Humanities had the highest share (24%) of articles about women; Life Science & Biomedicine, Physical Science, and Technology were at the opposite pole, with around 13% of biographical articles about women. From the analysis of the articles' subjects and authors, a clear picture follows that the gender of an author is not related to the subject's gender: both men and women wrote more about men than about women. Our research shows that more has to be done to commemorate women. The New York Times' input is worth noting, with their weekly postings of obituaries of overlooked individuals in the past, mostly women and representatives of minority groups (The New York Times, 2018). Figure 14 summarizes the relationships between the authors' and the subjects' genders in the WoS research areas. It clearly shows that, irrespective of the area, women are underrepresented in biographical articles in both roles: of their authors and of their subjects.
We based our analysis on an external name-gender reference dataset (the genderize.io database, Strømgren 2016). One limitation of using this database is that it was collected from social media networks. Declarative data that such networks collect from its users do not have to conform to reality. Thanks to the huge number of observations the genderize.io database consists of, however, the uncertainty related to the data's declarativity should be negligible. Our validation tests of the database confirmed this thesis.
Web of Science assigns each article it indexes to one of many article types, two of which are ''biographical items'' and ''items about and individual.'' This assignment is not error-free, however, and mistakes happen (Harzing 2013). Our analysis showed that such misclassification of articles as biographical ones happen, but happen infrequently. It might be interesting to study possible reasons for such mistakes. This paper opens new avenues for future research. What are the reasons behind the drop in the number of biographical articles over time? What is their importance to the scientific community? Do scientists read such articles? Why they do or don't? We hope this paper will trigger research on the still understudied topic of scientometrics, that is, biographical articles.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.