The Gender Gap in Brazilian Entomology: an Analysis of the Academic Scenario

Although women are about half of world’s population, they are underrepresented in many sectors including academia and the research scenario in general. Gender gap in Entomology has been pointed out in other publications; however, data for Brazil has never been demonstrated. Here we provide a diagnosis for the Brazilian Entomology scenario in order to contribute to propositions towards disentangling the gender gap in general. We analyzed scientometric data for Brazilian Entomology focusing on gender disparity and on personal perceptions related to the gender gap through an online questionnaire. We detected a pervasive gender bias in which the scissor-shaped curve is the most representative effect of it: women were the majority in lower degree stages but the minority in higher degree stages (permanent positions and positions of prestige and power). We also observed mentorship bias and discussed these results in light of intersectionality and the COVID-19 pandemic. Gender differences were perceived differently by the questionnaire respondents considering age, gender, and parenting. With this data and analyses, we have provided elements to stimulate and support change to a healthier and more equitable academic space. Supplementary Information The online version contains supplementary material available at 10.1007/s13744-021-00918-7.

In sum, our decisions were to: 1) search for all insect orders and some genera (c.f. Decision 2), but not other taxonomic levels, like family; 2) include genera of widely used and well-known model organism, disease vectors, and pests, like Apis, Aedes, Spodoptera; 3) include common names; 4) use radicals in specific cases like entom* for Entomology, entomofauna, etc.; 5) use scientific names in Latin in their full spelling; and 6) use Portuguese words in their correct spelling, including accentuation, hyphenation, or other Latin orthography like cedilla (ç). Below we detail each of these decisions, also reporting the steps taken for data curation.
We searched for keywords in the Title columnonly 61 titles were blank in the total 1,235,795 T&Ds. We considered master and professional master's degrees the same level (MSc). The SEARCH formula in Excel is case insensitive and it finds the search term in any part of the given cell. This formula returns a value, the place of the first keyword letter in the sentence (i.e., the title), which is meaningless. We replaced this number for 1, to use it for some calculations like the number of exclusive T&Ds of a given keyword. As we based many of our decisions on this number, we calculated it by: filtering out the T&Ds caught by insect orders and generic keywords (entom*, inset*, insect*) and summing the search results (substituted for 1) in the column of each keyword.
Decision 1: search for all insect orders and some genera (c.f. Decision 2), but not other taxonomic levels, like family We searched for some insect families as a keyword and asked whether they are also captured by keywords of insect orders. For example, Pentatomidae had 298 results and, among these, 274 (92%) were also caught by "Hemiptera" or "Heteroptera." Among the remaining 24 titles, only 5 (2%) would not be recovered by insect orders and generic keywords. This example repeated itself enough times to decide that the taxonomic level of family does not bring enough exclusive results, being contemplated by insect orders; also not justifying the effort to explore which would be the best insect families as keywords. If insect orders are already so numerous, exploring insect families as keywords would be a much deeper maze.
The decision not to include insect families is, however, alleviated by the decision to search for every insect order, even those that yield 0 results (i.e., were not in the title of any T&Ds in the Sucupira platform between 1987 and 2019), because this is relevant information for understanding which orders are not, or under, studied at graduate levels.
The list of insect orders prioritized natural groups based on recent phylogenetic hypotheses, but also orders that were once monophyletic or once an order because they were valid and, as such, widely used in the past. For example, fleas were once order Siphonaptera, today an infraorder, and now they belong to Mecoptera (Tihelka et al. 2020): we included both names as insect "order." We evaluated some phylogenetic hypotheses (Kristensen 1981, Kjer et al. 2006, Trautwein et al. 2012, Misof et al. 2014, Beutel et al. 2017, Chesters 2019) and selected two based on the representativeness of monophyletic names valid in the recent (Beutel et al. 2017) and past (Kristensen 1981) state of the art evaluations.
Three genera appeared with substantial number of T&Ds exclusively caught by these keywords (Fig. S1): Aedes (644), Apis (569), and Drosophila (455). Particularly for Apis and Drosophila, the percentage of T&Ds that only used the genus name in the title was quite high, respectively, 80% and 86% (Fig. S1). That is, we would only catch 20 and 14% of bee and fruit fly research if we do not include the genus. This result convinced us for the need to include insect genera as keywords, despite some data cleaning being necessary (detailed in the next section).
We are aware this decision can overrepresent studies of hand-picked genera. For example, the pest genera we chose sum to a maximum of 20 species, which represent 0.018-0.025% of the Brazilian insect diversity (estimated 80,750-109,250 species, Lewinsohn & Prado 2005; sometimes estimated to 400k species e.g. Rafael et al. 2012) or 0.002% of the world diversity (estimated 950,000 species, op. cit.). Even though most people know insects as harm or pest (Barua et al. 2012), and despite the fact that this bias reflects in the science we make, the percentage of harmful and pest species is negligible close to the real diversity and importance of insects. However, not including the chosen genera would be a worse trade off. Nonetheless, we hold some confidence that most entomological studies still explicitly write the order in the title, Figure S1: Number of T&Ds caught by additional keywords to insect orders and generic keywords. The number is shown as exclusively caught (i.e., did not appear within results of orders and generic keywords, in dark pink) or not (light pink). Keywords are in Portuguese but were translated here in English (in brackets). * indicates radicals, underline indicates a space and, in blue, keywords related to mites (see last section). or use generic words like "entomological" or "insect," but diversifying the type of keywords is another important strategy, which led us to the common name keyword (Decision 3).

Decision 3: include common names
The main challenge of using common names as keywords is that they can be used in various contexts under different meanings, or they derive from a common word. For example, esperança is the common name of bush crickets and is also the word for hopethese orthopterans are named esperança because finding them symbolizes good luck, or hope. Searching for this word yielded 570 (+ 33, changing the ç for c) but the first 350 were all "mistakes," that is, used as hope, and not the insect; and indeed, a single title was within the T&Ds caught with insect orders and generic keywords. The same applied to efêmera (23 results, all mistakes) [common name of Ephemeroptera, and the word for ephemerous]; efemérid* (11 results, all mistakes) [common name of Ephemeroptera, and the word for ephemerality]; and traça (489 results, 462 mistakes) [common name of Zygentoma and Lepidoptera, and the word or radical for words like traçar, to draw or traçado, line or extração, extraction]. In all these cases, we found basically mistakes, being unworthy to add them to the keyword list.
Another challenge is that common names are regional, and Brazil is large and diverse enough to have sub-regional common names within regions. We tried to dodge that by exhausting common names we could find on google for each order. Except from the cases above (esperança, efêmera, efemérid*, traça) and the ones listed in the next paragraph, the common names found in the titles of T&Ds are presented in Figure S1. From this figure, we also recognized the need to add common names, even with the potential hassle of curating the "wrong" titles that are also caught by this kind of keywords.
As in decision 1, we excluded keywords with really low numbers from further consideration: bicho-pau (0)  As a side remark, in a follow up paper, we will analyze publications of Brazilian entomologists. There, we shall explore with appropriate data the gender gap in the context of insect taxa since, for example, we found that the Entomology keyword with most T&Ds was bee, as a common name (Fig. S1). We recovered more female (649) than male (469) students among these T&Ds, growing expressively after 2003, showing that maybe the gender gap can be reverted in subfields of Entomology.
Choosing to use part of a word (radical) as the keyword or the full spelling is a balance between precision and specificity versus being inclusive and catching "wrong" titles, that is, unrelated to Entomology. In some cases, like Entomology, entomopathogenic, and entomofauna, using entomo* instead of the three keywords is better since this radical is not a common word, used in other contexts to mean different things. In our case, we used entom* since, for instances, entomólogo [entomologist] adds the accentuation and is thus not caught in Excel if we used entomo*.
We chose to use radical in the following cases: In this case, the word insetívoro/a [insectivore] is also caught and must be manually removed. c) insect*, for Insecta, insect ( We checked only for a couple of cases whether the radical indeed catches all variants using the online tool venny (Oliveros 2015), and it does. We also used Venn diagrams to explore potential errors that would be introduced (Fig. S2). It is worth noting that, in Excel, the * should not be written. Figure S2: Venn diagrams used to demonstrate that keyword as a radical (here, inset) correctly catches variants (here, inseticida [insecticide] and inseto [insect]). We also used Venn diagrams to understand if errors were introduced. In this case, the 34 results of "inset" relate to insetívoro [insectivore] and insetário [insectary].

Decision 5: use scientific names in Latin in their full spelling
We noticed some cases where the insect order was referred to as a vernacular word, like dipterofauna or coleópteros. So, we explored for Coleoptera, Diptera, and Lepidoptera if removing the last vowel (with and without accentuation, for e.g. coleóptero) would catch new exclusive results.
By removing the last vowel, we caught additional 0.1-3% D&Ts, except from Dipter* that caught 6% more but 75% of these were due to Dipteryx, a plant genus, thus a "mistake." Based on this brief exploration, removing the last vowel did not yield better results.
Another orthographic variant that appeared was the abbreviated form of some orders like Col., Hym., and Lep. We searched these three cases in the abbreviated form but the number of mistakes was in the order of 80% for a subsample of the first 50 D&Ts, and we dropped this possibility without further quantification.
Therefore, in the case of insect orders, the correct spelling is the best strategy. For genus names, we did not see any reason to investigate variants to the correct spelling, so we used any scientific name in Latin in the correct spelling.
Decision 6: use Portuguese words in their correct spelling, including accentuation, hyphenation, or other Latin orthography like cedilla (ç) The Sucupira database does show some level of punctuation, accentuation, orthography, and other issuesthough not so much extra or missing spacesas any database with more than a million lines, filled for more than 30 years, does. Additionally, Portuguese writing changed, most significantly with the Orthographic Agreement of 1990 (the mandatory transition in Brazil occurred in 2016), so some spelling forms had to be checked. That was more relevant for the common name keywords, in which we focus here, although we also explored variants in cases like coleoptero(s) and coleóptero(s) of Decision 5.
For most cases, writing in the correct spelling yielded the best results. For instance, libélula [dragonfly] versus libelula had respectively 19 and 1 D&Ts, and the 1 caught with libelula (the wrong spelling) was also caught by other keywords. Another example is afídeo [aphid], with 35 results, and afídio caught 0, afíd caught 37 (the extra 2 being mistakes), afid caught 12 (all mistakes or caught by other keywords, except from 1 new exclusive result).
After these considerations, we compiled the keyword list including every insect order and the three generic keywords (entom*, inset*, insect*), which yields 9,993 T&Ds. Then we plotted the number of T&Ds gained with additional keywords (Fig. S1) and how increments added to the 9,993 above in a cumulative plot (Fig. S3). These figures help see where, according to a very inclusive criterium, we stop gaining enough T&Ds. Objectively, the threshold was of contributions of at least 1% (rounded number) of the total exclusive T&Ds found with these additional keywords (5,249). In both Figures S1 and S3

Data curation
In all databases, we focused especially on columns with people's names, and accentuation mistakes were the most prevalent source of error. We also standardized information for analytical purposes, like date as "month year" and "day/month/year" being changed to "month" and "year" in two columns, or merging "abandonment" and "abandoned." We used two strategies for data cleaning. Firstly, we curated T&Ds caught by the keyword list above using filters in the column Main area for Human and Exact sciences and then each of its Disciplines and read every title.
If a consistent mistake appeared, the second strategy was to filter in the column Title by writing the word in the search engine within the Filter tool in Excel (for example, Capistrano, caught because of the keyword Apis) and removed all T&Ds found with that filter. We cleaned the database using this strategy for: • Insetívor* [insectivore(s)] because of keyword inset*: removed all but those that regarded the diet of insectivore species. After data curation, the final database of T&Ds caught by keywords related to Entomology sum to 14,448 T&Ds (1% of the total of 1,235,795), with 10,049 of them being theses and 4,399 dissertations, in 1,224 graduate courses (25% of the total of 4.918). We made an exploratory wordcloud (Fig. S4) of words in the title of these 14,448 T&Ds (excluding punctuation, words with wrong accentuation, and some prepositions, conjunctions, etc.) using the online tool  Only 54 graduate courses are responsible for half of these 14,448 T&Ds (Fig. S5) and, among these, 9 of 11 graduate courses that formed most masters and doctors are Entomology graduate courses (EGC, Fig. S5 inset). These top 11 graduate courses contribute to 25% of the 14,448 T&Ds, and the 9 EGCs are responsible for 22%. The number of T&Ds correlates with the number of professors in each EGC (see Pearson coefficients in the main text), but also with the age of the EGC (Fig. S6), with the earliest EGCs appearing in the 1970's.
Due to the considerable relevance of EGCs in the realm of T&Ds caught by keywords related to Entomology, we decided that only EGCs would be considered for detailed analyses of particular aspects of graduate courses, such as disciplines, gender bias by supervisor/advisor and by student/post-doc, gender bias of coordinators, etc.
In Brazil, there are 12 EGCs. Nine of them are clearly relevant to academic human resources (Fig. S5 inset). The remaining three contribute with 51 T&Ds in total, and they are either inactive (UFMS) or too recent (Public Health USP, since 2015). Despite not contributing as much as the other EGCs, we decided to include them, so the detailed analyses of graduate courses done in the main text contemplated all EGCs in Brazil.

Acarology in EGCs
One way to check the adequacy of chosen keywords would be to see the percentage of T&Ds caught in the total universe of T&Ds of these 9 most productive EGCs (Fig. S5), with the expectation that nearly 90-100% would be caught with our keywords. The average percentage among these 9 EGCs was 79%, ranging from 68 to 89%, which led us to study the T&Ds our keywords did not catch.
Part of this explanation comes from T&Ds done with mites. Most of the 12 Brazilian EGCs are related to Agronomy, and thus agricultural pest research is a strong research line. Despite mites being arachnids, and not insects, we briefly explored the contribution of mite keywords, looking at six orders (Holothyrida, Ixodida, Mesostigmata, Opilioacarida, Sarcoptiformes, Trombidiformes), two generic keyword (ácar*, acari*), and one genus (Tetranychus). We noticed that Acari was a frequently used word in T&D titles and included as a generic keyword (for it also brings e.g. acaricida), despite not being an order. This keyword in particular led to too many "mistakes" due to e.g. sacarina [saccharrin], polissacarideo [polysaccharide], tabacaria [tobacconist], Camacari [proper name of a town], Peracarida [Crustacea], etc. Using "_acari_" (underline being spaces) caught only 18 results, which is not the real number of Acari in T&D titles -a better search would be, for example, "(Acari_" but this exploration is beyond the scope of our study. In total, we caught 1,725 titles but almost half (736) were mistakes, mostly due to Acari, so we excluded it.