Background

Hemiptera is the fifth species-rich order of insects [1,2,3] and the most species-rich order of hemimetabolous insects with more than 100,000 described species [4]. Many hemipteran species are agricultural pests such as aphids [5,6,7] and planthoppers [8,9,10,11]; some are vectors of human diseases including kissing bugs (Rhodnius prolixus, a Chagas disease vector) [12, 13] and blood-feeding bed bugs (Cimex lectularius) [14,15,16]. The origin of Hemiptera was traced back to the Carboniferous period; an extinct family, Aviorrhynchidae, was dated to the Moscovian stage and was believed to be the most basal lineage of Hemiptera [17, 18]. Since then, major lineages of Hemiptera diversified during different geological periods: Late Permian saw the emergence of cicadomorphans [19]; the Triassic witnessed the rise of water bugs (Nepomorpha) [19]; the Jurassic saw the emergence of other heteropteran infraorders like Cimicomorpha and Pentatomomorpha, as well as planthoppers [19, 20]; and the Cretaceous period marked the appearance of most recent sternorrhynchans [21] and enicocephalids [19]. Early hemipterans likely fed on gymnosperm plants, but the mid-Cretaceous biotic reorganization, involving gymnosperm extinctions and angiosperm diversification, accelerated their evolution [21]. Food supply may have been a biotic factor influencing Hemiptera evolution [21], with diversity linked to angiosperm radiations [2, 18, 22].

Traditionally, Hemiptera has been classified into four major suborders: Sternorrhyncha (aphids, scale bugs, whiteflies, and psyllids), Auchenorrhyncha (planthoppers, cicadas, leafhoppers, spittlebugs, and froghoppers), Coleorrhyncha (moss bugs), and Heteroptera (true bugs) [18, 23]. Some morphological analyses indicated that Sternorrhyncha was sister to Auchenorrhyncha, forming either a separate order “Homoptera” or a suborder in the order Hemiptera [24, 25]. However, other morphological studies indicated that “Homoptera” was not monophyletic [26]. Molecular analyses of 18S rDNA sequences identified Sternorrhyncha as the sister group of all other Hemiptera [27,28,29]. This relationship has been supported by sequence analyses of multiple genes [30], mitochondrial genomes [22, 31], and transcriptomes [18, 32,33,34].

The suborder Sternorrhyncha comprises four major groups, typically recognized as superfamilies: Psylloidea (jumping plant lice), Aleyrodoidea (whiteflies), Aphidoidea (aphids), and Coccoidea (scale insects) [23]. The relationships among these four superfamilies have been contentious. Morphological analyses by Schlee [35, 36] supported a sister-group relationship between the superfamilies Aleyrodoidea and Psylloidea. Molecular studies based on 18S rDNA sequences supported Psylloidea as the sister group of all other Sternorrhyncha superfamilies [27, 37], whereas transcriptome-based analyses supported Aleyrodoidea as the sister group of all other superfamilies [32, 34, 38]. The superfamily Aphidoidea has three families: Phylloxeridae, Adelgidae, and Aphididae. Phylloxeridae and Adelgidae contain exclusively oviparous aphids and are recognized as a separate superfamily Phylloxeroidea in some studies; in such cases, the superfamily Aphidoidea contains only the viviparous aphids [39,40,41].

The suborder Auchenorrhyncha comprises two infraorders: Fulgoromorpha and Cicadomorpha [30, 42]. Fulgoromorpha consists of a single superfamily Fulgoroidea, while Cicadomorpha encompasses three superfamilies: Cicadoidea, Cercopoidea, and Membracoidea [30, 42,43,44]. The monophyly of Auchenorrhyncha has been a subject of debate, with conflicting evidence from both morphological and molecular studies (as reviewed in [30]). Earlier investigations using 18S rDNA gene sequences [27,28,29] and mitochondrial genome sequences [22, 31, 42, 45, 46] indicated that Auchenorrhyncha is not monophyletic. However, more recent studies utilizing multiple gene sequences [30, 47], transcriptome sequences [32,33,34], and ultraconserved elements [4] lent support to the monophyly of Auchenorrhyncha. Within the infraorder Cicadomorpha, the relationships among the three superfamilies are also controversial. Two conflicting hypotheses have both gained substantial support: some analyses proposes Cicadoidea and Cercopoidea as sister to Membracoidea, while others indicates Cercopoidea and Membracoidea as sister to Cicadoidea [32, 34, 38]. Recent transcriptome-based analyses did not resolve the relationships among these superfamilies [32, 34, 38].

The suborder Coleorrhyncha has only 36 species in one family, Peloridiidae [48,49,50], in contrast to the suborder Heteroptera, which has more than 40,000 species in about 89 families [51]. Several early studies grouped Coleorrhyncha and Heteroptera together as Heteropterodea [26, 52,53,54]. This grouping was supported by both morphological and molecular analyses [22, 28, 30, 46, 55,56,57,58,59,60]. Recent transcriptome-based studies, however, supported a sister-group relationship between Coleorrhyncha and Auchenorrhyncha [32,33,34].

In this study, we sequenced: (1) the genomes of 63 species of hemipteran insects from 25 families, 15 superfamilies, and 3 suborders; (2) the transcriptomes of 2 species of Coccoidea insects; and (3) the genome of one species from the order Thysanoptera. By combining these data with existing genomic and transcriptomic data, we constructed comprehensive supermatrices comprising 315 terminal species. With the substantially expanded genomic and transcriptomic data, we aimed to resolve the high-level relationships in Hemiptera and its suborders, and understand the divergence time and evolution of major hemipteran groups.

Results

Orthology assignment and matrix statistics

We identified and extracted on average 1625 single-copy genes from the genome and transcriptome assemblies across 315 terminal taxa. Individual nucleotide sequence alignments were concatenated to create a full-site supermatrix comprising 2,096,643 aligned nucleotide sites. A concatenated amino acid sequence supermatrix contains 719,819 aligned amino acid positions. In addition to these two supermatrices, we used various data treatment methods and loci filtering strategies to generate an additional eight nucleotide supermatrices and six amino acid supermatrices. All of these supermatrices were utilized in phylogenetic analyses under the maximum likelihood (ML) optimality criterion. The detailed statistics for these supermatrices are presented in Table 1.

Table 1 Summary of the sequence alignments used in phylogenetic analyses in this study

The suborder-level relationships in Hemiptera

Our analyses of all the concatenated datasets consistently supported the monophyly of three suborders: Sternorrhyncha, Coleorrhyncha, and Heteroptera (Fig. 1 and Additional file 1: Figs. S1–S16). The suborder Auchenorrhyncha was recovered as monophyletic in our analyses of nucleotide dataset NT_12 (Fig. 1 and Additional file 1: Fig. S1) and all amino acid datasets (Additional file 1: Figs. S2 and S11–S16). Sternorrhyncha was recovered consistently in all analyses as the sister group of the remaining Hemiptera with strong support (bootstrap support value of 100%). Coleorrhyncha was identified as the sister group of the monophyletic Auchenorrhyncha in the analyses of nucleotide dataset NT_12 and all amino acid datasets. In the analyses of other nucleotide datasets, Coleorrhyncha was nested within Auchenorrhyncha, being the sister group of either Cicadomorpha or Membracoidea (Additional file 1: Figs. S3–S10). Coleorrhyncha and Auchenorrhyncha together were found to be the sister group of Heteroptera (Figs. 1 and 2 and Additional file 1: Figs. S1–S17).

Fig. 1
figure 1

Phylogenetic relationships among major groups of Hemiptera. (Left) Maximum likelihood tree inferred from nucleotide dataset NT_12 (1,370,056 nucleotide positions). Here we show a schematic version of the ML tree with superfamilies collapsed for clarity. Values at branches show only the bootstrap values < 100. Different suborders were indicated by colored backgrounds. The full tree with branch lengths is presented in supplementary Fig. S1. (Right) Maximum likelihood tree inferred from amino acid dataset AA_RCFV (131,479 amino acid positions). The full tree with branch lengths can be available in supplementary Fig. S2

Fig. 2
figure 2

The species tree relationships in Hemiptera reconstructed using gene tree summarization in ASTRAL. Results are shown from analyses of gene trees derived for each locus included in the dataset AA. Numbers at nodes show the local posterior probabilities. Different suborders were indicated by colored backgrounds. The blue regions in the simplified tree highlight the placement of the partial tree within the entire tree topology

The coalescent-based species tree inferred from the amino acid dataset AA_all is consistent with the concatenated phylogenetic analyses of both the amino acid datasets and nucleotide dataset NT_12 on the suborder-level relationship in Hemiptera (Fig. 2). The coalescent-based species tree inferred from the nucleotide dataset NT_all placed Coleorrhyncha within Cicadomorpha, forming a sister-group relationship with Membracoidea (Additional file 1: Fig. S17).

The superfamily-level relationships in Cicadomorpha

The infraorder Cicadomorpha was supported as a monophyletic group in our phylogenetic analyses of 10 of the 16 concatenated datasets but not the other six nucleotide datasets when the suborder Coleorrhyncha was included. Furthermore, 9 out of 16 four-cluster likelihood mapping (FcLM) analyses provided strong support for a monophyletic Cicadomorpha (Additional file 1: Fig. S18). Within Cicadomorpha, Cicadoidea and Cercopoidea were sister groups, forming a clade that was sister to Membracoidea in all concatenated amino acid datasets and the nucleotide datasets NT_80, NT_12, and NT_GC (Table 2), with strong bootstrap support (BS = 100). Furthermore, the superfamily relationships within Cicadomorpha as (Membracoidea, (Cicadoidea + Cercopoidea)) were supported by coalescent-based species tree estimation using the amino acid dataset AA_all (Fig. 2). All our FcLM analyses confirmed the sister-group relationship between Cicadoidea and Cercopoidea (Fig. 3) except for the FcLM analysis using the NT_parsimony dataset, which indicated a sister-group relationship either between Cicadoidea and Cercopoidea (37.1%) or between Cercopoidea and Membracoidea (37.4%).

Table 2 Support for different phylogenetic relationships across datasets
Fig. 3
figure 3

Results of FcLM analyses on various datasets for Cicadomorpha superfamily phylogenetic relationships

The superfamily-level relationships in Sternorrhyncha

Within the suborder Sternorrhyncha, Aphidoidea and Phylloxeroidea consistently formed a monophyletic group, the Aphidomorpha clade (Figs. 1 and 2 and Additional file 1: Figs. S1–S17). In addition, the relationship between the superfamilies Coccoidea and Aphidoidea + Phylloxeroidea was well supported in all of our phylogenetic analyses (BS = 100). Most of our analyses identified the superfamily Aleyrodoidea as the sister to the rest of Sternorrhyncha. Psylloidea formed a sister relationship with a clade consisting of Coccoidea and Aphidoidea + Phylloxeroidea. The major relationships within Sternorrhyncha as (Aleyrodoidea, (Psylloidea, (Coccoidea, Aphidoidea + Phylloxeroidea))) were supported by all nucleotide datasets (Additional file 1: Figs. S1 and S3–S10) and amino acid datasets of AA_RCFV (Additional file 1: Fig. S2), AA_50 (Additional file 1: Fig. S12), and AA_SHI (Additional file 1: Fig. S16). In ML analyses of amino acid datasets such as AA_all, AA_80, AA_parsimony, and AA_RCV, a sister-group relationship between Aleyrodoidea and Psylloidea was retrieved (Additional file 1: Figs. S11 and S13–S15). This grouping was supported by less than 29% of quartets in FcLM analyses (Additional file 1: Fig. S19). In contrast, a sister-group relationship between Psylloidea and the clade (Coccoidea, Aphidoidea + Phylloxeroidea) received higher quartet support (40.2 ~ 63.8%).

Feeding modes of Hemiptera

Our analyses supported phytophagy as the ancestral feeding mode of Hemiptera (Additional file 1: Fig. S20). From the reconstructed character states, it is evident that most hemipteran lineages are phytophagous. Specifically, members from homopteran groups, namely Sternorrhyncha and Auchenorrhyncha, are exclusively phytophagous. Moss-feeding mode, characterized as a distinctive form of phytophagy, is observed mostly within Coleorrhyncha (Additional file 1: Fig. S20). The common ancestor of Dipsocoromorpha, Enicocephalomorpha, and Gerromorpha was inferred to be carnivorous (i.e., feeding on flesh), so was the common ancestor of Nepomorpha. Carnivory may also represent an ancestral state for Heteroptera as a whole (Additional file 1: Fig. S20). Blood-feeding was predominantly observed in the family Reduviidae in Reduvioidea, and also in the genus Cimex in Cimicoidea. The common ancestor of Reduviidae, however, was inferred to be carnivorous predominantly. The ancestral feeding mode of Cimicomorpha remained uncertain, possibly including both carnivory or phytophagy. Predatory behaviors were also observed in species from Lygaeoidea and Aradoidea (Additional file 1: Fig. S20). The majority of Pentatomomorpha were phytophagous, and their common ancestor was inferred to be phytophagous as well.

Divergence time estimates

Our divergence time analysis supports an early Carboniferous origin of Hemiptera (Additional file 1: Fig. S21). The divergence of Hemiptera from other paraneopteran insects was dated back to ~ 354 million years ago (95% HPD 321–379 Mya), and the initial divergence between Sternorrhyncha and the rest of Hemiptera occurred ~ 347 million years ago (Additional file 1: Fig. S21, 95% HPD 311–376 Mya). This timing corresponds well with the oldest known fossils of the group, dating back to the Carboniferous period [17]. Sternorrhyncha diversified into Aleyrodoidea and other sternorrhynchan lineages ~ 339 million years ago (95% HPD: 298–374 Mya), during the early Carboniferous. Members of Coleorrhyncha shared a common ancestor ~ 288 million years ago (95% HPD: 230–353 Mya), in the early Permian.

Discussion

We conducted a comprehensive phylogenomic analysis of Hemiptera in the present study, with an expanded taxon sampling of 298 species representing all major lineages of Hemiptera, nearly double the taxon size of previous studies [32]. We did additional FcLM analyses to further explore the superfamily relationships in Cicadomorpha that remained unresolved in previous studies. Overall, our results provided a well-resolved phylogeny of Hemiptera. The consistently high branch support values along internal nodes reflected the stability of the trees given the analyzed data. Additionally, the arrangement of major clades in the concatenation-based analyses exhibited substantial congruence. These findings demonstrate the robustness and reliability of our phylogenetic reconstruction and suggest a strong consensus among different datasets and methods. This updated understanding of Hemiptera phylogeny will serve as a valuable resource for future research on this diverse and ecologically important group of insects.

Effect of data coding and filtering on phylogenetic reconstructions

In assessing the impact of data coding on phylogenetic reconstructions, it was observed that three amino acid datasets (AA_50, AA_RCFV, and AA_SHI) and the nucleotide dataset NT_12 consistently produced identical tree topology for the superfamily relationships within Sternorrhyncha and Auchenorrhyncha, as well as the placement of Coleorrhyncha (Additional file 1: Figs. S1, S2, S12, and S16). However, there were several discrepancies between amino acid and nucleotide datasets (Table 2) with the following two major ones observed:

  1. (1)

    Placement of Coleorrhyncha relative to auchenorrhynchan lineages.

Amino acid datasets consistently supported Coleorrhyncha as the sister group of Auchenorrhyncha, with robust support for the monophyly of Auchenorrhyncha. Among the series of nucleotide datasets examined, only the NT_12 dataset supported the sister-group relationship between Coleorrhyncha and Auchenorrhyncha (Additional file 1: Fig. S1). Conversely, the remaining nucleotide datasets positioned Coleorrhyncha within Auchenorrhyncha. The nucleotide datasets NT_80 and NT_GC recovered Coleorrhyncha as the sister group of Cicadomorpha (Additional file 1: Figs. S5 and S7). However, in the remaining nucleotide datasets, Coleorrhyncha was consistently placed deep within Cicadomorpha as the sister group of Membracoidea (Additional file 1: Figs. S3, S4, S6, and S8–S10). As we progressed from analyses based on NT_12 to NT_80 and NT_GC, and to all other nucleotide datasets, the position of Coleorrhyncha appeared to be increasingly nested within Auchenorrhyncha.

  1. (2)

    Interrelationships among Sternorrhyncha superfamilies.

Nucleotide datasets uniformly recovered a branching pattern of (Aleyrodoidea, (Psylloidea, (Coccoidea, Aphidoidea + Phylloxeroidea))), with significant support for all nodes. Amino acid datasets also supported a sister-group relationship between Coccoidea and Aphidoidea + Phylloxeroidea. However, four amino acid datasets (AA_all, AA_80, AA_parsimony, and AA_RCV) indicated a sister-group relationship between Aleyrodoidea and Psylloidea. The nodal support for Aleyrodoidea + Psylloidea is lower (BS = 63 ~ 87, Additional file 1: Figs. S11 and S13–S15) than that for (Aleyrodoidea, (Psylloidea, (Coccoidea, Aphidoidea + Phylloxeroidea))) (BS = 100, Additional file 1: Figs. S1–S10, S12, and S16).

We also applied alternative data filtering methods under the same nucleotide or amino acid coding strategy to investigate the effect on phylogenetic reconstructions (Table 2). Summary of the sequence alignments showed that datasets filtered based on the number of parsimony-informative sites (i.e., datasets NT_parsimony and AA_parsimony) consistently retained more parsimony-informative sites than those filtered by other methods (Table 1). FcLM analyses also revealed that NT_parsimony and AA_parsimony datasets had a higher percentage of informative sites compared to other methods (Fig. 3 and Additional file 1: Figs. S18–S19). However, for nucleotide datasets, none of the data filtering strategies significantly affected phylogenetic relationships. Regarding the support values of major nodes and superfamily relationships, the amino acid datasets AA_RCFV and AA_SHI produced trees congruent with those from the nucleotide datasets. Thus, the RCFV and SHI filtering methods may be more suitable for the phylogenomic analysis of amino acid data in Hemiptera.

High-level relationships of Hemiptera

Our expanded genomic and transcriptomic data confirm the most basal position of Sternorrhyncha within Hemiptera, consistent with recent transcriptome-based phylogenomic analyses [18, 32,33,34], as well as earlier morphological [61] and molecular studies [27,28,29,30]. Stål was the first to suggest that Sternorrhyncha was sister to all other Hemiptera based on morphological comparison [62]. Cobben identified two stylet structure characteristics that distinguished Sternorrhyncha from Heteroptera and homopteran lineages: the relative orientation of the mandibles and the location of the salivary canal [63]. Subsequent fossil evidence [64] and additional morphological comparison [61, 65] further supported that the first divergence in Hemiptera occurred between Sternorrhyncha and the clade Euhemiptera, which comprises Auchenorrhyncha, Coleorrhyncha, and Heteroptera [66].

All our analyses based on amino acid datasets and the nucleotide dataset NT_12 strongly support the monophyly of Auchenorrhyncha. This is consistent with the morphological evidence summarized by Cryan and Urban [30] for the monophyly of Auchenorrhyncha including (1) a complex tymbal acoustic system on abdominal segment I [53]; (2) an aristoid antennal flagellum [53]; (3) a labium originating from the posterior region of the ventral head surface, close to the occiput and lacking an intervening sclerotic gula [67]; (4) the relative location of the hindgut junction [68]; (5) a pronotal coverage of the mesonotal fore margin, stridulatory and auditory organs, jumping ability, antennal structure, and various wing features [26]; (6) a reduced or entirely membranous proximal median plate [69]; and (7) internal reproductive structures such as male lateral ejaculatory ducts [70]. Earlier molecular studies, based on 18S rDNA [60] and mitochondrial genome sequences [22, 31, 42, 44, 45, 71], raised doubts about the monophyly of Auchenorrhyncha due to the separate position of Fulgoromorpha. However, recent studies utilizing transcriptome sequences [32,33,34] and ultraconserved element data [4] provided support for Auchenorrhyncha as a monophyletic group, which was confirmed in the present study.

Schlee originally proposed that Peloridiidae and Heteroptera were sister taxa, together forming the clade Heteropteroidea (= Heteropterodea) [72], which was supported by Forero [23]. Cobben, however, criticized that the synapomorphies proposed by Schlee as superficial and argued against the monophyly of Heteropterodea [63]. Evans regarded Peloridiidae as a separate suborder of Hemiptera alongside “Homoptera” and Heteroptera [49]. Based on genital morphology, Bourgoin suggested a close relationship between Peloridiidae and Fulgoromorpha [73]. Coleorrhyncha has often been considered as the sister group of Heteroptera based on 18S rDNA sequence data [27, 58,59,60] and various morphological characters [28, 74,75,76]. However, more recent phylogenomic analyses supported Coleorrhyncha as the sister taxon to Auchenorrhyncha [32,33,34]. In this study, our analyses of amino acid and nucleotide datasets with third positions removed strongly support Coleorrhyncha as the sister group of Auchenorrhyncha. This grouping is consistent with transcriptome-based phylogenomic analyses [32,33,34] and morphological analyses of wing base structure [77].

We showed that the three infraorders of Heteroptera (Gerromorpha, Dipsocoromorpha + Enicocephalomorpha) formed a clade, sister to all other heteropterans. These relationships were consistent with previous studies that analyzed transcriptome sequence data [32, 34, 78]. The clade (Gerromorpha, Dipsocoromorpha + Enicocephalomorpha) was also supported by a previous study that included combined analyses of morphological and molecular data [79]. Moreover, Weirauch et al. identified three morphological synapomorphies supporting the monophyly of the clade consisting of Gerromorpha, Dipsocoromorpha, and Enicocephalomorpha [79]. All concatenated datasets, except for AA_RCFV and AA_SHI, and the species tree analyses indicated that Nepomorpha was the sister group of a large clade comprising Cimicomorpha and Pentatomomorpha. This result is consistent with previous transcriptome-based phylogenomic studies [32, 34, 78], but contrary to a study using a combined analysis of multiple-gene sequence data and morphological characters [79]. The latter study identified Nepomorpha as the sister group of all other Heteroptera. However, the most-basal position of Nepomorpha within Heteroptera was not supported by recent phylogenomic analyses using genomic and/or transcriptomic data [18, 32, 34, 78]. The current study strongly supported the sister-group relationship between Cimicomorpha and Pentatomomorpha, which was consistent with several earlier studies [18, 32, 34, 78, 79].

The phylogenetic relationships among superfamilies

Compared to recent phylogenomic studies that relied solely on transcriptomic data [32, 34, 38], the present study combined genomic and transcriptomic data and provided robust support for stable superfamily relationships within Cicadomorpha, specifically as (Membracoidea, Cicadoidea + Cercopoidea). The competing hypothesis of (Cicadoidea, Cercopoidea + Membracoidea), commonly found in previous phylogenomic studies using transcriptomic data [32, 34, 38], was not supported in any of our analyses. Hamilton proposed that Cicadoidea is the sister group of Membracoidea + Cercopoidea based on the analysis of head morphological traits [25]. However, Emel’yanov supported a sister-group relationship between Cercopoidea and Cicadoidea based on more comprehensive morphological character analyses [65]. In addition, analyses based on antennal characters [80] and structural features of brochosomes, proteinaceous particles secreted by glandular regions of the Malpighian tubules [81], supported Membracoidea as a basal group within Cicadomorpha.

Most of our analyses supported Aleyrodoidea as the sister group of the remaining Sternorrhyncha, which was consistent with previous transcriptome-based studies [32,33,34]. Our analyses also supported the reciprocal monophyly of Coccoidea and Aphidoidea + Phylloxeroidea, consistent with these previous studies as well [32, 34]. The sister-group relationship between Coccoidea and Aphidoidea + Phylloxeroidea also received widely support in previous studies based on morphological characters [35, 36, 65].

In the current study, we included 17 superfamilies of Heteroptera. The monophyly of the superfamily Notonectoidea was not supported in our ML analyses of NT_12, AA_80, and AA_parsimony datasets (Additional file 1: Figs. S1, S13, and S14) but was supported in other ML analyses and coalescent-based species tree estimations (Fig. 1, Additional file 1: Figs. S2–S12 and S15–S17). Previous transcriptome-based analyses also provided support to the monophyly of Notonectoidea [32, 34, 78]. Currently, genome sequences were available to only two species of Notonectoidea; both were included in the present study. Further investigation is required to confirm the monophyly of this superfamily with additional taxon sampling. Based on phylogenomic analyses, we propose that Tingidae and Miridae should be elevated to the superfamily rank as Tingoidea and Miroidea, respectively. This suggestion is consistent with the results of a previous phylogenomic study by Song and Zhang [34]. Tingoidea and Miroidea formed a sister group, a relationship also recovered in previous analyses that combined molecular and morphological data [79].

Evolution of feeding modes and divergence time estimation

Members of the suborders Sternorrhyncha, Auchenorrhyncha, and Coleorrhyncha are exclusively phytophagous, and many species within these groups (e.g., aphids, scale insects, whiteflies, and planthoppers) are considered pests of cultivated crops. In contrast to the exclusively phytophagous insects of these three suborders, Heteroptera exhibit a wide variety of feeding behaviors. Within Heteroptera, species in the two largest infraorders, Cimicomorpha (> 20,000 described species) and Pentatomomorpha (> 16,000 species), are mostly phytophagous. However, many species in these and other heteropteran infraorders exhibit diverse feeding modes, including phytophagy, predation, hematophagy, mixed feeding strategies, and scavenging. Understanding the evolution of feeding modes in Heteroptera will help explain this group’s ecological success and its biodiversity.

Our ancestral state reconstructions revealed diversified feeding modes in Heteroptera. The observation that predatory Gerromorpha, Dipsocoromorpha, and Enicocephalomorpha forming a clade as sister to the remaining heteropteran lineages indicates that predation is likely an ancestral state relative to other feeding modes. Our reconstruction of feeding modes supports Cobben’s hypothesis that the most recent common ancestor of Heteroptera was likely predatory [63, 82]. However, this contrasts with Sweet’s, which inferred a phytophagous heteropteran ancestor [83]. Previous studies utilizing complete mitochondrial genomes [22] and combined multi-gene sequence data and morphological characters [79] also supported a predatory ancestor of Heteroptera.

Our analysis indicated phytophagous heteropteran lineages likely evolved from predatory ancestors. Schuh and Slater suggested that phytophagy likely evolved independently at least twice within Heteroptera [3]. In contrast, Li et al. proposed that phytophagy evolved only once in Heteroptera, specifically in the most recent common ancestor of Pentatomomorpha and Miroidea, which formed a clade in their analysis [22]. More recent analyses combining morphological and molecular data indicated multiple transitions from a predatory lifestyle to phytophagy and other feeding types across different infraorders of Heteroptera [79]. Our ancestral state reconstructions also supported multiple independent evolutions of phytophagy in the infraorders Cimicomorpha and Pentatomomorpha, consistent with Weirauch et al. [79]. The most recent common ancestor of Cimicomorpha likely exhibited a carnivorous feeding mode. However, phytophagy has independently evolved from predatory ancestors in Pachynomidae, Nabidae, Tingoidea, and Miroidea. Additionally, there were at least two independent transitions from predation to hematophagy (blood-feeding) in Reduviidae and Cimicidae.

The most recent common ancestor of Pentatomomorpha was reconstructed as phytophagous, with subsequent transitions to mixed feeding strategies (phytophagous or feeding on fungi, algae, and decaying plant material in Aradidae; phytophagous and/or carnivorous in Berytidae and Lygaeidae), and exclusively carnivorous feeding in Pachygronthidae, Colobathristidae, and Geocoridae. In certain clades of Lygaeoidea, a reversal transition from phytophagy to predation was observed, consistent with the findings in a previous study by Li et al. [22]. The divergence time estimates indicate that phytophagous species within Lygaeoidea emerged 100 million years ago, coinciding with the rise of angiosperms around the same time. Furthermore, it is worth noting that most phytophagous insects in the suborder Heteroptera appeared after the 100 Mya timeline. Hence, the transition of Heteroptera insects from predatory to herbivorous is intimately linked with the flourishing of angiosperms during the Cretaceous period. Within the Reduviidae clade, blood-feeding may have evolved from predatory mode. In another lineage of Heteroptera, Cimicoidea, which includes the notorious Cimex species, the ancestral feeding mode remains ambiguous; both phytophagy and predation are possible based the evidence available now.

The estimated divergence time for Sternorrhyncha from all other Hemiptera (~ 347 Mya) is comparable to previous analyses, which estimated divergence at 309 Mya [22], 373 Mya [33], and 386 Mya [32]. Our findings were consistent with a previous study by Shcherbakov [84], indicating that Hemiptera diversified into superfamilies primarily during the Permian period. The diversification of hemipteran species was estimated to have occurred between 60 and 150 Mya, largely in line with the flourishing period of angiosperms during the Cretaceous period.

Conclusions

We demonstrated that phylogenomic analyses with expanded taxonomic sampling could enhance the accuracy of hemipteran phylogeny. Our combined analysis of genome and transcriptome sequences confirmed several high-level relationships proposed by previous transcriptome-based studies. These relationships include the basal position of Sternorrhyncha and a close relationship between Coleorrhyncha and Auchenorrhyncha. Notably, in most analyses, the superfamily relationships within Cicadomorpha were robustly resolved, supporting Membracoidea as the sister group of Cicadoidea + Cercopoidea. Our ancestral state reconstruction supports phytophagy as the primary feeding strategy for the entire Hemiptera lineage, while predation likely represents an ancestral state within Heteroptera. Our divergence time estimation highlights the close association of hemipteran diversification with the evolutionary success and radiation of angiosperms.

Methods

Taxon sampling

We collected specimens of 64 species for genome sequencing including (1) 13 species of Sternorrhyncha (5 species from Aleyrodoidea, 4 from Aphidoidea, 2 from Coccoidea, and 2 from Psylloidea); (2) 30 species of Auchenorrhyncha (10 species form Fulgoromorpha and 20 from Cicadomorpha including 2 from Cicadoidea, 7 from Cercopoidea, and 11 from Membracoidea); (3) 20 species of Heteroptera; and (4) one species of Thysanoptera (Additional file 2: Table S1). We also collected specimens of two species of Coccoidea: Dysmicoccus neobrevipes Beardsley, 1959 and Phenacoccus solani Ferris, 1918 for transcriptome sequencing (Additional file 2: Table S1).

Genome sequencing and assembly

Genomic DNA was extracted from thoracic muscle tissues using the TIANamp Genomic DNA Kit (TIANGEN BIOTECH CO., LTD) according to the manufacturer’s protocols. DNA was quantified with the Quant-iT PicoGreen dsDNA Assay Kit; DNA integrity was assessed by electrophoresis on a 1% agarose gel. The extracted DNA samples were stored at − 20 °C before sequencing. Genomic DNA was sonicated to a fragment size of 400 base pairs (bp) using a Covaris E220 Focused-ultrasonicator with Covaris microTUBES. Library preparation was carried out using the TruSeqTM DNA Sample Prep Kit (Illumina). The genome sequencing was conducted on Illumina NovaSeq platform by Shanghai Personalbio Technology Co., Ltd. A minimum of 30 Gb of raw reads (paired-end, 150 bp each) was produced from each DNA sample. The quality of the raw data was assessed using FastQC [85]. Adapters were trimmed, and low-quality reads were filtered and removed using Trimmomatic v0.32 [86]. Only high-quality reads with Q30 > 90% were retained for subsequent genome assembly using Minia v3.2.4 [87] with multiple k-mer strategies (K-mer values of 21, 41, 61, 81, 101, and 121).

Transcriptome sequencing and assembly

RNA was extracted using Trizol reagent (Invitrogen, CA, USA) following the manufacturer’s guidelines. Libraries were constructed with 375-bp fragments of mRNA using the TruSeq RNA Sample Preparation Kit (Illumina Inc., San Diego, USA) according to the Illumina protocol. Library was quantified with the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen, Life Technologies) and assessed with the Agilent 2100 Bioanalyzer. Sequencing was carried out on Illumina NextSeq 500 platform by Shanghai Personalbio Technology Co., Ltd. A minimum of 20 Gb of raw reads (paired-end, 150 bp each) was produced from each RNA sample. Raw data quality was checked with FastQC [85]; adapter removal and read filtering were with Trimmomatic v0.32 [86]. Only high-quality reads with Q30 > 89% were used for subsequent transcriptome assembly. Transcriptome was assembled de novo using RNA-Seq assembler Trinity [88].

Orthology assignment and sequence alignment

We used the pipeline developed by Zhang et al. [89] to extract single-copy genes from both whole-genome sequence data and transcriptome data. Single-copy orthologs were extracted using BUSCO v3.0.2 with the Hemiptera_odb10 database. This resulted in an average of 1625 near-universal single-copy orthologs extracted for each species. Nucleotide and amino acid sequences were aligned using MAFFT v7.407 [90] under the L-INS-I algorithm. The alignments were trimmed using trimAl-1.4.1 [91] and then concatenated into supermatrices using FASconCAT-G-1.04 [92]. Summary statistics of the matrices were calculated using AMAS [93]. In addition to datasets containing full positions, we generated 50% and 80% completeness matrices to evaluate the effect of missing data, ensuring that each included locus was represented by at least 50% or 80% of all terminal species, respectively. We used BMGE [94] to further trim the multiple alignments of nucleotide and amino acid sequences to remove the hypervariable positions. We then used BaCoCa [95] to calculate the compositional heterogeneity and bias (RCFV: relative composition frequency variability) value and remove compositionally heterogeneous loci.

We used PhyKIT [96] to implement three types of alignment-based loci filtering treatments based on (1) the number of parsimony-informative sites [97], with a minimum value of 500 for nucleotide alignments and 200 for amino acid alignments; (2) GC content [97], with a GC percentage of < 0.8; and (3) relative composition variability (RCV) values [98], with RCV < 0.26 for nucleotide alignments and < 0.3 for amino acid alignments. We used a tree-based loci filtering approach implemented by PhyKIT [96] to remove potentially spurious sequences (SHI: spurious homolog identification) [99]. The ML trees for each locus were constructed from alignments of nucleotide and amino acid datasets, following the parameters detailed below in the subsequent section on coalescent-based species tree estimation.

Phylogenetic analyses

We constructed multiple datasets comprising 298 Hemiptera species, covering all four suborders, 13 infraorders, 27 superfamilies, and 85 families (Additional file 2: Table S1). In addition, eight species of Thysanoptera and six species of Psocodea were included as close outgroups, while three Orthoptera species were selected as distant outgroups. In addition to 64 genomes and 2 transcriptomes newly sequenced in the present study, we incorporated 61 publicly available whole genome sequences (WGS) from NCBI, along with 188 transcriptome assemblies (TSA) of Hemiptera and outgroups.

Sixteen datasets were used in our phylogenetic analyses: (1) NT_all: nucleotide sequences with all positions included; (2) NT_12: nucleotide sequences with third codon positions excluded; (3) NT_50: nucleotide sequences with 50% completeness; (4) NT_80: nucleotide sequences with 80% completeness; (5) NT_RCFV: nucleotide sequences with compositionally heterogeneous loci excluded; (6) NT_GC: nucleotide sequences filtered by GC content; (7) NT_parsimony: nucleotide sequences filtered by the number of parsimony-informative sites; (8) NT_RCV: nucleotide sequences filtered by average variability in sequence composition; (9) NT_SHI: nucleotide sequences filtered by removing potentially spurious sequences; (10) AA_all: amino acid sequences with all positions included; (11) AA_50: amino acid sequences with 50% completeness; (12) AA_80: amino acid sequences with 80% completeness; (13) AA_RCFV: amino acid sequences with compositionally heterogeneous loci excluded; (14) AA_parsimony: amino acid sequences filtered by the number of parsimony-informative sites; (15) AA_RCV: amino acid sequences filtered by average variability in sequence composition; (16) AA_SHI: amino acid sequences filtered by removing potentially spurious sequences (Table 1).

Phylogenetic relationships were inferred using the ML optimality criterion in IQ-TREE version 2.2.15 [100]. Nucleotide and amino acid datasets were analyzed using the GTR + I + G and LG + C20 models, respectively. Node support values were estimated with ultrafast bootstrapping for 1000 replicates [101]. We used a coalescent-based method in ASTRAL v 5.7.1 [102, 103] to estimate species trees based on each loci alignment of nucleotide and amino acid datasets (NT_all and AA_all). Individual gene trees were reconstructed using the GTR model for nucleotide alignments and the LG model for amino acid alignments in IQ-TREE version 2.2.15 [100]. ASTRAL v 5.7.1 [102, 103] was used to summarize all ML trees. To evaluate statistical support for alternative topologies of the superfamily relationships within Sternorrhyncha and Cicadomorpha, as well as the phylogenetic placement of Coleorrhyncha, we performed FcLM [33, 104] analysis using IQ-TREE version 2.2.15 [100]. We also checked for potential incongruent signal in each of the datasets mentioned above.

Character state reconstructions of feeding modes

Data on feeding modes were collected from literature sources for 315 extant taxa included in the present study and were used to reconstruct the evolution of feeding modes of hemipteran insects. The feeding mode character matrix was prepared using Mesquite v.3.81 [105]. The adult feeding modes were coded as eight states: (1) phytophagous; (2) feeding on mosses; (3) feeding on fungi; (4) phytophagous or feeding on fungi, algae, and decaying plants; (5) carnivorous; (6) phytophagous and/or carnivorous; (7) feeding on the blood of mammals; and (8) feeding on the blood of birds. Subsequently, we mapped the feeding modes onto the ML tree derived from the AA_50 dataset. The “Trace Character History” analysis was conducted using Mesquite v.3.81 [105] with the maximum likelihood method under the Mk1 model.

Divergence dating analysis

We estimated the divergence times of major hemipteran lineages using the ML tree inferred with dataset AA_50 and MCMCTree from the PAML package v4.9 [106]. We referred to previous studies and age information from the Paleobiology Database (PBDB; https://paleobiodb.org) as minimum age calibration points for some lineages. These fossil calibration points were (1) Liposcelididae—53 million years ago (Mya) [107]; (2) Aphidomorpha—245 Mya [108]; (3) Enicocephalidae—120 Mya [109]; (4) Gerromorpha—150 Mya [110]; (5) Reduvioidea—180 Mya [111]; (6) Cimicoidea—150 Mya [112]; (7) Pentatomomorpha—99 Mya [113]; (8) Coleorrhyncha—252 Mya [19]; (9) Fulgoroidea—260 Mya [19]; (10) Cixiidae—150 Mya [114]; (11) Fulgoridae—44 Mya [115]; (12) Cicadoidea—237 Mya [116]; (13) Cercopoidea—201 Mya [117]; and (14) Membracoidea—250 Mya [118].

Supplementary information.

Additional file 1: Fig. S1 The full tree for the ML analysis based on the nucleotide dataset NT_12 consisting of 1,370,056 aligned nucleotide positions. The values displayed at the nodes indicate the bootstrap values. Additionally, the colored background highlights the four suborders of Hemiptera. Fig. S2 The full tree for the ML analysis based on the amino acid dataset AA_RCFV consisting of 131,479 aligned amino acid positions. Fig. S3 The full tree for the ML analysis based on the nucleotide dataset NT_all consisting of 2,096,643 aligned nucleotide positions. Fig. S4 The full tree for the ML analysis based on the nucleotide dataset NT_50 consisting of 1,615,692 aligned nucleotide positions. Fig. S5 The full tree for the ML analysis based on the nucleotide dataset NT_80 consisting of 240,771 aligned nucleotide positions. Fig. S6 The full tree for the ML analysis based on the nucleotide dataset NT_RCFV consisting of 661,644 aligned nucleotide positions. Fig. S7 The full tree for the ML analysis based on the nucleotide dataset NT_GC consisting of 624,621 aligned nucleotide positions. Fig. S8 The full tree for the ML analysis based on the nucleotide dataset NT_parsimony consisting of 1,694,553 aligned nucleotide positions. Fig. S9 The full tree for the ML analysis based on the nucleotide dataset NT_RCV consisting of 930,285 aligned nucleotide positions. Fig. S10 The full tree for the ML analysis based on the nucleotide dataset NT_SHI consisting of 1,334,997 aligned nucleotide positions. Fig. S11 The full tree for the ML analysis based on the amino acid dataset AA_all consisting of 719,819 aligned amino acid positions. Fig. S12 The full tree for the ML analysis based on the amino acid dataset AA_50 consisting of 565,612 aligned amino acid positions. Fig. S13 The full tree for the ML analysis based on the amino acid dataset AA_80 consisting of 102,484 aligned amino acid positions. Fig. S14 The full tree for the ML analysis based on the amino acid dataset AA_parsimony consisting of 578,064 aligned amino acid positions. Fig. S15 The full tree for the ML analysis based on the amino acid dataset AA_RCV consisting of 345,843 aligned amino acid positions. Fig. S16 The full tree for the ML analysis based on the amino acid dataset AA_SHI consisting of 28,499 aligned amino acid positions. Fig. S17 The species tree relationships in Hemiptera were reconstructed using gene trees derived for each locus included in the NT_all dataset with ASTRAL. Numbers at nodes show the local posterior probabilities. Fig. S18 Results of the FcLM analyses on various datasets for the phylogenetic placement of Coleorrhyncha in Hemiptera. Fig. S19 Results of the FcLM analyses on various datasets for the phylogenetic hypotheses on the superfamily relationships of Sternorrhyncha. Fig. S20 Results of the ancestral state reconstruction analysis for feeding modes using the maximum likelihood method under the Mk1 model. The internal pie charts on the tree illustrate the relative likelihoods of different character states. Fig. S21 Estimated divergence times among lineages of Hemiptera. The red circle indicates the node corresponding to Hemiptera. Numbers at nodes show the divergence times (Mya). The bars display the 95% highest posterior probability density of each estimate. The colored background highlights the four suborders of Hemiptera. Red-circled numbers on the tree correspond to the minimum age calibration points.

Additional file 2: Table S1 Taxa included in this study, along with corresponding accession numbers and data types.