Background

Long-branch attraction (LBA) is a bias that results in spurious support for relationships between two (or more) long branches in an estimated phylogenetic tree when the assumed model of evolution is too simplistic [1, 2]. Biases associated with LBA have been identified in many phylogenetic studies, including analyses of mammals [3, 4], birds [5], arthropods [68], and seed plants [9, 10]. The most common problem occurs when distantly related ingroup taxa are poorly sampled and one or a few distant outgroup taxa are included to root the tree. Under these conditions, a simplistic model of evolution is unlikely to sufficiently account for homoplasy, and long branches will be connected (or attracted to one another) in the inferred tree based on homoplastic similarities [11]. One method for detecting this problem involves conducting phylogenetic analyses with and without outgroups [12]. If the inclusion of a distant outgroup changes the inferred relationships of the ingroup, it may be better to infer ingroup relationships separately and consider other methods for rooting the resulting tree, or to use more closely related outgroups [13]. In addition, several strategies have been suggested to reduce the effects of LBA, including: (1) excluding long-branch taxa from the analysis, (2) replacing the long-branch taxa with slow-evolving close relatives, (3) removing fast-evolving proteins or sites, (4) improving the models of character evolution assumed in the analysis, and (5) sampling more taxa to break up long branches in the tree [1416]. Among these methods, adding taxa to break up long branches is one of the most widely suggested strategies to reduce the effects of LBA bias [17, 18]. Appropriate and thorough taxon sampling is thus one of the most important considerations for accurate phylogenetic estimation [1619]. Phylogenetic analyses based on relatively few distantly related taxa (but with each taxon represented by many characters, such as from a mitochondrial genome) are particularly prone to problems with LBA; such analyses are likely to produce high support values for incorrect phylogenetic relationships [16, 20].

The relationships of the true water bugs (Hemiptera: Nepomorpha) within heteropteran insects [21] have been the subject of many studies of molecular and morphological data. The monophyly of Nepomorpha has been consistently and strongly supported by studies based on morphological characters [2225], molecular data (partial sequences of 16S rDNA and 28S rDNA [26], and four Hox genes [27]), and by combined data analyses [26]. In contrast, the monophyly of Nepomorpha has only been disputed in the study of Hua et al. [28], who based their analysis on nine nepomorphan mitochondrial genomes (mt-genomes). In the study by Hua et al. [28], Pleoidea was not supported as part of Nepomorpha, but instead was resolved as the sister-group of a clade that included the remaining species of Nepomorpha plus Leptopodomorpha, Cimicomorpha, and Pentatomomorpha (Figure 1). As a result of these analyses, Hua et al. [28] suggested that Pleoidea should be raised from a superfamily within Nepomorpha to the infraorder Plemorpha, outside of Nepomorpha. Their conclusions were supported by high Bayesian posterior probabilities (BPP) and maximum likelihood (ML) bootstrap proportions in five of eight phylogenetic analyses.

Figure 1
figure 1

The consensus phylogeny based on the data sets analyzed by Hua et al. [[28]]. Five of the eight phylogenetic analyses they conducted supported this tree. Numbers at the nodes indicate the BPP and ML support values for each data matrix analyzed by Hua et al. [28] in the following order: PP and BP for PCG123RT, PP and BP for PCG12RT, and PP for PCG12. Branch lengths are similar across analyses; these branch lengths represent the analysis of the PCG123RT data set. The scale bar represents the number of expected substitutions per site.

The study by Hua et al. [28] has both strengths and weaknesses when compared with previous studies of the phylogenetic relationships of Nepomorpha. Each taxon sampled by Hua et al. [28] was sampled for complete mitochondrial genomes, so the number of characters available for phylogenetic inference was large. In contrast, previous studies [2227] examined fewer characters per taxon, but included more taxa in the analyses. Thorough taxon sampling can often lead to more accurate phylogenetic inference, even if the total number of characters in the analysis is decreased [2932]. In particular, the position of Pleoidea in the study of Hua et al. [28] may have been affected by the inclusion of just one of two families in Pleoidea (Helotrephidae, without any representation of Pleidae; see Results and discussion). This made it more likely for the tree to be rooted by connection of the distantly related outgroup taxa to the long branch leading to Helotrephes sp. (Figure 1).

A second consideration is the selection of outgroups used by Hua et al. [28]. Fulgoromorpha is very distantly related to the ingroup Nepomorpha, making problems associated with LBA more likely [30, 33]. Furthermore, in groups more closely related to Nepomorpha, Hua et al. [28] sampled only one representative for three different infraorders (Cimicomorpha, Leptopodomorpha and Pentatomomorpha). Thus, we examined the possibility that the findings of Hua et al. [28] resulted from biases associated with inadequate taxon sampling. Because the model-based methods used by Hua et al. [28] are less sensitive to the problems of LBA [3436], these authors did not consider LBA to be a likely explanation of their results. However, models of evolution are never perfect, and poor taxon sampling exacerbates the problems of model insufficiency, so the use of model-based inference methods is not, by itself, a panacea for dealing with biases associated with LBA [11, 16].

We undertook the current study to explore the conclusion of Hua et al. [28] that the Pleoidea evolved their fully aquatic lifestyle independently of the remaining true water bugs in Nepomorpha. Our hypothesis was that this conclusion was a result of LBA between the single sampled representative of Pleoidea and the distantly related outgroup, Fulgoromorpha. We tested this hypothesis by: (1) removing the outgroups and re-estimating the phylogeny of Nepomorpha only, to detect whether the ingroup topology is affected by the long-branch outgroup taxa [12, 13]; (2) increasing taxon sampling of groups related to Nepomorpha, including Leptopodomorpha, Cimicomorpha, and Pentatomomorpha [37]; and (3) adding new mt-genome data for a representative of the second family within Pleoidea, namely Pleidae (the presumed sister-group of Helotrephidae).

Results and discussion

Misidentification of previously sampled taxa

To test our hypothesis that the conclusion of Hua et al. [28] (Pleidae outside of the remaining Nepomorpha) was an artifact of limited taxon sampling, we sampled a member of the family Helotrephidae. Helotrephidae is generally accepted as the sister-group of Pleidae [22, 23, 25, 26], so we reasoned that including the sister-group of Pleidae was the best way to break up the long terminal branch leading to this taxon. We sequenced the mt-genome of Helotrephes semiglobosus semiglobosus Stål, 1860 (Nepomorpha: Helotrephidae). However, after we obtained a partial mt-genome sequence of Helotrephes semiglobosus semiglobosus (GenBank accession number: KJ027513) with the length of 8,876 bp, including 29 genes (two rRNAs, ten protein coding genes [PCGs] and 17 tRNAs) as well as the control region, we found extreme similarity (97.4%) between this species and the specimen previously identified by Hua et al. [28] as Paraplea frontalis (Fieber, 1844). As this level of sequence similarity was unexpected between species in these two families, we checked the specimens identified previously as Paraplea frontalis by Hua et al. [28]. We found that those specimens are properly identified as Helotrephes sp., and so represent a species in Helotrephidae rather than Pleidae. As the mt-genome of a species in Helotrephidae was already represented in the data set, we then sequenced a new mt-genome of Paraplea frontalis, as a true representative of Pleidae. Henceforth, we label the sample sequenced by Hua et al. [28] correctly as Helotrephes sp..

Removal of outgroups from the analysis

The most common problem of LBA is that distantly related outgroups have a biased attraction to long branches within the ingroup [3, 4, 38]. For this reason, a common suggestion is to conduct phylogenetic analyses both with and without the outgroups to compare whether the distantly related outgroup alters the ingroup topology [16]. To test if outgroup selection affected the topology of our ingroup, we ran analyses using only the ingroup taxa of Hua et al. [28]. Using Bayesian and ML analyses, all data matrices of Hua et al. [28] generated phylogenetic trees with the same topology (Figure 2). When the outgroups are removed, the ingroup topology is distinct from that obtained by Hua et al. [28] (Figure 1). In all of these analyses, Helotrephes sp. was connected to Enithares tibialis Liu et Zheng, 1991 (Nepomorpha: Notonectoidea).

Figure 2
figure 2

Phylogenetic results based on analyses of ingroup taxa only. Numbers at the nodes are BPP and ML support values in the following order: PP and BP for PCG12, PP and BP for PCG123, PP and BP for PCG12RT, and PP and BP for PCG123RT. The red dot on the tree indicates the clade of Notonectoidea + Pleoidea. The scale bar represents the number of expected substitutions per site based on analysis of the PCG12 data set.

Addition of outgroups

Outgroup selection is an important factor for reconstructing phylogenetic trees, because the choice of outgroup taxa can affect the ingroup topology [39]. However, outgroup selection is often not adequately considered [40, 41]. Moreover, several authors have pointed out that adding more outgroup taxa in the sister-group to a phylogenetic analysis can improve the accuracy of phylogenetic estimation, and also should help break up the LBA between any long-branch members of the ingroup and the outgroup [38, 42, 43]. Therefore, we added three more taxa (selected from the sister-group of Nepomorpha) to the dataset of Hua et al. [28].

Both Bayesian inference and ML analyses resulted in the same topology (Figure 3A); the position of the long branch of Helotrephes sp. (Nepomorpha: Pleoidea) was supported within Nepomorpha rather than outside of Nepomorpha, in contrast to the findings of Hua et al. [28]. The monophyly of Nepomorpha (including both Helotrephidae and Pleidae) received strong support in Bayesian analyses (based on posterior probabilities: PP) but with relatively weak support in ML analyses (based on bootstrap proportions: BP). The monophyletic Nepoidea, Ochteroidea, and Naucoroidea were strongly supported by both PP and BP, similar to the results of Hua et al. [28]. Additionally, the topology of the infraordinal relationships of Heteroptera is similar to previous work [44] also based on mt-genomes, namely (Gerromorpha + (Pentatomomorpha + (Leptopodomorpha + (Cimicomorpha + Nepomorpha)))).

Figure 3
figure 3

Phylogenetic trees based on the inclusion of additional closely related outgroups. (A) Analysis including the distant outgroup Lycorma delicatula (Hemiptera: Auchenorrhyncha: Fulgoromorpha). (B) Analysis excluding the distant outgroup Lycorma delicatula. Numbers at the nodes are BPP (left) and ML support values (right). Yellow dots on each phylogram indicate the clades of Nepomorpha, and red dot indicate the clades of Notonectoidea + Pleoidea. Asterisks indicate these additional closely related outgroups. The scale bar represents the number of expected substitutions per site.

We also estimated phylogenetic trees without the long-branched outgroup of Lycorma delicatula (White, 1845) (Hemiptera: Auchenorrhyncha: Fulgoromorpha). The major changes that resulted from deletion of this taxon were the position of Helotrephes sp. and Naucoroidea (Figure 3B). In both Bayesian and ML analyses, Helotrephes sp. (Nepomorpha: Pleoidea) was supported as the sister group of Enithares tibialis (Nepomorpha: Notonectoidea). The close relationship between the Notonectoidea and Pleoidea also has been supported in most previous studies [2226]. Although the relationships among families of Nepomorpha varied among trees, all the analyses that excluded Fulgoromorpha supported the monophyly of Nepomorpha (including Helotrephidae as well as Pleidae, when the latter was added to the analyses). These analyses demonstrate that the conclusions of Hua et al. [28] were at least partly a result of their use of a very distant outgroup.

Addition of a new mitochondrial genome of Pleidae

We sequenced and assembled a new mt-genome for Paraplea frontalis (Fieber, 1844), except for small portions of 12S rRNA gene and the control region (polynucleotide sequences in these two regions proved difficult to resolve with certainty). This mt-genome was 14,143 bp in length and has been deposited in the GenBank (Accession number: KJ027516). The mt-genome of Paraplea frontalis contained the typical 37 genes (two rRNAs, 13 PCGs and 22 tRNAs), with the same gene order as observed in most other true bugs [44, 45] (Table 1). Gene overlaps were found at 11 gene junctions and involved a total of 32 bp, which may make the genome relatively compact. Twelve of the 13 PCGs initiated with ATN as start codon, whereas the COI gene started with TTG. Eight PCGs ended with the termination codon TAA and one with TAG, whereas the remaining four were terminated with T. All of the 22 typical animal tRNA genes were observed in the Paraplea frontalis mt-genome, ranging from 63 to 74 bp. Most of the tRNAs could be folded into typical cloverleaf secondary structures, except that the stem of the dihydrouridine (DHU) arm simply formed a loop in tRNA-Ser (GCT) (see Additional file 1). There are 22 unmatched base pairs in the Paraplea frontalis mitochondrial tRNA secondary structures.

Table 1 Organization of the Paraplea frontalis mitochondrial genome

Increased taxon sampling, especially when it breaks up long branches in a tree, is the most effective strategy for reducing the effects of LBA [16, 31, 32]. We added the representative of Pleidae, which is thought to be the sister-group of Helotrophidae, to help reduce the length of the branch that led to the single sampled species of Helotrephidae sampled by Hua et al. [28]. We therefore added our mt-genome of Paraplea frontalis to the four data matrices of Hua et al. [28] and conducted new phylogenetic analyses (Figure 4).

Figure 4
figure 4

Phylogenetic trees based on the addition of a new mitochondrial genome of Paraplea frontalis (Nepomorpha: Pleoidea). With adding the new mt-genome of Paraplea frontalis (Fieber, 1844) to the data matrices of Hua et al. [28], we gathered four new data matrices of 16(PCG12), 16(PCG123), 16(PCG12RT), and 16(PCG123RT). (A) Numbers at the nodes are BPP for the data matrix of 16(PCG12) (left) and 16(PCG123) (right). (B) Numbers at the nodes are ML support values for the data matrix of 16(PCG12) (left), 16(PCG123) (middle), and 16(PCG123RT) (right). (C) Numbers at the nodes are BPP for 16(PCG12RT) (left), ML support values for 16(PCG12RT) (middle), and BPP for 16(PCG123RT) (right). Yellow dots on each phylogram indicate the clades of Nepomorpha, and Red dots indicate the clades of Notonectoidea + Pleoidea. The scale bar represents the number of expected substitutions per site.

As with our analyses that replaced the distant outgroup with more appropriate outgroups, the analyses that included a member of Pleidae supported monophyly of Nepomorpha (with strong PP support but weak BP support). Moreover, these analyses strongly supported Paraplea frontalis (Pleidae) as the sister group of Helotrephes sp. (Helotrephidae). Together, Pleidae and Helotrephidae were supported as the sister-group of Notonectidae. The monophyletic groups of Nepoidea, Ochteroidea, Naucoroidea, Pleoidea, and Notonectoidea + Pleoidea were strongly supported by both PP and BP in all analyses that included Pleidae.

Likelihood-ratio tests

We compared the likelihood ratios of the best solutions for each of our two alternative hypotheses (Pleoidea inside versus outside of Nepomorpha; see Additional file 2) for eight different combinations of taxa (Table 2). The monophyly of Nepomorpha (including Pleoidea) was strongly supported if we added Paraplea frontalis and/or three more outgroup taxa to the original data matrix of Hua et al. [28], as well as when we analyzed the data set without the distant outgroup consisting of Lycorma delicatula. The original conclusion of Hua et al. [28] (the polyphyly of true water bugs) was only supported with the specific combination of taxa analyzed in the original study. Even then, the likelihood-ratio support for this result over the alternative is weak (Table 2).

Table 2 Likelihood-ratio tests for monophyly of Nepomorpha with eight different combinations of taxa

Phylogeny of nepomorpha

Given that the monophyly of Nepomorpha is consistently supported in all of our new analyses, we find no support for the new infraorder Plemorpha. Therefore, we recommend retaining Pleoidea as part of Nepomorpha. The superfamilies of Nepoidea (Belostomatidae + Nepidae), Ochteroidea (Gelastocoridae + Ochteridae), Naucoroidea (Aphelocheiridae + Naucoridae), and Pleoidea (Pleidae + Helotrephidae) are monophyletic groups in all our analyses with high support from both PP and BP. We also found strong support for the close relationship between Notonectoidea and Pleoidea. Several synapomorphies of biological and ecological traits also support some of these monophyletic groups [2426, 46]:

Nepomorpha: the short antennae are concealed below the eyes; all have an aquatic lifestyle, although Ochteroidea (including Ochteridae and Gelastocoridae) live along freshwater shores rather than underwater;

Nepoidea (including Nepidae and Belostomatidae): air-breathing through a siphon;

Naucoroidea: all Aphelocheiridae and some Naucoridae use plastron respiration;

Pleoidea (including Pleidae and Helotrephidae): also have plastron respiration, which allows them to stay permanently submerged;

Notonectoidea and Pleoidea (including Notonectidae, Pleidae, and Helotrephidae): swim on their backs in an inverted position.

Our principal goal in this study was to discuss the monophyly of Nepomorpha and the effects of adequate taxon sampling on this phylogenetic problem. As we did not sample all the families of Nepomorpha, a more thorough sampling of taxa is needed to adequately resolve the family relationships within Nepomorpha. In particular, more sampling of Potamocoridae, Micronectidae and Diaprepocoridae (Hemiptera: Nepomorpha) mt-genome sequences will be needed for a thorough analysis of the major groups within Nepomorpha.

Conclusions

This study provides a clear example of the importance of adequate sampling. We support the conclusion that investigators should be cautious about making major taxonomic rearrangements on the basis of limited taxon sampling, even (or especially) when the number of characters sampled per taxon is large [16, 17, 31, 32]. Phylogenetic analyses that are based on even complete genomes of relatively few taxa are likely to result in strongly supported, but incorrect, evolutionary reconstructions [16, 17, 47]. In the study by Hua et al. [28], limited sampling of mt-genomes, coupled with the use of a distant outgroup, resulted in a conclusion that was at odds with a traditionally supported group (true water bugs, or Neopmorpha). But even minimal additional sampling to break up long branches in the tree, or the use of more closely related outgroups, results in trees in which the traditional group Nepomorpha is supported.

In the phylogenomic era [48], many papers are reporting surprising phylogenetic results that conflict with traditional hypotheses of relationships. Many (or even most) of these surprising results are based on analyses of many characters (even whole genomes) from very few taxa [16, 47, 49]. Strong “statistical support” for a given conclusion may come from strong underlying phylogenetic signal, but also from systematic bias that stems from assuming inadequate or inappropriate models of evolution [50]. Using large numbers of characters in a phylogenetic analysis means that even small systematic biases associated with overly simplistic methodological assumptions are likely to be mistaken as strong phylogenetic signal. Thorough taxon sampling allows the use of more simplistic models of evolution, because multiple changes at each nucleotide site can be appropriately reconstructed through the increased sampling of the tree [18]. If the sampling in a phylogenomic study is sparse, investigators should use appropriate caution before overturning analyses that are based on more thorough sampling of taxa.

Methods

Ethics statement

No specific permits were required for the insect collected for this study in Yunnan and Hubei Province, China. The insect specimens were collected with a sturdy aquatic net at the pond. The field studies did not involve endangered or protected species. The species in the genus of Paraplea and Helotrephes are common small insects and are not included in the “List of Protected Animals in China”.

Specimen collection

Adult specimens of Paraplea frontalis were collected from Tongbiguan Village (24°36.411 N, 97°39.349E), Yingjiang County, Dehong City, Yunnan Province, China, on May 18th, 2009. Adult specimens of Helotrephes semiglobosus semiglobosus were collected from Jin Ji Valley (29°22.339 N, 114°34.301E), Jiu Gong Shan, Tong Shan County, Hubei Province, China, on July 30th, 2010. Voucher specimens are deposited in the Insect Molecular Systematics Lab, Institute of Entomology, College of Life Sciences, Nankai University, Tianjin, China. All specimens were initially preserved in 95% ethanol in the field. After being transferred to the laboratory, they were stored at -20°C until used for DNA extraction.

PCR amplification and sequencing

Whole genomic DNA was extracted from thoracic muscle tissue by CTAB-based method [51]. The mt-genome of Paraplea frontalis was amplified in four overlapping PCR fragments by PCR amplification (see Additional file 3). The partial mt-genome of Helotrephes semiglobosus semiglobosus was sequenced with two fragments (see Additional file 4). Primer pairs were modified from previous work [28], and designed from sequenced fragments.

PCR reactions were performed with TaKaRa LA Taq under the following conditions: 1 min initial denaturation at 94°C, followed by 30 cycles of 20 s at 94°C, 1 min at 50°C, and 2–8 min at 68°C, and a final elongation for 10 min at 72°C. PCR products were electrophoresed in 1% agarose gel, purified, and then sequenced using an ABI 3730XL capillary sequencer with the BigDye Terminator Sequencing Kit (Applied Bio Systems). All fragments were sequenced with primer walking on both strands.

Sequence analysis and annotation

Sequence files were assembled into contigs using BioEdit version 7.0.5.2 [52]. Protein coding regions were determined via ORF Finder implemented at the NCBI website (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) with invertebrate mitochondrial genetic codes. Transfer RNA analysis was performed by tRNAscan-SE version 1.21 [53] with the invertebrate mitochondrial codon predictors and a cove score cut-off of 5. Few tRNA genes that could not be identified by tRNAscan-SE were determined by comparing to other heteropterans. Analyses of sequences were performed with MEGA version 5.0 [54].

Taxon sampling

In total, 19 taxa were sampled. These taxa included representatives of 10 out of 11 extant families of Nepomorpha [46, 55] and 9 outgroups (Table 3). Among them, the mt-genome data of Paraplea frontalis is reported here for the first time. To make the results more directly comparable to the study of Hua et al. [28], we retrieved all mt-genomes of 15 taxa (including nine ingroups and six outgroups) from their work. According to the analysis of the heteropteran infraorders of Wheeler et al. [37], the phylogenetic relationships of Heteroptera are as follows: (Enicocephalomorpha + (Dipsocoromorpha + (Gerromorpha + (Nepomorpha + (Leptopodomorpha + (Cimicomorpha + Pentatomomorpha)))))). Therefore, we sampled another three taxa within the sister group to Nepomorpha as outgroups, with one representative from each of Leptopodomorpha, Cimicomorpha and Pentatomomorpha.

Table 3 Taxonomy and GenBank accession numbers of mitochondrial genomes for species sampled in this study

Phylogenetic analyses

All PCGs were aligned based on their amino acid sequences using MUSCLE as implemented in the MEGA version 5.0 [54]. The rRNAs and tRNAs were aligned with CLUSTAL_X version 1.83 [56] under the default settings. The alignments of tRNA genes were corrected according to the secondary structures, especially the stem regions. The aligned nucleotide sequences, excluding stop codons, were then concatenated and used to reconstruct the phylogeny. All phylogenetic trees were built using only first and second codon positions of 13 PCGs, except in our analyses in which we removed or added taxa to the data matrices of Hua et al. [28], so that we could make a direct comparison using methods used in the original paper. Our analyses with added and deleted taxa used the same data sampling methods of Hua et al. [28]; these analyses contained four kinds of data matrices: (1) The PCG123RT matrix, including all three codon positions of PCGs, rRNA genes, and tRNA genes; (2) the PCG12RT matrix, including the first and the second codon positions of PCGs, rRNA genes, and tRNA genes; (3) the PCG123 matrix, including all the three codon positions of PCGs; and (4) the PCG12 matrix, including the first and the second codon positions of PCGs.

We used GPU MrBayes [57] for Bayesian inference and raxmlGUI 1.2 [58] for ML analyses to reconstruct phylogenetic trees. We used the GTR + I + Γ model, based on results from Modeltest Version 3.7 [59]. In Bayesian inference, two simultaneous runs of 10,000,000 generations were conducted for each matrix. Each set was sampled every 100 generations. Trees that were sampled prior to stationarity (at 25% of the run) were discarded as burnin, and the remaining trees were used to construct a 50% majority-rule consensus tree. For the ML analysis, we conducted 1000 bootstrap replicates with thorough ML search.

Tests of monophyly

Traditionally recognized taxonomic groups are usually challenged when there is strong statistical support for an alternative phylogeny [16, 60]. Likelihood-ratio tests [61] can provide a powerful means of examining alternatives. We applied likelihood-ratio tests to compare the support of various data sets for two different hypotheses (see Additional file 2):

Hypothesis 1: Helotrephidae is nested within Nepomorpha (i.e., the true water bugs are monophyletic, and Helotrephidae is nested within the group).

Hypothesis 2: Helotrephidae is outside of the remaining species of Nepomorpha (i.e., true water bugs are only monophyletic if Helotrephidae is excluded from the group).

We conducted likelihood-ratio tests [61] of these two hypotheses for the original data set of Hua et al. [28], as well as with various additions and deletions of taxa, including both ingroups and outgroups. The likelihood-ratio tests were conducted using PAUP* 4 [62]. Heuristic searches were performed using the GTR + I + Γ model with 100 random addition replicates.

Availability of supporting data

The data sets supporting the results of this article are available in the Dryad repository, http://dx.doi.org/10.5061/dryad.tf25c[63].