Background

Cetaceans (whales, porpoises, and dolphins) compose a lineage exclusively of aquatic mammals, classified into two groups: the odontocetes—animals with teeth—and mysticetes—animals with baleen plates that allow for food filtration [1]. Cetaceans evolved from small-sized terrestrial ancestors approximately 50 Myr ago, during the Eocene [2]. By then, cetaceans started recolonizing the aquatic environment, a process followed by extensive morphological and physiological modifications such as reducing olfactory and gustatory systems, loss of hind limbs, and modifications toward a hydrodynamic body [3]. Some cetacean species have become gigantic, with colossal measures that are not achieved by living animals. Gigantism results from species evolving enormous body sizes compared with their small-sized ancestors. This feature affects critical life-history traits, such as fecundity, due to the consequent lower reproductive rate and an overall reduction in effective population genetic size (Ne) due to lower population densities [4, 5]. Despite this, some cetacean species reach large body sizes that are unique among living animals, ranging from the impressive gray whale (Eschrichtius robustus) with 15 m to the colossal blue whale (Balaenoptera musculus) that reaches up to 30 m [6, 7].

Some ecological hypotheses have been proposed to explain the large body proportions in cetaceans, such as thermoregulation [8], a wider space available in the aquatic environment to explore new niches [9], and also food acquisition, which in mysticetes is associated with filtration of small prey [10], and in sperm whale, the largest odontocete at 20 m in length [11], with the ability to dive to extraordinary depths to capture their prey [12]. In addition to these ecological causes, the genetics behind body size has been recently investigated, taking advantage of the sequenced cetacean genomes. For example, evolutionary analyses have shown signatures of positive selection on size-related genes in cetaceans. Sun et al. 2019 [13] found evidence of selection in genes related to small size in cetaceans, such as ACAN, OBSL1, and GRB10 genes; whereas, in giant cetaceans, genes possibly evolving under positive selection were those with known roles in promoting growth and large sizes, such as CBS, EIF2AK3, and PLOD1 genes. Still, these studies focused only on coding regions, and information on the influence of regulatory regions on gigantism in this group is scarce.

Non-coding sequences with regulatory functions (e.g., promoters and enhancers) coordinate the spatial-temporal expression of genes [14]. Although regulatory regions are not under the same constraints as coding sequences, highly conserved sequence blocks in different species indicate evolutionarily conserved functions [15, 16]. On the other hand, modifications of gene regulatory elements have been associated with phenotypic changes in animal evolution, such as pigmentation changes in dogs [17], bristle patterns in flies [18], and skeletal differences in fish [19]. This approach to studying transcription factors is currently facilitated by computational methods that can identify potential candidate gene regulatory elements by detecting regions of the genome that exhibit evolutionary conservation or acceleration [16].

Comparative genome-wide regulatory sequence approaches can provide insights into the evolutionary history of large body size in cetaceans. Specifically, our study focuses on species that are at least 10 m long and classified as giants. In Fig. 1 we present the cetacean species included in our investigation, highlighted in blue for giants and red for non-giants, based on their average size. Accordingly, we investigated the molecular evolution of non-coding regulatory regions of genes previously described in the literature as being associated with size in mammals, such as EGF, GHSR, IGF2, IGFBP2, IGFBP7, LCORL, NCAPG, PLAG1, and ZFAT, focusing on cetaceans. Our analyses were performed within a phylogenetic framework, where each promoter contained a consistent set of 52 species, including 39 from different orders of mammals and 13 species of cetaceans, of which eight were classified as giants with a minimum length of 10 m. The objective was to investigate differences in the enrichment of Transcription Factor Binding Sites (TFBS) between giant and non-giant cetaceans, as well as to identify potential evolutionary acceleration in these animals with large body sizes.

Results

Phylogenetic reconstruction

To identify potential candidate genes that may contribute to gigantism in cetaceans, we examined the evolution of their regulatory regions. To this end, we generated phylogenetic trees for each gene selected in this study using a maximum likelihood approach. These trees were constructed to visually explore the evolutionary relationships among the promoter sequences of the species included in our dataset, focusing on potential convergence among giant species. Specifically, we generated phylogenetic trees for the promoter region (-1500 bp to + 500 bp from TSS) of each gene. The promoter regions were defined as the sequences upstream of the transcription start site, as this is where regulatory elements, such as transcription factor binding sites, are typically located. The correct grouping of mysticetes and odontocetes was observed in most of the phylogenetic trees, along with other groups of mammals, such as Artiodactyla, Carnivora, Primate, Cingulata, and Chiroptera (Additional file 1: Supplementary Figures S1-S8 show the phylogenetic trees for each promoter).

One exception was observed in the NCAPG promoter, in which the odontocete sperm whale (Physeter catodon) was grouped within the mysticete clade, and the mysticete minke whale (Balaenoptera acutorostrata) was grouped with the odontocetes (Fig. 2). Thus, the NCAPG tree had a clade formed by the gigantic animals included in this dataset, Balaenoptera musculus, Physeter catodon, Eschrichtius robustus, Megaptera novaeangliae, Balaenoptera physalus, Eubalaena australis, Eubalaena glacialis, and Eubalaena japonica. To confirm this scenario, we performed a Bayesian approach, which returned the same grouping by size presented previously. This may be due to factors such as evolutionary convergence or rapid evolution of regulatory elements in this particular gene. To explore this further, we performed additional analyses to investigate the rate of evolution and the dynamics of changes in the TFBS of the promoters.

Fig. 1
figure 1

The adult average size, in meters, of all cetacean species included in this study. Blue values indicate giant cetaceans, and red values non-giant cetaceans. Gigantism in this group is defined by body measurements with an average length of 10 m. Physeter catodon, Eschrichtius robustus, Eubalaena japonica, Eubalaena australis, Eubalaena glacialis, Megaptera novaeangliae, Balaenoptera physalus, and Balaenoptera musculus are classified as giants. Size values are from the “Encyclopedia of Marine Mammals” and the phylogeny from McGowen et al., 2020 [20]

Fig. 2
figure 2

The formation of a clade with only giant cetaceans (the sperm whale odontocete and the other large mysticetes) and another with smaller cetaceans (the mysticete minke whale [Balaenoptera acutorostrata] alongside the remaining odontocetes) in both Maximum likelihood tree generated by IQ − TREE and Bayesian tree generated by Mr Bayes v3.2.6 constructed from the promoter region of the NCAPG gene. Numbers under nodes represent bootstrap support (right) and Bayesian posterior probability (left)

Regulatory regions analyses

To gain insights into the molecular evolution of non-coding regulatory regions of genes associated with body size in cetaceans, we employed a scanning approach using Ciiider to identify transcription factor binding sites (TFBS) for the nine promoters of interest across all species in our dataset. The identification of TFBS is crucial to understanding the regulatory mechanisms that control gene expression, and that may contribute to the evolution of morphological traits such as body size. By analyzing the presence and distribution of TFBS in these promoters, we aimed to identify potential regulatory modifications that may have contributed to gigantism in cetaceans and to gain a better understanding of the molecular basis of body size evolution in this group of mammals. The scanning approach performed in Ciider identified TFBS for all nine promoters in all species. The results of the enrichment analyses showed conservation patterns across different mammalian groups, suggesting that some transcription factors are evolutionarily conserved, as all mammals have the same patterns of transcription factor enrichment in the same approximate location of the promoter (Table 1, also Additional file 1: Supplementary Figures S9-S14 shows GHSR, IGF2, IGFBP2, LCORL, PLAG1, and ZFAT). For example, in the EGF promoter, TCF7 and CDX1 were found to be spatially conserved across phylogeny at the − 500 bp position within the promoter region, as shown in Fig. 3, which includes all species analysed to highlight this conservation. On the other hand, TFBS exclusive to certain groups was observed in the NCAPG and IGFBP7 promoters, as demonstrated in Figs. 4 and 5, respectively. Figures 4 and 5 only show cetaceans to highlight the specificity of the TFBS in the promoters related to giant and non-giant cetaceans. The complete figures, including other species, can be found in the supplementary material.

Table 1 The top ten over-represented transcription factors for each promoter of interest. These results are obtained when comparing the studied promoters against a background of human genes
Fig. 3
figure 3

Enrichment pattern for the EGF promoter implemented in the Ciiider program. The result shows transcription factors in bars, such as TCF7 and CDX1, conserved in the phylogeny. Mammals are cetaceans, artiodactyls, carnivores, primates, bats, and cingulate (marked with an asterisk). The giant cetaceans are: Balaenopteramusculus, Balaenoptera physalus, Eschrichtius robustus, Eubalaenaaustralis, Eubalaenaglacialis, Eubalaena japonica, Megaptera novaeangliae, and Physeter catodon

Fig. 4
figure 4

Closer look at the enrichment pattern for the NCAPG promoter implemented in the Ciiider program. The result shows transcription factors in bars. TEF and PBX1, highlighted in blue, are present only in Physeter catodon, Eschrichtius robustus, Eubalaena japonica, Eubalaena australis, Eubalaena glacialis, Megaptera novaeangliae, Balaenoptera physalus, and Balaenoptera musculus that are giant cetaceans, and FOXP3 and ZBTB33, highlighted in red, only in non-giant cetaceans such Tursiops truncatus, Orcinus orca, Lipotes vexillifer, Phocoena sinus, and Balaenoptera acutorostrata scammoni

Fig. 5
figure 5

Closer look at the enrichment pattern for the IGFBP7 promoter implemented in the Ciiider program. The result shows transcription factors in bars. PAX2, highlighted in blue, is present in a triple pattern only in Physeter catodon, Eschrichtius robustus, Eubalaena japonica, Eubalaena australis, Eubalaena glacialis, Megaptera novaeangliae, Balaenoptera physalus, and Balaenoptera musculus, which are classified as giants

In the NCAPG promoter, we identified a pattern of enrichment that split cetaceans into two groups: giants and non-giants. The giant mysticetes had an enrichment pattern with the transcription factors TEF and PBX1 in the region between − 1300 and − 1200 bp positions, shared only with the sperm whale (Physeter catodon), a species of odontocete that can exceed 20 m in length. In contrast, all cetaceans not classified as giants showed the enrichment of the transcription factor FOXP3 in the − 800 bp position and the transcription factor ZBTB33 in the − 200 bp position.

Similar patterns were also found in the IGFBP7 gene, with giant mysticetes presenting a unique triple enrichment pattern at the − 1100 bp position with the transcription factor PAX2 shared only with the sperm whale.

Additionally, we used phyloP from the PHAST package to estimate the molecular evolution rate of the promoters and identify signals of evolutionary acceleration in specific branches. Specifically, we aimed to identify whether promoters of giant cetaceans underwent accelerated evolution compared to non-giants. To achieve this, phyloP calculated the conservation and acceleration scores in a partitioned tree through a set of named branches, the giant cetaceans, and all remaining species .Thus, the tests for conservation/acceleration occur in the set of named branches relative to the others. Positive scores indicate conservation and negative scores indicate acceleration. A substitution model, against which all subtrees were compared, was derived from the phyloFit program, from the same PHAST package. Our analysis revealed possible evidence of accelerated evolution in the promoters of gigantic cetaceans, as evidenced by negative scores in the IGF2, IGFBP2, IGFBP7, and ZFAT promoters(Table 2).

Table 2 Conservation or acceleration in promoter sequences of the nine genes studied in this work, estimated based on the likelihood ratio test of phyloP for the subtree comparing an alternative model (alt_subscale) with a free scale parameter (alt_scale) within the given REV substitution model (null_scale). Positive scores indicate evolutionary conservation, and negative scores denote evolutionary acceleration, as observed in the IGF2, IGFBP2, IGFBP7, and ZFAT promoters

Discussion

This study investigates the molecular evolution of regulatory regions of genes potentially linked to cetaceans’ gigantism, focusing on the promoter region. We found evidence of enrichment of transcription factors binding sites potentially related to large body size, with distinct patterns between giant and non-giant cetaceans in the IGFBP7 and NCAPG promoters. We also found evidence of acceleration in the IGF2, IGFBP2, IGFBP7, and ZFAT promoters. We will focus our discussion on these 5 promoters, as the other four (EGF, GHSR, LCORL, and PLAG1) did not yield relevant results for our research question.

Despite being non-coding regions, which are often known to be difficult to align and contain many neutrally evolving sites and potentially a few constrained ones, we obtained high-quality alignments from our promoters, with ultimately recovered phylogenetic trees consistent with known relationships among species, except for the NCAPG. In this case, the phylogenetic signal was strong enough for the sperm whale (Physeter catodon) odontocete to be grouped with the other mysticetes, excluding the minke whale, grouped with the other odontocetes. In this way, two clades of cetaceans are redeemed: one that contains only those classified as giants and the other with non-giant cetaceans. The use of the Bayesian approach also resulted in the formation of the same clades divided by size. The phylogenetic incongruence between the highly reliable species tree and the promoter tree is a common phenomenon across the Tree of Life, as different regions can have different evolutionary histories [21] due to mechanisms such as incomplete lineage sorting (ILS), introgression, or convergent evolution [22,23,24]. The last one, convergent evolution, could fit the scenario of this work since we have species from two evolutionarily distinct groups (odontocetes and mysticetes) with similar gigantism-related mechanisms. Moreover, as discussed in the following paragraph, the enrichment analysis provides evidence that the convergent evolution of this region is a plausible explanation for this case. Regarding the other eight promoters, the recovery of trees consistent with the most accepted phylogenetic hypotheses for the groups included in the study gives us more confidence that we are indeed using a fundamental regulatory region of the genes in our dataset. Additionally, as discussed further, the identification of evolutionarily conserved TFBS across different mammalian groups in our study further supports the functional importance and conservation of these regulatory regions.

The analyses implemented in the Ciiider program identified the transcription factors binding sites in promoters. Subsequently, the enrichment test revealed some patterns in the promoters of our dataset. First, the same transcription factors were found in the same approximate position in different mammalian lineages, demonstrating evolutionary conservation, and this is the case for the EGF promoter (Fig. 3). Regulatory elements spatially conserved among different lineages suggest an important biological role, as observed between humans and mice for the Cd247, a gene with functional consequences in systemic autoimmunity [25], and in transcription factors related to growth and development in monocot and dicot lineages [26]. Second, NCAPG and IGFBP7 promoters presented different patterns for giant and non-giant cetaceans. In the NCAPG promoter, this transcription factors distribution pattern is likely responsible for the phylogenetic signal in the promoter tree discussed before. The sperm whale (Physeter catodon) has the transcription factors TEF and PBX1 in the region between − 1300 and − 1200 bp position like other giant mysticetes. In contrast, the minke whale has more similarities with smaller odontocetes than its giant mysticetes relatives. Our results suggest that these regions have undergone different selective pressures and that some of the TFBS may have evolved more rapidly in certain lineages. These findings provide further evidence that the NCAPG promoter has experienced unique evolutionary processes that could contribute to the observed incongruence in the phylogenetic tree.

TEF (Thyrotroph embryonic factor) is a protein that belongs to the proline- and acidic amino acid-rich (PAR) bZIP family and is expressed initially in the embryonic anterior pituitary, whereas in adults, it is involved in controlling the cell cycle and the death of hematopoietic cells [27, 28]. These features make TEF a possible tumor suppressor, as demonstrated in bladder cancer (BC). The upregulation significantly retarded BC cell growth by inhibiting the G1/S transition via regulating AKT/FOXOs signaling [28]. In the same way, PBX1 (Pre-B-cell leukemia homeobox 1) is a member of the Three Amino acid Loop Extension (TALE)-class homeodomain family. It is responsible for diverse developmental processes, including skeleton patterning, hematopoiesis, pancreas, and urogenital systems organogenesis [29,30,31,32,33]. It is also involved in fetal growth in activity with decidual natural killer (dNK) cells, driving transcription of pleiotrophin and osteoglycin in dNK cells. On the other hand, the PBX1 inactivation in mouse dNK cells impairs fetal development by decreasing growth-promoting factors that result in fetal growth restriction [34].

Together, both TEF and PBX1 factors are related to general growth processes, such as the control of cell proliferation or directly linked to embryonic growth like PBX1, highlighting the biological meaning of their enrichment pattern only in giant cetaceans, mainly when this enrichment occurs in the promoter of a gene strongly associated with increased body size, such as NCAPG.

The NCAPG (Non-SMC Condensin I Complex Subunit G) gene was previously associated with increased body size and weight gain in horses, donkeys, pigs, humans, and chickens [35,36,37,38,39,40,41]. In bovine species, evolutionarily close to the cetaceans, NCAPG is associated with many essential features such as birth weight, wither height, feeding efficiency, and pubertal growth [42,43,44]. In previous work from our group—focusing on coding regions—evolutionary analyses showed that the NCAPG gene has evidence of positive selection in giant cetaceans [45]. Our promoter and coding regions results imply this gene’s essential role in cetacean gigantism.

The IGFBP7 promoter also showed a specific triple pattern transcription factor only shared by giant cetaceans: the PAX2 (Paired Box Gene 2), which is critical during the embryonic development of systems such as the central nervous system (brain and spinal cord), kidney, eye, ear, and urogenital tract [46, 47]. PAX2 deficiency has been associated with various growth defects, such as kidney hypoplasia, optic coloboma, and vesicoureteral reflux [48]. Furthermore, PAX2 role in embryo development and oncogenesis suggests that it works as a regulatory factor in cell growth [49, 50]. This feature is similar to the IGFBP7 gene, a member of the IGFBP superfamily responsible for the viability of insulin-like growth factors (IGFs) that are molecules involved in promoting cell growth and division [51]. This gene also acts as an oncosuppressor in prostate, breast, lung, and colorectal cancer due to its regulatory action related to cell proliferation, cell adhesion, cell senescence, and angiogenesis [52,53,54]. One of the main challenges of gigantism is the suppression of tumors due to a large number of cells. Therefore, mechanisms that manage to mitigate cancerous processes were crucial during the evolutionary history of the giants.

The cetaceans not classified as giants in this work comprise Tursiops truncatus, Orcinus orca, Lipotes vexillifer, Phocoena sinus, and Balaenoptera acutorostrata scammoni. In the enrichment analyses performed in Ciiider, only these cetacean species share the transcription factors FOXP3 and ZBTB33 in the NCAPG promoter. The first, FOXP3 (Forkhead box protein P3), is a transcription factor belonging to the forkhead box protein family and may act as a transcriptional activator or repressor [55]. It is also associated with the differentiation and function of regulatory T (Treg) cells, which are responsible for suppressing the activation of other leukocytes and thus contribute to immune homeostasis [56,57,58].

The ZBTB33 (Zinc finger and BTB domain-containing 33, also known as Kaiso), exhibits bimodal DNA recognition and acts as a transcriptional repressor and activator depending on the sequence context and cellular phenotype [59]. As a repressor, it recruits other repressors, forming further complexes and aiding in dampening the transcription of the target gene by blocking the binding of transcriptional activators [60]. One of the targets of the transcriptional repressor action of ZBTB33 is the Wnt signaling pathway, associated with critical physiological activities such as growth, differentiation, and migration during development [61]. Focusing on growth, Wnt signaling shapes growing tissues while inducing cells to proliferate, acting as growth factors, and directly affecting cellular organization by the cytoskeleton and mitotic spindle [62]. In summary, the presence of transcription factors that can act as repressors in the promoter of the NCAPG gene related to body growth only in small cetaceans may indicate how these animals did not develop giant sizes.

We found evidence of accelerated evolution in IGF2, IGFBP2, IGFBP7, and ZFAT promoters. The first three (IGF2, IGFBP2, and IGFBP7) are a group of genes that work together to promote growth. The insulin-like growth factors (IGFs), such as IGF2, are important in somatic growth and cell proliferation and responsible for fetal and post-natal growth [63]. This action is only completed by the modulation of insulin-like growth factor binding proteins (IGFBPs), a group that serves as transport proteins for insulin-like growth factors, regulating the bioavailability and function of IGFs [64]. For this direct growth-promoting action, the evidence of evolution acceleration on the promoters found by phyloP in the giant cetaceans follows the knowledge about these genes and reinforces their coordinated performance. Furthermore, the IGFBP7 coding sequence was also associated with positive selection in investigating gigantism in cetaceans [45]. Likewise, the ZFAT gene has been associated with height in multiple human populations in horse body size and has been reported to have crucial roles in the maintenance and differentiation of the adipocytes, the number of T cells, and embryonic development [65,66,67,68]. Therefore, they are likely associated with growth due to controlling various aspects of body enlargement and acting as tumor suppressors. The remaining promoters (EGF, GHSR, LCORL, NCAPG, and PLAG1) exhibit conservation, as identified by CiiiDER, which found highly conserved patterns in most of the genes of interest, with NCAPG and IGFBP7 showing conservation specifically in cetacean groups. Notably, IGFBP7, which also underwent evolutionary acceleration as detected by phyloP, may be associated with multiple gene functions, including body growth and tumor suppression.

Recent studies in other lineages have also highlighted the importance of regulatory regions in controlling body size. For instance, a deletion in the promoter region of IGF2BP1 has been associated with larger body sizes in chickens [69], and variation in the STAT3 promoter has been shown to contribute to larger body size traits in cattle [70]. Additionally, the control of growth hormone IGF1 protein levels by long non-coding RNA has been implicated in the size of large dogs [71]. These findings, along with our own, underscore the critical role that regulatory regions play in determining size characteristics across diverse taxa. Further studies on the molecular evolution of these regions are needed, and future experimental testing will provide further insights into the regulatory mechanisms underlying body size variation.

Although with some limitations, such as the number of genes used, our study provides the first steps toward what other works can reach, especially those related to experimental validation. It is far from the definitive answer to a complex question. Still, this start could be useful in future research, indicating which genes are possibly related to gigantism in cetaceans and that this phenomenon must be understood in an integrated way.

Conclusions

We investigated the promoter regions of genes possibly associated with increased body size in giant cetaceans. In summary, we found evolutionary conservation and evidence of differential transcription factors enrichment, with distinct patterns between giants and non-giants cetaceans for IGFBP7 and the NCAPG promoters. In NCAPG, observing the presence of repressive transcription factors only in cetaceans of small body-size was also possible. Furthermore, evolutionarily acceleration was detected in the promoters of the IGF2, IGFBP2, IGFBP7, and ZFAT genes. In conclusion, our study provides evidence of the evolution of cetacean gigantism from a regulatory approach.

Materials and methods

Sample data

The promoters of nine genes were chosen because they have been described in the scientific literature as associated with changes in body size. The EGF (Epidermal Growth Factor), GHSR (Growth Hormone Secretagogue Receptor), IGF2 (Insulin-Like Growth Factor 2), IGFBP2 (Insulin-Like Growth Factor Binding Protein 2), and IGFBP7 (Insulin-Like Growth Factor Binding Protein 7) are part of the growth hormone/insulin-like growth factor (GH-IGF) axis, which plays a central role in regulating growth in vertebrates [72, 73]. The LCORL (Ligand Dependent Nuclear Receptor Corepressor Like), NCAPG (Non-SMC Condensin I Complex Subunit G), PLAG1 (Pleomorphic Adenoma Gene 1), and ZFAT (Zinc Finger And AT-Hook Domain Containing) are associated with the body enlargement of species such as cows, pigs, sheep, and goats, which are artiodactyls, evolutionarily close to cetaceans [74, 75]. The sequences of these promoters were retrieved in the Eukaryotic Promoter Database (EPD) from the Swiss Institute of Bioinformatics. Firstly, we located the transcription start site (TSS) for the human species and selected a 1500 bp region upstream of the TSS. Then, the promoter sequences of cetacean and other mammalian species were searched in public databases, such as Ensembl and GenBank (NCBI), using BLAST (Basic Local Alignment Search Tool), which compares nucleotide or protein sequences and calculates the statistical significance, finding similarity regions among sequences of interest.

For cetaceans, we used sequences from 13 species, five odontocetes (Tursiops truncatus, Orcinus orca, Lipotes vexillifer, Physeter catodon, and Phocoena sinus), and eight mysticetes (Balaenoptera acutorostrata scammoni, Eschrichtius robustus, Megaptera novaeangliae, Balaenoptera physalus, Balaenoptera musculus, Eubalaena australis, Eubalaena glacialis, and Eubalaena japonica). The sequences for Eubalaena australis and Eubalaena glacialis were retrieved from genomes available on the public platform DNA Zoo. All other cetacean sequences were retrieved from GenBank, and the Additional file 1: Supplementary Table 1 shows the accession numbers.

Following Lambert et al. 2010 [76], gigantism is attributed to species larger than 10 m. In our dataset, the following species fit this definition: blue whale (Balaenoptera musculus), sperm whale (Physeter catodon), gray whale (Eschrichtius robustus), humpback whale (Megaptera novaeangliae), fin whale (Balaenoptera physalus), South Atlantic right whale (Eubalaena australis), North Atlantic right whale (Eubalaena glacialis), and Pacific right whale (Eubalaena japonica).

In addition to cetaceans, we included 39 other species to represent the major mammalian groups, such as the order Artiodactyla (Bos taurus, Capra hircus, Bison bison, Odocoileus virginianus, Ovis aries, Sus scrofa, Camelus dromedarius, Camelus ferus, Camelus bactrianus), Carnivora (Panthera leo, Panthera onca, Panthera tigris altaica, Panthera pardus, Felis catus, Prionailurus bengalensis, Canis lupus familiaris, Canis lupus dingo, Vulpes lagopus, Vulpes vulpes, Ursus arctos horribilis, Ursus thibetanus), Primate (Homo sapiens, Pan paniscus, Pan troglodytes, Gorilla gorilla gorilla, Pongo abelii, Callithrix jacchus, Rhinopithecus roxellana, Papio anubis, Nomascus leucogenys, Macaca fascicularis, Macaca mulatta and Chlorocebus sabaeus), Cingulata (Dasypus novemcinctus), and Chiroptera (Artibeus jamaicensis, Hipposideros armiger, Phyllostomus discolor, Pipistrellus pipistrellus, Rhinolophus ferrumequinum). Thus, there were the same 52 species in each promoter studied.

Phylogenetic reconstructions

The sequences were aligned using the MUSCLE program [77] and visualized in AliView [78]. After this, phylogenetic trees were constructed for each promoter using the IQ-TREE program’s maximum likelihood strategy, 1,000 bootstrap replicates to estimate branch confidence, 1,000 maximum iterations, 1,000 number of bootstrap alignments, 0.5 perturbation strength, 100 IQ-TREE stopping rule, 0.99 minimum correlation coefficient, and “auto” in substitution model. This entire process was done directly on the IQ-TREE Web Server portal [79]. For Bayesian analysis, we determine the optimal number of partitions and evolutionary models for each promoter using PARTITION FINDER software v2.1.1 [80], which employed the Bayesian Information Criterion (BIC). Subsequently, Bayesian phylogenetic trees were constructed using MrBayes v3.2.6 [81]. The Markov chain Monte Carlo (MCMC) algorithm was run for 5,000,000 generations with four chains, and trees were sampled every 100 generations, utilizing the molecular evolution model selected by PARTITION FINDER v2.1.1. The resulting trees were visualized using FigTree v1.3.1. Finally, we visualized the results in the program FigTree v1.3.1.

Regulatory regions analyses

Promoter analyses were performed using Ciiider and phyloP tools. Ciiider was used to predict and to analyze transcription factor binding sites within a sequence and identify significantly enriched ones [82]. This is important since over-represented transcription factors are more likely to regulate gene expression that ultimately alters the phenotype [83]. We used scanning and enrichment approaches in Ciiider.

Given a sequence, the scanning predicts potential transcription factors in the region of interest. The MATCH algorithm searches for transcription factor binding sites in DNA sequences [84] using a Position Frequency Matrix (PFM). A set of non-redundant profiles derived from experimentally defined transcription factor binding sites for eukaryotes is used in this work, derived from the JASPAR database containing position matrixes of these elements [85]. Since PFMs generally have a highly conserved core-binding region flanked by areas of higher variability, a core PFM is created for the five most conserved consecutive bases. To search for transcription factor binding sites, sequences are divided into regions of five overlapping bases compared to the core PFM. If the similarity score between a five-base sequence and the core PFM meets a defined threshold, then the sequence window is increased to the full length of the transcription factors, and the similarity score to the full PFM is calculated. The default deficit is 0.15, meaning the scan will accept any transcription factors with MATCH scores of 0.85 or above [84].

The enrichment approach allow us to identify those transcription factor binding sites that are significantly over- or under-represented in the regions of interest when compared to the background regions used in the analysis. To reduce the possibility of chance findings, we used a comparative background consisting of several other genes provided by the Ciiider program. Thus, we reduce the chances of the results being stochastic. In short, Ciiider scans these background sequences using the same criteria for the sequences of interest and for the background sequences. The program determines the over- and under-representation of transcription factors by comparing the number of sequences containing these factors to those without them, followed by a statistical test such as Fisher’s exact test [84].

We used phyloP tool from PHAST (Phylogenetic Analysis with Space/Time) package to estimate the molecular evolution rate of the promoter and detect signals of evolutionary acceleration in specific branches [86, 87]. First, we generated a substitution model using the phyloFit program, which fits one or more tree models to multiple alignments of DNA sequences using maximum likelihood, and the substitution model used was REV (Reversible Evolutionary Model), the default of phyloFit, which is more realistic and flexible than simpler neutral models and can capture variations in nucleotide substitution rates at different positions in the alignment [88, 89]. Using REVl, we calculated conservation and acceleration scores with the “branch” option, which partitions the tree into named branches and tests for conservation/acceleration in the named branches relative to the others. We compared the set of named branches containing giant cetaceans against the remaining species. We selected the LRT option, which compares an alternative model having a free scale parameter with the substitution model, and the CONACC mode, which allows for acceleration as well as conservation, assigning positive values (scores) to indicate conservation and negative values to indicate acceleration. Thus, CONACC mode summarizes conservation and acceleration.