1 Introduction

The exploration of microbial communities in unique and unexplored environments has become increasingly important in microbial biotechnology for discovering bioactive compounds [1, 2]. Cave ecosystems, which cover 15–20% of the global land area, provide ideal habitats for exploring novel microbes, functional genes, and bioactive compounds due to their extreme and unique conditions [3, 4]. Caves present ecosystems with extreme conditions such as darkness, nutrient limitations, low oxygen levels, high humidity, low temperature, and high mineral concentrations [3,4,5,6].

These harsh conditions in caves harbor diverse and largely undiscovered microbial genomes, providing unique properties for exploring novel microbial genomes and functional genes [7,8,9]. However, caves have not received the attention they deserve, and studies on microbial cave often rely on traditional cultivation methods, which can only identify approximately 1% of microbial from caves [10]. Many microorganisms from caves cannot be cultured or isolated due to their specific living conditions and cannot be replicated in a laboratory [11]. Recent studies have shown that metagenomics, enabled by next-generation sequencing technology, offers valuable insights into whole microbial genomes in cave habitats, as well as novel genes and their functions [12,13,14,15].

Various novel microorganisms with functional genes have been discovered in caves worldwide [5, 16]. For instance, through 16S rRNA gene amplicon analysis, Zada et al. [17] identified a bacterial community dominated by Actinobacteria and Proteobacteria in Kashmir Cave, Pakistan. Another study by Belyagoubi et al. [18] reported the presence of Proteobacteria, Firmicutes, Bacteroidetes, and Actinobacteria in Chaabe Cave in western Algeria. A metagenomic analysis by Demirci and Emel [9] revealed 65% Actinobacteria and 31% Proteobacteria in Çal Cave in Trabzon, Turkey.

Ethiopia has numerous hotspot caves, such as Mechara Cave [19], Porc Epic Cave in Dire Daw [20] and Sof Umer Cave located east of Ginnir in the Oromia Region of southeastern Ethiopia [21]. Due to the unexplored microbial life and nutrient scarcity in caves, cave microbes present a promising natural source of novel bioactive compounds and offer a new platform for drug, enzyme, and metabolite development.

However, the microbial ecology of Sof Umer Cave, including microbial diversity and their potential genetic resource remain undiscovered. To address this gap, high-throughput shotgun sequencing is an advanced approach for exploring microbial diversity, functional genes, and pathways related to metabolism. Therefore, this study aimed to explore microbial diversity, and modeling functional gene dynamics using FBA-built metabolic model, Model seed and MS2-Prokaryotic metabolic model platforms. Finally, the findings of this study confirmed that Sof Umer Cave is a reservoir for novel microbes and potential genetic resources for the discovery of natural bioactive compounds.

2 Materials and methods

2.1 Description of the study area and sample collection

Sof Umer Cave is located in the Bale district of the Oromia region at an altitude of 1269 m (6°51′0'' to 6°54′ 0''N latitude and 6°51′0'' to 40°51′0''E longitude) (Fig. 1A). Inside the cave, the height is estimated to reach 60 m and 100 m in width in some parts. Sof Umer Cave, estimated to have a length of 15.1 km, is also considered one of the largest caves in East Africa. Sof Omer Cave has underground water through which the Weyib River flows [22]. The minimum temperature of Sof Umer Cave ranges from 19 to 21 °C, while the maximum temperature ranges from 33 to 35 °C throughout the year [22]. It is a cave with high humidity due to the river flow and darkness. The soil in the area has a pH range of 8.00 to 8.30 and exhibits a sandy texture and sedimentary rock.

Fig. 1
figure 1

A A geographic representation map of Sof Umer Cave according to ArcGIS; and B the specific sampling site where the samples were collected

For this study, sediments and rocks samples were collected from the Sof Umer Cave under aseptic conditions using sterile polyethylene bags. A total of 1800 g of sample was collected from 600 to 1000 m away from the cave entrance and from 0 to 0.5 cm depth from the ground, dark zone, and surface of the cave (Fig. 1B and Supplementary File Fig. 1). The collected samples were homogenized, placed in sterile polyethylene bags, stored in an icebox during transportation, and stored at 4 °C prior to laboratory analysis.

2.2 DNA extraction

Total metagenomic DNA was directly extracted from homogenized sample using the modified 1% CTAB-SDS method adopted from Zhou et al. and Verma et al. [23, 24] and the GeneAll DNA Soil Mini Kit. The integrity and quantity of the extracted DNA were checked via gel electrophoresis (Supplementary File Fig. 2), and the quality was checked by a NanoDrop 3300 Fluorospectrophotometer (Supplementary File Table 1). Extracted DNA with a high nucleic acid concentration (ng/µl) and a standard absorbance (260/280) was pooled prior to shotgun sequencing.

2.2.1 Library generation and sequencing

The DNA was fragmented randomly into small pieces of 500 base pair, end-repaired, and ligated into Illumina adapters. The fragmented DNA underwent size selection, PCR amplification, and purification. A metagenomic library was prepared based on the effective library concentration and the required data volume. The library was assessed using Thermo Fisher Qubit fluorometry, real-time PCR, and a bioanalyzer to determine the size distribution. It was then barcoded, pooled, and shotgun sequenced on one lane of a flow cell using a 150-bp paired-end run on a NovaSeq PE150 instrument (Illumina, Tsim Sha Tsui, Hong Kong).

2.3 Sequence analysis and interpretation

2.3.1 Raw data quality assessment

The quality of the raw data was checked using FastQC (version 0.23.1) to identify regions of low quality within the raw sequencing data [25]. Then, quality control, host, and adapter filtering were performed using Trimmomatic (2.2.4) [26], and low-quality reads were removed for subsequent analyses (Supplementary File Fig. 3).

2.3.2 Metagenome assembly

The clean data were subjected to metagenome assembly using MEGAHIT (version 1.2.9) [27]. The assembled metagenomic scaffolds were then subjected to binning using MaxBin2 (version 1.1.1), and scaftigs measuring 500 base pairs (bp) were selected for open reading frame (ORF) prediction using GeneMark. Hmm. (version 2.1) [28]. These ORFs were dereplicated using CD-HIT (version 4.5.8) [29] to create nonredundant gene catalogs (Supplementary File Fig. 5), with a focus on continuous gene sequences encoding nucleic acids [30]. Clean data were then mapped to the gene catalog using Bowtie2 (version 2.2.4) [31, 32] to quantify gene abundance.

2.3.3 Microbial genome and functional gene prediction

Microbial genome annotations of assembled metagenomic reads were compared against the reference micro-NR database using DIAMOND (version 2.1.6) [33, 34] to elucidate the microbial community composition across the sample. The functional gene prediction annotation process for coding sequences was conducted with the KEGG, eggNOG, and CAZy databases to elucidate the functional profile of biological activities encoded within the microbial community [35, 36].

2.3.4 Functional gene modeling and characterization

Megahit-assembled genes from metagenomic data in FASTA format were uploaded to the FBA-built metabolic model (version 2.2.0) and MS2-prokaryotic metabolic model (version 1.0.0). Then, functional genes associated with the biosynthesis of metabolites and other metabolic pathways were elucidated using the FBA, model seed, and MS2 platforms. The predicted functional genes were validated by comparison against known gene sequences and biochemical pathways available in the KEGG, eggNOG, and CAZy databases.

2.4 Statistical data analysis

Alpha diversity and clustering analyses of the sequence data were performed to assess the microbial community composition and diversity within Sof Umer Cave. Data visualization techniques such as heat-maps, alpha diversity, and hierarchical clustering were employed to explore patterns of microbial community structure and functional gene distribution.

3 Results and discussion

3.1 Metagenomic data analysis

Illumina sequencing using the NovaSeq PE150 platform (Novogene, Hong Kong) generated a total of 94,834,808 raw reads, each with an average length of 969 base pairs. This number included adapter-related reads (278,048; 0.29%), nonreading reads (218; 0.00%), and low-quality reads (0.00%). After quality control, 94,554,542 (99.70%) sequence reads were found to be effective and suitable for bioinformatics analysis (Supplementary File Table 2).

3.2 Microbial diversity and distribution analysis

After screening and removing host sequence reads and nonhost reads, the total number of high-quality reads (97.9%) was subjected to microbial genome analysis. The microbial genome analysis was performed using Kraken [37] and the Micro-RN database. As a result, 98% of the total reads were classified into specific kingdoms, while only 80% of the total reads were classified into the bacterial domain (Fig. 2A). Microbial genome analysis produced by Kraken was used to generate interactive plots using Krona [38] for intuitive exploration of the relative abundance. Microbial genome analysis confirmed that the Sof Umer cave of sedimentary rocks dominated 96% of the bacteria (Supplementary File Fig. 4).

Fig. 2
figure 2

Visualization of taxonomies and distribution of microbial genomes at both the bacterial domain and actinobacterial level using Krona and against the Micro-NR database. Notes: Circles from inside to outside represent different taxa, and the area of the sector represents the respective proportion of different taxa

The dominance of bacteria in Sof Umer Cave is consistent with research findings reported by Zada et al. [17], who showed the dominance of bacterial populations in Kashmir Cave in Pakistan. Another study supporting this result was reported by Demirci and Emel [9], who reported the dominance of Protobacteria and Actinobacteria in Çal Cave in Turkey. Bacteria in caves are fundamental for nutrient cycling, decomposition, and mineral degradation, as confirmed by De Sena Brandine and Smith [25], and are expected to have the same function in Sof Umer Cave microbiomes.

Shotgun sequence analysis confirmed that 17 distinct bacterial phyla were present in the sedimentary rocks of Sof Umer Cave. Among these, Protobacteria are the most dominant bacterial taxa in Sof Umer Cave, making up 64% of the bacterial domain, as they are known for their adaptability and metabolic versatility in cave environments, as reported by Zhu et al. [39]. Their high abundance is consistent with their frequent presence in microbial diversity in caves, indicating their key roles in ecological processes such as organic matter degradation, nitrogen cycling and biosynthesis of secondary metabolites [11, 40, 41].

Second, Actinobacteria represented 24% of the overall bacterial domain, with 13% of the Actinobacteria classified as Actinobacteria (7% remaining unassigned Actinobacteria) and 4% classified as unknown Actinobacteria (Fig. 2B). This information suggests that novel taxa of Actinobacteria, as yet unexplored species, inhabit Sof Umer Cave ecosystems. Actinobacteria, characterized by their ability to produce bioactive compounds [4] and tolerate extreme conditions, commonly inhabit cave ecosystems [42], contributing to nutrient cycling and organic matter decomposition [43].

Furthermore, the analysis revealed that other bacterial phyla were present at lower abundances within the Sof Umer Cave environment, including Bacteroidota, Verrucomicrobiota, Chloloflexoa, Mucoromycota, Acidobacteiota, Cyanobacteria, Planctomycetota, Rhodothermota, Gemmatimonadota, Myxococcia, Thermomicrobia, and Balneolales (Fig. 3). While these taxa may represent minor components of microbial diversity, their ecological roles and contributions to cave ecosystems merit further investigation. For instance, Cyanobacteria are known for their photosynthetic capabilities [44] and play a part in primary production and nutrient cycling [45], particularly in cave environments with access to sunlight.

Fig. 3
figure 3

The relative abundance of the bacteria domain associated with Sof Umer Cave was analyzed using shotgun sequencing with Kaiju programs against the Micro-NR database

In addition to bacteria, the analysis revealed the presence of archaea, which accounted for only 1% of the total microbial diversity in the Sof Umer Cave microbiota (Supplementary file, Fig. 4). Archaea are known to inhabit extreme environments, which explains their presence in caves and suggests adaptations to the unique physicochemical conditions present in subterranean habitats [46]. Eukaryotes and viruses, although present in lesser numbers (0.2% and 0.03%, respectively), should not be disregarded in the cave microbiome. Eukaryotic microorganisms, such as fungi, protists, and other microscopic organisms, may have unique ecological roles and interactions within cave ecosystems [47]. Similarly, although they make up a small fraction of viral communities, viral communities play crucial roles in regulating bacterial populations through viral predation and influencing microbial community dynamics through virus-mediated gene transfer events [48].

3.3 Functional gene annotation and predictions

3.3.1 KEGG-databased functional annotation and analysis

The KEGG pathway database was used to analyze functional gene determinants and revealed 45 distinctive metabolic pathway genes in the Sof Umer Cave microbiome (Supplementary File Fig. 6). This finding emphasizes the metabolic diversity and complexity of microbial communities that inhabit Sof Umer Cave. Carbohydrate metabolism genes represented 20.6% of the dataset, showing the importance of these pathways in energy production and biosynthesis in the microbiome of Sof Umer Cave. This finding is consistent with studies emphasizing the importance of carbohydrate metabolism in microbial communities, where energy supplies are often restricted [49].

Amino acid metabolism genes represented 20.03% of the dataset, revealing the importance of these pathways in protein synthesis and cell function. According to a previous study, amino acid metabolism plays an essential role in microbial adaptability to fluctuating environments [6]. Cofactor and vitamin metabolism genes, which account for 11.53% of the dataset, highlight the role of these substances in enzyme catalysis and metabolic regulation. This finding emphasizes the importance of cofactors and vitamins as nutrients necessary for microbial growth and survival and is supported by the results of Shen et al. [50].

The Sof Umer Cave microbiome contains specialized metabolic pathways represented by genes for glycan biosynthesis and metabolism (4.46%), biosynthesis of various secondary metabolisms (3.82%), and metabolism of terpenoids and polyketides (3.32%) (Fig. 4). These pathways are present among microbial species, suggesting evolutionary conservation and functional significance under diverse conditions. The presence of these pathways among microbial species suggests evolutionary conservation and functional significance in diverse conditions. These findings align with the results reported by Ai et al. [51]. Furthermore, 4.71% of the genes were involved in xenobiotic biodegradation and metabolism. This indicates that Sof Umer Cave microbes are able to digest environmental contaminants, emphasizing their ecological involvement in detoxification processes. A previous study by Drzewiecka [52] confirmed that xenobiotic biodegradation and metabolism are responsible for and involved in detoxification processes.

Fig. 4
figure 4

The relative abundance of putative genes associated with various metabolic pathways, as annotated in the KEGG database (Level 1), was analyzed

3.3.2 eggNOG-databased functional annotation and analysis

The functional gene annotation against the eggNOG database confirmed the presence of 24 putative genes in the Sof Umer Cave microbiomes (Fig. 5). Approximately 16.89% of the genes examined are involved in energy production and conversion. Energy production and conversion genes play critical roles in microbial survival and adaptability in cave environments [6]. These results are consistent with earlier research reported by Samanta et al. [11], emphasizing the relevance of energy metabolism in microbial communities living in harsh conditions.

Fig. 5
figure 5

The relative distribution and number of matched genes with various metabolic pathways annotated in the eggNOG database (Level 1) were analyzed

Approximately 14.2% of the genes were identified as regulating inorganic ion transport and metabolism, illustrating the key role of these activities in cellular homeostasis and environmental adaptability. Similarly, the Sof Umer Cave microbiome depends on lipids for energy and carbon, as revealed by 10% of its genes being involved in lipid transport and metabolism. However, more than 39.13% of the genes remain uncharacterized, implying that the Sof Umer Cave microbiome may harbor unique or unknown genetic functions. These undiscovered genes provide potential gene codes for unique metabolic and functional adaptations.

3.3.3 CAZy-databased functional annotation and analysis

According to a comprehensive analysis of the CAZy database, glycoside hydrolases constitute the majority of identified functional genes, comprising an impressive 37.47% of the total. As reported by Veloso et al. [53], these coding genes play a crucial role in breaking down glycosidic bonds in complex carbohydrates, thereby facilitating various essential biological processes. These processes include digestion, cellular signaling, and the remodeling of cell walls. Therefore, our results obtained from Sof Umer Cave support the abundance of coding genes for facilitating various essential biological processes. In addition to glycoside hydrolases, glycosyl transferases accounted for an equally significant portion of the total genes, with a percentage of 34.51%. Carbohydrate-binding genes were another significant group of genes, comprising 11.79% of the total genes. These genes are responsible for binding to carbohydrates and other molecules, thereby facilitating a range of cellular processes. In contrast, auxiliary activities constituted a smaller portion of the total genes, with a mere 2.24% of the total. These genes are involved in a variety of noncarbohydrate-related functions, such as protein folding and fatty acid synthesis (Figs. 6 and 7).

Fig. 6
figure 6

The relative distribution and number of matched genes associated with essential metabolic pathways annotated in the CAZy database (Level 1) were analyzed

Fig. 7
figure 7

The relative abundance of genes associated with various metabolic pathways, as annotated in the CAZy database (Level 1), was analyzed

3.4 Functional gene modeling and characterization

A new draft genome-scale metabolic model was constructed based on annotations in the 171692/34/1 genome. The model metabolic model was gap-filled in 171692/53/1 media to achieve a minimum flux of 0.1 through the biological reaction. A functional gene model based on the FBA-constructed metabolic model (2.2.0), model seed model (version 2.2.0) and MS2-prokaryotic metabolic model (version 1.0.0) revealed 1742 reactions and 1542 compounds, and during gap filling, 302 new reactions were added to the model (Supplementary File Table 3). The FBA builds a metabolic model, and the MS2-Prokaryotic metabolic model maintains metabolic completeness by incorporating 302 gap-filled reactions, reflecting the dynamic nature of cellular metabolism and the requirement for supplemental pathways. The discovery of only one instance of auxotrophy emphasizes the system's resiliency, implying a low reliance on external nutrients for growth and potential adaptability to changing environmental conditions. According to a previous study by Veloso et al. [53], the limitation of auxotrophy depends on external nutrients for growth and potential adaptability. Therefore, these findings reveal the dynamic nature of cellular metabolism and the need for auxiliary pathways, which are consistent with the findings reported by Zada et al. [40] on the composition and functional profiles of microbial communities in caves of metabolic pathways in Kashmir and Tiser Cave.

The division of two biomass formulations with different biomass values enables sophisticated modeling, revealing insights into cellular growth and proliferation under various conditions. The FBA built a metabolic model, and the model seed revealed the metabolic landscape, identifying important routes and areas for improvement. These statistics not only reveal the metabolic capacities and requirements of the system but also establish a framework for future optimization and refinement, which is essential for applications ranging from nanotechnology to drug discovery. The data provided by FBA Build metabolic, MS2-Prokaryotic metabolic model, and Model Seed include several metabolic pathways and the number of compounds related to each route from various databases, mainly KEGG, eggNOG, and CAYz.

Notably, pathways such as "biosynthesis of secondary metabolites" (map01110) had many more compound matches (1391) than other pathways, illustrating their complexity and importance in cellular operations. Furthermore, pathways such as "arginine and proline metabolism" (map00330) and "lysine degradation" (map00310) had a high number of compound matches, indicating their participation in a variety of metabolic activities within organisms. Surprisingly, routes such as "novobiocin biosynthesis" (map00401) and "anthocyanin biosynthesis" (map00942) are related to the CAYz database, showing that they are exclusive to various types of substances or biological processes. These data emphasize the diversity of metabolic pathways found in biological systems as well as the value of databases such as KEGG and CAYz in understanding cellular metabolism and biochemical processes (Table 1 and Supplementary File Table 3).

Table 1 Statics of selected metabolic pathways identified by FBA in the metabolic MS2 prokaryotic metabolic model and model seed were analyzed

3.5 Functional diversity analysis

Among the four databases, Micro-RN exhibited the highest ACE and Chao1 indices (Table 2). This finding suggested that Sof Umer Cave was primarily composed of bacterial communities. According to [54, 55], the Shannon and Simpson indices, which are frequently used to assess microbial community variety, showed that among the four datasets, the bacterial population in KEGG was the most diverse. While eggNOG displayed less bacterial diversity, the Micro-RN and KEGG databases revealed more diverse bacterial communities than did the CAZy database.

Table 2 Alpha diversity indices of the Sof Umer Cave microbiomes from the four databases

4 Conclusion

The findings provide insight into the microbial diversity and functional gene dynamics of Sof Umer Cave. High-throughput shotgun sequencing and extensive bioinformatic analysis revealed a diverse microbial community consisting of Protobacteria, Actinobacteria, Bacteroidota, Verrucomicrobiota, Acidobacteiota, and Cyanobacteria. Moreover, functional gene analysis revealed a variety of metabolic pathways, the production of bioactive compounds and potential genetic resources, with 44,780 genes discovered and an important percentage remaining uncharacterized, suggesting the possibility of potentially novel genes. The functional gene modeling studies provided essential knowledge into the metabolic pathways present in Sof Umer Cave, resulting in the identification of several reactions and chemicals, as well as the addition of novel reactions via gap filling. Finally, the findings confirmed the importance of the Sof Umer Cave as a reservoir for novel microbes and potential genetic resources, revealing new pathways for the discovery of natural bioactive substances. Further studies and the exploitation of this exceptional ecosystem have the potential to advance the understanding of microbial ecology while also discovering significant resources for drug discovery and bioprospecting.