Background

An understanding of the temporal and spatial structures, functions, interactions, and population dynamics of microbial communities is critical for many aspects of life, including scientific discovery, biotechnological development, sustainable agriculture, energy security, environmental protection, and human health (Bucci et al. 2017). Accordingly, several methods (cultivation-dependent and molecular approaches) have been employed to reveal microbial community composition and responses to environmental changes in various environments and in different contexts (Malik et al. 2008; Ligi et al. 2014; Vanwonterghem et al. 2014; Bucci et al. 2014, 2015; Warden et al. 2016; Crescenzo et al. 2017; Monaco et al. 2020).

The cryosphere refers to the portion of the Earth where the water is in solid form as snow or ice, including mountain glaciers, ice sheets, ice shelves, sea, lake or river ice, snow cover, permafrost, and seasonal frozen ground (Xiao et al. 2015). Snow and ice environments cover up to 21% of the Earth’s surface (Maccario et al. 2015): in winter, up to 12% of the planet’s land can be covered by snow (Marshall 2011) while glacial ice extends over approximately 10% of surface, storing 75% of the world’s fresh water (Maccario et al. 2015).

For a long period, frozen environments have been considered to be limiting for the development of life due to their extremely harsh climatic conditions such as low temperatures, low atmospheric humidity, low liquid water availability, and high levels of radiation (Cowan and Tow 2004; Lopatina et al. 2016), and they have received much less attention compared to hot habitats. Nevertheless, over the past 20 years, microorganisms inhabiting the cryosphere have been increasingly studied especially for the potential discovery of enzymes with biotechnological interest and to expand knowledge on the ecology of “extreme” environments (Margesin and Miteva 2011; Arrigo 2014). Snow and ice have unexpectedly high microbial abundance and diversity. Arctic and alpine snow have been intensively investigated (Bachy et al. 2011; Varin et al. 2012; Hell et al. 2013; Maccario et al. 2014; Lopatina et al. 2016) and bacteria belonging to several phylogenetic groups such as Alphaproteobacteria, Betaproteobacteria, Gammaproteobacteria, Firmicutes, Bacteroidetes, and Actinobacteria have been detected (Segawa et al. 2005; Amato et al. 2007; Møller et al. 2013; Maccario et al. 2014; Cameron et al. 2015). On the other hand, studies were also carried out on the surface snow in Antarctica (Carpenter et al. 2000; Brinkmeyer et al. 2003; Christner et al. 2003; Fujii et al. 2010; Lopatina et al. 2013) and revealed the presence of representatives of Proteobacteria, Bacteroidetes, Cyanobacteria, and Verrucomicrobia (Brinkmeyer et al. 2003; Lopatina et al. 2013, 2016).

Unlike Bacteria, Archaea were only rarely observed (Maccario et al. 2015). In Arctic spring snow samples, sequences associated with Thaumarchaeota and Euryarchaeota were detected with a relative abundance below 1% over soil, sea ice, and ice sheets (Møller et al. 2013; Maccario et al. 2014; Cameron et al. 2015). In Antarctic sea ice, Archaea were estimated at up to 6.6% of the microbial community (Cowie et al. 2011).

The main purpose of this research was to analyze and compare the microbial communities of the snow collected in two different locations of Capracotta, a municipality in the Molise region (Southern Italy; Additional file 1), after a snowfall record that occurred on March 5–6, 2015 (256 cm of snow during an 18-hour period) (Teague and Gallicchio 2017). This village, approximately 220 km east of Rome, is known for its dramatic weather, but this event has been debatably record-breaking (Teague and Gallicchio 2017). Bacterial communities were analyzed by the Next-Generation Sequencing techniques (NGS), a molecular tool widely used to investigate many microbial ecosystems (Yang et al. 2016; Bucci et al. 2017), which returns a huge amount of data to be further processed and analyzed. In addition, the effectiveness of a global approach to assess diversities and similarities was evaluated in order to summarize the innumerable information derived from NGS, both for evaluation of the biodiversity within microbial communities and for comparison between different microbiotas.

Results

Biomolecular investigations

MiSeq runs produced a total of 297,564 raw reads for Monte Civetta (MC) sample and a total of 81,403 raw reads for Santa Lucia (SL) sample, including V3 and V4 regions of the 16S rRNA gene. The total number of reads that passed quality filtering was 284,844 for MC sample and 75,980 for SL sample. Analyses focused on the Bacteria domain.

A different percentage of unclassified microorganisms for each taxonomic level was retrieved in the two investigated snow microbial communities. Unclassified sequences varied between 1% and 42% in MC and SL samples, respectively, already at phylum level, reaching values of ca. 45% (MC) and 73% (SL) at the taxonomic level of species.

Classified OTUs belonged to 23 phyla, 51 classes, 103 orders, 238 families, 656 genera, and 1725 species in MC sample, while in SL sample, there were 21 phyla, 41 classes, 83 orders, 174 families, 418 genera, and 744 species.

As shown in Fig. 1, the five top dominant phyla in MC sample were Proteobacteria (37.7%), Firmicutes (25.3%), Bacteroidetes (18.1%), Actinobacteria (12.6%), and Acidobacteria (4.2%) whereas SL sample was mainly composed by Proteobacteria (27.7%), Firmicutes (20.2%), Actinobacteria (4.2%), Tenericutes (3.2%), and Bacteroidetes (2.1%).

Fig. 1
figure 1

Phylum level microbial community composition. Relative abundance of bacterial phyla in MC and SL snow samples. Phyla with relative abundance < 1% are grouped in the category “Others”

The relative abundance of Acidobacteria in SL sample was 0.2%. The phylum Tenericutes had a much higher relative abundance in SL sample than in MC (0.1%).

At class taxonomic level (Fig. 2a), most of classified reads in MC sample belonged to Alphaproteobacteria (26.0%), followed by Sphingobacteriia (17.2%), Bacilli (16.7%), Actinobacteria (12.1%), and Clostridia (8.4%). On the other hand, SL sample was characterized by Betaproteobacteria (19.5%) and Bacilli (19.1%), which together comprised about 39% of total reads, and Actinobacteria, Gammaproteobacteria, and Alphaproteobacteria with a relative abundance of ca. 4% each. Mollicutes ranked sixth among classified classes in SL sample with a percentage of 3.2%, while they were poorly represented in MC sample (0.1%).

Fig. 2
figure 2

Class and genus level microbial community composition. a Relative abundance of bacterial classes in MC and SL snow samples. Classes with relative abundance < 1% are grouped in the category “Others”. b Relative abundance of bacterial genera in MC and SL snow samples. Genera with relative abundance < 1% are grouped in the category “Others”

At the taxonomic level of genus (Fig. 2b), Pedobacter, Bacillus, Sphingomonas, and Methylosinus represented the four dominant taxa in MC sample, with a relative abundance of 9.0%, 7.9%, 5.5%, and 3.9%, respectively. Clostridium, Acidisoma, Paenibacillus, and Hymenobacter were fairly represented.

In SL sample, Janthinobacterium and Bacillus predominated with percentages of 14.8% and 10.2%, respectively. Also Staphylococcus, Candidatus Blochmannia, Paenibacillus, Herminiimonas, Sphingomonas, and Hymenobacter were found, but with lower relative abundance values (ranging from ca. 1% to ca. 3%).

Furthermore, with a relative abundance of about 3%, Acidisoma tundrae and Pedobacter kwangyangensis were the main known species found in MC sample, followed by Beijerinckia mobilis (1.5%), Sphingomonas oligophenolica (1.2%), Pedobacter cryoconitis (1.2%), Sphingomonas wittichii (1.2%), and Edaphobacter modestus (1.1%).

In SL sample, the most representative species was Janthinobacterium agaricidamnosum with a percentage of 6.0%, followed by Bacillus badius, Candidatus Blochmannia rufipes, and Bacillus smithii (relative abundance > 1%).

However, a high number of classified OTUs (ca. 43% of total reads in MC sample and ca. 16% in SL sample) identified bacterial species that, taken individually, had a relative abundance less than 1%, suggesting an extreme fragmentation of microbial communities, probably composed by numerous distinct bacterial populations.

Microbial biodiversity assessment

In order to have a coherent assessment of the microbial diversity and similarity, and overcome the potential non-consistency when indices are referred to different taxonomic levels, it was considered the global approach presented in the “A global approach to the analysis of taxonomic data” section. The assessment was obtained following a compositional data approach (Aitchison 1986) with the counts transformed in proportions: for each taxonomic level j, each category count was divided by the total number of classified individuals.

Two selected indices were considered at each taxonomic level j. In particular, to evaluate the diversity within each sample, it was computed the Shannon entropy (Shannon and Weaver 1963) in the relative form (h) introduced by Pielou (Pielou 1969):

$$ {h}_j=\frac{-1}{\log \left({k}_j\right)}{\sum}_{i=1}^{k_j}{f}_{ij}\log \left({f}_{ij}\right), $$

where kj denotes the number of recognized taxa and fij the sampling proportion of microbial individuals classified in the category i of the taxonomic level j.

In addition, to assess the similarity between the two samples, it was computed the percentage model affinity (pma) index as introduced by Novak and Bode (Novak and Bode 1992) and investigated in terms of statistical properties by Ärje et al. (Ärje et al. 2016):

$$ {pma}_j=1-\frac{1}{2}{\sum}_{i=1}^{k_j}\mid {p}_{ij}-{q}_{ij}\mid, $$

where pij and qij denote the sampling proportions of microbial individuals classified in the category i of the taxonomic level j, referred to the two samples, respectively.

Diversity

In Additional file 2, the weights wj normalized to unity (first and third columns), and the values of the relative entropy hj (second and fourth columns) are reported for each sample, while in Fig. 3, the respective plots of the relative Shannon entropy are shown.

Fig. 3
figure 3

Shannon entropy. Plots of the relative Shannon entropy (hj) for MC (on the left) and SL (on the right) samples

The global entropy (hg) was computed from the data in Additional file 2 for both the samples:

$$ {h}_g={\sum}_{j=1}^J{w}_j{h}_j. $$

The obtained values, hg = 0.520 and hg = 0.365, showed that MC sample presented a higher level of biodiversity than SL sample. Furthermore, from data in Additional file 2, it is interesting to notice that in the sample MC, the species level presented the highest weight, with value equal to 0.301, while in the sample SL, the most informative level was the genus level with the respective weight equal to 0.707.

Similarity

With respect to the assessment of the microbial similarity between the two samples, the construction of the respective weights is different because it is necessary to combine the scaled abundance and the scaled richness of both the samples. Therefore, in this case, the mean and standard deviation were computed with respect to the values νj and κj. In Additional files 3 and 4, all the quantities necessary to the construction of the weights were reported: classified records nj, unclassified records uj, recognized taxa kj, scaled abundance νj, scaled richness κj, mean μj, standard deviation σj, and weight wj.

In Additional file 5, the weights wj and the values of the percentage model affinity (pmaj) index were reported, with the respective plots shown in Fig. 4. Then, the global similarity index (pmag) is given as follows:

$$ {pma}_g={\sum}_{j=1}^J{w}_j{pma}_j, $$
Fig. 4
figure 4

Taxonomic weights and trend of the percentage model affinity (pma) index. Plots of the taxonomic weights for combined samples MC and SL (on the left) and of the similarity between the two samples (on the right)

with value pmag = 0.472 that denotes a relevant degree of microbial dissimilarity between the two samples of data.

Discussion

This study represents one of the few works concerning the analysis of the snow microbiota in the Apennines, and its relevance is also related to the extraordinary snowfall event that took place in the municipality of Capracotta in 2015. The exceptionally adverse weather conditions caused serious problems with movements of people within the study area and did not allow to develop the experimental plan in an optimally manner, mainly with reference to the small sampling size. Nonetheless, the research constitutes a preliminary step for further and more accurate investigations of an environmental context poorly characterized till now.

Snow microbial communities were analyzed and compared by using NGS and a specific statistical approach for data interpretation, here described for the very first time. The assessment of biological diversity and similarity is relevant in many environmental studies. Indeed, biodiversity trends in time and space may often suggest environmental dynamics or changes in the status of ecosystems (Spellerberg 2005). In particular, in microbial studies, biodiversity assessment is important to describe communities and formulate hypotheses about the potential relations with the environments where the communities are observed. When information is obtained from genetic sequencing technologies, microbial data usually consist of taxonomic counts observed from one or more samples of the community of interest, and referred to every level of the taxonomic hierarchy, from kingdom to species. In general, when shifting from kingdom to species along the taxonomic hierarchy, the data are typically characterized by slightly decreasing levels of abundance (nj) and significantly increasing levels of richness (kj). Therefore, diversity and similarity indices are generally more informative when computed at the lowest level of the hierarchy, the species level for instance.

Unfortunately, this is not the case with microbial data, for which the possibility to detect a large number of unclassified records (uj) at every level of the hierarchy (even at the top) is realistic, due to the partial knowledge about the cultivable fraction of microbial communities and to the limitations of NGS technologies, caused chiefly by shorter read length and impacting on precision of species identification (Bukin et al. 2019). Therefore, in microbial studies, the species level is not necessarily the most informative. In these situations, to obtain a more consistent assessment of the diversity and similarity coherently with the taxonomy, a global approach is required: it is necessary to consider information from the whole taxonomic hierarchy and not from only one specific level (the species level for instance).

NGS results were in agreement with data available in scientific literature concerning snow microbial communities composition at phylum level, with Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes representing the main groups found in these ecosystems (Liu et al. 2006; Zhang et al. 2010; Michaud et al. 2014; Mortazavi et al. 2015). These phyla include different bacterial genera and species with an extreme metabolic diversity and the ability to adapt to snow and ice environments, such as Polaribacter, Psychroflexus, and Pedobacter. Some have the ability to form endospores, which confers an important selective advantage in terms of adaptation to harsh environmental conditions.

Nevertheless, biomolecular investigations and subsequent statistical data analysis showed relevant differences in terms of biodiversity, composition, and distribution of bacterial species between the studied snow samples. Indeed, the value of the global similarity index (pmag) between MC and SL samples was 0.472. The standard threshold 0.7 discussed in Ärje et al. (Ärje et al. 2016) is reputed as the critical level under which two communities are not considered similar in composition. Therefore, the observed value pmag = 0.472 denotes a clear level of dissimilarity.

After family (pma = 0.323), genus was the taxonomic level in which the main differences between MC and SL snow samples were concentrated (pma = 0.336). These differences must be sought in the percentage at which the different genera occur.

Furthermore, in accord with the global approach based on the weighted averaging, the data analysis showed that species and genus were the most informative taxonomic levels (in MC and SL samples, respectively).

In general, NGS analysis showed the presence of genera comprising bacterial species adapted to thrive at low temperatures and typical of snow/ice ecosystems, thus, probably constituting a resident microbiota (Sphingomonas, Methylobacterium, Acidisoma, Janthinobacterium, Paenibacillus, Hymenobacter), and genera including microorganisms whose presence could be justified by wind-transport (transient species) such as Beijerinckia, Deinococcus, Geodermatophilus, Maricaulis, Marinibacillus, Marinitoga, Marinobacter, Marinobacterium, Marinococcus, and Marinomonas.

Sphingomonas species can tolerate intense radiation, drying, and low concentrations of nutrients (Liu et al. 2006). Several studies highlighted the presence of bacteria belonging to this genus in different cold environments such as the Arctic snow, the Tibetan snow and glaciers, and the Antarctic snow and soils (Christner et al. 2002; Segawa et al. 2005; Liu et al. 2006; Miteva 2008; An et al. 2010; Xiang et al. 2010; Lopatina et al. 2013; Michaud et al. 2014; Mortazavi et al. 2015). Metanotrophic bacteria belonging to Methylobacterium genus are often found in cold environments in a metabolically active form (Lopatina et al. 2013) as well as species belonging to the genus Acidisoma (Acidisoma tundrae and Acidisoma sibiricum) which are psychrotolerant and moderately acidophilic bacteria capable of growth at 2–30 °C and pH 3.0–7.6 (Belova et al. 2009).

With reference to other genera commonly found in cold environments and in the snow (Rainey et al. 2005; Zhang et al. 2010; Chuvochina et al. 2011; Ivy et al. 2012; Lee et al. 2014; Mortazavi et al. 2015) and retrieved in samples collected in Capracotta, Paenibacillus includes widely distributed psychrotolerant spore-forming bacteria, previously isolated from Arctic snow samples (Ivy et al. 2012; Mortazavi et al. 2015) whereas Hymenobacter’s ability to withstand high doses of ionizing radiation represents a selective advantage increasing cell survival chances, not only during wind-mediated transport but also during snow exposition to intense UV radiation (Rainey et al. 2005; Zhang et al. 2010; Chuvochina et al. 2011; Lee et al. 2014).

Concerning transient microorganisms present in snow environments and probably transported by winds, one of the main genera retrieved in MC sample was Beijerinckia (with Beijerinckia mobilis in third place among classified species). It includes mainly soil bacteria widely distributed in the acid tropical soils of equatorial Africa, Southeast Asia, and South America, but it has been found only sporadically in temperate and subtropical areas. Beijerinckia spp. are able to grow in a temperature range between 10 and 35 °C, with optimal values between 20 and 30 °C. However, Beijerinckia cells resist freezing (Becking 1959, 1961, 1981). Therefore, considering the geographical distribution of this genus, its presence in MC and SL samples could find an explanation in a long distance transport and deposition during snowfalls.

In addition, also the genera Deinococcus and Geodermatophilus have been found, consistently with results reported in several scientific works (Carpenter et al. 2000; Chuvochina et al. 2011; Michaud et al. 2014; Mortazavi et al. 2015). Geodermatophilus obscurus, the type species of the genus Geodermatophilus, grows in the temperature range 18–37 °C, with an optimum of 24–28 °C (Ivanova et al. 2010). These temperatures were much higher than the average temperatures recorded in Capracotta especially in wintertime. Therefore, here too, it is possible to assume that these microorganisms, which do not represent cold adapted species, could be part of a transient microbiota in the snow.

It is generally recognized that microorganisms normally associated with other habitats (e.g., marine species, mesophiles, and even thermophiles) have often been found in snow and ice ecosystems (Liu et al. 2006; Lopatina et al. 2013). This statement is further supported by the presence of bacteria typically associated with marine habitats such as Maricaulis, Marinibacillus, Marinitoga, Marinobacter, Marinobacterium, Marinococcus, and Marinomonas. Their recovery could be related to the well-known orographic lift: the cold air coming from North-East, while crossing the Adriatic Sea, charges with humidity and microorganisms before reaching the internal mountainous areas of Molise Apennines giving rise to heavy snowfall (Stocchi and Davolio 2017), and influences the microbiota composition. These findings are in line with the results of Harding et al. (2011) which applied molecular, microscopic, and culture techniques to characterize the microbial communities in snow and air at remote sites in the Canadian High Arctic. Microorganisms retrieved in cold environments such as Antarctica, the Tibetan Plateau, and alpine regions of Japan, Europe, and North America were found together with microbes coming from diverse biomes in the coastal Arctic contributing to local inocula in the snow. These revealed the importance of aerial transport as a major transport route enabling microbes to colonize the different habitats.

Conclusions

In conclusion, the results obtained have shown that snow microbial communities retrieved in SL and MC samples relevantly differ from each other despite they are represented by bacterial phyla normally associated to cold environments. The reasons could be found in the different location, altitude, and also touristic usability of the sampling sites. In fact, unlike Santa Lucia sampling site, an unfrequented area, Monte Civetta was a place where thousands of people streamed in winter season to reach the alpine ski facilities of Monte Capraro, hosting some refreshment points, and it is known that microbial communities are generally the first responders to environmental chemical parameters/environmental perturbation (Bergk Pinto et al. 2019). An extreme fragmentation was suggested by the presence of several bacterial populations, each representing a small fraction of the whole microbial community. In addition, the presence of DNA from various microorganisms typically associated to other habitats demonstrates the strong potential of the wind-mediated transport in shaping the microbiota composition, enriching the normal resident bacterial communities with transient microorganisms. Although much of the biology and ecology of snow is still unknown, it is common knowledge that seasonal snowpack can influence the local climate, underlying soil, and adjacent ecosystems. For example, by regulating freeze-thaw events, the extent and duration of snow cover can affect soil microbial community composition, microbial-mediated soil nitrogen (N) cycling, and greenhouse gas exchange with the atmosphere. Accordingly, our exploratory approach and results can be used as a starting point to develop further investigations on the Apennines useful to address several research questions on the microbial ecology of these peculiar environments.

Methods

Study area

Capracotta is a small mountain village of about 900 inhabitants in the Molise region (Southern Italy; Additional file 1). It is the second highest municipality in the Apennines at 1421 m above sea level (a.s.l.).

In winter, snowfalls are frequent and abundant with temperatures that can drop down to several degrees below 0 °C whereas summers are mild.

Snow samples for molecular analyses were collected at Monte Civetta (MC), an area located at 1650 m a.s.l. close to the alpine ski facilities of Monte Capraro, and at Santa Lucia (SL; 1550 m a.s.l.), located at the base of Monte Campo, the highest mountain in the territory of Capracotta.

Sampling

Samples from each study site were obtained after the snowfall record event (March 7, 2015) by mixing snow collected in 6 points at a distance of about 1 m from each other, within an area of 15 m2 (5 m × 3 m). Surface snow layer was removed using a sterile spoon to eliminate deposited coarser particles such as dust and plant material (Amato et al. 2007). Collection was performed with a sterile plastic shovel in 6 sterile polyethylene containers (2 l) from each site (Harding et al. 2011). Sterilized suite and gloves were worn during sample collection to minimize contamination (Lopatina et al. 2013). Successively, the 6 subsamples from Monte Civetta and Santa Lucia were transported to the laboratory, the snow was melted over a period of ca. 12 h and occasionally mixed in order to maintain a homogeneous water temperature of ≤ 4 °C (Harding et al. 2011; Lopatina et al. 2013). After melting, 1 l of water from each subsample was measured in a sterile graduated cylinder and added to a single sterile polyethylene container (10 l): a single 6 l water sample was generated from each of the sampling sites (Monte Civetta and Santa Lucia). The two samples were filtered by using a membrane filtration system.

Biomolecular investigations

DNA extraction

For each sample, 6 l of melted snow were filtered through mixed esters of cellulose membrane filters (S-PakTM Membrane Filters, 47-mm diameter, 0.22-μm pore size, Millipore Corporation, Billerica, MA, USA). Filters were stored at − 80 °C until nucleic acid extraction. Total genomic DNA was extracted using the PowerWater®DNA Isolation Kit (MO BIO Laboratories, Inc., Carlsbad, CA, USA) and used for the Next-Generation Sequencing analysis.

Dual index 16S rRNA gene amplicon library preparation and bioinformatics analysis

NGS sequencing protocol was performed at BMR Genomics srl (Padova, Italy). For each snow sample, the V3–V4 regions of the 16S rRNA gene were amplified using the primers 331F and 797R (Nadkarni et al. 2002).

The primers were modified with a forward overhang (5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG−[locus−specific sequence]-3′) and a reverse overhang (5′-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG−[locus−specific sequence]-3′), which were necessary for dual index library preparation.

The library was run on the Illumina MiSeq (San Diego, California, USA) using the 2 × 300 bp paired-end approach.

The classification step used ClassifyReads, a high-performance implementation of the Ribosomal Database Project (RDP) Classifier described in Wang et al. (Wang et al. 2007). This process involved matching short subsequences of the reads (called words) to a set of 16S rRNA gene reference sequences (the taxonomic database used was an Illumina-curated version of the May 2013 release of the Greengenes Consortium Database). The accumulated word matches for each read were used to assign reads to a particular taxonomic classification. Forward and reverse strands were aligned independently in paired-end runs. Read stitching was not performed but classification required both reads from each cluster to classify to the same taxonomy to not be excluded. Analyses focused on the Bacteria domain. Therefore, sequences referred to viruses and microorganisms belonging to other domains, such as Archaea, have been excluded from the subsequent investigation. 16S rRNA gene sequences generated in this study are deposited in the NCBI Sequence Read Archive under the accession number PRJNA563617.

A global approach to the analysis of taxonomic data

Introducing some mathematical notation, the taxonomic levels are here denoted by the index j, with j = 1,…,J, corresponding to the standard hierarchy: kingdom, phylum, class, order, family, genus, and species (hence J = 7). Then, for each level j, nj and uj indicate the numbers of classified and unclassified records, respectively, while kj denotes the number of recognized taxa. For data collected from m distinct samples of the community of interest, nj and uj are obviously the sum over the m samples of the respective quantities, while kj is the total number of the recognized taxa in all the m samples, for each taxonomic level j = 1,…,J. Therefore, from the calculation point of view, distinct samples are considered as pooled.

In this section, a specific procedure based on the averaging method is presented. In particular, considering a generic biodiversity indicator of interest I, with the respective values Ij computed from the data at each level j of the taxonomic hierarchy, the procedure consists in the averaging those values Ij with specific weights wj determined in terms of amount of information present in the respective taxonomic level j.

Indeed, at each level j of the hierarchy, two relevant quantities are always available: the abundance, in terms of classified records nj, and the richness, in terms of recognized taxa kj. These parameters represent the degree of statistical consistence and the degree of microbial variety of the data related to each taxonomic level. Therefore, it would be natural to define each weight by combining these two types of information, such as through the arithmetic mean of the respective scaled values (denoted here by νj and κj, respectively). The scaling is necessary because abundance and richness have different magnitudes in general.

Furthermore, the more similar the scaled values, the more coherent the respective weight. Therefore, in addition to the average of νj and κj, each weight should also account for their precision, such as the inverse of the standard deviation for instance. Hence, in order to account for both average and precision, for each level j, the respective weight wj was defined in terms of the ratio between the arithmetic mean and the standard deviation of the scaled values νj and κj.

To show the construction of our proposal, an example with a generic indicator I is presented in the following scheme.

For each taxonomic level j of the hierarchy:

  • Compute the scaled values νj and κj as

$$ {v}_j=\frac{n_j}{n_M}\mathrm{and}\ {\kappa}_j=\frac{k_j}{k_M}, $$

where nM = max{nj; j = 1, ..., J} and kM = max{kj; j = 1, ..., J} are the respective maxima across the hierarchy;

  • Compute the mean μj and the standard deviation σj;

  • Define each weight as the ratio \( {w}_j=\frac{\mu_j}{\sigma_j} \);

  • Normalize the weights to unity;

  • Compute the global value Ig as the weighted average

$$ {I}_g={\sum}_{j=1}^J{w}_j{I}_j. $$

The system of weights obtained through this procedure defines each weight as the relative precision (the inverse of the coefficient of variation) of the information included in each taxonomic level j.

This approach can be easily generalized with biodiversity indices, which consider more than one sample of data, such as the similarity indices. Indeed, in this case, at every level j of the hierarchy, it is simply necessary to extend the computation of the mean and standard deviation to the values obtained from all the samples.

In Fig. 5, the plots of the scaled abundance and the scaled richness are reported for each taxonomic level. These scaled values are necessary to calculate the taxonomic weights as arithmetic means for each taxonomic level and for each sample. In Fig. 6, the system of taxonomic weights with values normalized to unity is shown for both MC and SL. Each weight represents the importance of the corresponding taxonomic level in terms of relative abundance and richness.

Fig. 5
figure 5

Scaled abundance and richness. Plots of the scaled abundance (νj) and richness (κj) for both the sampling areas: MC (on the left) and SL (on the right)

Fig. 6
figure 6

Weights normalized to unity. Plots of the weights (wj) normalized to unity for both the sampling areas: MC (on the left) and SL (on the right)

Currently, the assessment of diversity and similarity in biological communities with DNA data is often based on the use of clustering methods which process directly the DNA barcodes (Hebert et al. 2003; Jin et al. 2013). After the clusters are obtained, diversity and similarity indices are computed with respect to such clustering structure of the data. This approach is useful to avoid the problem of unclassified DNA sequences, since the clustering of DNA barcodes is independent from any taxonomic classification. Although the DNA barcoding approach is very useful to identify new species groups and assign unknown individuals to clusters (Savolainen et al. 2005), it strongly depends on the choice of the distance or metrics used in the clustering algorithm, with potentially relevant effects on the numerical values of the diversity and similarity indices. The approach introduced in this work produces a global assessment of diversity and similarity that considers the information of the taxonomic structure of the data that is how data are distributed along the different levels of the taxonomic hierarchy, combined with the respective relevance of each taxonomic level, in terms of abundance and richness. This aspect is missing in the barcoding approach, as the clustering computation is based only on the DNA sequences, without the taxonomic labels. Therefore, the method proposed in this article allows the calculation of any diversity or similarity indicator preserving the taxonomic structure of the data and represents an additional tool that can be usefully combined with other existing approaches in order to obtain a more complete evaluation of diversity and similarity in biological communities, paving the way to another perspective to look at biodiversity for 16S rDNA NGS data.