Measuring diversity from space: a global view of the free and open source rasterdiv R package under a coding perspective

The variation of species diversity over space and time has been widely recognised as a key challenge in ecology. However, measuring species diversity over large areas might be difficult for logistic reasons related to both time and cost savings for sampling, as well as accessibility of remote ecosystems. In this paper, we present a new R package - rasterdiv - to calculate diversity indices based on remotely sensed data, by discussing the theory behind the developed algorithms. Obviously, measures of diversity from space should not be viewed as a replacement of in situ data on biological diversity, but they are rather complementary to existing data and approaches. In practice, they integrate available information of Earth surface properties, including aspects of functional (structural, biophysical and biochemical), taxonomic, phylogenetic and genetic diversity. Making use of the rasterdiv package can result useful in making multiple calculations based on reproducible open source algorithms, robustly rooted in Information Theory.


Introduction
Back in 1872, Ludwig Eduard Boltzmann (Boltzmann 1872) introduced the first measure of entropy, later called marginal entropy and restructured by Claude Elwood Shannon under a mathematical theory umbrella (Shannon 1948). As such, it became one of the cornerstones of ecological theory and was adopted widely in ecological practice for measuring biodiversity and its change. Concerning biological entropy, the variation of species diversity over space and time has been widely recognised as a key challenge in ecology and was associated with analytic geometric models focusing either on the spatial component of species dispersal (Palmer 2007;Gorelick 2008) or on environmental drivers (Kreft and Jetz 2007).
To address this issue, many spatio-statistical models have been proposed to model biological entropy using data from Elisa Thouverai and Matteo Marcantonio are equally contributed to the manuscript. ecological surveys (Bachl et al. 2019). However, the statistical clarity (sensu Dushoff et al. 2019) of such models strictly depends on a high in situ data uncertainty, which propagates through all inferential steps (Meyer et al. 2016;. Furthermore, measuring species diversity over wide areas might be difficult for logistic reasons related both to time and sampling costs (Chiarucci et al. 2011;Hernandez-Stefanoni et al. 2012) and to theoretical and practical constraints, which are mainly related to two sources of uncertainty. The first is the uncertainty associated with the detectability and the determination of individual plants or animals species. The second is the one linked to different sampling strategies (McGlinn and Palmer 2009) or efforts  per area, or, in the worst case, to the impossibility of getting information about the real grain (sensu Scheiner et al. 2000) sampled (Hobohm 2003). In the absence of such information, it becomes excessively challenging to properly address the modifiable areal unit problem (MAUP), which in this case is the sensitivity of biodiversity to scale (Jelinski and Wu 1996). This is true, even though evidence exists for a chance to rely, in some instances, on expert knowledge to build straightforward and robust diversity maps worldwide (Hobohm et al. 2019).
Accordingly, algorithms based on remote sensing and spatial ecology might help estimating the variation of biodiversity over space and time (Skidmore et al. 2011;Schimel and Scheiner 2019) and represent a powerful first exploratory tool to detect the spatial variability across the landscape. The relationship between ecological processes (and functions) and the remotely sensed diversity can rely on the definition of niche proposed by Kroes (1977), and according to which a niche is the biotic structural and functional part of the ecosystem. Strictly speaking, such definition can be profitably used to measure spatial heterogeneity in ecosystems in order to convey information on their potential functions (Schneider et al. 2017).
From this point of view, the development of Free and Open Source algorithms to measure diversity from space would be beneficial to allow high robustness and reproducibility of the proposed approaches (Rocchini and Neteler 2012). Furthermore, their intrinsic transparency, community-vetoing options, sharing and rapid availability are also valuable additions and reasons to move to open source options. Among the different open source packages, the R software environment is certainly one of the most widespread worldwide and different packages have been devoted to remote sensing for: (i) raster data management (raster package, Hijmans and van Etten 2020), (ii) remote sensing data analysis (RStoolbox package, Leutner et al. 2019), (iii) spectral species diversity (biodivMapR package, Féret and Boissieu 2020), (iv) sparse generalised dissimilarity modelling based on remote sensing data (sgdm package, Leitao et al. 2012), (v) entropy-based local spatial association (ELSA package, Naimi et al. 2019), (vi) landscape metrics calculation (landscapemetrics package, Hesselbarth et al. 2019), to name just a few. Reader can also refer to https:// cran.r-proje ct. org/ web/ views/ Spati al. html for the CRAN Task View on analysis of spatial data.
However, currently no package provides a flow of functions grounded on Information Theory related to abundance based measures, by further introducing distances and going back to Information Theory again by generalised entropy. In this paper, we introduce a new R package which provides such a functions' throughput workflow. The aim of this manuscript is to encompass the theory behind the algorithms developed in the rasterdiv package (https:// CRAN.Rproje ct. org/ packa ge= raste rdiv), relying on the definition given by Gorelick (2011b): Theory is neither mathematical nor abstract. Theory is the creative, inductive, and synthetic discipline of forming hypothesis [...]

Information theory
One of the mostly used metrics for measuring remotely sensed diversity is related to the entropy measurement firstly introduced by Shannon (Shannon 1948).
Given a sample area with N pixel values and p i relative abundances for every i ∈ {1, … , N} , in decreasing order, the Shannon index is calculated as: Taking into account only the most abundant pixel value, the Berger-Parker (Berger and Parker 1970) index is given by: In remote sensing applications, the derivation of synthetic indices of any sort (i.e. diversity) is often performed by sequentially considering only small chunks of the whole image. These chunks are commonly defined as 'kernel', 'windows' or 'moving windows'. From now on, we will use this terminology to indicate the local space of analysis.
Given an x integer matrix (or RasterLayer) which in R could be defined as: Both indices can be calculated using rasterdiv by a moving window and applying the commands Shannon and BergerParker.
where window represents the side of the moving window. Additional arguments common to all functions in the package are: na.tolerance that determines the proportion of NA values allowed in a moving window (default equals 1), np which sets the number of processes in the parallel computing environment defined with cluster. type (default is "SOCK").
Both indices obey to the relative abundance of values. The Berger-Parker index is equal to the relative proportion of the most abundant class in a moving window (Fig. 1). Hence, low values of Berger-Parker are expected for continuous satellite data, given the high variability of reflectance values. In contrast, in the Shannon index, the abundance of every single numerical category (pixel value) is taken into account. This might lead to taking into account the turnover among values, since the higher the turnover the lower the dominance of a single class (Fig. 2). However, Shannon's H is unable to discern situations where there is a high richness (number of numerical categories) and a low evenness from those where there is a low richness but a high evenness.
To better account for evenness, the Pielou index (Pielou 1966) can be calculated by simply standardising the Shannon index on the maximum possible Shannon index attainable given the same richness value. The latter is attained when the maximum potential evenness of pixel values/numerical categories is reached, i.e. when they are equally distributed over space.
H max corresponds to the natural logarithm of the number of pixel values.
Using rasterdiv, the Pielou index can simply be calculated as: Pielou(x, window=3) Figure 3 reports an example with a moving window of 9x9 pixels. ( Berger-Parker index measuring the most abundant spectral value (Eq. 2). All the indices in this paper are calculated starting from a Copernicus Proba-V NDVI (Normalised Difference Vegetation Index, resampled at 8-bit radiometric resolution) long term average image (June 21st 1999-2017) at 5km grain, also provided into the rasterdiv package as a free default set which can be loaded by the function data(). A generally low value of the index (based on the most abundant spectral value) is found, since spectral input values are generally different from each other in a moving window. This figure has been generated by the command BergerParker(ndvi17_r, window=9,np=8,cluster.type="SOCK")
In remotely sensed imagery, this is a crucial point since it might happen that contiguous zones could have similar (but not equal) reflectance values. For instance, the diversity of a homogeneous surface like water could be overestimated if spectral distances are not considered. To overcome this issue, the Rao's Quadratic diversity (hereafter Rao's Q, Rao 1982) could be applied by not only taking into account relative abundance but also the spectral distance among different pixel values.
Given the values of different pixels i and j, the Rao's Q consider their pairwise distance d ij as: Hence, an array with different but spectrally close values will lead to a high Shannon's H but a low Rao's Q. On the contrary, an array with different and distant values in the spectral space will lead to both a high Shannon's H and a high Raos' Q (Rao 1982).
Moving towards a 2D spatial extent, let M be a 2D matrix Thus, according to Eq. 4, Rao's Q is related to the sum of all the pixel values pairwise distances, each of which is multiplied by the relative abundance of each pair of pixels in the analysed image d × (1∕N 2 ) . In other words, Rao's Q is the expected difference in reflectance values between two pixels drawn randomly with replacement from the evaluated set of pixels. The distance matrix can be built in several dimensions (layers), thus allowing to consider more than one band at a time. As a consequence, Rao's Q can be calculated in a multidimensional (multi-layers) system. In rasterdiv package Rao's Q is calculated as: Rao(x, dist m="euclidean", window=3, mode="classic") The dist_m argument refers to the type of distance calculated among pixels and can be any distance available in the R package proxy, such as Euclidean, Manhattan or Canberra distance. The Euclidean distance is the only possible with unidimensional datasets (mode="classic") ( Fig. 4) as it is demonstrated that in one dimension where D M and D E are the Manhattan and the Euclidean distances, respectively. In a similar way, the Canberra distance is derived from the Manhattan distance by standardising separately the absolute differences of each band with the sum of both values, and thus will also equal D E in one d i m e n s i o n , s u ch t h a t : Solving the intrinsic continuity of spectral data: cumulative residual entropy As previously stated, spectral data are continuous variables that are approximate to discrete (the so called "digital number") for practical reasons. As such, the fact that two different pixels should be counted or not in a category depends from the whim of the normalisation of the signal when Digital Numbers (DNs) are generated. Shannon index is built strictly for a non-ordered finite set of categories.
For continuous variables, a derivative version of Shannon index was proposed, but soon it was clear that it had very different properties than categorical formulation (Jumarie 1990;Michalowicz et al. 2013). Rao et al. (2004) proposed a Cumulative Residual Entropy (CRE) to build a consistent Shannon-like index for continuous variables. It is based on residual cumulative probability ( P(X >= x i ) ), which can be estimated in a robust manner from empirical mono-dimensional distributions by counting for each value the number of observations with equal or larger values and then dividing by the total. CRE is defined as follows: and to estimate it from an empirical distribution, the following approach is advised: where X is the sorted vector of N observations. In practice, the approach is similar to the Rao's Q, given that a coefficient d, representing the disparity of the observations, is used to weight the diversity estimate based on probability.
The difference resides in that the disparity in this continuous measure is absolute, while in Rao'Q it is relative between two observations. This difference makes more complex the generalisation to a multi-layer, where this time the unidimensional cumulative residual probability is substituted with a multivariate one. For instance, here is an example making use of three layers / bands: The calculation of the cumulative residual probability P cr (X, Y, Z) in an efficient way is based on: (i) calculating a contingency array with a certain dimension for each band, and then (ii) performing a reverse cumulative sum along each dimension as follows: In rasterdiv Cumulative Residual Entropy can be calculated as: producing a map such as that achieved in Fig. 5.

Solving point descriptors of diversity: the Rényi and Hill generalised entropy
The metrics described above represent point descriptors of diversity, each of which is able to represent only a part of the whole diversity spectrum that can be attained. There is actually no single measure that could be adopted to represent all the different aspects of diversity with an intrinsic fallacy in considering a 'true' diversity (Gorelick 2011a).

Fig. 4 Rao's Q index calculated on a Copernicus
Proba-V NDVI image at 5 km. Differently from the original Shannon's formula, Rao's Q also considers the distance among different values by better discriminating the queues of the diversity distribution from very low diversity (e.g. deserts and ice fields) to very high diver-sity (e.g. upper highly complex mountain ranges). This figure has been generated by the command Rao(ndvi17_r,dist_ m="euclidean",window=9,np=8, cluster. type="SOCK",na.tolerance=0.5) Rényi (1970) firstly proposed a measure which is able to represent several diversity metrics in just one formula, by only changing one parameter ( in the original version of his manuscript). Given a sample area with N pixel values and p i relative abundances for every i ∈ {1, … , N} , the Rényi index is: Changing the parameter will lead to different indices starting from the same formula (Hill 1973). As an example, when =0, H 0 = ln (N) where N=richness, namely the maximum possible Shannon index ( H max ). In practice, with = 0 , all the spectral values equally contribute to the index, without making use of their relative abundance. For → 1 , the Rényi will equal Shannon H, according to the l'Hospital's rule, while for =2 the Rényi index will equal the ln(1/D) where D is the Simpson's dominance (Simpson 1949). The theoretical curve relating the Rényi index and is a negative exponential, i.e. it decays until flattening for higher values of , where the weight of the most abundant spectral values is higher with small differences among the attained diversity maps (Ricotta et al. 2003a).
In rasterdiv the Rényi index is calculated as: Renyi(x, window=3, alpha=1) where x and window are the input dataset and the moving window size, as in previous functions (Fig. 6). The argument alpha ( value in Eq. 10, default equals 1) can be a single integer, a vector (e.g. alpha=c(1,3)) or a sequence of integers (e.g. alpha=1:3). Hill (1973) was the first ecologist applying the generalised entropy concept initially developed by Rényi (1970). In particular, since no particular formula would have a preeminent advantage over the others (Hill 1973), the Hill's generalised entropy N was based on the effective number of species of H , namely the number of species that would lead to H if they were equally abundant. In our case, the "species concept" is translated to the "spectral values" concept. Hence, N is the effective number of spectral values that would give H as an output. N can thus be calculated as: As for the Rényi generalised entropy, changing will let the index transform in many other widely used indices, which are point descriptions of diversity, i.e. peculiar cases of the Hill's generalised theory. Hence, for = 0 , N 0 = N , where N is the total number of spectral values in the window of analysis; for = 1 , N 1 = exp H; for = 2 , N 2 = 1∕S , where S is the Simpson's index, and for = ∞ , N ∞ = 1 I BP , where I BP is the Berger-Parker index (Fig. 7). We refer to Ricotta et al. (2003a) and Ricotta et al. (2003b) for a concise review on the theoretical properties of the Rényi and the Hill's generalised entropy, respectively. In rasterdiv, the Hill's generalised entropy can be calculated as:

Fig. 5
Cumulative residual entropy calculated on a Copernicus Proba-V NDVI image at 5 km. This figure has been generated by the command CRE(ndvi17_r,window=9,np=8,cluster.type="SOCK",na.tolerance=0.5) Hill(x,window=3,alpha=1) We refer to Chao et al. (2016) for a complete overview of the Hill's numbers application in ecology.

Discussion
In this paper, we provided a full description of the main functionalities of the new R package rasterdiv. The rasterdiv package provides an unprecedented suite of functions to calculate different indices for estimating diversity from space and to perform a first exploration of potential biodiversity hotspots worldwide at a glance. Of course, measures of diversity from space should not be viewed as a replacement of in situ data on biological diversity, but they are rather complementary to existing data and approaches. In practice, they integrate available information of Earth surface properties, including aspects of functional (structural, biophysical and biochemical), taxonomic, phylogenetic and genetic diversity (Laliberté et al. 2019).
Obviously, in most of the Information Theory based metrics, only one layer can be used, considering those indices related to relative abundance, apart from the Rao's Q and the Cumulative Residual Entropy (CRE). In the Rao's Q index, multidimensional systems can be used to calculate spectral distance (see also Nakamura et al. (2020) on the dimensionality of diversity), while in the CRE it is possible to calculate a multidimensional cumulative distribution to be used in the estimates (Drissi et al. 2008). In general, remotely sensed data are actually the approximation of more complex systems, which depends on the original radiometric and spectral resolution. In ecological terms, such original spectral space formed by many bands is analogous to the Hutchinson's hypervolume, in which a geometrical order is given to those variables shaping species' niches (Hutchinson 1959;Blonder 2018). In this case, the spectral space is expected to be related to both species niches and their relative diversity. The use of such spaces is an efficient approach to figure out the diversity of an area and potentially guide field sampling and monitoring schemes (Rocchini et al. 2008. Concerning the data being used, spectral diversity measures computed from satellite images represent a valid alternative to class-based land cover maps for investigating landscapes heterogeneity . For instance, a highly fragmented landscape characterised by a mosaic of crops and seminatural forests suffers from oversimplification when investigated through land cover classes (Amici et al. 2018), while it should present higher spectral diversity values compared to more homogeneous landscapes within the same study area (Rocchini and Ricotta 2007). Several studies have already acknowledged the importance of computing continuous spectral diversity measures from spectral bands in order to better understand and discriminate the various landscape components (Karlson et al. 2015;Godinho Fig. 6 Rényi index calculated on a Copernicus Proba-V NDVI image at 5 km, considering different values, from 0 to 2. With → 1 , the diversity map is equal to the Shannon's map of Fig. 2. Increasing will create a flattening of the index with a lower ability to discern dif-ferences among different maps (Ricotta et al. 2003a). This figure has been generated by the command Renyi(ndvi17_r,window=9, np=8,cluster.type="SOCK",alpha=c(0,0.5,1,2)) et al. 2018; Ribeiro et al. 2019;Doxa et al. 2020). This said, caution is warranted when making use of continuous data, by seriously considering the radiometry of pixel values. As an example, relying on continuous NDVI (Normalised Difference Vegetation Index) values, ranging from −1 to 1 with float (decimal) precision data, will lead to a high neighbouring diversity which could actually be the effect of data binning rather than of a biological underlying pattern. In general, an 8-bit image with a range of integer values/ classes from 0 to 255 would be preferable. In this paper, we made use of an 8-bit NDVI layer rescaled from Copernicus data. However, a multispectral system reduced to one single layer through the first component of a Principal Component Analysis, or similar multidimensionality reduction techniques, would also be useful (Féret and Boissieu 2020). In fact, NDVI assumes a biomass-grounded reflectance model, while the direct use of the original spectral data (digital numbers) does not generally require any assumptions about the biology of objects being sensed.
As remotely sensed estimates of diversity are currently based on relatively long time series, they might allow a more general forecasting framework of future shifts in rates of diversity change. This is particularly important when aiming at finding potential indicators of diversity change in time (Schmeller et al. 2018). On this point, it has been widely demonstrated that remotely sensed diversity might be in line with most of the spatially constrained Essential Biodiversity Variables proposed by Skidmore et al. (2015).
The rasterdiv package might also be particularly useful when aiming at calculating diversity directly from climate data, derived from remote sensing (Metz et al. 2014). This could allow analysing diversity based on the main drivers of biological diversity in the field, rather than on the patterns resulting from pure spectral response. This is true when considering both wide climatic variations at global scale and microclimate variations at the scale of individuals (Zellweger et al. 2019). Due to unprecedented rates of climatic changes, the adaptation of species to climate change is a benchmark in ecology. Hence, estimating diversity from climate gridded data could improve our understanding of the variability of species ranges at different spatial and temporal scales (Senner et al. 2018).

Conclusion
Measuring diversity from above and delivering rapid and robust knowledge about diversity over wide regions could be of crucial importance for guiding management practices. From this point of view, the spatial variation of the spectral signal has an intrinsic cumbersome relation with the spatial autocorrelation (sensu Laliberté (2008)) of pixel values over space (and time, e.g. ), which renders the proposed rasterdiv package a powerful tool to monitor the variation of ecosystems properties over space and time, and thus their change .
As previously stated, no single measure provides a full description of all the different aspects of diversity. That is why, the rasterdiv package can result useful in making multiple calculations based on reproducible open source Fig. 7 Another generalised entropy measure of diversity: the Hill index, for which the same reasoning of the Rényi index holds true. The maps are derived from a Copernicus Proba-V NDVI image at 5 km. This figure has been generated by the command Hill(ndv i17_r,window=9,np=8,cluster.type="SOCK",alph a=c(0,0.5,1,2)) 1 3 algorithms, robustly rooted on Information Theory from which the different indices are extracted.
Funding Open access funding provided by Alma Mater Studiorum -Universitá di Bologna within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.