Introduction

New technologies have significantly advanced various scientific disciplines, and biodiversity is no exception1. Citizen science platforms have emerged as prolific sources of valuable data contributed by amateur observers, facilitating increased collaboration and data sharing among scientists2. Notably, Observation.org and eBird stand out as the most well-established platforms among users in Western Europe. Furthermore, GBIF compiles data from these platforms and various other sources, including additional platforms, natural science museums, government data, etc.

This information, or citizen science (CS) data, is experiencing rapid growth, amassing an extensive and booming dataset. For the Iberian Peninsula alone, the combined bird records from these three platforms exceed 40 million, excluding duplicates, as of June 2024. Each record typically represents the observation of a confidently identified species at specific coordinates on a precise date. Although this information currently serves as the primary source of biodiversity data3, its full utilization is yet to be realized due to data quality, data integration, management, and new analytical techniques, as well as societal factors such as collaborations between citizen scientists and professional researchers. In addition, Callaghan et al.2 identify three areas to leverage CS data: increasing sampling efforts in undersampled areas and taxa, estimating species abundances over space and time, and utilizing secondary data in CS observations.

The significance of having up-to-date distribution data over large spatiotemporal scales cannot be overstated. However, no single institution, governmental, academic or NGO, possesses the means and funds to monitor all biodiversity taxa across all territories. This knowledge proves especially valuable for biodiversity hotspots or megadiverse countries, typically in areas where data remains incomplete4,5. Species distribution models, crucial for understanding extinction processes, heavily rely on sample size. CS data, by providing information from diverse locations, contributes to the creation of accurate models.

The utility of CS data in defining and characterizing biodiversity patterns hinges on the completeness of species inventories at specific locations and the even distribution of locations across temporal, spatial, and environmental dimensions6,7. Both the completeness (defined as the ratio between observed and expected richness4,5) and suitability for various macro-scale biodiversity studies are pivotal before comparing species assemblages, analysing distribution patterns, or conducting other basic ecological assessments. Unfortunately, these aspects are often insufficiently analysed in current studies.

While well-covered CS data is apt for representing distributional patterns and the ecological dynamics of species, encompassing patterns of extinction and colonization, incomplete data (taxonomic, spatial, and/or temporal dimensions) may yield misleading results, thus fostering false conclusions at different scales8,9. In general, biases in CS can arise from several factors that do not mutually exclude one another. For instance, users’ place of origin (e.g., urban areas) and their preferences (spatial: e.g., wetlands, taxonomic: e.g., raptors, temporal: e.g., summer and/or good weather conditions)10. Additionally, an inadequate or incomplete review of uploaded records by platform experts, coupled with users’ preference for searching rare species, could further amplify taxonomic biases11. Incomplete, biased, and/or outdated taxonomic coverage may impact the accurate inference of species richness and the definition of multispecies assemblages within a specific range12. On the other hand, uneven geographical coverage may influence the performance of species distribution models based on species occurrence records, potentially underestimating the actual or potential distribution patterns of focal species13,14. Inadequate coverage of biogeographical regions or ecoregions may adversely affect inferences of species' regional ranges, complicating the accurate determination or revision of the division of zoogeographical regions or ecoregions15. Simultaneously, incomplete coverage by land uses may introduce biases in taxonomic coverage. Finally, insufficient temporal coverage of data may impede the accurate assessment of temporal dynamics, especially those associated with extinction processes16.

Completeness analyses have been conducted for select taxonomic groups in a few countries worldwide, such as Brazil (plants)6, Mexico (birds)4, the United States (vertebrates and some invertebrates)7, India (birds)5, and West Africa (birds)17. However, the Iberian Peninsula has seen coverage assessments only for Scarabaeidae and bees18, as well as terrestrial mammals19, leaving birds without comprehensive evaluation so far.

The Iberian Peninsula stands as a territory of remarkable biodiversity, situated within the Mediterranean basin hotspot20. Its geographical placement has given rise to three distinct zoological components: (1) Iberian endemic elements, fostered by glacial isolation during the Pleistocene21; (2) European biotas; and (3) African elements, owing to the proximity of North Africa and historical connections with this continent during the Messinian Crisis six million years ago22, or attributed to the volcanic arc in the Alborán Sea, as described by Booth-Rea et al.23 during the same epoch. These factors collectively render this territory highly intriguing for biodiversity studies, thereby elevating the relevance of CS data. This region hosts 456 bird species24,25,26, some of which, especially those prevalent in agricultural landscapes, are undergoing a rapid and pronounced decline due to habitat transformation, climate change, or pollution27. Consequently, the scrutiny of these spatiotemporal processes through extensive CS data offers an opportunity to unravel the intricacies surrounding these issues28,29.

The primary goal of this study is to evaluate the extent of coverage and completeness within the CS datasets pertaining to birds of the Iberian Peninsula, sourced from eBird, Observation.org, and GBIF. Initially, we assessed completeness in 10 × 10 km cells, aiming to gauge the efficacy of data compilation efforts in analysing spatial variation in species richness. Subsequently, we delineated the coverage level of bird censuses derived from these databases from five distinct perspectives: (1) geographic (assessing bias in the total distribution of records), (2) temporal (examining well-sampled cells over time), (3) ecoregional (evaluating coverage across terrestrial ecoregions in the Iberian Peninsula as defined by the World Wildlife Fund), (4) environmental (considering annual mean temperature, annual precipitation, and altitude), and (5) land use (exploring forested areas, agricultural lands, wetlands, urban spaces, water bodies, and mixed-use regions). The scrutiny of coverage along environmental gradients represents a crucial yet often overlooked aspect of biodiversity data planning. Ultimately, by ascertaining completeness and coverage within extensive bird CS databases and across ecologically relevant gradients, our objective is to illuminate the factors contributing to data gaps.

Materials and methods

Study area

The Iberian Peninsula (583,254 km2) (Figure 1), including Andorra, Portugal, Spain, and Gibraltar, is located in south-western Europe and is separated from Africa only by the 14 km wide Gibraltar Strait. Serving as the western link between the two vast continents of Eurasia and Africa, it occupies the space between the Atlantic Ocean and the Mediterranean Sea. Furthermore, endowed with geological, physiographic, climatic, and ecological diversity, the Iberian Peninsula functions akin to a small continent. This distinctive character, coupled with its strategic location, historical human activities, and its role as a glacial refuge during the Pleistocene, bestows upon it extraordinary biodiversity values. These combined factors have fostered the coexistence of a rich blend of European, African, and endemic biotas30,31,32. In the realm of avifauna, the Iberian Peninsula serves as a crossroads for the Occidental European migratory route, giving rise to a significant number of species in transit33 (Fig. 1).

Fig. 1
figure 1

Map depicting the location of the Iberian Peninsula (in green) in the European continent. The primary migratory flyway for birds in Western Europe is indicated as a yellow band34.

Biogeographically, the Iberian Peninsula predominantly falls within the Mediterranean ecoregion, with a narrow extension into the Euro-Siberian domain in the north. The northern region is characterized by Cantabrian mixed forests, encompassing the Cantabrian Mountain Range, Pyrenees, and Galician Massif. The Mediterranean ecoregion itself is further subdivided into seven major ecological units (Iberian conifer forests, Iberian sclerophyllous and semi-deciduous forests, Northeastern Spain and Southern France Mediterranean forests, Northwest Iberian montane forests, Pyrenees conifer and mixed forests, Southeastern Iberian shrubs and woodlands, Southwest Iberian Mediterranean sclerophyllous and mixed forests)35. These subdivisions underscore the significant environmental heterogeneity of the Iberian Peninsula. The habitats of interest within this region comprise Mediterranean forests, wetlands, hardwood forests, steppes (whether natural or agricultural pseudo-steppes), mountain ranges, and river systems, among others36. This Peninsula boasts the highest contribution to the European Nature 2000 network, spanning over 222,000 km2, and hosts 2,130 protected spaces designated by national or regional authorities37.

Biodiversity data

Data for this study is sourced from three distinct platforms: two CS platforms, eBird (http://www.ebird.org)38 and Observation.org (https://observation.org/)39, along with GBIF (https://www.gbif.org/)40, a biodiversity data repository that aggregates information from various sources, including the aforementioned platforms. GBIF updates data from original sources with variable time delays, usually from 3 to 18 months, influenced by factors such as data validation, standardization, technical capabilities, and formal agreements between institutions. The dataset was cleaned and filtered in several phases, following Escribano et al.19. Two types of records were initially eliminated: (1) records from islands and North African territories; and (2) records lacking a collection date or coordinates.

The combination of these three datasets could result in the creation of duplicate or triplicate records (the same observation uploaded more than once) for two main reasons: (1) an observer uploading the same observation to two platforms (eBird and Observation.org) and/or (2) both platforms (eBird and Observation.org) routinely uploading their records to GBIF. To mitigate potential duplications and reduce the effects of pseudoreplication in subsequent analyses, we applied a stringent criterion. Specifically, we retained only one record from datasets that shared identical dates and geographical coordinates with the same first two decimal places in decimal degrees (DD) notation. These decimals correspond to a circle of approximately 800 m radius, ensuring a conservative approach to data consolidation. Additionally, records lacking specific taxonomic assignment and hybrids were excluded. From the initially downloaded 42,453,700 records, a final set of 23,372,637 records was retained for use in this study. Each record was then assigned to its corresponding 10 x 10 km UTM grid cell.

We have compiled a comprehensive list of continental birds of the Iberian Peninsula by combining the species lists from Spain (SEO/BirdLife “Lista de las aves de España. Edición de 2019”24) and Portugal (Matías et al.25,26), encompassing a total of 595 species (Suppl. Materials 1, 2). We have followed to the taxonomic criteria outlined in the HBW & BirdLife International Illustrated Checklist of the Birds of the World41.

The dataset used for this study is publicly available. No living organism has been manipulated during the execution of this study, which has been entirely based on data obtained from the internet.

Taxonomic coverage

To assess taxonomic coverage, we conducted a comparison between the species documented in the CS data and our comprehensive list of species for the Iberian Peninsula. We exclusively considered native and exotic species present in this consolidated list.

Given that the platforms involved may exhibit slight variations in bird taxonomy nomenclature, we created a synonymous list to standardize the taxonomy across the entire dataset. This process ensured uniformity by aligning the taxonomy with our unique combined list for the Iberian Peninsula (Suppl. Material 2).

Inventory completeness

The assessment of inventory completeness was conducted across 10x10 km grid cells. The expected species richness for each cell was determined using the Chao2 metric (SChao2)4,5:

$${S}_{chao2} = {S}_{obs} +\left(\frac{N-1}{N}\right)+\left(\frac{a\left(a-1\right)}{2\left(b+2\right)}\right),$$

where Sobs is the number of species recorded in each cell; N indicates the number of unique combinations of records (species x date) available for each cell; a and b are the number of species recorded in exactly one day and two days in each cell, respectively. Subsequently, the bird inventory completeness (C) of each pixel is calculated as follows:

$$C=\frac{{S}_{obs}}{{S}_{chao2}}.$$

To determine the threshold for considering a cell as well-sampled, various alternative sources were reviewed (Table 1). Cells were considered well inventoried under lax criteria when N ≥ 200 and C ≥ 0.8 and under strict criteria when N ≥ 1000 and C ≥ 0.9. Grid cells with fewer than 200 records were considered undersampled.

Table 1 Criteria used in different studies based on citizen science data to define well-inventoried cells, along with a proposal made in this study for Iberian birds.

Geographical coverage

To evaluate the geographical coverage of well-inventoried cells, the count of unique coordinates per species and per grid cell was conducted. Moran’s I was calculated using esda.Moran in Python (https://pysal.org/esda/generated/esda.Moran.html) to assess the geographical bias in the overall distribution of records, providing a quantitative measure of spatial autocorrelation42. A positive Moran’s I value, along with significant spatial autocorrelation, indicates a biased geographical coverage. To compute this index, the spatial weight matrix was initially calculated using the K nearest neighbours (KNN) method, with K=8 (representing the eight nearest cells).

Temporal coverage

A time-based index (T), was employed to assess the temporal coverage of each species or grid cell42, aiding in the identification of well-surveyed cells over time. Within the dataset, the earliest available record was in January 1817, and the latest was on 31 December 2021, resulting in a total of 2172 months (Mo). Ideally, each species or grid cell should have at least one record per month. The variable M was computed as the number of months for which data was recorded for each species or grid cell in the dataset. Subsequently, the temporal coverage index (T) for each species or grid cell was calculated as follows:

$$T=\left(\frac{M}{{M}_{o}}\right)\times 100.$$

Additionally, considering the increase in data since the last decades of the 20th century, this index was computed for the period 1980–2021, encompassing a total of 492 months.

Moran’s I was also calculated to assess geographical bias in temporal coverage. A positive Moran’s I value with significant spatial autocorrelation indicates a notable aggregation of grid cells with high temporal coverage43. Furthermore, Pearson correlations were calculated to examine the relationships between completeness, geographical coverage (localities per cell), and temporal coverage (months per cell).

Environmental and land use coverage

Environmental and land use factors can influence the way citizens collect bird data, potentially introducing some sampling bias44,45. To evaluate differences (gaps among well-surveyed sites) in the environmental dimension, we obtained the annual mean temperature, annual precipitation, and altitude variables at 2.5′ of arc (~ 5 km) spatial resolution from the WorldClim 2.1 dataset46. Following the approach described by Freeman and Peterson17, this data was utilized to compute nearest neighbour Euclidean distances between cells in the environmental space. The focus was particularly on distances from each grid cell across the study area to well-surveyed grid cells.

For land cover categorization, we employed the first level of the Corine Land Cover classification. The assessment involved computing the dominant land use within each grid cell, considering instances where the coverage exceeded 50% of the occupied surface. Following this criterion, our cells were categorized as follows: forest, agricultural, wetlands, artificial surfaces (urban), water bodies (encompassing all marine cells), mixed (indicating the absence of a clear dominance), and data unavailable (NA).

Kolmogorov’s test was applied to assess whether the frequency distribution of environmental and land use variables from well-sampled cells differed randomly from the global pool of cells. In cases where differences were identified, Tukey’s post hoc test was employed to pinpoint the specific variables contributing significantly to these differences.

Results

Taxonomic coverage

Between 1817 and December 2021, a total of 553 bird species were recorded in the CS data from the Iberian Peninsula, representing 92.9% of the taxonomic coverage outlined in the unified checklist of birds for the region (Table 2). The primary source contributing to the bird species richness in the Iberian Peninsula within the CS dataset was GBIF (553 species), while observation.org documented the lowest count (511 species), and eBird fell in between with 523 species.

Table 2 Taxonomic coverage of citizen science data for birds in the Iberian Peninsula (1817–2021) at the order level.

The taxonomic coverage of CS data for Iberian birds is notably high, with the majority of orders displaying values ranging between 90 and 100%. Only one order with more than 10 species exhibited a taxonomic coverage below 90%, Procellariiformes, which comprises exclusively pelagic seabirds with 23 species and a 73.9% coverage. Additionally, six other orders recorded values below this threshold, all characterized by a limited number of species, some of which are rare or exotic and present considerable difficulty in being covered by CS data (Table 2). For instance, two out of three species of Cuculiformes are accounted for in the CS dataset, with the sole exception being the rarity Coccyzus americanus.

Inventory completeness

To assess the distribution and completeness of bird observations gathered from the CS platforms eBird, Observation.org, and GBIF, we analyzed a total of 23,372,637 records collected between 1817 and 2021, documented across 6185 10 × 10 km cells. The average completeness per cell for the entire Iberian Peninsula was 0.68, with interquartile ranges of 0.97–0.78 (Table 3).

Table 3 Summary of downloaded bird citizen science data from eBird, Observation.org and GBIF alongside bird inventory completeness of 10 × 10 km cells sampled across the Iberian Peninsula.

A total of 2411 cells (38.8%) of the 10x10 km grid were considered well-sampled under strict criteria (C ≥ 0.9), and 4418 cells (71.2%) met the criteria under lax conditions (C ≥ 0.8). The remaining 1767 cells (28.6%) were considered not well-sampled. Completeness, particularly in well-sampled cells, exhibits a scattered pattern throughout the Iberian Peninsula, with two distinct areas, Portugal and eastern Spain, displaying well-defined maps characterized by cells with C ≥ 0.8 (Fig. 2). Noteworthy concentrations of high completeness values are also observed in specific regions such as central, mid-western, north-eastern, or the southwest coastline of Spain, encompassing renowned birdwatching areas like Doñana National Park or the vicinity of the Strait of Gibraltar (Fig. 2).

Fig. 2
figure 2

Bird inventory completeness (C) for the Iberian Peninsula, based on citizen science data obtained from eBird, Observation.org, and GBIF. (a) Map summarizing the C value in 10 × 10 km cells. Colours indicate C range values. (b) Cells well-surveyed under lax criteria (C ≥ 0.8 and n ≥ 200). (c) Cells well-surveyed under strict criteria (C ≥ 0.9 and n ≥ 1000).

Geographical coverage

All 6185 10 × 10 km cells within the Iberian Peninsula recorded at least one bird observation (Fig. 3). The maximum number of locations per cell reached 603. The CS dataset for the Iberian Peninsula exhibits a significant geographical bias (Moran’s I = 0.59, p < 0.01). Regions with higher concentrations of locations per cell were primarily found along the coast, in the vicinity of major cities (such as Madrid, Lisbon, Barcelona, Valencia, Seville, Córdoba, Oporto or Bilbao), and within well-known protected areas (e.g. Doñana and Monfragüe National Parks, the Ebro delta, etc.) (Fig. 3). Notably, observations predating 1980 (comprising 56,267 records) still demonstrate a significant bias (Moran’s I = 0.32, p < 0.01).

Fig. 3
figure 3

Geographical coverage (locations per 10 × 10 km cell) of the Iberian bird’s citizen science data for the 1817–2021 period. Colours vary based on the number of locations per cell.

The regions with lower geographical coverage are dispersed across the south-eastern, central, and eastern quadrants of the Iberian Peninsula, typically corresponding to areas characterized by low human density, all within Spain.

Temporal coverage

The temporal distribution of bird records from the Iberian CS data reveals an exponential growth curve (Fig. 4a), attributed to the recent increase in the use of CS applications. The total number of months with data for all species is 793, with the majority of species having fewer than 100 months recorded (Fig. 4b). The temporal coverage for the entire period (1817–2021) was notably low (T index median: 2.7% of the months in the entire period) (Table 4). Focusing solely on the most recent forty years, the frequency of months with data is markedly higher (T index median 11.9%) (Table 4).

Fig. 4
figure 4

Temporal coverage (records per year) of the Iberian bird species’ citizen science data between 1817 and 2021. (a) Temporal distribution of records. (b) Frequency distribution of months with records per number of species.

Table 4 Statistics for the bird’s citizen science temporal dataset from the Iberian Peninsula.

The temporal coverage was generally low or very low for most species (Fig. 4b), with some exceptions such as Carduelis carduelis (742 months), Erithacus rubecula (716 months), Alectoris rufa (715 months), Turdus merula (715 months), or Fringilla coelebs (711 months) (Suppl. Material 3).

The spatial distribution of temporal coverage across the Iberian Peninsula, measured by the T index per 10 × 10 km cells, was generally low, except for specific locations around the largest cities, emblematic protected areas, and most of the coastal strip (Fig. 5). However, during the period 1980–2021, the T index reached significantly higher values compared to those for the entire period and the temporal Moran’s I indicated significant temporal data aggregation, both for the entire dataset (I = 0.46, p < 0.01) and the period after 1980 (I = 0.19, p < 0.01).

Fig. 5
figure 5

Map of the temporal coverage Index (T) for the bird species’ citizen science dataset in the Iberian Peninsula between (a) 1817–2021 and (b) 1980–2021. The colours in the legend denote the temporal coverage index T in each grid cell.

Common spatio-temporal patterns

All parameters, including inventory completeness, geographical coverage (locations per 10 × 10 km cell), and temporal coverage (months per 10 × 10 km cell), exhibited strong correlations (Spearman Rank correlations, Rho > 0.52, p-value < 0.001, n = 6185) (see Figs. 2, 3, 5). The cells that were best sampled also had higher counts of locations and more recorded dates throughout the entire recording period (1817–2021).

Environmental and land use coverages

Environmental coverage was extensive for Iberian birds, with representatives from well-sampled cells found in more than 80% of the environmental range in all variables. The majority of well-sampled cells fell within the range of 11 to 17 °C for annual mean temperature, between 500 and 1000 mm for annual precipitation, and below 1000 m of elevation (Fig. 6). The distribution of well-sampled cells was narrower compared to that of incomplete cells, indicating a clear bias towards areas with less extreme environmental conditions, often coinciding with more densely populated areas (Kolmogorov’s test, p-value < 0.02; Tukey post hoc test identified wetlands and cells without information as the primary contributors to these differences).

Fig. 6
figure 6

10 × 10 km cell completeness along three environmental variables for the bird species’ citizen science dataset in the Iberian Peninsula. Frequency distribution of total (black bars) and well-sampled cells, considering strict (N ≥ 1000 and C ≥ 0.9) (brown bars) and lax (N ≥ 200 and C ≥ 0.8) (yellow bars) criteria by (a) annual mean temperature, (b) annual rainfall and (c) altitude (ordinates in log-scale). Frequency distribution of cell completeness along the same environmental variables: (d) temperature, (e) rainfall, (f) altitude. In (d–f) brown and yellow bars are well sampled cells according to strict and lax criteria, respectively. Green bars are cells not well-sampled and the red ones, cells with no data.

The main land uses in the Iberian Peninsula, based on their dominance within each cell, are forestry and agriculture, with a small number of cells designated as artificial surfaces (17) and wetlands (3) (Fig. 7). Despite wetlands being the least represented cell type in the study area, they exhibit the highest coverage, with 100% of cells considered as well-sampled, while the rest of the land uses have proportions ranging between 40 and 60% of well-covered cells. The most dominant land uses, namely forest (2871 cells) and agriculture (2781 cells), display similar absolute values of cells represented and a comparable proportion of well-covered cells (40.5% and 44.1%, respectively, using strict criteria).

Fig. 7
figure 7

Land use coverage of grid cells in the Iberian Peninsula bird’s citizen science dataset. (a) Map illustrating land use distribution per cell. (b) Frequency distribution of total 10 × 10 km cells per land use type (For. Forests, Agr. Agricultural areas, Art. surf. Artificial surfaces, Wet. Wetlands, Wat. bod. Water bodies, Mix. Mixed, and nd no land use data), along with the frequency of well-sampled cells. (c) Frequency of well-sampled grid categorized by lax and strict criteria, and cells not thoroughly surveyed for each land use type. Black bars are total cells, brown and yellow bars are well sampled cells according to strict and lax criteria, respectively. Green bars are cells not well-sampled and the red ones, cells with no data.

Global coverage of bird citizen science data

Open-access biodiversity datasets from CS platforms have become increasingly important in recent years. Currently, GBIF hosts nearly 2 billion records for the class Aves. Remarkably, over 95% of these records originate from just 40 countries (Table 5). The United States is the largest contributor, with almost 1 billion records. Spain and Portugal, the primary focus of this study, rank 9th and 21st, contributing 44.0 million and 12.3 million records, respectively. There is a significant correlation between the number of bird records and the number of avian orders present in a country (Pearson’s correlation on log-transformed data, r = 0.67, p < 0.0001, n = 40). Additionally, the correlation between a country’s area and the number of bird occurrences is also significant (r = 0.32, p < 0.05, n = 40). Table 5 arranges countries by the density of bird occurrences per square kilometre. In this ranking, Portugal and Spain are positioned 10th and 14th globally, with densities of 133 and 87 occurrences per square kilometre, respectively. These rankings indicate that the coverage and quality of CS data from the Iberian Peninsula are relatively high compared to the rest of the world.

Table 5 Statistics for the top 40 countries with the highest total bird occurrences in GBIF40.

Discussion

CS data are of significant interest for biodiversity conservation projects47. They contribute to raising awareness of biodiversity and the challenges encountered in such studies, partly due to the growing public interest in biodiversity3. It is crucial for these inventories to maintain integrity concerning spatial, temporal, and ecological dimensions, thus preventing the transmission of uncertain data to future studies3. Insufficient data can result in an unreliable portrayal of species evolution within a territory, leading to incomplete insights into species’ threats and other crucial knowledge43. However, it’s imperative to stress the importance of data quality to data collectors, as errors such as incorrect dates may be challenging to detect and could yield erroneous outcomes19. Notwithstanding, numerous successful studies based on CS have achieved significant conservation goals, including inventories, trend analyses, assessing biodiversity impacts, and selecting protected areas (e.g., Australian frogs inventory48, fire effects on biodiversity at a large scale49, spatiotemporal monitoring of coral reefs50, and prioritizing conservation areas51).

In this study, we have examined the quality of CS data concerning birds in the Iberian Peninsula, driven by the increasing volume of data and user participation across various sources. Consequently, our research serves as an evaluation of the comprehensiveness of CS occurrence data, pinpointing geographic, temporal, and environmental gaps in coverage. It’s essential to emphasize that this study should not be construed as an accurate representation of species diversity and richness in the Iberian Peninsula, nor should it be considered a predictor of bird species distribution concerning climate change52, owing to the many biases and lack of randomness in the records. Instead, at the moment, it should be utilized for comparative analyses with future inventories4.

With a dataset comprising 23,372,637 records, our study represents the most extensive data collection to date in CS assessments, both for the Iberian Peninsula18,19 and on a broader scale, including regions such as China, Mexico, or Brazil4,5,6,7,17, 43. This significant volume of data can be attributed to the heightened level of volunteer surveillance in Europe compared to other regions (Table 5), coupled with the predominant focus on birds in the surveyed organisms (accounting for 54% of total records in GBIF)53.

Taxonomic coverage

The taxonomic coverage across different platforms for the Iberian birds shows notable consistency, with slightly higher coverage observed in GBIF. While minor discrepancies in species representation among platforms may highlight the importance of utilizing all three sources, practical considerations suggest initially prioritizing the most comprehensive database. GBIF aggregates data from eBird, Observation.org, museums, and others, but the lag time associated with data upload to GBIF implies the practicality of leveraging all available platforms. However, adopting this approach introduces additional challenges in data filtering and processing due to the heterogeneity of projects within GBIF, ranging from structured (strict protocols and methodologies, ensuring high data quality and consistency, ideal for precise scientific studies but requiring significant participant training and resource management; e.g., atlases and academic projects) to semi-structured (provide guidelines but allow flexibility, balancing between consistent data collection and participant autonomy, making them adaptable to diverse conditions while potentially introducing some data variability; e.g., eBird) or unstructured (grant complete freedom in data collection, encouraging broad participation and large data volumes, but often at the cost of data standardization and quality; e.g., iNaturalist) formats. As an efficient alternative, using only eBird data may circumvent issues associated with mixed-origin data in GBIF or data scarcity in Observation.org. eBird pools the majority of bird data and offers the advantages of a semi-structured project.

Taxonomic coverage is a critical consideration as it can introduce bias when users prioritize certain taxa in their records (e.g., threatened species, rarities) or focus their surveys on areas where sought-after species are known to occur54. This bias becomes evident when observing the concentration of well-sampled grid cells in popular natural spaces. However, our assessment reveals robust coverage across all taxonomic orders, with values exceeding 90% in most cases. For instance, Passeriformes (208 out of 217 listed species) and Charadriiformes (115 out of 124 species) demonstrate excellent coverage, with half of the total orders achieving 100% coverage of their species. Overall, the checklist for all bird species in the Iberian Peninsula also surpasses 90% coverage. Furthermore, orders with lower coverage typically comprise only a few species, with gaps often representing rarities rather than common species. For example, Otidiformes and Pterocliformes exemplify this pattern, with the missing species, Chlamydotis undulata and Syrrhaptes paradoxus, respectively, being notable rarities in the Iberian Peninsula. In other studies, low coverage within an order may also be attributed to the challenging observation of seabirds, given their tendency to inhabit pelagic areas for extended periods43.

Geographical inventory completeness

Completeness, defined as the ratio between observed and expected species richness4,5, is a crucial metric for assessing the utility of CS data6,7. We have employed stringent criteria to identify cells with complete inventories (requiring values above 0.8 for lax criteria and 0.9 for strict criteria). This approach reveals that many areas initially deemed inadequately covered actually exhibit relatively high completeness (as illustrated by the extensive region with completeness values ranging between 0.6 and 0.8 in Fig. 3). Furthermore, instances of incompleteness may not necessarily stem from a lack of surveys but rather from data that has not yet been digitized, rendering it inaccessible for analysis6.

Our findings indicate a high level of completeness (mean completeness: 0.68), with a minimal number of cells categorized as not well surveyed, and all cells containing at least one bird record. We observed superior coverage particularly in Portugal and the eastern Spanish region. The areas with the highest sampling coincide with those having the greatest number of locations per cell. However, the distribution of sampled cells was non-random and exhibited significant bias, as indicated by the geographical Moran’s I coefficient of 0.5942. This points out an uneven distribution of records, with a tendency to cluster in specific areas such as large metropolitan regions and protected areas. These findings align with similar studies, such as those conducted in India, which also report geographical completeness biases5.

The only prior study examining the quality of CS data coverage in the Iberian Peninsula focused on certain invertebrates18 and mammals19, identifying a few well-sampled regions, notably mid-western, north and north-east and eastern Spain. While this pattern is similar to our findings for avifauna, differences exist in the distribution of well-sampled cells. Whereas well-sampled areas for mammals appear to stem from specialized monitoring surveys, our analysis of bird data reveals a clear pattern linked to popular visitor destinations. Notably, well-sampled cells are frequently situated in major urban areas, densely populated coastal regions, and renowned natural habitats such as popular wetlands. Conversely, significant gaps in sampling occur in sparsely populated interior regions of central Spain. It’s important to note that these regions are not necessarily devoid of bird biodiversity; rather, the records may not fully capture the true richness of these areas due to lower attractiveness to birdwatchers43.

Other studies focusing on birds typically exhibit a proportionally smaller number of records and more pronounced gaps, as evidenced by research conducted in China, Mexico, and India. This phenomenon can be attributed to the relatively short history of birdwatching in these regions43. Nonetheless, the exponential growth of CS data suggests that these gaps may only be temporary and could diminish in the coming years. In contrast, the study focused on mammals in the Iberian Peninsula revealed a low inventory completeness, with approximately 90% of cells failing to meet lax criteria, compared to around 30% in our avifauna study. This discrepancy may be attributed to differences in survey efforts between mammals and birds. Mammals, with their elusive behaviour and many being nocturnal, present greater challenges for observation compared to birds, which are often easily spotted flying or active during the day19. However, it’s worth noting that the best-surveyed areas identified in the mammal study coincide with those highlighted in our avifauna analysis, although our study also emphasizes well-sampled urban areas and Portugal19.

Temporal coverage

Temporal coverage reveals a remarkable surge in data sharing following the widespread adoption of CS platforms, with exponential growth observed since around 2016, culminating in over 3 million records annually by 2021. This exponential trend extends back to approximately 1960 if we consider the period from 1817 to 1974. The exponential growth in data sharing raises the possibility that by the time of publication, the number of records will have increased substantially. This pattern prompts us to question whether the knowledge gap prior to the advent of CS platforms reflects a genuine absence of data or rather a problem stemming from limited data sharing.

The frequency distribution of months with available data exhibits discernible patterns, both natural and influenced by the activity of birdwatchers, leading to specific areas with better temporal coverage than expected43. When considering the temporal index for the entire study period, we observe generally low values across the study area, with notable exceptions in major urban areas and certain coastal regions. However, focusing solely on the period from 1980 to 2021, the temporal index shows a considerable increase. This suggests that gaps in temporal coverage were more pronounced until the 1980s, with improvements in temporal completeness coinciding with the emergence of biodiversity platforms and CS initiatives2. Regarding species-specific temporal coverage, it was generally low for most species, with some exceptions.

Common spatio-temporal patterns

It is logical to expect correlations between inventory completeness, geographical coverage, and temporal coverage, as they are often influenced by the same factors. Indeed, there is a strong correlation among them, as the most active observers tend to collect data from the same birdwatching sites (for amateurs) or study areas (for scientists). Consequently, these areas exhibit better coverage across all analysed aspects, as observed in previous studies4,6, 7. However, depending on the scale of the study, this phenomenon can introduce an autocorrelation bias that researchers must carefully consider before utilizing the data.

Environmental and land use coverages

The Iberian Peninsula is predominantly characterized by two main land uses: forestry, which encompasses natural or spontaneous vegetation, not limited to forests, and agricultural land. Areas well-surveyed under strict criteria slightly outnumber those in forestry use. This can be attributed to the higher human activity in agricultural regions, which provides increased opportunities for data collection. Additionally, forested areas in Spain tend to be sparsely populated and are less frequented by birdwatchers, although they may attract hikers who typically contribute less data. However, when lax criterion is considered, forested areas show moderate completeness compared to agricultural ones. The most comprehensively surveyed areas are artificial surfaces, primarily urban areas, followed by mixed land use areas with no dominance, and wetlands. This observation can be attributed to the intensive visitation of wetlands by birdwatchers, drawn by the high bird diversity in these habitats55.

Environmental coverage appears to be high, with a more thorough surveying level observed in areas meeting specific environmental conditions. These conditions include an annual mean temperature either higher than 15 °C or lower than 5 °C, an annual mean precipitation ranging between 500 and 1500 mm, and altitudes either less than 500 m or exceeding 2000 m. This observation aligns with the concentration of human population in temperate zones56. Interestingly, areas at the highest altitudes and with the lowest temperature also demonstrate a well-surveyed status. This phenomenon may be attributed to travellers in mountainous regions who document bird sightings during their journeys, even if these sightings primarily involve species they were actively seeking. However, the lax criterion in these areas tends to be higher compared to more moderate climates or densely populated regions. This difference may be due to the relatively smaller number of surveyors in these extreme environmental conditions.

Future prospects for bird’s CS data in Iberia

CS is emerging as a crucial tool for biodiversity studies and has the potential to engage local communities in biodiversity conservation efforts28. However, CS data exhibits several gaps and biases that necessitate addressing. First and foremost is the issue of data quality57, which is heavily influenced by two main factors58: (1) the preferences and skills of volunteers in identifying and counting birds, and (2) environmental variability that affects bird detectability (e.g., weather conditions, time of day, season, and disturbances such as background noise). Volunteer site preferences, inconsistent attention to different species, and varying time investment in sampling can introduce significant biases into the data. To mitigate these issues, some CS projects, such as eBird, encourage data collection through semi-structured protocols to ensure a minimum level of data quality. eBird allows participants to record the type of observation (e.g., point count, transect, incidental, historical, or other), the start time, and the duration of sampling. Such semi-structured approaches help standardize the data and improve its reliability. Moreover, depending on the taxa or complexities of data collection, CS practitioners may require training and mentoring29. Various strategies exist to address low-quality data in CS projects, including spatiotemporal gaps and biases, both for existing datasets and future sampling efforts. Callaghan et al.59 discuss several statistical techniques used to mitigate biases. These include accounting for differences in observer skills, pooling, filtering, or subsampling data to reduce errors and uneven sampling effort, and employing a ‘big data’ approach to enhance data quality (see59 for detailed references). To fully harness the potential of CS platforms in filling gaps in our understanding of global biodiversity and driving research and conservation initiatives, it’s necessary to expand survey efforts in under-sampled regions worldwide, particularly those with high conservation value60,61.

The study conducted in China43 provides valuable strategies to address significant information gaps for future CS projects: (1) Prioritize observation sessions in areas with low data coverage, guided by expert researchers. (2) Implement alternative monitoring techniques utilizing new technologies such as aerial vehicles and cameras. These innovative approaches can complement traditional field surveys and provide additional insights into species distribution and abundance. (3) Utilize indirect information from specimens, such as DNA samples. Incorporating genetic data can offer valuable knowledge into species diversity and relationships, particularly in cases where direct observation is challenging. And finally, (4) leverage existing literature on bird distribution43. Incorporating information from scientific studies and field guides can enhance our understanding of species occurrence and habitat preferences. Another study underscores the importance of promoting less charismatic species to garner attention and conservation efforts. By highlighting the significance of these overlooked organisms, scientists can encourage society to prioritize their knowledge54. Similarly, Callahan et al.62 summarize strategies to enhance CS sampling for significant gains in data quality through three key points: (1) shifting focus from taxon-specific approaches to spatial and temporal considerations, (2) recommending high-value sampling sites to collaborators, and (3) fostering meaningful participation in sampling via dynamic incentive systems. Implementing these proposals, while acknowledging real-world constraints, could substantially improve data quality and enhance the ability to measure bird trends over space and time, a crucial objective amid the ongoing Sixth Extinction.

In the case of the Iberian Peninsula, our data indicates that CS on birds is approaching completeness, making it suitable for scientific studies. However, it’s crucial to thoroughly review the gaps identified in inventory, geographical, temporal, land use, and environmental coverages to mitigate potential biases in research outcomes. The ongoing exponential growth in the number of records and users on CS platforms in the Iberian Peninsula is expected to partially alleviate some of the identified gaps. Nevertheless, we encourage the Iberian birder community to extend their focus to areas with deficient information. By actively contributing data from these underserved regions, birdwatchers can play a vital role in further enhancing the comprehensiveness and accuracy of biodiversity knowledge in the region.