A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

Pastusiak, Andrzej; Reddy, Michael R.; Chen, Xiaoji; Hoyer, Isaiah; Dorman, Jack; Gebhardt, Mary E.; Carpi, Giovanna; Norris, Douglas E.; Pipas, James M.; Jackson, Ethan K.

doi:10.1038/s42003-024-06337-9

A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

Article
Open access
Published: 30 May 2024

Volume 7, article number 667, (2024)
Cite this article

Download PDF

You have full access to this open access article

Communications Biology

A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

Download PDF

626 Accesses
4 Altmetric
Explore all metrics

Abstract

The Anopheles gambiae 1000 Genomes (Ag1000G) Consortium previously utilized deep sequencing methods to catalogue genetic diversity across African An. gambiae populations. We analyzed the complete datasets of 1142 individually sequenced mosquitoes through Microsoft Premonition’s Bayesian mixture model based (BMM) metagenomics pipeline. All specimens were confirmed as either An. gambiae sensu stricto (s.s.) or An. coluzzii with a high degree of confidence ( > 98% identity to reference). Homo sapiens DNA was identified in all specimens indicating contamination may have occurred either at the time of specimen collection, preparation and/or sequencing. We found evidence of vertebrate hosts in 162 specimens. 59 specimens contained validated Plasmodium falciparum reads. Human hepatitis B and primate erythroparvovirus-1 viral sequences were identified in fifteen and three mosquito specimens, respectively. 478 of the 1,142 specimens were found to contain bacterial reads and bacteriophage-related contigs were detected in 27 specimens. This analysis demonstrates the capacity of metagenomic approaches to elucidate important vector-host-pathogen interactions of epidemiological significance.

Metagenomic shotgun sequencing reveals host species as an important driver of virome composition in mosquitoes

Article Open access 19 April 2021

Culicidae evolutionary history focusing on the Culicinae subfamily based on mitochondrial phylogenomics

Article Open access 02 November 2020

Mapping the virome in wild-caught Aedes aegypti from Cairns and Bangkok

Article Open access 16 March 2018

Introduction

The goal of the Anopheles gambiae 1000 Genomes (Ag1000G) Project was to determine the genetic diversity and population structure of An. gambiae complex mosquitoes, the primary vectors of human malaria parasites throughout sub-Saharan Africa^1,2. Whole genome sequencing of individual field-collected mosquitoes from thirteen African countries was used to identify the presence and distribution of single nucleotide polymorphisms conferring phenotypic traits such as reduced susceptibility to insecticides¹. In addition to mosquito DNA, endogenous and acquired viral, bacterial, fungal, protozoan, and in cases where mosquitoes were blood-fed, vertebrate DNA sequences may also be present in data generated from these field-derived specimens.

Anopheles gambiae mosquitoes interact with a myriad of vertebrates, plants, and their associated microbiota at various stages of development and across a diverse range of habitats³. Shifts in An. gambiae gut microbiome composition is strongly correlated with transitions from aquatic larval habitats to terrestrial settings where adult mosquitoes are actively seeking nectar and vertebrate host blood to support flight and egg production⁴. Partially digested blood meals often contain intact and degraded RNA and/or DNA derived from the vertebrate host(s) as well as pathogens and microorganisms present in the circulatory system of the host at the time of feeding⁵. Methods such as mitochondrial DNA barcoding, amplicon-based and shotgun metagenomic sequencing, and ELISA-based approaches have been successfully used to elucidate host-feeding patterns of hematophagous insects⁶. Shotgun metagenomic sequencing offers significant advantages with respect to greater resolution and accurate identification of microbial genera and reduced primer bias as compared to amplicon sequencing approaches, including 16S rDNA methods^7,8. Metagenomic analyses of hematophagous insects and their blood meals represent a promising approach for rapid detection and identification of previously undescribed and established microbiota, including pathogens and their respective vertebrate hosts^9,10,11,12.

Recent advances in high throughput, deep sequencing, and bioinformatics have given rise to metagenomic approaches for rapid, highly accurate resolution of complex environmental specimens^13,14. Microsoft Premonition has developed a Bayesian mixture model-based (BMM) metagenomics pipeline capable of identifying known taxa at the species level and estimating species present in a single specimen. The pipeline utilizes a ten-tera-base genomic reference database and cloud-scale statistical machine learning to quickly: (1) build probabilistic assignments from reads to species based on sequence similarity, (2) refine species probabilities for ambiguous reads by computing a global statistical model across all reads, and (3) identify previously undescribed, unexpected, and contaminant genetic material by aligning against all taxa with available (partial) genomic references, i.e., without a priori assumptions on which taxa might be present in a specimen and without limiting the analysis to a small subset of genomic references (e.g., to only pathogens for computational reasons)^15,16. To this end, we analyzed the publicly available Ag1000G Phase 1 and 2 datasets using the Microsoft Premonition metagenomics BMM pipeline to determine the constituent species present in each of 1142 field-collected Anopheles gambiae mosquito specimens.

Results and discussion

Summary of BMM analysis for Ag1000G datasets

The BMM pipeline computed genome posterior probabilities (i.e., the output BMM) for over 147 billion sequence reads from 1142 individual specimens that comprise the Ag1000G Phase I and Phase 2 datasets (Table 1, Fig. S1)^17,18. These reads were compared against a database of >600,000 reference genomes (at time of analysis) spanning the entire tree of life. The pipeline considered vertebrates, plants, protozoans, chromists, and archaea references, in addition to bacteria, viruses, and arthropods, to estimate taxon abundance profiles per mosquito¹⁹. An exceedingly small proportion of reads were assigned to plants, fungi and other taxa. Less than one percent of all reads (0.86%) failed assignment (with edit distance 20 or better) to any sequence present in sequence databases at the time of analysis. In the following summary, we say “a read was assigned to a taxon” to mean that a given read had the highest probability of coming from a given taxon, even though the output BMM presents possible alternatives and their corresponding probabilities.

Table 1 Total number and percent total of reads recovered by the BMM pipeline

Full size table

In total, 93.02% of reads were assigned to the phylum Arthropoda, which includes An. gambiae mosquitoes. In addition, we detected a considerable number of reads assigned to chordates, bacteria, and bacteriophages (Table 1). Specimens associated with chordates included reads assigned to hominid, bovid, canid, equid, and phasianid hosts (Table 2). These specimens contained reads covering at least 25% of the chordate reference genomes. All specimens contained reads assigned to human sequences, however the number of reads varied widely between specimens suggesting that some of these represented blood meals taken by the mosquitoes prior to capture, while others were likely the result of specimen contamination.

Table 2 Percentage of vertebrate (Phylum: Chordata) host reads identified in mosquito specimens (n = 4,713,450,496)

Full size table

Anopheles gambiae species complex mosquitoes

We evaluated BMM assignments of specimens that were morphologically identified at the time of collection and genomically verified by the Ag1000G consortium to members of the An. gambiae species complex, which are evolutionarily similar. The An. gambiae species complex represented 93.3% of the probability mass given to Arthropod-assigned reads, with the remaining mass scattered around other anophelines. Next, we evaluated BMM probabilities at the species level. Higher An. gambiae probabilities corresponded to An. gambiae specimens (and similarly for An. coluzzii probabilities and specimens), though probabilities were more evenly distributed across these taxa, expressing greater uncertainty in the estimates (Fig. 1). Straightforward selection of clusters from BMM statistics correctly grouped the Ag1000G specimens at the species level with 96.4% accuracy (Fig. S2). In summary, our model provided useful probabilities suggesting a robust interpretation of noisy alignment data across references with varying size, quality, and homology. Uncertainty increased at the species level which encourages more careful inspection and interpretation of model probabilities. As genomic databases such as The Darwin Tree of Life Project, VectorBase, the i5K initiative, and others expand, we expect the uncertainty of assignments to accordingly decrease^20,21,22.

**Fig. 1: Relationships between confirmed *Anopheles* species and BMM probabilities.**

Vertebrate hosts

Our model assigned vertebrate reads to humans as well as several domesticated animals, including cow, goat, dog, donkey, and fowl, which all comprise common livestock species found in rural settings across Africa (Table 2)²³. Although some human reads may be attributed to blood meals acquired from humans, it is difficult to discern human contamination from field and laboratory handling of individual specimens versus a human-derived blood meal. In contrast, the reads assigned by BMM to other vertebrates likely constitute evidence of blood feeding on non-human hosts. These signals represent over 613 million reads, sharing greater than 99% identity with vertebrate reference genomes deposited in GenBank and RefSeq databases. We suggest this indicates blood feeding by these mosquitoes had occurred on singular and, in some cases, multiple hosts. A total of 162 mosquito specimens contained non-mosquito reads corresponding to at least one chordate genome. 138 specimens contained human reads, and 24 specimens contained non-human vertebrate reads. Blood meal hosts were assigned where vertebrate reads within a mosquito specimen corresponded to at least 25% coverage of a specific reference genome sequence (Table 3). In several cases, the pipeline assigned reads to hosts based on very high coverage rates. For example, specimen ERS224451 contained 125,897,307 assigned reads covering 84% of the Capra aegagrus (wild goat) genome²⁴. Specimen ERS224085 contained 49,266,543 assigned reads covering 84% of the Equus asinus asinus (donkey/ass) genome, and specimen ERS224472 contained 71,332,843 assigned reads covering over 69% of the human genome^25,26.

Table 3 Number of specimens containing vertebrate host reads

Full size table

Given the high degree of anthropophily generally observed in An. gambiae and An. coluzzii, the abundance of non-human host signals is greater than expected (Fig. 2)^27,28. This finding suggests that An. gambiae and An. coluzzii may be more opportunistic feeders than has previously been appreciated^27,28,29,30. This may also reflect the relative abundance and diversity of hosts available to host-seeking mosquitoes in the sites where specimens were collected, as well as the method of collection and handling of specimens prior to sequencing. All these factors must be considered when interpreting the results and further indicate the need for accurate and extensive metadata at the time of collection³¹.

Plasmodium

The BMM pipeline assigned 5,344,273 reads (0.004%) to seven Plasmodium parasite species (with varying probabilities) distributed among 485 of the 1142 mosquitoes. However, the number of reads per specimen varied widely. To further examine the presence of Plasmodium falciparum, the most lethal and primary human malaria Plasmodium species transmitted by An. gambiae, sequence reads from each specimen were realigned using SNAP aligner against a single P. falciparum reference (GCA_001861075.1)³².

P. falciparum core and apicoplast genome coverage

As the P. falciparum nuclear genome consists of low complexity sequences (80.6% A + T), which can result in ambiguity in sequence assignments and likely was the reason for assignment to seven parasite species, we assessed the coverage of the P. falciparum core genome (hypervariable and subtelomeric regions excluded, 20.8 Mb; relative to the apicoplast genome, 35 kb)^33,34. The P. falciparum apicoplast genome has a higher copy number compared to its nuclear counterpart (15:1 ratio). Therefore it is expected to have an increased sensitivity of detection^35,36. Four hundred thirty-two and 148 specimens of the 1142 mosquitoes contained sequence reads mapping to the P. falciparum core genome and the apicoplast, respectively (Fig. 3a). As expected, all specimens with apicoplast reads had greater depth and a higher percent coverage than core genome assemblies, reflecting the disparity in sizes and copy numbers between the two genomes.

**Fig. 3: *Plasmodium falciparum* read assignment in the Ag1000G dataset.**

Plasmodium read validation

We discovered that some of these suspected Plasmodium reads originated from six anopheline specimens morphologically identified as male mosquitoes. Whereas only female mosquitoes feed on blood, the assignment of P. falciparum reads to male mosquitoes should be considered erroneous due to contamination or mislabeling. All specimens with reads aligned to P. falciparum apicoplast also contained reads aligned to the core genome; however, the reverse was not true. Read validation was further accomplished by establishing a threshold of the coverage for the apicoplast and core genomes in relation to the P. falciparum characteristic guanine–cytosine (GC) genome content.

The positive threshold for the presence of P. falciparum

We determined that the GC content of P. falciparum consensus sequences was distinctly lower than the contigs found in the Ag1000G An. gambiae metagenomes and specifically when compared to bacterial taxa (Fig. 3b). Furthermore, the correlation between the GC content and the percentage of genome coverage denoted a distinct threshold in genome coverage above which the sequences had consistent GC content and within the estimated interquartile range (Fig.3b). Specifically, for the apicoplast and the core genomes the thresholds were estimated to be 3.0% (Fig. 3c) and 0.4% (Fig. 3d), respectively. Based on these thresholds, a total of 59 specimens (5.6%) had validated coverage for both parasite genomes and were considered true positives for P. falciparum, while 339 specimens had no validated coverage and are likely false positives (Supplementary Data 4: “P. falciparum reads”). All male mosquitoes were in the latter category.

Viruses

The BMM pipeline assigned 2,039,560 reads (0.001%) to 80 species of viruses and bacteriophage distributed among 223 of the 1142 mosquito specimens (Supplementary Note 1—“Detection of viral and bacteriophage species by the BMM and Integrator pipelines”). Eukaryotic viral sequences were found in 65 specimens, bacteriophage-related sequences were present in 167 specimens, while both eukaryotic viruses and bacteriophage were detected in 10 specimens (Supplementary Data 4: “All viral species detected” & “Viral species by specimen”). Analysis revealed that many of these detections were false positives due to physical contamination or computational misassignment. The evidence for labeling taxa as a contaminant is summarized in Column G of Supplementary Data 4: “All viral species detected”. Many of these are well-known contaminants present in many metagenomic studies³⁷. In each of these cases, a genome coverage map was generated. In many contaminant taxa, sequences map to a single genome feature that is commonly used in viral vectors. For example, we detected human cytomegalovirus (CMV) in some specimens, but all the sequence reads mapped to the immediate early promoter. SV40 was detected in some specimens with all sequence reads covering the polyadenylation signal. Similarly, we found sequences aligning to human adenovirus and avian leukosis virus, but in each case, the alignments were to vector-associated features. Close examination of viral coverage maps, and sequencing flow cell history suggested that the human immunodeficiency virus (HIV) and influenza sequences detected were the consequence of contamination, most likely stemming from previous sequencing runs using the same flow cell. No known mosquito viruses were detected in any of the specimens. However, nineteen mosquitoes contained authenticated vertebrate viral sequences, including fifteen specimens containing human hepatitis B virus (HBV), a single specimen containing ungulate erythroparvovirus-1 and three specimens containing primate erythroparvovirus-1. (Figs S3–S5; Supplementary Notes 2–4- “Presence of HBV”, “Presence of ungulate erythroparvovirus-1”, “Presence of primate erythroparvovirus-1”). These viruses are not known to replicate in mosquitoes; therefore, HBV reads detected were most likely present in the blood meal. In support of this notion, all fifteen HBV-positive specimens, as well as the three specimens harboring primate erythroparvovirus-1, contained human DNA while the sole specimen, ERS248730, in which ungulate erythroparvovirus-1 was detected also contained bovine DNA (Fig. S6; Supplementary Note 5—“Comparison of Premonition BMM and Kraken2 pipelines for resolving a complex specimen”).

To further examine viral sequences detected by the BMM pipeline, all reads from each specimen were assembled and examined by Integrator, an extension of the Premonition pipeline for virus and microbe detection that probes amino acid similarities Supplementary Data 4: “Viral species detected-Pickaxe” & “Virus detection—Pickaxe vs BMM”)³⁸. Integrator confirmed the presence of HBV in twelve specimens, as well as the presence of ungulate erythroparvovirus-1 in a single specimen (Fig. 4; Supplementary Data 4: “Virus detection- Pickaxe vs BMM” & “Confirmed eukaryotic viruses”) and plots were generated in Circos³⁹. Furthermore, the assembled contigs contained open reading frame structures consistent with HBV and ungulate parvovirus presence. Some specimens contained near-complete genomes of HBV and ungulate erythroparvovirus-1 (Fig. 5). Integrator also uncovered numerous previously unidentified bacteriophages (Supplementary Note 6- “Novel bacteriophage”; Supplementary Data 4: “Potential novel bacteriophage”). In addition, Integrator found sequences with distant similarities to known mosquito viruses such as Anopheles annulipes orbivirus and Wuhan insect virus 23 (Supplementary Note 7—“Viruses with RNA genomes”; Supplementary Data 4: “Viral species detected-Pickaxe”). However, these are RNA genome viruses and may be present as integrated partial viral genomes.

**Fig. 4: Human versus Hepatitis B virus reads.**

**Fig. 5: Hepatitis B virus and ungulate erythroparvovirus-1 coverage maps.**

Bacteria

We analyzed bacterial taxa present in the Ag1000G Phase 1 and 2 data sets. The BMM analysis assigned approximately 0.6% of all sequence reads to bacteria (Table 1). Reads associated with bacteria were present in all 1142 specimens. However, the number of bacterial reads per specimen varied widely (Fig. 6a). Bacterial sequences may originate from microorganisms associated with the living mosquito specimens, microorganisms that grow postmortem on a preserved specimen, or due to contamination during nucleic acid preparation and sequencing. Therefore, we also examined the data following the removal of taxa commonly associated with contamination⁴⁰. Furthermore, we only analyzed specimens with fewer than one million total bacterial reads and families with at least five thousand reads. This arbitrary cutoff was selected based on when (1) the number of bacterial reads flattened out and (2) examination of low-read specimens revealed suspected contaminants. This reduced the number of specimens containing bacterial sequences to 478 of 1142. These reads were distributed among 59 bacteria families (Fig. 6b). The presence of bacteria in Ag1000G mosquitoes was confirmed by analysis with Integrator. Bacterial contigs were only found in specimens that BMM identified as harboring bacterial reads. The bacterial phyla and genera identified by Integrator were similar to those detected by BMM (Supplementary Data 4: “All viral species detected”). As proof of concept, we examined two bacterial species, Elizabethkingia anophelis and Thorsellia anophelis in detail since both have been associated with the Anopheles microbiome^41,42. E. anophelis was detected in 35 specimens, and Thorsellia anophelis in 42 of the 1142 specimens (Fig. 6c, d). Greater than 95% coverage of these bacterial genomes was achieved in a subset of the specimens. These bacteria accounted for most of the bacterial species detected in some specimens with no specimens having signatures of both bacterial taxa. In addition to bacteria, the BMM/Integrator analysis also identified bacteriophage contig sequences. Most bacteriophages found in nature are unidentified species distantly related to sequences in databases and therefore, most of these unique species will be missed by BMM because of alignment requirements. This is consistent with the results of Integrator analysis that identified multiple contigs encoding bacteriophage-related proteins. Our approach does not distinguish between bacteriophage sequences present because of an ongoing infection from those that are integrated in bacterial genomes. In total, bacteriophage-related contigs were detected in 27 specimens, all of which also contained a high number of bacterial reads (Fig. 6a). In three of the 27 specimens, Integrator detected and assembled bacteriophage contigs despite not having any BMM-assigned bacteriophage reads.

Conclusions

As projects like Ag1000G continue to expand the volume of genomic data available for the An. gambiae species complex across its range of distribution, we expect our mixture model to become increasingly accurate at resolving species-level assignments. Our Bayesian mixture model assigned vertebrate reads to humans as well as several domesticated animals, including goat, cow, dog, and donkey hosts. In addition, evidence of mixed blood meals derived from two host species was detected in several specimens. We were able to ascertain reads assigned to P. falciparum in several vector specimens. The relatively low coverage of the 23 Mb Plasmodium parasite genome demonstrated the challenges of detecting a small fragment of the parasite in a large specimen of the host genome and mixed DNA templates. However, to the best of our knowledge, using the detection of both Plasmodium core and apicoplast genomes proved to be an original method to validate parasite presence. Yet, because whole mosquitoes were typically used for DNA extractions and sequencing, it is not possible to discriminate whether mosquitoes were infectious.

We were unable to ascertain whether the wide-spread presence of reads assigned to Homo sapiens in nearly all specimens was a consequence of human feeding or contamination during field collection, sorting and identification, laboratory manipulation, and nucleic acid extraction or due to residual contamination of NGS flow cells between sequencing runs. Specimens originated from many different collectors and were handled and extracted using multiple approaches. Some specimens were stored in ethanol, while others were desiccated; some were extracted soon after collection, and others after extended storage. Thus, bacteria may be present as part of the mosquito microbiome, phoretic on the external surface of the insects, as contaminants introduced during collection/specimen processing, or even microbial growth during specimen storage. Thus, we term these as “mosquito-associated bacteria”. The enormous number of bacterial reads present in some specimens suggests that bacteria were actively growing in some specimens.

The microbiome of individual mosquitoes is relatively simple from this data set. Since bacteriophage cannot grow in insect cells, bacteriophage sequences should only be present in specimens containing bacteria. However, we cannot distinguish bacteriophage infection from integrated phage genomes. All the bacteriophages detected were previously undescribed. The high degree of sensitivity of the NGS method underscores the need to preserve specimen integrity and standardize approaches from collection through analysis to accurately determine sequence identity and the nature of biological associations. Additionally, the results demonstrate the importance of targeting collections and metadata to address specific questions. We continue to investigate unusual genomic assignments for systematic contamination of reference databases and are developing disciplined methods to address reference contamination. This study shows that metagenomic analysis of mosquitoes provides a robust strategy for detecting and monitoring the host species from which mosquitoes obtain a blood meal, as well as protozoa, bacteria, and viruses that are circulating among vertebrate hosts.

Methods

Statistics and reproducibility

Briefly, the Microsoft Premonition BMM pipeline takes as input: (1) a sample ${{{{{\boldsymbol{X}}}}}}=\{{{{{{{\boldsymbol{x}}}}}}}_{1},\ldots ,{{{{{{\boldsymbol{x}}}}}}}_{m}\}$, which is a collection of sequencer reads, and (2) a reference genome database ${Ref}=\{{g}_{1},\ldots ,{g}_{n}\}$, which is a set of genomes. It computes the probability distribution $p(r,g|{{{{{\boldsymbol{X}}}}}},{Ref})$, which is the probability that read ${{{{{\boldsymbol{x}}}}}}$ in the sample ${{{{{\boldsymbol{X}}}}}}$ came from genome $g$ in the reference database ${Ref}$. This distribution is computed without assumptions on the species that may be present in the sample (and so every read is aligned to every genome in the reference). This is well-suited for environmental specimens that have few biological constraints on the species that might be in a specimen and that may contain genomic fragments from many species with low genome coverage. The uncertainty in the resulting probability distribution can indicate: (1) uncertainty of species due to sequence similarity, (2) the presence of undetected species where reads are unlikely to have come from any genomes in the available references, and (3) genome coverage patterns that are consistent with non-biological artifacts—as well as other phenomena. The resulting probability distribution is a Bayesian mixture model (BMM), which is described in detail in the Methods under ‘Generative Model’ and shown figuratively in Fig. S1. In the context of the Ag1000G dataset, this allows the pipeline to suppress the probability that low complexity or highly conserved anopheline reads might have come from other anopheline species based on the overwhelming evidence for An. gambiae s.s. or An. coluzzii coming from other unambiguous reads in the sample. At the same time, some small probability can be assigned to these less likely interpretations, so they are available for consideration. Various quantities can be derived from this statistical model. For simplicity, we will consider only a few of these quantities here:

For every read ${{{{{\boldsymbol{x}}}}}}$, the genome posterior probability $p(g|{{{{{\boldsymbol{x}}}}}},{{{{{\boldsymbol{X}}}}}},{Ref})$ gives the probability that genome $g$ contributed read ${{{{{\boldsymbol{x}}}}}}$ to the sample. These probabilities sum to one for each read, so they can also be treated as fractionally mapping reads to genomes. For instance, a read${{{{{\boldsymbol{x}}}}}}$ may have $p({An}.{gambiae}|{{{{{\boldsymbol{x}}}}}},{{{{{\boldsymbol{X}}}}}},{Ref})=0.7$ and $p({An}.{coluzzii}|{{{{{\boldsymbol{x}}}}}},{{{{{\boldsymbol{X}}}}}},{Ref})=0.3$, with all other genomes having zero probability, indicating ${{{{{\boldsymbol{x}}}}}}$ was more likely to have come from the An. gambiae genome than the An. coluzzii genome under modeling assumptions.
The expected number of genome reads is the expected number of reads a genome $g$ contributed to a sample and is the sum of all fractional reads mapped to that genome, i.e., ${E}_{{reads}}\left[g\right]={\sum}_{{{{{{\boldsymbol{x}}}}}}\in {{{{{\boldsymbol{X}}}}}}}p\left(g|{{{{{\boldsymbol{x}}}}}},{{{{{\boldsymbol{X}}}}}},{Ref}\right)$. This can be extended to the expected number of reads contributed by an arbitrary taxon. Given a taxon ${tx}$ let $g({tx})=\left\{{{g}_{{tx}}}_{1},\ldots ,{{g}_{{tx}}}_{k}\right\}\subseteq {Ref}$ be the set of all genomes in the reference database that belong to that taxon. Then, the expected number of reads for that taxon is ${E}_{{reads}}\left[{tx}\right]={\sum}_{g\in g({tx})}{E}_{{reads}}\left[g\right]$. For example, ${E}_{{reads}}\left[{diptera}\right]-{E}_{{reads}}\left[{Anopheles}\right]$ gives the expected number of reads contributed from non-anopheline dipterans.
We extend the standard definitions of genome percent identity (i.e., the average percent identity of all reads assigned to a genome) and of genome coverage (i.e., the total fraction of genome locations for which at least one read is assigned to that location) to the BMM setting, where reads are not assigned with 100% probability.
The $\epsilon$-genome coverage and $\epsilon$-genome percent identity of a genome $g$ are the genome coverage and genome percent nucleotide identity calculated using the set of all reads with genome posterior probability greater than or equal to $\epsilon$, i.e., $\left\{{{{{{\boldsymbol{x}}}}}}\in {{{{{\boldsymbol{X}}}}}}\right|$ $p\left(g\left|{{{{{\boldsymbol{x}}}}}},{{{{{\boldsymbol{X}}}}}},{Ref}\right)\ge \epsilon \right\}$. Choosing a smaller value for $\epsilon$ yields a higher coverage because more reads are considered, but a lower value for percent identity because more divergent reads are included. Posterior credible intervals could also be defined. For this presentation “x% of reads were placed in taxon” means the percentage of expected reads contributed by that taxon to the total number of reads in that sample, i.e., $100\times \frac{{E}_{{reads}}\left[{tx}\right]}{|{{{{{\boldsymbol{X}}}}}}|}$.

Generative model

Genomes and mixture probabilities

Let ${{{{{\boldsymbol{g}}}}}}=({g}_{1},\ldots ,{g}_{m})$ be a vector of reference genomes for $m$ distinct species, and ${{{{{\boldsymbol{w}}}}}}=({w}_{1},\ldots ,{w}_{m})$ be a vector of probabilities. Each ${w}_{i}$ is the probability of ${g}_{i}$ contributing a read to a collection of reads ${{{{{\boldsymbol{X}}}}}}$ (i.e., a multiset of reads). The mixture probabilities satisfy $({\sum }_{i=1}^{m}{w}_{i})=1$. More precisely, each genome has length ${\lambda }_{i}$ and is a function from nucleotide positions to nucleotide characters, i.e., ${g}_{i}:\{1,\ldots ,{\lambda }_{i}\}\to \{{{{{{\rm{A}}}}}},{{{{{\rm{T}}}}}},{{{{{\rm{C}}}}}},{{{{{\rm{G}}}}}}\}$ where $g(j)$ is the nucleotide at position $j$. We write $\Omega$ for the alphabet of nucleotide characters.

Partitions and partition probabilities

To model (un-)even sampling of reads across genomes, each genome is partitioned into ${{{{{{\boldsymbol{p}}}}}}}_{{{{{{\boldsymbol{i}}}}}}}=({p}_{i,1},\ldots ,{p}_{i,n})$ parts, where each part ${p}_{i,j}\subseteq \{1,\ldots ,{\lambda }_{i}\}$ is a non-empty disjoint interval of nucleotide positions. The union of all parts equals $\{1,\ldots ,{\lambda }_{i}\}$. Let ${{{{{{\boldsymbol{\pi }}}}}}}_{{{{{{\boldsymbol{i}}}}}}}=({\pi }_{i,1},\ldots ,{\pi }_{i,n})$ be a vector of probabilities, where ${\pi }_{i,j}$ is the probability that part ${p}_{i,j}$ contributes a read to a collection, given that this read was contributed by ${g}_{i}$. Partition probabilities satisfy that $\forall i.({\sum }_{j=1}^{|{{{{{{\boldsymbol{p}}}}}}}_{{{{{{\boldsymbol{i}}}}}}}|}{\pi }_{i,j})=1.$

Reads and read probabilities

A read ${{{{{\boldsymbol{x}}}}}}$ with length $l$ drawn from genome ${g}_{i}$ and partition ${p}_{i,j}$ starting at position $k\in {p}_{i,j}$, is a vector ${{{{{\boldsymbol{x}}}}}}=({c}_{1},\ldots ,{c}_{l})$ where each ${c}_{q}\in \Omega$ is the ${q}^{{th}}$ character of read ${{{{{\boldsymbol{x}}}}}}$. The likelihood of each ${c}_{q}$ is simply:

$$\Pr \left({c}_{q}=y\left|\begin{array}{c}{genome}={g}_{i},\\ {part}={p}_{i,j}\\ {pos}=k,\\ {mismatch}={\gamma }_{i}^{{miss}}\end{array}\right.\right)= \left\{\begin{array}{cc}1-{\gamma }_{i}^{{miss}},&y={g}_{i}(k+q-1)\\ \frac{{\gamma }_{i}^{{miss}}}{\left|\Omega \right|-1},&{otherwise}\end{array}\right.$$

where ${\gamma }_{i}^{{{{{\rm{miss}}}}}}$ is the probability that any sampled nucleotide ${c}_{q}$ disagrees with the reference. We do not treat ${\gamma }_{i}^{{miss}}$ as a raw sequencer error rate, but as an overall parameter that encompasses many possible sources of reference mismatches. For brevity, we will not formalize insertions and deletions here.

Algorithm: sampling the mixture

Given a specification of the above parameters, then the mixture is sampled as follows.

1.
Draw a genome according to mixture weights: ${z}_{g} \sim {{{{{\rm{Mult}}}}}}(1;{{{{{\boldsymbol{w}}}}}})$.
2.
Draw a partition according to partition probabilities of that genome: ${z}_{p}{{{{{\rm{|}}}}}}{z}_{g} \sim {{{{{\rm{Mult}}}}}}(1{{{{{\rm{;}}}}}}{{{{{{\boldsymbol{\pi }}}}}}}_{{z}_{g}}).$
3.
Uniformly draw a position $k$ from the chosen partition: ${z}_{k}|{z}_{g},{z}_{p} \sim {{{{{\rm{Uniform}}}}}}({z}_{p})$.
4.
Draw a read ${{{{{\boldsymbol{x}}}}}}$ of length $l$, one element at a time: ${c}_{q}{{{{{\rm{|}}}}}}{z}_{g},{z}_{p},{z}_{k} \sim {{{{{\rm{Mult}}}}}}(1{{{{{\rm{;}}}}}}{p}_{{{{{{\rm{A}}}}}}},{p}_{{{{{{\rm{T}}}}}}},{p}_{{{{{{\rm{C}}}}}}},{p}_{{{{{{\rm{G}}}}}}})$.

In the above, ${{{{{\rm{Mult}}}}}}({n;}{{{{{\boldsymbol{q}}}}}})$ denotes the multinomial distribution with $k$ classes, $n$ trials, and class probabilities ${{{{{\boldsymbol{q}}}}}}=({q}_{1},\ldots ,{q}_{k})$). ${{{{{\rm{Mult}}}}}}(1;{{{{{\boldsymbol{q}}}}}})$ is the categorical distribution, which we treat as drawing a single class. Finally, in step 4, ${p}_{{{{{{\rm{A}}}}}}}=\Pr ({x}_{q}={{{{{\rm{A}}}}}}|{z}_{g},{z}_{p},{z}_{k},{\gamma }_{g}^{{{{{\rm{miss}}}}}})$, ${p}_{{{{{{\rm{T}}}}}}}=\Pr ({x}_{q}={{{{{\rm{T}}}}}}|{z}_{g},{z}_{p},{z}_{k},{\gamma }_{g}^{{{{{\rm{miss}}}}}})$ and so on, for the remaining nucleotides.

Model fitting

Given an observed vectors of reads ${{{{{\boldsymbol{X}}}}}}=({{{{{{\boldsymbol{x}}}}}}}_{1},\ldots ,{{{{{{\boldsymbol{x}}}}}}}_{n})$, the goal of the BMM algorithm is to construct a maximum likelihood estimate (MLE) of the genome mixture probabilities ${{{{{\boldsymbol{w}}}}}}$ that maximize $\Pr \left({{{{{\boldsymbol{X}}}}}}|{{{{{\boldsymbol{w}}}}}}{{{{{\boldsymbol{\Phi }}}}}}\right)$, where ${{{{{\boldsymbol{\Phi }}}}}}$ are the remaining model parameters that are treated as fixed values not subject to estimation (e.g., partition probabilities and mismatch probabilities).

Expectation–maximization (EM) algorithm

Computing the MLE is non-trivial because the observed data are incomplete—they do not include the source genomes, partitions, and starting positions of the observed reads (i.e., the variables ${z}_{g},{z}_{p},{z}_{k}$ in the generative model are not observed). The approach we utilize is the EM algorithm, which alternates between estimating the unobserved variables given an estimate of the ${{{{{\boldsymbol{w}}}}}}$, and then adjusts ${{{{{\boldsymbol{w}}}}}}$ to better fit the observed data—until convergence. The details of this algorithm, and its formal guarantees, are outside the scope of this description. We provide the key ingredients required by the EM algorithm.

Joint distribution of observed and latent variables

To estimate these parameters, EM requires a joint distribution of the observed data (i.e., the reads ${{{{{\boldsymbol{X}}}}}}$) and the unobserved data (i.e., latent variables ${z}_{g},{z}_{p},{z}_{k}$) given model parameters. Let ${{{{{\boldsymbol{Z}}}}}}$ be a zero-one matrix with dimensions $\left|{{{{{\boldsymbol{g}}}}}}\right|\times ({\max }_{i}{\lambda }_{i})\times |{{{{{\boldsymbol{X}}}}}}|.$ The entry ${Z}_{i,j,k}=1$ if genome ${g}_{i}$ at position $j$ contributed read ${x}_{k}$, otherwise it is zero. The joint distribution derives from the generative model as follows:

$$Pr \left({{{{{\boldsymbol{X}}}}}},{{{{{\boldsymbol{Z}}}}}}|{{{{{\boldsymbol{w}}}}}};{{{{{\boldsymbol{\Phi }}}}}}\right)={{\prod }_{i=1}^{\left|{{{{{\boldsymbol{g}}}}}}\right|}{\prod }_{j=1}^{{\max }_{i}{\lambda }_{i}}{\prod }_{k=1}^{\left|{{{{{\boldsymbol{X}}}}}}\right|}\left[{w}_{i}\cdot \frac{{\pi }_{i,{{part}}\left(i,j\right)}}{\left|{p}_{i,{{part}}\left(i,j\right)}\right|}\cdot Pr \left({{{{{{\boldsymbol{x}}}}}}}_{k},|,{g}_{i},{p}_{i,{{part}}\left(i,j\right)},j,{\gamma }_{i}^{{miss}}\right)\right]}^{{Z}_{i,j,k}}$$

where ${part}\left(i,j\right)$ returns the index of the partition containing position $j$ in genome ${g}_{i}$. From the generative model above, the probability of ${{{{{{\boldsymbol{x}}}}}}}_{k}$ having been drawn from ${g}_{i}$ and location $j$ is a function of its edit distance $d$ and length $l$ as follows: $\Pr \left(d,l\right)={\left({\gamma }_{i}^{{{{{\rm{miss}}}}}}\right)}^{d}\cdot {\left(1-{\gamma }_{i}^{{{{{\rm{miss}}}}}}\right)}^{l-d}$ .

All-by-all read probabilities computation

A non-trivial step performed by the BMM is the all-by-all computation that evaluates the probability of every read ${{{{{{\boldsymbol{x}}}}}}}_{k}$ coming from every location $j$ in every reference genome ${g}_{i}$. This probability matrix can be quite large and is computed once at the beginning of the algorithm. The precise number of locations stored in this matrix depends on the choice of aligner and its settings, which are parameters of BMM.

Unbiased sequencing assumptions

In an unbiased metagenomics setting, we expect reads to be represented from across the genomes of present taxa. This corresponds to setting a genome’s partitions to be of equal weights and sizes. This simplifies the joint distribution to:

$$Pr \left({{{{{\boldsymbol{X}}}}}},{{{{{\boldsymbol{Z}}}}}}|{{{{{\boldsymbol{w}}}}}};{{{{{\bf{\Phi }}}}}}\right)={{\prod }_{i=1}^{\left|{{{{{\boldsymbol{g}}}}}}\right|}{\prod }_{j=1}^{{\max }_{i}{\lambda }_{i}}{\prod }_{k=1}^{\left|{{{{{\boldsymbol{X}}}}}}\right|}\left[{w}_{i}\cdot \frac{1}{{\lambda }_{i}}\cdot Pr \left({{{{{{\boldsymbol{x}}}}}}}_{k},|,{g}_{i},{p}_{i,{{part}}\left(i,j\right)},j,{\gamma }_{i}^{{miss}}\right)\right]}^{{Z}_{i,j,k}}$$

An important impact of this setting is that shorter genomes are preferred over longer genomes, even if a subset of reads align equally well to both. For example, this setting allows BMM to suppress (microbial or viral) contaminants in eukaryotic reference genomes, opting instead to amplify the probabilities that microbial or viral species (with shorter genomes) were truly present in the mixture.

Practical corrections to unbiased sequencing assumptions

In practice, even this assumption does not fully handle interactions between contamination, partial genome references, and low complexity regions to name a few challenges. In this presentation, the solution is to heuristically define an effective genome size ${\lambda }_{i}^{{{{{\rm{effective}}}}}}={f}_{{{{{\rm{effective}}}}}}({g}_{i}{{{{{\boldsymbol{,}}}}}}{{{{{\boldsymbol{X}}}}}}{{{{{\boldsymbol{,}}}}}}{{{{{\boldsymbol{\Phi }}}}}})$ according to a function that can observe the reference genome, input data, and other model parameters. The function ${f}_{{{{{\rm{effective}}}}}}()$ can be designed using many approaches. We build ours using intuitive definitions and testing on many simulated and actual metagenomic datasets⁴³.

Partitioned read dispersion

Given the all-by-all read probabilities, the partitioned read dispersion for genome ${g}_{i}$ is the uniformity of read probabilities across its partitions. It has a value of 1 if read probabilities are equally dispersed across partitions, and a value of 0 if they are all concentrated into one partition. First, the read probability mass in a partition $j$ of genome ${g}_{i}$ is:

$${{mas}{s}}_{i,j}={\sum}_{k\in {p}_{i,j}}{\sum }_{l=1}^{{{{{{\rm{|}}}}}}{{{{{\boldsymbol{X}}}}}}{{{{{\rm{|}}}}}}}Pr \left({{{{{{\boldsymbol{x}}}}}}}_{l},|,{g}_{i},{p}_{i,j},k,{\gamma }_{i}^{{miss}}\right)$$

and the partitioned read dispersion is one minus the Gini coefficient of these masses:

$${{PRD}}_{i}=1-\frac{{\sum }_{j=1}^{{{{{{\rm{|}}}}}}{{{{{{\boldsymbol{p}}}}}}}_{i}{{{{{\rm{|}}}}}}}{\sum }_{k=1}^{{{{{{\rm{|}}}}}}{{{{{{\boldsymbol{p}}}}}}}_{i}{{{{{\rm{|}}}}}}}{{{{{\rm{|}}}}}}{{mass}}_{i,j}-{{mas}{s}}_{i,k}{{{{{\rm{|}}}}}}}{2{\cdot \left|{{{{{{\boldsymbol{p}}}}}}}_{{{{{{\boldsymbol{i}}}}}}}\right|}^{2}\cdot \widehat{{{mas}{s}}_{i}}}$$

Through sweeps of functional forms on simulated and actual metagenomic data, we utilize the below correction for genome size. The exponent ${p}_{{eff}}$ brings genomes of different sizes closer together decreasing length penalties. But low PRD increases penalties by increasing the effective genome length:

$${\lambda }_{i}^{{effective}}=\frac{{\lambda }_{i}^{\frac{1}{{p}_{{eff}}}}}{{{PRD}}_{i}^{{p}_{{eff}}}}$$

Major settings

Based on this discussion, the major settings for BMM were:

1.
All reference genomes are initially equally likely (uniform priors).
2.
Unbiased sequencing settings are assumed, i.e., all partitions were given equal weights. The effective genome lengths ${\lambda }_{i}^{{{{{\rm{effective}}}}}}$ were computed with the exponent ${p}_{{{{{\rm{eff}}}}}}=4$ based on sweeps of simulated and actual metagenomic data.
3.
The per-read miss probability was set to ${\gamma }_{i}^{{{{{\rm{miss}}}}}}=0.1$ for all genomes based on sweeps of simulated and actual metagenomic data. Of consideration was ensuring more distant taxa could receive probability mass, given overall sparsity of genome references.

Metagenomic analyses

The Microsoft Premonition BMM pipeline was applied to DNA sequencing datasets of 1142 mosquitoes from the Ag1000G Phase 1 and 2 datasets^17,18 All Phase 1 and Phase 2 reads were processed as follows. First, all reads were deduplicated looking for exact and exact reverse complement duplication. The duplicity count for each read was recorded. The adapters were trimmed with Cutadapt v1.13⁴⁴. Reads with low-quality or low complexity were removed with PrinSeq v0.20.4⁴⁵. To reduce computational complexity reads that aligned to mosquito references with an edit distance of five or better were subsampled at a rate of 1.0%. The references for subsampling consisted of An. gambiae (g4 assembly; GCA000150785.1), An. coluzzii (m5 assembly; GCA000150765.1) and An. gambiae str. PEST (AgamP3 assembly; GCF000005575.2)^46,47. All reads were aligned with SNAP-aligner with an edit distance limit of up to 20 against the selection of RefSeq and GenBank assemblies (615,026 total accessions retrieved June 2018)³². The selection aimed to have at least one high-quality assembly for every species taxonomic identifier. All viral references also retrieved from NCBI’s GenBank in June 2018 were included. A metagenomic assignment of reads to accessions was computed based on a BMM implemented as an EM algorithm and extended with a heuristic that prefers accessions with uniform coverage. An accession was assigned to each read producing a probabilistic BMM call, revealing the most likely taxonomic assignment.

Integrator pipeline

For all Ag1000G specimens, we applied these steps: (1) All reads classified by BMM as bacterial or unaligned were assembled with SpaDES v3.14⁴⁸. Contigs of a minimum length of 2000 bp were analyzed, producing probable bacterial contigs. (2) Probable bacterial contigs were aligned with Diamond v0.9.24.125 aligner against the RefSeq non-redundant (nr) protein database⁴⁹. All Diamond matches for a given contig were aggregated at the desired taxonomic level. (3) The taxon with the highest integral of percent identity over contig length was assigned to each contig, resulting in an Integrator assignment for a probable bacterial contig. Steps one through three were reiterated for a set of BMM viral and unaligned reads and produced an Integrator assignment for each probable viral or bacteriophage contig.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The datasets analyzed in the current study are available via the Anopheles gambiae 1000 Genomes Consortium website: Ag1000G phase 1 AR3.1 data release: https://www.malariagen.net/data/ag1000gphase1-ar3.1. Ag1000G phase 2 AR1 data release: http://www.malariagen.net/data/ag1000gphase2-ar1 Source data for figures and tables are included in this published article and its Supplementary Data files 1–4. All other data are available from the corresponding author on reasonable request.

Code availability

The Microsoft Premonition pipeline is a proprietary cloud service, and APIs to access this service are made available to select partners through the Microsoft Premonition Early Access Program (terms and conditions apply, see http://microsoft.com/premonition for details). If needed to assist reviewers, authors will provide the computed mixture model statistics, aggregated at the specimen level as a data artifact upon request to the corresponding author. Access to read-level data is managed by the Ag1000 Genome Consortium (terms and conditions apply) and should be requested directly from the Ag1000 Genome Consortium.

References

Miles, A. et al. Genetic diversity of the African malaria vector Anopheles gambiae. Nature 552, 96–100 (2017).
Article Google Scholar
Genomes, A. G. Genome variation and population structure among 1142 mosquitoes of the African malaria vector species Anopheles gambiae and Anopheles coluzzii. Genome Res. 30, 1533–1546 (2020).
Article Google Scholar
Gillies, M. T. & De Meillon, B. The Anophelinae of Africa South of the Sahara (Ethiopian Zoogeographical Region). 210–220 (The South African Institute for Medical Research, 1968).
Wang, Y., Gilbreath, T. M., Kukutla, P., Yan, G. & Xu, J. Dynamic gut microbiome across life history of the malaria mosquito Anopheles gambiae in Kenya. PLoS ONE 6, 24767 (2011).
Article Google Scholar
Drummond, C. et al. Stability and detection of nucleic acid from viruses and hosts in controlled mosquito blood feeds. PLoS ONE 15, 0231061 (2020).
Article Google Scholar
Borland, E. M. & Kading, R. C. Modernizing the toolkit for arthropod bloodmeal identification. Insects 12, 12010037 (2021).
Article Google Scholar
Brumfield, K. D., Huq, A., Colwell, R. R., Olds, J. L. & Leddy, M. B. Microbial resolution of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly available NEON data. PLoS ONE 15, 0228899 (2020).
Article Google Scholar
Jovel, J. et al. Characterization of the gut microbiome using 16S or shotgun metagenomics. Front. Microbiol. 7, 00459 (2016).
Article Google Scholar
Brinkmann, A., Nitsche, A. & Kohl, C. Viral metagenomics on blood-feeding arthropods as a tool for human disease surveillance. Int. J. Mol. Sci. 17, 1743 (2016).
Article PubMed PubMed Central Google Scholar
Fauver, J. R. et al. The use of xenosurveillance to detect human bacteria, parasites, and viruses in mosquito bloodmeals. Am. J. Trop. Med. Hyg. 97, 324–329 (2017).
Article CAS PubMed PubMed Central Google Scholar
Fauver, J. R. et al. Xenosurveillance reflects traditional sampling techniques for the identification of human pathogens: a comparative study in West Africa. PLoS Negl. Trop. Dis. 12, 0006348 (2018).
Article Google Scholar
Batson, J. et al. Single mosquito metatranscriptomics identifies vectors, emerging pathogens and reservoirs in one assay. eLife 10, e68353 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).
Article CAS PubMed Google Scholar
Garlapati, D., Charankumar, B., Ramu, K., Perumal, M. & Murthy, M. V. R. A review on the applications and recent advances in environmental DNA (eDNA) metagenomics. Rev. Environ. Sci. Biotechnol. 18, 389–411 (2019).
Article CAS Google Scholar
Déjosez, M. et al. Bat pluripotent stem cells reveal unusual entanglement between host and viruses. Cell 186, 957–974 (2023).
Article PubMed PubMed Central Google Scholar
Oguzie, J. et al. Metagenomic surveillance uncovers diverse and novel viral taxa in febrile patients from Nigeria. Nat. Commun. 14, 4693 (2023).
Article CAS PubMed PubMed Central Google Scholar
The Anopheles gambiae 1000 Genomes Consortium. AG1000G phase 1 AR3.1 data release. MalariaGEN. https://www.malariagen.net/data/ag1000gphase1-ar3.1 (2016).
The Anopheles gambiae 1000 Genomes Consortium. Ag1000G phase 2 AR1 data release. MalariaGEN .http://www.malariagen.net/data/ag1000gphase2-ar1 (2017).
Reppell, M. & Novembre, J. Using pseudoalignment and base quality to accurately quantify microbial community composition. PLoS Comput Biol. 14, 1006096 (2018).
Article Google Scholar
Blaxter, M. et al. Sequence locally, think globally: the Darwin tree of life project. Proc. Natl Acad. Sci. USA 119, e2115642118 (2022).
Article Google Scholar
Giraldo-Calderón, G. I. et al. VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases. Nucleic Acids Res. 43, D707–D713 (2014).
Article PubMed PubMed Central Google Scholar
Poelchau, M. F., Chen, M. J. M., Lin, Y. Y., Childers, C. P. Navigating the i5k Workspace@NAL: A Resource for Arthropod Genomes. In: Eukaryotic Genomic Databases. Methods in Molecular Biology, (eds. Kollmar, M.) 1757. (Humana Press, New York, NY, 2018).
Gifford‐Gonzalez, D. & Hanotte, O. Domesticating animals in Africa: implications of genetic and archaeological findings. J. World Prehist. 24, 1–23 (2011).
Article Google Scholar
Dong, Y. et al. Reference genome of wild goat (Capra aegagrus) and sequencing of goat breeds provide insight into genic basis of goat domestication. BMC Genomics 16, 1–11 (2015).
Article Google Scholar
Renaud, G. et al. Improved de novo genomic assembly for the domestic donkey. Sci. Adv. 4, aaq0392 (2018).
Article Google Scholar
Schneider, V. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Article CAS PubMed PubMed Central Google Scholar
Constantini, C., Sagnon, N. F., della Torre, A. & Coluzzi, M. Mosquito behavioural aspects of vector-human interactions in the Anopheles gambiae complex. Parassitologia 41, 209–217 (1999).
Google Scholar
Takken, W. & Verhulst, N. O. Host preferences of blood-feeding mosquitoes. Annu. Rev. Entomol. 58, 433–453 (2013).
Article CAS PubMed Google Scholar
White, G. B. Anopheles gambiae complex and disease transmission in Africa. Trans. R. Soc. Trop. Med. Hyg. 68, 278–298 (1974).
Article CAS PubMed Google Scholar
Gillies, M. T. & Coetzee, M. A Supplement to the Anophelinae of Africa South of the Sahara. 96–119 (The South African Institute for Medical Research, 1987).
Besansky, N. J., Hill, C. A. & Costantini, C. No accounting for taste: host preference in malaria vectors. Trends Parasitol. 20, 249–251 (2004).
Article PubMed Google Scholar
Zaharia, M. et al. Faster and More Accurate Sequence Alignment with SNAP. Preprint at arXiv https://arxiv.org/abs/1111.5572 (2011).
Gardner, M. J. et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498–511 (2002).
Article CAS PubMed Google Scholar
Miles, A. et al. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res. 26, 1288–1299 (2016).
Article CAS PubMed PubMed Central Google Scholar
Matsuzaki, M., Kikuchi, T., Kita, K., Kojima, S. & Kuroiwa, T. Large amounts of apicoplast nucleoid DNA and its segregation in Toxoplasma gondii. Protoplasma 218, 180–191 (2001).
Article CAS PubMed Google Scholar
Waller, R. F. & McFadden, G. I. The apicoplast: a review of the derived plastid of apicomplexan parasites. Curr. Issues Mol. Biol. 7, 57–80 (2005).
PubMed Google Scholar
Cantalupo, P. G. & Pipas, J. M. Detecting viral sequences in NGS data. Curr. Opin. Virol. 39, 41–48 (2019).
Article CAS PubMed Google Scholar
Cantalupo, P. G., Katz, J. P. & Pipas, J. M. Viral sequences in human cancer. Virology 513, 208–216 (2018).
Article CAS PubMed Google Scholar
Krzywinski, M. et al. Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zinter, M. S., Mayday, M. Y., Ryckman, K. K., Jelliffe‐Pawlowski, L. L. & DeRisi, J. L. Towards precision quantification of contamination in metagenomic sequencing experiments. Microbiome 7, 1–5 (2019).
Article Google Scholar
Chen, S., Bagdasarian, M. & Walker, E. D. Elizabethkingia anophelis: molecular manipulation and interactions with mosquito hosts. Appl. Environ. Microbiol. 81, 2233–2243 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kämpfer, P. et al. Proposal of Thorsellia kenyensis sp. nov. and Thorsellia kandunguensis sp. nov., isolated from larvae of Anopheles arabiensis, as members of the family Thorselliaceae fam. nov. Int. J. Syst. Evol. Microbiol. 65, 444–451 (2015).
Article PubMed Google Scholar
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Article PubMed Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011).
Article Google Scholar
Schmieder, R. & Edwards, R. A. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lawniczak, M. et al. Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences. Science 330, 512–514 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sharakhova, M. V. et al. Update of the Anopheles gambiae PEST genome assembly. Genome Biol. 8, R5 (2007).
Article PubMed PubMed Central Google Scholar
Bankevich, A. et al. SPADES: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2014).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the Ag1000G Consortium for making the Phase 1 and 2 datasets publicly available. We are indebted to consortium members Mara Lawniczak and Alistair Miles for their valuable comments and guidance. We also thank Christian Gauthier, Renee Ali, and Simon Frost for their insightful comments. Authors A.P., X.C., M.R.R., I.H., and E.K.J. are paid employees of Microsoft Corporation. The funder provided support in the form of salaries for authors [I.H., A.P., X.C., M.R.R., and E.K.J.], but did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the paper. J.D. and G.C. were supported by funding from the Department of Biological Sciences at Purdue University. D.E.N. and M.E.G. were supported in part by the Johns Hopkins Malaria Research Institute and Bloomberg Philanthropies. Commercial funder Microsoft Research provided financial support to J.M.P., M.E.G., and D.E.N.

Author information

Authors and Affiliations

Microsoft Premonition, Microsoft Research, Redmond, WA, 98052, USA
Andrzej Pastusiak, Michael R. Reddy, Xiaoji Chen, Isaiah Hoyer & Ethan K. Jackson
Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA
Jack Dorman & Giovanna Carpi
The W. Harry Feinstone Department of Molecular Microbiology and Immunology, Johns Hopkins Malaria Research Institute, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
Mary E. Gebhardt & Douglas E. Norris
Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, 15260, USA
James M. Pipas

Authors

Andrzej Pastusiak
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoji Chen
View author publications
You can also search for this author in PubMed Google Scholar
Isaiah Hoyer
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dorman
View author publications
You can also search for this author in PubMed Google Scholar
Mary E. Gebhardt
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Carpi
View author publications
You can also search for this author in PubMed Google Scholar
Douglas E. Norris
View author publications
You can also search for this author in PubMed Google Scholar
James M. Pipas
View author publications
You can also search for this author in PubMed Google Scholar
Ethan K. Jackson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.P.: Analyzed data relating to Anopheles gambiae mosquitoes and vertebrate hosts, contributed to the overall structure of the study and co-author of initial draft. MRR: Analyzed data relating to Anopheles gambiae mosquitoes and vertebrate hosts and contributed to the overall structure of the study and co-author of the initial draft. Corresponding author. X.C.: Contributed to the generation of figures and co-author of the initial draft. I.H.: Contributed to the overall structure of the study and co-author of the initial draft. J.D.: Analyzed data relating to Plasmodium and generation of figures. M.E.G.: Contributed to interpretation relating to Anopheles gambiae mosquitoes and vertebrate hosts. G.C.: Analyzed data relating to Plasmodium and generation of figures. Contributed to the overall structure of the study and co-author of the initial draft. DEN: Analyzed data relating to Anopheles gambiae mosquitoes and vertebrate hosts, Contributed to the overall structure of the study and co-author of the initial draft. J.M.P.: Analyzed data relating to viruses, bacteriophages, and bacteria. Contributed to the overall structure of the study and co-author of the initial draft. E.K.J.: Contributed to the overall structure of the study and co-author of the initial draft.

Corresponding author

Correspondence to Michael R. Reddy.

Ethics declarations

Competing interests

Authors A.P., X.C., M.R.R., I.H., and E.K.J. are current, salaried employees of Microsoft Corporation. This does not alter our adherence to Communications Biology’s policies on sharing data and materials. The remaining authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Tobias Goris. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pastusiak, A., Reddy, M.R., Chen, X. et al. A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes. Commun Biol 7, 667 (2024). https://doi.org/10.1038/s42003-024-06337-9

Download citation

Received: 07 March 2023
Accepted: 15 May 2024
Published: 30 May 2024
DOI: https://doi.org/10.1038/s42003-024-06337-9
Springer Nature Limited

A metagenomic analysis of the phase 2 Anopheles gambiae 1000 genomes dataset reveals a wide diversity of cobionts associated with field collected mosquitoes

Abstract

Similar content being viewed by others

Introduction

Results and discussion

Summary of BMM analysis for Ag1000G datasets

Anopheles gambiae species complex mosquitoes

Vertebrate hosts

Plasmodium

P. falciparum core and apicoplast genome coverage

Plasmodium read validation

The positive threshold for the presence of P. falciparum

Viruses

Bacteria

Conclusions

Methods

Statistics and reproducibility

Generative model

Genomes and mixture probabilities

Partitions and partition probabilities

Reads and read probabilities

Algorithm: sampling the mixture

Model fitting

Expectation–maximization (EM) algorithm

Joint distribution of observed and latent variables

All-by-all read probabilities computation

Unbiased sequencing assumptions

Practical corrections to unbiased sequencing assumptions

Partitioned read dispersion

Major settings

Metagenomic analyses

Integrator pipeline

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation