Introduction

Wolbachia is a gram-negative α-proteobacterium and is found as an endosymbiont in many arthropods and nematodes with a diverse range of effects on host phenotypes1,2. Wolbachia are maternally transmitted through host oocytes to the developing embryo1. Some Wolbachia strains manipulate host reproduction to promote their transmission to the next generation of hosts2. Wolbachia strains have strong affinities for host germline tissues, enabling efficient colonization of the oocytes for transmission to the next generation of hosts3. The Wolbachia strain from the neotropical fruit fly Drosophila willistoni, wWil, selectively infects the host germline4,5. This unique tropism could be informative for understanding how Wolbachia localizes to and regulates the host germline, with implications for vectorizing Wolbachia infections for biological control mechanisms.

The strong affinity of wWil for host germline cells is unique in comparison to closely related Wolbachia strains. Phylogenetic comparisons based on amplification of the wsp and ftsZ genes by PCR indicate that wWil is closely related to the wAu strain found in Drosophila simulans4. However, unlike wAu, which infects both germline and somatic tissues in D. simulans, wWil is enriched in the primordial germline cells of D. willistoni embryos4. As with many Wolbachia strains, wWil exhibits strict maternal transmission in laboratory lines.

Given that all Wolbachia strains must navigate to the female host germline for transmission to the next host generation, the wWil strain genome offers insight into the genomic repertoire enabling this phenotype. Despite the availability of numerous Wolbachia genomes in public databases, a complete wWil genome is lacking. Here we present the first high-quality de novo assembly of wWil, which we obtained from Nanopore sequencing of wWil-infected D. melanogaster cell culture cells. We perform comparative genomics analyses of the wWil, wMel, and wAu genomes to identify differences that can provide insights into the mechanisms underlying wWil’s germline-specific distribution. This genomic resource will provide insight into the essential genes required for efficient germline transmission in Wolbachia-based biological control strategies.

Methods

wWil genome assembly

We obtained wWil-infected Drosophila willistoni flies collected from Guadeloupe Island from the Drosophila Species Stock Center at University of California, San Diego (14,030–0811.24). These are now available from Cornell University (Powell Gd-H4-1). We isolated wWil from wWil-infected D. willistoni embryos6 and introduced wWil to immortalized Drosophila melanogaster JW18 cell culture cells with the shell vial technique7. Briefly, we isolated wWil from infected embryos by filtration through a sterile 5 µm syringe filter, followed by a secondary filtration through a 1.2 µm filter. We pipetted the wWil-containing lysate into a shell vial seeded with a monolayer of JW18 cells 24 h earlier, and gently centrifuged the mixture to force the wWil into the JW18 cells at 2500 × g for 1 h at 15 °C. We allowed the infection to stabilize by maintaining the culture for three months at 23 °C. Confluent cultures were sampled for genomic DNA extraction and library preparation. Our wWil-infected D. melanogaster cell culture system offered multiple advantages for generating a complete de novo genome assembly over directly sequencing wWil-infected D. willistoni flies. wWil replicates to higher titer in vitro than in its insect hosts, as evidenced by the high proportions of Wolbachia-derived reads in our datasets relative to in vivo data obtained previously (GCA_000153585.1). Additionally, it is significantly easier to extract the high quality long genomic DNA necessary for long read sequencing from cell culture cells than it is from whole fly tissues.

To prepare wWil genomic DNA for Nanopore library preparation and sequencing, 1.2 mL (at ~ 2e6 cells/mL) of cells were pelleted by centrifugation at 16,000xg for 10 min at 4 °C. Following supernatant removal, DNA was extracted using the Wizard HMW DNA Extraction kit (Promega #A2920, Lot: 0,000,575,812). Libraries were prepared with the Native Barcoding Kit V14 for Nanopore MinION R10 (Oxford Nanopore Technologies Cat #SQK-NBD114-24, Lot: NDP1424.10.0010) and sequenced on the Nanopore MinION Mk1B with a MinION R10 Version flow cell (FLO-MIN-114, Lot:11,003,064). We used Oxford Nanopore’s MinKNOW v23.07.8 software to live basecall with Guppy v7.0.8 (Fast model, read splitting ON). We set the minimum read length to 200 bp and stopped sequencing after 36 h. This resulted in 3.65 M reads with an estimated N50 of 1.11 kb and 2.6 Gb called with a minQ of 8.

Prior to genome assembly, we preprocessed the raw nanopore reads to remove host-derived sequences. Reads were aligned to the D. melanogaster reference genome (dmell-all-r6.46)8 with bwa-mem9 v0.7.17. We used samtools10 v1.6 to sort and index the alignment and remove reads which aligned to the host genome (samtools view -b -f 4). The remaining reads were output with bedtools11 v2.31.1 bamtofastq. Artifacts in MinKNOW resulted in duplicated Nanopore reads. We removed read duplicates with SeqKit12 rmdup v2.7.0 and performed a de novo assembly of the wWil genome with Flye13 v2.9 (preset, –nano-hq). We screened the assembly for foreign genomic and adapter contamination using the NCBI Foreign Contamination Screen (FCS) toolkit version 0.5.0. We ran FCS-GX14 (taxa ID 953) and FCS-adaptor (run with–prok flag) which both found no evidence of contamination.

Genome polishing, annotation, and quality assessment

We generated Illumina short read whole genome sequence data from JW18 cell culture cells stably infected with wWil to polish the Nanopore assembly. Genomic DNA was obtained as described above for Nanopore sequencing. Illumina libraries were made following the Tn5 protocol15 and sequenced on a NovaseqX by the Duke Sequencing and Genomic Technologies Core Facility. Illumina reads were aligned to the wWil assembly and D. melanogaster reference8 (dmel6) simultaneously using bwa-mem9 with default settings. Optical and PCR duplicates were marked with sambamba16. The reads aligning to dmel6 were discarded. The remaining reads were converted back to fastq format using samtools10 fastq, and then re-aligned to the wWil genome using minimap217 v2.26 with the settings-ax sr –cs –eqx. Reads with a gap-compressed mismatch ratio exceeding 0.04 were filtered out to remove mismapping and excess noise prior to polishing. The tool Pilon18 v1.24 was run on these filtered alignments using default settings, producing the final polished assembly.

We assessed the quality of the polished assembly with BUSCO19 and annotated the genome with a standard workflow. BUSCO scores were calculated using the rickettsiales_odb10 database and v5.7.0. Default parameters were used for all software unless otherwise specified. We annotated the wWil genome with Prokka20 v1.1.1 (kingdom:bacteria) and the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) v6.721 to identify coding sequences (CDS), tRNAs, rRNAs, and tmRNA. GC Content and GC Skew were calculated with Proksee22 v1.1.2. We then aligned the wWil genome against the wMel (CP046925.1) and wAu (CP069055.1) reference genomes with BLASTn, setting the expected value cut-off at 0.0001. We plotted these annotations with Proksee22 v1.1.2 to visualize the annotated genome (Fig. 1).

Figure 1
figure 1

Map of the Wolbachia wWil genome prepared using Proksee22. Circles in order from outer to inner show the following features: the position of coding sequences (CDS), open reading frames (ORF), tmRNA, tRNA, and rRNA genes (circle 1). GC content (circle 2) and GC skew plotted as the deviation from the average for the entire sequence (circle 3). The positions of BLAST hits detected through BLASTn comparisons of wMel CP046925.132 and wAu CP069055.133 are shown in transparent blue and green. Sites in the wWil genome that map to multiple positions in the wMel and wAu genomes are indicated by the darker, overlapping colors (circles 4 and 5).

Comparative genomic analyses

To place our wWil genome within the Wolbachia phylogeny, we gathered a set of 27 circular, chromosome-level genome assemblies from diverse Wolbachia supergroups across a broad host range23. Ehrlichia chaffeensis was used as an outgroup. Genes were annotated using the NCBI Prokaryotic Genome Annotation Pipeline21, and groups of orthologous genes (orthogroups) were identified with OrthoFinder224. This produced protein sequence alignments of each orthogroup, and those that were single copy orthologous genes (SCOs) were used to generate a maximum likelihood (ML) phylogeny with IQ-TREE25 (1000 bootstrap replicates), rooted on E. chaffeensis. Additionally, we utilized BUSCO19 analysis to characterize gene presence-absence variation across orthogroups.

Whole genome alignments were performed with progressiveMauve26 (snapshot 2015-02-25.1) to identify local collinear blocks of orthologous sequence and identify structural rearrangements. We visualized the breaks in synteny between wWil and both reference genomes by generating dotplots with D-GENIES27.

We also performed a brief assessment of putative secreted and membrane-bound proteins that could play a role in the Wolbachia-host interaction. Proteins containing a signal peptide sequence were identified by SignalP28. Transmembrane protein domains were identified by TMHMM29. The subset of proteins with a signal peptide and a transmembrane domain were classified as membrane-bound proteins, while those with a signal peptide but without a transmembrane domain were classified as secreted proteins. We then characterized presence-absence variation of putative secreted and membrane proteins within groups of orthologous genes across strains. Finally, we identified variable sites in all proteins by calculating the Shannon entropy metric30,31, and compared the number of high-entropy sites in membrane and secreted proteins versus all proteins in general.

Results and discussion

Genome assessments

We generated long-read nanopore data from wWil-infected D. melanogaster JW18 cells and assembled a 1.268 Mb genome assembly containing a single circular contig. Our wWil assembly had a high BUSCO completeness score of 98.6% before polishing, which was comparable to the other circular, chromosome-level Wolbachia genomes (Fig. 2A and Supplemental Table S1). Polishing produced an improvement in BUSCO score to 99.7%. We annotated the polished wWil genome to identify coding sequences (CDSs), tRNAs, rRNAs, and tmRNA. The polished assembly had a GC content of 35.23% and contained 1302 total genes with 1199 protein coding CDSs, three complete ribosomal RNA genes, 32 tRNAs, and 4 ncRNAs (Table 1).

Figure 2
figure 2

Comparative phylogenomics of wWil among other Wolbachia strains. (a) ML phylogeny of Wolbachia genomes (bootstrap values of 90 or greater are indicated by black circles) based on 470 single-copy orthologous genes (SCOs), with wWil in supergroup A, along with genome metadata: host species and common name, genome size (Mb), BUSCO completeness score (%), total number of proteins, number of putative transmembrane proteins, and number of putative secreted proteins. (b) The same phylogeny as in A, with the presence-absence variation of all orthogroups shown. Whitespace indicates the absence of a gene in a particular Wolbachia genome.

Table 1 wWil CP157591.1 annotation summary statistics prepared by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) v6.7.

Genome comparisons

To explore the differences between wWil and other Wolbachia strains, we performed genome alignments and analyses of orthologous gene content across diverse Wolbachia. Our phylogenetic analysis of SCOs from 27 Wolbachia genomes revealed that wWil resides in Wolbachia supergroup A, alongside wMel, wAu, and many other fly-infecting strains (Fig. 2A). Despite being closely related, dotplots revealed genomic rearrangements compared to both wMel (Fig. 3A) and wAu (Fig. 3B) with larger regions of homology to wAu. Alignment of the wWil genome to the wMel CP046925.132 and wAu CP069055.133 reference genomes revealed many breaks in synteny between the genomes (Fig. 3C).

Figure 3
figure 3

Alignment of reference strains wMel CP046925.132 and wAu CP069055.133 with wWil CP15759.1. (a, b) Dotplot generated with D-GENIES27 (c) Mauve alignment showing locally collinear blocks (LCBs) identified along the circular genomes and joined with vertical lines.

We then investigated genes that are present in some Wolbachia strains but absent in others, and could thus play a role in Wolbachia’s adaptation to different hosts. In general, our analysis showed a supergroup-specific pattern of gene presence-absence variation (Fig. 2B), with some sets of genes being unique to particular supergroups, or to individual Wolbachia strains. We then looked more specifically at membrane-bound and secreted proteins, which are often implicated in interactions between Wolbachia and its host. Just as for all genes, there was a supergroup-specific pattern in presence-absence variation for both membrane-bound and secreted proteins across Wolbachia strains (Fig. 4 and Supplemental Tables S2 and S3).

Figure 4
figure 4

Presence-absence variation of putative (a) membrane protein and (b) secreted protein genes across orthogroups in Wolbachia strains. As in Fig. 2, the absence of a tile indicates the absence of a gene in a particular Wolbachia strains.

Additionally, our analysis of sequence entropy in membrane and secreted protein groups showed that they had many variable sites compared to all proteins in general. The median number of variable sites in an orthogroup across all Wolbachia genes was one, while the medians for secreted and membrane proteins were 14 and 13.5 variable sites, respectively (Fig. 5). It is especially of interest to identify secreted proteins, known as effectors, that contain variable sites and may interact with host machinery, resulting in a phenotype of interest. One such phonetype is the rescue of sterile Sex-lethal (Sxl) mutants, caused by the effector, TomO’s interaction with host mRNA34. We find that there are 144 variable sites (~ 10%) in the alignment of TomO orthologs, suggesting rapid evolution of this gene driven by the arms-race between host and symbiont35,35,37. Thus, our results support the idea that TomO plays an important role in the interaction with the host across different Wolbachia strains. Other examples include the WalE1 actin-associated protein38 and cytoplasmic incompatibility factors (Cifs)23, which we also find to have many variable sites (216 or ~ 26% of sites, and 209 or ~ 22% of sites, respectively). Along with these known effectors, there are certainly many more that are yet uncharacterized. Overall, this analysis revealed proteins with many sites that vary across diverse Wolbachia strains with a wide host range, and thus provides candidates for further interrogating Wolbachia-host interactions at the molecular level.

Figure 5
figure 5

Variability of membrane proteins and secreted proteins compared to all proteins. Shown is a histogram of the distribution of orthogroups across the number of high-entropy (variable) sites in their protein sequence alignment. Orthogroup counts are plotted separately for all proteins (gray), secreted proteins (pink), and membrane proteins (blue), with median number of variable sites represented by dashed lines of the respective colors. There were 1,003 orthogroups that did not contain any variable sites, which are not included in the plot.

Conclusion

Our assembly of the first high-quality wWil strain genome will enable deeper understanding of Wolbachia Supergroup A evolution in Drosophila hosts and the evolution of germline tropisms. The use of a novel D. melanogaster cell line infected with the wWil strain enabled us to obtain high molecular weight wWil gDNA for Nanopore sequencing. The genome assembly and annotation produced from these data will be a vital resource for future investigations on this strain and its germline-specific tropism. Furthermore, the membrane and secreted proteins identified in our analyses inform on candidate genes that may be involved in mediating bacterial-host interactions to promote infection and intracellular persistence.