The development of next-generation sequencing techniques has enabled analysis of unculturable viruses from sources as diverse as water, soil, and samples of prokaryotic and eukaryotic organisms. As a result, sequence data for thousands of viruses have been obtained, but an immense amount of “viral dark matter” [14] remains to be investigated. Among the many virus groups that are now better understood because of the increase in sequence data are hepeviruses and related viruses of the order Hepelivirales. The present classification of the higher-order taxa of RNA viruses is based essentially on the similarity of the RNA-dependent RNA polymerase (RdRp), the only universal gene of all RNA viruses. According to Wolf and coworkers, the global RdRp phylogenetic tree consists of five major branches, with the members of the order Hepelivirales belonging to the "alphavirus supergroup" of branch 3 [19]. Except for the similarity of their RdRp, the members of the Hepelivirales share few commonalities regarding genome structure, capsid symmetry, or host range: Mono-, bi-, and multipartite viruses have been described. The lengths of genomic RNAs (gRNAs) vary from 6.7 kb to 9.8 kb for viruses with monopartite genomes and up to 16.8 kb in total for the multipartite benyviruses. gRNAs may be polyadenylated (Benyviridae, Hepeviridae, Matonaviridae) or not (Alphatetraviridae) and encode a large nonstructural polyprotein at their 5' end with an N-terminal N7-methyltransferase, a central helicase, and a C-terminal 'alpha-like' RdRp. A capsid protein (CP) is encoded either by a subgenomic mRNA (sgRNA) or by a second gRNA molecule (Benyviridae and Helicoverpa armigera stunt virus of the family Alphatetraviridae). Particles are either icosahedral, with a T = 3 (Hepeviridae) or T = 4 lattice (Alphatetraviridae, Matonaviridae), or rod-like with helical nucleocapsids (Benyviridae). The CP of icosahedral viruses has a jelly-roll fold. Only members of the family Matonaviridae have enveloped virions. The wide host range includes vertebrates (Hepeviridae, Matonaviridae), lepidopteran insects (Alphatetraviridae), and plants (Benyviridae).

In addition to the classified members of the order Hepelivirales, a great number of distantly related, unclassified hepe-like viruses (hepeliviruses) have been described. Among these viruses are the bastroviruses from humans, other mammals, and fish [1, 10, 17] as well as so-called ‘bastro-like’ viruses and many other hepe-like viruses detected in mammals, lower vertebrates, invertebrates, plants, sewage, and environmental water samples, e.g., [4, 5, 12, 13, 15,16,17,18, 20]. The genomes of these viruses vary from 5 to 12.5 kb in size and contain 2-8 open reading frames (ORFs) indicating that the 'hepe-like' viruses likely comprise several new virus families.

Considering the information given above on viruses in the order Hepelivirales and their reservoirs, especially in sewage and environmental water samples, the aim of this project was to analyse the virome of two waterbodies in Berlin, Germany, to reveal the spectrum of viruses and to characterise the viruses detected by phylogenetic analysis. The present study describes 25 almost complete and 68 partial hepelivirus genomes. These viruses were designated Havel hepe-like virus (HHLV) 1 to 38 and Teltowkanal hepe-like virus (TkHLV) 1 to 46. Further, we describe the partial genomes of six astro-like viruses, named Havel astro-like virus (HALV) 1 to 3, and Teltowkanal astro-like virus (TkALV) 1 to 3. None of the viruses belong to a known family of the order Hepelivirales or Stellavirales except for TkHLV-14. Whereas TkHLV-14 has the P70-like CP of omegatetraviruses, the genome sequences of the remaining viruses display similarity to hepe-like and astro-like viruses from crustacea, aquatic arthropods, and unspecified lophotrochozoa, which is compatible with their detection in environmental water samples.

The methods of sample collection, virus enrichment, RNA preparation, and sequencing have been described previously [21]. Briefly, 50-L freshwater samples were collected from the Teltow Canal (Teltowkanal) and the Havel River in Berlin, Germany (for sampling dates and sampling site coordinates, see Table 1). Virus particles were concentrated by glass wool filtration and eluted from the column with buffer containing 3% beef extract and 50 mM glycine, pH 9.5. After adjusting its pH to 7, the eluate (180 ml) was filtered (0.45 µm). Virus particles were sedimented by ultracentrifugation (100,000 × g for 2.5 h at 4°C). Single-end and paired-end sequencing, respectively, was done using a HiSeq 2500 System (Table 1). Sequence data were extracted in FastQ format employing the bcl2fastq tool of Illumina, followed by adapter and quality trimming with Cutadapt (parameters: -q 10 -m 30 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; -A only with PE data) and removal of amplification duplicons. Assembly was performed with the clc_assembler (parameters: -p fb ss 0850) and metaSPAdes using standard parameters (only for PE data; for parameters and detailed information, see Table 1). A protein database created with all NCBI GenBank entries for the Taxonomy ID 10239 (search term “viruses[organism]”) was screened with DIAMOND [3] and BLAST+ v2.6.0 to identify hepelivirus candidates. The final genome sequences were curated by manual assembly of overlapping contigs if necessary. Sequence alignments were made in MEGA version X [8] and adjusted manually. Codon-adjusted nucleotide sequence alignments were used for maximum-likelihood tree inference with IQ-TREE 2.1.3 for Windows [9]. Optimal substitution models were selected using ModelFinder, implemented in IQ-TREE. Branch support was assessed with 50,000 ultrafast bootstrap replications using UFBoot2, also implemented in IQ-TREE [7].

Table 1 Sample information

Illumina sequencing results varied depending on the yields of virus enrichment and the sequencing method (Table 1). DIAMOND analysis suggested the presence of many scaffolds/contigs belonging to the orders Hepelivirales and Stellavirales. However, the vast majority of scaffolds/contigs with hepe-like sequences were misassigned to the order Stellavirales, and close to 100% of all assignments to lower ranks (family, genus, species) could not be verified by BLAST. Using BLAST, 93 partial or full-length genome sequences of hepe-like viruses were identified, with sizes ranging from 970 nt to 9268 nt, plus six partial astro-like virus genome sequences with lengths from 539 nt to 6815 nt (GenBank accession nos. OP699055 to OP699153; Supplementary Table S1). The genome layout of 21 (almost) complete hepe-like virus genomes is presented in Supplementary Figure S1. Interestingly, many viruses were present in two or more samples with almost identical sequences, indicating that a very similar hepeli virome is present in both waterbodies and in three consecutive years (Supplementary Table S1). Most of the contigs smaller 1 kb were not further characterized.

As a large nonstructural polyprotein with methyltransferase, helicase, and RdRp domains is a common feature of hepeliviruses, phylogenetic analysis of these three domains was conducted. The sequences of 57 of the 93 novel hepeliviruses were suitable for these analyses and included. The sequence alignments contained reference sequences of members of the four Hepelivirales families, bastroviruses, and numerous unclassified hepe-like viruses. In addition, the RdRp tree contained representative sequences of mamastroviruses, avastroviruses, and unclassified astro-like viruses as well as the HALV and TkALV sequences (Fig. 1 and Supplementary Fig. S2).

Fig. 1
figure 1

Phylogenetic analysis of the viral methyltransferase (A), helicase (B), and RdRp domains (C) of hepeliviruses. Blue boxes indicate reference virus families; yellow, brown, and green boxes denote unclassified bastroviruses, 'bastro-like' viruses, and hepe-like viruses with permuted RdRp. Sequences of Havel hepe-like viruses (HHLV; printed in red), Teltowkanal hepe-like viruses (TkHLV; printed in green), unclassified hepe-like viruses (printed in black) and reference strains of the families Alphatetraviridae, Benyviridae, Hepeviridae, Matonaviridae, and Astroviridae (printed in blue) were aligned in MEGA and used for tree inference with IQ-TREE 2. Optimal substitution models: TVM+F+R6 (A) and (B), and TVMe+R7 (C). The trees in panels A and B were arbitrarily rooted with members of the family Matonaviridae, and in panel (C), with members of the family Astroviridae and astro-like viruses. Scale bars indicate substitutions per site. Teltowkanal hepe-like viruses identified in consecutive samples are indicated by a diamond (◆). Triangles (▲) and dots (●) indicate viruses from this study with almost complete and partial genome sequences, respectively. Details of the trees are presented in Supplementary Figure S2

The tree topologies in the three phylogenetic analyses demonstrate several robust clades in addition to the families Alphatetraviridae, Benyviridae, Hepeviridae, Matonaviridae, and Astroviridae. The first clade includes human and animal bastroviruses, which always clustered in two subclades. The second clade, the so-called ‘bastro-like’ viruses, which include HHLV-6, -10, -15, and -16 and TkHLV-28, -29, -30, -31, and -38 from our water samples, were always located apart from the bastroviruses but close to the alphatetraviruses (Fig. 1). The genomes of bastro-like viruses contain two ORFs. In addition to the phylogenetic results and the fact that they have a different host, which is likely a non-vertebrate, 'bastro-like' viruses are distinguished from the human and animal bastroviruses by the presence of a non-homologous CP. This is of particular interest, as the CP of human and animal bastroviruses exhibits a striking similarity to the astrovirus CP, whereas the CP of ‘bastro-like’ viruses does not (Supplementary Fig. 3). A third clade is comprised of up to 11 viruses of arthropods, HHLV-8, and TkHLV-22, -32, and -33. These viruses share one characteristic feature: an RdRp with permuted palm motifs. Whereas all positive-stranded RNA viruses with a canonical RdRp have seven conserved structural motifs (consisting of β-strands and α-helices) in the order G-F-A-B-C-D-E [11], the 11 hepe-like viruses of this clade exhibit the permuted order -C-A-B- of their active site motifs (Fig. 2 and Supplementary Fig. S5). The genomes of Beihai barnacle virus 1, Changjiang hepe-like virus 1, and Beijing sediment hepe-like virus 1 encode four hypothetical proteins in addition to the polyprotein and have sizes of 10-11 kb, which is considerable larger than the genomes of the monopartite viruses of the acknowledged Hepelivirales families. This genome layout, however, is not conserved. TkHLV-32 and -33 have only three ORFs, with ORF1 and 2 corresponding to ORFs 1 and 4 of the former viruses.

Fig. 2
figure 2

Schematic presentation of RNA-dependent RNA polymerases (RdRp) with the canonical and permuted order of the conserved palm motifs. Blue boxes represent motifs A to E of the palm subdomain, and brown boxes represent motifs G and F of the finger subdomain. DxxxxD, GxxxTxxxN, and GDD indicate conserved amino acids of the RdRp active site

Nine hepeliviruses were found to have peptidase-like CPs (Supplementary Fig. S4), and six possessed a third ORF encoding a protein with a Zn-binding RING/Ubox domain (data not shown).

We expected to trace human and animal enteric viruses, as all of the samples were collected in the metropolitan area of Berlin, Germany, and the two waterbodies are linked and receive the discharge of a local wastewater treatment plant and the drain water effluents of the city of Berlin after periods of heavy rainfall. Although such viruses are detectable in Havel River and Teltow Canal samples by specific PCR methods throughout the year [2 and unpublished results], we failed to identify sequence reads of known pathogenic noroviruses (Caliciviridae), enteroviruses (Picornaviridae), astroviruses (Astroviridae), or hepatitis A or hepatitis E viruses (Picornaviridae, Hepeviridae) in our metagenomes. Instead, we were able to demonstrate the presence of many unclassified hepe-like viruses and a few astroviruses. As shown recently for the picorna-like viruses of the Havel River [21], this result indicates the presence of a vast number of uncultured viruses compared to the still moderate number of viruses in the known virosphere. One notable group of unclassified hepe-like viruses are the bastroviruses, which can be differentiated in phylogenetic analysis into human bastroviruses with three ORFs and animal bastroviruses with two ORFs (Fig. 1). These viruses likely have vertebrate hosts. Another characteristic feature of bastroviruses is a CP with striking similarity to the astrovirus CP (Supplementary Fig. S3). The significance of this similarity, as well as the prevalence of bastroviruses in vertebrates remains to be elucidated. One member of this group, the Guangdong fish caecilians hepevirus, however, exhibits more variation in the CP gene than in the domains of the nonstructural protein (Supplementary Figs. S2 and S3). Overall, the four Hepelivirales families, the bastroviruses, and the 'bastro-like' viruses each have unique structural proteins with little or no similarity to one other.

Another clade of hepe-like viruses (Fig. 1) is comprised of the so-called ‘bastro-like’ viruses. They have rather small genomes (<7 kb) that are significantly different from those of the bastroviruses. Some bastro-like viruses have been detected in insectivore bats and insects (Culex mosquitoes, wasps), while others have been found in aquatic organisms (sponges, bivalves, fish). The presence of bastro-like viruses in environmental water samples from the Havel River and Teltow Canal is compatible with the assumption that these viruses have aquatic hosts.

A third interesting clade includes viruses with rather large genomes (up to 11 kb). These viruses have been detected in mantis flies, crayfish, and barnacles and in various environmental samples, including our Havel River and Teltow Canal water samples. As shown in Fig. 2, the RdRp of these viruses has a permuted order of the conserved palm subdomain motifs. A permuted RdRp has been observed only in birnaviruses [11] and two permutotetraviruses (Thosea asigna virus and Euprosterna elaeasa virus) [6]. In addition to the permuted order of the RdRp palm subdomain motifs, the polymerase sequences of the members of the families Birnaviridae and Permutotetraviridae differ significantly from those of the alphavirus supergroup, which includes the hepeliviruses. To our knowledge, this study is the first description of members of the alphavirus supergroup with a permuted RdRp.

One obstacle to metagenomic studies of environmental water samples is the lack of information on the virus hosts. Even virus detection in faecal samples or the gills and guts of fish, arthropods, or molluscs may indicate dietary uptake or accumulation by filtration rather than infection. However, the increasing number of hepeliviruses associated with non-vertebrates suggests that they are the actual hosts. Further, the similarity of sequences from various sources and locations suggests a similar host range and a global distribution of hepeliviruses in terrestrial, marine, and freshwater ecosystems, which clearly deserves a more extensive investigation regarding prevalence and host range. Recurrent detection of identical viruses in samples from two linked waterbodies sampled in consecutive years, as shown in the present study, suggests an enzootic prevalence of hepeliviruses in many unknown hosts.