The origin of the genus Cannabis

Chloroplast markers (cp markers) are the useful instrument for cannabis (syn. hemp, Cannabis sativa L.) to study relationships of accessions between different geographic origins. In an alignment of three published plastomes 38 chloroplast polymorphisms were identified from which 8 cp markers were used to study relationships of 53 cannabis accessions by high-resolution-melting analyis (HRMA). The marker set could distinguish six haplotypes (‘A’ to ‘F’) in the cannabis collection, where haplotypes ‘A’ and ‘F’ dominated with 34% and 50% of the individuals, respectively. A majority of populations (37) were homogeneous regarding the haplotype, 12 accessions were constituted of two haplotypes and 4 accessions of three haplotypes. Most of the European fibre cultivars consisted of the ‘F’-type (e.g. ‘Fibrimon’, Fibrimon 21’, Juso 14’, ‘Fasamo’ and ‘Schurig’), some were mixed ‘A/F’-types (e.g. ‘Fibrimon 21’, ‘Superfibra’, ‘Lorrin 110’, ‘Futura’, ‘Havelländische’). The Italian ‘Carmagnola in Selezione’ was exceptional in being a pure ‘A’-type. In the heterogenous populations, expected heterozygosity ranged from 0.06 to 0.41. The populations were well differentiated by this marker set locating 79% of the variation among populations (AMOVA). By comparison with plastomes from the closest related genus Humulus, haplotype ‘B’ could be identified as haplotype of the common ancestor of both genera. The haplotype ‘B’ is rare with a frequency of only 4% in the populations analysed. Unfortunately, the true geographic origin of most samples was unclear. However, amongst all published plastomes, only two were classified as haplotype ‘B’, both pointing independently back to Yunnan province (China), indicating Yunnan as the region of origin of the genus Cannabis.


Introduction
Cannabis sativa L. (hemp or cannabis) belongs to the small family of Cannabaceae, which comprises of ten genera with Humulus (hop, 3 species) as sister genus (Yang et al. 2013). The genus Cannabis is monotypic (Small and Cronquist 1976) and it has been distributed globally as one of the oldest known crop plants. The use of C. sativa can be characterized as multi-purpose of fibre for paper, textile or construction materials (Karus and Vogt 2004), as seeds for food and feed (Callaway 2004) and the female inflorescence as medicine (Ben Amar 2006), and psychotropic drug (Szendrei 1998).
Central Asia is regarded as origin of C. sativa (de Candolle 1885;). According to McKim (2003), the first archaeological discoveries in China are from the Neolithic period, around 4000 BC, while Long et al. (2017) date the first utilization to 8,000 BC. In contrast to the prevalent Central-Asia-Origin hypothesis of C. sativa, molecular evidence reveals that this species probably comes from a low latitude region of India (Zhang et al. 2018a). The most profound localisation of the origin was addressed by , who identified the northeastern Tibetan plateau near Qinquai Lake as origin by pollen analysis. From there, C. sativa spread over 6 Ma ago to Europe, 1.2 Ma ago to eastern China and 32.6 thousand years ago to India. C. sativa favors a mild climate with sufficient water and sunlight, and early humans spread it into a range of favorable temperate and sub-tropical niches where it becomes naturalized throughout Eurasia, in parts of Africa, and more recently in the New World (Clarke and Merlin 2016).
Molecular plant phylogeographic studies have mostly relied on the chloroplast (cp) genome because of the low mutation rate of this single and nonrecombining unit of inheritance (Schaal et al. 1998). In C. sativa, cpDNA has been ascertained as a valuable tool for such analysis with sufficient variability on the interpopulational level (Gilmore et al. 2007;Zhang et al. 2018a).
Genebank collections are composed of genetic materials with original sources from in situ conditions or from breeding/research programs. Many of these materials can no longer be found in situ for a variety of reasons (Fowler and Hodgkin 2004). Gene banks were created at the beginning of the twentieth century as repositories of genetic material to preserve genetic diversity and to provide easy access of genetic materials to breeders (Fowler and Hodgkin 2004). The difficulty of maintaining such ressources for medicinal plants is their uncountable number of taxa (Lohwasser and Weise 2020).
This study develops molecular markers from cpDNA of C. sativa for analyzing the variability and relationsships of cannabis accessions of the genebank Gatersleben and published plastomes.

Development of chloroplast SNP markers
The chloroplast genomes of three C. sativa genotypes (Genbank ID/cultivar: KR363961/'Yoruba Nigeria' (Oh et al. 2016), KP274871/'Carmagnola', KR779995/'Dagestani' (Matielo et al. 2020)) were assembled, SNPs localised and primers for SNP candidates developed with Primer3 (Untergasser et al. 2012) as implemented in Geneious Prime 2020.2.2 (Biomatters Ltd.). The SNP candidates were tested with a test set of 33 individuals of different accessions with high-resolution melting analysis. Of the 29 SNP candidates, 8 markers were selected on the basis of curve types easy to distinguish. Two individuals representing the two curve types per marker were selected and furtheron added as references to each run.

Sample material
Fifty-three accessions of the IPK collection were grown in the greenhouse and leaves from 10 plants per accession were sampled with the exception of CAN33 and CAN60, where only 9 and 8 plants could be sampled, respectively. Therefore, the study was comprised of 527 Cannabis individuals. The leaf samples were dried at 38°C in a drying oven and stored until analysis.
DNA Extraction, PCR and HRM DNA was extracted with a modified CTAB method (Schmiderer et al. 2013). Concentration and quality of the DNA were determined on a 1.5% agarose gel electrophoresis and a NanoDrop 2000 (Fisher Scientific). HRM with pre-amplification was performed on a Rotor-Gene 6000 (Qiagen). For a PCR reaction in 10 ll, 1 ll of genomic DNA (1:100 dilutions of the original DNA extract) was added to a master mix containing 1 9 HOT FIREPolÒ EvaGreenÒ HRM Mix (no ROX) (Solis BioDyne) and 100 nM forward and reverse primers (ordered at Life Technologies), respectively. The PCR cycle profile included a denaturation step at 95°C for 14 min, followed by 45 cycles (95°C for 10 s, 59°C for 20 s and 72°C for 20 s) with a final denaturation step at 95°C for 30 s. For high-resolution melting curve analysis (HRM) the temperature was increased from 69°C to 81°C by 0.1°C/s. All reactions were completed in duplicates with non-target controls in each run.

Statistical analysis
The statistical analyses were done with R 3.6.2. (R Core Team 2019) under Rstudio 1.2.5033 (RStudio Team 2019) using the packages poppr (Kamvar et al. 2014(Kamvar et al. , 2015 and ggtree (Yu et al. 2017). Distances were calculated according to Prevosti et al. (1975), the Simpson index (Simpson 1949), Nei's expected heterozygosity (H EXP ) (Nei 1978), as well as the genetic differentiation using G ST (Hedrick 2005). As a measure of linkage disequilibrium, r d (an adapted form of the index of association I A (Brown et al. 1980)) were calculated (Agapow and Burt 2001).

Marker development
Fifty-three accessions of C. sativa from the genebank Gatersleben were analysed with 8 chloroplast markers using high-resolution melting analysis (HRM). In order to detect chloroplast markers, three published chloroplast genomes of C. sativa were aligned and 38 polymorphisms (16 indels and 16 SNP, of which 7 transitions) were identified (Supplementary Table 1). Candidates were preselected on their theoretical suitability for high-resolution melting analysis (cf. exemplarily to Supplementary Fig. 1). Those candidates were evaluated with a small sample set and then narrowed to a set of one INDEL (marker P18) and seven SNP's (Tables 1, 2). As in many plant species, cannabis chloroplast DNA contains two inverted repeats (26,011 bp each), which separate a large single copy region (84,059 bp) from a small single copy region (17,829 bp) (Zhang et al. 2018b). All but one markers were located in the large single copy region, only marker P10 was in the small single copy region. Five markers were intergenic, one (P21) in an intron of rps16, and two in a coding region (P18 in matK-trnK-UUU and P12 in rps11) ( Table 1).

Description of the markers
The expected heterozygosity of the markers over all populations was in average 0.37 with most markers in a narrow range between 0.45 and 0.5 (Table 2) while two markers were very low with 0.063 (P26) and 0.077 (P10). The average G ST of all markers was 0.87 with a G ST of P26 with 0.76 and P10 with 0.96 as two extremes, while all other markers ranged between 0.87 and 0.91.
No geographical pattern of haplotypes could be observed from the accessions' passport data (data not shown). However, the most commonly known European fibre type cultivars consisted of haplotype 'F' or 'F' mixed with another haplotype (in most cases mixed with the 'A'-type). For 'Fibrimon' and 'Kompolti', three accessions per cultivar from different providers were in our sample set. All 'Fibrimon' accessions were pure 'F' haplotype, while two accessions of cv. 'Kompolti' were 'F'-type and one a mixed 'A/F'-type. Other fibre cultivars in the genebank could also be grouped in either pure 'F'-type or mixed 'A/F'type. Pure 'F'-types were 'Fibrimon 21', Juso 14', 'Fasamo' and 'Schurig'. Mixed 'A/F'-type were 'Fibrimon 56', 'Eletta Campana', 'Superfibra', 'Lorrin 110', 'Futura' and 'Havelländische'. The Italian 'Carmagnola in Selezione' was -as a singular exception amongst the fibre accessions-a pure 'A'type.
Basic populations statistics were calculated for the heterogenous populations separately and for all populations as an overall mean ( Table 3). The Shannon-Wiener index of haplotype diversity ranged in the heterogenous populations from 0.33 to 0.944 and the Simpson Index from 0.18 to 0.56. A number of mixed populations had only one individual of a different haplotype (evenness = 0.57). In only one population (52) the number of individuals of different haplotypes was in balance (evenness = 1). The expected heterozygosity ranged from 0.06 to 0.41. The level of linkage disequilibrium was in all heterogenous populations highly significant and ranged from 0.58 to 1. Overall (so including all individuals), the Shannon-Wiener index was 1.19, Simpson's index 0.63, the evenness 0.73, the expected heterozygosity 0.37 and the index of association 0.53. In the analysis of molecular variance (AMOVA), the populations were well differentiated with 79% of the variation located among populations Table 4).
All three Humulus species with published plastomes (H. lupulus, H. scandens and H. yunnanensis), as well as C. sativa (Herbarium of the Kunming Institute of Botany, province Yunnan, China (Zhang et al. 2018b) and the cannabis fibre variety 'Yunma 7' (Deng et al. 2021)) were characterized as haplotype 'B'. Therefore, haplotype 'B' can undoubtedly be regarded as the ancient haplotype where the other haplotypes in cannabis were evolved from (Fig. 2). None of the cannabis polymorphisms were polymorphic in the hop chloroplasts, all mutations were occurring in Cannabis after separation of the two genera from their common ancestor. Two genebank accessions in our study were belonging to haplotype 'B', one from France (no further background information available) and one from Spain designated as

Marker development
Chloroplast markers have several advantages, such as maternal inheritance. As a result, they are usually useful to explore genetic structure and gene flow between rather than within populations, field of applications are evolutionary studies, migration of plants (biogeography) and profiling genotypes and gene pools. The highly significant linkage disequilibrium, that are usually in plants used to identify clonality, could have been also expected for nonrecombining maternal lineages as in chloroplasts or mitochondria. Also in C. sativa, nuclear marker variability (e.g. expected heterozygosities of 0.68 (Gilmore and Peakall 2002) or 0.75 (Soler et al. 2017)) was higher than chloroplast variability (expected heterozygosity of 0.37 in our study). However, one study showed expected heterozygosities of nuclear markers below that of our results (0.22 to 0.32 (Lynch et al. 2016)).
Nuclear markers revealed higher intrapopulational cannabis variability, demonstrated with microsatellite markers which attributed only 32% to the variation between cultivars, while 37% and 31% was intracultivar and intra-individual, respectively (Soler et al. 2017). Chloroplast markers moved the focus to the higher level of between cultivar variability with 69% (Zhang et al. 2018a) to 79% (this study). The major disadvantage of cp (or mt) markers are their limitations in absolute numbers, 38 cp markers in total based on three cannabis chloroplast genomes in comparison to (fractional) 24,710 ncSNPs (Soorni et al. 2017) or 14,031 ncSNPs (Sawler et al. 2015). However, depending on the type of query, just a few, but powerful cp markers may deliver sufficient information for an intra-specific classification e.g. for identifying different genepools (Gilmore and Peakall 2002).
Overlapping accessions in different cpDNA studies Gilmore et al. (2007), developed 5 cpDNA and 2 mtDNA markers with good discrimination power of accessions and identified with this set 6 haplotypes that clustered the samples into three haplotype groups. Comparing some jointly used fibre cultivars showed that the grouping of Gilmore et al. (2007) was not the same as ours. Zhang et al. (2018a) sequenced with 5 primer pairs in highly variable cp regions and identified 23 haplotypes that grouped nicely into 3   Is the origin of cannabis in the Chinese province Yunnan?
The haplotype 'B' is common in Cannabis and Humulus and must have been present in the common ancestor of the two genera. The identification of the original haplotype allowed the determination of the sequence of the eight cp mutations used here over time because of the non-recombining maternal lineages.  (Zhang et al. 2018b) and the plastome of cv. 'Yunma 7', a main cultivar in fibre production (Deng et al. 2021), bred in Yunnan province (Amaducci et al. 2015). This province has a long tradition of using cannabis (Clarke and Gu 1998) and is one of the main production areas of hemp fibre in China (Deng et al. 2021). Provided that both samples were originally collected from natural stands in Yunnan, cannabis could have had its origin in this province.
Cp haplotypes in the European breeding history for fibre use In Europe, domestication of Cannabis was occurring in the copper/bronze age indicative of an domestication event independent from the Chinese domestication (McPartland et al. 2017). Most European fibre cultivars were derived from European landraces and consisted-at least partly-of haplotype 'F', corresponding to the 'fibre-type' haplotype '1,122,121' of Gilmore et al. (2007). The monoecious German cultivar 'Fibrimon' (three accessions in the genebank, all 'F'-type accession) was found to be bred from old German origins, probably landraces, namely 'Schurig' ('F'-type) and 'Havelländer' ('A/F'-type), both originally of Central-Russian origin (Hoffmann 1961).
Since monoecisms is a desired trait for fibre use, but rare in C. sativa, most of the French cultivars (e.g. 'Fibrimon 21' ('F'-type) and 'Fibrimon 56' ('A/F'type) go back to 'Fibrimon' (de Meijer 1995). The Hungarian variety 'Kompolti' (three genebank accessions, two 'F'-type, one 'A/F'-type) was obtained from 'Fleischmann hemp' which had its origin in Italy (de Meijer 1995). The Romanian cultivar 'Lovrin 110' ('A/F'-type) was derived from Bulgarian landraces and the Russian 'Juso 14' ('F'-type) from 'JUS-6', a crossing between a Southern origin, a Northern Russian dwarf origin, and the German 'Odnodomnaya Bernburga' (de Meijer 1995). The Italian 'Eletta Campana' ('A/F'-type) originated from a selection from a Northern Italian landrace 'Carmagnola' and high fibre strains from Germany (de Meijer 1995). In breeding, selection itself is not restricted to geographical distinct entities, but includes all materials useful and approachable. Chinese fibre strains e.g. were the basis for some fibre cultivars in the United States at the beginning of the twentieth century. The cultivar 'Chington' (China-Washington), extensively used by hempseed growers in Kentucky, was developed from seeds obtained from Hankow, China (Dewey 1927). The Hungarian three-way hybrid 'Kompolti Hybrid TC' had also a Chinese component (de Meijer 1995). So, many cultivars used today may be based on already early global exchange explaining the occurrence of the 'A'-type in fibre cultivars. Accessions in genebanks are a mixture of donations from many different sources, genebank exchanges and own collection trips. Collection trips can usually be regarded as only trustful sources when it comes to a defined geographic origin, since donations are in most cases of selected materials, collected from mostly undefined sources planted in a field and often pollinated without isolation. That is demonstrated by the variety 'Kompolti', present in the genebank from three different donors. 'Kompolti' is usually a pure 'F'-type. One accession, however, was a mixture of the haplotypes 'F' and 'A'. Since chloroplasts are only maternally inherited, accidential cross-pollination can be ruled out. Such a mixture can only origin by mixing seeds. For cannabis, even collection trips were often not reliable sources due to the exchange of genetic materials for hundreds of years and over long distances and subsequent subspontaneous naturalization, either unintentionally or by cultivation of illegitimate strains hidden in a natural environment (Szendrei 1998). Therefore, it is difficult to distinguish natural from naturalized populations in cannabis.
Author contributions EO laboratory analysis, data evaluation, manuscript preparation; UL concept planning, statistical evaluation, seed bank information, manuscript; DJ laboratory analysis, development of markers, manuscript (technical issues); JR planning, marker development, assay optimisation, data evaluation, manuscript; JN concept, planning, statistical evaluation, manuscript preparation.
Funding Open access funding provided by University of Veterinary Medicine Vienna. No funding received.
Data availability Data and DNA available on reasonable request.

Conflicts of interest No conflicts of interests or competing interests.
Ethical approval Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.