Looking at the big picture: worldwide population structure and range expansion of the cosmopolitan pest Ceratitis capitata (Diptera, Tephritidae)

The Mediterranean fruit fly, Ceratitis capitata (Weidemann), is considered one of the most significant tephritid pest species worldwide and is an exotic species in most of its range. Here, we investigated polymorphism at 14 microsatellite loci for a total of 126 populations of C. capitata from six geographical regions, applying network theory and cluster analyses. Analyses revealed nine distinct modules for the Central American region and one in each of the remaining five regions. Bayesian cluster analysis revealed that the highest level of genetic partitioning corresponds with the presence of 3 well-defined genetic clusters. Our results confirm the African origin for Mediterranean populations based on genetic diversity and suggest a direct invasion of C. capitata from the Mediterranean to Central-America. South American populations show links with Central-America, but also exhibit indications of direct admixture with the European cluster. Additionally, the network analysis proposes a South American origin for the Madeiran and Hawaiian flies. Cluster analysis corroborates the hypothesis of a Mediterranean origin for Australian samples. Our work provides novel insights regarding the migration history of Medfly worldwide.


Introduction
International trade has surged tremendously in the last few decades (Hulme 2009;Sardain et al. 2019;WTO 2019). The increase in trade and private travel between continents causes a higher risk of introduction of exotic species that can pose serious threats to agriculture (Paini et al., 2016). Phytophagous insects of the dipteran family Tephritidae (the true fruit flies) are considered one of the most impactful pest groups for fruits and vegetables worldwide (Prokopy et al. 1999). One species in particular, the Mediterranean fruit fly or medfly (Ceratitis capitata) is considered one of the most notorious invaders. It originates from tropical Sub-Saharan Africa (De Meyer et al. 2004), but during the past century it has spread to other tropical and more temperate regions worldwide (Arias et al. 2018;Malacrida et al. 2007;Nikolouli 2020). Medfly is now established in northern Africa, the Mediterranean, Latin America, Australia and the Hawaiian islands (Barr 2009;Bonizzoni et al. 2004;Malacrida et al. 2007) and is annually detected in California (Bonizzoni et al. 2001;Carey 1991;Carey et al. 2017).
The success of the medfly can be largely attributed to an extensive list of hosts that includes numerous fruits and vegetables from more than 50 families (Liquido et al. 1990(Liquido et al. , 2013. The medfly genome is 479 Mb and, in comparison to Drosophila melanogaster and Musca domestica, it has gene expression patterns that suggest adaptation to a phytophagous biology (Papanicolaou et al. 2016). In addition, C. capitata demonstrates a higher phenotypic plasticity towards thermal tolerance in comparison to related Ceratitis rosa, which could also contribute to its invasiveness (Nyamukondiwa and Terblanche 2010).
Eradication and pest management programs for this fly are expensive. Estimates suggest that medfly outbreaks in California have cost taxpayers nearly $500 million (Szyniszewska and Tatam 2014;Ruiz-Arce et al. 2020). In Spain, pest management of medfly in 2017 alone cost €8.5 million ($9.7 million) to the regional government of Valencia (Dalmau, communication with authors). Pest management of medfly primarily relies on Sterile Insect Technique (SIT) and insecticide use (Enkerlin et al. 2015;Klassen and Curtis 2005).
Population genetic studies can help mitigate the crop damage caused by C. capitata by disentangling the genetic structuring of medfly populations. Recently, Ruiz-Arce et al. (2020) have used mitochondrial DNA sequencing methods on 150 C. capitata localities (collection maintained by APHIS, USDA), covering largely the worldwide distribution of medfly. They revealed a total of 202 haplotypes in four different haplogroups that show a strong association to geography. Their study complements earlier mitochondrial analyses on more limited collections (Arias et al. 2018;Barr 2009).
Here we applied integrative graph theoretic methods to a smaller subset of the USDA APHIS collection analyzed by Ruiz-Arce et al. (2020). This network approach combined with modularity analysis can be a particularly helpful tool to reveal pathways of gene exchange (Fortuna et al. 2009) and can help to target specific trade routes in the prevention of further medfly invasion.

Methods
A total of 1214 specimens of C. capitata from 126 collections (with numbers of specimens per sample ranging from 1 to 40) have been collected between 1956 and 2016 across 46 countries distributed across six main geographical regions: Sub Saharan Africa (SSA), Europe/Mediterranean (MED), Central (CAM) and South America (SAM), Australia (AUS) and Hawaii (HAW) (see Table S1 for locality, coordinates and year of collection for each population). The geographic regions used in this study are in congruence with delimitation used in previous studies (Barr 2009;Ruiz-Arce et al. 2020). The fly specimens used for this study are a subset of the USDA APHIS collection genotyped in Ruiz-Arce et al. (2020) comprising 1592 flies of 144 fly collections and is based on the patterns observed in the study of Ruiz-Arce et al. (2020). Flies were collected in traps and were identified by the collectors.

Analysis of genetic diversity
The R (R Core Team 2019) package adegenet 1.4-2 (Jombart 2008) was used to quantify population genetic variability and differentiation, including number of alleles per locus (N a ), observed and expected heterozygosity (H o , H e ). A Kruskal-Wallis test was performed using the stats package (R Core Team 2019) in R to assess differences between the geographic regions in regard of N a , H o and H e . The Benjamini-Hochberg method was applied to correct for the false discovery rate introduced by multiple testing. The absolute number and relative frequency of private alleles at each locus was calculated in R (R Core Team 2019) using the allele.dist (PopGenReport package, Adamack and Gruber 2014) and private_alleles (poppr package, Kamvar et al. 2014) functions respectively. Private alleles were calculated for populations as well as for the main geographical regions. A genotype accumulation curve was plotted using R package poppr 2.8.3 (Kamvar et al. 2014) to determine whether enough microsatellite loci were used to allow discrimination between individuals. Deviations from Hardy-Weinberg equilibrium (HWE) and linkage disequilibrium were tested for each population across each pair of loci using the log likelihood ratio statistic implemented in the genepop package for R (Rousset et al. 2017; R Core Team 2019). Assessment of significance was performed through Markov-chain randomizations based on 10,000 dememorizations, 20 batches, and 5000 iterations per batch. The same software was used to estimate null allele frequencies (per locus and population).

Structuring of genetic variation
In order to visualize the partitioning of genetic variation among the individuals and geographical regions a principal component analysis (PCA) was carried out using the ade4 package (Dray and Dufour 2007). For this, data was first scaled and centered using the dudi.pca function and five axes were retained (dudi.pca function, ade4 package, Dray and Dufour 2007). In association with the PCA, an analysis of molecular variance (AMOVA) was performed using Genalex (Peakall and Smouse 2012) to assess hierarchical partitioning of the observed genetic variation displayed in the PCA to the level of individuals, populations or the six geographical regions. The number of permutations for the AMOVA was set to 1000. STRUCTURE 2.3.4 (Pritchard et al. 2000) was used to calculate individual admixture coefficients (Q) across the entire dataset (n = 1462). Analyses were based on the admixture model (individuals were allowed to have mixed ancestries from different clusters) with correlated allele frequencies (allele frequencies in different clusters were likely to be similar due to migration or shared ancestry) and the parameter of the Dirichlet distribution of allelic frequencies (k) separately inferred for each population. The optimal number of clusters (K) was inferred following Evanno et al. (2005). Since this method only detects the uppermost level of population structure when different hierarchical levels exist, we further investigated the genetic substructuring of our dataset up to K = 6. For each value of K, three iterations were run for 500,000 generations (with 100,000 generations as burn-in) and the posterior estimates of cluster memberships were visualized using the pophelper (Francis 2017) package implemented in R (R Core Team 2019).

Estimation of network structure
A network approach using georeferenced populations generates an accurate visual representation of the genetic similarity between populations and partitioning of genetic variation (Dyer and Nason 2004;Guimerà and Amaral 2005;Magwene 2001) on a global scale. Since the assembly of a genetic network relies on genetic variability for the estimation of the incidence matrix, we only included samples containing a minimum of 5 specimens (Dyer and Nason 2004;Magwene 2001). This resulted in a subset comprising 95 populations and a total of 1122 specimens included for the network analyses. In population networks, populations are indicated by nodes that can be interconnected by edges or links. Some populations are only poorly connected to others and therefore several links are redundant when describing population connectivity. Based on conditional independence, it is possible to detect these redundant edges and simplify the population graph without losing information on the genetic covariance structure among populations (Dyer and Nason 2004;Fortuna et al. 2009;Magwene 2001). Edge exclusion deviance is a theoretic measure to calculate whether an edge should be excluded from a fully saturated population network and follows several steps (Magwene 2001).
First, the multilocus Euclidean genetic distance matrix was computed as in Fortuna et al. (2009). It is a Euclidean genetic distance matrix giving more weight to rare alleles. The covariance matrix was calculated from the matrix of distances as in Dyer and Nason (2004) while the steps followed to compute the partial correlation matrix from the covariance matrix are the same as in Magwene (2001). The threshold value for edge exclusion deviance (EED) was calculated for each edge in order to omit edges that are not conditionally independent. Finally the strength (s ij ) of genetic similarity represented by the retained edges was calculated as: where r ij is the partial covariation coefficient of the edge. The package graph4lg (Savary 2020) implemented in R (R Core Team 2019) was used to build the final matrix used to produce the population graph (Savary 2020). For more in depth information on the procedure to estimate the conditional independence structure of the genetic covariance see Fortuna et al. (2009).

Modularity analysis
Here, we apply a simulated annealing approach to our population graph to explore different possible states of the population structure (Guimerà and Amaral 2005). We ran 100 replicates of the Guimera and Amaral algorithm for Newman's M and then obtained the maximum value of M (Newman 2006). Modularity M can vary between 0 for non-modular networks and 1 for networks with strong modularity, consisting of dense modules with tightly linked nodes and the absence edges between modules (Newman 2006). The final population graphs displaying the different modules were constructed using the igraph package (Csardi and Nepusz 2006) in R (R Core Team 2019). The population graph is presented in two different ways. In a first layout, populations are georeferenced while a second representation of the population graph positions the nodes using the Fruchterman-Reingold force-directed algorithm (Fruchterman and Reingold 1991). While the geographical layout renders a clear view on the genetic similarity among continents, the Fruchterman-Reingold network is unconstrained by geography and can display more tightly linked populations in closer proximity (Fruchterman and Reingold 1991).

Results
The number of scored multilocus genotypes reached a plateau after seven sampled loci, indicating that the genetic variability of C. capitata was adequately sampled by the 14 microsatellites markers used in this study (Fig. S2). After scoring genotypes, our dataset contained 2.91% missing alleles.
The 126 C. capitata populations (n = 1214) yielded a total number of alleles per locus ranging from 10 (medflymic92) to 34 (medflymic43) with an average of 17.6 ± 6.5 alleles per locus. The total number of alleles ranged from 105 (SUD_Sin) to 17 (GUA_San) with an average of 45 (SD = 16) (Table S2). Expected heterozygosity (H e ) ranged from 0.143 (GUA_San) to 0.755 (SUD_Sin) with an average of 0.479 (SD = 0.118), while observed heterozygosity (H o ) ranged from 0.171 (ECU_Tun) to 0.714 (MAL_Lil) with an average of 0.476 (SD = 0.106) (Table S2). Sub-Saharan Africa had a higher N a , H e and H o value compared to any of the other regions (P B 0.05) ( Table 1). Significance of pairwise differences among regions of N a , H e and H o are summarized in Table 2. The average frequency of population-wise private alleles for each region was 0.091, 0.182, 0.107, 0.160 and 0.235 for Hawaii, Australia, the Mediterranean, South-America and Central America respectively, with the Sub-Saharan region scoring the highest with a value of 0.941 (Table S2). Region-wise private alleles were highest for Sub-Saharan Africa (74) and are summarized in Table 1.
Pearson's Chi-squared test showed significant HWE deviations in 170 of 1764 locus-population combinations, corresponding to 9.64% of observations. HW disequilibrium tends to be more present in populations with only a handful of samples, but not restricted to a certain marker only. Linkage disequilibrium was observed in 206 out of 11,466 perpopulation pairwise tests, corresponding to 1.8% of comparisons.

Genetic structuring
Eigenvalues of the first three principal components resulting from the principal component analysis (PCA) were 8.56 (33.10% of the variance), 5.21 (20.04% of the variance) and 4.90 (18.87% of the variance) respectively with a drop to 3.94 for the eigenvalue for the fourth principal component (Fig. 1). Sub-Saharan African populations formed a well-defined cluster showing no overlap with populations belonging to other regions along the first three principal components. Australian populations form a separate cluster along the first and third principal component showing the closest proximity to the Mediterranean cluster (Fig. 1a).
Results from the AMOVA show that most of the genetic variation could be attributed to variation within the populations (68% , Table 3). Genetic differences among populations contributed 15% to the total genetic variation while genetic differences among the six regions comprised 17% of the total genetic variation ( Table 3).
The STRUCTURE analysis of the 1214 specimens showed a clear DK peak at K = 3 (DK = 838.46, Fig. S3) while average DK for K values up to 6 was 197 (SD = 325) indicating that the main hierarchical level of population structure is based on three genotype groups (Fig. 2). Individual Bayesian cluster assignment for K = 3 largely reflected the results of the ordinations in the multivariate space of the PCA, recognizing Sub-Saharan African populations as one genetic entity (Figs. 1 and 2).
Within Sub-Saharan Africa, the genetic structuring is markedly homogeneous with the notable exception of MAU_Wes which exhibits extensive admixture with populations from the Mediterranean (Fig. 2). All three other sampling locations within Mauritius (MAU_Eas, MAU_Nor and MAU_Sou) do not show elevated signs of European admixture and are primarily assigned to the Sub-Saharan region (Fig. 2).
South American and Hawaiian samples are largely assigned to one genotypic cluster, distancing them from Australian, Mediterranean and Central American samples and agreeing with the PCA where South American and Hawaiian populations are located in close proximity and show a large overlap regarding their 95% inertia ellipses (Fig. 1b).

Network of spatial genetic variation
Of the 4465 possible links between nodes, only 325 were retained after applying edge deviance exclusion (Figs. 3 and 4). Node size in Fig. 3a correlates with the participation coefficient of that node, i.e., the number of links the node shares with nodes of a different module. The node size in Fig. 3b is correlated with the number of alleles within the respective population. In Fig. 4, node size correlates with the within-module degree of the node, i.e. the number of edges it shares with nodes within its own module. The total number of intraregional links was 259 (equal to 79.68% of the total number of links) while 66 links (equal to 20.32% of the total number of links) connected populations from different regions. The average number of intraregional links per population was highest for the Sub-Saharan region (SSA: 3.97, SAM: 2.76, CAM: 2.53, EUM: 2.25, AUS: 2.36, HAW: 2.00).
The analysis of modularity revealed 7 distinct modules with a total network modularity of 0.465 (Figs. 3 and 4). One module (dark blue) exclusively Table 1 Average values for number of alleles (N a ), observed heterozygosity (H o ) and expected heterozygosity (H e ) for every region with the number of samples included for every region in the first column The Mediterranean region shares only four links with the Sub-Saharan region through Turkey (TUR_-Bom) which is on its turn is connected with three other populations in the Mediterranean (Figs. 3 and 4). Of the eight links that connect to Madeira, six are shared with the South American continent. Central America shares 12 links with both the Mediterranean region and South America, one link with Africa (SEN_Thi) and two with Australia (AUS_Pad and AUS_Gin) (Figs. 3 and 4).
Australia shares eight links with the Mediterranean region (of which AUS_Jar, AUS_Har and AUS_Car show a link with ITA_Cut), six with South America (AUS_Kat has a link with BOL_San, Per_Lim, ECU_Imb and ECU_Ecu) and two with Central America (Figs. 3 and 4).
Hawaiian populations structure in one tight cluster consisting of 14 intraregional links of which 11 have Fig. 3 Georeferenced C. capitata population networks with node size corresponding to the participation coefficient, i.e. the number of links that node shares with nodes of another module in a and node size corresponding to the number of alleles in b. Line thickness correlates with genetic similarity. The node color represents the module assignment an edge strength in the top 5% quantile of the distribution, indicating high genetic similarity. Only two links are shared with populations outside the Hawaiian archipelago, one with Madeira and another with South America.

Discussion
In this study we have applied an integrative approach, combining network theory and modularity analysis with more conventional population genetic approaches such as principal component analysis (PCA) and Bayesian clustering (Pritchard et al. 2000) to complement earlier work on the worldwide phylogeography of Ceratitis capitata (Bonizzoni et al. 2000;Malacrida et al. 2007  Fruchtermann-Reingold representation of C. capitata the population network with node size relative to the node's within-module degree, i.e. the number of edges the node shares with nodes within its own module. Line thickness correlates with genetic similarity. The node color represents the module assignment C. capitata from Sub-Saharan Africa to other parts of the world. We must stress the fact that our sampling was spread over several decades and included a large number of populations from multiple sources. For this reason, in order to reduce the risk of possible biases, below we focus on the general, large-scale, regional patterns while avoiding inference at finer spatial and temporal scales.

Sub-Saharan Africa
Our results confirm the African origin of C. capitata with Sub-Saharan Africa possessing the largest amount of genetic variation. This is in line with Ruiz-Arce et al. (2020) who found the largest number of mitochondrial haplotypes in this region and results of Karsten et al. (2015) who found the highest allelic richness in the African continent using 11 microsatellite markers. The presence of a large amount of private alleles within Sub-Saharan Africa highlights the high genetic diversity in this region (Fig. S2). Our findings agree with the results of Malacrida et al. (2007) who recognized Kenya is a reservoir of high genetic polymorphism. Only three African countries in our dataset (Sudan: SUD_Sin, 105 variants; Benin: BEN_Kwa, 95 variants and Ghana: GHA_Mam, 93 variants) surpass Kenya in respect of allelic richness.
Sub-Saharan samples form a well-defined cluster in both the PCA (Fig. 1), the structure analysis (Fig. 2) and the analysis of modularity (Figs. 3 and 4), indicating a strong connection between populations from this region.
C. capitata is considered as an introduced species to Madagascar, Comoros, Mauritius and La Réunion (CABI 2021) where it was established in La Réunion by 1939 (White et al. 2000) and in Mauritius by 1942 (Seewooruthun et al. 2000). The source of these populations has not been determined but our network analysis revealed links of significant genetic similarity between Réunion and Turkey, Mozambique, South-Africa, Botswana, Zambia, Cameroon, Benin and Senegal (Figs. 3 and 4). Likely, the genotypes of Medfly populations on Mauritius and Réunion are an assemblage of several genotypes of different origins.
The marked admixture of a Mauritian population (MAU_Wes) with the mainly European haplotype and the links between Réunion, the Mediterranean and South-Africa in the population network (Fig. 3) could be a reflection of the unique position of these islands along the naval routes to the Far East and Cape of Good Hope (source: transportgeography.org). Similarly, the direct link between South-African populations and west Africa (Senegal, Ghana, Togo and Benin) could be attributed to the presence of naval trading routes connecting Cape of Good Hope with western Africa (Rodrigue 2020). Global travel has undoubtably facilitated transport of pest species outside of their native range, leaving genetic signatures that can be traced by population genetic analyses (Hulme 2009;Karsten et al. 2013;Malacrida et al. 2007). It has to be noted that the small sample size for each of the Mauritian populations (\ 5 individuals) might have influenced the admixture patterns we observed.
Genetic differentiation between the African mainland and islands was already observed in tephritid fruit flies such as the melon fly, Zeugodacus cucurbitae (Coquillett 1899), (formerly Bactrocera cucurbitae, see Virgilio et al. 2015), C. rosa (Virgilio et al. 2013) and C. capitata (Bonizzoni et al. 2000;Malacrida et al. 1998) suggesting that geographical isolation can cause considerable genetic divergence between mainland and islands compared to the homogenizing gene flow from the mainland to the islands (Malacrida et al. 2007). More exhaustive sampling on the islands of Réunion and Mauritius is needed to investigate the sources of genetic variation in more detail.

Europe/Mediterranean
Medfly arrived in the Mediterranean as early as 1842 and is considered as an invasive species in large parts of temperate Europe (Malacrida et al. 2007). Europe has a strong link with the African continent through colonialism from the nineteenth century until decolonization after the world war, boosting trade among continents (Macke et al. 1977) and likely facilitating the movement of C. capitata.
After the Sub-Saharan African populations, the European populations exhibit the second highest average allelic richness, although it is significantly lower than in Sub-Saharan Africa (Tables 1, 2 and  Table S2). This supports the migration of C. capitata out of Africa which was associated with loss of genetic variation by drift and changing selective pressures such as seasonality and temperature changes (Bonizzoni et al. 2000;Malacrida et al. 1992). Loss of genetic variability is often associated with biological invasions due to founder effects (Arias et al. 2018;Lockwood et al. 2009). Our findings are in line with findings of Ruiz-Arce et al. (2020) who showed that the Mediterranean region had the second highest number of mitochondrial haplotypes after the African region. These results, however, are not from truly independent data sets because the flies included in our study were also included in the Ruiz-Arce et al. (2020) study.
It has been suggested that C. capitata found its way into Europe through the Iberian peninsula and subsequently spread further into Western Europe (Bonizzoni et al. 2004;Malacrida et al. 2007). Bayesian cluster analysis shows considerable admixture with the African region for all of the Iberian populations, supporting the Iberian point of entry into Europe. Interestingly, a population in Turkey (TUR_Bom) exhibits the same signs of admixture (Fig. 2). Moreover, our network analysis confirms the link between Turkey and the African continent with which it shares five links (REU_Lig, MOZ_Mit, SEN_Thi, SOU_Wol and TOG_Ago). This finding suggests an eastern Mediterranean entry point that might be considered complementary to the Iberian route hypothesized in Bonizzoni et al. (2004) and Malacrida et al. (2007). Additional data and focused hypothesis testing are now needed to provide more details about the intriguing invasion history of the Mediterranean.
Remarkably, the extensive admixture between South American and Madeiran populations points towards a secondary introduction of C. capitata from South America into Madeira. As shown in Ruiz-Arce et al. (2020), haplotype M065 is shared between South American, Hawaiian and Madeiran populations, corroborating our results.

South America
First records of C. capitata in the New World data back to 1905, Brazil (Malacrida et al. 1998). The South American samples form a relatively uniform group in the population graph with the exception of some populations in the north of the continent (ECU_Imb, ECU_Pec, ECU_Tun and COL_Bel) that share the same module with flies from Costa Rica and Panama (Figs. 3 and 4). Especially COL_Bel in Colombia exhibits extensive admixture with Central American genotypes (Fig. 2) and shows links with Costa Rica and Guatemala (Figs. 3 and 4). The population graph (Figs. 3 and 4) shows 12 links between Central-America and northern South America, indicating considerable gene flow between these regions. Similarly, Ruiz-Arce et al. (2020) did not recover the common M001 (81 individuals) and the closely related but less common M040 (two individuals) and M041 (one individual) haplotypes of South America in the Northern part of South America (Ecuador, Venezuela and Colombia), further pointing towards a modest barrier for gene exchange between the north and the rest of the South American continent (See Table S2 in Ruiz-Arce et al. 2020).
Interestingly, a population in Brazil near São Paulo (BRA_Pir) seems to function as a connector hub between Eastern European and South American populations. The direct influx of genetic variants belonging to the European cluster could explain why BRA_Pir contains the largest number of alleles (57) of all South American populations (average = 44, SD = 8.03) ( Table 2).

Central America
Medfly was first discovered in Central America in 1955 in Costa Rica (Gallo et al. 1970). Central American samples are genetically closest to the European/Mediterranean samples which is reflected by the large number of links between the two regions ( Figs. 3 and 4). Additionally, the large proportion of admixture with the Mediterranean genotype when investigating admixture plots points to an association between Central America and the European continent (Fig. 2). Ruiz-Arce et al. (2020) also suggested introduction of Mediterranean haplotypes into Central America. In their study, Central American samples predominantly showed the M007 haplotype which is only shared with Mediterranean populations and one individual from Senegal. Two of the private haplotypes (M005 and M008) for the region only differed by two and three bases respectively from the M007 haplotype and likely arose within the region (Ruiz-Arce et al. 2020). Another study by Nikolouli (2020) corroborates the link between the European continent and Central America and shows clear clustering of four European populations with four Central American populations, based on eight microsatellite markers. Additionally, a study by Karsten et al. (2015) showed that Guatemala was likely colonized from Greece. Although only one Central-American and one European mainland population was used in Karsten et al. (2015), this finding supports the European origin of Central-American flies.
Although Central America is small in size, the presence of two different modules that are almost equally represented suggests the presence of genetic structuring beyond the regional level. This is in contrast with findings of Ruiz-Arce et al. (2020) who found no significant structuring beyond the regional level using mitochondrial DNA. More exhaustive sampling in Central America could reveal more details concerning the partitioning of genetic diversity in that area.

Hawaii
Medfly first colonized Western Hawaii around 1895 (Bonizzoni et al. 2004;Dominiak and Daniels 2012). Bayesian clustering clearly indicates a South American origin for the Hawaiian samples, largely agreeing with the results of Ruiz-Arce et al. (2020) who showed that one haplotype (M065) was primarily shared by Hawaiian and South American populations. Furthermore, results by Nikolouli (2020) suggest a high genetic similarity between populations from Hawaii and South America, applying a similar Bayesian clustering analysis.
Hawaiian populations form a tightly linked cluster with a central hub (HAW_Ala) (Fig. 4). HAW_Ala is situated in close proximity (9 km) to the international airport of O'ahu island which might suggest that HAW_Ala served as the first point of entry of C. capitata into the Hawaiian archipelago from which it has spread to the other Hawaiian Islands. Further investigations in respect of fruit transport across the archipelago is needed to unravel the invasion history and further assess the special status of HAW_Ala.

Australia
Medfly has been accidentally introduced into Australia at several times in history. The first record of medfly in Australia dates back to 1895 in the area of Perth (Sproule et al. 2001;Dominiak and Daniels 2012). Since then, sporadic medfly outbreaks have occurred between 1952 until 1993 after which medfly has been captured continuously (Bonizzoni et al. 2004). Medfly has expanded its range northwards from Perth, where horticultural crops and the subtropical climate provide an ideal environment for medfly expansion (Sproule et al. 2001;Bonizzoni et al. 2004).
Our results suggest a secondary colonization event from Mediterranean origin into Australia, in line with the results of Bonizzoni et al. (2004) and Karsten et al. (2015). Especially connections with Central to Eastern Mediterranean populations (Croatia, Montenegro, Greece and Israel) and Tunisia are well represented in the network graphs (Figs. 3 and 4). Furthermore, Bayesian clustering provides a strong indication for a Mediterranean origin of the Australian samples (Fig. 2). Arias et al. (2018) who did not include South American flies, revealed a similar result with most of the Australian samples used in the study sharing the large Cc_21 haplotype with several countries bordering the Mediterranean basin. Conversely, in the study of Ruiz-Arce et al. (2020), the predominant mitochondrial haplotype (M009) in Australian specimens is shared to a large extent with South-American and only to a lesser extent with Mediterranean flies. Further investigation is needed to assess the impact of direct gene exchange on Australian population dynamics. The links shared by South America and Australia (AUS_Kat) might be either related to the genetic similarity between a subset of the South American countries (Brazil, Peru and Ecuador) and the Mediterranean samples or might indicate direct gene flow between Australia and South America. The significantly lower H e and H o compared to all other regions and relatively low N a (37.09) suggest that introductions of C. capitata in Australia mainly stem from genetically similar regions (Tables 1 and 2).
In summary, presented here is a description of the worldwide population structuring and range expansion of C. capitata. We recovered significant genetic groups using three different approaches, revealing a high association between genetic structuring and geography. We hypothesize an African origin for Mediterranean medfly populations from where flies have been introduced in other regions (Australia and Latin America). We uncovered a clear association between the Mediterranean and Central American populations, suggesting a range expansion into Central America directly from the Palearctic rather than colonization of Central America by South American flies. Additionally, we revealed genetic divergence between northern and southern Latin American populations. We caution the reader to consider that while diversity can support colonization time estimates, it is best interpreted together with other information such as historical reports. This work, together with the contribution by Nikolouli (2020), who used eight microsatellite loci, the work by Karsten et al. (2015) and the more recent study by Ruiz-Arce et al. (2020) who have both used mitochondrial DNA sequencing methods, render the most current view on medfly genetic diversity, geographic genetic structure and invasion pathways. Although we consider the current dataset as an informative instrument for investigating medfly invasion history, we note that it will benefit from resampling of populations and new collections to stay relevant in the future. This is certainly the case for populations that are only represented by a handful of samples in this study. In 2016, a high-quality genome assembly for C. capitata has been achieved which could serve as a critical tool for surveying the medfly genome for the invasion potential of different populations and in developing more informative invasion pathway estimation tools.
Authors' contributions Drafting of the manuscript: all authors. Acquisition of data: D. T. N. Analysis of data: D. P.
Funding Funding for this project was provided by the EU-FFIPM project (Grant ID: 818184).
Code availability R packages and software used are clearly indicated throughout the manuscript.

Declarations
Conflict of interest The authors declare there are no competing interests. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA.

Consent for publication All authors agreed to this publication.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.