Background

Many studies in livestock have exploited variation at the sequence level to understand population-scale diversity and for the genetic improvement of livestock. However, most of these studies were restricted to single nucleotide polymorphisms (SNPs, or single nucleotide variants—SNV), and small INsertions/DELetions—INDEL (< 50 bp) that can be detected confidently with short-read sequencing. Genomic variations that involve a longer segment of DNA, i.e. more than 50 bp, are referred to as structural variants (SV) [1] and have not yet been extensively studied in livestock, and particularly not at the genome-wide and population scales. In general, there are two types of SV, either balanced (such as inversions or translocations), or unbalanced (such as insertions, deletions, or copy number variations [CNV]). Previous studies on the human genome have estimated that structural variations represent a proportion of the total genome that could be equal to or exceed that of SNPs and small INDEL [2, 3]. In the bovine species, ~ 3.1% (94.4 Mb) of the genome was estimated to consist of segmental duplications (≥ 1 kb long and with ≥ 90% sequence identity) [4] and these regions typically harbour many CNV [5]. A later analysis has shown that up to 10% of the bovine genome may contain deletions and tandem duplications [6]. A study that was published in 2021 [7] assembled a pangenome from only six bovine genomes and revealed 70.3 Mb of non-reference SV when compared to the standard bovine reference genome (assembled from a single animal).

Structural variation in the genome can have a direct or indirect influence on both complex and Mendelian phenotypic variation through multiple mechanisms, such as the alteration of the DNA sequence in regulatory or functional gene regions [8,9,10]. In spite of their importance, SV remain much more poorly considered than their smaller mutational counterparts, mainly due to the difficulty in characterising such regions using the short-read sequencing technology, and have been described as biological dark matter [11]. Since the advent of high-throughput genomics in the early 2000s, multiple attempts using mainly the short-read technology have been made to characterize SV that potentially may be causal variants for defects, diseases, or other traits in the major livestock species that have good quality reference genomes (Table 1). Interestingly, some of these CNV detected by analysing short reads have begun to be included on SNP arrays, however the use of SNP arrays to characterize/discover SV is out of the scope of this study. While the short-read technology (also known as 2nd generation sequencing) has provided a cost-effective and accurate means of detecting small variants (< 50 bp), its limitation of the length of the short reads has made it technically challenging to accurately detect large SV as well as SV located in tandem repeat rich regions. The so-called 3rd generation sequencing technologies (or long-read sequencing) are much more appropriate to directly address the identification of SV [12]. Recent studies have highlighted that a substantial proportion of previously hidden structural variation can be discovered with long-read sequencing [7, 13] through technological advancements that enable huge fold increases in read lengths compared to 2nd generation sequencing (typically longer than 10 kb). Although in the past, the per base accuracy of 3rd generation long-read sequencers was not comparable with that of Illumina short-read sequencing [11], the ongoing development of cutting-edge chemistry [14] as well as software development [15] are rapidly addressing this issue. In addition, improvements in dry/wet lab methods have been published over recent years to promote the use of long reads that improve the continuity, accuracy, and range of variant calling/processing as well as de novo assemblies [16].

Table 1 Structural variant discoveries using a “focused approach” in livestock, using either short-read (SR) and/or long-read (LR) sequencing

To date, the main focus of the SV investigations in livestock has been the characterization and application of CNV [4, 5, 17,18,19,20,21,22]. In general, there has been strong interest in the discovery of SV in livestock (see Tables 1 and 2). As a direct result of the technological limitations of short-read sequencing as well as the cost of building large reference populations with long-read sequences, currently two key elements for the detection of SV in livestock are missing:

  1. (1)

    Genome-wide population scale SV discovery and imputation.

  2. (2)

    Studies to determine associations between genome-wide SV and quantitative traits (a previous attempt using short-read information highlighted the difficulties of this approach [23]).

Table 2 Structural variants detected in livestock based on two discovery approaches (RS: resequencing or PG: pangenome approach) with either short-read (SR) or long-read (LR) sequencing technology

Curation of large reference populations with long-read sequences is essential to address both elements (1) and (2). Cataloguing SV and their frequency spectrum in each population using long-read technology is a critical first step towards: understanding the extent of this variation, imputing SV into larger genotyped populations, and undertaking further downstream research (e.g., interpretation of breed diversity, association with a range of phenotypes such as disease susceptibility, environmental adaptation, etc.). It is important to mention that due to differences in the structure of breeding programs from one species to another, the strategies to deploy genetic improvement can be specific to each type of livestock. However, the overarching framework is still most likely to be “Discover + Impute \(\Rightarrow\) Impact”.

Previously, in 2014, the landscape of SV in livestock as well as the challenges in this field of study were reviewed [22]. However, with the rapid advances in long-read sequencing since then, as well as the recent progress in the field of bioinformatics, we consider that it is timely to provide here updated perspectives on:

  1. (1)

    The progress of the methods and strategies for genome-wide SV discovery in livestock species where genomic tools are routinely available (cattle, sheep, goats, pigs, and chicken).

  2. (2)

    The challenges and prospects for population-scale discovery and application of genome-wide SV for livestock breeding.

In the last decade, the development of technologies for 2nd generation sequencing has been dominated by Illumina. Their sequencing technology is highly cost-effective with high base-calling accuracy and well supported downstream analysis tools and pipelines [24]. Another advantage of 2nd generation sequencers is that the library preparation itself does not require high-quality DNA. Libraries can be prepared with short DNA fragments, even ancient DNA that is highly degraded. However, the key technical feature of 2nd generation short-read sequencers is that they only provide reads with a limited read length: generally, less than 300 bp. These short reads have minimal potential to identify (i) large SV, because the short reads that are derived from them are difficult to accurately map to a reference genome, and (ii) SV within repetitive sequences such as large segmental duplications, which may not be resolved with short-read mapping algorithms. It should be noted that even for the discovery of small variants in chromosomal regions with large segmental duplications, short reads result in much lower accuracy than long reads because of difficulties of their alignment in these regions [25].

In an effort to improve the detection of SV using short reads, several studies have relied on a technology that creates “virtual long reads” to further increase read length with techniques such as: mate-pair reads [26, 27], linked-read technologies from 10X Genomics [28], MGI single-tube long fragment read (stLFR) [29], or Illumina’s recently announced long-read sequencing assay, i.e. complete long read (CLR) at the time when this manuscript was written, November 2022. These approaches can theoretically extend read length while maintaining the low base call error rate and cost efficiencies. However, many of these technologies are still under development and can be considered as “advanced short-read sequencing” instead of “long-read native DNA sequencing”. In addition, in the last few years, multiple studies have performed a combination of short-read sequencing with several other add-on technologies, for example, with long-read sequencing as well as optical mapping (Bionano Genomics) or Hi-C sequencing techniques to greatly enhance the ability to find and validate SV at the genome level [30,31,32,33].

Evolving from short-read sequencers, the development of 3rd generation sequencers began in the early 2000s with key competitors including Pacific Biosciences (PacBio) with single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) developing nanopore sequencing. Although they are in the same wave (3rdgeneration sequencers), PacBio and ONT differ widely in their principle of action. Nanopore sequencers measure the ionic current fluctuations when single-stranded nucleic acids (DNA/RNA) pass through biological pores (so-called ‘nanopores’) [34]. The read lengths with ONT vary with the input fragment lengths, therefore the term “N50” is often used to describe the read length where 50% of the data is contained within reads with lengths greater than the N50 value. Typically, ONT sequencing achieves N50 of more than tens of thousands of kb and it is possible to reach maximum lengths of several Mb (the longest recorded is 4 Mb [35]). In contrast to ONT, PacBio sequencers use fluorescence polymerase tethered to the bottom of a well to predict nucleic acid sequences [36]. Their high fidelity (HiFi) read lengths are typically around thousands of kb (10–25 kb, [37]) with very high accuracy. At the time this article was written, through various optimizations in the workflow, PacBio HiFi reads have achieved a per base quality score accuracy that nearly equals that of Illumina short reads [38]. In the past, several studies in the field of genomics have reviewed long-read sequencing technologies, its opportunities and limitations [11, 12], as well as performed benchmarking across multiple technologies [39]. Undoubtedly, now and in the near future, these technologies will continue to be developed to further increase yield, base call accuracy, maximum read length while reducing overall sequencing cost [40].

With long-read sequencing, there are currently two major approaches to detect genome-wide SV in multiple individuals, the first uses the “assembly” method to generate a “pangenome”, and the second uses a so-called “resequencing” approach, with the potential to combine both:

  1. (1)

    The assembly/pangenome method generally applies a de novo assembly approach to the sequence of each individual (i.e., no prior reference genome is used for alignment) and aims at generating a haplotype-resolved pangenome. The de novo approach enables SV to be identified using a compare and contrast method between multiple assemblies and removes the inherent bias when using a reference genome from a single individual of a particular breed. The aim of a pangenome approach is ultimately to provide a new reference genome that is not limited to a single individual but encompasses a much broader range of the structural variation that exists across a species. Thus, the approach is generally undertaken with a limited number of individuals each from diverse populations (e.g., breeds). In addition to providing a pangenome reference, this expands the knowledge on the extent of unique structural variation across diverse individuals and enables a more complete annotation of genes and transcripts using long-read sequencing [41]. For the bovine species, the Bovine Pan Genome Consortium (PBC) has begun important work in creating a pangenome using individual animals from very genetically-diverse breeds, sub-species and species while also cataloguing the extent of SV discovered (https://bovinepangenome.github.io/).

  2. (2)

    On the other hand, the resequencing approach uses sequencing reads from an individual and aligns these against a specified reference genome that is generally derived from a single individual. Following alignment, the different sites between the new and reference sequence can then be assessed at an individual level as well as at a population level. In general, the key aim of the re-sequencing approach is to detect variation in a significant number of individuals (potentially all from the same population) with a view to then linking the genomic variation to specific phenotypes and evolutionary processes.

  3. (3)

    Ideally, in the foreseeable future, the reference genome for the resequencing approach can be assembled from multiple animals and will be either population (breed) specific or a pangenome. Although software tools have been developed to align reads and call variants using a pangenome reference (e.g., Pangenie [42], Vg [43], and Giraffe [44]), improved efficiency and compatibility are required to become feasible at the population scale with long-read sequences [41].

Due to the exacting sequence quality requirements for de novo haplotype-resolved assembly, the accuracy of SV discovery from the pangenome will outperform the re-sequencing approach [45]. However, the assembly approach will be considerably more costly on a per individual sequence level compared to the re-sequencing approach because: (i) de novo assembly requires high-sequencing depth (50–60× with older long-read technologies, and trending towards 20–30× with latest releases), while the re-sequencing approach may compromise with lesser coverage (Nguyen et al., unpublished); (ii) ideally the parents of the individuals used for the pangenome assembly are also sequenced (often with short-read technology) to enable the required resolution of haplotypes; and (iii) the additional sequencing results in significantly higher computing costs compared to the re-sequencing approach.

The above descriptions demonstrate that these two approaches for the discovery of SV are complementary, such that in the future, as pangenome references and improved bioinformatics tools become available for resequencing studies, this will greatly expand the repertoire of SV detected at the population scale. Thus, in addition to pangenome development, livestock improvement applications will require the discovery of SV across many individuals within specific populations, to catalogue the level of SV diversity within breeds and to build reference populations for downstream analyses. The resequencing approach allows for a more cost-effective sequencing of a larger number of individuals, which then enables studies such as association of SV with specific phenotypes, either directly or through imputation of SV into even larger populations that are already genotyped with dense SNP panels, short-read sequencing or low pass long-read sequencing. The first successful example of a population-scale SV study (discovery, imputation and association) was in a human Icelandic population where SV were found to be associated with complex traits [10].

Recent examples of SV studies in livestock

To date, sequence-based studies of SV in livestock (short and long reads) have implemented two main approaches: one is a “focused approach”, where a priori, a phenotype is tracked and then associated with SV in a genomic region of interest (summarized in Table 1), and the other is a naive “discovery approach” (summarized in Table 2). In the latter, multiple SV can be identified from genome-wide scanning using either (a) a resequencing or (b) a pangenome method. In Tables 1 and 2, we summarize recent studies using these two methodologies in several key livestock species where genomics tools are well developed (cattle, sheep, goat, pig, and chicken), because there have been many developments since the last major review on the SV landscape in livestock [22].

Perspectives on the importance of SV for livestock improvement

Due to their large size, SV are known to influence gene function, as they might cause partial/complete gene knockout or even may alter gene expression of neighbouring genes: this phenomenon is confirmed in humans [46], plants [47] and animals [48]. Currently, the SV that have been identified in livestock as putatively causal, are biased towards those that have a large monogenic influence on a phenotypic trait, but some have also been identified as affecting quantitative traits (see examples in Tables 1 and 2). In the past few years since the advent of cheaper sequencing, a range of monogenic traits involving SV have been dissected using the focused approach in multiple livestock species (Table 1). However, in general, causal variants that underpin a physical defect/feature or inherited Mendelian disease including recessive lethal mutations in livestock are often not confirmed at the molecular level. There are numerous reasons for this, including the high investment cost (R&D, sequencing, and turnover time), difficulties capturing genetic material (farm to laboratory distance, rarity of cases, short lifespan of the embryo/animal, and producer’s concerns over reputation). For quantitative traits in livestock, it has been even more difficult to unequivocally identify any type of causal variant due to the large numbers of individuals required to detect the generally smaller effects and also due to strong linkage disequilibrium between variants extending over long distances (often several hundred kb) [49]. To date in livestock, there are few published examples of putative causal SV affecting complex traits, although there are two interesting examples in cattle (a CNV and a large deletion) that appear at a moderate frequency and have antagonistic pleiotropic effects on important traits [50, 51].

Clearly, to have adequate power to detect associations with quantitative traits it is necessary to be able to generate large numbers of individuals with real/imputed SV genotypes and phenotypes. This approach has already been applied with some success in plants [52], yeast [53] and humans [10, 54]. The evidence from such studies indicated that there may be high value in developing the catalogue of SV in reference populations of livestock, imputing, and testing the effects of these variants in large populations of animals with recorded traits, and applying these findings to breed improved livestock. The main challenges that need to be addressed fall into three main areas: (i) developing large long-read sequenced reference populations to enable effective and accurate SV discovery and imputation; (ii) evaluate molecular mechanisms that underpin SV effects on phenotypic traits; and (iii) apply knowledge of SV location and genotype to improve genomic tools for animal breeding.

Developing large long-read sequenced reference populations to enable effective and accurate SV discovery and imputation

Building long-read reference populations for SV discovery, phasing and imputation

We propose that it is timely to begin large collaborative long-read sequencing projects for livestock species using the cost-effective re-sequencing approach, similar to the existing short-read collaborations (e.g., 1000 Bull Genomes Project and SheepGenomesDB). Ideally, similar to the 1000 Bull Genomes Project, the reference populations would include: (i) at least hundreds of individuals for each of the most numerous breeds because the rarer are the variants the more individuals are required for discovery and accurate imputation; (ii) small numbers of rarer breeds and outspecies; (iii) popular common ancestors of the current population where possible; and (iv) at least 10 or more trios (offspring and parents) for targeted studies including bioinformatic quality control.

Within each species, we consider that there should be close collaboration between pangenome, long-read and short-read consortiums because this would enable the most effective use of the different levels of genomic information available, for example:

  1. (1)

    Deeply sequenced pangenome animals can be used: in the short-term to augment the size of the sequenced population, and in the medium- to long-term to be deployed as a breed-specific or pangenome reference for alignment of re-sequencing long read data.

  2. (2)

    Existing short read databases with many sequenced individuals would continue to be invaluable for imputation of small sequence variants (e.g., 1000 Bull Genomes Project now includes over 9000 genomes), some individuals for which short reads and DNA are still available could be added to the long-read reference to provide individuals (such as trios) for specific studies such as: testing SNPs, INDEL and SV discovery/imputation, testing new bioinformatic tools that use short reads for the discovery of some types of SV, including tools that rely on high confidence SV sets that will become available from the long-read work (e.g., Giraffe, PanGenie).

  3. (3)

    In the short- to medium-term, a high-confidence set of SV in specific populations could be documented through long-read SV discovery (pangenome and/or re-sequencing). This ‘truth set’ could be used for a range of purposes including its use with short-read sequence databases for improved SV detection, although this will necessarily have considerable biases such as tending to exclude SV in segmental duplication regions [10]. However, where population-scale short-read sequence databases already exist, this might enable some limited population-scale SV detection and imputation, while long-read sequence databases are being developed.

One of the main weaknesses of long reads in the past few years was the single base accuracy, and previous studies have suggested that this might lead to incorrect small variant calling [36, 55, 56]. This resulted in the development of approaches such as hybrid base-call correction for long reads using short reads (‘polishing’) to improve the single base accuracy [57, 58]. However, at the time this article was written (November 2022) and looking forward, the likely verdict is that single base errors will become a non-issue. This is because the field is rapidly progressing in many aspects (technologies and bioinformatics), such as the most recent high accuracy PacBio developments (including HiFi) as well as ONT R10.4 flow cells that claim dramatic improvement in per base accuracy, bringing new advances that could result in high-quality small variant calls equivalent to short -read technology [37, 38, 59]. This means that SNPs and small INDEL variants called in long-read re-sequencing could be added to existing short-read variant databases to augment the data available for their imputation. Furthermore, although it is critical to maintain and provide access to these short-read databases, there would be no need to go on increasing the size of the short-read sequence database in populations that have the resources to undertake long-read sequencing. Arguably, for livestock species that do not yet have a short-read sequence database, there would no longer be a need to develop a short-read database if resources could be switched to sequencing adequate numbers of individuals with long reads.

A considerable strength of long-read sequencing is the relative ease for deployment of read-based, long-range haplotyping (instead of the traditional haplotype phasing), where phase information present in the reads can be incorporated into algorithms as true data to calibrate phasing and imputation models. This has been adopted in several recent phasing and imputation algorithms, for example: WhatsHap [60], HapCUT2 [61], QUILT [62], Duet [63] or LongPhase [64]. This should enable improved imputation (which relies on accurate phasing) of SV using long-read data compared to using short-read data and this was confirmed in a human study [10]. In the past, we have demonstrated that imputation accuracy for SNPs and small INDEL is improved by combining short-read sequence from multiple breeds and crosses [65]. However, it is yet to be determined if this will still hold for the imputation of SV using long reads, and should it not be the case, it could necessitate increased numbers of individuals that are sequenced within a breed.

De novo assemblies to build pangenomes

Assuming that the sequencing cost of long-read technologies will continue to significantly decrease in the near future, it would be useful to perform high read depth sequencing and construct pangenome scale assemblies. Recent studies in humans and bovine have identified that hundreds of Mb of the population- and individual-specific sequences are absent from the reference genome [7, 66] and it is therefore likely to be the case in other livestock species. Therefore, as discussed above, planning for de novo assemblies with long reads is desirable to create breed-specific or pangenome references, as well as to gain deeper insights into evolutionary modifications and comparative functional genomics between breeds and individuals. However, given the high costs per animal to undertake haplotype-resolved de novo assembly, if resources for long-read sequencing are limited in a given species, then it could be more cost-effective to initially focus only on building a consortium that undertakes a re-sequencing approach with the current reference genome. This will build a long-read sequence population, while waiting for improvements in cost-efficiency before developing breed-specific or pangenome references. At a later stage, it would be possible to redo the re-sequencing alignment to a breed-specific or pangenome reference to improve on the initial SV discovery.

Validation of SV effects and evaluating their role in molecular mechanisms

Biological validation of specific SV

Currently, wet-lab methods can be employed to validate SV post-discovery, for example, some available options include: (i) long-range PCR amplification in combination with gold standard Sanger sequencing or (ii) Bionano optical mapping can be considered a cost-effective method. In addition to this approach, long-read sequencing of parent–offspring groups can also provide a means to confirm SV inheritance patterns to validate the presence of SV [67]. Once SV from individuals have been confirmed to be accurately predicted and putatively causal, it is of great interest to undertake biological investigations to reveal the molecular mechanisms that underpin the effect of SV on important traits in livestock. Then for example, a functional approach such as knockout via gene silencing or CRISPR might be considered for downstream validation. However, it is important to note that these validation methods are often low-throughput, so there is a necessity for the further development of higher throughput validation methods for SV similar to the deployment of massively parallel reporter assays (MPRA) in SNP functional confirmation [68].

Genome-wide validation of SV effects

Similar to SNPs, SV may have the potential to affect promoter/enhancer activity, alter gene expression, and in some case, cause malfunction/fusion of genes by combining/separating genomic regions together or separating a genomic region into sub-regions. Therefore, it would be of great interest to test the effect of SV on gene expression through genome-wide expression quantitative trait locus (eQTL) mapping. This is, however, only feasible with a reasonable sized population with gene expression data and with real or imputed SV genotypes. Some recent studies in humans have suggested that SV have larger effect sizes than SNPs and INDEL [69, 70]. In the last decade, multiple studies have predicted that SV have the potential to alter multiple adjacent genes: indeed, a recent estimation showed that SV-eQTL affect an average of 1.82 nearby genes, whereas SNP- and INDEL-eQTL only affect an average of 1.09 genes [46]. Thus, transcriptome changes induced by genomic SV are of strong interest to investigate. It should be noted that the molecular mechanism by which the Celtic and Friesian SV result in the polled cattle phenotype is still unknown, although a long RNA is suspected to be involved [71].

Prediction of the theoretical impact of SV

There are many bioinformatic tools to predict the effect of SNPs, such as SIFT [72] and VEP [73]. Prediction of the effect of SV adds more complexity as there are different types of SV (such as insertions, deletions, and inversions) and they have the potential to influence the linear as well as the three-dimensional genome structure [74]. These different types of SV will need to be accounted for when predicting their effects. Several strategies to predict SV effects in humans have deployed existing tools to predict the biological effects of individual bases spanning the SV [75,76,77]. Theoretically, this strategy can also be applied to livestock species.

Incorporation of SV discoveries to improve gene functional annotation

Multi-omics analyses including ATAC/ChIP/Iso-seq may be beneficial to explain the mechanism that underpins the effect of an SV (for an example, see [40]). Also, as described in previous sections, SV are of interest not only for the purpose of identifying simple mendelian mutations but also for their role in explaining variation in complex traits. At present, the FAANG (Functional Annotation of Animal Genomes) consortium is building livestock-specific genome-wide ‘OMICS’ resources to improve the functional annotation available for a range of species, tissues and developmental phases [78]. This type of annotation combined with the knowledge of SV could be used in prediction frameworks for the importance of a SV on complex traits similar to the FAETH score method used for SNPs and INDEL [79]. In addition, it is important to note that native methylation capturing is now available with both Oxford Nanopore and PacBio, so we believe that the analysis of multiple methylomes gathered from the large sequencing consortiums could provide a tremendous opportunity to further examine genomic imprinting or epigenetic marks [80]. Another question of interest is to examine if SV from specific genomic regions have very large effects on phenotypes. For example, SV within coding regions or regions enriched for sites that are conserved across vertebrates may result in large-effect SV associated with fitness. Interestingly, a recent study in bovine found evidence that SV were less likely to be located in “core” eukaryote genes [23] suggesting that there may be selective purging of SV in these genes due to highly detrimental effects. Of course, many SV will potentially encompass a range of genomic regions such as coding and non-coding. To assess the validity of predicted SV effects, one could compare the ranking of SV between predicted functional effects and SV genome-wide association studies (GWAS) results on complex traits such as fertility and survival.

Application of the knowledge of SV location and genotype to improve genomic tools for animal breeding

Undoubtedly, post-validation and further downstream, there is still the ultimate question of how best to apply knowledge of the impact of SV to livestock breeding. For example: how common are functional SV, how accurately can they be imputed and/or incorporated for genotyping on a platform such as custom SNP panels, genotyping-by-sequencing or low-pass sequencing (using either short reads or long reads). Adopting SV in combination with SNV from both long/short read libraries to estimate the genomic heritability of quantitative traits is also of interest [23] and requires further investigation since long reads offer a higher resolution for SV, in addition to accurate phasing of long haplotypes and therefore better imputation. Last but not least, SV could be a target for the CRISPR gene editing technology that might provide benefits for specific situations to improve animal productivity, health or welfare outcomes (e.g., editing the poll trait in cattle [81] or other genetically improved livestock [82,83,84]). However CRISPR-like editing approaches require more active research to confirm their feasibility for application in livestock, because recent studies suggested that unintended off-target SV might be created as an artefact [85]

In the near future, it is within reach to build a collaborative multi-institutional long-read sequencing project (perhaps in conjunction with existing short-read consortiums) to build large-scale reference populations to enable the discovery and imputation of SV into large, genotyped populations of livestock. Either alone, or combined with imputed SNPs and INDEL, this would enable population-scale and GWAS with SV to determine the impact of SV on quantitative trait phenotypes as well as Mendelian traits. Furthermore, we can anticipate that the increasing availability of these resources in genomic prediction settings for a range of traits will deliver positive impacts for livestock breeding. In addition, most SNPs are commonly found to be biallelic (two observed alleles), while many SV can be multi-allelic (multiple observed alleles), as well as having slightly different breakpoints between individuals in large cohorts. Undoubtedly, these features create future challenges for analytical approaches [86]. Ideally, we would need thousands of animals in the reference population to accurately discover and impute SV for livestock breeding applications. In the initial phases it would likely be preferable to include parent–offspring trios to determine the accuracy of SV detection and phasing, as well as widely-used recent ancestors from a limited number of the most important breeds, while increasing the number of breeds in the future. The addition of more breeds will not only increase the diversity of the SV catalogue but would be useful to better understand the evolutionary and more recent history of SV, and in particular to understand if there has been some selective advantage for/against specific SV. It is also of interest to include suspected carrier/affected animals with deleterious conditions in an attempt to capture SV that may be responsible for these.

Conclusions

Through this review, we provide a snapshot of the landscape of long-read sequencing in livestock and discuss the exciting developments for the discovery and application of SV. Significant ongoing technological improvements have paved the way to apply genome-wide long-read sequencing to population-scale projects. With this long-read technology, we can now dissect these structural variants with unprecedented detail as well as develop approaches to test their significance for key traits in livestock. We believe that although the generation and analyses of population-scale long-read sequencing data remains challenging in the next few years, now is the right time to start investing in multi-institutional collaborations that can integrate and use the huge volume of data generated from SNP array, short-read, and long-read technologies. We argue that a collaborative approach is a cost-effective proposal to more comprehensively and rapidly advance livestock genomics and that investment now will bring rewards in the near- to medium-term future.