Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

Minozzi, Giulietta; Pedretti, Andrea; Biffani, Stefano; Nicolazzi, Ezequiel Luis; Stella, Alessandra

doi:10.1186/1753-6561-8-S5-S4

Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

Proceedings
Open access
Published: 07 October 2014

Volume 8, article number S4, (2014)
Cite this article

Download PDF

You have full access to this open access article

BMC Proceedings Aims and scope

Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

Download PDF

Giulietta Minozzi^1,3,
Andrea Pedretti¹,
Stefano Biffani²,
Ezequiel Luis Nicolazzi¹ &
…
Alessandra Stella²

2039 Accesses
4 Citations
Explore all metrics

Abstract

Background

Genome wide association studies are now widely used in the livestock sector to estimate the association among single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. As computational power increases, the use of machine learning techniques to analyze large genome wide datasets becomes possible.

Methods

The objective of this study was to identify SNPs associated with the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest (RF) approach. The approach was applied to single and multiple trait estimated breeding values, and on yield deviations and to compare them with the results of the GRAMMAR-CG method.

Results

The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identifying the main QTLs for trait 1 on chromosomes 1 and 4, for trait 2 on chromosomes 1, 4 and 5 and for trait 3 on chromosomes 1, 2 and 3.

Conclusions

The results of the RF approach were confirmed by the GRAMMAR-CG method and validated by the effective QTL position, even if their approach to unravel cryptic genetic structure is different. Furthermore, both methods showed complementary findings. However, when the variance explained by the QTL is low, they both failed to detect significant associations.

Background

Genome wide association studies (GWAs) are now widely used in the livestock sector to estimate the association among multiple single nucleotide polymorphisms (SNPs) distributed across the whole genome and one or more trait. GWAs are typically carried out on a single-point by performing a marginal chi-square test or regression. However, these methods do not take into account linkage disequilibrium between markers and the genetic structure of the population that may have a large impact on structured populations (e.g. cattle populations). Approaches for genome wide pedigree-based quantitative trait loci (QTL) analysis have been developed (e.g. GRAMMAR-CG), which are based on mixed model and regression, where the genomic kinship matrix estimated through genomic marker data can be used to correct for familiar correlation and cryptic relatedness [1].

As computational power increases, the use of more advanced machine learning techniques to analyze large genome wide datasets becomes possible [2], these techniques include Support Vector Machines [3], Bayesian Networks [4] and Random Forest [5].

The Random Forests (RF) algorithm [6] is a machine-learning method that has been widely applied to classification and regression problems, and is particularly well suited to circumstances in which the number of potential explanatory variables exceeds the number of observations, as is the case for GWAs. The RF algorithm produces a collection of trees (forest), each grown on a different bootstrap sample of observations, and at each split (node) of a tree, a different random subset of predictors (SNP) is evaluated to identify the best split. The final scores are then calculated by aggregating predictions resulting from all the trees grown in the forest.

RF embraces a combination of characteristics that makes it appropriate for genetic applications: it is well suited for very large datasets; it is non-parametric, thus does not require a causal model to be specified, it is highly parallelizable and considers interactions between predictors.

The objective of this study was to identify SNPs associated to the three traits simulated in the 16th MAS-QTL workshop dataset using the Random Forest approach and to compare them with the results obtained by the Grammar-CG method. SNPs identified by both methods were verified with the actual QTL positions.

Methods

Dataset

The dataset used was provided by the organisers of the 16th QTLMAS workshop and consisted of 4080 individuals (G0 to G4). The simulated genome was 499.750 Mb consisting of 5 chromosomes carrying 2,000 equally distributed SNPs. The GWA analysis was conducted on 3000 samples, all females belonging to generations G1 to G3, for which phenotypic information for three traits (yield deviations) was provided. The analysis was performed on: yield deviations (YD1, YD2 and YD3), the estimated breeding values (EBV) obtained from a single trait model (tr1_ST, tr2_ST, tr3_ST) and the EBVs obtained from a multiple trait model (tr1_MT, tr2_MT, tr3_MT).

Analysis

Variance components and EBV estimation

Variance components and EBVs were obtained separately, using REMLF90 and BLUPF90 programs, respectively [7]. The model used to estimate variance components and EBVs was:

y_{k, i, j} = μ_{k} + G E N_{k, i} + A n i m a l_{k, j} + e_{k, i, j}

where μ is a general mean for the k^th trait, GEN is a fixed effect for i generations (i = 1 to 3), Animal is a random animal effect with distribution ~ N(0,σ²_a), where σ²_a is the additive genetic variance, and e is the random residual with distribution ~ N(0, σ²_e), where σ²_e is the residual variance. Covariance between traits was considered only in multiple-trait analysis.

Random Forest

Feature selection (SNPs) analysis was performed with the randomForest package in R [8] using 3000 individuals and the 9042 SNPs that passed quality control checks out of the total 10000 SNP. The minimum size of the terminal nodes was set to 5. The number of trees grown was set to 1000. The subset of samples evaluated at each tree was 70% of the total number of samples (n = 2100). The number of variables evaluated at each node was set to the square root of the number of predictors (p = 94). All SNPs were ordered by Mean decrease Gini index [6] and the most strongly associated SNPs are at the top of the lists shown in Table 2, 3 and 4.

Grammar-CG

Genome-wide association analysis was performed with the GenABEL package in R using a three step GRAMMAR-CG (Genome wide Association using Mixed Model and Regression - Genomic Control) approach [1, 9].

Results and discussion

Variance components and EBV estimation

Mean and standard deviations of the nine phenotypes used are shown in Table 1. The heritability estimates resulting from the single trait model were 0.38, 0.38 and 0.50 for trait 1, 2 and 3, respectively. Large genetic correlations between traits 1 and 2 were observed (0.83), whereas lower genetic correlation was observed for trait 2 and 3 (0.12). Negative correlation was observed between traits 1 and 3 (-0.44).

Table 1 Statistics of the nine phenotypes used in the GWAs.

Full size table

Table 2 Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for Trait 1

Full size table

Table 3 Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for trait 2

Full size table

Table 4 Top SNPs identified by the Random Forest and GRAMMAR-CG Approach for trait 3

Full size table

Association mapping

The two QTL mapping methods used, GRAMMAR-CG and RF, were successful in identified the largest QTLs for trait 1 on chromosomes 4 and 1 in position 24 Mb and 84 Mb, for trait 2 on chromosomes 4, 5 and 1 in position24 Mb, 68 Mb and 14 Mb and for trait 3 on chromosomes 1, 2 and 3 at 84 Mb, 79 Mb and 36 Mb respectively. Positions and names of the significant SNPs are shown in Tables 2, 3 and 4.

Both methods showed good precision in the identification of the QTL in comparison with the "real" QTL position provided by [10]. Interestingly the exact markers flankingthe QTL were identified for all traits.

Differences however were observed depending on i) the phenotype analysed, YD, single trait EBV and Multiple trait EBV and on ii) the method used RF or GRAMMAR-CG.

With regard to Trait 1, the GRAMMAR-CG method identified 8 significant associations for multiple trait EBV, 6 for single trait EBVs and 10 for YD, only 4 of which are common between the three phenotypes. The RF approach identified the same number of markers per phenotype, but only 2 markers were in common between the methods of analysis and phenotype. The two markers identified by both approaches were the QTL which explained the largest variance, however, the other markers are all true associations and indicate that using different types of phenotypes for the same trait and different analysis methods may overlap, but may also show some differences in QTLs and positions.

Traits 2 and 3 share the same pattern as observed for trait 1. Several QTL were identified in common between phenotypes and methods but just a few were in common between analysis methods: 2 markers for trait 2 and 3 markers for trait 3. When the YD phenotype was used, a larger number of significant SNPs were detected. This may be due to the larger variability of the YD compared to the more regressed EBV phenotypes (Table 1).

Interestingly both methods failed to identify the QTLs on chromosomes 4 and 5 for Trait 3. The variance explained by the markers is low, suggesting that both methods are not able to detect QTLs which explain a small amount of variance. The RF approach, however, detect the QTL on chromosomes 5 and 3 for Trait 1.

Overall the results of the RF were confirmed by the results of the GRAMMAR-CG method and were validated by the effective positions given the QTL. Interestingly, even though the RF approach does not directly use family structure information through a relationship matrix (genomic or additive), as is the case in the GRAMMAR-CG approach, correct identification of QTL positions is achieved.

Conclusions

In this study we proposed the use of recursive partitioning approaches such as Random Forest, as an alternative to traditional regression methods to detect the genetic loci. The results of the RF approach were consistent with those of the GRAMMAR-CG method and validated by the effective positions given for the QTL. However, when the variance explained by the QTL was low, both failed to detect a significant association.

References

Amin N, van Duijn CM, Aulchenko YS: A genomic background based method for association analysis in related individuals. PLoS ONE. 2007, 2: e1274-10.1371/journal.pone.0001274.
Article PubMed PubMed Central Google Scholar
Goldstein BA, Polley EC, Briggs FB: Random forests for genetic association studies. Stat Appl Genet Mol Biol. 2011, 10 (1): 32-
PubMed PubMed Central Google Scholar
Yoon Y, Song J, Hong S, Kim J: Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines. Clin Chem Lab Med. 2003, 41: 529-534.
Article PubMed CAS Google Scholar
Han B, Chen XW, Talebizadeh Z, Xu H: Genetic studies of complex human diseases: characterizing SNP-disease associations using Bayesian networks. BMC Syst Biol. 2012, 6-
Google Scholar
Mokry FB, Higa RH, de Alv arenga Mudadu M, Oliveira de Lima A, Meirelles SL, Barbosa da Silva MV, Cardoso FF, Morgado de Oliveira M, Urbinati I, Méo Niciura SC, Tullio RR, Mello de Alencar M, Correia de Almeida Regitano L: Genome-wide association study for backfat thickness in Canchim beef cattle using Random Forest approach. BMC Genet. 2013, 14: 47-Jun 5
Article PubMed CAS PubMed Central Google Scholar
Breiman L: 2001 Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.
Article Google Scholar
Misztal I, Tsuruta S, Strabel T, Auvray B, Druet T, Lee DH: BLUPF90 and related programs (BGF90). Proc. 7th WCGALP Montpellier 2002, France, Communication No. 28-07
[http://cran.r-project.org/web/packages/randomForest/index.html]
Aulchenko YS, Ripke S, Isaacs A, van Duijn CM: GenABEL: an R package for genome-wide association analysis. Bioinformatics. 2007, 23: 1294-6. 10.1093/bioinformatics/btm108.
Article PubMed CAS Google Scholar
[http://qtl-mas-2012.kassiopeagroup.com/en/dataset.php]

Download references

Acknowledgements

This work was funded by the MASTFIELD project n.1745 (Applicazione di sistemi molecolari innovativi per ilcontrollo in campo delle mastiti bovine) of the Lombardy Region(Agricultural regional research programme2010-2012) and by the PON EPISUD project n° PON01_ 01841 project funded by MIUR (Ministero dell'Istruzione, dell'Università e della Ricerca).

Declarations

This work was funded by the MASTFIELD project n.1745 (Applicazione di sistemi molecolari innovativi per ilcontrollo in campo delle mastiti bovine) of the Lombardy Region (Agricultural regional research programme2010-2012) and by the PON EPISUD project n° PON01_ 01841 project funded by MIUR (Ministero dell'Istruzione, dell'Università e della Ricerca).

This article has been published as part of BMC Proceedings Volume 8 Supplement 5, 2014: Proceedings of the 16th European Workshop on QTL Mapping and Marker Assisted Selection (QTL-MAS). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S5

Author information

Authors and Affiliations

Parco Tecnologico Padano Srl, Via Einstein 26900, Lodi, Italy
Giulietta Minozzi, Andrea Pedretti & Ezequiel Luis Nicolazzi
IBBA-CNR, Via Einstein 26900, 26900, Lodi, Italy
Stefano Biffani & Alessandra Stella
Department of Veterinary Science and Public Health, University of Milan, Via Celoria 10, 20133, Milan, Italy
Giulietta Minozzi

Authors

Giulietta Minozzi
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Pedretti
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Biffani
View author publications
You can also search for this author in PubMed Google Scholar
Ezequiel Luis Nicolazzi
View author publications
You can also search for this author in PubMed Google Scholar
Alessandra Stella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giulietta Minozzi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AP, GM, ELN, and SB participated in the design and performed the statistical analysis. GM and AS conceived the study, participated in its design and coordination. GM drafted the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Minozzi, G., Pedretti, A., Biffani, S. et al. Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach. BMC Proc 8 (Suppl 5), S4 (2014). https://doi.org/10.1186/1753-6561-8-S5-S4

Download citation

Published: 07 October 2014
DOI: https://doi.org/10.1186/1753-6561-8-S5-S4

Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Dataset

Analysis

Variance components and EBV estimation

Random Forest

Grammar-CG

Results and discussion

Variance components and EBV estimation

Association mapping

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Dataset

Analysis

Variance components and EBV estimation

Random Forest

Grammar-CG

Results and discussion

Variance components and EBV estimation

Association mapping

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation