Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous

Rabaglino, Maria B.; Kadarmideen, Haja N.

doi:10.1038/s41598-020-72988-3

Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous

Article
Open access
Published: 12 October 2020

Volume 10, article number 16981, (2020)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous

Download PDF

Maria B. Rabaglino¹ &
Haja N. Kadarmideen¹

2999 Accesses
12 Citations
8 Altmetric
Explore all metrics

Abstract

The main goal was to apply machine learning (ML) methods on integrated multi-transcriptomic data, to identify endometrial genes capable of predicting uterine receptivity according to their expression patterns in the cow. Public data from five studies were re-analyzed. In all of them, endometrial samples were obtained at day 6–7 of the estrous cycle, from cows or heifers of four different European breeds, classified as pregnant (n = 26) or not (n = 26). First, gene selection was performed through supervised and unsupervised ML algorithms. Then, the predictive ability of potential key genes was evaluated through support vector machine as classifier, using the expression levels of the samples from all the breeds but one, to train the model, and the samples from that one breed, to test it. Finally, the biological meaning of the key genes was explored. Fifty genes were identified, and they could predict uterine receptivity with an overall 96.1% accuracy, despite the animal’s breed and category. Genes with higher expression in the pregnant cows were related to circadian rhythm, Wnt receptor signaling pathway, and embryonic development. This novel and robust combination of computational tools allowed the identification of a group of biologically relevant endometrial genes that could support pregnancy in the cattle.

Meta-signature of human endometrial receptivity: a meta-analysis and validation study of transcriptomic biomarkers

Article Open access 30 August 2017

Differential gene expression profiling of endometrium during the mid-luteal phase of the estrous cycle between a repeat breeder (RB) and non-RB cows

Article Open access 23 March 2017

External validation of putative biomarkers in eutopic endometrium of women with endometriosis using NanoString technology

Article 09 October 2020

Introduction

Various bioinformatics and systems biology tools in animal production and health sciences¹, and specifically in cattle artificial reproduction², focus on integrating biological data layers (genomics and transcriptomics) and application of statistical-bioinformatics methods (e.g. eQTL mapping) to identify functionally relevant targets and biomarkers. However, the emergence of machine learning (ML), as a big data science tool, is less explored in livestock functional genomics in general, and bovine species in particular. ML refers to the use of self-learning algorithms to make sense of big data and is a branch of artificial intelligence that holds great potential for pattern recognition in complex datasets^3,4, such as the ones derived from the “omics” technologies. In transcriptomic data (captured by either microarrays or RNA-sequencing platforms), expression pattern analysis is central to find functionally relevant groups of genes under different treatment conditions or phenotypic categories. Thus, application of ML tools represents a powerful analytical approach that can be strengthened when it is applied to data integrated from several datasets (i.e., multi-transcriptomic data), which provides a robust overview of the system under study⁵.

Here, we investigated ML methods across multi-transcriptomic data, in the context of characterizing the receptive endometrium at the time of embryo transfer (ET) in European cattle. The endometrial transcriptomic profile should determine a favorable environment for contact and communication with the embryo at around 5–6 days of pregnancy, when the conceptus reaches the uterus⁶. Thus, identification of a receptive endometrium becomes crucial at around 7 days of the estrous cycle, when an embryo is deposited into the uterus after the application of assisted reproductive technologies. Previous works, including ours⁷, have demonstrated that the endometrial transcriptomic profile at day 6 or 7 differs between animals that become pregnant or not. This fact has been shown in Nellore cows after artificial insemination (AI)⁸, Simmental heifers after in vivo produced ET^9,10, dairy cows after in-vitro produced ET⁷, and cross-breed heifers after AI¹¹. These studies have applied high-throughput technologies with the goal of determining the differentially expressed genes in the endometria of animals that resulted pregnant compared to those that do not. Their results helped to shed light on the identification of those genes whose expressions control the fate of uterine receptivity. Nevertheless, results from these studies are not entirely consistent, since several other factors, such as breed and category, can influence gene expression. Therefore, the identification of endometrial genes as biomarkers of receptivity is still challenging.

In the present study, we combined data integration with supervised and unsupervised ML tools to provide actionable knowledge from various endometrial transcriptomic datasets. We hypothesized that such novel computational approach could reveal the main gene expression signature of the receptive endometrium at the time of ET in cattle. The goal of this study was to conduct multi-transcriptomic data integration from five publicly available datasets, including our recent study on endometrial transcriptomics⁷, and apply ML tools to such integrated data, to identify biomarker genes determining uterine receptivity according to their expression patterns. These studies, listed in Table 1, have in common that endometrial samples from European or Taurine cattle (Bos taurus taurus) were obtained at day 6–7 of the estrous cycle, and they were classified as receptive (R, n = 26) or not (nonR, n = 26) after ET or AI, depending on the study. Selection and validation of the potential biomarkers were done in three steps involving supervised and unsupervised ML methods. Once these key genes were determined, the final aim was to explore their biological characteristics through predictions in external data to discern the role of estradiol and progesterone in their expression, and network analysis to reveal related genes.

Table 1 Characteristics of each dataset selected for data integration and analysis.

Full size table

Results

In what follows below, the main results are categorized into four topics.

Identification of groups of potential biomarker genes through supervised ML

The software BioDiscML¹² was employed for selection of potential biomarker genes. This software automates the main steps in ML by implementing methods for features and model selection, in order to identify the best model for data classification. The software generated 2097 models, from which only five models presented accuracy higher than 90% in the test set and in the evaluation procedures in the train set. These models were:

a.
Two models of Bayes Network optimized by accuracy of prediction, which selected 100 and 75 genes
b.
Two multinomial logistic regression models, also optimized by accuracy of prediction, which selected the same 100 and 75 genes, and
c.
Bayes Network optimized by False Discovery Rate, which selected 50 genes.

The 100, 75 and 50 selected genes were overlapping, meaning that the 50 genes were repeated in the three groups.

Identification of the best group of biomarker genes through unsupervised ML

The groups of genes identified as potential biomarkers were evaluated for their ability to blindly cluster apart the R and nonR samples, according to their expression levels, in a hierarchical clustering. As expected, the group of 50 genes showed the best performance (92.3% accuracy). Only one nonR sample from the Holstein cows (nonR_Hols_6) was clustered with the R samples, while three R samples were grouped with the nonR samples. These samples were one from Holstein (R_Hols_3), one from the Charolais x Limousine (R_Cont_3) and one from the second study with the Simmental heifers (R_Sim2_1). The corresponding dendrograms and heat maps for the expression signatures are shown in Supplementary Fig. 1, together with the confusion matrix and accuracy for each classification.

In addition, hierarchical clustering of the genes according to their expression showed two main clusters of genes, corresponding to those with increased or decreased expression in the R cows (up-regulated or down-regulated, respectively).

Table 2 lists the 50 genes, with the respective indication if they were more (UP, n = 32) or less (DOWN, n = 18) expressed in the 7-day endometria of the R animals.

Table 2 List of the 50 endometrial genes identified as biomarkers to determine pregnancy status around day 7 of the estrous cycle in the Bos taurus cattle.

Full size table

Validation of the selected set of biomarker genes through supervised ML

The next step was to verify if the expression signature of the 50 selected genes were able to predict uterine receptivity. For this, we applied Support vector machines (SVM) as classifier, using all the samples but the samples from a given breed as training set, and the samples from such breed as testing set. Therefore, we could discern if the expression signature of these genes would be able to predict uterine receptivity across all the bovine breeds.

The evaluation metrics associated with the confusion matrix for each of the four train/test set are depicted in Table 3. Using the expression of the 50 genes in all the samples, but the samples for a particular breed, to train the SVM, the accuracy to predict correctly the uterine receptivity in that particular breed was 100% for the Japanese and Simmental breeds, 94.1% for the Holstein cows and 91.7% for Charolais x Limousine heifers. One nonR sample from the Holstein cows (non_R_6) was misclassified as R, while one R sample from the Charolais x Limousine heifers was misclassified as nonR (R_Cont_1). Therefore, the overall accuracy was 96.1%.

Table 3 Evaluation metrics corresponding to the classifications on pregnancy status based on the expression of the 50 endometrial genes for each breed, using Support Vector Machine as classifier, trained with all the samples except for the samples of the particular breed.

Full size table

Determination of the biological significance of the selected biomarker genes

As a final step, we investigated the biological meaning of the 50 genes through two methods: predictions in external datasets and functional/network analysis.

Predictions in external datasets: with the aim of understanding the role of estradiol and progesterone in the expression of the biomarker genes, two datasets were selected to generate the test sets based on the endometrial expression of the 50 biomarker genes. The training set consisted on the expression of the 50 genes in the 52 samples described in Table 1. Predictions of ‘receptive’ samples for each test set were as follow:

Test set 1) Five out of five pregnant heifers with normal progesterone levels (PN), but only one out of five pregnant heifers treated with a progesterone device on day 3 of the estrous cycle (PH).

Test set 2) Three out of three ovariectomized cows receiving a progesterone treatment for six days plus estradiol at day 6 (E2 + P4) but none out of three receiving only the progesterone treatment (P4).

Accordingly, samples that were classified as ‘receptive’ (PN and E2 + P4) tended to cluster with the R samples, and vice versa for samples classified as ‘non receptive’ (PH and P4), in a PCA plot (Fig. 1).

Functional/network analysis: functional classification of the protein class for each gene was performed with the Panther database¹³ and network analysis was done with the Cytoscape software (V. 3.7.2)¹⁴.

From the 50 genes, 26 genes were classifying into protein classes, from which the most abundant protein class was gene-specific transcriptional regulator (six out of the 26). These regulators were the following transcription factors: Cellular tumor antigen p53 (TP53), Basic helix-loop-helix family member e40 (BHLHE40), Hematopoietically expressed homeobox (HHEX) and Zinc finger and SCAN domain containing 12 (ZSCAN12). The following regulators are transcription co-factors: Transducin like enhancer of split (TLE4) and C-terminal-binding protein 2 (CTBP2). All the genes were more expressed in the animals that become pregnant, except for ZSCAN12 and CTBP2.

The top 100 related genes to the up- and down-regulated biomarker genes in the R animals were inferred and analyzed with Cytoscape (Supplementary Table 1). These genes generated highly connected networks (Supplementary Fig. 2). The significantly enriched biological processes (adj. p < 0.05) related to these networks are listed in Supplementary Table 2. The main significant biological processes determined in the network derived from the up-regulated genes in the endometria of the R cows are: positive regulation of biological/cellular process, regulation of gene expression, circadian rhythm, regulation of apoptosis, Wnt receptor signaling pathway, and embryonic development. For the down-regulated genes, the main non-redundant biological processes are: chromosome segregation, lipid modification, negative regulation of biological/cellular process, M phase of cell cycle, and fatty acid oxidation.

Discussion

So far, most of the studies of the bovine endometrial transcriptome during the early-luteal phase period have utilized bioinformatics methods to detect differentially expressed genes between the groups in that particular study^7,8,9,10,15. The outputs of these investigations are deposited in the public GEO database, which enables the access to a large amount of high-throughput data¹⁶. Integration of several datasets could lead to a better characterization of the system under study⁵, as done, for example, for the human endometrial transcriptome¹⁷. On the other hand, ML algorithms have emerged as useful tools to recognize patterns in data generated by “omics” assays⁴. Here, we demonstrated the power of combining data integration and ML methods to detect endometrial genes whose expression patterns potentially identify a receptive endometrium at around seven days of the estrous cycle.

In the cow, the pre-implantation period is so critical that more than 70% of pregnancy failure associated with embryo death occurs here¹⁸. This represents one of the main causes of economic loss and thus, the understanding of the early physiological changes occurring in the endometrium, which are determined by variations in the endometrial transcriptome, takes major importance. In the present study, we integrated endometrial transcriptomic data from the Bos taurus taurus during this early-luteal phase, with the main aim of identifying a group or set of genes characterizing a receptive endometrium despite the breed and category. Data from Bos taurus indicus were not considered to avoid confounding differences given the bifurcation in the phylogenetic tree. Selected datasets (Table 1) shared similar experimental designs. Except for the dataset GSE29853, the other datasets classified the animals retrospectively according to pregnancy status after the biopsy (although GSE107741 was based on the results before and after the biopsy). For the dataset GSE29853, the authors classified the animals based on the results of previous AI. Therefore, all these public datasets have in common that consist of endometrial samples obtained at around day 7 of the estrous cycle, samples were classified according to pregnancy status, and the transcriptome was measured through a high-throughput technology. The application of bioinformatics procedures allowed the integration of these datasets, in the sense that technical differences given by the platform employed for transcriptomic measurement, or the experiment itself, were eliminated through a data pre-processing step (Supplemental Fig. 3).

As per our main objectives, we applied a series of ML procedures (that included identification of sets of genes and application of unsupervised and supervised ML tools to determine the best one) to determine a group of 50 genes with the capability to predict pregnancy status according to their expression levels (Table 2). Even more, the expressions of these genes in all the samples but a particular breed were able to predict uterine receptivity with an overall 96.1% accuracy, validating the predictive capability of these key genes (Table 3).

Between them, there were six transcriptional regulators, corresponding to four transcription factors and two co-factors. Furthermore, five of these 50 genes have been associated with cow fertility in recent studies using genome-wide association analysis. YWHAQ and PAXIP1 were identified as master regulators (molecules that have indirect relationships to positional candidate genes through upstream regulators), and SUCLG2 as one positional candidate gene, associated with dairy heifer fertility¹⁹. TP53, one of the transcription factors discussed below, was one of the three top upstream regulators of positional candidate genes, while MYH10 a regulator target gene, associated with beef cattle fertility²⁰.

It is well accepted that progesterone concentrations regulate the endometrial expression of genes determining uterine receptivity, and it plays a key role in pregnancy establishment and conceptus development^21,22. Therefore, to explore the action of this hormone on the expression of these 50 genes, additional external data from two studies were re-analyzed. In one study, pregnant heifers presented normal or high levels of progesterone from day 3 of the estrous cycle²³. Although all heifers were pregnant, only one from the high-progesterone group was classified as such according to the expression of the 50 genes (Fig. 1A). Forde et al.²³ concluded that progesterone supplementation during early pregnancy advances endometrial gene expression in cattle, and so the endometria of those animals probably reflected changes occurring later in the estrous cycle. In the other study, ovariectomized cows were treated with progesterone for six days, receiving an injection of estradiol benzoate at the end of the treatment or not²⁴. Only samples obtained from animals receiving the estradiol treatment were classified as receptive, according to the expression of the 50 genes (Fig. 1B).

Thus, these results suggest that the expression of the 50 genes is temporally regulated and their differences in expression between R and nonR animals would occur at around day 7 of the estrous cycle but not later in the cycle. Also, these genes probably are responding to the increasing estradiol levels of the first wave together with the increased levels of progesterone²⁵, but not to progesterone alone, although this fact should be confirmed by experimental studies in vivo.

In addition, and to explore more deeply the biological significance of the 50 genes, related genes were inferred in a network analysis (Supplementary Fig. 2). These related genes might not behave as crucial genes during the early luteal phase period, but their expression, or their products, or the pathway(s) they shared with the key genes, could be affected later. We cannot know exactly the number of genes that would be regulated by the key genes, and so the arbitrary top 100 genes were explored, together with the significantly enriched biological processes by these genes, respectively for the biomarker genes that showed higher or lower expression in the R cows.

One of the biomarker genes with higher expression in the R cows was TP53, which is well known as tumor suppressor because its protein (p53) regulates cell division by keeping cells from growing and dividing (proliferating) too fast or in an uncontrolled way. In response to different stress signals, p53 can hold cell division in both the G1/S phase and G2/M phase checkpoints, in order to prevent chromosomal replication specifically during the cell cycle if DNA damage is present, and even to induce cell apoptosis^26,27. The actions of p53 are critical to avoid tumor development, but it also regulates many cellular processes, including metabolism, antioxidant response, and DNA repair²⁶. Interestingly, regulation of transcription and cell death/apoptosis were biological processes enriched by the related genes to the ones with higher expression in the R cows, while M phase of the cell cycle, chromosome segregation and lipid oxidation to the ones with lower expression in the R cows. Furthermore, many steps involved in implantation in the human, such as apoptosis and angiogenesis, are regulated by p53 and thus this protein could play a broader role in the survival of the specie by optimizing the embryo implantation²⁸. This study suggests that early expression by TP53 is critical for uterine receptivity in the cow as well. In the report from Ponsuksili et al.⁹, the authors found that activated TP53 was associated with the endometria of high receptive cows, although on day 3 (not on day 7). Thus, the role of TP53 expression during the early-luteal phase in the bovine endometrium deserves further investigation.

Another biological process enriched with the related genes to the ones with higher expression in the R cows, was Wnt signaling pathway. The key genes involved in it were TLE4 and ROR2. On the other hand, CTBP2, a transcription co-factor with higher expression in the nonR cows, also participates in the Wnt pathway. TLE4 is a transcriptional co-repressor whose products, and the ones encoded by the TLE1-3, inhibit the transcriptional activation mediated by the nuclear β-catenin CTNNB1 and TCF family members in the canonical Wnt signaling pathway. Conversely, HHEX, a transcription factor more expressed in the R cows, acts early in embryo development to enhance canonical WNT-signaling by repressing expression of TLE4²⁹. ROR2 signals through a Wnt responsive, β-catenin independent pathway and suppress a canonical Wnt/β-catenin signal³⁰. Finally, CTBP2 associates with major components of the β-catenin destruction complex and limits the accessibility of β-catenin to core transcription factors in undifferentiated embryonic stem cells, which allows exit from pluripotency³¹. In the cow, maternally derived Wnt are important for the development of the preimplantation embryo³². Therefore, the expression of these biomarker genes identified in the present study could play a crucial role in the regulation of the canonical and non-canonical Wnt pathway in the early-diestrous endometrium.

Lastly, one more biological process that deserves attention is the circadian rhythm, influenced by the transcription factor BHLHE40. The basic helix-loop-helix protein encoded by this gene interacts with the clock genes and modulates the circadian phase of the clock genes, playing a role in the fine regulation and robustness of the molecular clock^33,34. This clock is highly important in reproductive tissues, including the regulation of the uterine function, although more studies are needed to define its role in the endometrial receptivity (reviewed by Sen and Hoffmann³⁵).

Study strengths and limitations: This work embraces the output from five studies employing four breeds with distinct purposes (dairy, beef and double) and different techniques for sample collection, which are factors that can influence endometrial gene expression. However, the relative difference of expression between R and nonR animals, for all the biomarker genes (except for five) was similar in all the breeds before correction for the experimental effect (Supplementary Fig. 4). In other words, if samples are taken from a given breed cattle, using the same technique, and at around day 7 of the estrous cycle, these genes are expected to show differences in expression between R and nonR animals. Even when the differences were subtle, the overall behavior of these key genes would help to define those animals with a higher uterine capacity to support pregnancy.

On the other hand, the establishment of pregnancy is a complex process that depends not only on the receptive endometrium but also on embryonic viability and synchrony of actions between both parts (reviewed by Spenser et al.²¹). Therefore, we cannot expect that the sole expression of these 50 genes, identified by mathematical approaches, could determine animals that would become pregnant or not which such high accuracy. However, we believe that our results could be of enormous help to understand the characteristics of a receptive endometrium at the time of ET and provide the basis for further studies.

Conclusion

In summary, the application of supervised and unsupervised ML approaches for multi-transcriptomic data integration and target/gene selection, allowed the identification of a group of 50 endometrial genes with high predictive capability (96.1%) to define uterine receptivity in Taurine cattle at around seven days of the estrous cycle, despite the animal’s breed and category. From a data science perspective, results show the scope and power of ML methods in multi-transcriptomic studies and from a biological perspective, results highlight the concept of the strong influence of the maternal environment for pregnancy establishment, which is determined independently of the presence of the embryo.

Methodology

High-throughput datasets

Five transcriptomic datasets were downloaded from a public functional genomic data repository: Gene Expression Omnibus (GEO) from the National Center for Biotechnology Information^36,37. These studies were selected because they all have in common that endometrial samples from Bos taurus taurus animals were obtained at day 6–7 of the estrous cycle, and they were classified as pregnant (n = 26) or not (n = 26) after ET or AI, depending on the study. The accession number and main characteristics of each dataset are shown in Table 1.

Only our dataset (GSE115756,⁷) used the RNA-sequencing technology (Illumina HiSeq 2500 platform). The other datasets measured gene expression through the microarray technology. The study GSE107741 used the Agilent-023647 B. taurus (Bovine) Oligo Microarray v2 while the other three (GSE29853, GSE36080, GSE20974) employed the Affymetrix Bovine Genome Array platform.

Data integration

The R software platform³⁸ was employed in the following procedures. The raw counts obtained from the RNA-sequencing in our data were transformed through the variance stabilizing transformation method³⁹, using the vst function from the DESeq2 package for R⁴⁰. This transformation removes the dependence of the variance on the mean and produces transformed data on the log2 scale, which has been normalized with respect to library size or other normalization factors. The raw data obtained from samples hybridized to the Affymetrix or Agilent platforms were processed with the gcRMA⁴¹ or limma⁴² packages, respectively. Data were imported into R, background corrected, and then transformed and normalized using the quantile normalization method. Next, rows of each data set were collapsed, in order to retain the microarray probe with the highest mean value from the group of the genes with the same Ensembl ID.

Therefore, a table with transformed and normalized gene expression values for each sample was generated for each of the five studies, using the same identifier for the transcripts (Ensembl ID). These tables were integrated into a single table containing the expression of 9850 annotated transcripts for the 52 samples in total (only transcripts with expression values for all the samples were retained). Next, the batch effects (i.e., the fact that the data were obtained from different studies) were removed with the ComBat function from the sva package⁴³. A multidimensional scaling analysis (MDS) was employed to evaluate between samples similarities before and after the batch removal (Supplemental Fig. 3), with the Glimma package⁴⁴.

Selection of biomarker genes through supervised ML

The details about each step followed by the BioDiscML software¹² are specified in the reference and in the GitHub page (https://github.com/mickaelleclercq/BioDiscML). Briefly, a first sampling step separates the data into a train and a test set (2/3 and 1/3, respectively, by default), that are later used to assess the model, or the user can define these datasets. We chose this last option instead of using a random separation of the data in order to have samples from all the breeds on each set. The training set consisted of 34 samples (11 from Holstein, 7 from Japanese Black, and 8 from Charolais x Limousine, or Simmental cows, respectively). The test set consisted of 18 samples (6 from Holstein, 4 from Japanese Black, and 4 from Charolais x Limousine, or Simmental cows, respectively).

As second step, a feature-ranking algorithm sorts the features (or genes) based on their predictive power with respect to the class (R or nonR), retaining only the best 1000 genes. Next, two methods are employed for searching and selecting the potential biomarker genes: top k features and stepwise, for each ML algorithm and each optimization evaluation criterion. At each iteration, the created model is evaluated by tenfold cross validation and the selected genes are retained if the predictive performance is improved. When the signatures of biomarker genes are identified, the models are evaluated again. Finally, it is possible to let the software to select the best model (or combine the best ones), or this step can be done manually. For this, one of the output files describes each model with its associated performance metrics and the list of corresponding genes.

For this study, we manually selected those models that resulted with a prediction accuracy higher than 90% in the test set and the following evaluations’ procedures in the train set: tenfold cross validation, leave-one-out cross validation, repeated holdout and bootstrapping; and repeated holdout in the whole set.

Identification of the best group of biomarker genes through unsupervised ML

The groups of genes identified as potential biomarkers were evaluated for their ability to blindly cluster apart the R and nonR samples according to their expression levels. For this, a hierarchical clustering was employed, using Spearman Rank Correlation as similarity metric and complete linkage as clustering method, implemented with the Cluster 3.0 software⁴⁵. The resulting dendrogram and the heat map were visualized with Java TreeView⁴⁶.

The correct clustering of the R and nonR samples for each group of genes was evaluated using a confusion matrix, selecting the genes that, according to their expression, presented the highest accuracy to cluster apart the samples from each group.

Validation of the selected set of biomarker genes through supervised ML

Once the set of potential biomarker genes was selected according to unsupervised learning, the next step was to verify if the expression levels of these genes were able to predict pregnancy status. For this, we applied a different ML model than the ones identified by the BioDiscML software, using Support vector machines (Support vector classifier) with linear kernels (SVM). This method was chosen because of its ability to learn well with only a very small number of features, its robustness against the error of models, and its computational efficiency compared to other ML methods ⁴⁷. In addition, SVM has been shown to successfully classify cancer tissue samples based on gene expression, from microarray technology⁴⁸ or microarray-RNAseq integrated data⁴⁹.

In order to discern if the expression of these genes would be able to predict pregnancy across all the bovine breeds, the training set consisted of all the samples but the samples from a given breed, which were part of the test set. Therefore, four pairs of training-test sets were used for classification (Simmental heifers from both studies were considered together). In other words, the training sets consisted of all the samples but the ones from Holstein (n = 35), or Japanese Black (n = 41), or Charolais x Limousine (n = 40), or Simmental animals (n = 40). Then, the corresponding test set to each training set were all the samples from the Holstein (n = 17), or Japanese Black (n = 11), or Charolais x Limousine (n = 12), or Simmental animals (n = 12).

The leave-one-out cross validation method was employed as the internal control for the training dataset. The implementation of the SVM with linear kernel was done with the kernlab package⁵⁰, through the caret package⁵¹ for the R software³⁸.

Exploring the biological significance of the selected biomarker genes

As a final step, we investigated the biological meaning of the 50 genes through two methods: predictions in external datasets and functional/network analysis.

Predictions in external datasets: Two datasets were selected to generate the test sets based on the endometrial expression of the 50 biomarker genes. These were GSE33030²³ and GSE16880²⁴. Both studies employed the Affymetrix Bovine Genome Array (GPL2112) as platform. Only samples belonging to pregnant heifers treated or not with a progesterone device from day 3 (n = 5 per group), and those obtained from ovariectomized cows treated with progesterone for 6 days receiving or not an estradiol injection (n = 3 per group), downloaded from GSE33030 or GSE16880, respectively, were analyzed. The raw data were processed with the gcRMA package⁴¹. Data were imported into R, background corrected, and then transformed and normalized using the quantile normalization method. Next, rows of each data set were collapsed, to retain the microarray probe with the highest mean value from the group of the genes with the same Ensembl ID. The 50 genes were isolated from each dataset to be used as test sets, performing an addon batch effect adjustment of this data with the training data with the bapred package⁵². The training data consisted of the batch-corrected expression of the 50 genes for all the 52 samples described in Table 1. SVM with linear kernels was used as classifier, employing the leave-one-out cross validation method as the internal control, applied with the kernlab package⁵⁰, through the caret package⁵¹ for the R software³⁸.

Functional/network analysis: A functional classification of the protein class for each gene was overview with the Panther database¹³. Next, in order to expand the knowledge about the genes related to the biomarker genes, a network analysis with Cytoscape V. 3.7.2¹⁴ was performed.

For this, the Ensembl IDs were converted first to the corresponding human Entrez ID homologous using bioDBnet (https://biodbnet-abcc.ncifcrf.gov/db/db2db.php). Then, the GeneMania plugin⁵³, which infers network data, was employed to generate two networks: one for the group of genes increasing in expression, and other for the genes decreasing in expression, in the R cows. The set of functional association data between genes was downloaded from the Homo sapiens database. The up-regulated -or down-regulated- biomarker genes were imported into the GeneMania plugin to retrieve the corresponding association network, allowing the program to find the top 100 related genes. The association data employed was genetic or physical interaction (i.e., two genes are functionally associated, if the effects of perturbing one gene were found to be modified by perturbations to a second gene, or if their products were found to interact in a protein–protein interaction study) or if the genes were in the same reaction within a pathway. Finally, the BinGO plugin⁵⁴ was applied to find the statistically overrepresented biological processes in the resulting networks.

Data availability

All data are fully resourced from public NCBI GEO databases.

References

Suravajhala, P., Kogelman, L. J. & Kadarmideen, H. N. Multi-omic data integration and analysis using systems genomics approaches: methods and applications in animal production, health and welfare. Genet Sel. Evol. 48, 38 (2016).
Article PubMed PubMed Central CAS Google Scholar
Kadarmideen, H. N. & Mazzoni, G. Transcriptomics-genomics data integration and expression quantitative trait loci analyses in oocyte donors and embryo recipients for improving invitro production of dairy cattle embryos. Reprod. Fertil. Dev. 31, 55–67 (2018).
Article CAS PubMed Google Scholar
Ghaffari, M. H. et al. Metabolomics meets machine learning: Longitudinal metabolite profiling in serum of normal versus overconditioned cows and pathway analysis. J. Dairy Sci. 102, 11561–11585 (2019).
Article CAS PubMed Google Scholar
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lagani, V., Karozou, A. D., Gomez-Cabrero, D., Silberberg, G. & Tsamardinos, I. A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions. BMC Bioinform. 17(Suppl 5), 194 (2016).
Article CAS Google Scholar
Spencer, T. E. & Bazer, F. W. Uterine and placental factors regulating conceptus growth in domestic animals. J. Anim. Sci. 82 E-Suppl, E4–E13 (2004).
CAS PubMed Google Scholar
Mazzoni, G. et al. Characterization of the endometrial transcriptome in early diestrus influencing pregnancy status in dairy cattle after transfer of in vitro-produced embryos. Physiol. Genomics 52, 269–279 (2020).
Article PubMed Google Scholar
Binelli, M. et al. The transcriptome signature of the receptive bovine uterus determined at early gestation. PLoS ONE 10, e0122874 (2015).
Article PubMed PubMed Central CAS Google Scholar
Ponsuksili, S. et al. Gene expression and DNA-methylation of bovine pretransfer endometrium depending on its receptivity after in vitro-produced embryo transfer. PLoS ONE 7, e42402 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Salilew-Wondim, D. et al. Aberrant placenta gene expression pattern in bovine pregnancies established after transfer of cloned or in vitro produced embryos. Physiol. Genomics 45, 28–46 (2013).
Article CAS PubMed Google Scholar
Killeen, A. P. et al. Global gene expression in endometrium of high and low fertility heifers during the mid-luteal phase of the estrous cycle. BMC Genomics 15, 234 (2014).
Article PubMed PubMed Central CAS Google Scholar
Leclercq, M. et al. Large-scale automatic feature selection for biomarker discovery in high-dimensional OMICs data. Front. Genet 10, 452 (2019).
Article CAS PubMed PubMed Central Google Scholar
Thomas, P. D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003).
Article CAS PubMed PubMed Central Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Moran, B., Butler, S. T., Moore, S. G., MacHugh, D. E. & Creevey, C. J. Differential gene expression in the endometrium reveals cytoskeletal and immunological genes in lactating dairy cows genetically divergent for fertility traits. Reprod. Fertil. Dev. 29, 274–282 (2017).
Article CAS PubMed Google Scholar
Clough, E. & Barrett, T. The gene expression omnibus database. Methods Mol. Biol. 1418, 93–110 (2016).
Article PubMed PubMed Central Google Scholar
Rabaglino, M. B. & Conrad, K. P. Evidence for shared molecular pathways of dysregulated decidualization in preeclampsia and endometrial disorders revealed by microarray data integration. FASEB J. 33, 11682–11695 (2019).
Article CAS PubMed PubMed Central Google Scholar
Diskin, M. G. & Morris, D. G. Embryonic and early foetal losses in cattle and other ruminants. Reprod. Domest Anim. 43(Suppl 2), 260–267 (2008).
Article PubMed Google Scholar
Kiser, J. N. et al. Validation of 46 loci associated with female fertility traits in cattle. BMC Genomics 20, 576 (2019).
Article PubMed PubMed Central Google Scholar
Neupane, M. et al. Loci and pathways associated with uterine capacity for pregnancy and fertility in beef cattle. PLoS ONE 12, e0188997 (2017).
Article PubMed PubMed Central CAS Google Scholar
Spencer, T. E., Forde, N. & Lonergan, P. Insights into conceptus elongation and establishment of pregnancy in ruminants. Reprod. Fertil. Dev. 29, 84–100 (2016).
Article CAS PubMed Google Scholar
Spencer, T. E., Forde, N. & Lonergan, P. The role of progesterone and conceptus-derived factors in uterine biology during early pregnancy in ruminants. J. Dairy Sci. 99, 5941–5950 (2016).
Article CAS PubMed Google Scholar
Forde, N. et al. Progesterone-regulated changes in endometrial gene expression contribute to advanced conceptus development in cattle. Biol. Reprod. 81, 784–794 (2009).
Article CAS PubMed Google Scholar
Shimizu, T. et al. Actions and interactions of progesterone and estrogen on transcriptome profiles of the bovine endometrium. Physiol. Genomics 42A, 290–300 (2010).
Article CAS PubMed Google Scholar
Smith, J. F., Fairclough, R. J., Payne, E. & Peterson, A. J. Plasma hormone levels in the cow: I. Changes in progesterone and oestrogen during the normal oestrous cycle. N. Z. J. Agric. Res. 18, 123–129 (1975).
Article CAS Google Scholar
Chen, J. The cell-cycle arrest and apoptotic functions of p53 in tumor initiation and progression. Cold Spring Harb. Perspect. Med. 6, a026104 (2016).
Article ADS PubMed PubMed Central CAS Google Scholar
Mercer, W. E. Checking on the cell cycle. J. Cell Biochem. Suppl. 30–31, 50–54 (1998).
Article PubMed Google Scholar
Kang, H. J. & Rosenwaks, Z. p53 and reproduction. Fertil. Steril. 109, 39–43 (2018).
Article CAS PubMed Google Scholar
Zamparini, A. L. et al. Hex acts with beta-catenin to regulate anteroposterior patterning via a Groucho-related co-repressor and Nodal. Development 133, 3709–3722 (2006).
Article CAS PubMed Google Scholar
Bainbridge, T. W. et al. Evolutionary divergence in the catalytic activity of the CAM-1, ROR1 and ROR2 kinase domains. PLoS ONE 9, e102695 (2014).
Article ADS PubMed PubMed Central Google Scholar
Kim, T. W. et al. Ctbp2-mediated β-catenin regulation is required for exit from pluripotency. Exp. Mol. Med. 49, e385 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tribulo, P., Leão, B. C. D. S., Lehloenya, K. C., Mingoti, G. Z. & Hansen, P. J. Consequences of endogenous and exogenous WNT signaling for development of the preimplantation bovine embryo. Biol. Reprod. 96, 1129–1141 (2017).
Article PubMed PubMed Central Google Scholar
Honma, S. et al. Dec1 and Dec2 are regulators of the mammalian molecular clock. Nature 419, 841–844 (2002).
Article ADS CAS PubMed Google Scholar
Nakashima, A. et al. DEC1 modulates the circadian phase of clock gene expression. Mol. Cell Biol. 28, 4080–4092 (2008).
Article CAS PubMed PubMed Central Google Scholar
Sen, A. & Hoffmann, H. M. Role of core circadian clock genes in hormone release and target tissue sensitivity in the reproductive axis. Mol. Cell Endocrinol. 501, 110655 (2020).
Article CAS PubMed Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets–update. Nucl. Acids Res. 41, D991–D995 (2013).
Article CAS PubMed Google Scholar
Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucl. Acids Res. 30, 207–210 (2002).
Article CAS PubMed PubMed Central Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria. (2020). https://www.R-project.org/.
Huber, W., von Heydebreck, A., Sueltmann, H., Poustka, A. & Vingron, M. Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Genet. Mol. Biol.2, Article3 (2003).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central CAS Google Scholar
Wu, J., Irizarry, R. & Gentry, W. C. F. J. M. J. gcrma: Background Adjustment Using Sequence Information. (2017).
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucl. Acids Res 43, e47 (2015).
Article PubMed CAS PubMed Central Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article MATH PubMed Google Scholar
Su, S. et al. Glimma: interactive graphics for gene expression analysis. Bioinformatics 33, 2050–2052 (2017).
Article CAS PubMed PubMed Central Google Scholar
de Hoon, M. J., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software. Bioinformatics 20, 1453–1454 (2004).
Article PubMed CAS Google Scholar
Saldanha, A. J. Java Treeview–extensible visualization of microarray data. Bioinformatics 20, 3246–3248 (2004).
Article CAS PubMed Google Scholar
Kecman, V. Support Vector Machines: An introduction. In Support Vector Machines: Theory and Applications. Studies in Fuzziness and Soft Computing (ed. Wang, L.) 1–47 (Springer, 2005).
Furey, T. S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000).
Article CAS PubMed Google Scholar
Huang, C. et al. Machine learning predicts individual cancer patient responses to therapeutic drugs with high accuracy. Sci. Rep. 8, 16444 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Karatzoglou, A., Smola, A., Hornik, K. & Zeileis, A. kernlab—An S4 Package for Kernel Methods in R. J. Stat. Softw. 11, 1–20 (2004).
Article Google Scholar
Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).
Article Google Scholar
Hornung, R., Causeur, D., Bernau, C. & Boulesteix, A. L. Improving cross-study prediction through addon batch effect adjustment or addon normalization. Bioinformatics 33, 397–404 (2017).
CAS PubMed Google Scholar
Montojo, J. et al. GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop. Bioinformatics 26, 2927–2928 (2010).
Article CAS PubMed PubMed Central Google Scholar
Maere, S., Heymans, K. & Kuiper, M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448–3449 (2005).
Article CAS PubMed Google Scholar
Salilew-Wondim, D. et al. Bovine pretransfer endometrium and embryo transcriptome fingerprints as predictors of pregnancy success after embryo transfer. Physiol. Genomics 42, 201–218 (2010).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The MBR’s appointment at the Technical University of Denmark was funded by the grant from Innovation Fund Denmark (7045-00013B).

Author information

Authors and Affiliations

Quantitative Genetics, Bioinformatics and Computational Biology Group, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, Building 324, 2800, Kgs. Lyngby, Denmark
Maria B. Rabaglino & Haja N. Kadarmideen

Authors

Maria B. Rabaglino
View author publications
You can also search for this author in PubMed Google Scholar
Haja N. Kadarmideen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.B.R. developed the pipeline for this study, integrated and analyzed transcriptomic datasets using ML methods and interpreted the results. M.B.R. wrote first draft. H.N.K. conceived the application of ML methods and improved the draft of this manuscript. Both authors read and approved the final version.

Corresponding author

Correspondence to Haja N. Kadarmideen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rabaglino, M.B., Kadarmideen, H.N. Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous. Sci Rep 10, 16981 (2020). https://doi.org/10.1038/s41598-020-72988-3

Download citation

Received: 17 April 2020
Accepted: 07 September 2020
Published: 12 October 2020
DOI: https://doi.org/10.1038/s41598-020-72988-3
Springer Nature Limited

This article is cited by

Genes and regulatory mechanisms associated with experimentally-induced bovine respiratory disease identified using supervised machine learning methodology
- Matthew A. Scott
- Amelia R. Woolums
- Bindu Nanduri
Scientific Reports (2021)

Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous

Abstract

Similar content being viewed by others

Meta-signature of human endometrial receptivity: a meta-analysis and validation study of transcriptomic biomarkers

Differential gene expression profiling of endometrium during the mid-luteal phase of the estrous cycle between a repeat breeder (RB) and non-RB cows

External validation of putative biomarkers in eutopic endometrium of women with endometriosis using NanoString technology

Introduction

Results

Identification of groups of potential biomarker genes through supervised ML

Identification of the best group of biomarker genes through unsupervised ML

Validation of the selected set of biomarker genes through supervised ML

Determination of the biological significance of the selected biomarker genes

Discussion

Conclusion

Methodology

High-throughput datasets

Data integration

Selection of biomarker genes through supervised ML

Identification of the best group of biomarker genes through unsupervised ML

Validation of the selected set of biomarker genes through supervised ML

Exploring the biological significance of the selected biomarker genes

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary information

Supplementary Information.

Rights and permissions

About this article

Cite this article

This article is cited by

Genes and regulatory mechanisms associated with experimentally-induced bovine respiratory disease identified using supervised machine learning methodology

Navigation

Machine learning approach to integrated endometrial transcriptomic datasets reveals biomarkers predicting uterine receptivity in cattle at seven days after estrous

Abstract

Similar content being viewed by others

Meta-signature of human endometrial receptivity: a meta-analysis and validation study of transcriptomic biomarkers

Differential gene expression profiling of endometrium during the mid-luteal phase of the estrous cycle between a repeat breeder (RB) and non-RB cows

External validation of putative biomarkers in eutopic endometrium of women with endometriosis using NanoString technology

Introduction

Results

Identification of groups of potential biomarker genes through supervised ML

Identification of the best group of biomarker genes through unsupervised ML

Validation of the selected set of biomarker genes through supervised ML

Determination of the biological significance of the selected biomarker genes

Discussion

Conclusion

Methodology

High-throughput datasets

Data integration

Selection of biomarker genes through supervised ML

Identification of the best group of biomarker genes through unsupervised ML

Validation of the selected set of biomarker genes through supervised ML

Exploring the biological significance of the selected biomarker genes

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Genes and regulatory mechanisms associated with experimentally-induced bovine respiratory disease identified using supervised machine learning methodology

Search

Navigation