Introduction

Polycystic ovary syndrome (PCOS) is one of the most prevalent endocrine disorder around the world, with an estimated about one in 15 women worldwide [1]. PCOS exposes patients to a major psychosocial burden and is characterized by hyperandrogenism and chronic anovulation [2]. Diabetes, heart disease, obesity, non-alcoholic fatty liver disease and hypertension are the risk factors associated with PCOS [3,4,5,6,7]. Therefore, it is of prime importance to identify the etiological factors, molecular mechanisms, and pathways to discover novel diagnostic markers, prognostic markers and therapeutic targets for PCOS.

Numerous research strategies have recently investigated the molecular mechanisms of PCOS. High-throughput RNA sequencing technology has received extensive attention among these research strategies and has generated significant advances in the field of endocrine disorder with marked clinical applications ranging from molecular diagnosis to molecular classification, patient stratification to prognosis prediction, and discovery of new drug targets to response prediction [8]. In addition, gene expression profiling investigation on PCOS have been performed using high-throughput RNA sequencing, and several key genes and diagnostic biomarkers have been diagnosed for this syndrome, including the profiling of many of differentially expressed genes (DEGs) associated in different pathways, biological processes, or molecular functions [9]. Integrated bioinformatics analyses of expression profiling by high throughput sequencing data derived from different investigation of PCOS could help identify the novel diagnostic markers, prognostic markers and further demonstrate their related functions and potential therapeutic targets in PCOS.

Therefore, in the current investigation, the dataset (GSE84958) was then retrieved from the publicly available Gene Expression Omnibus database (GEO, http://www.ncbi.nlm.nih.gov/geo/) [10] to identify DEGs and the associated biological processes PCOS using comprehensive bioinformatics analyses. The DEGs were subjected to functional enrichment and pathway analyses; moreover, a protein-protein interaction (PPI) network, miRNAs - target gene regulatory network and TFs - target gene regulatory network were constructed to screen for key genes, miRNA and TFs. The aim of this investigation was to identify key genes and pathways in PCOS using bioinformatics analysis, and then to explore the molecular mechanisms of PCOS and categorize new potential diagnostic therapeutic biomarkers of PCOS. We anticipated that these investigations will provide further understanding of PCOS pathogenesis and advancement at the molecular level.

Materials and Methods

RNA sequencing data

Expression profiling by high throughput sequencing dataset GSE84958 was downloaded from NCBI-GEO, a public database of next-generation sequencing, to filter the DEGs between PCOS and normal control. The expression profiling by high throughput sequencing GSE84958 was based on GPL16791 platforms (Illumina HiSeq 2500 (Homo sapiens)) and consisted of 30 PCOS samples and 23 normal control.

Identification of DEGs

The limma [11] in R bioconductor package was used to analyze the DEGs between PCOS samples and normal control samples in the expression profiling by high throughput sequencing data of GSE84958. The adjusted P-value and [logFC] were calculated. The Benjamini & Hochberg false discovery rate method was used as a correction factor for the adjusted P-value in limma [12]. The statistically significant DEGs were identified according to P<0.05, and [logFC] > 2.5 for up regulated genes and [logFC] < -1.5 for down regulated genes. All results of DEGs were downloaded in text format, hierarchical clustering analysis being conducted.

GO and pathway enrichment of DEGs in PCOS

To reflect gene functions, GO (http://geneontology.org/) [13] has been used in three terms: biological processes (BP), cellular component (CC) and molecular function (MF). ToppGene (ToppFun) (https://toppgene.cchmc.org/enrichment.jsp) [14] is an online database offering a comprehensive collection of resources for functional annotation to recognize the biological significance behind a broad list of genes. The functional enrichment analyses of DEGs, including GO analysis and REACTOME (https://reactome.org/) [15] pathway enrichment analysis, were performed using ToppGene in the present study, using the cut-off criterion P-value<0.05 and gene enrichment count>2.

PPI networks construction and module analysis

The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING: http://string-db.org/) is online biological database and website designed to evaluate PPI information [16] Proteins associated with DEGs were selected based on information in the STRING database (PPI score >0.7), and then PPI networks were constructed using Cytoscape software (http://cytoscape.org/) [17]. In this investigation, node degree [18], betweenness centrality [19], stress centrality [20] and closeness centrality [21], these constitutes a fundamental parameters in network theory, were adopted to calculate the nodes in a network. The topological properties of hub genes were calculated using Cytoscape plugin Network Analyzer. The PEWCC1 (http://apps.cytoscape.org/apps/PEWCC1) [22], a plugin for Cytoscape, was used to screen the modules of the PPI network. The criteria were set as follows: degree cutoff=2, node score cutoff=0.2, k-core=2 and maximum depth=100. Moreover, the GO and pathway enrichment analysis were performed for DEGs in these modules.

Construction of miRNA - target regulatory network

Furthermore, the target genes of the significant target genes were predicted by using miRNet database (https://www.mirnet.ca/) [23], when the miRNAs shared a common target gene. Finally, the miRNA - target genes regulatory network depicted interactions between miRNAs and their potential targets in PCOS were visualized by using Cytoscape.

Construction of TF - target regulatory network

Furthermore, the target genes of the significant target genes were predicted by using TF database (https://www.mirnet.ca/) [23], when the TFs shared common target genes. Finally, the TF- target genes regulatory network depicted interactions between TFs and their potential targets in PCOS were visualized by using Cytoscape.

Receiver operating characteristic (ROC) curve analysis

The ROC curve was used to evaluate classifiers in bioinformatics applications. To further assess the predictive accuracy of the hub genes, ROC analysis was performed to discriminate PCOS from normal control. ROC curves for hub genes were generated using pROC in R [24] based on the obtained hub genes and their expression profiling by high throughput sequencing data. The area under the ROC curve (AUC) was determined and used to compare the diagnostic value of hub genes.

Validation of the expression levels of candidate genes by RT-PCR

Total RNA was extracted from PCOS (UWB1.289 (ATCC® CRL-2945™)) and normal ovarian cell line (MES-OV (ATCC® CRL-3272™)) using TRI Reagent® (Sigma, USA). The Reverse transcription cDNA kit (Thermo Fisher Scientific, Waltham, MA, USA) and 7 Flex real-time PCR system (Thermo Fisher Scientific, Waltham, MA, USA) were used for reverse transcription and real-time quantitative reverse transcriptase polymerase chain reaction (qRT-PCR) assay. Polymerase chain reaction primer sequences are listed in Table 1. β-actin was used as an internal control for quantification. The relative expression levels of target transcripts were calculated using the 2-∆∆Ct method [25]. The thermocycling conditions used for RT-PCR were as follows: initial denaturation at 95°C for 15 min, followed by 40 cycles at 95°C for 10 sec, 60°C for 20 sec and 72°C for 20 sec.

Table 1 Primers used for quantitative PCR

Molecular docking studies

Surflex-docking studies of the standard drug molecule used in polycystic ovary syndrome were used on over expressed genes and were collected from PDB data bank using perpetual SYBYL-X 2.0 software. Using ChemDraw Software, all the drug molecules were illustrated, imported and saved in sdf. templet using open babel free software. The protein structures of POLR2K (), RPS15, RPS15 alpha and SAA1 of their co-crystallised protein of PDB code 1LE9, 3OW2, 1G1X and 4IP8 respectively were extracted from Protein Data Bank [26,27,28]. Gasteiger Huckel (GH) charges were applied along with the TRIPOS force field to all the drug molecules and is standard for the structure optimization process. In addition, energy minimization was achieved using MMFF94s and MMFF94 algorithm methods. The protein preparation was carried out after incorporation of protein. The co-crystallized ligand was extracted from the crystal structure and all water molecules; more hydrogen was added and the side chain was set. For energy minimisation, the TRIPOS force field was used. The interaction efficiency of the compounds with the receptor was represented in kcal / mol units by the Surflex-Dock score. The interaction between the protein and the ligand, the best pose was incorporated into the molecular area. The visualization of ligand interaction with receptor is done by using discovery studio visualizer.

Results

Identification of DEGs

Expression profiling by high throughput sequencing dataset was obtained from the National Center for Biotechnology Information GEO database containing PCOS samples and normal control samples: GSE84958. Then, the R package named “limma” was processed for analysis with adjusted P < 0.05, and [logFC] > 2.5 for up regulated genes and [logFC] < -1.5 for down regulated genes. All DEGs were displayed in volcano maps (Fig. 1). A total of 739 DEGs including 360 up regulated and 379 down regulated genes (Table 2) were identified in PCOS samples compared to normal control samples. The results are shown in the heatmap (Fig. 2).

Fig. 1
figure 1

Volcano plot of differentially expressed genes. Genes with a significant change of more than two-fold were selected. Green dot represented up regulated significant genes and red dot represented down regulated significant genes

Table 2 The statistical metrics for key differentially expressed genes (DEGs)
Fig. 2
figure 2

Heat map of differentially expressed genes. Legend on the top left indicate log fold change of genes. (A1 – A2 = normal control samples; B1 – B30 = PCOS samples)

GO and pathway enrichment of DEGs in PCOS

The top 739 DEGs were chosen to perform GO and REACTOME pathway enrichment analyses. Gene Ontology (GO) analysis identified that the DEGs were significantly enriched in BP, including the peptide metabolic process, intracellular protein transport, plasma membrane bounded cell projection organization and cell morphogenesis (Table 3). In terms of CC, DEGs were mainly enriched in organelle envelope, catalytic complex, neuron projection and cell junction were the most significantly enriched GO term (Table 3). In addition, MF demonstrated that the DEGs were enriched in the RNA binding, transcription factor binding, DNA-binding transcription factor activity, RNA polymerase II-specific and ATP binding (Table 3). REACTOME pathway enrichment analysis was used to screen the signaling pathways for differential genes. These DEGs were mainly involved in the translation, respiratory electron transport, generic transcription pathway and transmembrane transport of small molecules (Table 4).

Table 3 The enriched GO terms of the up and down regulated differentially expressed genes
Table 4 The enriched pathway terms of the up and down regulated differentially expressed genes

PPI networks construction and module Analysis

Following the analysis based on the PPI networks, 4141 nodes and 14853 edges were identified in Cytoscape (Fig. 3a). The genes with higher scores were the hub genes, as the genes of node degree, betweenness centrality, stress centrality, closeness centrality may be linked with PCOS. The top 10 hub genes were SAA1, ADCY6, POLR2K, RPS15, RPS15A, ESR1, LCK, S1PR5, CCL28 and CTNND1 and are listed in Table 5. Enrichment analysis demonstrated that module 1 (Fig. 3b) and module 2 (Fig. 3c) might be associated with respiratory electron transport, organelle envelope, catalytic complex, gene expression, signaling by NGF and neuron projection.

Fig. 3
figure 3

PPI network and the most significant modules of DEGs. a The PPI network of DEGs was constructed using Cytoscape. b The most significant module was obtained from PPI network with 26 nodes and 160 edges for up regulated genes. c The most significant module was obtained from PPI network with 26 nodes and 71 edges for up regulated genes. Up regulated genes are marked in green; down regulated genes are marked in red

Table 5 Topology table for up and down regulated genes

Construction of miRNA - target regulatory network

After combining the results of miRNA-target genes with the interactive network of miRNAs, 281 hub genes were selected and 2138 were miRNAs. The genes and miRNAs are shown in Fig. 4a. Specifically, 97 miRNAs (ex, hsa-mir-8067) that regulate RPL13A, 95 miRNAs (ex, hsa-mir-4518) that regulate RPS15A, 71 miRNAs (ex, hsa-mir-3685) that regulate RPLP0, 65 miRNAs (ex, hsa-mir-1202) that regulates ADCY6, 48 miRNAs (ex, hsa-mir-4461) that regulate RPS29, 129 miRNAs (ex, hsa-mir-8082) that regulate CTNND1, 98 miRNAs (ex, hsa-mir-4422) that regulate ESR1, 76 miRNAs (ex, hsa-mir-548am-5p) that regulate NEDD4L, 62 miRNAs (ex, hsa-mir-6886-3p) that regulate KNTC1 and 56 miRNAs (ex, hsa-mir-9500) that regulate NGFR were detected (Table 6).

Fig. 4
figure 4

a Target gene - miRNA regulatory network between target genes and miRNAs. b Target gene - TF regulatory network between target genes and TFs. Up regulated genes are marked in green; down regulated genes are marked in red; The purple color diamond nodes represent the key miRNAs; the blue color triangle nodes represent the key TFs.

Construction of TF - target regulatory network

After combining the results of TF-target genes with the interactive network of TFs, 455 hub genes were selected and 274 were TFs. The genes and TFs are shown in Fig. 4b. Specifically, 15 TFs (ex, PER3) that regulate RBX1, 13 TFs (ex, CTCF) that regulate RPS15, 12 TFs (ex, E2F7) that regulate RPS20, 11 TFs (ex, LMO2) that regulate ADCY6, 9 TFs (ex, POLR2H) that regulate POLR2K, 122 TFs (ex, NCOA2) that regulate ESR1, 21 miRNAs (ex, EBF1) that regulate LCK, 18 TFs (ex, SMAD2) that regulate GLI3, 17 TFs (ex, JUND) that regulate NEDD4L, and 15 TFs (ex, FOXO3) that regulate CALCR were detected (Table 6).

Receiver operating characteristic (ROC) curve analysis

Moreover, ROC curve analysis using “pROC” packages was performed to calculate the capacity of ten hub genes to distinguish PCOS from normal control. SAA1, ADCY6, POLR2K, RPS15, RPS15A, CTNND1, ESR1, NEDD4L, KNTC1 and NGFR all exhibited excellent diagnostic efficiency (AUC > 0.7) (Fig. 5).

Fig. 5
figure 5

ROC curve validated the sensitivity, specificity of hub genes as a predictive biomarker for PCOS prognosis. a SAA1, b ADCY6, c POLR2K, d RPS15, e RPS15A, f ESR1, g LCK, h S1PR5, i CCL28, j CTNND1

Validation of the expression levels of hub genes by RT-PCR

Aiming to further verify the expression patterns of selected hub genes, real-time PCR, which allows quantitative analysis of hub gene expression, was applied. The results showed that the relative expression levels of 10 hub genes including SAA1, ADCY6, POLR2K, RPS15, RPS15A, CTNND1, ESR1, NEDD4L, KNTC1 and NGFR were consistent with the expression profiling by high throughput sequencing (Fig. 6).

Fig. 6
figure 6

Validation of hub genes by RT- PCR. a SAA1, b ADCY6, c POLR2K, d RPS15, e RPS15A, f ESR1, g LCK, h S1PR5, i CCL28, j CTNND1

Molecular docking studies

In the present analysis, the docking simulations are performed to classify the active site conformation and significant interactions with the receptor binding sites responsible for complex stability. The over expressed genes is recognized in polycystic ovary syndrome and their x-ray crystallographic proteins structure are selected from PDB for docking studies. The standard drugs containing steroid nucleus are most commonly used either alone or in combination with other drugs. The docking studies of standard molecules containing the steroid ring have been carried out using Sybyl X 2.1 drug design software. The docking studies were performed to know the biding interaction of standard molecules on identified overexpressed genes of protein. The X- RAY crystallographic structure of one protein in each of four over expressed genes of POLR2K, RPS15, RPS15 and SAA1 of their co-crystallised protein of PDB code 1LE9, 3OW2, 1G1X and 4IP8 respectively were selected for the docking (Fig. 7). A total of three drug molecules of ethinylestradiol (ETE), levonorgestril (LNG) and desogestril (DSG) were docked with over expressed proteins to assess the binding affinity with proteins. The binding score greater than six are said to be good, all three drug molecules obtained binding score greater than 7 respectively. The molecules ETE obtained with a high binding score of 9.943 with SAA1 of PDB code 4IP8 and 8.260, 8.223 and 8.019 with 1G1X, 3OW2 and 1LE9. The LNG obtained highest binding score of 8.535 with SAA1 of PDB code 4IP8 and 8.351, 7.973 and 7.854 with RPS15, POLR2K and RPS15 alpha of PDB code 3OW2, 1LE9 and 1G1X respectively. DSG: highest with POLR2K of 8.273 with PDB code 1LE9, 8.158 with SAA1 of PDB code 4IP8, 7.745 with RPS15 alpha of PDB code 1G1X and obtained least binding score of 5.674 with RPS15 of PDB code 3OW2 respectively (Table 7). The molecule ETE and LNG has highest binding score its interaction with protein 4IP8 and hydrogen bonding and other bonding interactions with amino acids are depicted by 3D (Fig. 8) and 2D (Fig. 9)

Fig. 7
figure 7

Structures of Designed Molecules

Table 6 miRNA - target gene and TF - target gene interaction
Table 7 Docking results of standard drugs on overexpressed proteins
Fig. 8
figure 8

2D Binding of Molecule ETE with 4IP8

Fig. 9
figure 9

3D Binding of Molecule ETE with 4IP8

Discussion

PCOS is a most prevalent endocrine disorder with hyperandrogenism and chronic anovulation [29]. If not treated promptly and effectively, PCOS can seriously reduce the quality of life. There is no doubt that considerate syndrome at the molecular level will help to develop their diagnosis and treatment [30]. Up to now, various biomarkers have been identified to be linked with PCOS and might be selected as therapeutic targets, but the detailed mechanism of gene regulation leading to syndrome advancement remains elusive [31].

In our investigation, we aimed to identify biomarkers of PCOS and uncover their biological functions through bioinformatics analysis. Dataset GSE84958 was selected as expression profiling by high throughput sequencing dataset in our analysis. As a result, 360 up regulated and 379 down regulated genes at least 4-fold change between PCOS and normal control samples were screened out. ABI3BP protein expression in heart tissue was significantly related with cardiovascular disease [32], but this gene might be liable for progression of PCOS. Romo-Yáñez et al [33] have revealed the expression of BNIP3 was linked with diabetic in pregnancies, but this gene might be responsible for progression of PCOS. F13A1 is an essential regulatory factor to be associated in PCOS development [34]. An investigation has reported that the ITIH4 can promote non-alcoholic fatty liver disease [35], but this gene might be important for progression of PCOS. Da et al [36] have suggested that the TET3 is an important role in controlling type 2 diabetes progressions, but this gene might be key role in PCOS.

The GO and pathway enrichment analysis was of great importance for interpreting the molecular mechanisms of the key cellular activities in PCOS. RPS5 [37], RBM3 [38], BAK1 [39], NDUFC2 [40], NDUFS4 [41], NDUFS5 [42], UQCRFS1 [43], COX6B1 [44], NDUFA13 [45], PRMT1 [46], RDX (radixin) [47], EPHB4 [48], SYNE2 [49], DNAH5 [50], NEDD4L [51], PDE4B [52] and CTNND1 [53] plays a critical role in the process of cardiovascular disease, but these genes might be linked with development of PCOS. Ostergaard et al [54], Zi et al [55], Kunej et al [56], Van der Schueren et al [57], Jin et al [58], Emdad et al [59], Liu et al [60], Scherag et al [61], Shi and Long [62], Sharma et al [63], Parente et al [64], Saint-Laurent et al [65] and Lee [66] demonstrated that over expression of COA3, PHB (prohibitin), UQCRC1, COX4I1, IFI27, MTDH (metadherin), S100A16, SDCCAG8, GLI2, NTN1, NLGN2, FGFR3 and PTPRN2 could cause obesity, but these genes might be involved in progression of PCOS. Alsters et al [67], Lee et al [68], Shiffman et al [69], Yaghootkar et al [70], Rotroff et al [71], Cheng et al [72], Baig et al [73], Zhang et al [74], Lebailly et al [75], Ferris et al [76], Lempainen et al [77] and McCallum et al [78] presented that high expression of CPE (carboxypeptidase E), RPL13A, CERS2, CCND2, PRPF31, SARM1, PLD1, EPHA4, ARNTL2, BATF3, IKZF4 and MEN1 were associated with diabetes, but these genes might be linked with advancement of PCOS. Wang et al [79], Tian et al [80], Zhang et al [81] and Carr et al [82] demonstrated that over expression of ATP6AP2, FIS1, GRK4 and KCNQ4 were found to be substantially related to hypertension, but these genes might be essential for PCOS progression. Atiomo et al [83], Lara et al [84] and Douma et al [85] were reported that NQO1, NGFR (nerve growth factor receptor) and ESR1 could be an index for PCOS. Jin et al [86] presented that GLI3 was associated with non-alcoholic fatty liver disease, but this gene might be linked with development of PCOS.

In the present investigation, PPI network and its modules has been shown that significant amount of hub gene might be associated with progression of PCOS. Zhang et al [87] proposed that SAA1 was linked with progression of obesity, but this gene might be important for progression of PCOS. Deng et al [88] indicated that ADCY6 was responsible for development of cardiovascular disease, but this gene might be associated with advancement of PCOS. POLR2K, RPS15, RPS15A, ESR1, LCK (LCK proto-oncogene, Src family tyrosine kinase), S1PR5, CCL28, CTNND11, UQCRQ (ubiquinol-cytochrome c reductase complex III subunit VII), UQCRH (ubiquinol-cytochrome c reductase hinge protein), COX7C, COX6C, COX8A, COX5B, COX6A1, COX7A2L, ARHGAP39, OBSCN (obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF) and TIAM2 might be novel biomarkers for PCOS.

MiRNA-target genes and TF-target genes regulatory networks revealed that the miRNAs, TF and target genes were might be involved in PCOS. Hsa-mir-6886-3p was liable for progression of hypertension [89], but this gene might be involved in progression of PCOS. Some investigations determined that expression of PER3 [90] and SMAD2 [91] were associated with diabetes, but these genes might be linked with advancement of PCOS. NCOA2 was found to be associated with advancement of obesity [92], but this gene might be involved in progression of PCOS. Recently, increasing evidence demonstrated that EBF1 was expressed in coronary artery disease [93], but this gene might be responsible for progression of PCOS. FOXO3 was involved in progression of PCOS [94]. RPLP0, RPS29, KNTC1, hsa-mir-8067, hsa-mir-4518, hsa-mir-3685, hsa-mir-1202, hsa-mir-4461, hsa-mir-8082, hsa-mir-4422, hsa-mir-548am-5p, hsa-mir-9500, RBX1, RPS20, CALCR (calcitonin receptor), CTCF (CCCTC-binding factor), E2F7, LMO2, POLR2H and JUND (jun D proto-oncogene) might be novel biomarkers for PCOS.

Among all three of molecules of ethinylestradiol, levonorgestrel and desogetril respectively, ethinylestradiolhas obtained highest binding score (c-score) of 9.943 with protein of PDB code 4IP8 and obtained 8.260, 8.223 and 8.019 with protein of PDB 1G1X, 3OW2 and 1LE9 respectively. The phenolic -OH group in ring A of ethinylestradiol formed favourable bonding interactions with ALA-14 of Chain A and pi-pi bonding interactions of alicyclic ring B TRP-18. Ethinylestradiol also formed alkyl and pi-alkyl interaction of ring B, C and D with TRP-18, ARG-62, TYR-21, PHE-69, ILE-65 and ILE-58. Ethinylestradiol also formed Van der Waals interactions with ACA-61, MET-17, MET-24 and GLN-66 respectively. It is assumed that the highest binding score (c-score) of ethinylestradiol is due to the presence of aromatic ring and the phenolic –OH group.

In conclusion, we used a series of bioinformatics analysis methods to find the crucial genes and pathways associated in PCOS initiation and development from expression profiling by high throughput sequencing containing PCOS samples and normal control samples. Our investigations provide a more specific molecular mechanism for the advancement of PCOS, detail information on the potential biomarkers and therapeutic targets. However, the interacting mechanism and function of genes need to be confirmed in further experiments.