Gene expression profiling at birth characterizing the preterm infant with early onset infection

Abstract Early onset infection (EOI) in preterm infants <32 weeks gestational age (GA) is associated with a high mortality rate and the development of severe acute and long-term complications. The pathophysiology of EOI is not fully understood and clinical and laboratory signs of early onset infections in this patient cohort are often not conclusive. Thus, the aim of this study was to identify signatures characterizing preterm infants with EOI by using genome-wide gene expression (GWGE) analyses from umbilical arterial blood of preterm infants. This prospective cohort study was conducted in preterm infants <32 weeks GA. GWGE analyses using CodeLink human microarrays were performed from umbilical arterial blood of preterm infants with and without EOI. GWGE analyses revealed differential expression of 292 genes in preterm infants with EOI as compared to infants without EOI. Infants with EOI could be further differentiated into two subclasses and were distinguished by the magnitude of the expression of genes involved in both neutrophil and T cell activation. A hallmark activity for both subclasses of EOI was a common suppression of genes involved in natural killer (NK) cell function, which was independent from NK cell numbers. Significant results were recapitulated in an independent validation cohort. Gene expression profiling may enable early and more precise diagnosis of EOI in preterm infants. Key message Gene expression (GE) profiling at birth characterizes preterm infants with EOI. GE analysis indicates dysregulation of NK cell activity. NK cell activity at birth may be a useful marker to improve early diagnosis of EOI. Electronic supplementary material The online version of this article (doi:10.1007/s00109-016-1466-4) contains supplementary material, which is available to authorized users.

flagged each gene value as GOOD, EMPTY, POOR, NEG or MSR, thus defining different quality measures as outlined in the user's manual. Only gene values flagged as GOOD or EMPTY were used in the following analysis workflow:

2) Removal of genes with a high number of missing values or of values flagged as FALSE:
Genes with missing values in ≥ 50% of all arrays in a group were excluded from the dataset. Genes that were flagged as FALSE in > 50% of arrays in each group were also excluded from the dataset.

4) Normalization of imputed dataset:
Imputed dataset was normalized using quantiles normalization in R and logged to base 2 [2].

5) Array outlier detection:
Dissimilarity matrices of the normalized dataset were generated in AVADIS-Pride to determine outlier arrays within the dataset. No outlier arrays were identified in the data set [7].

6) Statistical analysis of microarrays:
For each gene, the mean value of all technical replicates of an infant was calculated in dChip (4). To identify differentially regulated genes between group 1 (NS) and group 2 (EOI), the dataset was subjected to a two-class rank statistics (Rank products, RP) as described below (5,6). For each gene, a false discovery rate (FDR) ≤ 0.1 was defined as the significance level.

7) Annotation of genes:
Significantly regulated genes were annotated using the web-based annotation tools SOURCE [6] and the Database for Annotation, Visualization and Integrated Discovery (DAVID Bioinformatics Resources 2008) [5] as described in the manual.

8) Enriched functional categories:
Enriched functional categories within the differentially regulated genes were determined using DAVID [5] version 6.7. DAVID is a platform that provides statistical methods (reported as an Enrichment Score or EASE score) to facilitate the biological interpretation of gene lists deriving from microarray analysis. Enriched genes describe a class of genes that have similar functions regardless of their expression level. They appear in a list of interest more often than what would normally be predicted by their distribution among all genes assayed. An EASE score is calculated for likelihood of enrichment of biological processes, molecular functions and cellular component categories using the Gene Ontology (GO) public database. It is derived from an adaption of the Fisher's Exact test. The Enrichment Score for a cluster of similar terms is calculated by using the negative logarithm of the geometrical mean of all EASE scores of the cluster. Comparison of functional annotation analysis results was conducted using DAVID overrepresentation analysis of Biological Process Gene Ontology terms, so called GO BP fat terms, and KEGG pathways with an EASE score of at least 10%.

9) Cluster analysis:
Hierarchical cluster analysis of the significantly regulated genes (FDR ≤ 0.1) was performed using the centroid linkage method and the distance matrix 1 -r in dChip [13].

10) Network analysis:
Identification of networks in the differentially regulated genes was achieved using Ingenuity Pathways Analysis (Ingenuity® Systems, www.ingenuity.com)

Network Generation:
These genes were overlaid onto global molecular networks contained in the Ingenuity Pathways Knowledge Base. Networks of the genes were then algorithmically generated based on their connectivity.

Functional Analysis and network generation:
A data set containing gene identifiers and corresponding expression values was uploaded into the application. Each gene identifier was mapped to its corresponding gene object in the Ingenuity Pathways Knowledge Base. An FDR ≤ 0.1 was set to identify genes whose expression was significantly differentially regulated. The Functional Analysis identified the biological processes that were most significant to the data set. Only genes that met the criteria FDR ≤ 0.1 were assigned to biological processes using the Ingenuity Pathways Knowledge Base. The probability for each biological process was given as a p-value determined by the Fisher's exact test.

Surrogate Variable Analysis (SVA)
SVA and limma were used to account for potential confounding effects as well as further hidden structures.
SVA [12] analysis was conducted according to Leek et al.'s R-package description (Version 3.12.0) [11] and estimated two statistical models: The first model accounted for the confounders identified to correlate with structural differences between the study groups (EOI and non-EOI), i.e. GA, birthweight, and WBC (leucocytes (LEU), segmented neutrophils (segNEU), band neutrophils (bandNEU), juvenile neutrophils, lymphocytes (LYM) and monocytes (MON)). The second model accounted for confounders as well as EOI status, thereby protecting the EOI status. Gene expression data were then fitted using Limma (linear models for microarray analysis) implementing the calculated surrogate variable (SVA) together with the confounding variables in the final statistical analysis [14,15]. The number of surrogate variables used for gene expression adjustment was estimated using Leek's asymptotic conditional singular value decomposition [10].
Limma, a linear model is used to explain the variance of every gene or transcript by the specified predictors. As gene expressions in a microarray experiment are not independent from each other, a hierarchical model is defined best describing the variance for the coefficients and expressions across genes. Prior distributions for the coefficient and expression variances are estimated using an empirical Bayes approach taking the inter-dependency of gene expression into account. Posterior values are obtained by adjusting prior values to the actual observed values [14]. Subsequently, adjusted gene expression data were obtained as residuals of the limma-model and used to calculate Rank Products for EOI status.
Transcripts were considered as statistically significantly associated with either WBC or EOI if FDR ≤ 0.1.

Rank products
The Rank Products method [3,4]was used for identifying differentially expressed genes in the expression data. The method is based on the premise that a gene in an experiment examining n genes in k replicates, has a probability of being ranked first (rank 1) of 1/nk if the lists were entirely random. Therefore, it is unlikely for a single gene to be in the top position in all replicates if this gene was not differentially expressed, i.e., if all null hypotheses were true. More generally, for each gene g To know how significant the changes are and how many of the selected genes are likely to be truly differentially expressed, a simple permutation-based estimation procedure provides a very convenient way to determine how likely it is to observe a given RP value in a random experiment by converting the RP value to an E value in analogy to the BLAST results [1]. The RP value distribution can be approximated in each case by calculating the RP values for a number of z random "experiments" with the same number of replicates and "genes" as the real experiment. Each random experiment consists of k random permutations of the numbers 1,…,n and for these the RP values are calculated as Subsequently, for each gene g a conservative estimate of the percentage of false-positives (PFP) is calculated: q g =E(RP g )/rank( g ). Here, rank(g) denotes the position of gene g in a list of all genes sorted by increasing RP value, i.e., it is the number of genes accepted as significantly regulated. This estimates the false discovery rate (FDR) [16] and provides a flexible way to assign a significance level to each gene. The FDR is accepted as a reasonable significance threshold in microarray studies [16].
One can now decide how large a PFP would be acceptable and extend the list of accepted genes up to the gene with this q g value. The rank product method was chosen since it has been shown to outperform classical t-statistic and moderated t-statistics when datasets have low numbers of samples or high levels of noise [3,8].

RT-PCR
Real time RT-PCR was performed for 10 human target genes deriving from the microarray results  EOI: early onset infection; GA: gestational age; IUGR: intrauterine growth restriction; PROM premature rupture of membranes; ANCS: antenatal corticosteroids: complete course including two doses of betamethasone given >24 hours prior to birth, last dose <7 days before birth; * any ANCS before birth; CRIB: critical risk index for babies; RDS: respiratory distress syndrome; IVH: intraventricular haemorrhage; BPD: bronchopulmonary dysplasia; ROP: retinopathy of prematurity Supplemental * Amplification efficiencies for the target genes were calculated as E = 10(-1/S) -1, were S is the slope of the standard curve.