Background

Gene expression analysis in the post genomic era through high throughput genomic studies led to identification of enormous candidate genes related to pathophysiological conditions or altered signal transduction. One such freely available high throughput database is ‘Unigene’ (http://www.ncbi.nlm.nih.gov/Unigene/). The Unigene libraries of interest with varying treatment conditions can be digitally ‘pooled’ and compared to control vs. treatment using Digital Differential Display (DDD). It enables the identification of numerical differences in transcript frequency between the individual or pooled Unigene libraries from the various treatment conditions and multiple cDNA libraries. The frequency of each differentially expressed transcripts and their fold change from the pooled libraries have been calculated using Fisher Exact Test. The prioritisation of DDD identification from differentially expressed candidate genes strictly used relative change in the frequency value and its fold change. Apart from DDD, many web tools are freely available to prioritise candidate genes based on the relative change in gene expression profile [1, 2]. The prioritisation of each tool differs due to their different computational approaches [3]. But the process of identifying the most likely tissue specific disease candidate genes from the pool of differentially expressed genes remained difficult [1].

Recent advances in the systems biology have shown promising results in the elucidation of potential biomarkers of phenotype and clinical relevance, particularly in cancer research sphere [46]. These studies were performed using the predictive integration of gene expression data. Different predictive integration strategies have been developed and were used to study the biological information from public repositories [48]. Amongst such strategies, gene products that are biologically and functionally related would maintain similarity, both in their expression profiles and in the Gene Ontology (GO) annotation [9]. The integration of gene expression data and standardised descriptions of the biological function of gene products were used for the search of candidate prognostic biomarkers and therapeutic targets [1012]. These studies demonstrated that the measure of functional similarity based GO annotations between query genes and the genes of interest can be applied as a complementary predictive feature to characterise gene expression profile. So, we have applied this integrative computational approach to characterise a tissue specific biological data from DDD.

We hypothesised that tissue specific differentially expressed genes can be functionally characterised using their GO semantic similarity score with normal tissue specific genes (query genes). The query genes, in this study, were normal lung tissue specific genes from the Tissue-Specific Genes Database (TiSGeD). The genes of interest were candidate lung cancer genes from DDD [13, 14]. Surprisingly, this approach successfully distinguished 38 signature biomarkers for lung cancer. Thus this suggests that, in principle, this integrated methodology can offer a complementary predictive capability for detecting tissue specific signature biomarkers from the tissue specific differentially expressed data. These tissue specific signature biomarkers may be candidate prognostic biomarkers and therapeutic targets for lung cancer.

Methods

Selection of Human Lung Tissue specific query genes

The normal lung tissue specific genes were collected from TiSGeD (Tissue-specific gene database; http://bioinf.xmu.edu.cn/databases/TiSGeD/index.html). Human adult lung tissue related genes with tissue specificity measure score (SPM) ≥ 0.9 (represents high tissue specificity) were considered. The lung tissue specific “Mouse” and “developmental” genes were omitted.

Collection of Lung Tissue specific differentially Expressed Candidate Genes using DDD

DDD comparisons were made at various tissue stages to elucidate the selective differential expression levels of human lung tissue specific genes for normal (Case 1) and cancerous (Case 2) conditions. In Case 1, the Normal lung tissues (11 tissue libraries) were considered as a ‘Reference’ samples and the remaining normal human tissues (251 tissue libraries) were ‘Query’ samples. In Case 2, the Normal lung tissues (11 tissue libraries) were considered as ‘Reference’ samples and the cancerous human lung tissues (8 tissue libraries) were ‘Query’ samples. These comparisons were designed systematically so as to identify altered Gene expression of varying treatment conditions of ‘Reference’ and ‘Query’ samples. These pair wise comparisons resulted in a relative abundance of ESTs among the contrasting cDNA libraries of digitally ‘pooled’ contracts from Unigene Database.

GO-based similarity assessment

Org.Hs.eg.db package in R-program was used for the computation of Semantic similarity score while the GO-based similarity score was computed based on the three orthogonal gene ontologies generated for Molecular Function (MF), Cellular Component (CC) and Biological process (BP). GOSemSim of R-program was used to calculate semantic similarity between the GO terms and the gene products. In this study, GO terms derived from human annotations were used for calculations. The estimation of between-term similarity was based on the Wang semantic similarity measure [12]. Aggregation of between-term similarities was done with the highest between-term similarity approach, which selectively aggregates maximum between-gene similarity values [9]. Given a pair of gene products, gi and gj, annotated to a set of GO terms, the GO-driven similarity, SIM (gi, gj), is calculated by aggregating the maximum interest similarity values as follows:

S i m g i , g j = 1 i m m a x S i m g i 1 g j + 1 j m m a x S i m g j 1 g i / m + n

where, two sets of GO terms gi = {gi1, gi2, …………., gim} & gj = {gj1, gj2, ……………, gjn} as query and reference sequence. Method max calculates the maximum semantic similarity score over given pairs of GO terms between these two sets, while average calculates the average semantic similarity score over a given pairs of GO terms. The hierarchical clustering of tissue specific, differentially expressed genes in relation with a normal lung tissue is shown in a Dendrogram. In the colour code of heat map, red represents a low semantic similarity below the median level, whereas, the green represents a high semantic similarity above the median level.

Clustering analysis

The clustering analysis was carried out by the program pvclust [15]. It is an add-on package for a statistical software R to perform the bootstrap analysis of clustering and also to assess the uncertainty in hierarchical cluster analysis. The package calculates the approximately unbiased (AU) and bootstrap probability (BP) p-values for each cluster. Stability of the clustering was accessed at 95% probability (α = 0.95).

Results

DDD based prioritisation of lung cancer genes

In order to find the lung tissue specific differentially expressed genes, two Unigene pools (A and B) were constructed (See Additional file 1). For analysis, in the DDD1, we employed the UniGene pool (A) representing 39 human normal tissues excluding normal lung tissue and UniGene pool (B) representing 11 counterpart lung normal tissues were employed for analysis (Table 1). Similarly, in DDD2, UniGene pool (A) representing 8 human lung tumours and UniGene pool (B) representing 11 counterpart lung normal tissues were employed (Table 1). The fold change of normal lung (DDD1) and lung carcinoma candidate genes (DDD2) were calculated based on transcript frequency values. The candidate genes with an expression of at least 2-fold difference were taken into analysis. In DDD1, amongst the total of 519 differentially expressed genes 268 genes were up-regulated (≥2-fold) and 234 genes were down-regulated (≥2-fold). In DDD2, amongst the total of 203 differentially expressed candidate genes, 147 genes (≥2-fold) including 33 unknown were up-regulated (≥2-fold) and 55 genes were down-regulated (≥2-fold). Comparison of DDD1 with DDD2 has revealed that in total 76 genes from DDD1 were differentially expressed in DDD2 (See Additional file 2). From the literature survey, amongst the 76 genes, 18 of them were found to be commonly expressed in all types of cancerous conditions (See Additional file 3) [16]. Excluding these 18 from the 76, the remaining 58 genes were predicted as the lung tissue specific tumour genes (See Additional file 2). The molecular functions of these 58 genes were found to be involved in broad range of cellular functions with majority of the genes playing many different roles like structural, extracellular and intracellular functions. This subtractive approach eliminated most of the commonly expressing genes; for example, housekeeping genes. This approach has also helped to eliminate genes expressing in more than 10 cancerous conditions (See Additional file 4).

Table 1 Different tissue specific Unigene libraries employed in DDD

Prediction of Lung tissue specific tumour genes by Semantic similarity score based clustering

To identify lung tissue specific clusters from the 202 genes from DDD2 cancerous condition, firstly they were subjected to similarity clustering analysis using the 47 lung tissue specific genes from TiSGeD (See Additional file 5). Before the semantic similarity clustering analysis, the Unigene ID were converted into Entrez ID. During this process, the 202 genes of DDD2 reduced to 145 and the 47 lung tissue specific genes of TiSGeD were reduced to 28 due to gene duplication. Using GOSemSim package, the similarity correlation matrix was constructed between the 145 predicted lung specific differentially expressed cancer genes from DDD2 and 28 genes from TiSGeD. The differential expression levels of these clustered genes were depicted in the form of a Heat Map (Figure 1). The similarity correlation matrix produced seven gene clusters at 95% confidence level, using the pvclust program (Figure 2). The clusters 1–4 have 14 genes and the clusters 5, 6 and 7 have 36, 74 and 14 genes respectively.

Figure 1
figure 1

Go semantic similarity score between the set of normal lung tissue specific genes from TiSGeD (28-horizontal, x-axis) and the differentially expressed lung cancer genes from DDD2 (145-vertical, y-axis). The intensity of the color corresponds to the magnitude of the similarity. Red represents low semantic similarity below the median level whereas the green represents high semantic similarity above the median level.

Figure 2
figure 2

Average correlation distances with hierarchical clustering based on GO semantic similarity score matrix calculated between normal lung tissue specific genes from TiSGed and differentially expressed lung cancer gene from DDD2. Values in red represent AU (Approximately unbiased) p-value and green represents BP (Bootstrap probability) Clusters with AU larger than 95% are highlighted by red rectangle boxes. AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value than BP value computer by normal bootstrap resampling.

In the ID conversions from Unigene to Entrez, the 58 lung tissue specific tumour genes were reduced to 38 genes (Table 2). These 38 genes were matched with the 7 clusters. This38 genes formed four panels with the corresponding cluster 4, 5, 6 and 7 respectively. The panels 1–4 contained 2, 9, 21 and 6 genes respectively. This leads to identification of the lung tissue specific clusters of the normal lung tissue specific genes differentially regulated in lung cancer condition.

Table 2 Lung cancer signature biomarker clusters

We then analysed the functional significance of each panel as given below.

Analysis of Cluster 4 / Panel 1

The cluster 4 had two-lung cancer related genes ubiquitin thiolesterase (UCHL1) and Lactotransferrin (LTF). In the normal lung (DDD1 data), UCHL1 was down-regulated and LTF was up-regulated (Table 2). This was reversed during the lung cancer condition where UCHL1 up-regulated and the LTF highly down-regulated (Table 2). These two proteins were found to be important in the cancer progression. UCH-L1 up-regulation promoted prostate cancer metastasis through epithelial-to-mesenchymal transition (EMT) induction and LTF expression decreased in lung prostate cancer progression [17, 18]. Both of them were co-expressed in almost six different lung adenocarcinoma cell lines, as evident by mSigDB. This suggested that UCH-L1 and LTF could be novel diagnostic and therapeutic targets for lung cancer metastasis diagnostic markers.

Analysis of Cluster 5 / Panel 2

The cluster 5 was playing the common functional role of immune response and complement activation. The down-regulated RPSA, RPL9, TMSB4X and TUBA1B in normal lung (DDD1) were significantly up-regulated in lung cancer (DDD2) (Table 1). The analysis resulted that all these up regulated genes played the role of tumour cell resistance to the anti-cancer agents. In gastric cancers, the up-regulation of RPSA/LRP contributed to drug resistance via hypoxia-inducible-factor dependent mechanism [19]. Similarly, there was a link between the TMSB4X and TUBA1B and the anti-cancer drug resistance to the drug Paclitaxel (PTX) observed in the cervical and breast/ovarian cancers respectively [20, 21].

In this cluster, NT5C2, API5, CPN, PRKAR1A and COPB1 were fully down-regulated in lung cancer (Table 1). The down regulation of NT5C3 altered the tumour cell sensitivity to cytidine based anti-cancer drugs [22]. The anti-apoptosis gene API5 down-regulation linked to increase in the survival and resistance cancer cells to chemotherapy [23]. To our knowledge, the major copper carrying protein CPN (ceruloplasmin) down regulation link to chemotherapy/drug resistance is not yet studied. But increased level of copper in lewis lung carcinoma cells were related with the development of multi drug resistance [24]. The PRKAR1A down-regulation also linked to multidrug-resistant (MDR) in colon carcinoma cells [25]. The COPB1 was an essential component for the coatomer formation [26]. These coatomers were involved in the drug trafficking pathways and endocytic drug delivery [27]. So, it was expected that the down-regulation of COPB1 might have a role in the chemotherapy which needs to be taken up and studied. We are surprised to find that all these results suggest that the cluster 5 functionally represents a panel of chemotherapy/drug resistance related lung cancer biomarkers.

Analysis of Cluster 6 / Panel 3

In cluster 6, the upregulated FTL (65 fold in our study) and ALDOA (7 fold in our study) were regulated by hypoxia inducible factor (HIF) during lung cancer [2831]. The COL1A1 (23 fold in our study) and GAPDH (11 fold in our study) were regulated by hypoxia [3234]. IGKC (8 fold in our study) up-regulated in lung cancer patients but no literature data was available for its interaction either with HIF or hypoxia [35]. The HIF, TGM2, CSNK1A1, CSNK2A1, CTNNA1, NAMPT)/Visfatin, TNFRSF1A, ETS1 and SRC-1 were down-regulated and proposed as the biomarkers for lung cancer. We found all of them to be interacting with the HIF in cancerous condition [3645]. The down- regulated FN1 and APLP2 showed hypoxia dependent differential regulation [4648]. The DMBT1/SAG interacted HIF-1 was a kind of feedback loop in response to hypoxia. The hypoxia induced HIF-1 to transactivate SAG and the induced SAG then promoted HIF-1alpha ubiquitination and degradation [49]. The FBJ/c-Jun/AP-1 interacted with HIF during hypoxia that controlled the transcriptional regulation of the Cyr61 gene in retinal vascular endothelial cells [50]. The role of AIB1/SRC-3/NCoA during hypoxia condition were exhibited by controlling the expression levels of HIF induced erythropoietin (EPO) gene during hypoxia [42].

However, in this cluster, the AZIN1 and TICAM2 were down-regulated and were lacking direct experimental evidence to support their regulation with HIF or hypoxia during cancer. The following literature analysis suggests their possible regulations either with HIF or hypoxia. The AZIN1 was an inhibitor for the antizyme and both were highly regulated in human cancers and antizyme induced HIF, during increased cellular redox potential [5153]. The TICAM2 physically bridged toll like receptor-4 (TLR4) with TICAM1 and the TLR4 partially regulated by the HIF during adenocarcinoma [54, 55].

All these results suggest that the cluster 6 represents the panel of either HIF or Hypoxia related lung cancer biomarkers.

Analysis of Cluster 7 / Panel 4

In the Cluster 7, there were seven lung biomarkers, mostly encoding for lung tissue specific extra cellular matrix proteins. The epigenetic analysis using Methycancer database (http://methycancer.genomics.org.cn) revealed that amongst the seven, KIAA1324, NET1, NTN3, RPL10 and TFPI2 were epigenetically regulated through DNA methylation. In the remaining two, SFTPA1 was epigenetically regulated [5658]. However, the experimental evidence was lacking the epigenetic related data for CRISP3. However, the Gene card database analysis of CRISP3 showed that the CRISP3 orthlogous gene C-type lectin domain family 18 member A (CLEC18A) epigenetically regulated through DNA methylation (http://www.genecards.org). All these results show that the cluster 7 represented the panel of epigenetically regulated lung cancer specific extra cellular matrix biomarkers.

Discussion

UniGene database using the DDD tool provides us a computational approach to study and understand the lung tissue specific gene expression levels in both disease and normal conditions [59]. Studying their differential expression in disease state (lung cancer) will provide a clue about lung cancer specific candidate genes. However, the candidate identification of the DDD method is relying on the EST frequencies based fold change calculation. In DDD2, the 203 differentially expressed candidate genes (≥2-fold) ranking / prioritisation only based upon fold change did not account for the tissue specific variability of the genes in disease conditions (eg: biomarker identification). To include the tissue specific variability in DDD2 prioritisation, the normal lung tissue specific genes from DDD1 were compared. This approach eliminated most of the house keeping genes from the analysis (gene list reduced from 202 to 76). Further, we detected genes expression selectively altered in the lung cancer by eliminating genes that commonly expressed differentially in more than five tumours (gene list reduced from 76 to 58) (See Additional file 2). Almost all of them have a documented role in the lung cancer (http://www.megabionet.org/bio/hlung). So, these subtractive approaches successfully increase the probability of identifying the lung cancer specific probable candidate biomarkers.

The semantic similarity scores amongst the GO terms and the subsequent hierarchical clustering were calculated using the freely available R-software for lung tissue specific candidate genes from normal and cancer conditions. The analysis of members of individual genes from each cluster revealed the functional significance of each cluster. Out of the seven clusters, our approach identified four functionally important clusters. The four clusters represented metastasis diagnostic markers, chemotherapy/drug resistance related biomarkers, and HIF or Hypoxia induced biomarkers and epigenetically regulated extra cellular matrix biomarkers for lung cancer. This suggests that, especially for lungs tissues, the semantic similarity score amongst GO terms between normal and diseases condition from the same tissue can prioritise biomarkers. But, further study is necessary to extend our hypothesis to other tissues. This subtractive approach integrated with semantic similarity score among GO terms can offer a predictive capability for detecting tissue specific signature biomarkers from the tissue specific differentially expressed data. This approach is also complementary to the network based biomarker prediction approach [60, 61]. Our study is one more example of demonstrating the utility of the Digital differential expression technique.

Our study suggests that amongst the 4 panels, HIF or Hypoxia induced lung cancer biomarkers panel (panel 3) is the most important cluster. Because, in other clusters, most of the identified lung cancer biomarkers follow the same expression pattern (either up or down) in other types cancers like breast, ovarian, cervical etc. However, in our study and literature, the expression pattern of genes down regulated in cluster 6 / panel 3 is distinct from almost all types of other cancers. In panel 3, the expression pattern of the HIF and its modulating proteins are completely different when compared to most of the other types of cancers. For example, in most of the cancerous conditions the HIF level is up-regulated [62]. This up-regulation is expected in cancers due to the acute hypoxic condition exhibited during cancer. In contrast, in lung cancer, the HIF level is completely down-regulated (Table 1).

Therefore, it is evident from our study that the HIF down regulation also affect the expression level of the other HIF modulating lung cancer biomarkers. All the down-regulated genes, in this Panel 3 showed their significant up-regulation in most of many types of cancers (TGM2 [63, 64], CSNK1A1 [65], CTNNA1 [66], NAMPT/Visfatin [67], TNFRSF1A [68], ETS1 [41], SRC-1 [69], FN1 [70], APLP2 [71], DMBT1/SAG [64], AIB1 [72], AZIN1 [72]). Our study further shows that this down-regulation is more than five folds when compared to the normal lungs tissue (Table 1). This fold change level suggests that this fold change seems to be more than enough to detect them in the patient sample. Therefore, this panel of down regulating HIF / hypoxia regulated lung cancer biomarker can help to distinguish lung cancer from other types of cancers.

The identified 38 signature lung cancer specific biomarkers can help to increase the sensitivity and selectivity for early diagnosis of lung cancer.

Conclusion

We could demonstrate that our approach readily predicted lung tissue specific cancer biomarkers from digital differentially expressed lung cancer tissue specific genes. The procedure can easily adapt for the prediction of tissue specific biomarkers from the tissue specific differentially expressed genes. It is necessary to explore the extent to which the proposed approach can be integrated with the prediction of tissue specific biomarkers from tissue specific microarray datasets.