Introduction

One of the major goals of the post-genomic era is to understand the role of proteomics and genomics in human health and diseases1. The world health organization has estimated that about 11 percent of the total cancers accounts for breast cancer and a drop of about one-third of the cancer deaths could be attained if detected and treated early2. Molecular studies and healthcare research has shown that the detection of more and more genes after BRCA1 and BRCA2 implicative in breast cancer has rendered research, diagnosis and treatment strategies more ambiguous as the dangers being posed by those genes are yet uncertain3. Complexity as well as variations at every stage of the cancer render designing drug targets very difficult4,5. The ample availability of data in functional genomic and proteomic information and the development of high-throughput data-collection techniques has resulted from basic gene-based traditional molecular biology to a systems approach of network biology6,7. In this approach, biological processes are considered as complex networks of interactions between numerous components of the cell rather than as independent interactions involving only a few molecules8,9,10. Previous attempts to understand various diseases under network biology approach reveals that various types of cancers are interlinked to each other through some pathways which are altered in different diseases11. Further analysis of the centrosome dysfunction under the network theory framework demonstrates importance of hub proteins as well as those connected with them12,13. This paper, in order to achieve a deeper understanding of the complexity of breast cancer, its interacting patterns, role and importance of interaction patterns for the disease, analyzes protein protein interaction (PPI) network using a novel mathematical tool random matrix theory (RMT). This technique is known to develop fifty years back to explain interactions of complex nucleons14, has recently exhibited its remarkable success in understanding complex systems arising from diverse fields ranging from quantum chaos to galaxy15. While structural parameters namely degree, clustering coefficient (CC), degree clustering correlation and diameter demonstrate similarities in both the normal and disease networks, the structural patterns such as cliques combined with proteins revealed through the spectral analysis indicate changes in both the networks, which might be important behind transformation of a cell from the normal to the disease state. The present work not only straightens the importance of structural patterns in the disease, further demonstrating the success of the network framework, but also for the first time analyzes a disease using the RMT techniques. This combined framework helps in detecting proteins, beyond their structural significance in the underlying network, which are crucial for the disease. A detailed analysis of the top contributing nodes (TCNs) in the localized eigenvectors reveal their importance in the occurrence of the disease state.

Results and discussion

Structural properties of cancer networks

The structural parameters of the largest connected cluster of the network using dataset described in the materials and methods section, are summarized in the Table 1.

Table 1 Network parameters for both the normal and disease networks. The total number of proteins N collected using various databases (described in the Method section), number of proteins in the largest connected cluster NLCC and connections NCLCC, the average degree 〈k〉, average clustering coefficient 〈CC〉, the number of nodes having CC = 1 in the whole network (NCC) and the average IPR of both the networks

The degree distribution ρ(k) of both the networks follow the power law (Figure 1) indicating the presence of very high degree nodes. These nodes are known to keep the network robust against random external perturbations, as well as have been found to be functionally important in many pathways16. The degree and CC of the normal and the disease networks are negatively correlated (Figure 1), as found in the case of other biological systems17. As mentioned in the Table 1, the average CC of both the networks is high as exhibited by most of the biological systems investigated under network theory framework indicating presence of functional modules10. What follows that the disease and normal networks exhibit overall similar statistics for widely investigated structural properties but the crucial differences between them which are of potential importance, are revealed through the analysis of cliques structures and spectra. As reflected in Table 1, the disease network bears less number of nodes with CC = 1 than the normal one. The value of CC for a node being one reflects the formation of complete subgraph or clique comprising of that node. The higher value of average CC implicates the presence of high number of clique structure in a network18. Further, cliques are known as building blocks of a network for making the underlying system more robust19 and stable20. Additionally nodes forming cliques structure are known to be preserved during evolution21. What follows that the disease network having less cliques of order three as well as less number of nodes with CC being one as compared to the normal one indicate that there is a demolition of building blocks in the disease state, which may be leading to a more unstable underlying system and might be one of the reasons behind occurrence of the disease.

Figure 1
figure 1

Degree distribution and degree-CC correlations for the normal and disease networks.

Left panel of the normal network show that the degree distribution follows power law and the degree clustering coefficient correlation shows are negatively correlated. The right panel gives us the same results for the disease networks.

The importance of clique structures would become more clear after we explain spectral properties and local patterns of top contributing nodes (TCNs) appearing in the spectral analysis performed under the powerful framework of RMT. This analysis not only reveals functionally important proteins but also helps in uncovering importance of structural pattern in the disease network. In the following, we provide the results pertaining to the global spectral properties, eigenvalue fluctuations and properties of nodes appearing in the localized eigenvectors. These are the most frequently used techniques in RMT for analysis of spectral properties in order to achieve a comprehensive understanding of the underlying complex system.

Universality and the deviation from the same

The eigenvalue statistics reflects typical triangular shape with the tail of the distribution (Figure 2) relating with the exponent of the power law of degree distribution as observed for many other biological and real world networks22,23. Both the disease and normal networks have about 30% eigenstates with zero eigenvalues. This high degeneracy at zero is not surprising as many of the biological networks have been shown to yield very high degeneracy at zero22.

Figure 2
figure 2

Eigenvalue distribution of both normal and disease networks.

The plots depict triangular distribution for both the networks with a high degeneracy at zero.

Further, as depicted in Figure 3, both the disease and normal networks follow GOE statistics of random matrix theory at the consecutive eigenvalues captured through the distribution of their ratio (Eq. 4), which reflects that both the networks have a minimal amount of randomness22. Randomness in a network might be arising due to some nonsense mutations24 occurring in the underlying system. We remark that in dynamical systems, randomness may be related with the unpredictable nature of time evolution (for example: chaotic systems)25, whereas for networks, randomness is referred to as random connections between nodes26, which for biological systems might have evinced in the course of evolution randomly and not because of any particular functional importance of that connection. For instance, emergence of the modular structure in networks, which are known to be motivated by their specific functional role in the evolution27 might be linked with random connections perhaps resulting from mutations24. Further, randomness in the interactions has been known to be important for functioning of the underlying system. For instance, information processing in the brain is considered to arise because of many random long-range connections among different modules28 making the underlying system robust19. (Supplementary material contains details about RMT technique). The universal Gaussian orthogonal ensemble (GOE) statistics displayed by the disease network on one hand, indicates the robustness of the cell even in the disease state, which may be considered crucial for maintenance and housekeeping processes of cancerous cell and on other hand establishes that the breast cell can be modeled using GOE of RMT and we can apply all the techniques developed under the well established framework of the RMT to understand the breast cancer.

Figure 3
figure 3

Spacing distribution.

The ratio of eigenvalues spacing follow GOE statistics for both the networks. The bars represent data points and solid line represents Eq.4 with parameters of GOE statistics.

After the short range correlations, which is analyzed through probability of ratio of consecutive level, the second most insightful step in the RMT is the analysis of long range correlations in eigenvalue using spectral rigidity test, which is generally done using Δ3 statistics given by Eq. 5. This test reveals that both the networks follow RMT prediction of GOE statistics till a particular value of L (Eq.6) and deviates thereafter. Interestingly, the value of L for the disease network is less as compared to the normal one (Figure 4), suggesting that the normal network is more random than the disease network26 or the disease network is more ordered than the normal one. This interpretation combined with the observation that the disease state has a less number of connections (pathways expressed) than that of the normal one, implicates that there are some interactions getting hampered or silenced during the course of mutation leading to the disease11. What follows that, these hampered pathways should be corresponding to or treated as random pathways and as randomness is one of the essential ingredient for the robustness of a cell26, lack of sufficient randomness might be leading to the disease.

Figure 4
figure 4

Long-range Correlations (Δ3 statistics) for the normal and the disease network.

Circles denote data points for the normal and the disease networks whereas the solid line is the Δ3 statistics for the GOE.

While universal part following RMT prediction reflects the importance of random connections in the biological networks, we will witness in the following that the non-universal part of the spectra deviating from RMT provides direct clue about the set of nodes (proteins) relevant for the occurrence of the disease state. This is achieved by analyzing localization properties of eigenstates which provides a quantitative picture of non universal part of the spectra.

Important proteins through eigenvector localization

Based on the IPR values calculated using Eq.7, the eigenstates can be divided into two components, one which follows RMT predictions of Porter-Thomas distribution29 and another one which deviates from this universality and show localization (Figure 5). The average IPR calculated using Eq.8 comes out to be more for the disease network than for the normal one, which is in the direct relation with the behavior of Δ3 statistics demonstrating that the normal network is more random than the disease network. The part of the unfolded spectra following universality corresponds to random interactions in the underlying system30, whereas non-universal part can be exploited to get the system dependent information. The non universal part annotating the importance of the localized eigenvectors reveals important proteins as explained below. The disease network yields 34 TCNs corresponding to the top 5 most localized eigenvectors extracted by taking the threshold as 1/IPR31, of which 18 appears to be unique in the disease and 14 are common to both the disease and the normal networks. In the following section, we discuss briefly the functions of these proteins selected through the localization property of eigenvectors (Table 2).

Table 2 Eigenvector localization properties. Top most localized eigenvectors (Ek) for the disease dataset, their top contributing proteins and network parameters namely degree and clustering coefficient. The betweenness centrality of all the TCNs in the disease network is zero
Figure 5
figure 5

Eigenvector Localization both normal and disease network.

They clearly reflect three regions (i) degenerate part in the middle, (ii) a large non-degenerate part which follow GOE statistics of RMT and non-degenerate part at both the end and near to the zero eigenvalues which deviate from RMT.

Functional properties of disease proteins

The most important outcome of the functional analysis of proteins corresponding to the TCNs is that all of them are involved in important pathways leading to breast cancer. The first localized eigenvector has five contributing nodes, of which MTHFD1L is responsible for synthesis of purines in mitochondria but its expression is up-regulated in breast cancer leading to proliferation, invasiveness and anti-apoptotic activity32. The protein corresponding to the next TCN is ALDH1L1 which in the normal cell controls the cell mobility, but in the disease state is silenced thereby making way for uncontrolled proliferation of cells33. The next two proteins CACNB1 and CACNB2 help in calcium transport in the normal state but are down-regulated in the breast cancer cell, thereby affecting the calcium metabolism which causes poor signaling of messages in the cell34. Mutation in the last TCN KCNB2, results in concomitant cell proliferation35.

The second localized eigenvector consists of six TCNs, among which KCNB2 and CACNB2 have been discussed above. The third protein KCNH4 in a normal cell is responsible for the potassium transport but undergoes splicing and is silenced in the disease state, thereby hampering early event in apoptosis leading to prolonged survival of cells36. The next protein TEX1, under normal condition, plays a major role in transcription regulation, but in the breast cancer cells, is over expressed which causes down-regulation of transcription37. The next TCN, KIAA1, is a binding protein to RNA but in cancer state undergoes mutation and refrains from its function of binding to RNA resulting in hampering of proper translation in cancer38. The next protein THOC1 is a part of TREX complex which is responsible for regulating transcription in a normal cell but in breast cancer cell, it delays transcriptional expression leading to irregular central dogma which makes the cell unorganized37.

There are seven TCNs in third most localized eigenvector. The first two proteins, KLK5 and KLK10, are from Kellikein gene family which are generally involved in serine-type peptidase activities. In the breast cancer they are down-regulated leading to the suppression of tumorigenesis39. The next protein CNTN4 is implicated in nervous system development. As of now its exact function in breast cancer is not known, but its first interacting neighbor is BRCA1 which is potent candidate in breast cancer cells40. Similarly, the next contributing node, ASTN2, under normal condition controls neural migration but in the breast cancer is found to have undergone chromosomal rearrangement with PTPRG which is an important gene for the recognition of the cancerous state in the cell41. The next protein, BRMS1L, is found to reduce expression of mRNA in breast cancer42. Another protein, ARID4B, functions in diverse cellular processes including proliferation, differentiation, apoptosis, oncogenesis and cell fate determination. In the disease state, it is found to cause irregular cell formation and proliferation43. KCNQ5 is another TCN which is a family member of KCN which affects the cell proliferation of the breast cell35.

The fourth localized eigenvector has eight TCNs of which KCND3 elevates the influx of potassium ions in breast cancer cells44. The others appear in the three most localized eigenvector and have been already discussed. Among the top contribution nodes of the fifth localized eigenvector, except THOC6, FCRL3 and FCRL5, all other have been discussed above in different localized vector. These three proteins have been found in both normal and breast cell. THOC6, in normal cell accounts for negative regulation of apoptosis, whereas in the breast cancer cell have been found to be silenced37. Next two proteins FCRL3 and FCRL5 appear as a part of FCRL complex which under normal circumstances act as an adapter of protein as well as development in immunity. In breast cancer, they have been found to be over expressed in immune cells namely WBC, thereby making it more robust against treatments45.

What follows that all the proteins corresponding to the TCNs, except few, in the five most localized eigenvectors have a major contribution in promoting breast cancer. Moreover, fourteen common proteins are important for the normal and the disease both but leading to a very different behavior of cell in the two states. Whereas for normal cell, these common proteins are involved in major functioning of the cell (Supplementary), in the disease cell they all are found to be abnormally expressed or mutated leading to the disease state.

Preserved structures in localized nodes

The TCNs, in addition to the functional importance pertaining to the occurrence of the disease state revealed, exhibits interesting structural properties. This is more remarkable in the light that all of these TCNs lie in the low degree regime in the networks. Moreover, their betweenness centrality also are zero further ruling out any trivial structural significance of these nodes. But importance of these nodes based on the analysis of their interactions reveals the existence of preserved local structural patterns. Most strikingly, all of them follow phenomenon of gene duplication46, as depicted in the Figure 6, which shows TCNs being involved in the pair formation in which first node in each pair has exactly the same neighbors as of the second node. Most remarkably, there are 20 duplicates (proteins having the same number of neighbors and having more than one connection) in the whole network of which 18 are found in the TCNs of the most localized eigenvectors.

Figure 6
figure 6

Local structure of top contributing nodes.

(Left panel) The local structure of all TCNs in the disease network. (Right panel) The local structure for the same proteins in the normal network. Yellow represents TCNs and pink represent their first neighbor.

Further insight to this local structure is surfaced when we analyze the interaction patterns of these proteins in the normal breast network. What comes out from the analysis of the local interaction patterns of TCNs in the normal to those of disease is that there is either addition(s) of new interaction in the disease state in order to build clique of order three, or preservation of clique structure from normal to the disease or removal of interactions while in the normal keeping the clique intact. For example, TEX1 and THOC6, retain the same clique structures in both the networks (Figure 6). There are other proteins, for example BRMS1L and ARID4B which shed off some connections from the normal network (Figure 6(right)) while keeping clique structure intact in the disease state (Figure 6(left)). Further, nodes KLK5 and KCND3 form new connection in the disease state yielding the clique structures (Figure 6(left)). The fact that in spite of less connections of the proteins in the disease state as compared to the normal as well as reduction of cliques of order three from the normal to the disease, cliques in TCNs are preserved. This may be one of the reason of the poor performance of drugs targeting these proteins, as cliques are known to be the building blocks of a system making the underlying system robust against external perturbations. While the less number of nodes with CC = 1 as well as less number of clique of order three in the overall disease network from the normal one implicate that there is a destruction of building block, the conservation or addition of cliques of order three in TCNs (whose function importance for the occurrence of disease has already been emphasized) reflects that mutation in the disease cell makes the proteins to form a stable structure.

Conclusions

We construct and investigate the breast cancer network under the RMT framework. The analysis reveals that the TCNs of the most localized eigenvectors, despite of lying at low degree regime and having zero betweenness centrality, exhibit structural significance. All of them form pairs possessing common neighbors. Most remarkably, in the disease network there are 20 duplicate proteins with connections more than one and out of which 18 appear in the localized eigenvector. This interesting revelation turns out to be more intriguing in the light of the functional analysis of proteins corresponding to the TCNs which clearly confirms their functional importance for the occurrence of the disease state.

Furthermore, clique structure and duplication of genes, which have been emphasized important for robustness as well as the evolution of a system, are found to be crucial for the disease state. Most striking revelation is that while there is an overall reduction in cliques in the disease network from the normal one reflecting reduction in the building blocks for the disease state, conservation or formation of cliques involving important disease proteins reflects that robustness of the overall system is decreased in the disease but the interactions of the important proteins involved in promoting the disease are preserved and might be one of the reasons behind making those pathways involved with the important proteins highly resistant to various treatments.

The Δ3 statistics demonstrates less randomness in the disease state than the normal one, which might be arising due to the removal of random connections in the disease state and not because of a probable enhancement in the modular structure as number of nodes having CC = 1 is indeed less in case of the disease state than the normal one. This in-turn depicts that randomness leads to the robustness of the system where, normal breast network is found to be more robust than the disease state.

Detection of important proteins involved in breast cancer using RMT platform provides a time efficient and cost effective approach for those diseases which lack in-depth information about important genes. Revelation of clique structure, being formed or preserved by these proteins, provides a further bench mark for designing drugs which can target a sub-graph instead of the individual protein. This analysis presented here can be extended further to study other diseases like other types of cancers, diabetes, hypertension etc to predict various structural and functional aspect of biology47, which in addition may help to compose novel drug targets and to introduce the concept of single medicine for multiple diseases48.

Methods

Data assimilation and network construction

According to the network theory, there are two basic components of network namely nodes and edges. Here we study the PPI network where nodes are the proteins and edges denote the interactions. After diligent and enormous efforts, we collect the protein interaction data from various literature and bioinformatic sources. To keep the authenticity of the data we only take the proteins into account which are reviewed and cited. We use various different bioinformatic databases namely Gen bank from NCBI and UNIPROT49,50, constituting data available from other resources like European Bioinformatic Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource. To add more information we also take the most widely studied normal and breast cancer cell lines whose protein expression data is known. There are numerous cell lines available but a very few have been exploited for their maximum proteomic insight. Here, we are use the data of HMEC cell line for normal51,52 and MFC-7 for the breast cancer network53. The collection and discussion to select the datasets and authenticating this data for RMT analysis is an extensive job and we generate this data after thorough literature search and validation in about 400 hrs of rigorous study. After collecting the proteins for both the datasets, their interacting partners are downloaded from STRING database54. The dataset contains 2464 nodes and 15131 connections for normal network followed by 2096 nodes and 14183 links for the breast cancer network. The networks turned to have one big cluster and several small clusters. We then investigate the structural properties of the network.

Structural measures

Several statistical measures are proposed to understand specific features of the network55. First we define the interaction matrix or the adjacency matrix of the network as follows:

The most basic structural parameter of a network is the degree of a node (ki), which is defined as the number of neighbors of the node has (ki = Σj Aij). Degree distribution ρ(k), revealing the fraction of vertices with the degree k, is known as the fingerprint of the network. Another important parameter is the clustering coefficient (CC) of the network. Clustering is defined as the ratio of the number of connections a particular node is having by the possible number of connections the particular node can have. These are also known as cliques. Clustering coefficient of a network can be written as

They are complete sub graphs in the network which are known to be the conserved part of the network21. The average clustering coefficient of the network characterizes the overall tendency of nodes to form cluster or groups55. Further, the betweenness centrality of a node i is defined as the fraction of shortest paths between node pairs that pass through the said node of interest56

where is the number of paths from s to t that passes through i and gst is the total number of paths from s to t in the network. Another parameter is the diameter of the network which measures the longest of the shortest path between the two nodes.

Spectral techniques

The random matrix analysis of the eigenvalue spectra considers (1) global properties such as spectral distribution of eigenvalues ρ(λ) and (2) local properties such as eigenvalue fluctuations around ρ(λ). We denote the eigenvalues of a network by λi = 1, …, N and λ1 > λ2 > λ3 > … > λN. The nearest neighbor spacing distribution (NNSD) has been known to be one of the most powerful technique in RMT and we analyze them by calculating the distribution of ratio of the consecutive eigenvalues which is represented as57

The benefit of analyzing ratio of nearest spacings over much used NNSD58 is that this method does not require unfolding of the eigenvalues. Further, the NNSD accounts only for the short range correlations in the eigenvalues. We probe for the long range correlations in eigenvalues using Δ3(L) statistics which measures the least-square deviation of the spectral staircase function representing average integrated eigenvalue density from the best fitted straight line for a finite interval of length L of the spectrum. In order to get the universal properties of eigenvalues through h Δ3 statistics, it is customary in RMT to unfold eigenvalues by a transformation , where is the average integrated eigenvalue density. Since we do not have any analytical form for N, we numerically unfold the spectrum by polynomial curve fitting. After unfolding, average spacings are unity, independent of the system, which in the absence of any analytical form of polynomial fitting of the eigenvalues works with the approximate numerical fitting58 and is given by

where a and b are regression coefficients obtained after least square fit58. Average over several choices of x gives the spectral rigidity, the Δ3(L). In case of GOE statistics, the Δ3(L) depends logarithmically on L, i.e.

Further, we use the inverse participation ratio (IPR) to analyze localization properties of the eigenvectors59. For denoting lth component of kth eigenvector Ek, the IPR of an eigenvector can be defined as

which shows two limiting values : (i) a vector with identical components has Ik = 1/N, whereas (ii) a vector, with one component and the remainders zero, has Ik = 1. Thus, the IPR quantifies the reciprocal of the number of eigenvector components that contribute significantly. We further calculate the average IPR in order to measure an overall localization of the network calculated as60,

Note that IPR defined as above separates out the TCNs by keeping the threshold as 1/IPR.