Lay Summary

Cardiovascular diseases are a leading cause of death and chronic disability worldwide. Part of the reason we cannot control the progression of these diseases, or promote repair and recovery, may result from a lack of understanding of how mechanical forces and material properties change cell signaling during health and disease. To help address this knowledge gap, we systematically analyze all articles in the PubMed citation database related to biophysics and, separately, cardiovascular diseases. We identify genes that have been studied before in these contexts and combine this information into signaling networks that may work together to regulate cell behavior outcomes.

Future Works

This study identified individual genes and linked gene sets that are strongly associated with both biophysical cues and cardiovascular diseases. Future work can integrate detailed information from the articles we identify here to build data-driven, quantitative models to help predict how biophysical changes will influence cell behavior and health.

Introduction

What intracellular networks control cell behavior? How are these biological circuits disrupted when disease strikes, and can regenerative medicine and tissue engineering therapies restore circuit and tissue function to a healthy state? The global effort to answer these basic questions is adding thousands of new articles every year to the scientific literature. This overlaps topics relevant to regenerative medicine and tissue engineering, including advanced materials science, stem cell research, the physical sciences, and developmental biology. Gaining more accurate predictive capability and control of cell signaling can have especially high impact in the area of cardiovascular regenerative medicine, since cardiovascular disease remains the world-leading cause of death and chronic disease [1]. Our goal is to identify biological networks important for cardiovascular health and disease, and whether they also help cells respond to local mechanical forces and material properties in the surrounding tissue.

Cardiovascular diseases impact hundreds of millions of patients and trigger more than $300 billion loss annually in the USA alone [2]. As the number of deaths and disability adjusted life years due to cardiovascular disease grimly increase, public health experts urge countries to reduce the disability and premature deaths [3]. Furthermore, some sectors of the population suffer at higher rates, including ethnic and racial minorities [4] and gender differences [5]. We focused our analysis on four main groups of conditions that drive morbidity and mortality in cardiology: (1) heart attacks, or myocardial infarctions, and coronary artery disease [6]; (2) stroke and other cerebrovascular diseases such as aneurysm and blocked arteries [7, 8]; (3) deep vein thrombosis and pulmonary embolism [2]; and (4) peripheral artery disease [9,10,11]. Each of these groups of cardiovascular diseases includes chronic and acute components, which differ in the cell signaling and tissue remodeling involved and treatment strategies needed [12].

Many tissue engineering and regenerative approaches have been proposed to treat heart and blood vessels disorders during the past three decades [13,14,15,16]. The localized nature of many cardiovascular conditions (e.g., heart attacks, blood clots, vessel narrowing) make them well-suited to benefit from tissue engineering approaches, compared to diseases that affect diffuse regions such as some neurodegenerative diseases. Furthermore, the systemic access of the cardiovascular system can support regenerative medicine approaches such as cell homing and systemic delivery [15]. These efforts have yielded insights from cell therapy clinical trials [17, 18], FDA pre-market approved products [19], and ongoing investment to develop new cardiovascular regenerative therapies [20]. However, regenerative medicine and tissue engineering have not yet successfully reduced the burden of cardiovascular disease on society [21, 22]. Problems have been identified with immature cell phenotypes, failure to integrate with local host tissue, and side effects of poorly controlled biologic agents [6]. Improved modeling of human disease may be possible via stem cell technology and analyses that integrate multiple types of information [23]. Such systems-oriented models can help explain how cardiovascular diseases — or their repair — result from multiple factors working in concert, not single genes (or cells, scaffolds, or drugs) [24].

Living tissues and the laboratory models for studying them feature cells grown within inherently physical environments, including mechanical forces and material properties. Biophysical cues in turn play essential roles in promoting cardiovascular homeostasis [25]. In the cardiovascular system, disruptions in physiologic levels of shear stress, pressure, tissue stiffness, and strain are linked directly and indirectly to triggering cardiovascular disease [2, 5, 25, 26]. This underscores the need for stem cells or other therapeutic cells to be able to respond in site-appropriate ways to the biophysical environment when used as part of a regenerative therapy [27]. This is challenging, since differentiated stem cells often show immature properties compared to their endogenous counterparts [6, 28]. Biophysical cues are one of several types of information to which cells detect and react during disease progression and tissue regeneration. Models that integrate the breadth of published information on biophysics, cellular responses, and disease states can help predict how molecular networks translate biophysical cues into cell signaling and, subsequently, how to engineer therapeutic repair.

Systems biology approaches to study cardiovascular mechanical forces have been used to integrate mechanosignaling transcriptome data at different timepoints [29], generate quantitative computational models of specific stiffness-sensitive signaling pathways like Yap [30], and qualitatively summarize the large number of canonical signaling pathways that appear to participate in shear stress mechanosensing [31]. To more broadly understand cellular response to materials, Darnell and colleagues recently systematically characterized cell responses to distinct combinations of biophysical parameters (substrate stiffness, stress relaxation, and adhesion ligand density) [32]. A complementary but less common approach to these examples of -omics measurement-based systems biology is systematic, high-throughput meta-analysis of existing literature to integrate otherwise siloed experimental data. Text mining for key signaling network information, such as gene and biomolecule names, is one such method that has been successfully applied to discover and study novel signaling networks in disciplines other than cardiovascular bioengineering, including autoimmune diseases [33], precision medicine [34], and pharmacology [35].

In this work, we sought to address a critical gap in identifying molecular circuits that mediate the mechanobiology of cardiovascular disease: identification of relevant molecules, or circuit nodes. We systematically mined the text of 8 custom citation databases, each representing one biophysical cue or major cardiovascular disease. Our high-throughput approach used statistical and bioinformatic tools to identify evidence-based molecular networks in each category. This systematic approach yielded novel and specific gene networks with known biological functions and immediate data-driven relevance to biophysics and cardiovascular disease.

Materials and Methods

Citation Database Generation

Search terms were selected manually with the intent to capture a broad collection of articles related to the topics (biophysical cue or cardiovascular disease), while minimizing the number of unrelated or false positive citations. Searches were performed on PubMed between 9 November 2020 and 18 May 2021. Search phrases, Boolean operators, and wild card variations were used to improve the comprehensive citation lists (see Supplemental Tables 1–3). Article searches and downloading of XML-formatted citations was performed using eDirect commands and custom python code [36]. XML files were converted to tables for downstream data processing. In total, 8 independent citation databases were generated for four biophysical cues (shear, strain, stiffness, and pressure) and four cardiovascular disease sets: coronary artery disease/myocardial infarction (CHD/MI), cerebrovascular disease/stroke (CVD/stroke), deep vein thrombosis/pulmonary embolism (DVT/PE), and peripheral artery disease (PAD).

Gene Frequency Analysis

We made the simplifying assumption in this analysis that genes mentioned in prominent portions of the text, such as the title or abstract, likely play an important role in understanding the article’s findings. Biomolecules were identified using third party-curated lists of human gene and protein official names, official symbols, alternative names and alternative symbols. Annotation files for biomolecule information were sourced online from HGNC [37] (15 June 2020 download) and UniProtKB/Swiss-Prot [38] (01 December 2020 download). Custom python scripts were used to integrate biomolecular annotation information from both sources into a single gene database, using the HGNC ID as the root identifier for each gene. Titles and abstracts for all articles in a given citation database were then screened for mention of any biomolecule’s identifying information in the gene name database (Supplemental Fig. 1). Gene counts were filtered to remove likely false positives, such as gene symbols that are words in the English language (e.g., REST) or common scientific abbreviations (e.g., AFM). The code generated tabular output showing (1) the individual genes (if any) mentioned in every article in the database and (2) total counts of each gene name mentioned in the literature database. Genes that appeared in fewer than 3 articles in the database were neglected, since they lacked sufficient representation in the literature to detect trends during downstream analysis.

In total, the HGNC annotation file included 144,407 molecule identifiers (names, symbols, and aliases) and the UniProt annotation file included 65,922 molecule identifiers (approved symbols, protein names and aliases, gene aliases). We searched the title and abstract text for these \(\sim 2.1*10^{5}\) established gene identifiers for every article in each of the 8 citation databases (\(\sim 1.2*10^{6}\) articles), equivalent to \(\sim 2.6*10^{11}\) bits of information.

Probability and Statistics

To determine whether two genes, x and y, were studied together, we calculated the absolute numbers of articles mentioning both genes, the Pearson correlation coefficient of the two genes, and the related conditional probabilities, P(x|y) = (# articles mentioning both gene x and gene y) / (# articles mentioning gene y). Means and standard deviation were calculated when noted in the text.

Network Analysis

Network Node Derivation

We chose a threshold of 0.2 as the level of conditional probability required to classify two genes as enriched in their mutual occurrence in the literature. The level of 0.2 was chosen to filter out noise, based on the distribution of conditional probability values for all pairwise conditional probabilities (Supplemental Fig. 2). We inferred that gene pairs that are enriched for mentions in the same articles may contribute to similar signaling networks and/or biological functions [39, 40]. We refer to these linked gene sets as molecular networks or circuits. Histogram analysis was performed to tally the number of genes with enriched mentions (sum of gene pairs that satisfy P(A|B) ≥ 0.2) for each gene, to quantify the degree of connectivity and information content of the gene.

Network Architecture

Network architecture was visualized using two approaches. First, we used Cytoscape software [41] to connect nodes (shown as gene symbols) with edges (connecting lines between nodes). Edge thickness and color were proportional to the absolute number of shared articles and the correlation coefficient between the nodes, respectively. Second, we used the software Ingenuity Pathways Analysis (IPA, Qiagen) to define a new pathway containing only nodes that were gene symbols that could be linked exclusively using the conditional probability enrichment criteria (P(A|B) ≥ 0.2) [42]. Using the “connect network” function and allowing only direct connections, IPA software populated the edges in these networks based exclusively on known biological relationships between molecules, without regard to article counts or correlation values. Indirect relationships were excluded since these were far too numerous and made it more difficult to visualize the signaling pathway architecture based on direct connections.

Computer Code

Computer code was written in Python 3. PubMed database interrogation via the command line used eDirect (NCBI) underlying commands executed via the Biopython library. Code output is deposited in the Dryad digital repository.

Results

Generation of Systematic Citation Databases

Interest in biophysical stimuli and cardiovascular diseases has grown steadily in the past 60 years, and approaching exponentially since 2000 (Fig. 1A). In 2020 alone, more than 25,000 new PubMed-indexed articles were published related to shear, stiffness, strain, or pressure. Nearly 60,000 new publications during the same year mentioned CHD/MI, CVD/ stroke, DVT/PE, or PAD. This vast data set is larger than what can be read and synthesized manually. Among the biophysical citation databases, pressure was studied the most prior to 2010. The past 10 years, however, show a rapid increase in annual publications related to stiffness, which is on track to exceed pressure as having the most total articles published since 1960. Among cardiovascular diseases, most articles published before 2010 focused on CHD/MI, compared to a greater focus more recently on CVD/stroke.

Fig. 1
figure 1

Rapid increases in publication rates and gene annotations combine to form a high dimensional dataset describing biophysics and cardiovascular diseases. (A) Thousands of PubMed-indexed articles are published annually related to cardiovascular-relevant biophysical cues (left), along with tens of thousands of new articles annually for cardiovascular disease (right). (B) Custom computer code analyzed citations from PubMed for mentions of standardized molecule names from HGNC or UniProt databases, yielding databases matching articles to -omics-wide gene and protein mentions (see supplemental data). (C) Citation databases each include \(\sim 4*10^{5}-4*10^{6}\) total individual articles (left, navy bars), with approximately 40% mentioning any gene (right, yellow bars) or a high confidence filtered gene(s) (right, green bars). (D) Genes are being mentioned more often over time, appearing in up to one quarter (25%, yellow lines) of article citations published annually since 2000. (E) Systematic analysis of biophysical and cardiovascular disease-related articles, spanning > 60 years, identified \(\sim 10^{4}-10^{5}\) gene data points (individual articles containing gene information) for each topic

To systematically analyze these large datasets, we wrote custom code to perform text recognition in titles and abstracts of PubMed-indexed article citations (Supplemental Fig. 1AB). Text that was screened initially only for official HGNC-curated gene symbols and gene names missed many relevant biomolecular mentions. This high rate of false negatives (genes present but not found when searching only for HGNC symbols) is because a diverse range of aliases are used in the literature. To solve the problem of false negatives, the algorithm was updated so that text was scanned for 6 types of officially recognized biomolecule names including 4 sets of gene symbols, names, symbol aliases, and name aliases from the HGNC database; 1 name list from the Reviewed UniProt (Swiss-Prot) database; and 1 optional category of user-defined names to handle names not documented in either the HGNC or UniProt database (such as “YAP” for the official gene symbol YAP1) (Fig. 1B). Accounting for biomolecule mentions improved the sensitivity of automatically detecting gene mentions in article citations, affecting 7 out of the top 10 most frequently studied genes in the shear biophysical cue database (Supplemental Fig. 1D). This showed that the type of biomolecule identifier used varied greatly depending on the subdiscipline (article dataset) and individual gene. To promote algorithm efficiency, the user-defined names, HGNC gene names, and HGNC gene name alias were searched with highest priority (Fig. 1A) since alternatives like HGNC gene symbols, HGNC symbol aliases, and UniProt names were more likely to overlap with unrelated acronyms.

We designed Boolean filter criteria to reduce the detection of spurious genes (false positives) and improve accuracy (Fig. 1B) of the text mining code. This accounted for a common scenario in which gene names could be confused with other terms (e.g., “CPP” could refer to the gene or to the term cerebral perfusion pressure). We validated this automated filter method for all genes mentioned in ≥ 10 articles in the shear biophysical cue database. The computational algorithm with Boolean predicted positive genes achieved levels of 88% precision, 94% sensitivity, 86% specificity, and 90% accuracy compared to ground truth positives, defined using manual verification (Fig 1C). We performed the same validation procedure for three other biophysical cues and obtained similar scores (not shown). The computational algorithm executed text mining for gene names and the Boolean prediction filter within minutes to hours, depending on the dataset size. In contrast, manual verification of individual true positive genes required much longer (days). For larger datasets or in case increased gene detection sensitivity is desired (e.g., for genes with article counts below 10), manual verification of hundreds of thousands of gene mentions in articles is not feasible. Manual verification also risks inaccuracy due to human error, while execution of computer code is expected to be reproducible and objective. These data demonstrate the need for computational solutions to systematically analyze published literature, and the ability of the algorithm to generate an accurate list of genes mentioned in a citation dataset.

Each of the eight citation databases can be described by output listing (a) genes mentioned in each article, (b) genes in ranked order according to the total number of articles mentioning each gene, and (c) gene mentions over time (Supplemental Tables; Dryad repository). These citation databases range in size from 37,706 articles for PAD to 394,607 articles for CHD/MI (Fig. 1B). Despite differences in dataset size, \(\sim\)25–40% of the articles in every database mentioned genes in prominent citation fields. Approximately 10–20% of the articles in each database mentioned genes predicted by the Boolean filters to be high confidence genes mentioned in the data set (Fig. 1B right, green bars) versus those detected with more generous filters as simply potential gene mentions (Fig. 1B right, yellow-only bars). This indicates that each citation dataset included 4000 to 80,000 individual articles containing high confidence gene mentions, suitable for future detailed inspection. In both the biophysical dataset and the cardiovascular disease set, the percentage of articles mentioning genes has risen since 1998 (Fig. 1D). Shear, strain, and PAD showed the greatest percent of articles mentioning genes between 1998 and 2010, while the percent of articles mentioning genes in the remaining topics has plateaued since 2010 (stiffness, pressure, CVD/stroke, DVT/PE) or increased (CHD/MI). Likewise, the number of genes being studied (Fig. 1E) has increased rapidly since the mid-1990s to the present. Of all the databases, stiffness and CVD/stroke show the fastest growth both in publication number and number of genes mentioned during the past 10 years.

Genes Associated with Cardiovascular-Relevant Biophysical Cues

From the set of > 40,000 genes screened, 1558 genes were mentioned in ≥ 3 articles for one or more of the biophysical cues (Fig. 2A). More than half of these (859 genes) have been reported in multiple biophysical contexts, and 21.6% (337 genes) appeared in the articles of all 4 stimuli. Strain and pressure resembled each other the most in terms of biomolecules studied, as shown by the dendrogram (Fig. 2A). Shear and stiffness had more unique profiles of genes studied (272 and 244 unique genes, respectively, versus 88 and 94 for strain and pressure).

Fig. 2
figure 2

Biophysical literature overlaps with molecular studies and concepts core to regenerative engineering and cell therapies. (A) Clustergram showing relationships between mechanical forces and material properties, and among genes mentioned in these 4 article databases. Hierarchical clustering (average linkage and Euclidean distance) was used to obtain the dendrogram. Color bars (vertical rainbow) highlight gene sets matching specific biophysical patterns (e.g., yellow = predominantly shear). Colors correspond to single large branches in the genes dendrogram (top levels only shown); gaps reflect minor branches with heterogeneous patterns. The list of all clustered genes, including associated color group, can be found in Supplemental Table 4. (B) Heat maps of the pairwise conditional probabilities (left column) and the absolute numbers of shared articles (right column) between genes. Genes shown have at least one enriched connection (P(A|B) ≥ 0.2) to another gene. Outlined squares denote linked gene sets, defined by their shared increased likelihood to be mentioned in the same articles. (C) Histograms show the number of genes with enriched affiliation (P(A|B) ≥ 0.2; y-axis) as a function of gene (x-axis). X-axes gene labels are the same as those in the conditional probability heatmaps in subfigure B. (D) Studies of different biophysical cues overlap with core concepts in regenerative engineering. (E) Data from biophysical articles provides asymmetric insight into the role of each biophysical cue in stem cell therapeutic types or mature cardiovascular cell types. (F) Linked gene sets identified via conditional probability patterns can be visualized using bioinformatics software to show known biological relationships, as in the exemplar network identified from strain articles that relates to kidney and blood pressure functions. (G) Detailed article information, from full text to properties like publication frequency over time (shown), can be analyzed for higher resolution mechanosignaling analysis

In order to rapidly identify genes that are often studied in the same context, we used pairwise conditional probability to measure the likelihood for one gene to be mentioned in an article given the presence of the other gene in the pair. Genes were considered to be enriched in their association when the conditional probability of P(A|B) ≥ 0.2 (Supplemental Fig. 2). The conditional probabilities demonstrated consistent bias in the literature, with one gene in the pair usually driving the likelihood that other genes are studied. This can be seen by the global diagonal asymmetry in the conditional probability heat maps (Fig. 2B, blue heatmaps) and the asymmetry of gene connectedness shown by comparing histograms for the sums of P(A|B) versus P(B|A) (Fig. 2C). One specific example of this asymmetry from the shear dataset is the high conditional probability for genes Selectin -P, -E, and -L to be studied in any article in which Selectin-P Ligand was mentioned, but an absence of the reverse correlation. Articles mentioning the receptors Selectin -P, -E, or -L do not show increased likelihood for the ligand to be mentioned. Applying a filter to highlight conditional probability values above 0.2 (Fig. 2, outlined boxes) rapidly narrowed down the gene association space from the initial conditional probability list of \(\sim 1.6*10^{9}\) (40,000 × 40,000) possible gene pairs to a manageable set of < 100 enriched gene pairs. Individual genes within this list linked to a variable number of other genes, based on meeting the enrichment criteria of P(A|B) ≥ 0.2 (Fig. 2C). Individual genes that directly link to more genes based on the conditional probability threshold contain more information about the related molecular networks. These genes appear in the largest linked gene molecular circuits and indicate that molecular network architecture for biophysical cue sensing includes mainly highly connected genes, rather than genes connected in series. Mathematically linked gene sets could be visualized as molecular networks based on their known biological relatedness, as shown for the strain-sensitive network AGT-ACE-REN (Fig. 2F). Connecting this information with the number of articles overlapping for each gene pair enabled us to identify individual article subsets matching any gene, gene pair, or linked gene set (Fig. 2B, red heatmaps). We could also extract other information from either the citation, such as the molecule studies over time (Fig. 2G), or details from the full text of the articles.

Within the shear stress dataset (948 total genes found in ≥ 3 articles), 180 genes were discussed in ≥ 20 articles (Fig. 2B). Imposing a conditional probability threshold of 0.2 narrowed the list to 49 genes with 84 pairwise interactions. Linking the 49 genes together generated 5 linked gene (or co-occurrence) networks with 3 or more genes. We interpreted high outlier conditional probability values (close to 1.0) as a red flag for possible false positives. Upon inspection, this suspicion was borne out: NEUROD1 and TUBB4B shared the HGNC alias symbol Beta2, and DUSP2-SELP matches comes from the alias PAC1-SELP used in the literature, which in this subfield refers to an antibody that binds to activated α IIbβ3 and not the gene DUSP2 [43]. Subsequent conditional probability values at or near 1.0, while rare, were manually checked and removed if artifacts. The number of genes identified in the stiffness dataset was similar to the shear dataset: 77 pairwise interactions, shared among 44 genes, exceeded the conditional probability threshold. These 44 genes can be grouped into 4 linked gene networks. In the strain dataset, slightly fewer enriched gene interactions passed the conditional probability filter (62 gene interactions, 38 genes). These 38 strain genes contribute to form 4 linked gene networks. The pressure dataset included the fewest (46) gene interactions above conditional probabilities ≥ 0.2. These interactions were shared among 28 genes and could be grouped into 3 linked gene networks. For all four biophysical cues analyzed, the molecules included in each pair of connected genes and each linked gene network (i.e., those sets containing ≥ 3 connected genes) we identified are shown in detail in Fig. 4A.

We also analyzed interests shared by the biophysics and regenerative medicine communities in the biophysical cue citation databases. Publications related to stiffness were the most likely to discuss regenerative medicine and tissue engineering topics (Fig. 2D). Across all biophysical cues, articles mentioned scaffolds approximately twice as often as stem cells or cell therapy was mentioned, indicating a potential gap in the regenerative medicine literature describing the biomechanics and mechanosignaling properties of stem cells and other therapeutic cell types. Mesenchymal Stem Cells (MSC) were the most frequently studied stem cell type in the 4 biophysics datasets (67–84%), with more than 6 times the article counts compared to induced or pluripotent stem cells. MSC studies were also correlated primarily with stiffness studies, which could reflect adherence to a well-established model system in the stiffness mechanosensing community [44]. Pluripotent cells were studied more uniformly across biophysical cues. Among cardiovascular cell types (Fig. 2E), endothelial cells were studied most often and usually in the context of shear, while smooth muscle and cardiomyocytes were studied more uniformly across all cues (with the exception of cardiomyocytes and shear stress, a group containing fewer articles). This similar number of available articles across mechanical cues and material properties is somewhat surprising, given the larger role strain and pressure play in situ for cardiovascular muscle cells.

Genes Associated with Major Cardiovascular Diseases

Since cardiovascular health problems vary widely in their symptoms and causes, we also separated gene analysis of the most common cardiovascular diseases by anatomically themed categories. These 4 groups included (1) coronary heart disease and myocardial infarction (MI/CHD), (2) cerebrovascular disease and stroke (CVD/stroke), (3) deep vein thrombosis and pulmonary embolism (DVT/PE), and (4) peripheral artery disease (PAD). 3129 genes have been mentioned in ≥ 3 articles for at least one of the disease topics (Fig. 3B). 1742 genes were studied in more than 1 disease area and 299 highly studied genes (9.6%) appeared in the articles of all 4 topics. CHD/MI and CVD/stroke resembled each other the most, with substantial overlap in genes studied. Genes related to DVT/PE and PAD were more similar to each other than to either CHD/MI or CVD/stroke, as shown in the dendrogram (Fig. 3B). Both CHD/MI and CVD/stroke also had large gene sets uniquely studied in those diseases, 516 and 853 unique genes to CHD/MI and CVD/stroke respectively (Fig. 3B, green and blue bars). Very few genes were unique to either DVT/PE (14) or PAD (4). Article data size varied considerably across cardiovascular diseases, and resulted in differences in the average article counts mentioning a gene pair between the smallest article database, PAD (\(\bar {x}_{PAD}=8\)), and the largest database, CHD/MI (\(\bar {x}_{CHD/MI}=47\)) (Fig. 3A). To account for this data size variation of 40,000 to 400,000 articles, we could not use the absolute cut-off of genes appearing in at least 20 articles, since this was chosen for the biophysical conditional probability heat maps that all had dataset sizes of 7000–12,000 articles mentioning filtered genes. Instead, we used the equivalent percentage, requiring that genes listed on the conditional probability heatmap appear in ≥ 0.2% of gene-mentioning articles within the database (Fig. 3C).

Fig. 3
figure 3

Cardiovascular disease articles reflect underlying gene networks that overlap many mechanically sensitive components. (A) Heat map showing the number of articles (directly proportional to pixel intensity) as a function of the pair of genes (row and column position) mentioned in the four cardiovascular disease citation databases. The average article data set size for any pair of genes varies between cardiovascular diseases, ranging from 8 to 47 articles (\(\bar {x}\pm \sigma\) shown), independent of the number of genes in each disease that satisfied the threshold criteria. (B) Clustergram showing the relationships between cardiovascular diseases based on genes studied (dendrogram, top) and similar genes based on occurrence in cardiovascular disease specific literature (row order, left dendrogram, and rainbow colorbar). Hierarchical clustering (average linkage and Euclidean distance) was used to obtain the dendrograms. Each color corresponds to single large branches in the genes dendrogram (top levels only shown); gaps in color reflect minor branches with heterogeneous patterns. The list of all clustered genes, including associated color group, can be found in Supplemental Table 5. (C) Conditional probability heat maps showing enriched co-occurrence of gene pairs in the cardiovascular disease citation databases. Genes shown are filtered for one or more enrichments of the gene being studied in the presence of another gene (P(A|B) ≥ 0.2); outlined boxes denote sets of genes linked by shared enrichment patterns. (D) Frequency of biophysics-associated genes that have been studied among the cardiovascular disease article set (top) and the frequency of genes important for cardiovascular disease that have been studied in articles related to biophysics (bottom). Up to 90% of biophysics related genes are included in the cardiovascular disease literature, while up to two-thirds of disease-related genes also appear in biophysics articles

In the CHD/MI dataset (2235 genes≥ 3 articles), 127 genes appeared in ≥ 0.2% articles (Fig. 3C) and 25 genes formed pairwise interactions that had conditional probabilities ≥ 0.2. On average those gene pairs were mentioned in 47 articles (Fig. 3A) and the percentage of color-filled cells (especially among more frequently mentioned genes) implied a high degree of connectivity. For example, PLG and SERPINE1 had the highest chance (P(PLG|SERPINE1) = 0.74) of appearing together in an article, and the pair was mentioned together in 682 articles. 35 gene pairs met the conditional probability threshold of ≥ 0.2, which identified 3 linked gene networks (Fig. 4B). One high conditional probability outlier was manually removed from subsequent analysis: the genes ATP12A and ATP4A share the same alias, “proton pump”, and could not be distinguished as an interacting gene pair.

Fig. 4
figure 4

Gene sets linked using enriched conditional probability scores reveal refined and biologically connected networks. Linked gene networks are shown in tabular form for each of the 8 citation databases for biophysical cues (A) and cardiovascular diseases (B). Related but unconnected molecular networks are shown as separate rows within the same biological process row. Linked gene sets are visualized using IPA as molecular circuits with edges connecting molecules (nodes) in cases of known direct biological interaction. (A) All biophysical cues produced molecular circuits related to inflammation and immune cell adhesion (left) and the extracellular matrix (right). Pairwise only networks are also shown (middle). (B) Cardiovascular disease networks include a large network linking both inflammation and adhesion with extracellular matrix genes (left) and a network related to coagulation (right). Vertical color bars on network diagrams denote the subcellular positions of the molecules in the extracellular space, plasma membrane, cytoplasm, or nucleus (top to bottom)

Within the CVD/stroke dataset (containing 2580 total genes found in ≥ 3 articles), 147 genes appeared in ≥ 0.2% articles, and 40 genes had conditional probabilities ≥ 0.2. CVD/stroke included more gene pairs than CHD/MI, despite having a similar dataset size (Fig. 2C). Gene pairs studied in CVD/stroke (19 ± 54 articles) were documented in smaller article subsets than those in CHD/MI (47 ± 124 articles). The 68 gene pairs were connected to form 6 potential networks (Fig. 4B). High conditional probability values flagged the genes GABPA and NFE2L2 as a potential outlier interaction. Since they share the same alias NRF2, the trio network of GABPA-NFE2L2-HMOX1 could represented as a network pair interaction, NRF2-HMOX1.

In the DVT/PE dataset (464 total genes found in ≥ 3 articles), 79 genes appeared in ≥ 0.2% articles, and 26 gene interactions among 16 genes reflected enriched conditional probabilities ≥ 0.2 (Fig. 2C). Despite having fewer enriched gene pairs and a smaller overall number of citations in the DVT/PE database, the average number of articles supporting each gene pair was similar to the literature for CVD/stroke (24 vs 19 articles). This suggests the DVT/PE community has coalesced to study a more consistent set of molecular interactions. The 26 gene interactions in DVT/PE can be linked together to form 2 potential networks (Fig. 4B). In the PAD dataset (containing 559 genes appearing in at least 3 articles), 151 genes appeared in ≥ 0.2% articles, and 45 genes had conditional probabilities ≥ 0.2. Although the PAD dataset was the smallest of the 4 cardiovascular diseases, it included a broad range of gene interactions (80 pairs). Linking these genes pairs resulted in 3 molecular networks (Fig. 4).

The combined set of 8 citation databases represents a novel and contemporary snapshot of several important subdisciplines within mechanobiology and cardiovascular disease. While these databases can be used as a source for many systematic comparisons across the fields, our current effort focused on gene details relevant to underlying molecular control circuits. We quantified the overlap between the genes that have been studied in the context of both biophysical cues and cardiovascular diseases (Fig. 3D), as a first step to establish a systems-scale link between molecules that may contribute to both processes. We find that genes with biophysical relevance have also nearly always been studied in relation to cerebrovascular disease and stroke (shear: 84%, stiffness: 82%, strain: 85%, pressure: 90%), often also related to CHD/MI, and in 30–40% of cases in DVT/PE and PAD (Fig. 3D, top). These data imply that a rich data set is embedded in the literature that describes how biophysical-relevant genes may be affected during cardiovascular disease.

We also quantified the percent of genes studied in the context of disease for which articles also existed in the biophysical literature (Fig. 3D, bottom). Consistent with the smaller number of genes studied for DVT/PE and PAD (464 and 559, respectively), these were more likely to appear within articles in the biophysical databases, ranging from 55 to 69% representation. More genes have been studied in the context of CHD/MI and CVD/stroke (2235 and 2580, respectively), so even though a smaller fraction of them overlap with the biophysical literature (24–35%), this still amounts to sets of 600–800 genes that overlap between every combination of disease (CHD/MI or CVD/stroke) and biophysical cue (shear, stiffness, strain or pressure). Analyzing these overlapping genes sets in future may help establish the role of mechanosignaling in cardiovascular disease at a systems level, to complement the signaling pathway scale.

Network Analysis

To identify genes that may function together as molecular circuits, we identified all sets of genes, {x1,x2,..xi,...xn}, that could be associated together based on enriched probability of gene pairs to be studied in the same article citations (P(xi− 1|xi) ≥ 0.2). This approach reduced \(\sim 10^{2}\) enriched gene interactions to \(\sim 10^{1}\) molecular networks (≥ 2 linked genes/network), with an average of 9 ± 2 networks identified for each biophysical cue (Fig. 4A) and 8 ± 3 networks identified for each cardiovascular disease (Fig. 4B). A total of 68 individual networks were identified. The size of these networks had a bimodal distribution, with three-quarters of the networks (n = 52) consisting of two or three genes versus the remaining quarter of the networks, comprised of 8 larger networks each identified for biophysical cues and cardiovascular diseases. These larger biophysical cue networks contained more molecules per network on average than the large cardiovascular disease networks (median: 11 versus 5.5 molecules/network). This size difference between biophysical and disease networks may reflect greater consensus by authors in the biophysical cue community about choosing genes to study in parallel; be an artifact of the larger number of co-occurrence articles required to meet the conditional probability enrichment filter in case of large datasets like CHD/MI and CVD/stroke; and/or reflect underlying features of the biological networks themselves based on the literature to date. Among the smaller networks, antioxidant defense molecules CAT and SOD1 showed enriched co-mentions in every biophysical cue and all diseases except DVT/PE. The AP-1 family member FOS and JUN made frequent co-appearances in articles related to shear, strain, and pressure. In contrast to the biophysical cues however, the disease data sets included frequent co-appearance of molecules related to insulin, tumor growth, apoptosis, and thrombosis. Each biophysical stimulus and cardiovascular disease, except CHD/ MI, was also affiliated with a few cue-specific networks.

Our priority was to identify cell processes that may be targets of system-level regulation during biophysical or disease signaling, so we focused subsequent analyses on the larger networks. These networks readily clustered into a few even larger molecular networks formed by combining partially overlapping networks across biophysical cues or diseases. Molecules in these pan-biophysics or pan-disease networks related to non-overlapping cell functions, as noted (Fig. 4 tables, “Biological Process” column). Very intriguingly, the networks identified using this approach detected two key cell functions in all 8 citation databases: inflammation and immune cell adhesion (e.g., networks containing Interleukin and chemokine family members, TNF, Selectins, and CAMs) and extracellular matrix regulation (e.g., networks containing MMP, TIMP, and COL family members, as well as cartilage and bone associated genes). Genes involved in Inflammation/Immune Cell Adhesion and Extracellular Matrix were detected in all 8 datasets, but each field focused on a different subset of genes with varying degree of connections. Using the curated pathway analysis software IPA, we tested whether the linked gene sets that we had identified using only mathematical approaches could also be connected to one another physically via known biological interactions. All networks (n = 68) were highly connected when the software network building tool allowed consideration of direct and indirect interactions. These connections were so numerous that they made interpreting the signaling network diagram difficult, since they represent hundreds of edges connecting molecules and tens of thousands of embedded references collected by IPA that document these edge relationships. We observed a similar challenge building networks from correlation coefficients, in that the number of networks generated is too large to be informative compared to the sparse linked gene sets detected using enriched conditional (directional) probabilities.

To identify sparser networks more tractable for future experimental study, we reduced the signaling pathway connections to allow only direct interaction relationships between molecules within a linked gene set (Fig. 4, molecular network diagrams). No extra molecules were added. Thus, the networks we identified reflect a sparse set of genes, much sparser than the entire collection of inflammation, immune cell adhesion, or extracellular matrix genes known. The connection of these molecules to each other and their relevance to specific biophysical cues and cardiovascular diseases are thoroughly documented, by the custom citations databases created through this work and the biological references embedded within IPA. Furthermore, since networks nodes are only selected based on enriched conditional probabilities, these data-rich, sparse molecular networks are supported by many articles in the literature already that co-mention, and may co-study, multiple genes in a network in parallel. Locating and assimilating experimental observations taken in parallel is a valuable tool in quantitative model building to predict signaling pathway behavior. By using a computational approach to systematically and comprehensively review citation data, we have created a relatively simple tool to help accomplish an otherwise laborious and daunting task. The overlap in networks (Fig. 4A and B) reveals that most molecules in the networks important for cardiovascular disease also contribute to cellular mechanosignaling (Fig. 4B, overlapping genes denoted by solid blue stars). In future, this similarity could motivate integrating citations from different cue databases, to better understand how mechanics may alter cell signaling and impact cardiovascular disease. These networks also suggest a need for multi-factor studies in the pericellular and plasma membrane space, the subcellular location of the majority of genes identified in all networks. Some networks appear enriched for signaling in the plasma membrane (biophysics inflammation and cell adhesion, Fig. 4A left) versus the extracellular space (biophysics extracellular matrix, Fig. 4A right).

Physical Cue Sensing in Cardiovascular Diseases

In addition to mechanical forces and material properties, the cardiovascular system also relies on electrochemical communication to perform pacemaking and muscle contraction. We extended the approach used to study biophysical cues and cardiovascular diseases to analyze bioelectric articles and added 3 electrochemically related cardiovascular diseases: arrhythmia, bradycardia, and tachycardia (Supplemental Fig. 3). In the electric articles set, 2354 genes were mentioned in at least 3 articles, slightly less than the 3130 genes mentioned in ≥ 3 articles in any of the 4 common cardiovascular diseases (CAD/MI; CVD/Stroke; DVT/PE; PAD) but more than the 1235 genes mentioned in ≥ 3 articles in any of the 3 irregular heartbeat disorders (Fig. 5A). Most electric genes were documented in the 4 common cardiovascular disease-related articles (73% of the electric-related genes) and at least one biophysical citation dataset (49% of electric genes). In contrast, only 41% of the electric genes have been extensively studied in abnormal heartbeat-related articles, whereas 83% of the biophysical genes have been studied in at least one of the 4 common cardiovascular disease datasets. Approximately 700 genes have attracted attention from at least one citation database in all 4 subdisciplines (Fig. 5B). Furthermore, many genes (156 genes) have been studied in all 4 biomechanics-related diseases but not irregular heartbeat-related diseases, while a complementary set of genes (91 genes) has been studied in all 3 electric-related diseases but not all biomechanics-related diseases (Fig. 5C). Of the 70 genes present with high confidence in all of the 12 biophysics and cardiovascular disease databases (Fig. 5C-star and G), several of these contribute to the molecular networks derived in Fig. 4. Future studies may investigate the circuit activity and universality of these predicted molecular circuits in homeostasis and disease.

Fig. 5
figure 5

Systematic meta-analysis of citation data identifies thousands of genes studied for cardiovascular disease that also relate to cellular sensing of physical cues. (A) Bar charts showing the total number of genes mentioned in ≥ 3 articles in each database. (B) Venn diagram showing the overlap of individual genes studied in the context of any mechanical force or material property (shear, stiffness, strain, or pressure), electrical cue, any disease associated with irregular heart rhythm (arrhythmia, tachycardia, and bradycardia), and any common cardiovascular diseases (myocardial infarction/coronary artery disease, stroke/cerebrovascular disease, deep vein thrombosis/pulmonary embolism, peripheral artery disease). (C) Venn diagram showing overlap of cited genes, associated with all mechanical forces and stiffness (mentioned in all 4 citation databases), electric cues (1 citation database), common cardiovascular diseases (4 citation databases) and irregular heartbeat disorders (mentioned in all 3 citation databases). (D–F) Upset plots summarizing the number of genes unique to a single topic or the intersections among (D) biophysical databases, (E) cardiovascular disease databases, and (F) any individual disease dataset compared with the 5 biophysical cues datasets. Gene data used for Upset plots is available in the Supplemental Material. (G) Table listing the 70 genes that appeared in every citation database reviewed, suggesting these molecules are considered important across many fields and may deserve closer inspection. Green star highlighting the 70 genes (panel C) matches those itemized in subfigure G

We identified gene overlap among biophysical cues (Fig. 5D; 5 cues), cardiovascular diseases (Fig. 5E, 7 diseases), and between each individual cardiovascular disease compared to all biophysical cues (Fig. 5F; 7 disease-biophysics comparisons). Approximately 300 molecules were mentioned in articles related to all 5 biophysical cues, with additional subsets of genes reported common to electric and between 1 and 3 mechanical/stiffness cue databases (Fig. 5D). All 7 diseases we studied included 100–300 genes that are candidate highly mechanosensitive molecules, defined here as genes that have been studied across all biophysical cues, including mechanical forces, stiffness, and electric cues (Fig. 5F). Since only 70 genes were studied across all biophysical cues and all diseases though (Fig. 5C, green star and Fig. 5G, gene table), this suggests that the remaining few hundred genes studied in all biophysical contexts may reflect mechanosensing observations specific to one or a few diseases. Data for the presence or absence of individual genes (4037 unique genes) within each of the 12 citation databases (biophysics or disease related) is shown in the Supplemental Table 6.

Imposing the same 0.2 conditional probability threshold and 0.2% article threshold (≥ 3 genes) used to generate co-occurrence networks for biophysical cues and cardiovascular diseases, we identified linked genes and molecular networks associated with electric-related articles (Supplemental Fig. 4A). These electric networks included smaller subsets of the Inflammation and Insulin networks identified in the cardiovascular disease datasets (Fig. 4), as well as new networks related to potassium channels (8-gene network), tight junction proteins (5 genes), and gap junction proteins (5 genes) (Supplemental Fig. 4C). Compared to the biophysical and disease datasets, the articles focused on any one of these gene sets tend to be more siloed from the other gene sets, as shown by clusters of high intensity surrounded by larger regions of white on the co-mention heatmap (Supplemental Fig. 4B). It is noteworthy that by relaxing the threshold criteria used to define molecular networks, it is possible to identify additional co-occurring gene sets. For example, by decreasing the conditional probability threshold from 0.2 to 0.1 and decreasing the article threshold from 0.2% of articles to 10 articles, we identified 48 co-occurrence networks and 32 gene pairs in the electric dataset instead of the more stringent 5 networks and 6 pairs.

Discussion

PubMed-indexed articles related to cardiovascular diseases and mechanosignaling, and their overlap with regenerative engineering, encompass more than 1 million articles. We hypothesized that this literature set included studies of non-random groups of genes that operate as effectively latent molecular circuits. We used bioinformatics and computational analysis to systematically meta-analyze these article citations to identify genes and linked gene networks mentioned most often in article citations. We observed the following key results from this high-throughput analysis.

  • Stiffness and cerebrovascular disease/stroke articles are accumulating at the fastest rates (Fig. 1).

  • Analyzing these 4 biophysical cues and 4 cardiovascular diseases gave molecular insight into each field, scalable from overall genome- and citation database-wide trends down to single gene- and single article-resolution.

  • Genes mentioned in biophysics citations (up to 90%) also appear in cardiovascular disease citations, defining a tight molecular coupling between fields (Figs. 2 and 3).

  • Conditional probability filtering efficiently constructed molecular networks from citation databases. Inflammation and immune cell adhesion, and extracellular matrix regulation, dominated the largest molecular networks found across disease and biophysics databases (Fig. 4).

  • The molecular circuits identified connect via direct interactions and are backed by large numbers of field-specific citations. The full text of articles from the custom citation lists, plus the ranked gene lists generated for each field, provides new source material for building quantitative predictive models. These can focus on mechanisms or empirical behavior of genes and networks involved in cardiovascular diseases and biophysical cues.

  • Modeling and follow up experiments can also help distinguish whether observed asymmetries in conditional probability reflect empirical bias in experiment design (e.g., if a receptor is studied in isolation, while its ligand is routinely studied in parallel with the receptor) or a true difference in mechanosignaling (e.g., in case a receptor is mechanosensitive, but not its ligand) (Figs. 2 and 3).

  • A subset of 70 genes have been studied by biophysics, electric, and cardiovascular disease communities (Fig. 5). Future investigation is needed to determine the role this broad interest gene set may play in promoting biophysical homeostasis in cardiovascular cells.

This computational study of cardiovascular mechanobiology included limitations that affected the molecular networks we identified. First, since only article citations are text mined for biomolecule names, these results overlook genes that may be studied in the article full text but not mentioned in the citation. Despite this coarse-graining of the molecular data, this approach had the advantage of prioritizing important signaling networks, since these molecules would likely be the ones to appear prominently in article titles and abstracts. A second option to tune the sensitivity of network discovery is to adjust the combinations of filters applied at each stage of the algorithm. We required at least 20 separate article mentions (or ≥ 0.2% of gene-mentioning articles) to consider that enough information existed in the citation database for the gene for us to draw conclusions about its enriched conditional probability. Different filter choices may help optimize the sensitivity of linked gene networks discovered for other applications. Finally, the comprehensive scope of our current analysis precludes us from discussing quantitative models for any single signaling pathway. Future work could use these citation databases and gene annotation outputs to build data-driven computer models that quantify molecular mechanisms or empirical behavior of signaling pathways and predict how disease or biophysical changes will affect signaling behavior.

Meta-analysis approaches have been used to enhance the impact of existing scholarly works in many fields [45, 46]. Here, we offer new perspective on the genes and molecular networks associated with biophysical cues and cardiovascular diseases. Our ability to identify evidence-based network summary lists for biophysical cues and cardiovascular pathologies (Fig. 4) is made possible thanks to decades of research in cardiovascular bioengineering in particular, a field known for early recognition of the importance of mechanics on cell signaling [47,48,49,50]. This work also builds on efforts to provide systematic and regularly curated consensus ontologies (e.g., HGNC, UniProt) in bioinformatics subfields like gene and protein nomenclature. Our results demonstrate that ontologies play crucial roles in systems biology analyses, with choices about their use affecting the genes and molecular networks subsequently discovered.

There have been previous efforts to use computational text mining to gain insight into gene associations and networks [51,52,53,54,55,56]. These tools provide rapid and computationally powered insight into user-selected individual molecules or articles. Our approach fills a complementary niche in this space, in which the user has less prior knowledge: they desire a broad overview of molecular knowledge in a specific rapidly evolving field plus high resolution data about potential genes of interest (e.g., highly ranked or highly correlated genes), backed by individual citations. Combining these computational tools can help researchers synthesize large amounts of molecular signaling data to help identify bottlenecks or refine strategies for cardiovascular regenerative therapies. This can complement parallel efforts to build quantitative predictive models describing specific aspects of cardiovascular physiology [57, 58]. The citation and ranked gene results reported here for 4 biophysical cues, electric, and 7 cardiovascular diseases paves the way for future data-driven studies of how these molecules alone, or acting together in molecular circuits, support cardiovascular health.