1 Introduction

Covid-19 is an infectious disease developed due to transmission of novel coronavirus (2019-nCoV), causing respiratory problems and deaths across the globe [1]. It was first detected in China in 2019 and in a matter of few months, Covid-19 was declared a global pandemic affecting millions of lives. World scientists and researchers are putting in all the efforts to explore different methods related to containment of coronavirus. In the initial phase of Covid-19, it was identified to be specifically deleterious to children and older adults, but with the reappearance of the second wave and increased cases, it has been identified to be associated with many other factors like type 2 diabetes, obesity, and respiratory illness. A recent research [2] suggests to closely observe Covid-19 patients having comorbidities like diabetes or pneumonia. Similarly, many other studies and meta-analysis revealed hypertension, chronic obstructive pulmonary disease, diabetes, cerebrovascular disease, and cardiovascular disease as risk factors of Covid-19 [3,4,5.

Apart from comorbidities, another deciding factor of the progression of Covid-19 is the immune system [6, 7]. Due to variance in the immunity of individuals, their response to the virus is variable. Some patients are asymptomatic and recover through isolation and medicines, while others have mild to severe symptoms requiring hospitalization. In this time of crisis, vaccination has raised a little hope but along with that, there is an urgent need to improve our immune system to minimize the risk of Covid-19 and other related diseases. Diet is a vital component for our immune system and thus considered important for reducing risk of Covid-19 [8]. The nutrients found in fruits and vegetables have anti-inflammatory effects and are suggested for Covid-19 risk management. Vitamin A, vitamin C, selenium, and zinc are known to be potential options for prevention from Covid-19 as they are effective for immune functions [9]. Deficiency of vitamin D is known to be related to several diseases like diabetes, hypertension, and obesity, which are associated with Covid-19 risk. Thus, many researchers [2, 3] are suggesting vitamin D intake to protect against infection. Few recommendations have been suggested for optimal nutrition at different levels of health model such as leafy vegetables, dairy products, nuts, and citrus fruits, consisting of nutrients like iron, zinc, vitamin A, C, B6, and B12 [10]. Other works [7, 8] suggest foods like kiwifruit, broccoli, red pepper, strawberries, and citrus foods which are rich in vitamin C and foods rich in vitamin A like carrot, spinach, sweet potato, and vegetable oils, seeds, spinach, or supplements for vitamin D and E. Similar recommendations have been given like adding foods containing vitamin B6 and omega-3 polyunsaturated fatty acids to the list [11]. Consumption of diets rich in saturated fats, refined carbohydrates, and sugars named as Western Diet is known to severely affect the immune system leading to impairment against viruses and is thus not being recommended [12].

Diet and comorbidities contribute significantly towards the prognosis of Covid-19 in infected individuals. Further, comorbidities are also impacted by the kind of diet consumed by different individuals. Thus, in the ongoing pandemic, understanding the dynamics of Covid-19, diet (“food” and “diet” terms have been used interchangeably in this work), and other diseases is important so that progression of diseases can be traced. Understanding the interplay of different diets and diseases is a lot more complicated than is generally known. For example, it is a general notion that a person suffering from diabetes should eat eggs and leafy greens. This might be helpful for some patients but not for others. This is because diets have different interactions with diseases for different individuals depending on other factors like their lifestyle, comorbidities, or environment. Intake of different combinations of food even at different time of the day affects metabolism and health. Apart from this, allergies from food, relation of foods with disease subtypes, and even different forms of same food item might affect individuals’ health differently [13, 14]. Moreover, such kind of disease-diet database is not readily available. In order to comprehend such complex interdependencies, there is a need to integrate authentic data from literature, identify associations, visualize them, and explore future possibilities using computational techniques. Such a computational approach has already been applied in various biological studies, for example, use of machine learning to predict post-translation modification site [1517]. Network analysis is also one such solution which offers to develop complex visualizations and draw inferences from them [1820]. It is increasingly being used for solving many real-world problems which can be depicted as graphs. For example, social media connections are represented as graphs and analyzed to predict future connections or to understand disease outbreak mechanism. Another area where networks are scrutinized is representation of disease associations with entities like genes [2122], symptoms [23], microRNA[22], microbe [24], and drugs [25]. The same analogy can be applied to known associations among diet, other diseases, and Covid-19 so as to identify and predict significant unknown associations.

The motivation behind this work lies in the fact that there are associations existing among Covid-19, diet, and other diseases, which if discovered, can prove to be a valuable resource for healthcare researchers. Moreover, lack of an existing database corresponding to such associations is another reason to develop an approach for curation and analysis of such dataset. These predicted interdependencies can be further explored by researchers to understand their impact in real life. Once validated, the associations can be utilized by caregivers to plan healthy eating patterns for boosting immunity and reducing the risk of Covid-19 for patients afflicted with comorbidities. The main contributions of this work are:

  1. i.

    Curation of a database pertaining to known associations among Covid-19, diets, and other diseases from existing literature.

  2. ii.

    Introduction of Network Analysis as a computational technique for visualization and analysis of curated associations.

  3. iii.

    Development of a Network Analysis–based approach to identify and predict unknown associations of diets with Covid-19 and other diseases by comprehending the known associations.

2 State-of-the-art approaches

With more and more research about Covid-19, it is now a well-established fact that it is associated with the immune system and thus its prognosis is closely related to diet and certain comorbidities. There is a need to take advantage of this fact and study the association among these factors. Not much of the literature on this is available, this being a recent disease of concern. Table 1 describes the recent research works related to association of food with Covid-19. It is quite evident from the table that most of the works are based on survey of literature to understand the importance of nutrition for prevention or management of Covid-19. Only a very few studies are based on analysis of collected real-life data [25, 27]. In these studies, data related to levels of vitamins/nutrients (vitamin D, zinc, and selenium) of patients have been collected and analyzed for understanding their association with Covid-19. Since data have been collected from an academic medical center of University of Central Missouri (UCM) in the first study [25] and Tehran University of Medical Sciences in the second, it depends on factors like location, age, or ethnicity of groups involved and their physical parameters, thus covering a local view. The proposed approach aims to provide a global view by considering relations between Covid-19 and other diseases to infer unknown associations. Moreover, this approach is a novel attempt to use literature and computational approaches like Network Analysis to identify the associations between food and disease. Such an approach aims to develop a fast and efficient system which would be beneficial for caregivers and domain experts for planning a healthy lifestyle by eliminating toxic eating patterns. The approach might also act as a baseline to be used for predicting such associations for other related diseases.

Table 1 Comparison of related work

The rest of the paper is organized as follows: Section 3 is a detailed description of Network Analysis and its different approaches that can be utilized for predicting disease-diet associations. Section 4 describes the proposed approach for inferring unknown associations along with the challenges encountered and their solutions. Since this is a new research domain, Section 5 discusses some probable future directions along with conclusions in Section 6.

3 Network Analysis: a technique for identification of associations

This section briefly introduces Network Analysis as an important technique for predicting relations from complex datasets. It also presents some major computational approaches of Network Analysis which have already been used in other domains and discusses its aspects that can be used for diet-related inferences.

3.1 Background

Network Analysis is one of the most suitable computational techniques for analysis of heterogeneous complex data. It utilizes the properties or structure of complex network to infer interesting insights from data. Network analysis offers a means to develop complex visualizations and draw inferences from them [19135]. It is used for solving real-world problems like predicting future connections from social media, understanding criminal networks, and food chains. Networks are also widely used for analysis of associations like disease-symptoms [23] or disease-microRNA []. Recently, Network Analysis has been in use for inferring various associations, for example, to explore disease similarities based on shared symptoms and genes []. Random walks in a network and machine learning have been used to predict microRNA-disease associations in another study [35]. Similarly, classification has been done to predict drug-disease associations from networks [36]. Adjacencies in networks have been used to score drug-disease associations which were further used as features for prediction of unknown links [36]. The same analogy can be utilized for studying complex associations of diseases and diets for Covid-19.

Network visualization and analysis can be useful for studying complex disease-diet associations in multiple dimensions. Figure 1 depicts a snapshot of subgraph of a disease-diet network including 2 disease nodes (Inflammatory Bowel Disease (IBD) and Ulcerative Colitis (UC)), 10 diet nodes, and 15 associations. The nodes in the network either represent a diet or a disease term which were searched in PubMed literature to identify if they are associated or not. The co-occurrence of the terms in research papers represents that they are linked to one another; thus, the weights of associations in the graph indicate the normalized value of their co-occurrences. A complete graph thus generated helps in recognizing unknown associations and in performing predictions. Moreover, multiple dimensions can be considered for further analysis, for example, foods can be represented on different levels like based on their ingredients or vitamins and minerals [13].

Fig. 1
figure 1

A snapshot of disease-diet subgraph based on term co-occurrences in PubMed publications

In a recent study [13, 14], analysis of disease and food networks has been carried out in different dimensions. Disease-food networks were developed and statistically analyzed using network parameters to realize significant foods. A food item was tagged as significantly harmful for a disease if multiple papers in literature report the same and vice-versa for significantly helpful. Further, a food item was tagged as significant if it is found to be associated with many diseases. Disease-disease network was constructed based on the similar foods to extract network and similarity measures for realizing disease complexity and similar diseases.

3.2 Proposed usage of Network Analysis techniques

Being a vast and novel area of research, Network Analysis can be utilized in multiple ways to extract significant disease and diet associations. The major computational approaches of Network Analysis which can be used for various diet-related inferences are as follows:

3.2.1 Overlapping of networks for prediction

Network overlapping and comparisons are done to realize unknown associations. Two networks can be overlapped if they have same types of nodes. For example, two networks having same diseases were developed in [23] where one was based on similar disease-symptoms and the other on similar disease-genes. The two graphs were integrated to create a single network of shared diseases and genes. The network parameters and analysis of this graph were useful for interpreting disease correlations and predicting unknown associations. Thus, two networks of diseases, in which one is based on diets and the other is based on a factor like symptoms, drugs, or genes, can also be overlapped to extract significant disease associations.

3.2.2 Dynamic graphs for understanding temporal relations

Apart from overlapping, a network can be compared to itself as it evolves with time so that the progression in pattern can be observed and future pattern can be forecasted [37, 38]. There are numerous meta-analysis [3941] in which consumption data of a specific food item by patients suffering from a disease is collected. Similarly, dietary patterns of patients can be noted for a period and represent it in the form of temporal graphs. The graphs can be used to study dynamic trends for understanding disease-diet interactions.

3.3 Ranking and clustering for exploring similarity

Ranking is exploring the most significant links from networks and order them accordingly [35]. It is used for fixing the priority of links according to their rank. It also helps to simplify a dataset by identifying vital links. Ranking for diseases has been done previously based on their degree of association in the network [14]. Apart from using only the network parameters for ranking, the network can also be broken down into clusters and then ranking can be performed, so that the topmost significant associations are recognized. The inclusion of clustering will enhance the accuracy of ranking. The disease-diet network can be split into clusters and then ranking can be performed for each cluster. In this way, similar diets and disease links can be ranked.

3.3.1 Machine learning and link prediction for effective analysis

A machine learning framework can enhance traditional link prediction of networks. The features for this task can be extracted using various properties of a network and can be analyzed using machine learning techniques. The approaches like Clustering [42], Classification [36, 43,138], and Ranking [137138]can be applied to the database to predict future links. A framework generated using machine learning and link prediction as shown in Figure 2 can be used in forecasting the possibility of association of disease with another disease in a curated disease-disease network. Network parameters like SimRank, Common Neighbors, Adamic Adar, Clustering Coefficient, and Diameter [139140] can be extracted for different diseases like diabetes type 2, diabetes type 1, lung cancer, and breast cancer and used as features. Different machine learning approaches can then be applied to the retrieved dataset for inferring unknown associations.

Fig. 2
figure 2

Framework for using Machine Learning approach for Link Prediction in Networks by extracting features including SimRank (SR), Common Neighbors (CN), Adamic Adar (AA), Clustering Coefficient (CC), Diameter (D), and Betweenness Centrality (BC)

3.3.2 Personalized analysis

Work done in literature has focused on generic predictions for diseases. With technological changes and realization of customized diets, the future research focus should shift to personalized diet analysis. This can be done by collecting data from patients and then developing a network. For example, Personalized Nutrition Project [44] and Hundred Person Wellness Project [45] have collected physical and other health parameters data for exploring disease progression. In case of a diet-based analysis, data of patients’ dietary habits can be collected [44] or data of fit individuals can be accumulated using food questionnaires. Parameters like heart rate, blood pressure, Healthy Eating Index (HEI), and body mass index can be extracted and utilized as features. Using such data of different individuals, their similarities can be measured and further used to generate a network. Profiles of similar individuals can be clustered using appropriate clustering algorithm. Thus, a cluster can be predicted for a new individual based on its features and similarity. Based on the cluster an individual belongs to, diet of individual can be recommended (as shown in Figure 3). The data shown is sample data, but data for such analysis can be generated by monitoring patients as done in Personalized Nutrition Project or can be taken from resources like Eating and Health Module Dataset (American Time Use Survey) [46] as shown in Table 2. Different approaches other than this, that can be embarked for personalized inferences are:

  1. i.

    Understanding Diets: Diet-oriented predictions can be performed by evaluating data from food questionnaires and dietary habits. Individuals with similar diets can be clustered in a network, and then their network and other parameters can be explored in order to perform ranking.

  2. ii.

    Understanding temporal associations: We can also collect health parameters of an individual at different points of time and dynamic graphs can be compared to understand the progression or digression of disease based on various parameters.

  3. iii.

    Predicting possibility of disease: This can be done by training the dietary-based dataset of healthy and unhealthy individuals suffering from a disease using different machine learning and network algorithms and creating a suitable prediction model.

Fig. 3
figure 3

Framework for personalized dietary predictions by extracting features including heart rate (HR), systolic blood pressure (SBP), EUE (for participation in exercise in last 7 days with 1 as yes and 2 as no), body mass index (BMI), steps per day (SD)

Table 2 ATUS Eating and Health Module containing fields related to Body Mass Index (ERBMI), diet (EUDIETSODA for kind of soft drink with 1 for diet, 2 for regular and 3 for both/EUDRINK for drinking any beverage other than plain water with 1 for yes and 2 for no), exercise (EUEXERCISE)

4 Exploring diet associations with Covid-19 and other diseases: the proposed approach, challenges, and solutions

As seen in literature, many foods have been suggested as helpful or harmful for enhancing immunity and minimizing the risk of Covid-19 [1347, 8, 10, 47]. Foods are also known to be related to other diseases like diabetes and obesity, which are in turn related to Covid-19. This transitive relation can be used to infer unknown diet associations for Covid-19 and other diseases. Since Network Analysis has a great potential for understanding associations, it has been used in this study to uncover such associations. Network Analysis has been used effectively for identifying associations between diseases and entities like miRNA, symptoms, or drugs [18], but this research area has not been much explored for understanding its associations with diet. When a similar approach is undertaken for diet-based associations, it experiences challenges at different stages. This section discusses the proposed approach, challenges faced, and solutions to eliminate the challenges.

4.1 The proposed approach

In order to identify significant associations, there is a need to develop a stepwise approach which incorporates the necessary tasks of extracting, pre-processing, visualizing, and analyzing data using appropriate techniques. The approach developed in this paper is as shown in Figure 4.

Fig. 4
figure 4

Proposed approach for exploring associations among Covid-19, diet, and other diseases

4.1.1 Curation of experimentally supported disease-diet associations

Challenge

The first and foremost step for the proposed approach requires authentic data from an authentic database containing disease-diet associations in order to filter and establish meaningful associations between diet and Covid-19-related diseases. As no such database exists, curation is the only option. In order to curate such database, one method is to use existing websites, blogs, or books, but it lacks reliability because such sources provide only general notions regarding dietary habits. The second method is to identify relationships from existing literature by manually reading research papers as done in [14]. In this work, reading of abstracts has been undertaken for curation of database corresponding to dietary habits and diseases. It is again not the best method as it requires many man hours.

Solution

Literature mining of medical articles will expedite the curation process as well as preserve its reliability. For curating disease-diet associations, a technique named DIDACE has been proposed, designed, and used as an underlying method for automatically mining literature in our previous work [146]. This technique is further used in this study to extract experimentally supported associations among diet and Covid-19-related diseases. The proposed technique works as per the following steps (shown in Figure 5):

Fig. 5
figure 5

Steps for curation of associations between diet and Covid-19-related diseases

  1. i.

    DIDCorDIDACE for extracting associations: A software application has been designed in our previous work [146] to extract articles in which disease and diet terms occur together. The database acts as a benchmark containing experimentally supported associations from which required associations can be further used. Such approach has been used in many other biomedical researches to produce authentic data [23146]. Standard disease and diet terms were downloaded from research vocabulary named MeSH [48]. MeSH vocabulary is used to index research articles in medical literature PubMed [49]. The PubMed articles can be searched via an API called e-utilities. Thus, the programming application utilizes e-utilities to search disease and diet MeSH terms from PubMed articles. It returns an xml containing the number of articles in which the disease-diet pair occurs (termed as co-occurrence) along with their ids. The retrieved co-occurrences are further normalized using Term Frequency-Inverse Document Frequency metric [50]. A detailed description of disease-diet curation is discussed in the previous work [48, 50. Thus, the curated database is a good fit for developing a network.. It is evident from the figure that most of the associations have low co-occurrences while a very few associations have high values. This is similar to a power law distribution which is a property of complex networks [6141, 146]. The distribution of co-occurrences for curated disease-diet associations is as shown in Figure 6. It is evident from the figure that most of the associations have low co-occurrences while a very few associations have high values. This is similar to a power law distribution which is a property of complex networks [47, 141]. Thus, the curated database is a good fit for developing a network.

    Fig. 6
    figure 6

    Distribution of co-occurrences for curated disease-diet associations

  2. ii.

    Evaluation of curated database: The retrieved disease-diets database contains associations between two variables, namely disease, and diet. A statistical significance test needs to be performed to examine the significance of their associations. Since more than 20% of the relations have expected frequencies (co-occurrences) less than 5, Fisher’s exact test is performed on this data to comprehend statistical significance [51]. The null hypothesis in this case is that the disease terms are independent to diet terms, whereas the alternative hypothesis suggests a relationship between the two variables. The obtained p-value=0.0004997 is less than 0.05; thus, the null hypothesis is rejected at 5% significance level. This affirms a significant relationship between diseases and diets in the retrieved database.

  3. iii.

    Selection of Covid-19-related diseases and diets associations: Covid-19 is a novel disease and research regarding its associations with other diseases is still in its infancy. Due to this, diseases having evidence in literature regarding its relation to Covid-19 are focused in this study. A study [52] explored the gene expression patterns of Covid-19 and found high similarities with the characteristic patterns of a very few other diseases. The diseases include diabetes mellitus type 2 (T2DM), leukemia, psoriasis, pulmonary arterial hypertension, and non-alcoholic fatty liver disease (NAFLD). The similarities suggest that persons suffering from these diseases might need to take extra care for prevention of Covid-19 risk. Other studies also suggest that patients suffering from T2DM and NAFLD are known to be at a greater risk of developing infections and thus Covid-19 [5356]. Thus, out of these diseases, only T2DM and NAFLD have been considered for study due to presence of strong evidence in literature regarding association with Covid-19. The curated database contains 274131 experimentally supported associations between 1917 diseases and 153 diets. This becomes a baseline for selecting records pertaining only to diseases known to be related to Covid-19. A total of 235 relations between NAFLD-Diets and T2DM-Diets are selected from the curated database.

  4. iv.

    Storage and visualization of associations: The retrieved associations database is further stored as a graph in one of the benchmark platforms for graph database management, named Neo4j. The graph contains 137 diet nodes, 2 disease nodes (T2DM and NAFLD), and 235 relationships between them. A subgraph of this graph database is shown in Figure 7 in which the weight of edges depicts normalized co-occurrences using min-max scaler. Min-max normalization is used so that the complete dataset is viewed on a standard scale even if data from different sources is integrated. The most commonly used range for scaling is 0 to 1 [57], but in this study, a value of 0 would mean absence of association. For example, in this study, the minimum value of normalized co-occurrence is 0.1538 (association between NAFLD and infant food). If this value is scaled with range 0–1, it would convert to 0, which will lead to an ambiguous record. Moreover, the co-occurrences retrieved vary from very small values to very large values. For instance, the maximum value of normalized co-occurrence is 274.65 (association between T2DM and dietary fiber). A range from 0 to 1 would not be a good fit for such a vast difference between minimum and maximum values. Thus, a broader range not starting from 0, i.e., a range from 1 to 10 has been selected for this purpose. This implies association between NAFLD and infant food is taken as 1, T2DM and dietary fiber as 10, and other values lie between 1 to 10. As depicted in the figure, T2DM and NAFLD are mutually related to beer, coffee, energy drinks, carbonated beverage, and many other diet terms.

    Fig. 7
    figure 7

    A subgraph of retrieved Covid-19-related diseases and diets associations graph

4.1.2 Curation of disease-disease associations

Challenge

After the Covid-19-related diseases and diets associations are established, the next step is to understand the associations between the underlying diseases. The associations between NAFLD and T2DM can be extracted using similarity measures. Finding an effective similarity measure is also a challenge in this study, as it will ultimately affect the results obtained. One approach is to use the curated disease-diet association graph and find similarity of diseases based on graph using traditional measures [58] like Cosine similarity or Euclidean distance, but this might make the entire database redundant.

Solution

The association between NAFLD and T2DM is extracted using semantic similarity which is based on finding relatedness from semantic meanings of terms [59]. The following tasks are performed in this step (also shown in Figure 8):

Fig. 8
figure 8

Steps for curation of disease-disease and disease-Covid-19 associations graph

  1. i.

    A toolkit named DincRNA [60] provides DisSim tool offering five state-of-art methods to choose for calculating similarity of diseases. In this work, Wang’s method [59] is used which is based on finding semantic similarity between terms using hierarchical structure in the ontology. This method is used because it adds a different perspective (using semantic meanings) to the dataset unlike other traditional similarity metrics. The retrieved data is further normalized using min-max scaler with a range of 1–10. This range is same as the previous step so that complete dataset follows same distribution. Thus, a graph containing disease-diet and disease-disease associations is obtained.

  2. ii.

    If Covid-19 node is introduced in this graph along with its similarity with the other diseases as relationships, then its association with diets can be calculated. Due to its novelty, associations database between Covid-19 and other diseases is not available as of now. Thus, similarity of Covid-19 with other diseases is computed in a similar manner as done in the previous step by curating its co-occurrences using PubMed and e-utilities. The retrieved dataset is normalized using min-max scaler (range 1–10) and records pertaining to relationships with NAFLD and T2DM are used to develop a graph containing disease-diet, disease-disease, and disease-Covid-19 associations.

4.1.3 Prediction of diet associations for Covid-19 and other diseases

Challenge

Traditional methods of predicting probable links in a network mainly include techniques based on nodes similarity [14], statistics, and probability [61]. There is a need for a more powerful and computationally intelligent approach to handle complex networks. The approach should be able to utilize network structure as well as must possess learning capabilities.

Solution

In this step, inferences are drawn using graph algorithms, namely, the Louvain algorithm (LA) and K-Nearest Neighbors (KNN) using the Page rank algorithm (PR). These two algorithms (LA and KNN) are selected because they utilize different learning properties, i.e., unsupervised and supervised respectively to infer relations. Initially, the Louvain algorithm is used to divide the graph database into communities. Since only two communities are identified in this step, another algorithm KNN is introduced to further refine the analysis. In KNN, the Page Rank algorithm is applied to rank all the nodes in graph. This is done so that the ranks can be further used for finding similarity. Thus, LA finds communities in the graph whereas KNN fetches the top most related nodes in the graph. Use of combination of graph algorithms ensures refined results and better precision. The steps performed are discussed as follows (also shown in Figure 9):

Fig. 9
figure 9

Steps for prediction of diet associations for Covid-19 and comorbidities (bigger size of node depicts a better rank in Page Ranking algorithm)

  1. i.

    Louvain algorithm is an unsupervised graph algorithm for inferring communities [62]. Nodes reported in the same community are known to be more related to one another than those in other communities. The algorithm works on the principle of maximizing modularization gain, developing communities, and reiterating the previous steps until no more changes are evident. The modularity gain of a node in a community is defined as in Equation (1):

    $$\mathrm{M}=\left[ \frac{{\Sigma }_{in}+2{k}_{i,in }}{2m}-{\left(\frac{{\Sigma }_{tot}+{k}_{i}}{2m}\right)}^{2}\right]-\left[\frac{{\Sigma }_{in}}{2m}-{\left(\frac{{\Sigma }_{tot}}{2m}\right)}^{2}-{\left(\frac{{k}_{i}}{2m}\right)}^{2}\right]$$
    (1)

    where m is the sum of weights of all the relationships in the graph, \({\Sigma }_{in}\) is the sum of relationships in the community, \({k}_{i,in}\) is the sum of weights of relationships starting from node i to other nodes in the community, \({k}_{i}\) is the sum of weights of relationships incident to i, and \({\Sigma }_{tot}\) is the sum of weights of relationships incident to nodes in the community. This algorithm is selected because it has no prior assumptions regarding community of nodes and is specifically designed for graphs. Graph retrieved from the above steps is stored in Neo4j and Louvain algorithm is run using cypher queries. Two communities are identified containing a total of 140 nodes.

  2. ii.

    In order to retrieve refined results and enhance the analysis, another algorithm following supervised property is performed. KNN algorithm is applied for all nodes in the graph to infer most related nodes. It is a machine learning algorithm for finding top K similar nodes for each node in the dataset using a similarity function. Similarity function is calculated using a given property of nodes in the graph. In this study, page rank of nodes is taken as the given property based on the principle that more connected nodes are more similar. Page Rank (PR) calculates the rank of a node by using the number of nodes with which it is connected [63]. It was introduced in Google for ranking the webpages so as to optimize the search. Since page rank is a float value in this study, cosine similarity is used as similarity function by KNN. This algorithm is selected because it is a simple technique with good accuracy and less runtime, which is suitable for our small dataset.

4.2 Results and evaluation

The two communities identified using the Louvain algorithm contain 46 and 94 nodes respectively. One of the communities contains NAFLD and Covid-19 terms along with 44 diet terms. The second community contains 93 diet terms related to T2DM. The disease and diet nodes retrieved in different communities are described in Table 3.

Table 3 Disease and diet nodes detected in different communities

KNN and PR algorithms are performed for refining the results further. The graph database is used to firstly obtain page ranks of the nodes and then K nearest neighbors algorithm is applied for all the nodes. A higher sample rate increases the accuracy of KNN, thus it is chosen as 0.8 along with delta threshold of 0.001. Different variations of algorithm are performed with value of K ranging from 2 to 20. Initially, two topmost similar nodes are retrieved, but as the value of K increases, other most similar nodes are added as described in Table 4.

Table 4 Associations identified using K Nearest Neighbors and Page Rank algorithms

The obtained associations are validated by searching Pubmed research papers with respective diet terms. Table 5 represents important associations validated from Pubmed literature.

Table 5 Validation of predicted associations for diseases using Pubmed literature

All the associations predicted for T2DM have been identified from literature as shown in Table 5 (records for terms “sweetening agents,” “flavoring agents,” and “food additives” are displayed collectively because of same literature involved). Interestingly, most of the diets associated with NAFLD have also been identified except bread, dietary fats unsaturated, and edible grains. Being a novel disease, research related to Covid-19 is in its infancy; thus, lesser number of diets have been validated for it. The diets predicted and validated for Covid-19 risk management include egg yolk, celery, sesame oil, strawberry, raspberries, honey, carrot, kefir, and onion. The predicted associations which could not be validated include food items like beets, watermelon, shellfish, cucumber, egg white, pumpkin, brazil nuts, infant formula, raw food, peach, and carbonated water.

Evaluation for this study is done using precision as a measure due to the nature of database. Being a prediction from dietary database and relating directly to human health, precision has utmost importance in this case. A high precision might mean returning less results, but most of it is correct. This kind of database cannot afford low precision with high number of incorrect results. Precision is calculated as ratio of true positives and all the positives where true positives refer to actual associations (confirmed using PubMed literature validation) and all positives refer to the predicted associations. Since relation between Covid-19 and diets is a very fresh domain, validation of its corresponding results in literature is difficult. If only NAFLD and T2DM are considered, then precision is around 92.5% as shown in Table 6. The table depicts different combinations of data samples of results and their corresponding precision. The precision with all the data samples (76.7%) is also quite interesting because of the presence of results related to a novel disease.

Table 6 Precision for different data samples of results

4.3 Discussion

In this study, Network Analysis algorithms aim to uncover diet and disease associations from a graph database. Important inferences can be drawn from the results retrieved as follows:

  • Similar diets have been predicted for T2DM and NAFLD which confirms the overlap seen in Figure 10. The diets include Dietary Fats, Dietary Fibre, Dietary Carbohydrates, Sweetening Agents, Flavoring Agents, Food Additives, Dietary Sucrose, Nutritive Sweeteners, Dietary Supplements, Vegetables, Whole grains, Bread, Dietary Fat Unsaturated, Fruits, Red Meat, Coffee, Nuts, Edible grains, Meat, and High Fructose Corn Syrup. All the predicted diets have been validated from literature for T2DM. This is due to the existence of a substantial research regarding association of T2DM with diets. In case of NAFLD, all the diets except bread, dietary fats unsaturated, and edible grains have been validated.

  • It is evident from the results and validations that diets like whole grains, vegetables, fruits, and nuts are found to be helpful for both NAFLD and T2DM diseases. The reason behind this relation lies in the fact that many researchers suggest Mediterranean Diet (MD) for prevention and management of chronic diseases like T2DM and NAFLD [99145]. Mediterranean Diet corresponds to more consumption of plant-based foods and fish but less consumption of dairy products and meat, which very well agrees with the results obtained.

  • The top 20 predicted diets for Covid-19 include Egg yolk, Celery, Beets, Sesame oil, Watermelon, Peach, Carrot, Strawberry, Raspberries, Shellfish, Brazil nuts, Pumpkin, Infant Formula, Raw foods, Cucumber, Egg white, Carbonated water, Kefir, Onion, and Honey. Out of these, 9 associations have been validated including egg yolk, celery, sesame oil, strawberry, raspberries, honey, carrot, kefir, and onion. Validated diets for Covid-19 imply that they are currently being inspected for future prospectives of disease prevention and management. Food items containing vitamin C and A like strawberry and carrot respectively along with those exhibiting antiviral properties like kefir are found in the list of validated diets.

  • The diet-based associations predicted for Covid-19 are depicted in Figure 11 along with their similarity scores. The most similar diets as seen in the figure include egg yolk, celery, beets, sesame oil, and watermelon. Out of these 5 diets, 3 diets including egg yolk, celery, and sesame oil have been validated in literature. These kinds of validations are interesting, considering the novelty of disease.

  • Considering the high similarity scores retrieved for diets with Covid-19 (refer to Figure 11), associations which could not be validated in literature should be studied for further research. These include beets, watermelon, shellfish, cucumber, egg white, pumpkin, brazil nuts, infant formula, raw food, peach, and carbonated water which must be considered for research in this time of pandemic. The significance of associations can be confirmed by performing randomized trial or cohort study. Similarly, associations that could not be validated for NAFLD including bread, dietary fats unsaturated, and edible grains should also be inspected for future associations.

  • Covid-19 and NAFLD are seen in the same community, thus the diets present in the community must be explored for utilizing them in case of patients suffering from both the diseases. The confirmed associations can then be used for planning a better diet for patients suffering from Covid-19, T2DM, or NAFLD or combinations of these. This approach can help to expedite the process of testing different diets for Covid-19 and act as a baseline to discover dynamics of Covid-19 with different comorbidities as per patients’ health status.

Fig. 10
figure 10

Important communities identified using the Louvain algorithm (diet terms including almonds, cashew, and mustard are clearly visible in one cluster with T2DM whereas pumpkin, apricot etc. are visible in other clusters with Covid-19 and NAFLD. Certain diets are also visible in the overlapping of both clusters including whole grains and nuts)

Fig. 11
figure 11

Top 20 associations identified for Covid-19 along with similarity scores

5 Limitations and future research directions

Understanding the associations of disease and diet requires integration of multiple dimensions of networks. Due to a lack of dataset and novelty of disease in consideration, there are limitations to this work which can be overcome by undertaking certain research directions in future:

  • The approach followed in this paper covers a small scope. This is due to consideration of only a few diseases (T2DM and NAFLD) as comorbidities. This small scope can be used by caregivers for planning dietary recommendations of patients with comorbidities. This work can also be extended with a larger scope in which other known diseases can be added in the network to enlarge the network. This will require more validated database of associations, but this approach might be helpful for domain experts for understanding generic food and disease dynamics. Moreover, a similar approach can be developed for predicting associations for other novel diseases.

  • The relation of diets with Covid-19 and other comorbidities is predicted through the proposed approach but its type (harmful/helpful) is still unknown. In addition to finding the strength of correlation with other diets, another dimension of relationship type should also be explored. Some diets may be harmful for one disease, but helpful for its subtype or vice versa. For example, consumption of cheese is known to be helpful for a disease named Inflammatory Bowel Disease (IBD) [132], whereas it is harmful for its subtype named Crohn’s Disease (CD) and Ulcerative Colitis (UC) [133]. A graph containing such complex relationships is as shown in Figure 12. Such underlying graph can be utilized to understand the role of different diets for diseases. Similar to the previous analysis, we can introduce Covid-19 node in this graph and its relation (harmful/helpful) can be predicted with other diets. Label propagation algorithms can be used for this purpose.

  • It is known that different individuals have different response to same diet. The food item which suits one person, may not suit the other. In case of conditions like food allergies, diabetes, or pregnancy, individuals might need alternate diet recommendations. The proposed approach does not take this scenario in consideration. A similar approach can be devised in which associations between diets can be known based on their associated diseases. If two diet/food items are associated with many mutual diseases, they will be more similar. Based on this principle, their similarity can be predicted using similarity algorithms. Once the similarities are known, a graph can be generated and clustering/community detection can be performed. Alternate diets can be recommended using inferred communities and relationships.

  • This approach aims to explore unknown diet associations for Covid-19 and other diseases using computational methods. The main purpose is to expedite the process of understanding diet-based associations using Network Analysis. The predictions might act as a base for researchers to further explore the actual significance of associations. Only after the predicted diets are validated by dieticians and medical researchers, the associations can further be used to design a service to be utilized by caregivers for entering the comorbidities of patients and plan diets for them.

Fig. 12
figure 12

Harmful and helpful diets for Inflammatory Bowel Disease, Ulcerative Colitis, and Crohn’s Disease

6 Conclusion

Disease-diet association is one of the most unexplored databases but can be extremely valuable for understanding progression of novel diseases if mined using appropriate computational methods. Covid-19 is a novel and life-threatening disease which currently requires steadfast analysis in multiple dimensions including disease and diet associations. In this paper, a computational approach has been proposed to identify diet associations for Covid-19 and other diseases (NAFLD and T2DM) using Network algorithms (LA, KNN, and PR). There are certain challenges which are faced during development of the approach but computational methods like Network analysis and literature mining have been introduced to eliminate them. A blend of disease-diet, disease-disease, and disease-Covid-19 associations along with computational techniques aims to infer unknown diet associations with high precision. The computational techniques expedite the process of exploration of diet associations which can be further validated by medical researchers or dieticians so as to design future dietary guidelines. This article also explores different methods in which Network analysis can further aid in recognizing unknown associations based on diets and diseases. Such powerful computational frameworks will not only help in predicting unknown associations but can also transform into personalized healthcare solutions.

6.1 Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.