A cost-based multi-layer network approach for the discovery of patient phenotypes

Clinical records frequently include assessments of the characteristics of patients, which may include the completion of various questionnaires. These questionnaires provide a variety of perspectives on a patient's current state of well-being. Not only is it critical to capture the heterogeneity given by these perspectives, but there is also a growing demand for developing cost-effective technologies for clinical phenotyping. Filling out many questionnaires may be a strain for the patients and therefore costly. In this work, we propose COBALT -- a cost-based layer selector model for detecting phenotypes using a community detection approach. Our goal is to minimize the number of features used to build these phenotypes while preserving its quality. We test our model using questionnaire data from chronic tinnitus patients and represent the data in a multi-layer network structure. The model is then evaluated by predicting post-treatment data using baseline features (age, gender, and pre-treatment data) as well as the identified phenotypes as a feature. For some post-treatment variables, predictors using phenotypes from COBALT as features outperformed those using phenotypes detected by traditional clustering methods. Moreover, using phenotype data to predict post-treatment data proved beneficial in comparison with predictors that were solely trained with baseline features.


Introduction
Clinical records contain a wealth of characteristics of the patients: their vital signs, for example, and other subjective markers determined by questionnaires.There have been numerous studies on tinnitus [10,11,16], but due to the heterogeneity of its symptoms there is no standard treatment that fits all patients.Tailored treatments are therefore essential to tackle this issue [38].Phenotyping enables this by capturing heterogeneity and identifying subgroups of patients with comparable traits.For some chronic diseases like tinnitus, it is common that patients are asked to complete a series of questionnaires to assess the severity of their symptoms and its impact on their well-being.However, filling out multiple questionnaires can be a burden for patients [37], which might affect their adherence to filling out crucial questionnaire items.The cost-effectiveness of digital clinical phenotypes is already seen as a high priority, according to Huckvale et al. [25].
As a response, we propose an algorithm that builds a multi-layered network (MLN) of patient features (assessments, questionnaire scores) and derives communities from it, taking missingness and diversity into account.We opt for a MLN representation of the patient data, because MLNs can capture intra-and inter-associations between features, highlighting the heterogeneity of the various perspectives provided by each feature of the data.We propose an extension of our work in [36] by including a cost-aware aspect in our model.Our Cost-Based Layer Selector (COBALT) has thus a cost component that aims to minimize the number of features (questionnaire data) for the construction of the phenotypes, without compromising its quality.Our overarching research question for COBALT is: How can we exploit features and inter-relationships among them for the discovery of phenotypes?
We refine it as follows: 1. To what extent can tinnitus patients similarity be represented and captured using a multi-layer network? 2. How does our approach with a community detection algorithm in a multi-layer network compare to traditional clustering models?3. To what extent does including a cost-sensitive model affect the quality of the phenotypes found? 4. To what extent do missing nodes impact the quality of the phenotypes found?
To this purpose, we investigate how missing data influence the proposed model, as well as how traditional clustering algorithms compare to our approach.We also carry qualitative and quantitative evaluations to ensure that we analyze the impact of our findings with respect to a practical application.More specifically, we build a posttreatment data predictor that learns with baseline features (age, gender and pre-treatment data) and with the phenotypes discovered.We evaluate our approach on a clinical dataset with pre-treatment and post-treatment records of tinnitus patients.
The remainder of the paper is organized as follows: Section 2 presents related work on phenotyping and its applications on multi-layer networks, Section 3 presents the dataset characteristics used in the experiments, Section 4 describes the proposed algorithm and Section 5 describes how the proposed methodology was evaluated.In Section 6 the results are shown and discussed and Section 7 summarizes the main findings.

Related work
The role of cost-awareness in phenotyping research varies among the medical domains.Current tinnitus phenotyping research is centered on clustering methods such as hierarchical clustering analysis and latent clustering analysis [19].There are many surveys on clinical phenotyping [1,4,19], but they do not emphasize the issue of feature acquisition cost when constructing phenotypes.Huckvale et al. [25], on the other hand, identify a set of priorities for the future of clinical digital phenotyping, and they mention the cost-effectiveness of the models as important while implementing digital phenotyping in a clinical environment.Liang et al. [30] also states the importance of cost-effective methods and, more specifically, with respect to the acquisition of data.

Phenotyping with MLNs
Regarding the use of MLNs in the representation of phenotype and genotype data, Lee et al. [29] review computational methods using MLNs to represent the hierarchy of biological systems.They focus on the quality of the representation of interactions between phenotype data, gene data and SNP data in biological systems, rather than finding phenotypes.More recently, Yang et al. [41] use functional clustering methods to capture MLNs from any dimension of their genetic data.Concerning community detection for phenotype discovery, Kramer et al. [28] use the Leiden and Louvain algorithms to find phenotypes using a KNN and CoNet representation of the data.We also use the Leiden algorithm for community discovery in COBALT, but with a different representation of the data.This method discovers a set of communities over all the MLN layers.Hereafter, we use the terms 'set of communities' and 'partition (over the MLN)' interchangeably.An overview of community discovery algorithms for MLNs follows after explaining graph pruning.

Graph pruning in MLNs
Interdonato et al. [26] characterize graph pruning as a network simplification filtering technique.This technique improves computation for methods that do not perform well in large networks.Pruning in networks/graphs is a dimensionality reduction approach that assists in the removal of noisy and redundant edges or nodes.These methods can be grouped into: (i) centrality-based, (ii) node-layer relevance-based and (iii) model-based.The first two focus on node removal, wile the last one focuses on edge pruning.In COBALT, we use the Maximum Likelihood Filter (MLF) of Dianati [15], which is mentioned in [26] as an unbiased edge-filtering method that uses maximum entropy.
The main idea behind MLF is to create a null model which generates a "realized graph" with the same total weight and degree sequence as the original graph and use it to compute a p-value of each edge.Then, edges with p-values above a threshold α (significance level) are pruned.In the realized graph, the nodes are kept the same as in the original graph and with the same degree sequence.Edges are then randomly assigned to a pair of nodes, but with a likelihood.A node with a high degree is more likely to be assigned an edge than a node with a lower degree.This likelihood is computed using the binomial distribution, as follows: where m is an edge from the set of edges E that connected the pair of nodes (i, j); k i , k j are the degrees of nodes i and j, respectively; σ ij is the weight of the undirected edge between i and j, p = kikj 2E 2 and E = 1 2 i k i .The p-value of an edge connecting nodes i and j with a weight of w ij is then denoted as s ij (w ij ).Equation 2shows the calculation of the p-value of an edge with weight w ij .

Variational Bayes with stochastic block model
This method by Ali et al. [2] proposes to detect shared and unshared communities in a multiplex network.The nodes are represented in a community wise connectivity matrix in blocks.This matrix maps the probability of the nodes being connected to other nodes.Nodes are placed in the same block if their edges are stochastically similar.The Poisson distribution is used to calculate the probability of the nodes being connected.
The optimum number of blocks that represent the nodes is determined via the Bayes factor [2] and communities are extracted accordingly.Ali et al. [2] formulate the Weighted Stochastic Block Model (WSBM) adapted to a multiplex network.However, it is not stated if the achieved results outperform a benchmark method like Louvain.

Modularity optimization with GenLouvain
The GenLouvain algorithm is a modularity optimization algorithm that is widely used and considered a benchmark algorithm [24].Modularity is a metric that evaluates the quality of the partition of the network into communities.The multislice modularity is a modularity metric adapted to MLNs introduced by Mucha et al. [33].This metric can be defined as follows: (3) where µ is the number of edges in the MLN, γ is a resolution parameter, A ijα is the value of the edges between i and j in the layer α, C jαβ is an inter-layer edge between the same node j that belongs to layer α and β and k iα represents the degree of node i in the layer α.
The main difference between the GenLouvain algorithm and the traditional Louvain algorithm is that the modularity metric is replaced by the multislice modularity [27].Despite of Louvain being one of the benchmark algorithms for community detection based on modularity optimization, it can lead to sub-optimal partitions as shown by Traag et al. [40].To tackle this, Traag et al. [40] present the Leiden algorithm as an improvement to the Louvain algorithm.They include a refinement phase in which the nodes to be moved to another community do not have necessarily to provide the highest increase on the quality function.This is a major distinction to Louvain, where the approach is greedy and nodes are only assigned to another community if this results in the largest increase of the quality function.Nonetheless, in the Leiden algorithm there is a likelihood associated with the decision on which nodes should be moved to other communities.The higher the increase in the quality function provoked by moving a node, the higher the likelihood of that node to be selected.

AggregationPan
Pan et al. [35] propose an algorithm that aggregates the edge weight matrices and then applies a cut-off, so that the edge weights with low values (< τ ) are converted to 0. However, aggregation is criticized in the literature since its simplification may mask the true nature of the initial modular patterns [24].Plus, there is the need to define τ .

Particle competition
Gao et al. [18] work is based on a particle competition algorithm for MLNs.The fundamental concept is to insert a certain amount of K particles into network nodes.Each particle's goal is to dominate as many nodes as possible while also safeguarding their current dominated nodes from other particles.When a particle visits a node, it gains strength while weakening the other particles in the node.At the end of the algorithm, each particle should represent a community.The particles can move in two ways: random walking and preferential walking.Random walking chooses a node at random from its neighbors, whereas preferential walking visits a neighbor with a high dominance.A balancing parameter is then used to balance these two types of walking.This means that a node's propensity to choose a random or preferential walk is affected by this value.The experiments were, however, not tested in real world networks.

MNLPA
This algorithm is introduced by Alimadadi et al. [3] and consists in a generalization of the label propagation algorithm (LPA) to multiplex networks.At the beginning, each node is assigned a label.Then, similarity measures are used to compare nodes.If these two nodes have a certain similarity metric higher than a given threshold, then the label of the two nodes is replaced by a common label to both.This is done until the stopping criteria are achieved.However, the survey from [24] states that this method is better suited for directed and weighted networks rather than for general MLNs.It is also mentioned that it might be unstable due to the threshold parameter defined, which impacts strongly the network partition.

Similarity functions for MLNs
Bródka et al. [8] presents some methods to compare distributions between layers in a multiple network: (i) dissimilarity index, (ii) Kullback-Leibler, (iii) Jensen-Shannon and (iv) Jeffrey.
For comparing properties with binary and numeric values, some other measures are mentioned by Bródka et al. [8].For properties with binary values: Russel-Rao, Jaccard, coverage, Kulczyński, simple matching coefficient (SMC) and Hemann.For properties with numeric values, the following metrics are normally used: cosine similarity, Pearson correlation similarity, Spearman correlation coefficient.

Comparing the networks
Tantardini et al. [39] present approaches for comparing networks, both when the nodes are aligned and the pairs are known (known-node correspondence, KNC) and when the networks comprise nodes that are not aligned and hence distinct (unknown-node correspondence, UNC).For undirected and weighted networks, some methods are presented in Table 1.
Concerning UNC, it is stated that global statistics are not a reliable tool for comparing layer similarity because they are overly simplistic.[39], the MI-GRAAL has a significant computational cost, but the Portrait divergence is largely efficient with small to medium graphs.

Comparing communities
All of the above metrics focus on describing the similarity between networks based on their edges, nodes and/or other properties.For the current work, we are interested in comparing a specific property -the community structures among layers.Recently, Ghawi and Pfeffer [20] identified some drawbacks when using extrinsic (that require ground truth) evaluation metrics, such as the direction of the comparison.More precisely, if we have 2 layers of a MLN, l α and l β , there are twoways for matching the communities in each one of them.One can use either l α or l β as the basis of the comparison and this may lead to different results.For instance, let C lα = {c 1 , c 2 , ..., c r } be the set of communities in l α and C l β = {c 1 , c 2 , ..., c k } be the set of communities in l β .Let n lα C1 = {n 1 , n 2 , ..., n m } be the set of nodes in c 1 of layer l α and n l β C1 = {n 1 , n 2 , ..., n n } the set of nodes in c 1 of layer l β .In these communities, if we consider l α as the basis of the comparison (the ground truth) and l β as the layer to be compared with the group truth.If we look at node existence, we will check for the number of nodes from n lα C1 that are in n l β C1 .In the case that n lα C1 has 15 nodes and n l β C1 has 30, from which 15 are the same as in l α , then the similarity metric gives a perfect similarity.Therefore, a two-way matching is required, in which each layer is considered the ground truth -one at a time -and then both similarity values are combined into a single metric.Since purity and F-measure are among the most commonly used metrics for clustering evaluation and they allow the comparison between two clustering solutions, they fit the purpose of the current work.
Ghawi and Pfeffer [20] convert the purity in a two-way matching by computing firstly the purity of l α against l β , purity [lα l β ] , and viceversa, purity [l β lα] .The harmonic mean between both values is computed and the final purity is as follows: With this metric, we can study the extent to which the clustering is "pure," with respect to node existence.
F-measure is computed using the recall and precision of the clusters.The precision of a cluster is the same as its purity.The recall metric evaluates the fraction of nodes that are shared between the ground truth cluster and the systemgenerated one.In the end, the F-measure of a cluster is the harmonic mean of its precision and recall.Considering that F [lα l β ] is the F-measure when considering l α as ground truth and l β as the system-generated clustering solution and F [l β lα] is the F-measure of the clustering solution when l β is considered as ground-truth and l α as the system-generated clustering solution.The overall F-measure is given by the harmonic mean of both values:

Materials
The data analyzed in this study refers to chronic tinnitus patients admitted to the University Hospital of Regensburg.The data was gathered between January 3, 2016 and May 28, 2020.
The studies involving human participants were reviewed and approved by the ethics committee of the University Regensburg.The patients/participants provided their written informed consent to participate in this study.
At the time of admission, each patient fills out a series of questionnaires meant to assess some of the patient's mental and physiologic symptoms.The questionnaire data used in this research was gathered from five questionnaires: tinnitus questionnaire by Goebel and Hiller (TQ) [21], tinnitus handicap inventory (THI) [31], tinnitus functional index (TFI) [32], major depression inventory (MDI) [6], and tinnitus impairment questionnaire (TBF12) [22].In total, data from 1087 patients were considered.
Two time points are considered: t 0 denotes the so-called 'screening', where all questionnaires are answered and the treatment is scheduled to start; t 1 denotes the moment of the last visit of the patient at the end of the treatment, whereby some of the questionnaires are answered again and the scores are compared.We also use the expressions 'pre-treatment' and 'before treatment' (moment) for t 0 and 'post-treatment', 'after treatment' (moment) for t 1 .
Table 2 shows the number of patients with available data at t 0 and the number of patients with records at both t 0 and t 1 .The last column shows the ranges of the questionnaire scores.It can be seen that the questionnaires have different value ranges.However, for all of them smaller values are better, in the sense that the patient is in better health.

Methodology
We adopt an iterative and cost-sensitive approach to find communities in a MLN and name it COBALT(Cost-Based Layer Selector).
The first phase is denoted as representation.The nodes, edges and layers are defined in this phase.Then, pruning is applied to remove edges that are not statistically significant.Subsequently, we introduce the cost-sensitive component to our model.At this phase, the minimum set of layers that capture patient phenotypes without compromising its quality is found.This is done iteratively, with layers being added to the structure in a specific order determined by their cost.The cost of each layer is also modeled.Subsection 4.1 describes the representation phase, subsection 4.2 describes the graph pruning phase, subsection 4.3 presents the layer cost model, subsection 4.4 illustrates an example of our MLN, subsection 4.5 describes the search algorithm and subsection 4.6 describes the stopping criteria for the proposed algorithm.

Representation
Let p ∈ P be the set of nodes (for our application: patients) in a graph g ∈ G (for our application: questionnaires).Each g is considered a layer and therefore we denote it instead by l ∈ L. A node p i in layer l is denoted as (p i , l).Let e ∈ E denote the edges in layer l.The edge between two nodes p i and p j in layer l can be defined as e (pi,pj ,l) , which we simplify to (p i , p j , l).We denote the edge weight (which is an attribute of the edge) between these nodes as w pi,pj ,l .
A MLN is constructed to combine the multiple data features into a single structure.This network has L layers and each layer has nodes and connections between them (edges) with a weight associated to it (edge weight).
We span an "intra-layer" edge between each pair of nodes within the same layer, assigning the value of the normalized distance between the nodes in this layer to it (edge weight).We span an "inter-layer" edge across two layers, connecting nodes that represent the same patient in different layers, and assign a value to it based on the distance between the nodes.
When building the inter-layer edges, two types of edges may exist: 1. edges of a node with itself between two layers 2. edges between different nodes located in different layers We generate the inter-layer edges to incorporate the different perspectives given by different features of a node.Hence, only inter-layer edges between the same nodes in different layers are incorporated.We use the same logic as in [36], which we explain in more detail in 4.1.1 and in 4.1.2.

Intra-layer edges
The distances between nodes within a layer are represented by intra-layer edges.A layer represents a feature in our algorithm and in the specific case of our application it represents a questionnaire.These, however, can be generalized to other types of numerical features.Considering two nodes p i and p j in the same layer, the edge between them are defined by the difference between their feature values, which in our case are questionnaire scores.Assuming a questionnaire l ∈ L (L is the set of layers that represent questionnaires), score pi,l and score pj ,l denote the scores of nodes p i and p j in layer l, respectively.
The larger the discrepancy in scores between nodes, the lower their edge weight.This ensures that their connection is represented by a "weak" edge weight if their scores are not close.To accommodate for this, we define the weight as in Equation 6, which describes the transformation 1/x.
In Equation 6, w pi,pj ,l denotes the edge weight that connects p i and p j in layer l.The scores are then normalized by subtracting the mean and dividing by the standard deviation.

Inter-layer edges
Equation 7 shows how to compute the weight w (pi,lα),(pi,l β ) of an inter-layer edge that connects (p i , l α ) and (p i , l β ).
score pi,lα and score pi,l β denote the normalized score of layer l α and l β for node p i , respectively.l α , l β ∈ L correspond to layers α and β, respectively.

Graph pruning
The MLN representation of the previous paragraphs results into a fully-connected network at each layer.To ensure that only the most important edges are retained, we apply next a graph pruning step for inter-layer and intra-layer edges, using the maximum likelihood filter (MLF) proposed by Dianati [14], see subsection 2.2.We fix the threshold for pruning the edges at 0.05, i.e. edges with a p-value higher than 0.05 are removed.

Cost model for layer selection
Certain questionnaires may be more likely to be completed than others (due to the type of questions).This aspect should be considered while deciding on the next layer to be added to the MLN.The similarity of communities between layers is a second criterion of relevance.We intend to add a layer containing additional information about the patients while avoiding adding redundant information.As a result, the more distinct the layers are with respect to their communities, the less redundant they are.The goal is to add a layer with a low community similarity.
Hence, the cost of a layer is calculated using two terms: an availability ratio term and a community similarity term.We formulate the function to measure the cost of a layer l α , C linc,lα , with respect to the incumbent set of layers in the network (l inc ) as: where A linc,lα denotes the availability ratio term and CS linc,lα the community similarity term.These two terms are described hereafter.

Availability ratio term
We term the ratio of completion of the questionnaires as availability ratio.A linc,lα denotes the availability ratio of the questionnaire that is represented by layer l α against l inc .It is the ratio of nodes that are not missing in l α from the nodes that are already in l inc .
Considering the set of nodes in layer l α as p lα = {p 1 , p 2 , ..., p m } and in the incumbent layer or layer set, l inc , as p linc = {p 1 , p 2 , ..., p n }, A linclα is given by: The higher the amount of missing nodes in a layer, the fewer the number of nodes (patients) that are added to the MLN.The goal is to add the maximum information about the nodes in each iteration.Therefore, we aim to maximize the availability ratio term.

Community similarity term
We define the community similarity term as a measure of how similar are communities between layers, with respect to the assigned nodes.More specifically, we focus on quantifying the shared nodes between communities of different layers.
We use the bi-directional F-measure to quantify this term.There are two reasons for this choice: (i) the metric includes both the purity/precision and the recall of the solution and (ii) it handles the absence of ground truth by considering as ground truth one layer at a time.
The community similarity between layers l α and the incumbent layer or set of layers l inc is modeled as: As previously stated, we intend to use layers that provide the maximum additional information about the nodes.If the community structure of two layers is substantially similar, they are seen as redundant.We intend the exact opposite: to add layers with a community structure that differs from that of the incumbent (current) network.As a result, CS linc,lα should be minimized.In Figure 1, the edge that connects nodes p i and p j in layer l α is thicker than the ones that connect them to p k in the same layer.This is because these two nodes are more similar between themselves than with p k .The score of p i is 10% whereas the score of the p j is 20% in l α .Hence, the difference between them, in layer l α , is only 10%.If we compare the score of p k (of 70%), this difference is higher and hence the similarity between p i and p j with p k is lower.The edges are therefore represented by thinner lines.

Illustration example
Since the layers represent different features, the similarity between the same set of entities may differ from layer to layer.An example is illustrated in layers l α and l β , where the pair of nodes p i and p j have a higher distance between them in l β than in l α , making the connection between them "weaker".One of the motivations to use MLNs is the fact that inter-layer edges provide information about the association between features (layers).This association is taken into account when building the communities with a community detection algorithm, by using both intra-layer distances and inter-layer distances for modeling.
The example also illustrates communities generated in each layer, individually.Figure 1 represents a scenario in which a community detection algorithm runs on each single-layer network.In the example, node p i appears in all layers and in the same community, but node p j is assigned to the red community in l α and to the blue community in l β and in l m .Additionally, node p j is assigned to the same community as p i in l α , but to other communities in the other layers.We assume that if the community structure of layers are heterogeneous, then they provide different information about the nodes (patients).These three layers could then be considered as not redundant with respect to its community structure.
As previously mentioned, using a bi-directional metric for layer similarity is crucial in the context of the current work, due to the fact that groundtruth is unknown.For instance, if we compare the red community in l α to the red community in l m and use the l α as the ground truth, then its precision and recall is 1 and 1 2 , respectively.For the blue community, precision and recall are both 0. If we compute the weighted sum of the precision and recall, then this results in 1  2 of precision and 1 4 of recall.Therefore, the F-measure (F [lα lm] ) over all communities is 1 3 .In contrast, when we consider l m as the ground-truth, then the precision and recall for the red community are, respectively, 1 2 and 1.For the blue community, both values are 0. In this case, we also include the grey community and compute its precision and recall, which is 0 for both values.The weighted average of the recall, for this community solution, is 1  6 and of the precision is 1  3 .The F-measure (F [lm lα] ) is then 2 9 .As a result, there is a significant difference when one layer is assumed to be the ground-truth solution versus the other.

Cost-Based Layer Selector -COBALT
On the basis of the aforementioned cost model, we now present our algorithm COBALT for costbased layer selection and community construction.Table 3 displays a description of the main variables used in the pseudocodes that follow.
Algorithm 1 shows the pseudocode for the generation of the starting solution.The input of Algorithm 1 is the set of all single-layer graphs G.For each g ∈ G, COBALT invokes the Leiden algorithm which builds a set of communities and returns two objects: the 'partition' S, which encompasses the community membership of each  (S, q) ← Leiden(g)

6:
S single ← S single ∪ {S} 7: if q bestsingle < q then 8: S bestsingle ← S ; q bestsingle ← q 10: end if 11: end for 12: return g bestsingle , S bestsingle , S single node for the incumbent graph g ∈ G, and q -the modularity of S. The partition S bestsingle with the largest modularity and the corresponding graph g bestsingle are returned as initial solution of COBALT, together with the set of all single-layer partitions.
After the initialization, COBALT gradually adds layers to build up a multi-layer network and derive the best set of communities in it.In each iteration, it invokes the cost model (cf.subsection 4.3) to add the least-cost layer, until a stopping criterion is met.The pseudocode is depicted in Algorithm 2. Note that we consider the first iteration of Algorithm 2 as iteration 2, since the iteration 1 represents Algorithm 1.
The input of Algorithm 2 is the set of singlelayer graphs G and the outputs of Algorithm 1.The graph g bestsingle becomes the first layer of the MLN structure L, which is built iteratively by adding layers.extend L with g 13: (S L , q L ) ← Leiden(L) 14: G candidates ← G candidates \ {g} 15: end while 16: return S L is dynamically updated to remove the graph g added at each iteration.To choose this graph g, the layer cost is computed for each candidate u ∈ G candidates , using the cost formula of Eq. 8.
This computation is made between the candidate u and L, but it demands also the community memberships (for the community similarity term).To make this more clear, we include the corresponding partitions.
Next, the least-cost candidate g is selected.L is expanded to incorporate g and the interlayer edges between its nodes and the nodes in g.Then, the set of communities is built, G candidates is updated and the next iteration starts.
The algorithm runs while there are candidates in G candidates and while the stopping condition SC is false.The stopping condition monitors the cost of the growing MLN, and is specified in the next subsection 4.6.

Stopping condition for COBALT
The main loop of COBALT (cf.Algorithm 2, while-loop) gradually adds each layer; the cost function only decides which layer to add next.Since layers are chosen on the basis of availability ratio and community similarity (Eq.2), and since both factors take positive values, cost cannot be negative.Availability ratio may increase or drop from one iteration to the next, though.Two stopping criteria can be derived from it: • SC1: COBALT stops when the availability ratio decreases towards the previous iteration • SC2: COBALT stops when the availability ratio drops and the community similarity increases where the 'previous iteration' is the 2nd (i.e. the 1st after the initialization) or a later one.We choose SC1 as stopping condition in the while-loop.In our experiments, SC1 is not used, because we study the behavior of COBALT as each layer is added.We rather report at which iteration COBALT would have stopped, and what would have been the effect on the communities' contribution to predictive power.

Evaluation design
For the evaluation of COBALT, we use modularity as internal measure and contribution to predictive performance as external measure.We further quantify the impact of missingness.

Phenotype quality as modularity
To investigate the role of layer cost on phenotype quality, we use modularity as community quality evaluation measure, cf.formula in Eq. 3 and we study how modularity changes as layers are added in a cost-sensitive way.

Community visualization scheme
To acquire insights on how communities change after selecting each additional layer, we use the Fruchterman-Reingold layout [17] as basis for visualization: nodes that are positioned closer to one another have stronger connections between them than with the ones located far apart.In this layout, we use colors for the communities, i.e. each node takes the color of the community it belongs to.Thus, 'good' communities are visualized as graph partitions colored with only one color, while the occurrence of multiple colors in one area of the visualized network indicates that the communities are mixed up.
Since the Fruchterman-Reingold layout is twodimensional, we place the visualizations of the layers used in each iteration for community detection next to each other; thus, we can see how the colors/communities span across layers.

Measuring the contribution of phenotypes to predictive quality
To assess the contribution of the phenotypes discovered by COBALT towards phenotype-sensitive treatment, we first build a 'Baseline' set of regressors, each of which predicts the score of each questionnaire (layer) at t 1 .They use the following features for each patient: age, gender and score of that questionnaire at t 0 .We compare this 'Baseline' to regressors that also exploit community information, namely the ID of the patient's community for each patient/layer node.Since the patient's community changes as layers are added by COBALT, we train one regressor on the community augmented data for each iteration.
For regression we use linear regressor, ridge, LASSO and SVR (support vector regressor).To set the hyperparameters, we apply a grid search.We perform 10-fold cross validation, and we evaluate with mean absolute error (MAE), mean squared error (MSE) and R 2 .
By using the post-treatment score (at t 1 ) of each layer as target and the communities at each iteration as input, we also assess to what extent we can predict a post-treatment score for some layer without exploiting the pre-treatment data (at t 0 ) of this layer.In particular, assume that the posttreatment score of questionnaire/layer l is used as target and that we use as input the communities learned over the layers l 1 (first iteration) and l 2 (second iteration), where l ∈ {l 1 , l 2 }.If the prediction quality is high in the evaluation setting for a given phenotype, then we expect that we can predict the score of l at t 1 without recording layer l for this phenotype at all.Lastly, we compare the contribution to predictive quality achieved by the communities found by COBALT to the predictive quality achieved when using clusters built with traditional clustering algorithms.The clustering techniques used are: (i) agglomerative hierarchical clustering [12], which we denote as 'AHS'; (ii) BIRCH [42], (iii) Gaussian Mixture Models with the Expectation Maximization algorithm [7,13], which we denote as 'GMM EM'; (iv) HDBSCAN [9], (v) k-means [23] and (vi) OPTICS [5].It must be stressed that these clustering models serve only as baselines, because they operate under different conditions than COBALT: they are cost-insensitive, since they exploit all questionnaires/layers, while COBALT uses only the layers up to a given iteration.Moreover, these clustering algorithms do not handle missing values, hence they are trained only on patients that answered all the questionnaires.

Measuring how missingness affects phenotype quality
COBALT is designed to deal with layers that contain only few nodes, i.e. layers for which only few entities (in our application: patients) have delivered data.To measure the influence of missingness, we gradually introduce missingness into a dataset that originally has no missing data.In this dataset, when removing an entity, we remove its corresponding node from all layers.The complete workflow is as follows.If the dataset in use contains missing values, we extract from it the subset D that has no missing values.In the next step, we specify the maximum and minimum of the 'missingness ratio', which we define the percentage of entities to be randomly selected and eliminated from D. Next, we perform a grid search between the minimum and maximum percentage and derive the corresponding subset of D for each value in the grid; for our experiment, we vary the missingness ratio from 0.1(10% of entities removed) to 0.9(90% of the entities removed), with a step of 0.1, i.e. of 10%.Finally, we run COBALT on each derived dataset and measure quality as (i) modularity at each iteration and (ii) predictive quality for the iteration with the best modularity.

Results and Discussion
For our evaluation, we apply COBALT on the tinnitus patient dataset described in section 3.This dataset contains 5 questionnaires: in a realistic scenario, the addition of layers would have stopped at an upper boundary of cost/budget; for the evaluation, we add all layers and study the effects of adding each one.COBALT adds the layers in the following order: (1) THI, (2) MDI, (3) TQ, (4) TBF12 and (5) TFI questionnaire.This indicates that the benefit of adding THI is the highest and the benefit of adding TFI is low.(uppermost subfigure), while the community similarity stagnates at 0 (lowermost subfigure), indicating that there is substantial difference among the communities built while adding layers.Despite the decrease in modularity, it is important to note that the drop is from 0.730 to 0.591, indicating that there are underlying structures in the MLN.The visualizations presented next deliver insights on how communities change as layers are added.

Modularity vs cost per iteration
At the same time, the cost of adding a layer is increasing for all but the 4th layer (cf.middle upper subfigure).This increase is mostly due to the decrease in the availability ratio (cf.middle lower subfigure), since community similarity is always low.The availability ratio decreases slowly, hence SC1 can be triggered already at the 3rd or at the 4th iteration.

Community visualization
In Figures 3 to 8 we show the communities as COBALT adds layers.The visualization is twodimensional and therefore we show one subfigure per layer.

Visualization of one layer
In Figure 3 we depict the 6 communities in the THI layer, which is chosen by COBALT in the first iteration.The communities mark areas of a single color.There is one subgraph with nodes of the 'blue' and the 'darkgreen' communities, and one subgraph with both 'purple' and 'pink' nodes, but these two subgraphs are only small parts of the network.Hence, the visualization scheme captures properly community separation and clearly marks the areas with nodes from more than one community.Score distributions. Figure 4 depicts the distribution of the questionnaire scores for each of the 6 communities of this 1st iteration of COBALT.
The first column of Figure 4 shows the value distribution for sex, with one row per community.Subsequently, we show one column per questionnaire and, inside it, one boxplot per community.We see that all communities are rather homogeneous with respect to the THI questionnaire: the small boxes indicate small variance.The variance increases for the other questionnaires; this is expected, since COBALT considered only the THI data to build these communities.
There are remarkable differences among some of the communities.Community C1 (green boxes for the questionnaires) has the lowest average value in each questionnaire, while community C2 (fade-yellow boxes) has the highest average values, indicating C1 and C2 have considerably different characteristics.In the context of the data used, it seems that C1 accommodates the patients with the most mild tinnitus symptoms, while C2 accommodates the patients with the most severe tinnitus symptoms.As we have seen in the evolution curves of Figure 2, iteration 2 incurs a cost increase and a slight modularity drop.The visualization of the communities in the two layers makes this evident: there is a rather clear separation of colors in the THI network (cf.subfigure 5a), but the colors in the MDI network (cf.subfigure 5b) are mixed.Hence, the use of the inter-layer similarities and the similarities inside the MDI network did not contribute to a good modularity score.

Communities across layers
Figure 6 shows the 4 communities found in the 3rd iteration, where COBALT added the TQ layer.From this iteration on, community induction is driven by the node similarities inside the MDI layer, leading to more homogeneous communities in the MDI layer, while the community colors in the other layers are mixed.A further remarkable aspect in Figure 6 is the high density of the MDI network: the patients are very similar to each other inside this layer.This might have led to communities that are not clearly separated.This can also be seen in Figure 7, which depicts the 4 communities found when COBALT adds TBF12 as 4th layer.
As in the 3rd iteration, nodes are wellseparated into communities with respect to the MDI layer, but not with respect to the other layers.The number of communities is also the same as before: although this does not imply that the communities are exactly the same, it indicates adding the TBF12 layer does not add much information to the previous MLN.
Finally, Figure 8 shows the 4 communities found in the 5th iteration, when the TFI layer is added as last one.
As before, the communities are well separated in the MDI network, but not in the other networks.The modularity (0.591) is also very close to the modularity value of the 4th iteration (0.592).Summarizing, the visualization scheme shows a deterioration of community quality as layers are added.The evolution of layer cost and of its two factors capture this deterioration well; the SC1 stopping criterion would have stopped the MLN expansion after the 3rd iteration by the latest.
In our application scenario, communities serve as phenotypes, intended to contribute to prediction.For that reason, we next report on how the communities contributed to predictive quality.

Phenotypes for prediction
According to our evaluation design, we compare the prediction quality achieved when using COBALT communities to that achieved by the Baseline regressors and to the quality achieved when using clustering instead of COBALT.
The results are on Table 4, which we discuss in detail hereafter.For each of the 5 scores at t 1 , we depict the performance as MSE and MAE (where smaller values are better) and R 2 (where larger values are better).Values marked in boldface are the best achieved for predicting a questionnaire score.

Prediction with the phenotypes of each COBALT iteration
The upper part of Table 4 shows for each iteration of COBALT (first column) the quality of the phenotypes in the MLN network, measured as modularity Q (third column).For each predicted score, we mark in underline the best values achieved in a COBALT iteration.
For each score, the COBALT-augmented regressor outperforms the corresponding Baseline regressor, as can be seen by the underlined values for the scores.When focusing on the phenotypes of each iteration, we observe the following: • iteration 1: COBALT outperforms the Baseline for the THI score and for the MDI and TBF12 scores (the MAE values are identical).
For the TQ and TFI scores, the Baseline is better.the MDI layer itself.However, as can be seen in Figure 5, the communities are more oriented towards THI.• iteration 3: COBALT outperforms the Baseline for the TQ and TBF12 scores, and for the THI score with respect to MSE and MAE.With respect to the MDI score and the TFI score, the Baseline is superior.• iteration 4: COBALT outperforms the Baseline for the THI, MDI, TQ and TBF12 scores, i.e. all but the TFI score.• iteration 5: COBALT outperforms the Baseline for the THI score, but it is inferior to it for all the other scores.
Summarizing, the COBALT phenotypes on the THI and MDI layers suffice to predict 4 out of the 5 scores at t 1 with better MAE and MSE than the Baseline that exploits the scores of all 5 questionnaires at t 0 .The phenotypes of the THI layer alone suffice for predicting 3 out of 5 scores at t 1 .This indicates that the exploitation of phenotypes during prediction is advantageous, and the advantage is higher when adding the least-cost layer.
The influence of the MDI layer must be perceived as an artifact, since this layer improves predictive performance but not for the MDI score itself An explanation is in the density of this layer, which may have resulted in poor-quality communities inside this layer.

COBALT vs clustering
The lower part of Table 4 depicts the prediction quality achieved when using clustering algorithms for phenotype construction instead of COBALT on MLNs.We varied the number of clusters to optimize silhouette, and we report the clusters found for this optimal number.For example, for k-means, the best silhouette was for k = 2.
For each of the 5 scores there is at least one clustering algorithm that outperforms the Baseline (first row).As with COBALT, this indicates that exploiting phenotypes during prediction is of advantage.However, unlike COBALT, there is no clear winner among the clustering algorithms: for the THI score, all algorithms delivered models that were superior to the Baseline; for the TFI score, none did; for the other scores, some models were superior to the Baseline and others were inferior to it.
COBALT outperforms the Clustering approaches when predicting the TFI score (iteration 2) and the TBF12 score (iteration 3), i.e. before the corresponding layers are added.For the MDI score, iteration 4 delivers the best R 2 value, but is inferior to OPTICS with respect to MSE and MAE.For the THI score, K-Means returns the best results.However, there is no clear winner among the Clustering approaches with respect to phenotype contribution: for each of the 5 scores, another algorithm is best for one or more of the three measures, although, unlike COBALT, all Clustering algorithms are trained on all scores at t 0 .In contrast, the phenotypes returned by COBALT on the first two layers outperform the Baseline for all scores except the MDI.
COBALT and Clustering cannot be compared directly on community homogeneity: for clustering, we use silhouette instead of modularity.When juxtaposing the results of the Clustering algorithms, we see that BIRCH returns the phenotypes with the highest silhouette value, but these phenotypes contribute less to prediction than those output by other clustering algorithms.This agrees with our observation on modularity for COBALT: the phenotypes returned at iteration 1 have the highest modularity but the regressors exploiting them are of inferior quality.
Summarizing, the phenotypes returned by COBALT exhibit more consistent predictive performance than achieved by clusters built by individual clustering algorithms, albeit the latter exploit all scores at t 0 .This holds particularly for iteration 2, where a layer is chosen in a costsensitive way, before the stopping criterion SC1 is triggered.Phenotype quality, either the modularity in MLNs or the silhouettes in clustering, are poor indicators of predictive performance.For COBALT, phenotypes built in a cost-sensitivity way lead to competitive predictive performance.

COBALT vs cost-insensitive MLN phenotypes
In our earlier work [36], we built MLN-based phenotypes, optimizing on modularity rather than cost.Table 5, from [36], shows the layer selected at each iteration and the modularity scores achieved.When we compare with the column on Q of Table 4, we see that COBALT achieves higher modularity values.This can be explained by the differences in the data: albeit we use the same dataset and prediction tasks, in [36] we considered only patients that had available data in all layers, whereas COBALT uses all data, allowing for missing values in some layers.
A comparison on prediction quality is not possible because in [36] we predicted for each community separately, and only for the TQ score at t 1 , considering only communities that had enough data for learning and testing.Nonetheless, we can state that COBALT is superior to the predecessor approach of [36] by design, since it allows for missing values.

Impact of missingness
Figure 9 shows how modularity changes from the 1st to the last iteration as we increase missingness from 10% (leftmost, uppermost subfigure) to 90% (rightmost, lowermost subfigure) in steps of percentual points.The data with 0% missingness lead to the modularity values depicted by the dashed line in each of the subfigures.below it but sometimes above it.These findings suggest that subsets of the nodes in the MLN have intra/inter-layer edges and weights that lead to communities of comparable quality to that of 0% missingness.Table 6 depicts prediction quality as we increase missingness from 0% to 10% and then to 20%.Prediction was possible only for THI, MDI and TQ scores at t 1 when the missingness was 0% and 10%, and only for THI when the missingness was increased to 20%.For TFI and TBF12, there were not enough data for training and testing.
For each missingness value we chose the iteration with the highest modularity: this is reflected in the column 'NL' that denotes the Number of Layers (NL) used and may differ depending on the questionnaire score being predicted.The number of patients used for training is shown in the column marked as 'n', which again may differ for each questionnaire score.
In Table 4 we have seen that a decrease in modularity is not necessarily associated with a tendency in the values of the prediction quality measures.This agrees with the results in Table 6, where the sets of communities with the best modularity are chosen but the prediction quality measures vary in both directions.In general, the prediction quality is good, which indicates that a small increase in missingness does not lead to dramatic quality deterioration.However, the amount of training data is so small that a generalization is not possible.

Conclusion
In this work we have presented COBALT, a costbased model to find communities in a single-or multi-layer network structure.We define cost as the cost of acquiring features and test it in a dataset with questionnaire data from chronic tinnitus patients.We also compare our results with traditional clustering methods.
The major findings of this work can be summarized into four: 1. COBALT is able find partitions in the data that are superior to conventional clustering algorithms in terms of our evaluation criteria (performance on post-treatment data prediction) for three of the five questionnaire predictions; 2. COBALT outperforms our prior work with the same dataset (cf.[36]) by achieving higher modularity values.As a result, taking a costeffective strategy and allowing missing values in each layer proved advantageous; 3. The partitions with the best modularity do not lead necessarily to the best inputs for a posttreatment data predictor; 4. Missing values have no substantial impact on the modularity of the partitions.
In the context of clinical decision support, COBALT is intended for personalized diagnostics and treatment planing on the basis of patient phenotypes.COBALT demonstrates that it is possible to build predictive phenotypes without demanding a large number of questionnaires, as was the case in our earlier work [34].Our approach can be used in different ways.First, we have shown that phenotypes are predictive; hence, once a patient's phenotype has been assessed, treatment outcome can be predicted without demanding that the patient fills further questionnaires.Thus, the burden of the patient during the diagnostic procedure can be reduced.Next, the acquisition of information during screening can be focused towards informative features, thus reducing cost without compromising quality.Finally, phenotype-specific treatments can be designed.To this purpose, clinical research is needed to assess the robustness of the identified models and their predictiveness.Moreover, the cost-sensitive selection of questionnaires and the usage of phenotypes in the clinical

Figure 1
Figure1illustrates the structure of a MLN with communities.Thicker edges between nodes correspond to a high similarity between them.Nodes

Fig. 1 :
Fig. 1: Structure of a MLN with communities generated in each layer.

Figure 2 Fig. 2 :
Figure2shows the evolution of the modularity Q per iteration, as well as the layer cost and

Fig. 4 :
Fig. 4: The 6 communities of the THI network added by COBALT as 1st layer, depicting the value distributions inside each questionnaire.

Figure 5
Figure 5 depicts the 5 communities found in the 2nd iteration, where COBALT adds the MDI layer as the one with the lowermost cost.
(a) THI network (b) MDI network (c) TQ network

Table 2 :
Number of patient records per questionnaire at t 0 , t 0 and t 1 and range of each questionnaire -ordered on the number of questionnaires at t 0 descending

Table 3 :
Notation table 2: g bestsingle ← NULL 3: S bestsingle ← NULL; q bestsingle ← −∞ 4: for g ∈ G do 5: To expand L, the list of candidate graphs G candidates is created from G and Algorithm 2 Iterative Layer Selection: Pseudocode of the subsequent iterations of COBALT Input: G, g bestsingle , S bestsingle , S single Output: S best (best set of communities) 1: G candidates ← G \ {g bestsingle } 2: S ← S bestsingle 3: L ← g bestsingle 4: while G candidates = ∅ and SC is False do

Table 4 :
• iteration 2: COBALT outperforms the Baseline for the TQ, TFI and TBF12 scores.For the THI score, COBALT is better with respect to MSE and MAE.For the MDI score, COBALT is inferior than the Baseline.This is remarkable, since the layer included in the 2nd iteration is Prediction of questionnaire scores at t 1 given age, gender, score at t 0 (Baseline, first row of values) and, additionally the phenotype IDs returned by COBALT (upper part), respectively by the clustering algorithms (lower part); for convenience, we ordered the columns with the same order as layers were added by COBALT, but this ordering has no effect on the way the data are read by the predictor.

Table 6 :
Modularity per iteration as missingness ratio is increased from 10% to 90%.The curves indicate that the modularity is not greatly affected by missingness: it always drops 0.4, it always remains close to the reference (dashed) line of 0% missingness, mostly a bit Prediction quality for each percentage value of missingness, showing the iteration (equiv.number of layers NL) that achieves the best modularity.