Conformal link prediction for false discovery rate control

Most link prediction methods return estimates of the connection probability of missing edges in a graph. Such output can be used to rank the missing edges from most to least likely to be a true edge, but does not directly provide a classification into true and non-existent. In this work, we consider the problem of identifying a set of true edges with a control of the false discovery rate (FDR). We propose a novel method based on high-level ideas from the literature on conformal inference. The graph structure induces intricate dependence in the data, which we carefully take into account, as this makes the setup different from the usual setup in conformal inference, where data exchangeability is assumed. The FDR control is empirically demonstrated for both simulated and real data.


Problem and aim
Graphs (or networks) denote data objects that consist of links (edges) between entities (nodes).Real-world examples are ubiquitous and include social networks, computer networks, food webs, molecules, etc.A fundamental problem in network analysis is link prediction (Lü and Zhou, 2011), where the goal is to identify missing links in a partially observed graph.Biological networks such as protein-protein interaction networks (Kovács et al., 2019) or food webs (Terry and Lewis, 2020) are typical examples of incomplete networks: because experimental discovery of interactions is costly, many interactions remain unrecorded.Link prediction can be used to identify promising pairs of nodes for subsequent experimental evaluations.Other applications include friend or product recommendation (Li and Chen, 2013), or identification of relationships between terrorists (Clauset et al., 2008).
In this work, we consider a link prediction problem, where a graph with a set of vertices V = {1, . . ., n} and a set of edges E is only partially observed: namely, we observe a sample of node pairs recorded as interacting (true edges) and a sample of pairs recorded as noninteracting (false edges).The graph can be directed or undirected and self-loops are allowed.The two observed samples of node pairs make up only a part of the set of all pairs V × V , and the non-observed pairs correspond to missing information, where it is not known whether there is an edge or not.The aim is to identify the true edges among the pairs of nodes for which the interaction status has not been recorded.
There exists a variety of approaches for link prediction and they are mainly divided according to two viewpoints.In Ben-Hur and Noble (2005); Bleakley et al. (2007); Li and Chen (2013); Zhang and Chen (2018), link prediction is treated as a classification problem.That is, examples are constructed by associating the label 1 (or 0) with all true (or false) edges.Then, a classifier is learned by using either a data representation for each edge (Zhang and Chen, 2018), or kernels (Ben-Hur and Noble, 2005;Bleakley et al., 2007;Li and Chen, 2013).Another line of research views link prediction rather as an estimation issue, namely as the problem of estimating the true matrix of the probabilities of connection between node pairs.In this line, Tabouy et al. (2020) consider maximum likelihood (ML) estimation for the Stochastic Block Model (SBM) with missing links and propose a variational approach for a variety of missing data patterns.Gaucher et al. (2021) study a low-rank model and a technique based on matrix completion tools, which is also robust to outliers.Mukherjee and Chakrabarti (2019) give an algorithm for graphon estimation in a missing data set-up.Li et al. (2023) study a special case of missing data setup called egocentric or star sampling, where observations are generated by sampling a subset of nodes randomly and then recording all of their connections, and Zhao et al. (2017) consider the case where the recording of true/false edges can be erroneous.Minimax results are derived in Gao et al. (2016) for a least squares estimator in the SBM under a uniform and known missing data pattern and in Gaucher and Klopp (2021) for the ML estimator under a more general setting.
Concretely, the output of all of these methods are scores for all missing edges, ranking them from most likely to least likely to interact.Such an output is satisfying when the application constrains the number of pairs of vertices to be declared as true edges to be fixed, as e.g. in e-recommendation, where we could have to recommend the top 3-best products most likely to interest the consumer.Alternatively, other practical cases may instead require a classification of the missing edges into true and false edges together with a control of the amount of edges that are wrongly declared as true (false positives).Putting the emphasis on false positives is appropriate in many contexts.For instance, in the reconstruction of biological networks, the edges that are classified as true are then tested experimentally in a costly process, which makes it desirable for the user to avoid false positives in the selection step.This is increasingly true for real-world networks that are in general very sparse.The decision of declaring a missing pair as a false edge can be viewed as a type of abstention option: based on the data, we do not have enough evidence to confidently predict it as a true edge.
How to build a reliable classification procedure?Using an ad hoc rule like declaring as true edges the node pairs with a connection probability above the 50% threshold, may lead to a high number of false positives since a) probabilities may not be estimated correctly and b) even if they were, the probability of making a mistake may still be high if there are many node pairs with moderately elevated connection probability.
In this work, we consider the goal of identifying a subset of the missing pairs of nodes for which we can confidently predict the presence of an edge, with a guarantee on the number of edges that are falsely predicted as true.Our problem can be viewed as finding the appropriate threshold (not 50%) for the connection probabilities such that the number of false positives remains below a prescribed level.The optimal threshold depends on the problem itself.In simple settings a low threshold may be satisfactory, as for instance when most connected triplets are indeed triangles.However, on a graph with much stochasticity, the exact prediction of links is a very uncertain endeavor.
The problem is formalized in terms of controlling the false discovery rate (FDR), defined as the average proportion of errors among the pairs of vertices declared to be true edges (proportion of false discoveries).More precisely, the goal is to develop a procedure such that the FDR is below a user-specified level α, which is an error margin that represents the acceptance level for the proportion of false edges in the selection.The interpretation for the user is clear: if, for instance, α is set to 5% and the method returns a set of 100 node pairs, then the number of non-existent edges in this set is expected to be at most 5.

Approach
We propose a method that takes as input the partially observed graph and, using an off-theshelf link prediction method, returns a set of node pairs with an FDR control at level α.The method can be seen as a general wrapper that transforms any link prediction technique into an FDR-controlling procedure.Crucially, even when the quality of the link predictor is not particularly good, our method provides control of the FDR.
Our approach relies on conformal inference (Vovk et al., 2005;Balasubramanian et al., 2014), a statistical framework that provides generic tools to rigourously estimate uncertainty for off-the-shelf ML algorithms in various tasks.In particular, conformal prediction (Angelopoulos and Bates, 2021;Lei and Wasserman, 2014;Romano et al., 2019Romano et al., , 2020;;Tibshirani et al., 2019) enables to build model-free confidence intervals for the output of any ML algorithm, even "black-box", valid in finite samples without any assumptions on the data distribution besides exchangeability of the observations.Another important application of conformal inference is nonparametric hypothesis testing, via the so-called conformal p-value (Vovk et al., 2005;Balasubramanian et al., 2014;Bates et al., 2023).Considering a standard testing problem H 0 : X ∼ P 0 for some multivariate observation X and reference distribution P 0 , conformal p-values measure statistical significance by comparing the test statistic, or score, to a reference set, consisting of values of the score function on observations drawn from P 0 .Crucially, computing a conformal p-value only requires to have at hand a sample from P 0 , rather than knowing the distribution explicitly.Under the exchangeability of the reference scores with the test score when the null hypothesis is true, conformal p-values enable to build valid tests in various contexts, such as FDR control in novelty detection (Bates et al., 2023;Yang et al., 2021;Marandon et al., 2022a;Liang et al., 2022), binary classification (Rava et al., 2021;Jin and Candès, 2022), or two-sample testing (Hu and Lei, 2023).
We propose to use this high-level idea of comparing a score to a set of reference scores representing the null hypothesis being tested, in order to properly threshold the link prediction probabilities for FDR control.In our link prediction set-up, the connection probability for a pair of nodes (i, j) can be seen as a score indicating the relevance of an edge between i and j.The afore-mentioned score comparison then turns into a comparison of the connection probability for a non-observed pair of nodes to connection probabilities of pairs that are known to be non-existent edges.However, the setup is markedly different from previous literature, making this transposition challenging.In particular, the graph structure makes the scores dependent on each other in an intricate way, which requires to build the scores with care.
Contributions.The contributions of this work are summarized as follows: • We introduce a novel method to control the FDR in link prediction (Section 3), which extends ideas from the conformal inference literature to graph-structured data.The proposed method acts as a wrapper that transforms any off-the-shelf link prediction (LP) technique into an FDR-controlling procedure for link prediction.It is designed to provide FDR control regardless of the difficulty of the setting and of the quality of the chosen LP technique.Moreover, the ability to use any LP technique of choice, including the state-of-the-art, makes it flexible and powerful.
• Extensive numerical experiments1 (Section 4) assess the excellent performance of the approach and demonstrate its usefulness compared to the state of the art.

Relation to previous work
Error rate control in statistical learning.Error rate control has notably been considered in novelty detection (Bates et al., 2023;Yang et al., 2021;Marandon et al., 2022a;Liang et al., 2022), binary classification (Geifman and El-Yaniv, 2017;Angelopoulos et al., 2021;Rava et al., 2021;Jin and Candès, 2022), clustering (Marandon et al., 2022b) and graph inference (Rebafka et al., 2022).The setting closest to ours is that of binary classification, in the sense that here the goal is to classify non-observed pairs of nodes as a 'true' or 'false' edge, given that we observe part of both true edges and non-existent edges.In this line, some approaches (e.g., Zhang and Chen, 2018)  However, these approaches cannot be applied in our situation because here data examples are based on the graph structure and thus depend on each other in a complex way.In particular, we do not have i.i.d.data examples as assumed in the classical binary classification setting.In this regard, our method is related to the work of Marandon et al. (2022a) that extends the conformal novelty detection method of Bates et al. (2023) to the case where the learner is not previously trained, but uses the test sample to output scores which makes the scores dependent.This is similar to our problem in the sense that here we aim to calibrate connection probabilities that depend on each other through the graph structure.
Conformal inference applied to graph data.A few recent works (Huang et al., 2023;Lunde et al., 2023) have considered the application of conformal prediction to graph data.However, conformal prediction is concerned with constructing prediction sets, rather than error rate control as considered here.Moreover, in these works, the prediction task concerns the nodes: Huang et al. (2023) considers node classification, while Lunde et al. (2023) studies prediction of node covariates (also called network-assisted regression).By contrast, in our work, the prediction task concerns the edges, and therefore, the specific dependency issue that arises differs from Huang et al. (2023); Lunde et al. (2023).Finally, Luo et al. (2021) employed conformal p-values to detect anomalous edges in a graph.However, their method relies on edge-exchangeability, which is a restrictive assumption, and the guarantee is only for a single edge.
Conformal inference for missing data problems.Link prediction can be seen as a type of missing data problem where the goal is to give a prediction for missing values, rather than with missing values.The latter setting was addressed within conformal prediction e.g. in Zaffran et al. (2023) for a regression task with missing values in the covariates and in Shao and Zhang (2023) for a particular task called matrix prediction.A concurrent work by Gui et al. (2023) investigated conformal prediction for the matrix completion task, in which the aim is to fill in a matrix that has missing values.In particular, the method proposed therein is shown to provide coverage in a general missing data setup where the entries of the sampling matrix are either i.i.d.Bernoullis variables or known independent Bernoullis variables.Gui et al. (2023) also cover the case where the sampling is unknown and non-uniform by providing error bounds for a weighted version of their method, based on tools from conformal prediction under covariate shift (Tibshirani et al., 2019).However, as outlined previously, conformal prediction (1, 4), (2, 3), (2, 5), (3, 5) are observed, along with the non-existent edges (1, 3),(1, 5), (4, 5) but the information concerning the pairs (2, 4) and (3, 4) is missing.We aim to decide for (2, 4) and (3, 4) whether there is a true edge or not. is a different aim from the one considered here, which is FDR control.
Link with multiple testing.The FDR criterion is a staple of multiple testing, where recent works on knockoffs and conformal p-values (Barber and Candès, 2015;Weinstein et al., 2017;Bates et al., 2023;Yang et al., 2021;Marandon et al., 2022a) have provided model-free procedures that come with an FDR control guarantee in finite samples.However in this work, while we do use tools of Bates et al. (2023), our setting does not strictly conform to a known multiple testing framework such as the p-value framework (Benjamini and Hochberg, 1995) (the hypotheses being random) or the empirical Bayes framework (Efron et al., 2001;Sun and Cai, 2007) (the number of hypotheses being itself random).Hence, previous theory in that area cannot be applied.

Problem setup
Let A * = (A * i,j ) 1≤i,j≤n be the adjacency matrix of the true complete graph G, X ∈ R n×d a matrix of node covariates (if available), and Ω = (Ω i,j ) 1≤i,j≤n the sampling matrix such that Ω i,j = 1 if the interaction status (true/false) of (i, j) is observed, and 0 otherwise.We denote by A the observed adjacency matrix with A i,j = Ω i,j A * i,j .Thus, A i,j = 1 indicates that there is an observed true edge between i and j, whereas A i,j = 0 indicates either the observed lack of an edge or an unreported edge.The sampling matrix Ω is assumed to be observed, so that it is known which zero-entries A i,j = 0 correspond to observed false edges and which ones correspond to missing information.The general setting is illustrated in Figure 1.We denote by P the joint distribution of Z * = (A * , X, Ω), Z the observation (A, X, Ω) and Z the observation space.
We next lay out our specific modelling assumptions regarding P .In this work, we make assumptions on the generation of the sampling matrix only.We start by reviewing the main settings encountered in the literature in this regard.As outlined in Tabouy et al. (2020), most works on link prediction do not explicitly model the sampling matrix Ω.To fill this gap, Tabouy et al. (2020) provided a formal taxonomy of missing data patterns in networks by adapting the theory developed in Little and Rubin (2019), dividing them into three cases: • Missing completely at random (MCAR): Ω is independent from A * .
• Missing at random (MAR): Ω is independent from the value of A * on the unobserved part of the network.
• Missing not at random (MNAR): when the setting is neither MCAR nor MAR.Tabouy et al. (2020) give several examples for each case.For instance, the MCAR assumption trivially includes the setting of random-dyad sampling where entries of A * are missing uniformly at random, that is considered in Gao et al. (2016) and is also well studied in the matrix completion literature (Candes and Recht, 2009;Candes and Plan, 2010;Chatterjee, 2015).In the specific context of link prediction, another typical MCAR sampling pattern is star (or egocentric) sampling (Li et al., 2023), that consists in randomly selecting a subset of the nodes and observing the corresponding row of A * .A particular type of MNAR sampling is studied in Gaucher et al. (2021) where the sampling only depends on latent structure (e.g.node communities).
In this work, we consider a type of MNAR sampling called double standard sampling (Tabouy et al., 2020;Sportisse et al., 2020), in which the entries of Ω are independently generated as: ), 1 ≤ i, j ≤ n for some unknown sampling rates w 0 , w 1 .This amounts to consider that true edges together with false edges are missing uniformly at random at a certain status-specific rate, and can be seen as a straightforward generalization of random-dyad sampling that is more relevant for practical applications.Indeed, detecting interactions is typically more of interest rather than the detecting of non-interactions, hence one can expect that in general true edges have more chance of being reported than false edges.
We are interested in classifying the unobserved node pairs {(i, j) : Ω i,j = 0} into true edges and false edges, or in other words, selecting a set of unobserved node pairs to be declared as true edges, based on the observed graph structure.In order to be consistent with the notation of the conformal inference literature, we use the following notations: • We denote by D test (Z) = {(i, j) : Ω i,j = 0} the set of non-sampled (or missing) node pairs and by D(Z) = {(i, j) : Ω i,j = 1} the set of sampled pairs, with D 0 = {(i, j) ∈ D : A * i,j = 0} the set of observed non-existent edges and D 1 = {(i, j) ∈ D : A * i,j = 1} the set of observed true edges.We refer to D test (Z) as the test set.
• We denote by H 0 = {(i, j) : Ω i,j = 0, A * i,j = 0} the (unobserved) set of false edges in the test set and H 1 = {(i, j) : Ω i,j = 0, A * i,j = 1} the (unobserved) set of true edges in the test set.
The notations are illustrated in Figure 2.
In our framework, a selection procedure is a (measurable) function R = R(Z) that returns a subset of D test corresponding to the indices (i, j) where an edge is declared.The aim is to design a procedure R close to H 1 , or equivalently, with R ∩ H 0 (false discoveries) as small as possible.For any such procedure R, the false discovery rate (FDR) of R is defined as the average of the false discovery proportion (FDP) of R under the model parameter P ∈ P, that is, Similarly, the true discovery rate (TDR) is defined as the average of the true discovery proportion (TDP), that is, Our aim is to build a procedure R that controls the FDR while having a TDR (measuring the power of the procedure) as large as possible.
Remark 1.In some applications, missingness can also be present in the covariate matrix X.
While we do not explicitly consider this setup here, our method can readily be applied and control the FDR over the detected edges with missingness in X as long as the distributional assumption on Ω|A * , X is still valid.

Review of conformal p-values for out-of-distribution testing
We first provide a general overview of conformal p-values in the classical setting of out-ofdistribution testing, in which the problem is to test the null hypothesis that a data point Z n+1 is drawn from the same (unknown) distribution P 0 as a i.i.d.data sample Z 1 , . . ., Z n .A conformal p-value (Haroush et al., 2022;Bates et al., 2023) is a non-parametric approach that relies on reducing each multivariate observation Z i to a univariate non-conformity score (or score for short) S i = g(Z i ) ∈ R, that measures the conformity to the data sample (Z j ) 1≤j≤n , to evaluate the statistical evidence of being an outlier: In 1. Sample D cal of size ℓ uniformly without replacement from D 0 ; 2. Learn g on "masked" observation Z train = (A, X, Ω train ), with Ω train given by 3. For each (i, j) ∈ D cal ∪ D test , compute the score S i,j = g(Z train ) i,j 4. For each (i, j) ∈ D test , compute the conformal p-value p (i,j) as given in equation ( 1) 5. Apply BH using (p (i,j) ) (i,j)∈Dtest as input p-values, providing a rejection set R(Z) ⊂ D test .

Conformal link prediction
Let g : Z → R n×n be a scoring function, that takes as input an observation z ∈ Z, which is a tuple consisting of an adjacency matrix, a covariate matrix, and a sampling matrix as described in Section 2, and returns a score matrix (S i,j ) 1≤i,j≤n ∈ R n×n , with S i,j estimating how likely it is that i is connected to j.A scoring function amounts to a link prediction algorithm (also returning in general an indicator of the relevance of an edge between i and j for any pair of nodes (i, j)) and in the sequel we use the two terms exchangeably.The score does not have to be in [0, 1]: for instance, S i,j can be the number of common neighbors between i and j.
To obtain a set of edges with FDR below α, we borrow from the literature on conformal p-values (Bates et al., 2023;Marandon et al., 2022b) to formulate the following idea: some of the observed false edges can be used as a reference set, by comparing the score for a node pair in the test set to scores computed on false edges to determine if it is likely to be a false positive.Effectively, we will declare as edges the pairs that have a test score higher than a cut-off t computed from the calibration set and depending on the level α.In detail, the steps are as follows: 1. Build a reference set D cal of false edges by sampling uniformly without replacement from the set of observed false edges D 0 2. Run an off-the-shelf LP algorithm g on the "masked" observation Z train = (A, X, Ω train ), where (Ω train ) i,j = 0 if (i, j) ∈ D cal and Ω i,j otherwise.
3. Compute the scores for the reference set and for the test set, S i,j = g(Z train ) i,j for (i, j) ∈ D cal ∪ D test ; 4. Compute the conformal p-values (p (i,j) ) (i,j)∈Dtest given by (1) 5. Declare as true edges the node pairs in the test set that are in the rejection set returned by the BH procedure applied to the p-values (p (i,j) ) (i,j)∈Dtest .
The procedure is summarized in Algorithm 1.We now review each step one by one in order to provide the intuition behind the proposed procedure.
Step 1 builds a reference set of false edges that are sampled in such a way as to imitate the missingness of false edges in the test set, i.e. the sampling of H 0 .
Step 2 runs an off-the-shelf LP algorithm g on Z train (instead of Z) so as to treat the edge examples (i, j) in the reference set D cal as unreported information when learning the score, which is necessary to avoid biasing the comparison of the test scores to the reference scores.Otherwise, the scoring g may use the knowledge that the node pairs in the reference set are false edges and produce an overfitted score for those.Together with Step 1, this is crucial to fabricate good reference scores that are representative of the scores of false edges in the test set.Finally, the remaining steps correspond exactly to the conformal procedure of Bates et al. (2023) for out-of-distribution testing and we refer the reader to Section 3.1 and to Bates et al. (2023) for more details.
Remark 2. This type of procedure is designed to control the FDR at level |H 0 | | Dtest | α < α.To maximize power, we recommend to apply the procedure at level α/π 0 where π0 ∈]0, 1[ is an estimate of |H 0 | | Dtest | .In particular, tools from the multiple testing literature on the estimation of the proportion of null hypotheses may be employed, e.g. by using Storey's estimator (Marandon et al., 2022a).
Remark 3.Many LP algorithms (e.g.Zhang and Chen, 2018) are not trained on the entire set of observed edges D but on a subset D train ⊂ D with a 50-50% distribution of true and false edges.As most real-world networks are sparse, typically all observed edges D 1 are used for training and a randomly chosen subset of false edges in D 0 of the same size as D 1 .Then the reference set D cal is naturally chosen among the false edges in D 0 that are not used in D train for learning the predictor.Consequently, in practice choosing a reference set D cal does not diminish the amount of data on which the predictor is learned.
Remark 4. The sample size of the reference set D cal must be large enough to ensure a good power, as pointed out in previous work using conformal p-values in the novelty detection context (Mary and Roquain, 2022;Marandon et al., 2022a;Yang et al., 2021).In particular, Mary and Roquain (2022) give a power result under the condition that where k is the number of "detectable" novelties.Consequently, our recommendation is to choose | D cal | of the order of | D test |/α; this choice works reasonably well in our numerical experiments.

Link with conformal out-of-distribution testing
In this section, we explicit the output of typical link prediction methods, to write each score S i,j as a (learned) function of an edge embedding representing the information that is observed about the pair (i, j).This allows to reframe our approach as using the procedure of Bates et al. (2023) (see Section 3.1) on data objects that have a non-exchangeable dependence structure.
For simplicity of presentation, in this section we consider an undirected graph without node covariates.Link prediction methods output a matrix of predictions/scores (g(Z) i,j ) 1≤i,j≤n , where the prediction for i and j can be written in general as (2) for a real-valued measurable function h and a given fixed K ∈ {1, . . ., n} and where A k i,• is defined as the i-th row of A k .In the (3), the r.v.W i,j can be thought of as the "K-hop  neighborhood" of (i, j).It represents an embedding for the node pair (i, j), that describes a pattern of connection around i and j.If the graph has some structure, it should be observed that the pattern differs when i and j are connected compared to when they are not.Moreover, there should be some similarity between the patterns observed for true edges, as compared to false edges.Figure 3 gives an illustration in the case of a graph with community structure.When there is an edge between i and j (Figure 3a), i and j are involved in a same group of nodes that is densely connected (community).Conversely, when there is no edge between them (Figure 3b), i and j belong to separate groups of densely connected nodes that share few links between them.For instance, in the case of the common neighbors (CN) heuristic (Lü and Zhou, 2011), we have that g(Z) i,j = A T i,• A j,• .Alternatively, when considering supervised approaches such as binary classification-based (Zhang and Chen, 2018;Bleakley et al., 2007), maximum likelihood-based (Kipf and Welling, 2016;Tabouy et al., 2020) or matrix completion-based (Li et al., 2023;Gaucher et al., 2021), the link prediction function can be written as the minimizer of an empirical risk (ERM): g(Z) i,j = ĥ(W i,j ), with ĥ ∈ with L : [0, 1] × {0, 1} → R a loss function and F a function class.In (4), h(W i,j ) is an estimate of the probability that there is an edge between i and j, and the error term L[h(W i,j ), A i,j ] quantifies the difference between the prediction h(W i,j ) and the true A i,j .The ERM formulation for the afore-mentioned supervised approaches can be justified as follows: • Binary classification approaches (Zhang and Chen, 2018;Bleakley et al., 2007): In that case the ERM formulation ( 4) is obvious.For instance, for SEAL (Zhang and Chen, 2018), g(Z) i,j is given by a GNN that takes as input the K-hop subgraph around (i, j), excluding the edge between (i, j) if there is one observed, and augmented with node features that describe the distance of each node in the subgraph to i and to j.
The parameters of the GNN are fitted by minimizing the cross-entropy loss over a set D train ⊂ D of observed true/false edges.In practice, D train is subsampled from D in order to have a 50% − 50% partitioning between true and false edges.
• Maximum likelihood approaches (Kipf and Welling, 2016;Tabouy et al., 2020): Maximum likelihood approaches aim to optimize a lower bound on the likelihood (ELBO).This lower bound is an expectation and therefore, using Monte-Carlo approximation, we end up with a function of the form (4).For instance, for VGAE (Kipf and Welling, 2016), g(Z) i,j is given by the scalar product H T i H j where H u is a node embedding for node u, the embedding matrix H ∈ R n×n being the output of a GNN.It follows that ) for some function ϕ, with L the number of layers of the GNN.
• Matrix completion (Li et al., 2023;Gaucher et al., 2021): e.g., for Li et al. (2023), one can rewrite the estimated probability matrix P as where A in is the sub-matrix of A consisting only of the observed entries.Hence, in that case, g(Z) i,j is of the form (4) with h(W i,j ) = ϕ(A i,• , A j,• ) for some function ϕ.
In this view, our procedure proceeds by learning a binary classification rule for a set of examples indexed by D train = D\ D cal , each learning example consisting of an edge embedding and its "label" (true/false), and then using a reference set of false edges examples indexed by D cal to properly calibrate the edge probabilities for D test .As such, it boils down to the approach proposed by Bates et al. (2023) for FDR control except for a supervised context and with non-exchangeable data objects.Sampling D cal in a manner that mimics the missingness for false edges and learning on the edge examples indexed by D\ D cal allows to control the FDR despite this dependence.Figure 4 provides a high-level sketch.

Numerical experiments
In this section, we study the performance of our method (Algorithm 1) both on simulated data (Section 4.1) and on a real dataset (Section 4.2).We consider two choices for the scoring function g: the GNN-based method of (Zhang and Chen, 2018) called SEAL and the Common Neighbors (CN) heuristic, yielding the procedures CN-conf and SEAL-conf.SEAL is used with a hop number of 2, for the GNN we use GIN (Xu et al., 2018) with 3 layers and 32 neurons, and we train for 10 epochs with a learning rate of 0, 001.In addition, following Remark 2, we consider a version of our method where Algorithm 1 is applied at level α/π 0 , with π0 a suitable estimator of |H 0 |/| D test |.Specifically, following Marandon et al. (2022a), we use Storey's estimator (Storey et al., 2004) given by π0 = (1 − λ) −1 (1 + (i,j)∈Dtest 1{p i,j ≥ λ}) with λ = 1/2, which gives the procedures CN-conf-storey and SEAL-conf-storey.Finally, we compare the performance of our methods to a "naive" procedure for FDR control (called fixed threshold procedure hereafter) in which we select in R(Z) the edges (i, j) ∈ D test for which g(Z) i,j ≥ 1 − α (here we assume that g ∈ [0, 1], otherwise, scores are normalized into [0, 1] by standardizing the values and applying the sigmoid function).Combined with either SEAL or CN, this yields the procedures CN-fixed and SEAL-fixed.If the probabilities g(Z) i,j are poorly estimated, these fixed threshold procedures are expected to not control the FDR at level α in general.

Simulated data
In this section, we evaluate our method by generating the true graph A * from two popular random graph models: the Stochastic Block Model and the graphon model (Matias and Robin,

Stochastic Block Model
To start with, A * is generated from a Stochastic Block Model: In this model, the graph A * displays a variety of connection patterns including community structure and hubs, see Figure 5 for an illustration.
The overall difficulty of the problem resides within two distinct statistical tasks: learning the edge link probabilities (i.e. the score), and the multiple testing issue of controlling the false discovery rate with a given test statistic.The main quantities that govern the difficulty of the learning problem are the density of the graph ρ = n −2 E( i,j A * i,j ) = q,l γ q,l and the number of nodes n (Gaucher and Klopp, 2021).For the FDR control, the difficulty mainly depends on the signal-to-noise ratio SNR = p/ϵ as well as the proportion of true edges to false edges within the test set ρ) and the number of observed false edges w 0 (1 − ρ).Indeed, the higher the SNR, the more true edges display different connection patterns as compared to false edges, whereas increasing the proportion of true edges to false edges within the test set and the calibration sample size improves the estimation of the false discovery rate in Algorithm 1, see Remark 4.
Hence, in order to study the performance of the methods in various conditions, we vary the number of nodes n, the SNR p/ϵ, and the connectivity parameter p (controlling the density ρ for a fixed SNR and a fixed number of nodes n).In each setting, we construct samples D(Z) and test samples D In this SBM, for any parameter values, the connection probabilities cannot be consistently well estimated by a CN heuristic, as some classes have a low probability of connection within their class while being well connected with other classes: for nodes that belong to these groups, it occurs that they share neighbors despite not being connected.Hence, in the case that CN is used, the fixed procedure fails to control the FDR in Figure 6.By contrast, our conformal procedures displays a FDR that is below or close to α across all level values, for all choices of scoring function and for all model configurations.
However, in some model configurations (when p = 0.2 or SNR = 5) the power is near zero for any scoring function across all values of α and n.Moreover, even in the more favorable setting where p = 0.5 and SNR = 10, the power decreases as n increases if α is low enough (α ≤ 0.2).The issue is that the FDR control problem is intrinsically difficult for a model such as the SBM.To illustrate, consider an oracle setting where the true classes (C i ) 1≤i≤n and model parameters are known.In that case, a natural strategy is to declare edges using for each test pair (i, j) the probability P(A * i,j = 1|(C i ) 1≤i≤n ), and if p > ϵ then the pairs that are most likely to be connected are pairs that are in the same class.However, within a class all pairs have the same likelihood of being connected P θ (A * i,j = 1|(C i ) 1≤i≤n ) and so if α < p, it is impossible to control the FDR at α. On principle, one can get a finer test statistic than ) by using the observation A, but if the sampling is homogeneous for true and false edges respectively, as is assumed here, A does not give much more information about the location of true and false edges within a class.Thus, when n increases, the learned scores concentrate around the oracle probabilities P(A * i,j = 1|(C i ) 1≤i≤n ), entailing that if α is too low we get near-zero power.When n is small, the probability estimates are more noisy and the margin of error α may be fully utilized by the conformal procedures, which could explain the better power observed here -however the variance of the FDP naturally increases as n decreases.Nonetheless, across all settings, the most powerful procedure among the ones that control the FDR is a conformal method.

Graphon
Next, A * is generated using a graphon model: We construct samples D(Z) and test samples D test (Z) and choose | D cal | as in the previous section.The FDR and TDR of the methods are displayed in Figure 7 for σ = 0.01, a = 2 and for σ = 0.1, a = 0.5, with n ∈ {50, 100, 200}.In this model, compared to Section 4.1.1,the CN heuristic is not necessarily a poor predictor: by transitivity, the graphon function implies that the more neighbors two nodes have in common, the more likely it is that they are connected.
In the case that σ = 0.01 and a = 2, all procedures control the FDR for all n and α values.However, our conformal procedures are the most powerful by a margin.In the more difficult setup where σ = 0.1 and a = 0.5, the FDR of the fixed procedure is inflated: across all values of n in the case of CN and for small n (n = 50) in the case of SEAL.By contrast, our conformal procedures control the FDR across all values of n in this setup also, with equal or higher power than the fixed procedure when its FDR is below the nominal level.Finally, for all procedures the power improves when n increases, however the gain is more substantial for the conformal methods.

Real data
We evaluate our method on a real dataset: a food web network (Christian and Luczkovich, 1999) downloaded from the Web of Life Repository (https://www.web-of-life.es/).A food web is a network of feeding interactions in an ecological community, in which nodes are species and two species are connected if one eats the other.Food web analysis delivers important insights into the workings of an ecosystem.However, documenting species interaction is in practice labour-intensive and the resources for such investigations finite; hence, food webs are typically incomplete.
In this dataset, there are in total 48 species and 221 interactions recorded between them.We consider a directed and bipartite representation where a set of predators connect to a set of preys: if a species plays both roles (i.e.eats certain species and also gets eaten by others), then we associate with it two separate nodes, one in the predator set and one in the prey set.We thus obtain a network of 81 nodes and 221 edges (see Figure 8).In the corresponding adjacency matrix, there is by construction a zero entry for any node pair (i, j) where i is a prey or both i and j are predators: hence, we remove these node pairs from the set of negative edges {(i, j) : A * i,j = 0} in the sequel.We construct samples D(Z) and test samples D When the scoring function uses the CN heuristic, the fixed threshold procedure fails to control the FDR.Indeed, the bipartite structure entails poor estimation of edge probabilities with such a link prediction heuristic, as prey nodes and predator nodes have, respectively, many neighbors in common but share no connections with each other.Conversely, SEAL is not only learning-based but is also expressive enough that it can learn various connection patterns.Hence, we observe here that when using SEAL, the fixed procedure controls the FDR.By contrast, our conformal procedures control the FDR regardless of the link prediction algorithm used for the scoring function and so even in case of a poor link prediction model.Moreover, the conformal procedures uniformly improve upon the fixed procedure: when the fixed threshold procedure has an FDR under the nominal level, our conformal procedures have a much higher power.

Discussion
We have proposed a novel method that calibrates the output of any link prediction technique for FDR control, using recent ideas from the conformal inference literature.In a nutshell, our method acts as a wrapper that allows to transform an off-the-shelf link prediction technique into a procedure that controls the proportion of falsely detected edges at a user-defined level.Importantly, our proposed method is model-free: no assumptions are made regarding the distribution of the complete graph and the control is provided regardless of the quality of the link prediction model.
In this work, it is assumed that the sampling matrix is generated by sampling at random true and false edges at an unknown rate respectively.However this type of sampling does not include certain practical sampling designs such as egocentric sampling (Li et al., 2023).On the other hand, the key of the method is to sample the calibration node pairs in a manner that mimics the missingness of the false edges in the test set: therefore, if the sampling mechanism is complicated but its parameters are known, the control can still provided as long as the sampling of the calibration sample is adapted.
An important topic in conformal inference is the question of conditional guarantees, that is, achieving error guarantees that hold conditionally on somes component of the data.For instance, in classical conformal prediction, a large body of work (Foygel Barber et al., 2021;Romano et al., 2020Romano et al., , 2019) ) studies the aim of obtaining valid prediction sets conditional on the features of a new test point, while a few others (Löfström et al., 2015;Sadinle et al., 2019) are rather interested in conditioning on the label.In the outlier detection context, Bates et al. (2023) considers guarantees conditional on the calibration data (note that labels are also fixed in that work).In our case, let us note that the control can be considered as conditional on the complete graph A * : indeed, the calibration node pairs are actually sampled in a manner that mimics the missingness of the false edges conditionally on A * .Hence, the randomness of the problem resides in the sampling matrix.However, defining a relevant variable to further condition on is not clear here, since the observed data has a complex dependence structure.
Finally, some related scenarios of interest are settings where some true/false edges are erroneously observed or where the sampling matrix is not observed (Zhao et al., 2017).In-vestigating the extension of the method for these cases represents an interesting avenue of research for future work.

Figure 2 :
Figure 2: Illustration of the notations introduced in Section 2. The test edges D test (left panel) are divided into two subsets (unobserved): true edges H 1 , and false edges H 0 (right panel).

Figure 3 :
Figure3: Example of K-hop (K=2) neighborhood of (i, j) (in color), for when (a) i and j are connected and (b) i and j are not connected.

Figure 5 :
Figure 5: Illustration of the SBM model considered in Section 4.1.Nodes represent classes, edges indicate connection patterns between classes.

Figure 8 :
Figure 8: Food web network of Christian and Luczkovich (1999) with a bipartite representation: prey nodes are colored in blue and predator nodes are colored in yellow.
test (Z) by considering the dataset at hand as the complete network A * and generating the missingness ourselves, using double standard sampling with w 1 = 10% and w 0 chosen such that |H 0 |/|H 1 | = 50%, and use | D cal | = max(|D 0 | − |D 1 |, 5000) for the conformal methods.The performance of each method is then evaluated by computing the FDP and TDP with respect to the ground truth and the FDR and TDR are computed by using 100 Monte-Carlo replications.The FDR and TDR of the methods are displayed in Figure 9 for varying α.

Figure 9 :
Figure 9: FDR (left panel) and TDR (right panel) as a function of the nominal level α, for the food web dataset described in Section 4.2.The bands indicate the standard deviation.
words, the conformal p-value amounts to the rank of the test score S n+1 among the scores of the data sample.If (Z j ) 1≤j≤n+1 is i.i.d.when Z n+1 ∼ P 0 and the scoring function g is fixed, then p is uniformly distributed under the null and hence yields a valid test.In practice, to use a learned score function, one splits the data sample (Z j ) 1≤j≤n into two subsets, a training sample D train = {Z 1 , ..., Z k } and a calibration sample D cal = {Z k+1 , ..., Z n }.The training sample D train is used to learn the score (e.g., using one-class classifiers), whereas the calibration sample D Conformal link prediction Input: Observation Z = (A, X, Ω): (Observed) adjacency matrix A, node covariate matrix X, sampling matrix Ω; LP algorithm g; sample size ℓ of the reference set Define cal is used to compute the p-value (hence n is replaced by n − k in the above equation).More generally, the core idea is that the validity holds as soon as we have exchangeability of the scores S i under the null.Moreover, in the multiple testing Algorithm 1