Exchange-based diffusion in Hb-Graphs

Highlighting important information of a network is commonly achieved by using random walks related to diffusion over such structures. Complex networks, where entities can have multiple relationships, call for a modeling based on hypergraphs. But, the limitation of hypergraphs to binary entities in co-occurrences has led us to introduce a new mathematical structure called hyperbaggraphs, that relies on multisets. This is not only a shift in the designation but a real change of mathematical structure, with a new underlying algebra. Diffusion processes commonly start with a stroke at one vertex and diffuse over the network. In the original conference article—(Ouvrard et al. 2018)—that this article extends we have proposed a two-phase step exchange-based diffusion scheme, in the continuum of spectral network analysis approaches, that takes into account the multiplicities of entities. This diffusion scheme allows to highlight information not only at the level of the vertices but also at the regrouping level. In this paper, we present new contributions: the proofs of conservation and convergence of the extracted sequences of the diffusion process, as well as the illustration of the speed of convergence and comparison between classical and modified random walks; the algorithms of the exchange-based diffusion and the modified random walk; the application to two use cases, one based on Arxiv publications and another based on Coco dataset images. All the figures have been revisited in this extended version to take the new developments into account.

similarity-based groupings. Connections between entities call for a modeling of potentially complex multi-adic relations, that most of the time cannot be reduced to pairwise relationships. Hypergraphs where relations are based on subsets of a given set enhance the modeling of multi-adic relations. A multi-adic relationship can be viewed either as a group or a cooccurrence. Also, hypergraphs can be used to model the complex co-occurrence networks induced by the different facets of an information space- [25].
Nonetheless, in real co-occurrence networks, some elements might repeat themselves more than once, or require an individual weighting at the group level. Sets fail at capturing this information, while multisets, where elements have a multiplicity, naturally handle it. Moving from sets to multisets is not only a change in designation, but an effective mathematical paradigm shift, as the algebra involved behind is not the same. Collections of multisets are then the next step in the modeling. In [26,27], we have introduced hyperbag-graphs (hb-graphs for short) as a generalization of hypergraphs to support multisets. A hb-graph is a family of multisets over a same universe, designated as the vertex set. The multisets play the role of hyperedges in hypergraphs and are called hb-edges. Hypergraphs appear as a sub-category of this new mathematical category that hb-graphs constitute.
The step forward consists in highlighting important information conveyed by these complex networks. Traditional approaches in hypergraphs includes random walks- [1,44]. Particularly, in [1], the authors show that putting hyperedge-based weights on vertices provides a better information retrieval. In these two approaches- [1,44]-, the focus is mainly put on vertices; but, most of the time, in such modeling, the information carried by the links is semantically significant, as it represents the reference used for building the cooccurrences. Also, having a way to highlight important information from the reference is also interesting. Multimedia databases, such as image databases or document databases are potential applications of such means of highlighting information. For instance, different metadata can be attached to a document such as authors, author keywords, processed keywords, categories, added tags. If the users are able to attach tags to documents, it can be important to weight them individually in the context of each document. The same can apply to an image, with other features that are based on the image analysis. Hb-graphs fit to model such information spaces- [29].
We want to address the following research question: "Can we find a network model and a diffusion process that not only rank vertices but also rank hb-edges in hb-graphs?". In [24], we have developed an iterative exchange approach in hb-graphs with two-phase steps that allows to extract information not only at the vertex level but also at the hb-edge level. In this article, which is an extended version of [24], we not only present the contributions of [24]-that included the introduction of the exchange-based diffusion process as a means to rank both vertices and hb-edges, and were formalizing the exchanges by using hb-graphs, and presenting a novel visualisation of co-occurrences network-, but also add new contributions.
In [24], we have validated our approach by using lab-generated hb-graphs. We continue here to use this approach, as mimicking real datasets by randomly generated ones is not only a warranty for reproducibility, but also for robustness of the results obtained. We illustrate the extracted information using the exchange process by a hb-graph visualisation that highlights not only vertices but also hb-edges.
We show that the exchange-based diffusion process provides proper coloring of vertices with high connectivity and highlights hb-edges with a normalisation approach-allowing small hb-edges to have a chance to be highlighted. We apply this approach to process the metadata contained in the results retrieved by querying Arxiv through its API in order to visualize the results: we will show how it can be used to allow further query expansion. We give a last use case on Coco dataset images.
In summary, the contributions of this extended version include: the proofs of conservation and convergence of the extracted sequences of the diffusion process, as well as the illustration of the speed of convergence and comparison to classical and modified random walks; the algorithms of the exchange-based diffusion and the modified random walk; the application to two use cases, one based on Arxiv publications and another one based on images of the Coco dataset.
In Section 2, the mathematical background and the related work is given. The construction of the formalisation of the exchange process is presented in Section 3. Results and evaluation are given in Section 4 and future work and conclusion are addressed in Section 5.

Mathematical background and related work
For the mathematical background, we give a minimal mathematical formalization. However, the interested reader can refer to [28] which contains all the necessary mathematics and to [26,27] for a full introduction on hb-graphs.

Hypergraphs
Hypergraphs have been introduced in [3]; we use nonetheless the definition given in [4], as it relaxes the constraint on the hyperedges to cover the vertex set. A hypergraph over a finite set of vertices is defined as a family of subsets of this vertex set. A hypergraph will be said edge-weighted if there exists an application that associates a positive real number to each hyperedge. Hypergraphs fit to model multi-adicity in structures where the traditional pairwise relationship of graphs is insufficient: they are used in many areas such as social networks in particular in collaboration networks-[22, 23]-, co-author networks- [13] and [37]-, chemical reactions- [38]-, genome-[6]-, VLSI design- [15]-and other applications. Hypergraphs are also used in information retrieval for different purposes such as query formulation in text retrieval [2] and in music recommendation [5]. Several applications of hypergraphs exist based on the diffusion process firstly developed in [44]. In [11], the authors use the diffusion process developed in [44] for 3D-object retrieval and recognition by building multiple hypergraphs of objects based on their 2D-views. In [43], multiple hypergraphs are constructed to characterize the complex relations between landmark images and are gathered into a multi-modal hypergraph that allows the integration of heterogeneous sources to provide content-based visual landmark searches. Hypergraphs are also used in multi-feature indexing to help image retrieval [41]. For each image, a hyperedge gathers the most similar images based on different features. Hyperedges are weighted by average similarity. A spectral clustering algorithm is then applied to divide the dataset into a given number of sub-hypergraphs. A random walk on these sub-hypergraphs retrieves significant images: they are used to build a new inverted index, useful to query images. In [40], a joint-hypergraph learning is achieved for image retrieval, combining efficiently a semantic hypergraph based on image tags with a visual hypergraph based on image features.
Evaluating the importance of vertices in hypergraphs by random walks has been largely studied. In [44], a random walk on a edge-weighted hypergraph is defined by choosing a hyperedge with a probability proportional to its weight and, within that hyperedge, a vertex randomly chosen using a uniform law. This random walk has a stationary state which is shown to correspond to a vector proportional to the vector of vertex degrees in [10]. This process differs from the one we propose: our diffusion process is done in successive steps from a random initial vertex on vertices and hb-edges, taking into account the multiplicities of the vertices inside each hb-edge.
In [1], the authors use a random walk on hypergraphs using weight functions both on hyperedges and vertices. The vertex weights are hyperedge-based: it is achieved using a vector of weights associated to each vertex. The random walk is similar to the one in [44], but additionally takes into account the vertex weight in the probability law for choosing the vertex inside the hyperedge. They show on a publication dataset that this modified random walk gives a ranking of vertices with higher precision than random walks using unweighted vertices. However, this process differs again from our proposal since our process not only enables simultaneous alternative updates of vertices and hb-edges values but also provides hb-edge ranking. We also introduce a new theoretical framework to perform our diffusion process.
Diffusion processes are tightly tied to random walks. In [17], the authors use random walks in hypergraph for image matching. In [19], the authors build higher order random walks in hypergraph and construct a generalised Laplacian attached to the graphs generated from their random walks.

Multisets
Multisets-also known as bags or msets-have a long use in many domains. But before developing their use in different domains, we firstly give the main definitions on multisets mainly based on [35].
A multiset A m = (A, m) 1 is a pair composed of a set A of distinct objects-called the universe of the multiset-and, of a multiplicity function m with a range potentially in the real numbers set. The support A m of the multiset A m corresponds to the elements of the universe that have a non-zero multiplicity. When the range of the multiset is a subset of the non-negative integers, we call it a natural multiset. A natural multiset can be viewed as an unordered list of elements with possible repetitions.
The m-cardinality of a multiset A m , written # m A m , corresponds to the sum of the multiplicities of the elements of its universe.
Different operations can be defined on multisets of same universe as inclusion, union, intersection and sum. As mentioned in [35], De Morgan's laws on multisets do not hold. Defining complementation and difference requires to fix a limit in m-cardinality to the multisets as given in [12].
Multisets, under the appellation bag, appear in different domains such as text modeling, image description and audio [32]. In text representation, bag of words have been first introduced in [14]: bags are lists of words with repetitions, i.e. multisets of words on a universe. Many applications occur with different approaches. Bags of words have been used for instance in fraud detection [31]. More recently, bags of words have been used successfully for translation by neural nets as a target for the translation as a sentence can be translated in many different ways [20]. In [8], multi-modal bag of words have been used for cross domains sentiment analysis.
Bags of visual words is the transcription to image of textual bags of words; in bags of visual words, a visual vocabulary based on image features is built to allow the description of images as bags of these features. Since their introduction in [36], many applications have been achieved: in visual categorization [7], in image classification and filtering [9], in image annotation [39], in action recognition [30], in land-use scene classification [42], in identifying mild traumatic brain injuries [21] and in word image retrieval [33].
Bags of concepts are an extension of bags of words to successive concepts in a text [16]. A recent extension of these concepts is given in [34] where bag of graphs are introduced to encode in graphs the local structure of a digital object: bags of graphs are declined into bags of singleton graphs and bags of visual graphs. Using the hb-graphs as we propose in this article will allow to extend this approach, by taking advantage of multi-adicity and also of the multiplicity of vertices specific to each hb-edge.

Hb-graphs
Hb-graphs are introduced in [26]. A hb-graph is a family of multisets with the same universe V and with support a subset of V . The msets are called the hb-edges and the elements of V the vertices. We consider for the remainder of the article a hb-graph H = (V , E) , with V = v i : i ∈ n 2 and E = e j j ∈ p the family of its hb-edges.
Each hb-edge e i ∈ E has V as universe and a multiplicity function associated to it: m e i : V → W where W ⊂ R + . For a general hb-graph, each hb-edge has to be seen as a weighted system of vertices, where the weights of each vertex are hb-edge dependent.
A hb-graph where the multiplicity range of each hb-edge is a subset of the non-negative integer set is called a natural hb-graph. A hypergraph is a natural hb-graph where the hb-edges have multiplicity one for every vertex of their support.
The support hypergraph of a hb-graph H = (V , E) is the hypergraph whose vertices are the ones of the hb-graph and whose hyperedges are the support of the hb-edges in a one-to-one way. We write it The m-degree of a vertex v i ∈ V of a hb-graph H-written deg m (v i ) = d m (v i )-is defined as the sum of the multiplicity of v i in each hb-edge of the hb-graph.
The matrix H = m j (v i ) i∈ n j ∈ p is called the incident matrix of the hb-graph H.
A weighted hb-graph H w = (V , E, w e ) is a hb-graph H = (V , E) where the hb-edges are weighted by w e : E → R + * . An unweighted hb-graph is then a weighted hb-graph with w e e j = 1 for all e j ∈ E.
A strict m-path ue j 1 v i 1 . . . e j s v in a hb-graph H from a vertex u to a vertex v is a vertex / hb-edge alternation, where the intermediate vertices belong to the intersection of the hbedges immediately surrounding them. In a natural hb-graph, a strict m-path is not unique as many copies of the same vertex can coexist in the intersection. Moreover, in natural hbgraphs, there are two notions of paths: a strict and a large one: some copies of the vertex are possibly not in the intersection of the two surrounding hb-edges and can exist only in one of the two hb-edges.
A strict m-path in a hb-graph corresponds to a unique path in the hb-graph support hypergraph called the support path. In this article we abusively call it a path of the hb-graph. The length of a path corresponds to the number of hb-edges it is going through.
Representations of hb-graphs can be achieved either by using sub-mset representations or by using edge representations. In the edge representation, an extra-node is added to each hb-edge and the thickness of the link between the extra-node of a hb-edge and the vertices in the support of the hb-edge is made proportional to their multiplicity in the hb-edge. More details on these representations can be found in [26].
We give in Fig. 1 an example of such representation of a hb-graph for keywords extracted from sentences in which stop words have been removed. The number of words occurrences differs from one sentence to another: it is given as a multiplicity specific to the corresponding hb-edge that represents the sentence. The universe of the hb-graph is the set of words where the stop words has been removed.

Exchange-based diffusion in hb-graphs
We introduce in this section a diffusion process based on the exchange of information between the vertices and the hb-edges. Traditionally, diffusion processes are achieved using an initial stroke on a vertex that propagates over the network structure. Diffusion processes can be approximated using random walks. When the random walk takes place on a network, either graph or hypergraph based, vertices can be ranked by using the number of times they are reached. Teleportation is introduced in these random walks to avoid loops. Several random walks are often necessary in order to average their results.
The idea in the exchange-based diffusion is to propose a mechanism that mimics the behavior of a population where agents-vertices-have equal resources at the beginning and can exchange them only via intermediates-hb-edges-they are belonging to and share the resources according to the multiplicities of these agents.
We consider a weighted hb-graph H = (V , E, w e ) with |V | = n and |E| = p; we write H its incidence matrix.
At time t, we set a distribution of values over the vertex set: t : We write P V ,t = (α t (v i )) i∈ n the row state vector of the vertices at time t and P E,t = t e j j ∈ p the row state vector of the hb-edges. The initialisation is done such that v i ∈V α 0 (v i ) = 1 and the information value is concentrated uniformly on the vertices at the beginning of the diffusion process and, consequently, each hb-edge has a zero value associated to it. Writing We consider an iterative process with two-phase steps. At every time step, the first phase starts at time t and ends at t + 1 2 , followed by the second phase between time t + 1 2 and t + 1. In Fig. 2, we illustrate this principle using the example in Fig. 1. A more general figure of the principle of this iterative process is given in [24,28]. The iterative process conserves the overall value held by the vertices and the hb-edges, meaning that we have at During the first phase between time t and t + 1 2 , each vertex v i of the hb-graph shares its value α t (v i ) hold at time t with the hb-edges it is connected to.
In an unweighted hb-graph, the fraction of , which corresponds to the ratio of the multiplicity of the vertex v i due to the hb-edge e j over the total m-degree of hb-edges containing v i in their support. In a weighted hb-graph, each hb-edge has a weight w e e j . The value α t (v i ) of the vertex v i is shared by accounting not only the multiplicity of the vertices in the hb-edge but also the weight w e e j of the hb-edge e j .
The weights of the hb-edges are stored in a column vector: We also consider the weight diagonal matrix: We introduce the weighted m-degree matrix: where d w,v i is called the weighted m-degree of the vertex v i . It is: The contribution of the vertex v i to the value t+ 1 2 e j attached to the hb-edge e j of weight w e e j is: It corresponds to the ratio of the weighted multiplicity of the vertex v i in e j over the total weighted m-degree of the hb-edges where v i is in the support. We remark that if v i / ∈ e j : And the value t+ 1 2 e j is calculated by summing over the vertex set: Hence, we obtain: The value given to the hb-edges is subtracted to the value of the corresponding vertex, hence for all i ∈ n : Claim (No information on vertices at t + 1 2 ) It holds: Proof For all i ∈ n : Claim (Conservation of the information of the hb-graph at t + 1 2 ) It holds: During the second phase which starts at time t + 1 2 , the hb-edges share their values across the vertices they hold taking into account the vertex multiplicities within the hb-edge. The contribution to α t+1 (v i ) given by a hb-edge e j is proportional to t+ 1 2 in a factor corresponding to the ratio of the multiplicity m j (v i ) of the vertex v i to the hb-edge mcardinality: The value α t+1 (v i ) is then obtained by summing on all values associated to the hb-edges that are incident to v i : Writing D E = diag # m e j j ∈ p the diagonal matrix of size p × p, it comes: The values given to the vertices are subtracted to the value associated to the corresponding hb-edge. Hence, for all j ∈ p : Claim (The hb-edges have 0 value at t + 1) It holds: Proof For all i ∈ p : Claim (Conservation of the information of the hb-graph at t + 1) It holds: Regrouping (1) and (2): It is valuable to keep a trace of the intermediate state Claim (Stochastic transition matrix) T is a square row stochastic matrix of dimension n.
Proof Let consider: A = a ij i∈ n j ∈ p = D −1 w,V H W E ∈ M n,p and: A and B are non-negative rectangular matrices. Moreover: and, it holds: # m e j and it holds: # m e j = 1.
We have: P V ,t+1 = P V ,t AB where: It yields: Hence T = AB is a non-negative square matrix with its row sums all equal to 1: it is a row stochastic matrix.
Claim (Properties of T) Supposing that the hb-graph is connected, the exchange-based diffusion matrix T is aperiodic and irreducible.
Proof This stochastic matrix is aperiodic, due to the fact that any vertex of the hb-graph retrieves a part of the value it has given to the hb-edge, hence t ii > 0 for all i ∈ n . Moreover, as the hb-graph is connected, the matrix is irreducible as all states can be joined from any state.
Claim The sequence P V ,t t∈N , with P V ,t = (α t (v i )) i∈ n , in a connected hb-graph converges to the state vector π V such that: Proof We denote by π an eigenvector of T = (c ik ) i∈ n k∈ n associated to the eigenvalue 1. We have πT = π . Let consider u = d w,v i i∈ n . We have: Hence, u is a non-negative eigenvector of T associated to the eigenvalue 1. For a connected hb-graph, when we iterate over the stochastic matrix T which is aperiodic and irreducible, we are then ensured of convergence to a stationary state: this stationary state is the probability vector associated to the eigenvalue 1. It is unique and is equal to αu such that k∈ n αu k = 1. converges towards a state vector π E such that: We have: All components are non-negative and we check that the components of this vector sum to one: These two claims show that this exchange-based process ranks vertices by their weighted m-degree and hb-edges by their weighted m-cardinality.
We have gathered the two-phase steps of the exchange-based diffusion process in Algorithm 1. The time complexity of this algorithm is O ( is the maximal degree of vertices in the hb-graph and r H = max e j ∈E e j is the maximal cardinality of the support of a hb-graph. Usually, d H and r H are small compared to n and p. Algorithm 1 can be refined to determine automatically the number of iterations needed, fixing an accepted error to ensure convergence on the values for the vertices and storing the previous state.
This section firstly addresses the validation of the approach taken on lab-generated hbgraphs. Secondly, this approach is applied to two use cases: one on the processing of the results of Arxiv querying and another one on Coco dataset images.

Validation on lab-generated hb-graphs
This diffusion by exchange process has been validated on two experiments: the first one generates a random hb-graph to validate the approach and the second compares the results with a classical and a modified random walk on the hb-graph.
Using lab-generated hb-graphs allow to test our diffusion on hb-graphs that have different shapes, and where the connectivity is always guaranteed. The lab-hb-graph generator includes different parameters to ensure both the connectivity, the number of groups-i.e. sub-hb-graphs-and the way of connection of these groups. As it is shown in Fig. 3, we generate N max vertices. N 0 out of the N max vertices are regrouped in V 0 and will be used for interconnection between the different groups. The remaining N max − N 0 vertices are then Fig. 3 Random hb-graph generation principle separated into k subsets V j j ∈ k , corresponding to the vertices of the groups. In each of these k groups V j , we generate two subsets of vertices: a first set V j,1 of N j,1 vertices and a second set V j,2 of N j,2 vertices with N j,1 N j,2 , j ∈ k . The number of hb-edges to be built is adjustable and shared between the different groups. The m-cardinality # m (e) of a hb-edge is chosen randomly below a maximum tunable threshold. The multiplicity given to a vertex is also a random choice, tunable below a threshold. Vertices in V j,1 are the vertices considered as important: their presence is required in a certain number of hb-edges per group; the number of important vertices in a hb-edge is randomly fixed below a maximum number. The completion of each hb-edge is done by choosing vertices randomly in the V j,2 set. A vertex can be chosen randomly many times, increasing in this case its multiplicity within the hb-edge using the same random approach. The random choice made into these two groups is tuned to follow a power law distribution. It implies that some vertices occur more often than others. Interconnection between the k components is achieved by choosing vertices in V 0 and inserting them randomly into the hb-edges built.
The exchange-based diffusion is then applied to these generated hb-graphs: we analyze not only the validity of this diffusion process but also propose a visualisation of the results that highlights not only vertices but also hb-edges, both on the hb-graph and on its support hypergraph.
We make the hypothesis that vertices with the highest values of α T correspond to vertices of the network that are important in the sense of being central for the connectivity. To validate this hypothesis, we are going to define a relative eccentricity of vertices from a subset of the vertex set in the hb-graph.
The eccentricity of a vertex in a graph is the length of a maximal shortest path between this vertex and the other vertices of this graph: extending this definition to hb-graphs is straightforward. If the graph is disconnected then each vertex has infinite eccentricity.
The relative eccentricity is then defined as the length of a maximal shortest path starting from a given vertex in a subset S of the vertex set V and ending with any vertices of V \S. The relative eccentricity is calculated for every vertex of S provided that it is connected to vertices of V \S; otherwise it is set to −∞. The concept of relative eccentricity is illustrated in Fig. 4.
The subset of the vertex set V -written A V (s V )-is built by using a threshold value  The results obtained by this experiment are shown on the two plots of Fig. 5. The first plot corresponds to the maximal length of the path between vertices of A V (s V ) and vertices of B V (s V ) that are connected according to the ratio r V : this path length corresponds to half of the length of the path observed in the extra-vertex graph representation of the hb-graph support hypergraph as in between two vertices of V there is an extra-vertex that represents the hb-edge (or the support hyperedge). The second curve plots the percentage of vertices of V that are in A V (s V ) in function of r V . When r V increases, the number of elements in A V (s V ) naturally decreases while they get closer to the elements of B V (s V ) , marking the fact that they are central. Figures 6 and 7 show that high values of α T (v) correspond to vertices that are highly connected either by degree or by m-degree.
A similar approach is taken for the hb-edges: assuming that the diffusion process stops at time T , we use the T − length of the path corresponds to half of the one obtained from the graph for the same reason as before. We define the ratio: |E| that corresponds to the normalised value that would be used in the dual hb-graph to initialize the diffusion process. In Fig. 8, we observe for the hb-edges the same trend than the one observed for vertices: the length of the maximal path between two  hb-edges decreases as the ratio r E increases while the percentage of vertices in A E (s E ) over V decreases. Figure 9 shows the high correlation between the value of (e) and the cardinality of e; Fig. 10 shows that the correlation between value of (e) and the m-cardinality of e is even stronger.
The number of iterations needed to have a significant convergence depends on the initial conditions; we tried different initializations, either uniform, or applying some strokes on a different number of nodes. We observed that the more uniform the information on the network is, the less number of iterations for convergence is needed. No matter the configuration, the most important vertices in term of connectivity are always the most   Figures 11 and 12 depict the convergence observed on a uniform initial distribution as it is described in the former section. In Fig. 11, we can see how the α-values as observed in Fig. 6 reflect the m-degree of the vertex they are associated with: 200 iterations is more than enough to rank the vertices by m-degree. In Fig. 12, we can observe an analogous phenomena with the -value associated to hb-edges that reflect the m-cardinality of the hb-edges. Again 200 iterations are sufficient to converge in the studied cases. The number of iterations needed to converge depends on the structure of the network. In the transitory phase, the vertices need to exchange with the hb-edges; the process requires some iterations before converging and its behavior depends on the node connectivity and the hb-edge composition. It is an open question to investigate on this transitory phase to have more indications on the way the and the α-values vary. We show an example of exchange-based diffusion on a lab-generated hb-graph in Fig. 13a and on its support hypergraph in Fig. 13b. The vertices are colored depending on the value of the ratio: (e) compared to norm (e)hb-edges are colored in a blueish hue; when this ratio is high-i.e. when the hb-edges have high T − 1 2 (e) compared to what was expected with norm (e)-they are colored in a reddish hue. It is worth mentioning that diffusing only on the support hypergraph of a hb-graph highlights only nodes that are highly connected inside a group, the ones being at the intersection of the different groups have less importance in this case. The diffusion on the hb-graph captures the centrality of these vertices that are peripheral to the groups. Hence, taking the multiplicities into account brings valuable information on the network and on the centrality of some vertices.
To compare our exchange-based diffusion process to a baseline we consider a classical random walk. In this classical random walk, the walker who is on a vertex v chooses randomly a hb-edge that is incident with a uniform probability law and when the walker is on a hb-edge e, he chooses a vertex inside the hb-edge randomly with a uniform probability law. We let the possibility of teleportation to an other vertex from a vertex with a tunable value γ : 1− γ represents the probability to be teleported. We choose the classical value γ = 0.85. We count the number of passages of the walker through each vertex and each hb-edge. We stop the random walk when the hb-graph is fully explored. We iterate N times the random walk, N varying.
To improve the results of the classical random walk we propose a modified random walk-described in Algorithm 2-on the hb-graphs with random choice of hb-edges when the walker is on a vertex v with a distribution of probability w e (e i ) m i (v) deg w,m (v) i∈ p and a random choice of the vertex when the walker is on a hb-edge e with a distribution of probability m e (v i ) # m (e) i∈ n . We let the possibility of teleportation as it is done in the classical random walk. Similarly to the classical random walk, we count the number of passages of the walker through each vertex and each hb-edge. We also stop the random walk when the hb-graph is fully explored. We iterate N times the random walk with various values of N . Assigning a multiplicity of 1 to every vertex and a weight of 1 for every hb-edge-with the vertex degree and the hb-edge cardinality instead of the multiplicity-retrieves the classical random walk from the modified random walk.  Figure 14 shows that there is a good correlation between the rank obtained by a thousand modified random walks and after two hundreds iterations of our diffusion process, especially for the first two hundred vertices of the network, which is generally the ones targeted. The lack of anti-correlation between the rank obtained by the random walk with the degree of the vertices and the m-degree of vertices as shown respectively in Figs. 15 and 16 Fig. 15 Comparison of the rank obtained by a thousand modified random walks after total discovery of the vertices in the hb-graph and m-degree of vertices

Multimedia Tools and Applications
Fig. 16 Comparison of the rank obtained by a thousand modified random walks after total discovery of the vertices in the hb-graph and degree of vertices is mainly due to the vertices with low m-degrees / degrees, but this lack decreases with the modified random walk.
We can remark in Fig. 17 that the correlation is a bit lower with a thousand classical random walks due to the fact that there are more vertices that are seen as differently ranked in between the two approaches. In Fig. 18, we can see that the ranks in the classical random walk relies more on the degree than on the m-degree as shown in Fig. 19, especially for vertices with small (m-)degrees; but there is still a misclassification for lower (m-)degree vertices.
We have compared the three methods from a computational time perspective; the results are shown in Table 1. The diffusion process is clearly faster; the modified random walk, essentially related to the overhead due to the large number of divisions, requires longer than the classical random walk. A lot of optimization can be foreseen to make this modified random walk run faster. The random walks can be easily parallelized; it is also the case for the diffusion process. The number of iterations in the diffusion process can also be optimized. These issues will be addressed in future work.

Application to Arxiv querying
We use the standard Arxiv API 3 to perform searches on Arxiv database. When performing a search, the query is transformed into a vector of words which is the basis for the retrieval of documents. The most relevant documents are retrieved based on a similarity measure between the query vector and the word vectors associated to individual documents. Arxiv relies on Lucene's built-in Vector Space Model of information retrieval and the Boolean Fig. 17 Comparison of the rank obtained by a thousand classical random walks after total discovery of the vertices in the hb-graph and rank obtained with 200 iterations of the exchange-based diffusion process model. 4 The Arxiv API returns the metadata associated to documents with highest scores for the query performed. This metadata, filled by the authors during their submission of a preprint, contains different information such as authors, Arxiv categories and abstract.
We process these abstracts using TextBlob, a natural language processing Python library 5 and extract the nouns using the tagged text.
Nouns in the abstract of each document are scored with TF-IDF, the Term Frequency -Invert Document Frequency. Even if it is a classical measure, we just remind here its definition: TF-IDF (x, d) = TF(x, d) × IDF (x, d) with TF(x, d) the relative frequency of x in d and IDF (x, d) the invert document frequency.
Writing n d the total number of terms in document d and n x the number of occurrences of x : TF(x, d) = n x n d and writing N the total number of documents and n x∈d the number of documents having an occurrence of x, we have: Scoring each noun in each abstract of the retrieved documents generates a hb-graph H Q of universe the nouns contained in the abstracts. Each hb-edge contains a multiset of nouns extracted from a given abstract with a multiplicity function that represents the TF-IDF score of each noun.
The exchange-based diffusion process is then applied to the hb-graph H Q . We show two typical examples for the same query: the first 50 results in Fig. 20 and the first 100 results Fig. 18 Comparison of the rank obtained by a thousand classical random walks after total discovery of the vertices in the hb-graph and m-degree of vertices in Fig. 21. The number of iterations needed to reach convergence is less than 10 in these two cases; with 500 results, around 10 iterations are needed for all hb-edges but one where 30 iterations are needed.
As the hb-edges correspond to documents in Arxiv database, we compare the central documents obtained in the results of the queries: we observe that the ranking obtained based on the 49+ 1 2 differs significantly from the ranking by pertinence given by Arxiv API. In the Fig. 19 Comparison of the rank obtained by a thousand classical random walks after total discovery of the vertices in the hb-graph and degree of vertices Table 1 Time taken for doing 100, 200, 500 and 1000 iterations of the diffusion algorithm and 100, 200, 500  and 1000 classical and modified random walks on different hb-graphs exchange-based diffusion, the ranking sorts documents depending on their respective word weights and their centrality as we have seen in the experimental part on random hb-graphs.
Moreover, we have observed that when the number of results retrieved increases the top 5, top 10 documents sometimes change drastically depending on the retrieval of new documents that are more central with respect to the words they contain. If the gap seems small with a few documents retrieved, it increases as the number of documents increases. Increasing the number of results reveals the full theoretical hb-graph obtained from the whole dataset of the query performed, and hence, reveals the subjects central to this dataset.

Application to an image database
We have applied the exchange-based diffusion to a database of images. We have used a hb-graph modeling of the objects detected on individual images to build a network of cooccurrences. Each image has been processed using a Retina neural network to label the objects it contains, and each object is then counted in its own category. The database used is the 2014 training set of the COCO dataset 6 [18]. The use of a pre-trained Retina net 7 allows to give bounding boxes corresponding to concepts, with a probability associated to it. We then choose a threshold below which we reject the bounding box: it has been fixed at 0.5, as it is proposed by the library developer. Hence, we can associate to each image its concepts and their multiplicity.
Two hb-graphs can be build. First, a hb-graph of images H Im , where the vertex set is constituted of the different concepts-objects-that the image holds and where a hb-edge is related to an image, regrouping the different concepts with their respective multiplicity. The second hb-graph is the hb-graph of concepts H Co : the vertex set corresponds to the image set and a hb-edge regroups the images holding the concept with a multiplicity that corresponds to the number of times the corresponding concept occurs in the image. These two hb-graphs are dual one of the other. We now focus on the hb-graph of images.
198 images of the COCO 2014 training dataset have been randomly selected, building the original image hb-graph. To ensure connectivity, only the first main component of the original image hb-graph is kept: it is constituted of 175 images. This component is designated as the hb-graph in the remainder. We then enhance the diffusion on this connected hb-graph. A typical result is presented in Fig. 22: the concepts are the vertices, the images represent the extra-vertices corresponding to the hb-edges. The coloration of vertices-i.e. the nodes of the concepts-and of hb-edges-i.e. the extra-nodes representing imagesis the same than the one used in Fig. 13. Images containing persons are more reddish than images without persons, as the concept of person is central to the first component. But a lot of the images highlighted in red with persons contain other concepts, that are seen as important. It is the case for the image Reference 237245 in Fig. 23 which shows one person with one TV, two concepts that are central. Nonetheless, if the second concept is less important  Fig. 23 which contains 2 persons and 2 surfboards-the image is seen as less important as the concept of surfboard is less important than the one of TV. It is worth mentioning that images containing a lot of persons are not systematically highlighted in red-for instance, image Ref. 348954 in Fig. 23 with 7 persons, 1 bicycle, 1 traffic light and 1 backpack is seen as less important than image Ref 347167 in Fig. 23 with 8 persons, 2 cups and 1 laptop. The closer to red the images are, the more central to the sample drawn they are; hence, these images can potentially be used to make a summary of this sample, by selecting for instance the top 20% images based on their importance in the exchange-based diffusion, based on the c (e)-value calculated based on the diffusion process, as it is shown in Fig. 24. This strategy for summarizing can be refined with more complex strategies in order to fully covered the dataset concepts: it is kept as future work.

Future work and conclusion
Through this study, hb-graphs by enabling multiplicities of elements that are hb-edge based have proven to be efficient in retrieving the important part of a co-occurrence network. The two-phase step diffusion proposed enhances the possibility of retrieving information not only for vertices but also for hb-edges. The two use-cases show the potential of such approaches.
Different applications can be thought in particular in the search of tagged multimedia documents for refining the results and scoring of documents retrieved. Using tagged documents ranking by this means could help in creating visualisation summary. Our approach is seen as a strong basis to refine the approach of [41]; it can also be viewed as a mean to make query expansion and disambiguation by using additional highly scored words in the network and as a way of making some recommendation based on the scoring of a document based on its main words.
Jean-Marie Le Goff is a senior applied physicist and computer scientist that focuses on applying advanced IT techniques and concepts to Particle Physics. He first studied how to move objects over the internet using CORBA to service the control system middleware of the Large Hadron Collider (LHC) experiments at CERN. He then worked on extending the concepts of Unified Modeling Language (UML) layers with a descriptiondriven layer for classes and objects which led to the development of a software (C.R.I.S.T.A.L.) dedicated to the tracking and assembly of detector parts. This versatile software found applications outside particle physics, in particular in industry as Enterprise Resource Programming (ERP) software and Business Process Management (BPM), and in accounting and finance. He is currently working on the use of emerging graph, semantic and structural abstraction techniques for data management and visualization in conjunction with techniques acquired in his previous work. This led to the development of the Collaboration Spotting software, a generic platform for visual analytics of complex datasets. The platform is being used to build various demonstrators including for compatibility and dependency relationships in software and metadata of an experiment at CERN, in scientometry with publication and patent information, pharmaco-analytics and neurosciences. Jean-Marie Le Goff holds a PhD in experimental particle physics and a DPhil in Computer Science. From 06/2003-06/2009, he has been Visiting Professor at the University of the West of England, Bristol, UK.
Stéphane Marchand-Maillet Professor in the Department of Computer Science at the University of Geneva, Switzerland, where he leads the Viper group. His research is directed towards large-scale, high-dimensional distributed machine learning and information mining and indexing, with applications to data modelling and prediction. He has authored, co-authored or edited a number of publications on these topics. He and his group are part of several national and European and international projects in the domain. He is Senior PC Member of the International Joint Conference on AI (IJCAI, one of the oldest established conferences in AI). He was general co-chair of the International Conference of the ACM-SIG on Information Retrieval in 2010 (ACM-SIGIR 2010) and general co-chair of the 16th IEEE Conference in Business Informatics in 2014 (IEEE-CBI 2014)