Loguinov et al. [33] proposed some guidelines for incremental construction of a de Bruijn graph. However, the paper falls short of an actual implementation as it does not address the problem of maintaining de Bruijn structure in presence of churning, where churn refers to dynamicity involving nodes leaving and joining.
For the purpose of implementation, we choose the parameters K = 8, and D = 8 for a de Bruijn graph. It allows around 16 million nodes inside network labeled “00000000-77777777” in octal strings. However, the diameter of a de Bruijn graph remains O(1) (8 to be precise). The DHT overlay in our system would support efficient look-ups, and an epidemic dissemination-based protocols can be designed to infect all the nodes in just a few rounds.
Node information structure
We refer to the nodes in the underlying de Bruijn graph as “virtual” nodes. For maintaining the underlying graph, a physical node in our system is responsible for a range of virtual nodes with consecutive ID s. A range of a physical node in our system is referred to as its zone. Initially, when a single physical node joins the DHT, it becomes responsible for all virtual nodes from 00000000 to 77777777. The virtual ID-space may be visualized as a ring, where each physical node is responsible for only an arc segment of the ring. Figure 10 illustrates an example of three physical nodes responsible for three different zones.
The structure is similar to Chords [48]. However, unlike the Chord overlay a physical node responsible for an ID space arc may be located somewhere randomly within it. A node A has an outgoing edge to another node B, if there is at least one virtual node in A’s zone having an outgoing edge to a virtual node in B’s zone. Each node keeps a list of its outgoing and incoming edges. Each node in these lists knows its address and zone ID. The details of the structure maintained at each node is given in Table 4.
Table 4 Structure maintained at each node Joining of nodes
If A tries to join, it selects a random ID and forwards a join request to identify the owner of the zone in which the chosen random ID falls. For convenience of description, we use the following convention: any intermediate node receiving the join request initiated by A is referred to as B; while the node whose zone contains the random ID chosen by A, is node C. The problem of joining is split into three parts according to actions of A, B and C is explained below.
Node A requests the rendezvous server (whose public IP address is known to all) to provide the external address (IP address) of A and those in a list of random peers. A picks one peer randomly from the list supplied by rendezvous server and sends a join request to that peer. The request contains A’s external address, and a random virtual ID from the entire ID-space, i.e., 00000000 to 77777777.
The join request initiated by A tries to identify C, the owner of the random ID sent in the join request. When node A sends such a request for the first time, its ID is chosen using SHA-1 hash value of its external address. But, upon retries, a random ID from the entire region is picked.
On receiving the join request, B forwards it to the next node on the routing path. If a routing path is not provided in the request, B will create one using B’s ID and destination ID given in the join request. The correct path will be sent along with the join request.
On receiving the join request from A (possibly through intermediate nodes), C sends the information regarding its zone ID, the incoming and the outgoing links. C does not accept further Join requests until joining of A is complete, or the timeout of 10 seconds occurs.
The reply from C contains a structure, namely the virtual node label of C, external address of C, and outgoing and incoming edges of C. Now C waits for A to complete the joining process and sends information about A’s new zone. Then C updates its own zone by sending keys with values to be managed by A, notifying the neighbors about the change in zone, and dropping the edges destroyed due to the shrinking of its zone.
Next, A picks the half part of the zone not containing C’s ID and chooses a random ID (label) from the picked zone as own ID. A sends information of its ID and its zone ID to C, and the attachment of node A in DHT overlay is complete. A needs to perform the following two actions to complete the joining process: (i) disseminate its ID and zone information to all the links shared by C; (ii) identify the links to be dropped, check if there is any new link to C, and make the corresponding changes to the list of incoming and outgoing edges.
Finally, A willingly accepts the load shared by C. Due to space limitation, the algorithm is included here. Interested readers can review the Send Join Request and Receive Join Request algorithms in Appendix ?? and Appendix ??, respectively.
Join example
To keep the example small, we show the Node Join process in de Bruijn graph with K = 2 and D = 4. Figure 11a and b illustrate joining process of the first two nodes in the system.
The process of joining of the third node is shown in Fig. 12a. Notice that the joining of the third node requires the removal of the (dashed line) link A→B. For the joining of the fourth node two new links are to be inserted as shown by dotted lines in Fig. 12b.
Leaving of nodes
Consider the leaving of a node A from the system. A node C is identified who can merge the zone of A into its own Zone. There are two parts to the leave process, as the leaving node is aware of its both the predecessor and the successor in the DHT overlay.
A identifies a successor zone and a predecessor zone. The first virtual ID belonging to the successor zone equal to A’s Zone.endID + 1. Similarly, the last virtual ID belonging to the predecessor Zone is equal to A’s Zone.startID − 1. A picks one of the two IDs, finds the respective owners of the zones and sends a request to leave. The leave request contains the information regarding ID, Zone, incoming and outgoing links to the chosen neighbor, say C. If C agrees within timeout period of 5 seconds, A sends the load to C and the leave process is complete. Otherwise, A tries the same with the other neighboring node. If the timeout occurs again, the entire procedure is repeated assuming chosen nodes were busy in other leave procedure. Algorithm 8 in Appendix C specifies the precise steps executed by A.
Node C receiving the request checks if it is going to leave the system. If not, it accepts the request and agrees to take over the load by merging the two zones. The merging process includes merging of the incoming and outgoing links as well. C then notifies all the linked nodes regarding the update. Algorithm 9 in Appendix D gives a step-wise description of C’s operations.
During join and leave, nodes whose zone changes notify the nodes linked to them by outgoing or incoming edges. When such notification is received, a node adds or drops links caused by the change. Every two minutes, each node sends keep-alive message to all linked nodes. These nodes update the timestamp with the zone information corresponding to that node. If no update is received in last five minutes for a linked node, it is considered dead, and owner of the successor zone takes the responsibility of orphaned zone.
If there is an outgoing edge from a virtual node in one zone to a virtual node in another zone, then there exists an outgoing edge between one node to the other. On many occasions, zones of the neighboring nodes change. A brute force way to find an edge in the underlying graph would slow down join and leave operations.
Node A receives zone update from another node B. A checks if the size of its zone is larger than or equal to N/K. If so, then a link exists (because every possible suffix of length D − 1 is present in A’s zone). Otherwise, it determines the range [LA,RA] of the suffix of length D − 1 lying in the A and range [LB,RB] of the prefix of length D − 1 lying in B’s zone. If LA ∈ [LB,RB], or RA ∈ [LB,RB] then there exists an edge from node A to node B. Similarly, there exists an edge if LB or RB lies within [LA,RA]. Consider the following two examples to understand how range overlap is used to determine the connectivity.
-
Consider two zones: node A’s zone [’00000000’, ’17777777’] and node B’s zone [’40000000’, ’77777777’]. Since the size of zone of node A is larger than ’10000000’, it will have all possible suffixes of length D − 1. Therefore, it will have link to any node in the system. Same is the case for B and hence, both have outgoing edges to each other.
-
Consider two other zones: node A’s zone [’00000000’, ’00777777’] and node B’s zone [’01000000’, ’01777777’]. The range of suffixes of length D − 1, for node A is [’0000000’, ’0777777’]. The range of prefixes of length D − 1, for node B is [’0100000’, ’0177777’]. The intersection between the two ranges is non-empty, and hence an edge exists from node A to node B. (In the underlying graph, ’00100000’ has an outgoing edge to ’01000000’.)
-
For the above example, range of suffixes of length D − 1, for B is [’1000000’, ’1777777’]. The range of prefixes of length D − 1, for A is [’0000000’, ’0077777’]. The intersection between two ranges is empty, and hence no edge can exist from B to A.