An agentbased framework for performance modeling of an optimistic parallel discrete event simulator
Authors
Abstract
Purpose
The performance of an optimistic parallel discrete event simulator (PDES) in terms of the total simulation execution time of an experiment depends on a large set of variables. Many of them have a complex and generally unknown relationship with the simulation execution time. In this paper, we describe an agentbased performance model of a PDES kernel that is typically used to simulate largesized complex networks on multiple processors or machines. The agentbased paradigm greatly simplifies the modeling of system dynamics by representing a component logical process (LP) as an autonomous agent that interacts with other LPs through event queues and also interacts with its environment which comprises the processor it resides on.
Method
We model the agents representing the LPs using a “base” class of an LP agent that allows us to use a generic behavioral model of an agent that can be extended further to model more details of LP behavior. The base class focuses only on the details that most likely influence the overall simulation execution time of the experiment.
Results
We apply this framework to study a local incentive based partitioning algorithm where each LP makes an informed local decision about its assignment to a processor, resulting in a system akin to a self organizing network. The agentbased model allows us to study the overall effect of the local incentivebased cost function on the simulation execution time of the experiment which we consider to be the global performance metric.
Conclusion
This work demonstrates the utility of agentbased approach in modeling a PDES kernel in order to evaluate the effects of a large number of variable factors such as the LP graph properties, load balancing criteria and others on the total simulation execution time of an experiment.
Keywords
Agentbased modeling Parallel simulation Self organizing system Game theory Load balancingBackground
Parallel and distributed simulation techniques allow us to create and run large scale models that also allow fine grained modeling of the simulated entities. For example, the rapid growth of largescale communication networks, particularly the Internet and the online social networks it supports, has brought an increased emphasis on the need to model and simulate them in order to understand their macroscopic behavior and to design and test defenses against network attacks e.g., denialofservice and Border Gateway Routing Protocol (BGP) attacks (Sriram et al. 2006). It has been observed largely that in many cases the results of such experiments are affected significantly by the level of abstraction that is chosen. For example, (Chertov and Fahmy 2011) studied the need to model forwarding devices such as switches or routers and the sensitivity of the experimental results to their level of detail in the model. Parallel and distributed simulation distributes the simulation tasks across multiple machines and, in this way, exploits their combined resources. A challenge here is to maintain synchronization among collaborating machines in order to preserve the causality of processed events and at the same time attain a significant speed up.
Most parallel simulation mechanisms use the concept of a logical process (LP) which is a software component that runs in parallel with other LPs via processor time sharing or by running simultaneously on different processors in the case of multiprocessor systems. Each LP has a set of local variables that define its state. The state of the simulation at a given time is defined by the combined state of all LPs at that time. LPs communicate with each other using interprocess communication mechanisms such as shared memory or pipes. For the specific case of network simulation, a simulator kernel creates a distinct LP as a computational unit for each element of the simulated network model (Nicol and Fujimoto 1994). For example, all the properties and methods associated with a simulated router are executed and maintained locally in a distinct LP that represents it. This aids in the distribution of computation load across machines and also restricts the communication of an LP with only its “neighboring” set of LPs (i.e., neighbours in the network topology under simulation). In this way, we can represent the LPs and their interdependencies with a graphical model.
In this work, we use an agentbased framework to simulate the behavior of a PDES (Parallel Discrete Event based Simulator) kernel wherein each LP is represented by an autonomous agent with its own set of distinct local variables. The agentbased modeling paradigm of LPs is novel and appropriate especially in the context of optimistic synchronization among LPs that requires each LP to maintain a local simulation state and use messages or events to communicate and synchronize with other LPs. Agentbased modeling allows us to study the resulting global performance metrics of the simulator kernel. The total time required for simulation is a non trivial and a complex (in many cases nondeterministic) function of various system parameters. The dynamics of an optimistic PDES kernel are easier to describe at the level of an LP in terms of the event generation and processing behavior. These local dynamics, their causal effects and the effect of state rollbacks have a direct consequence on the global performance metrics of the system, which in this case happens to be the total simulation wallclock time of the experiment. We use this framework to evaluate an LP assignment scheme where each LP makes its own informed choice of a processor based on a novel incentive based cost criterion. The agent based modeling framework also makes it convenient to implement our assignment scheme since it is based on a local cost criterion at the level of LPs which are agents in our model. This is akin to a self organizing network of LP agents that form clusters across the machines.
Background
In the following subsections we detail some important background ideas and then develop our agentbased model of the simulator in the sections thereafter.
Need for parallel and distributed simulation
Networks in the form of social connections, informationcarrying links, collaborative associations and reputation systems are ubiquitous. Large and complex networks result from the association of entities distinct in character, preferences, and inherent design. Network modeling is essential for understanding the behavior of a combined system of heterogeneous individual nodes that interact with each other. It is used, e.g, to test routing and forwarding protocols, defenses against network attacks and to study community structures in social networks. A typical communication network may contain tens of thousands of nodes with a large number of links. The nodes in reality can be whole routers, CPUs, or switches, individual router ports, firewalls, or end systems. (Sriram et al. 2006) studied the large scale impact of BGP peering session attacks that can cause cascading failures, permanent route oscillations or the gradually degrading behavior of the routers, using an Autonomous System (AS) level topology which was downsampled from a typical ASlevel topology consisting of 23,000 ASs and 96,000 BGP peering links (Zhang 2013). There were a fidelity/complexity tradeoff when simulating such a network. Some methods scaled down the network under study (Psounis et al. 2003) by omitting some intricate details and cleverly choosing the parameters that were expected to maximally affect the simulation results so as to minimize the loss of fidelity as much as possible. Recent papers such as (Gupta et al. 2008) introduced new methods to represent groups of nodes by a single node while (Carl and Kesidis 2008) studied pathpreserving scaleddown topologies for large scale testing of routing protocols. Dimitropoulos et al. (2009) suggested use of annotations to a simple unweighted, undirected graph to represent the original network. However, the reliability of scaleddown methodologies depends largely on the assumption of low sensitivity of the outcome of the experiment to certain microscopic factors which may be at best crudely modeled by scaledown. An important macroscopic behavior of a network might be due (in an a priori unknown way) to a microscopic behavior ignored in the simulated model, thus resulting in inaccurate simulation results.
Parallel and distributed simulation: synchronization for causality
With the advancement of distributed processing systems, computing power was increased and one could thus feasibly represent networks using more refined models. Simultaneously, the theory of distributed simulation developed along with practical implementation of simulators such as PARSEC (Bagrodia et al. 1998). In the case of network simulation, one can represent each node in a network by a logical process (LP). An LP is an object that consists of a set of variables known as its “state” along with functions specific to the type of node being modeled. It normally has an event list containing timestamped events to be executed. The local variables of an LP change when an event is processed. The value of these variables at any given time defines the state of the LP at that time. An LP maintains a local (simulation) time variable that contains the time stamp of the event currently being processed (if busy) or the most recent event processed (if idle). Events are stamped with their time of execution (in simulation time). Each LP maintains a list of events to be processed.
The methods to synchronize event execution between LPs running in parallel can be classified into two major types: conservative (Chandy KM, Misra 1981) and optimistic, e.g. Time Warp (Jefferson 1985). In the conservative methods, LPs strictly follow timecausality, i.e., all events are processed in the strict order of their time stamps. Each LP tries to ensure, before processing an event A, that no other event, say B, with time stamp less than that of A, will arrive at the LP subsequent to the processing of event A. This time causality is followed by assuming that the graph topology of LPs is fixed and known beforehand. LPs communicate via messages and each messagecarrying link ensures that they are sent in the order of their time stamps. Synchronizing null messages are exchanged between LPs to assure each other that no event with time stamp less than a specified time will be sent. Optimistic synchronization allows for noncausal event execution. An LP is allowed to process events “ahead of time” without any kind of synchronization assurance. Each LP maintains its own local time and processes the event with lowest time stamp in the event list. If an LP receives an event timestamped which is less than its local time, it rolls back in time to the time stamp of the event. Rolling back means restoring a state prior to the time stamp of the event that triggered the rollback. In order for it to rollback to a prior state, an LP should archive the past history of its states and events. The combined system of LPs maintains a global variable called globaltime which is equal to the minimum local time across all the LPs. Hence the global time is indicative of the overall progress of the simulation.
Using agentbased framework for performance modeling of a PDES kernel
The system modeled using an agentbased framework is characterized or described in terms of the actions, behavior, beliefs, etc. of an individual autonomous agent in a group of large number of such autonomous agents. To a large extent, the paradigm of agentbased modeling can be said to use the bottomup approach to characterize, represent, predict or recreate a system or a phenomenon. However, agents have been used to drive large scale software systems (Jennings 2001) where the architecture is specified using the topdown approach. A common underlying theme motivating agent based modeling is that most complex phenomena observed in the physical world are the consequence of a much simpler set of rules that govern the dynamics of a large number of constituent entities that it is comprised of. For example, the Axelrod’s model of social dissemination (Axelrod 1997) tries to explain the consensus towards cultures and the simultaneous existence of different cultures in the society using a simple set of rules of interaction between individuals or behavioral models that explain the flocking of birds (Reynolds 1987) or ant colony optimization (Dorigo et al. 2006). Agentbased modeling has received a large amount of interest in the last decade even from the noncomputing research communities in areas such as social sciences and ecology (Niazi and Hussain 2011). It is a comparatively new method of modeling and a large number of existing models can be extended under this paradigm. Borschev and Filippov (2004) illustrates how the classic predatorprey model can be enriched by an agentbased model that makes more realistic assumptions without any significant addition to the model complexity. A large number of agentbased modeling toolkits such as NetLogo Tissue and Wilensky (2004) and Repast (North et al. 2007) have been designed in the last decade. A useful comparative evaluation of the existing toolkits has been provided in (2007). A highlevel approach to performance prediction of simulators has been done previously using probabilistic models for event thread generation. These statistical methods use analytical means instead of simulations to predict the performance or speed up of synchronized iterative algorithms on multiprocessors. For instance, Agrawal and Chakradhar (1992) uses maximum order statistics on random variables that affect the speedup on parallel machines. Xu and Chung (2004) proposes similar analytical methods for performance prediction of synchronous simulation. Agentbased modeling has always been explored as an alternative to equationbased modeling (Agentbased modeling vs. equationbased modeling ). We argue that in the case of optimistic synchronization, analytical models are not feasible due to the complex and timevarying interdependencies of various parameters and hence we resort to agentbased approach. In the following sections we will demonstrate the application of agentbased modeling to performance modeling of an optimistic PDES kernel. The model is used to compute the simulation wallclock time of the kernel for an experiment described by an event initialization list based on the dynamics and interactions at the level of LPs. The agentbased model can be used to evaluate and compare different cost criteria for LP assignment. In this work we focus on the additive cost framework described in (Kurve et al. 2011a) (See Appendix Appendix B: a local cost function) and the alternative cost framework (See Appendix Appendix C: an alternate cost framework). This model can serve as a testbed to evaluate other cost criteria that can be more complex than the simple additive relationship in modeling effects of computation load imbalance and communication delays on the number of roll backs and consequently the simulation execution time.
LP assignment and simulation time
The total simulation time of the experiment is sensitive to the assignment of LPs to machines. In our model we focus on two aspects that are direct consequence of the LP assignment: the resulting load imbalance across the machines and the interprocessor communication^{a}. We study the effect of optimization of these parameters on the total simulation time of the experiment. Our bicriteria optimization algorithm assumes an additive relationship, i.e., the total simulation time is represented mathematically as the addition of the load balancing cost and weighted interprocess communication cost. In the case of an optimistic PDES, both the parameters induce synchronization overheads in the form of rollback events that slows down the advance of the global simulation time. The graph partitioning problem is a wellknown NPcomplete problem that considers both these aspects. It is formally defined as follows.
Let G = (V, E) be an undirected graph where V is the set of nodes and E is the set of edges. Suppose the nodes and the edges are weighted. Let w_{ i } represent the weight of the i^{ th } node and let c_{ i j } represent the weight of the undirected edge {i, j}. Then the Kway graph partitioning problem aims to find K subsets V_{1}, V_{2}, .., V_{ K } such that V_{ i }∩V_{ j } = ∅ ∀ i, j and $\bigcup _{i=1}^{K}{V}_{i}=V$, $\sum _{j\in {V}_{i}}{w}_{j}=\frac{\sum _{k}{w}_{k}}{K}$ ∀ j and with the sum of the weights of edges whose incident vertices belong to different subsets minimized.
Heuristics to solve the graph partitioning problem primarily make use of spectral bisection methods (Pothen et al. 1990) or multilevel coarsening and refinement techniques (Karapis and Kumar 1996). Spectral bisection methods calculate the second smallest eigenvector, known as Fiedler vector, of the modified adjacency matrix of the graph. These methods by far give the best results. However, finding the Fiedler vector is computationally very expensive. For “geometric” graphs in which coordinates are associated with the vertices, geometric methods are available which are randomized and are quicker than spectral bisection methods. Multilevel partitioning algorithms are by far the most popular techniques. All of these discussed methods use centralized computational implementations involving access to global state information. A periodic refresh of partition is needed in most cases because of the highly time varying nature of the load generated across different LPs. The exchange of a chunk of nodes that are identified using a sparse cut metric was studied in Kurve et al. (2011b). In our work (Kurve et al. 2011a), we present an alternative to this, where decisions are made at the LP or node level. We model a game theoretic framework where the local cost for each LP incentivizes the LP to choose a processor so as to minimize the total simulation time of the experiment (represented by the “social welfare”). LPs can make distributed decisions eliminating the need for a separate software or hardware resource to refresh the partition across the machines. We are particularly interested in the class of games known as “potential games” (Monderer and Shapley 1996) which guarantee a descent in a global cost function (indicative of the system performance) with each decision made at the local (machine or node) level. The global state required to make an accurate choice of LP is independent of the total number of LPs and scales linearly with the number of neighbors of an LP. Please refer to Appendix Appendix A: initial partitioning algorithm and Appendix B: a local cost function for more details on the local cost criteria. Note that the local incentive based criteria can be applied in other scenarios which involves the two competing incentives for load balancing and clustering. For instance, Kurve et al. (2013) has studied the application of the local incentive criteria in superpeer based peertopeer (P2P) networks.
Method
 1.
We do not consider any specific simulation scenario and both the LP agent class and the class of events are generic to accommodate different particularities. The LPs can represent a router or a switch in the case of network simulation or logic gates in gatelevel simulation of VLSI circuits. We describe the LP agent and event classes in detail in the following subsections.
 2.
One of the inputs to the model is a list of events to which event lists of the LPs is initialized before the simulation begins.
 3.
We abstract the simulation model across different scenarios using a graph of LPs and an event generation model defined by the initial event list and the causeeffect relationships between different events resulting in the event execution thread across LPs during simulation.
 4.
The parallel processing hardware is modeled as simulated artifact in terms of the number of processors, relative speed of each processor and the mean intra and inter processor communication delays.
 5.
We focus our attention on the total simulation time and the synchronization overhead in terms of the number of rollbacks, which depends largely on the event generation and processing model rather than an actual specification of the simulation scenario.
To illustrate the abstraction mechanism, consider an oversimplified scenario of N peers sharing content using a peertopeer (P2P) system such as Gnutella. We would like to simulate the query generation and resolution mechanism in the system. Suppose the queries are resolved using random walk over the overlay network graph. Using our method of abstraction, we can represent this using a graph of LPs which is similar to the graph of the overlay network. Each peer is represented by a unique LP in the graph. The events for each LP correspond to queries generated by the LP itself and those forwarded by its neighbors. In this case the event list for an LP is initialized based on the query generation rate of each peer. The query forwarding mechanism is represented using limited hop event forwarding and time to process each event. Note that our abstraction is agnostic about the actual changes that happen to a local state of an LP (representing a peer) in the simulation, because this does not directly affect the total simulation time.
The LP agent class
Properties of an LP agent base class
localtime 
Contains time stamp of event being processed 

state 
Local variables of the LP 
eventlist 
List containing pending events 
eventlisthistory 
List containing the archived local state of LP before each processed event in the past 
busytick 
The time remaining in wallclock units to finish processing the current event 
busy? 
A boolean variable indicating if the LP is busy processing an event or idle. 
mymachine 
Current machine assigned to the LP 
simtime 
Time in wallclock ticks needed for processing the current event 
Member functions of an LP agent base class
process_causal_event() 
Member function to process causal event 

process_noncausal_event() 
Member function to process noncausal event 
process_rollback_event() 
Member function to process rollback event 
generate_event(agentlist) 
Generate new events for the thread of currently processing event in each of the agents in agentlist 
create_neighbor_list () 
Returns a agentlist which is a subset of neighbors 
simulate() 
Member function called at every tick of wallclock time 
get_busy_time() 
Returns the time needed to process the current event. Depends on the type of event, current processor load and processor speed 
compute_node_weight() 
Computes node weight 
compute_edge_weight() 
Computes the weight of edges with all its neighbors 
Event generation and processing model
Properties of an event base class
eventthreadnumber 
Thread number of the event 

eventtime 
Time stamp of the event 
eventtype 
Type of event: regular or rollback 
eventtick 
Waiting time in wallclock units the event is ready to be processed. This is used to model communication delays between LPs 
eventhopcount 
The maximum hop count of the event thread. Event threads survive for limited hops equal to this value 
In our model, each thread of events has a unique number denoted by eventthreadnumber. We define thread as a sequence of events where one event is generated by a previous event in the sequence. Each event in the event initialization list starts its own thread of events across LPs. This helps us track the spread of events in a limitedscope flooded packetflows scenario. The time stamp of the events are stored in the eventtime variable. The time stamp is the simulation time when the event is supposed to occur. The eventtype variable tells the LP what its reaction to the event should be. Generally, there is a specified procedure call for every type of event. A rollback event is one of the types by default. In order to model delay in communication (in wallclock time) between two LPs, we use the variable eventtick, which is programmed to a value by the LP that generates the event. At every tick of wallclock time, eventtick is decremented by one unit until it reaches zero. An event can be processed by an LP only if its corresponding eventtick variable is zero. The variables eventlisthistory are the lists containing the information the events already processed by the LP along with the contemporary state of the LP before the event was processed. In the case of a rollback, the LP restores its prior state from these variables. The LP regularly flushes out the data of events from the history with time stamps less than the global time. This is because the LP will never have to rollback to a state prior to the global time. Note that in such a model of the simulator, the functions that actually process events are generic. All that is needed is the time in wallclock ticks stored in the variable simtime, required for processing the event and the neighboring nodes where new events will be created as a result of processing this event. The simtime is a function of the speed of the machine on which the LP resides given by mymachine, which in turn depends on the number of LPs that reside on that machine. We use a simple limited scope event forwarding using the generic framework of the simulator model. In this model, events are generated at random times by randomly chosen LPs and these events traverse the network for a limited number of hops, i.e., each node that receives such a packet forwards it to a randomly chosen neighbor, provided the hop count given by eventhopcount is not zero. In our experiments, we randomly initialize the eventlist of each LP that generate such a thread of events in such a way that the load generated across the LPs becomes highly dynamic during the course of simulation. More specifically, we generate “hot spots” of traffic or a cluster of LPs that generate large amounts of traffic over a short period of (simulation) time. The locations of these hot spots change regularly.
LP assignment algorithm
Initial static LPtoMachine assignment
We argue that due to initial uncertainty of computation and communication costs and their dynamic nature, an optimum partition is not possible a priori. As a result, we employ a simple initial partitioning method in which each machine chooses an initial “focal node” from among the nodes of the graph and then expands hopbyhop to include neighboring nodes. The idea is to have connected subgraphs within each partition so as to minimize interprocess communication. To avoid contention between partitions, we require that each machine wait a random amount of time after every hop and check for a semaphore before claiming ownership of new nodes. Unit edge and node weights are assumed during initial partitioning. The choice of the focal nodes is important to ensure a high probability of a good initial partition. As will be described later, the iterative partition refinement algorithm converges to a local minimum. Hence, a good initial partition might improve the chances of converging to the global minimum or a good local minimum. Refer to Appendix Appendix A: initial partitioning algorithm for additional details.
Iterative partition refinement
The partition refinement algorithm utilizes the cost framework developed in the game theoretic study. During the course of the simulation, each LP can decide to switch processors either synchronously or asynchronously. In synchronous transfers which we refer to as a refinement step, LPs take turns to evaluate their costs and the prospective new machine based on the cost framework. In asynchronous transfers, LPs decide to switch without a global synchronization scheme. However, asynchronous transfers might lead to inconsistent decisions due to the likelihood of simultaneous transfers of more than one LPs. Note that the inconsistencies can be eliminated if the transfers occur between two different pairs of machines. We thus assume the use of a processor mutex in the case of asynchronous transfers that allows only single transfer for every processor. The transfer rates in our model are controlled by the global parameter partitionrefinefreq which determines the frequency when the the node weights corresponding to the computation cost of each LP and the edge weights corresponding to the communication cost between LPs are recomputed and the LPs are triggered to evaluate processor transfers.
Hence the most dissatisfied node is the one with maximum value of I. The most dissatisfied nodes in the system having I=0 indicates that the algorithm has converged to a local minimum. We can prove that this algorithm converges to a local optimum of the social welfare function (Kurve et al. 2011a).
Simulation engine
Finally, the pseudocode for the entire simulation engine of the agentbased model is shown in algorithm 4. It begins with the initialization of the global and local variables including the eventlist of all the LP agents. At each tick of wallclocktime, simulate() member function for all the LPagents is invoked exactly once. The simulation is continued and the simtime is incremented until all the LP agents have an empty eventlist. partitionrefinefreq defines how frequently the LP assignment algorithm is invoked. Particularly, it is the interval between two invocations of the LP assignment algorithm. Before every invocation of the LP assignment algorithm, the node and edge weights are recomputed. compute_node_weight() and compute_edge_weight() are used to estimate the computation cost of an LP and the communication cost over the edge with its neighboring LPs. There can be many different ways of defining these functions. In our experiments, we use the length of the eventlist as being indicative of the computation cost that the LP will generate in the near future. Hence, node weight is calculated as the length of a weighted eventlist where each event in eventlist is weighed by eventweight, i.e., with a weight which is the number of machine cycles needed to process the event. Communication cost should quantify the amount of “event traffic” between two LPs. We estimate edge weight from the history of event traffic across the edge. Specifically, we calculate the number of events generated across the edge during a fixed interval of time given by estimationwindow. We then multiply it by 100 to make its magnitude comparable to the computation cost and then assign it to the weight of the link.
Results and discussion
We evaluated the iterative LP assignment algorithms on the agent based model described above to compare the speed up in the simulation of an experiment. We also observed the performance under different settings of system such as synchronous and asynchronous LP transfers and for two different models of random graph generation. As described previously, one of the inputs to the model was a graph of LPs. The first model of random graphs that we used to generate an instance of the LP graph was the preferentialattachment model which is scalefree and, according to some studies, can be used to model the Internet topology at the level of Autonomous System (AS) (Bu and Towsley 2002). The second model incorporates geometric information along with the degree of a node, i.e., each node has associated coordinates in two dimensions: while forming links, each node randomly chooses another from among a set of 15 closest nodes (in terms of distance with respect to the coordinate axes). Our NetLogo based model of LP graph consisted of 300 LP nodes generated from one of the models above each of which was an instance of the LP agent class described in Section 3.1. The event initialization list was created keeping in mind the highly dynamic nature of the load generated during a network simulation, i.e., we created two genres of event threads: one that was uniformly distributed across all nodes with a relatively large value of eventhopcount and second one that had shorter hop count and was localized to a closely connected set of nodes. The former created a wider scope of event interdependencies between nodes that were far apart in geodesic distance while the later generated hot spots of traffic that were localized in space and time.
Note that the processorspeed is the normalized speed of the processor that the LP resides on. We set the inter processor communication delay as 10000 wallclock ticks uniformly across all the links between processors, i.e., any event that is created by an LP in another LP that resides in a different processor other than its own processor would experience a delay of 10000 wallclock ticks. We set the value of estimationwindow described in the previous section to 10000 clock ticks.
Conclusion
We presented a method of agent0based modeling of an optimistic parallel discreteevent simulator. The agentbased modeling paradigm is novel to the study of a PDES performance and greatly expedites the evaluation of performance optimization techniques with different system settings. The model was used to evaluate and compare two LPtoMachine assignment algorithms. In this process, we got several insights into the complex relationship between the LP agents and the simulation time. We observed that the comparative weight given to computation and communication delay cost in the LP assignment scheme depends on the properties of the LP graph. We studied the sensitivity of the simulator performance to other parameters such as partition refinement frequency, and the graphical properties of the network under simulation. The need for dynamic load balancing was further emphasized from the results of our experiments by studying the distribution of the rollback count curves with the advance of simulation execution time. One can use the proposed agent based modeling framework to study different LP to machine assignment techniques that characterize system performance using different objective functions. This framework can also be extended to study different rollback strategies and methods to estimate LP load and communication patterns based on the knowledge of current and past events.
Appendix
Appendix A: initial partitioning algorithm
In this appendix, we describe the initial partitioning algorithm. As mentioned previously, the choice of focal nodes is important to get a good initial partition. The focal nodes are chosen so that they are at maximum geodesic distance from each other. Ideally we should find focal nodes such that each one of them is at least $2{N}_{\frac{\leftV\right}{K}}$ geodesic distance away from the others, where ${N}_{\frac{\leftV\right}{K}}$ is the mean number of hops that cover $\frac{\leftV\right}{K}$ nodes. In this case, we should know the properties of the underlying graph for calculating ${N}_{\frac{\leftV\right}{K}}$. We find this value for the example of an ErdosRenyi random graph model.
Theorem 1
Proof
Suppose after the k^{ th } hop, the graph is divided into two sets of nodes: the set A of nodes obtained by the k^{ th } hop and set A^{′} of nodes not yet obtained. Let A = N_{ k } and A^{′} = V − N_{ k }. Denote the set of newly acquired nodes in the k^{ th } hop by B, where B = N_{ k } − N_{k−1}.
For any node a∈A^{′}:
P(a is not connected to any node in B) = ${(1p)}^{{N}_{k}{N}_{k1}}$ and
P(a is connected to at least one node in B) = $1{(1p)}^{{N}_{k}{N}_{k1}}$.
The number of nodes acquired during the (k+1)^{th} hop is binomial, with (V − N_{ k }) trials and the probability of success being $1{(1p)}^{{N}_{k}{N}_{k1}}$. Hence the expected number of nodes acquired during (k + 1)^{ th } hop is $\left(\rightV{N}_{k})(1{(1p)}^{{N}_{k}{N}_{k1}})$ □
where d_{ G } is the geodesic distance between the two nodes. We can attempt to find such nodes using heuristics. For example, start by assigning an arbitrary set of distinct nodes to each machine. In roundrobin fashion, each machine takes a turn at finding a node from the set of nodes that are neighbors of its current focal node that increases the minimum of its geodesic distance with focal nodes of other machines. This becomes the new focal node for that machine. This process is iterated until there is no further improvement possible. We iterate this process over multiple initializations of the focal node set and the best set of focal nodes is identified. In the next phase, starting at the focal nodes, the partitions try to collectin nodes in their neighborhood, thus expanding their clusters. We can use random waiting time between two successive hops and semaphores to deal with contention issues, i.e., when two or more machines try to claim ownership of the same node.
Appendix B: a local cost function
We will now describe our iterative local incentive for the partitioning algorithm (Kurve et al. 2011a) in detail. Consider an undirected weighted, graph G = (V, E) representing the network model to be simulated. As described previously, the graph can be interpreted as a model of a network of LPs as vertices or nodes, with weights assigned to the nodes and edges of G. Thus, we want to first estimate the computational load generated by each node and the amount of communication between the logical processes associated with each node to assign node weights and edge weights, respectively, to the graph. Second, we wish to find a distributed technique to equitably loadbalance among the machines while also taking into consideration the amount of intermachine communication, the latter reflecting the risk of rollback. We will address the first problem in our following sections and focus on the second problem now.

Let K be the number of machines (K < V). The graph G is partitioned among K machines or less, since in some cases where the cost of interprocessor communication is high, partitioning the workload among less than K machines might be optimal, i.e., some machines may be not be assigned any LPs.

Let b_{ i } represent the computational load of i^{ th } LP.

Let c_{ i j } denote the cost of communicating over the edge {i, j}, representing the average amount of traffic between node i^{ th } and j^{ th } LP.

Let r_{ i } ∈ {1, 2, ... K} be the partition chosen by the i^{ th } LP.

Let the normalized capacity or speed of the k^{ th } machine be:${w}_{k}=\frac{{s}_{k}}{\sum _{j=1}^{K}{s}_{j}},$
where s_{ j } is the speed of the j^{ th } machine.
In this way the load balancing mechanism is implicitly manifested in the local cost function. The second term in the sum represents the weight of edges that connect the i^{ th } node with nodes in other partitions. This term incentivizes the node to choose a partition with which it is well connected.
Thus, at Nash equilibrium every node, say node i, will not be able to improve its cost by unilaterally changing its current processor ${r}_{i}^{\ast}$, i.e., provided that the decisions of all other nodes are given by the assignment vector ${\mathbf{\text{r}}}_{i}^{\ast}$.
Appendix C: an alternate cost framework
Note that this cost function also has the property of machinetomachine level overhead similar to the cost function described previously.
The above standard formulation of the graph partitioning problem (Van Den Bout and Thomas Miller 1990) is a quadratic integer programming problem the convexity of which depends on the network graph. In most cases it will not be convex. We can think of decomposing this problem into a set of K subproblems each of which is solved by a single partition. However, with the constraints $\sum _{k}{x}_{\mathit{\text{kj}}}=1,\phantom{\rule{1em}{0ex}}\forall \phantom{\rule{1em}{0ex}}j\in V,$ such a decomposition is difficult to realize. So, instead we study the effect of sequential nodebynode transfer on (8). We can prove that for the local node cost function (6), Nash equilibria exist at the local optima (minima) of the centralized cost function (8). Thus, we can define a new cost function for each node as given by (6). Note that at each of the locally optimal points of (8), none of the nodes can improve their costs (6) given that the node assignments of all other nodes remain constant. Hence, the assignment vectors at these points are the Nash equilibria for this game. And since the node decisions always perform descent on (8), there is convergence guaranteed to one of the equilibrium points.
Endnote
^{a}High volume interLP communication is not just overhead or delays, but also indicates the threat of rollback.
Acknowledgements
The local incentive costbased LP assignment method was presented at the International Workshop on Modeling Analysis and Control of Complex Network (CNET) 2011 at San Francisco. This work is supported in part by the NSF under grant number 0916179.
Supplementary material
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.