Neighbourhood message passing computation on a lattice with cP systems

We propose neighbourhood message passing (NMP), an abstract framework for loopy belief propagation (BP), as used in stereo matching (SM). We focus here on generic inter-processing-element messaging over a two-dimensional square grid, but our results apply to lattices of any shape through minimal modification. Specifically, this paper investigates three cP Systems (a type of P systems) models for loopy BP: One based on the classical globally synchronous BP, and two novel variants, (totally) asynchronous and locally synchronous. To model the classic globally synchronous NMP, we extend cP systems messaging rules with antiport features, similar to those used in other P systems. Next, we propose a novel version of NMP by extending it to the asynchronous case. We then derive a locally synchronous NMP variant, which arises naturally as a middle ground between our asynchronous and the classical globally synchronous variants. To clarify the operation of the asynchronous NMP system, we supply a short worked example. Following this, we analyse the proposed asynchronous system and prove that it uses precisely the same number of messages as the globally synchronous variant. We further put forward some runtime and correctness conjectures. Furthermore, we experimentally investigate the asynchronous system’s run-time characteristics. Messages spread from a given location on the lattice similarly in both the asynchronous and synchronous versions, even in the face of slow channels. We also conduct computer experiments and find that, in practice, the locally synchronous system is usually faster than the traditional globally synchronous approach (about 5–13%), and the asynchronous system is typically quicker still (often by approximately another 10%). We thus believe that it is a promising novel approach for faithful implementations of NMP and should be preferred.


Introduction
We present a framework for passing messages between discrete points on a finite lattice. Each point updates its internal data (and thus the messages it sends) based on messages received. Each point is uniquely connected to other points and so has its own neighbourhood, being the specific other points it communicates with. A key aspect of this neighbourhood message passing (NMP) is that the messages for a neighbour depend upon the messages received previously from all neighbours, except the neighbour to which the current message is sent. We focus here on the square grid lattice but emphasise that the principles of NMP apply equally to all lattices.
In this paper, we term the individual computational units in the lattice processing elements (proxels). 1 These are the logical base computational units for NMP and are represented by cP systems (P systems with compound objects, formerly P systems with complex objects) top-level cells. In NMP, each element on the lattice is an individual communicative proxel, which starts with one or more input data. The proxels then perform their part of a computation by Elements of this article were presented at the International Conference on Membrane Computing 2020 in "Belief Propagation for Stereo Matching in cP systems" by Cooper and Nicolescu. The current article is a significant revision and expansion of part of the conference paper.
alternating between problem-dependent internal computations and exchanging messages with connected proxels. The messages a proxel receives update the proxel's internal data, while the messages sent are computed based upon those data-specifically, using all the data except the datum most recently received from the intended recipient neighbour. This n − 1 property is the key distinguishing factor of NMP from similar processes.
Conceptually, there are two distinct aspects of the NMP process: (i) the message passing itself, which occurs externally to proxels over channels connected to other proxels; and (ii) the data/message updates, which occur internally and are largely orthogonal to the inter-proxel communication. The focus of this paper is the external side and communication with an oracle stands in for the internal computation.
A visual example of the NMP process for the process for the four-neighbourhood on the square lattice is shown in Fig. 1. A proxel in the grid received messages from its neighbours to the left, right, and bottom at generation t − 1 and has used the data from those messages in preparing its new message to send to the neighbour above at iteration t . The same is performed for each other neighbour too. Later, after reaching a defined stopping point, the proxel will compute a final output based on its internal data at that point. Again, the nature of the final computation is problem-dependent.
There are (at least) two separate ways to model the proxel' communication: a globally synchronous system-wide view, where the states and steps taken occur across the system as a whole, and every proxel carries out its operations simultaneously according to the same rules. Or, a per-proxel view, where each proxel has its own state and advances asynchronously without regard to other proxels' objects, states, and rulesets. We explore both and find in the process that an intermediate third, locally synchronous, model arises naturally as a hybrid of the other two.
The concept of asynchronicity used here-and in cP generally-is that from distributed computing with a focus on messaging times rather than internal processing. Briefly, in the asynchronous model of distributed algorithms, external messages may take any arbitrary time (in ℝ ) to reach their destinations [4,21,23], though internal updates are typically assumed to occur (near-)instantaneously. This asynchronicity concept is thus substantially different to that used in other areas of membrane computing [7,12,32].
Modelling and implementation of NMP prove surprisingly challenging. The globally synchronous form (Sect. 3) is reasonably simple. It requires only nine rules, most of which have precisely one term on both the left-hand and right-hand sides and none of which use promoters or inhibitors (see Sect. 2 for more on these concepts). The challenge arises with the asynchronous version. The possibility of Fig. 1 Diagram of the central concept of four-neighbourhood NMP on a grid [8]. Each circle is a proxel. The central proxel needs to compute a new message to send at time t to the proxel above it, so it performs a computation to combine the results of the messages received from the other three neighbouring cells at time t − 1 (indicated by the thin and fat arrows). The same is applied to every other neighbour, too 1 3 messages from the same generation arriving at different times necessitates bookkeeping. Proxels must track from which neighbours messages have been received for a given generation, to decide which neighbours the proxel is ready to send a new message to. This leads to inescapable complexity and the possibility of deadlock through unsuitable design.
NMP shares similarities and overlaps with areas such as Cellular Automata, Gossip Protocols and Consensus Algorithms (see, e.g., [9,17]). NMP is distinguished from these other areas in a few ways, however. First, other approaches do not usually impose the " n − 1 neighbours" condition of message updates as found in NMP. Second, in other areas messaging tends to be only a part of the process, whereas in NMP it is the heart of it. Internal computation is only performed to aggregate/average the messages received from neighbours to prepare new messages. Third, neither a global nor local consensus is sought in NMP-the closest equivalent would be the (tautological) concept of individual consensus per proxel. Instead, the result of each proxel comprises one distinct part of the final result. Finally, often in these other areas, the neighbours are selected at random rather than being a neighbourhood arising naturally from the structure of the modelled problem. Generalisation across these fields may be possible but is not pursued further.
This paper makes the following contributions: It is the first to use an antiport rule in cP systems, and, to the best of our knowledge, it is also the first time asynchronous NMP has been explored in any context, P systems or otherwise. We begin with a summary of cP systems before describing a straightforward globally synchronous NMP system using the aforementioned antiport rule. Then, we supply an asynchronous proxel-specific version of the same, followed by an adaptation of the asynchronous system to a locally synchronous system. Next is a short example of a potential evolution of the asynchronous system to clarify its operation. Then, we analyse the asynchronous system. The analysis proves that the asynchronous system sends the same number of messages as the globally synchronous system (one per generation per neighbour) but that the data used to compute new messages may vary slightly depending on the ordering of messages received. We also state some conjectures we have not (dis) proven so far. Immediately following this are experiments related to NMP and the cP systems variants. The empirical results verify the operation of the asynchronous system and support our hypothesis that the locally synchronous and asynchronous systems are faster than an equivalent globally synchronous version.

A minimal overview of cP systems
The following is a stripped-down synopsis of cP systems, included in the interests of self-containment. For a comprehensive description and explanation of cP systems, the reader is referred to [14,15,24,25]. All choices of names/ symbols for objects and variables in this section are made purely for illustration purposes and are independent of and bear no relation to the following sections.
cP systems is another variant of P systems [29][30][31], complementary to the classic trinity of Cell-like [28], Tissue-like [22], and Neural-like P systems [18]. The core cP systems unit is the top-level cell, which is arranged as a nested tree of subcells. There can be an arbitrary number of top-level cells in the environment, communicating via a form of messagepassing over channels. The internal operation of cP systems' top-level cells closely resembles Cell-like P systems. Only top-level cells have rules, however. All sub-cells are inert symbolic objects operated upon by the top-level cell's rules. These sub-cells are represented as labelled multisets within the top-level cells.
Sub-cells can be either atoms or compound terms, which are multisets labelled by functors (the name 'functor' is also commonly used as a shorthand for compound terms). Atoms are indivisible symbols. They can be of any type relevant to a system but are static objects without other intrinsic properties. Atoms are written as the name of the atom's type. Compound terms are objects, labelled with functors, which may hold both atoms and other compound terms. They are written with the functor's type, followed by opening and closing parentheses surrounding the term's contents.
Both functors and atoms are written using lower-case letters, typically from the Latin alphabet. E.g., an x atom may be written as x , while a y functor might be written y(a xx z) . Empty functors are written either with nothing between the parentheses or with a lower-case lambda: b( ) or b( ) . When there are multiple of a given atom present, the count is usually written as a superscript. The earlier functor thus could look like y(a x 2 z).
Rules are arguably the most significant difference between cP systems and other P systems variants. cP systems rules still work in a maximally parallel top-down weak-priority order, update the multisets inside the cells, and depend on the presence of given objects in the cell to make the rules applicable, but also include variables. These variables, written with upper-case letters, are matched with the actual contents of the cells through a one-way unification process [20], where the variables are substituted for contents within a cell at 'run-time'.
Rules are written with exactly one each of a starting state 2 ; left-hand-side (LHS); application mode; ending state; right-hand-side (RHS); followed by zero or more promoters and inhibitors. These rules take the form where s i is the starting state; LHS is the multiset of objects required to be present in the given top-level cell and which will be deleted by the rule; 1∕+ is the application mode, either exactly once or maximally parallel, respectively; s j is the ending state; RHS is the multiset of objects the rule will create inside the top-level cell. Exactly once mode means that the rule is applied at most one time during a given step. Maximally parallel application means that the rule is applied at the current step as many times as there are available objects to enable it.
For example, if a top-level cell has two functors a(b 2 ) and a(b 3 ) , and an applicable rule's LHS is a(B) , B will be unified non-deterministically to b 2 or b 3 . Or, if the rule uses a(bB) , B will be unified to b or bb . If the rule instead uses a(BC) , then a non-deterministic choice will be made to decide what atoms to assign to B and C , which could be one of the following pairs: Promoters are objects that must be present within the cell for the rule to be applicable but are not destroyed by the rule. Inhibitors are objects that must be absent for the rule to be applicable, although the rule may create them. If promoters are present, they are denoted following a | per promoter, and inhibitors by ¬ , e.g., | a(A) or ¬ b(B) . Traditionally, inhibitors and promoters are written below the main rule body, but this is not mandatory.
Top-level cells may exchange messages over channels. Each top-level cell may hold one or more appropriately labelled endpoints for any relevant channels and may, in its rules, try both to send and receive messages via those endpoints. A message is written inside angle brackets and marked either on the LHS with a question mark or on the RHS with an exclamation mark to denote receiving or sending. E.g., ⟨a(b)⟩! c would represent a message a(b) to be sent via channel c , and ⟨d(e)⟩? f would represent a message d(e) to be received via channel f . Both sending and receiving use pattern matching. For sending, any cP systems term matching the pattern in the rule may be removed from the top-level cell and passed to a waiting recipient or placed into a buffer multiset in the channel. Receiving works similarly in that any object matching the pattern of a receipt rule either offered by a top-level cell on the other side of the endpoint or stored in the channel's buffer multiset may be withdrawn. If more than one object in the buffer matches the pattern, one is selected randomly. Importantly, this means that ordinary cP systems channels do not behave as FIFO queues.
Numerical operations are simulated with unary arithmetic (see, e.g., [3,6]). Natural numbers are represented by counting copies of the unary digit atom, 1 , inside a given functor. E.g., the number three can be represented as a(111) , a(1 3 ) or a (3) .
Like most other P systems families, cP systems typically evolve synchronously in a stepwise fashion following an implicit global clock. Rules are applied based on whether the available multiset(s) within the system match the rules' LHS and promoters and not the inhibitors. The consumption of removed objects plus the creation of new objects happens instantaneously at the end of a step.

Exactly once vs maximally parallel
To clarify further the respective operation of the 'exactly once' and 'maximally parallel' application modes, consider this example: The rule will eliminate all but one instance of a(X) from the containing top-level cell. Its application to the objects {a(1), a(1), a(2), a(2), a(2)} will leave only {a(1), a(2)}. This is because the application of the rule will invariably be equivalent to the following ground rule applications: After these three, no further instantiations are possible because there are no longer sufficient remaining objects to satisfy both the LHS and promoter.

cP systems notation
A specific cP system can be described as a 6-tuple, as shown below.
T is the set of top-level cells at the start of the evolution of the system; A is the alphabet of the system; O is the set of multisets of initial objects in the top-level cells; R is the set of rulesets for each top-level cell; 3 S is the set of possible states of the system; s ∈ S is the starting state of each toplevel cell in the system.

Indexed notation for compound terms
Indexed compound terms appear commonly in cP systems definitions. These are compound terms tagged with one or more sub-terms to distinguish different instances of the same functor type. They are still classic cP systems objects with nested terms, but some subterms are used only to distinguish between instances of the encapsulating compound term. This is roughly analogous to tagging a term with its index in a logical vector/array, or the key in a typical dictionary/associative array data structure.
For example, compound v terms appear in multiple rules later in this paper. Ordinarily, these would be represented as nested terms like and used inside a given proxel to stand for a set of tagged data. They are labelled by neighbour X and annotated with a 'generation count' G (further explained in Sect. 4). Both values track metadata about a datum. Lastly, the datum D stored by the term is given. In many common programming languages, accessing each datum might be written like v[X], where v is a dictionary indexed by neighbour.
The v functors could instead be written as for a shorthand that can be expanded back out to a full form automatically. This shorthand is purely a notational convenience, without effect on rule application and system evolution. When a concrete instance of a v compound term is indexed by a ground term (i.e., not a variable) k , e.g., v(k)(_)(_) , we refer to it as k-tagged.

Blocking vs non-blocking message receipt in cP systems
In cP systems, sending messages from one top-level cell to others is non-blocking by default. The channels connecting the cells buffer message objects if needed. Thus, the mediating channel always accepts an outgoing message, even if the holder of the other endpoint is not ready to receive the message. Receiving messages via channels is also ordinarily a non-blocking operation in cP systems, albeit for a different reason: If there are no incoming eligible messages available, then the relevant rule to receive over that channel will not apply, regardless of other rules' applicability. It may be useful to simulate the nature of a blocking receipt, however. This can be achieved with added dedicated states. The beginning state for the intended blocking receipt should be unused as the beginning state for any other rule. This state is the ending state for another rule, which enters the blocking receipt. The ending state for the blocking receipt rule should return the cell to its standard process.
Using the dedicated state ensures that no other rules may be used inside a particular top-level cell at a given step. Effectively, the top-level cell becomes quiescent until another top-level cell offers to send a message to the first cell. At that point, messages are exchanged as appropriate, and the receiving top-level cell returns to its standard processing. We use this later in Sect. 4 (rules 7 and 10 in Ruleset 2).
As an example, consider the two cP systems rules below: Assume these are the only two rules in the ruleset that begin in state s 2 , and that the top-level cell this ruleset applies to is presently in state s 2 . Therefore, the only rule which may be applied at the next step is one or the other of these-the different ending states prevent the application of both.
When applied in the typical top-down weak priority order, the first rule has priority. If one or more messages available on the channel z match the pattern-here, a functor x with any contents-then the first rule applies, and the new x terms are created in the top-level cell at the end of the step. Conversely, if no such message is on hand, the top-level cell will unconditionally transition to state s 4 . Thus, receiving messages through the first rule while the second is available is a non-blocking receipt.
What if the second rule was not there? In that case, the top-level cell will be unable to make progress until at least one matching message becomes available via channel z . Thus, receiving messages through the first rule without the second rule being available is a blocking receipt. The toplevel cell is blocked from making forward progress until a suitable message arrives. Blocking receipt rules can lead to potential deadlocks in a communicating system, and care must be taken to avoid this.

Antiport communication rules in cP systems
We propose and use an antiport rule [26,27] in Sect. 3. To the best of our knowledge, this is the first use of an antiport rule in cP systems. Briefly, antiport rules both permit and require the bidirectional exchange of objects between cells during a single rule execution. Thus, if one side is unready to send and receive, the rule cannot run during the step. This averts deadlock from both sides waiting on the other to send a message.
In the context of cP systems, this means that a given rule must involve a receipt over a channel on the LHS and a send on the same channel on the RHS. For example, would be a valid antiport rule because the same channel is used on both sides of the rule. To distinguish clearly between antiport rules and other messaging rules, we write two marks between the message and the channel label, instead of the usual one.

System-wide-synchronicity messaging
This section presents the globally synchronous NMP variant, wherein the entire system evolves as one, and every proxel executes the same rule(s) simultaneously. For this paper, we assume that all proxels, their oracles and channels behave correctly and follow their respective protocols without faults, corruptions, lost messages, et cetera. This significantly simplifies the presented systems by allowing us to omit error handling and describe only the intended operation.
The layout of the system, from the perspective of an arbitrary proxel in a neighbourhood grid, is depicted in Fig. 2. The central circle is this proxel and is labelled with to represent its ID in the overall system. Two-way channels connect it to each neighbour and its oracle ( o ). The neighbours themselves are anonymous, but 's end of each channel is labelled with a number from 1-4, standing for the neighbours expected to sit above, to the right, below, and the left of , respectively. The oracle is smaller, blurred, and surrounded with a dashed line to reflect that it is a stand-in for an unspecified process that ordinarily would take place inside the proxel.
In our approach, each proxel has no direct knowledge of its neighbouring proxels. Instead, it interacts with channels that connect to those neighbours; the channels interpose between the proxels. Furthermore, each label for a neighbour in each proxel is not the system's ID for that neighbour but the system's label for a connecting channel endpoint. Nevertheless, as a shortcut, we will refer to proxels as simply communicating with their neighbours. Messages exchanged by proxels are termed neighbourhood messaging (NM) messages, while messages exchanged with oracles are named oracle query (OQ) messages.
The OQ phase is when each proxel performs its internal computation. As mentioned earlier, these computations are problem-dependent and thus cannot be modelled generically. Instead, we use communication with an oracle assigned to each proxel to simulate this aspect of NMP. The NM phase is the focus of this work. This is where proxel exchange messages with their neighbours, based on data received earlier from other neighbours. The finalisation phase occurs after each proxel reaches a stopping point. At this point, each proxel performs a final computation to decide its output for the lattice, incorporating the latest data received from neighbours. Figure 3 depicts this progression as a state machine. States s 1 and s 2 the OQ phase; s 3 the NM phase; and s 4 and s 5 the finalisation phase. State s 6 is the halting phase of the system's evolution. In an application of NMP, phases 2 and 3 do not change. Only phase 4 is replaced. The system's evolution is sketched at a high level in Algorithm 1. We assume that at the start of the system's evolution that the correct number of proxels are already in place, and they each hold an appropriately initialised iteration counter and their relevant channel endpoints only. Everything else required by each proxel will be supplied by the oracle or generated by rules during the proxel's evolution. The rules for the globally synchronous system are listed in Ruleset 1 and explained below. We first define the globally synchronous NMP system, explain the intended operation of the ruleset, and then describe each of the ground terms used.

System definition
By Sect. 2.2, a cP system can be defined as a 6-tuple: T is a set of top-level cells, each being a single proxel, numbered from one to the total size of the lattice. These numbers correspond to the of Fig. 2. Each proxel's starting multiset in O is empty, barring appropriate channel endpoints.

Definitions of terms used
Atoms o The label of the channel used to communicate with the oracle.
Functors i A generation counter. Used to count the number of generations of NM remaining before moving to the finalisation phase.
v The v compound terms described in Sect. 2.3 except without the generation counter-the i counter serves the same purpose. These serve as NM messages.
w The same as the v terms but used as messages to the oracle for OQ.
w ′ The same as the w objects but marked to indicate to the oracle that they are to be used for finalisation rather than OQ.
z Final output functor. Holds the result of the proxel's computation.

Processing element-specific asynchronous messaging
This section presents a variant of the above system where the rules are applied in a proxe-specific manner, and proxe run asynchronously-done by adding receipt tokens to the system-so that the steps of one proxel do not necessarily occur contemporaneously with those of another. 4 There is no longer a single state spanning the entire system. Instead, each proxel has its own state dictating its operation independently. 5 The same conceptual phases are kept, though the OQ and NM phases are combined because they now may occur interleaved. During the OQ and NM phase, each proxel independently works through a series of rounds, following the typical idea of a round in theoretical asynchronous systems: The proxel receives one or more messages, processes the messages and updates its internal data, then sends out new messages based on the results.
The most notable change from Sect. 3 is perhaps the use of receipt tokens. Receipt tokens record that the proxel has received a message from a particular neighbour since the last time the proxel sent out messages. This is vital to ensure that the correct number of messages are sent and sent only after receiving appropriate messages from other neighbours. Initialisation of the proxel also differs slightly from Sect. 3. Along with the channels, each proxel also starts with an adjacency list object, a(A) , which holds a copy of the atoms for each neighbour/channel.
Another important change is the augmentation of the data objects with message generations. These are equivalent to iterations, but because messages may arrive in different rounds, they must track their progression themselves. The data pre-populating the proxels are the iteration counter (now a "maximum generation counter") and zero th generation messages. Each outgoing message for generation n + 1 is based on the relevant received messages for generation n.
As with Sect. 3, we show in Fig. 4 the system's progression through its states. In this system, an OQ and NM round can be regarded as the steps taken to proceed from state s 1 until reaching s 1 again. Thus, receiving one or more messages, followed potentially by preparing and sending new messages, forms a round. The joint OQ and NM phase is made up of rules (1) through (6) plus (8) in 2. Rules (7), (9) and (10) make up the finalisation phase. Again, we also supply a pseudocode interpretation of the procedure in Algorithm 2. An essential assumption for Ruleset 2 is that the oracle returns all expected results at the same step. After a proxel sends one or more OQ messages to the oracle, the proxel blocks as per Sect. 2.4 until it receives NM messages from the oracle. One NM message is expected per OQ message. We also assume this for the finalisation phase.
In preparing the new OQ messages to go to the oracle, rule (1) discards the information about which datum came from which neighbour. When adapting the rules to a particular problem, this must be borne in mind. The changes needed to preserve the data are likely minor but unexplored further here.

System definition
T is a set of top-level cells, each standing for a single proxel, numbered sequentially. These numbers correspond to the of Fig. 2. Each proxel's starting multiset in O is an adjacency list term a , its maximum generation counter i , its set of starting v messages and one receipt token per neighbour. For most proxels, the applicable rules are Ruleset 2, but for border and corner proxel, rule (1) is replaced with a rule from Ruleset 3 (see Sect. 4.4).

Description of rules
1. Consider in turn all receipt tokens r , if any (in any arbitrary order). For r(k 1 ) and v(k 1 )(g 1 )_ , compare generation number g 1 against the generation numbers of all other current values v(k j )(g j )(_), k j ∈ {k 2 , k 3 , k 4 } = {1, 2, 3, 4}⧵{k 1 } . Without loss of generality, if g 1 ≤ g 2 , g 3 and g 1 < I (i.e., the maximum generation count) then create a message w to send to the oracle for OQ. This message has the data for k 1 , k 2 and k 3 , which are used by the oracle in computing the new message to send to k 4 . The generation count for this w message is g 1 + 1. 2. If there were no OQ messages generated by rule (1), then move straight to the process of testing for termination and receiving new messages. 3. If there is more than one OQ message about a given neighbour, delete all but one of them. The messages will be identical, but duplicates may arise due to the unification of different neighbours to Y and Z in rule (1). This is precisely the rule from Sect. 2.1. 4. Send the remaining OQ messages to the oracle to compute new outgoing NM messages. 5. Forward newly prepared NM messages v from the oracle to the relevant neighbour. 6. Delete all extant receipt tokens after completing an OQ and NM phase.
7. If messages with a generation count equal to I have been received from every neighbour, transition to the finalisation phase. 8. Receive a message v(_)(G)(D) from neighbour X and replace the current v object for X with it. The neighbour label contained inside the message is overwritten with the current proxel's label for the receiving channel. Create a receipt token r(X). 9. Send all v messages to the oracle to compute the final result for the proxel. 10. Identical to rule (7) in Ruleset 1.

Definitions of terms
Atoms The same as in Sect. 3. i Maximum generations counter.
n Neighbourhood functor. Stores the label for one of the channels/neighbours to which the current proxel is connected.
r Receipt token, showing that a message was recently received from neighbour X.
v Serves as both the internal data stores and NM messages of the system. These are the v compound terms described in Sect. 2.3.
w Identical in most respects to v functors, except used for messaging with the oracle instead of neighbouring proxels. That is, these are OQ, rather than NM, messages.
w ′ Identical to w , except marked to indicate to the oracle that the messages sent are to be used for finalisation computations instead of OQ.

Rules for four-neighbourhood proxels with fewer than four neighbours
Ruleset 2 considers only the case where a proxel has exactly four neighbours. This does not work correctly for those proxels with three neighbours (i.e., on the grid border, but not in a corner) or two neighbours (i.e., in a grid corner). The only adjustment necessary to handle these edge cases is to change the number of neighbours considered in rule 1. These alternative rules are listed in Ruleset 3 as rule (1a) and rule (1b) for the border and corner cases, respectively.
An alternative is the use of 'dummy' or 'sentinel' proxels with special rules outside the grid, connected to the corner and border proxels. The specifics of the special rules are situation-dependent and so unexplored here.

Traces
A trace is a sequence of snapshots, connected by arrows ⇒ describing the evolution of message generations. A snapshot is an m-tuple, where m is the number of neighbours for the relevant proxel: (g 1 , g 2 , g 3 , g 4 ) for a four-neighbourhood, where g k is the generation number of the last message v(k)(g k )(_) received from neighbour k . The period between snapshots is a round, as described above. The starting state and finalisation phase may be included in a trace as single rounds along with OQ and NM.
By convention, for a four-neighbourhood proxel, the entries in a snapshot are written clockwise starting from the top. That is, the first entry is the neighbour above the current proxel, the second entry the neighbour to the right, et cetera. If there is no message about a given neighbour present inside the relevant proxel, '-' is used in place of the corresponding generation number.
Optionally, snapshots may be decorated further with: Each component g k can be followed by a question mark ? , showing that a message was just received from that neighbour at the start of the current round. (c) Each component g k ? can be further followed by one or more stars * , one for each message sent because of the corresponding incoming message.
Contemporaneous receipts from different neighbours may lead to the same outgoing message(s). As a convention, the left-most g k ? which triggers the outgoing message will be marked with the corresponding * . For example, a snapshot where the latest iterations received from the first through fourth neighbours are 2, 3, 1 and 2, respectively, and a message has just been received from neighbour four, would look like: (2, 3, 1, 2? * ) . A trace documenting this could look like (2, 3, 1, 1) 1 ⇒ (2, 3, 1, 2? * ) 2 .

Alternative 'locally synchronous' rules
It is possible to define a ruleset where each proxel is locally synchronous. The difference between the locally synchronous and asynchronous NMP processes is that a proxel working in the asynchronous style will start a new round after receiving any message(s)-it only needs to receive one before starting a new round. In contrast, a locally synchronous proxel will remain in a blocking wait until it has received messages for the next generation from all of its neighbours.
Going from the asynchronous ruleset to a locally synchronous one requires only changing Ruleset 2's rule (8), splitting it into two related rules, as shown in Ruleset 4. Combined, the two change the stopping condition of the message receipt process. Where the asynchronous rules block until a message arrives over any channel, the locally synchronous rules block until exactly one message has been received from every neighbour. Rule (8a) terminates the blocking wait once all messages are received. Rule (8b) continues message receipt but prevents receiving more than one message from a neighbour during the same round.
Similarly, the pseudocode from Algorithm 2 requires only two changes, on lines 20 and 21. On line 20, the condition for the while loop changes from |R| = 0 to R ≠ A , reflecting that the proxel now waits until it has received a message from every neighbour, rather than one or more. On line 21, the set from which the proxel receives changes from A to A⧵R , reflecting that only a single message should be received from each neighbour during the current round of NMP.
In the context of cP systems, there is no meaningful difference in the end result of the globally synchronous ruleset of Section 3 and the locally synchronous rules. In practice with implementations on standard electronic computers, however, there is typically a difference in total running times due to eliminating the need to synchronise across the entire lattice. Each proxel now needs to wait only for its slowest neighbour during a given round, rather than the entire lattice waiting for the overall slowest proxel. The comparative performance of all three of the globally synchronous, locally synchronous, and asynchronous rules is investigated further in Sect. 8.6.
The locally synchronous rules necessarily work similarly to the asynchronous rules regarding receipts. In both cases, the following messages due from every neighbour may not arrive during the same rule execution step. The locally synchronous version blocks until it has received a message for the next generation from every neighbour, instead of moving on after receiving any message. Sending differs more. In the asynchronous system, it is possible for a proxel not to have received all messages needed to compute the new outgoing message for a neighbour at the start of a round. This is untrue for the locally synchronous version, which guarantees that a proxel has sufficient information for every outgoing message at the start of the next round.

Asynchronous system example
This section steps through a simple example of the evolution of Sect. 4's asynchronous variant of NMP. The grid in question is found in Fig. 5, and this example focuses on the node marked "1". For simplicity, but without loss of generality, we focus on a border point in the grid, i.e., a proxel with a three-neighbourhood. Two generations of NM are shown (i.e., the maximum generation count is 2: i(2) ) for demonstration purposes. Node 1 starts in state s 1 and its adjacency list object is a(n (2)

n(3) n(4)).
Embodying the unknown processes of the oracle, the data for the incoming v and w terms have been randomly generated for this example and have no further significance.
To supply a partial demonstration of the expected operation of the rule(s) which would replace the oracle, we compute the contents of outgoing messages according to the following simple rule pair: These rules merely sum the data received and return the newly computed outgoing message to the relevant node from its oracle.
The total trace for this example is: The state of node 1 at the end of each round is shown in Table 1. The round number for each row in the table matches the round number described further below. Where the contents of a functor change between the end of one round and the end of the next, the functor and its changed contents are written in boldface to highlight the modifications. Round one Node 1 begins round one pre-populated with the initial messages 'from' each neighbour plus the receipt tokens, its adjacency list, and its maximum generations counter. Having up-to-date messages and receipt tokens for all three neighbours makes rule (1a) applicable, and node 1 enter the sending cycle. Two new OQ w messages are generated relating to each neighbour (because the other two neighbours will take the place of Y in applications of rule (1a)recall that there is no Z in the rule for node 1, as it only has three neighbours). Rule (3) eliminates these duplicates, however, leaving one copy of each message for rule (4) to send to the oracle. These outgoing messages to the oracle are 5 Basic 3 × 3 grid graph used for the example. This example will focus on the perspective of node 1. Other nodes are assumed to work equivalently, but their evolution is not discussed. Nodes are labelled internally with their ID in the system ( in Fig. 2), and each intermediary channel with the identity assigned to it by the nearby proxel Table 1 Objects present inside node 1 at the end of each round in the example Functors whose contents have changed from one step to the next, and the changed contents themselves, are marked in boldface Round number All objects in node 1 at the end of the round (7)) and w(4)(1)(d(1) d (4)) for neighbours 2, 3, and 4, respectively. Node 1 then enters a 'blocking wait' (see Sect. 2.4) and stays inactive until the oracle returns the new NM messages to send to each neighbour. Rule (5) sees node 1 receive those messages and then immediately forward them to the neighbours. Following the example oracle rules described above, the messages sent by node 1 are v (2)(1)(11) , v (3)(1)(8) and v(4)(1)(5) for neighbours 2, 3, and 4, respectively.
The application of rule (5) brings the sending cycle to a close, and node 1 moves to state s 4 . In this state, rule (6) clears away any receipt tokens present in the proxel and then moves to the termination check of rule (7). At this point, none of the v messages hold generation counters equal to the maximum generation count, so processing will continue. The round ends with node 1 entering another blocking wait until messages arrive from one or more of its neighbours.
Round two In the second round, node 1 receives messages from only neighbours 2 and 4. This generates receipt tokens for those neighbours and sees node 1 enter a new sending cycle. The presence of receipt tokens for both neighbours 2 and 4 leads to the creation of two duplicate w OQ messages about neighbour 3. In effect, 2 and 4 swap roles as X and Y in rule (1a). Again, rule (3) eliminates the duplicate, and the single message w (3)(2)(d(9) d(5)) is sent to the oracle with rule (4). In turn, by rule (5), node 1 receives back v(3)(2) (14) and forwards it to neighbour 3, completing the sending cycle. As before, node 1 moves from sending to clearing receipt tokens, checking whether it has completed NM, and finally becomes quiescent once more, awaiting further messages from neighbours.
Round three Round three begins with receiving a single message, the final message to be received from neighbour 4. Unlike the earlier rounds, no new messages are sent out due to said receipt. Neighbour 4 was already one of the neighbours with the highest generation count among the messages stored in node 1, and so while it can serve as X in rule (1a), neither of neighbours 2 or 3 can serve as Y because their generation counts are lower than neighbour 4's.
Since rule (1a) does not apply, rule (2) is applied-the first time in the node's evolution. Rule (2) unconditionally transitions node 1 to removing all extant receipt tokens, before it progresses to the termination check, then returns to awaiting new messages.
Round four Neighbour 2 sends its final message, and neighbour 3 its first message. The receipt from neighbour 3 leads to new messages relating to neighbours 2 and 4. The receipt from neighbour 2 could ordinarily lead to another message to be sent to neighbour 3, as it would pair up with neighbour 4 to act as X and Y , respectively, in rule (1a). This does not occur, however, as neighbour 2 (and 4) has reached the maximum generation count, so rule (1a) is no longer applicable for it.
Round five Only one more message is still to be received, from neighbour 3 specifically. Node 1 will wait until said message is received, but at that point, all message generation counters will be equal to the maximum, and thus no further messages will be sent out to the neighbours.
Following the application of rule (8) to receive the final message from neighbour 3, node 1 returns to state s 1 and the rules are again explored in top-down order. This time, rule (1a) is inapplicable, so rule (2) is applied, transitioning node 1 to state s 4 . Rule (1a) is the only rule applied at this step because it is the only rule which moves from state s 1 to state s 4 . Next, rule (6) deletes the last receipt token. Again, rule (6) is the only rule applied because it is the only one to feature its start and end states. Finally, without any v messages with a generation count below the maximum present in the proxel, node 1 moves to state s 6 and the finalisation phase by rule (7).
Round six Round six is the final round of the system's evolution, for node 1 at least. With node 1 in state s 6 , the only applicable rule is rule (9). The effect of rule (9) is to send every v message stored in node 1 to the oracle, renaming them to w ′ messages so that the oracle recognises them as finalisation messages rather than OQ messages for NM purposes.
Eventually, the oracle will return the final result message and the evolution of node 1 halts. When all nodes halt, the evolution of the system will have terminated.

Asynchronous system analysis
This section examines and analyses the asynchronous proxel-specific system of Sect. 4. Firstly, we describe and prove several properties of the system, particularly that the asynchronous system exchanges precisely the same number of NM messages as the globally synchronous system of Sect. 3. Then, we state three conjectures about the system, which we have yet to prove or disprove mathematically, before examining the asynchronous system's complexity from the standpoint of the rules and symbols used.
Recall that the main difference in behaviour between the globally synchronous, locally synchronous and asynchronous variants is the timing of when they send out new messages, based on when they have received messages from the previous generation.

Asynchronous messaging properties and behaviours
Consider a top-level cell/proxel 0 in the asynchronous framework of Sect. 4, where the maximum generation count is I, specified via i(I) . Without loss of generality, we consider only a four-neighbourhood system and assume that 0 is a non-border/non-corner proxel with four neighbours, k , accessible via channels labelled by k ∈ {1, 2, 3, 4}. Proof Proof by induction. Proxel 0 starts with value v(k)(0)(_) . Assume that this hypothesis holds for a given g ≥ 0 , appearing as a generation number in value v(k)(g)(_) . Rule (8) gives the only possibility of changing this number, by replacing the current value by another v value, where the generation number is strictly greater by one, v(k)(g + 1)(_) . ◻

Remark 4
The conclusion of Lemma 3 will hold even in the hypothetical case that proxel k would be allowed to send multiple copies of the same message v. Only the first received copy will be accepted, and the rest will be ignored, courtesy of rule (8). However, this is not the case, as excess w OQ messages are deleted by rule (3). Proof Channels are reliable, so no message is lost. Channels in cP systems are not inherently FIFO, but proxel 0 will still pick messages with increasing generation numbers, by Lemma 3. As defined by the asynchronous version of the model, any messages arriving out-of-order are not lost but stored in a multiset at the end of the channel and will be picked when their time comes. ◻
Proof Only if. Assume that 0 sends out a generation g message v over channel k. This may only happen by rule (5), where 0 just forwards the same message v, received as w from its oracle. Rule (4) triggers the oracle to build this message w, based on three other v messages v( and k, k 1 , k 2 , k 3 are all distinct. By Lemma 5, 0 must have also received two generation g − 1 values v, from k 2 and k 3 . Thus, proxel 0 has received three values v( These values need not arrive at the same round. Without loss of generality, assume that 0 has just received the first value, v(k 1 )(g � )(d 1 ) , while the other values may have been already received and even updated, v( (1) applies and creates at least two identical w messages w(k 4 Additional w messages may be created, if either h 2 = 0 , or h 3 = 0 , or both. This does not matter, as excess w messages are deleted by rule (3), so only one w(k 4 )(g � + 1)(d) remains. This message goes to the oracle, which creates and returns the value w(k 4 )(g � + 1)(d � ) = v(k 4 )(g)(d � ) , further forwarded by rule (5) to k 4 , over channel k 4 . ◻ Remark 7 Lemma 6 also holds for all variants. The critical difference is that with the asynchronous variant, the new outgoing message for generation g is sent as soon as possible after those three messages for generation g − 1 have been received. By contrast, the other two variants await receipt of the final generation g message before sending any new ones. Proof By bounded induction on g. Base case: g = 0 . Each proxel 0 begins with ("receives") four k-tagged generation g values-one for each k ∈ {1, 2, 3, 4} . By Lemma 6, applied four times, 0 sends out four generation g + 1 values v, one over each channel k. ◻ Induction step: assume the induction hypothesis holds for g ∈ [0, I) , and we prove that it still holds for g + 1 . The induction hypothesis holds for all proxels, including 0 and k . Thus, by part (ii), each proxel k sends a generation g + 1 value v, which is further received by 0 , as v(k)(g + 1)(_) . This proves part (i) of the induction step.
If g + 1 < I , rule (7) does not apply and, as shown by Lemma 6, applied four times, 0 sends out four generation g + 2 values v, one over each channel k. This proves part (ii) of the hypothesis. ◻ Theorem 9 Proxel 0 does not send any message to its neighbours at generation g > I.
Proof OQ w messages (and thus NM v messages) are created only by rule (1), with generation counts g + 1 , only if g + 1 ≤ I . The promoter i(G1_) for rule (1) prevents the creation of any w message with a larger generation count. ◻

Theorem 10
For each proxel 0 , the neighbourhood (interproxel) messaging loop stops after receiving the last generation I value.
Proof After receiving the last generation I message, rule (1) does not apply (per Theorem 9), and instead control flows by rule (2) to state s 4 , then to s 5 by rule (6). Finally, rule (7) applies, and the control flow breaks out of the main messaging loop, proceeding to state s 6 and the finalisation phase, where no further NM occurs.

Remark 12
The evolution of our asynchronous version can be suggestively summarised by analysing the following four typical cases of proxel 0 snapshots, which focus on the received values having the least generation number, here indicated by g. Note that each star (*) shows the channel of the incoming value that triggers the sending of a generation g + 1 message, not the channel over which this is sent. Without loss of generality, in the case of confluent nondeterminism, stars are allocated to the first possible choice (for illustrative purposes).
1. (g? * * * , g? * , g?, g?) : 0 receives all four generation g values in the same round, by rule (8). As shown by Lemma 6, applied four times, 0 sends four generation g + 1 values. Rule (1) will initially create six OQ messages per neighbour (for a total of 24) but the excess ones will be deleted, so exactly one w message per channel will remain and be sent to the oracle. 2. (g? * * * , g? * , g?, g 4 ) , g < g 4 : 0 receives three generation g values in the same round, by rule (8). The fourth value is at least one generation ahead. This case evolves similarly to case (1) above. As shown by Lemma 6, still applied four times, 0 sends four generation g + 1 values. The rules will create four OQ messages for each of the first three channels, and six OQ messages for the last channel (for a total of 18), but the excess ones will be deleted, so precisely one OQ message per channel will remain and be sent. 3. (g? * * * , g? * , g 3 , g 4 ) , g < g 3 , g 4 : 0 receives two generation g values in the same round, by rule (8). The last two values are at least one generation ahead. This case evolves similarly to case (2) above. As shown by Lemma 6, still applied four times, 0 sends four generation g + 1 values. The rules will create two OQ messages for each of the first two channels, and four OQ messages for the last two channels (for a total of 12), but the excess ones will be deleted, so only one OQ message per channel will remain and be sent. 4. (g? * * * , g 2 , g 3 , g 4 ) , g < g 2 , g 3 , g 4 : 0 receives one single generation g value in each round, by rule (8). The last three values are at least one generation ahead. As shown by Lemma 6, applied three times, 0 sends three generation g + 1 values. The rules will create two OQ messages for each of the last three channels (for a total of 6), but the excess ones will be deleted. The question arises: when will 0 send its fourth generation g + 1 value, over channel 1? Brief answer: that message was already sent, before the current round! Indeed, Lemmas 3 and 5 show that, from each channel k ∈ {2, 3, 4} , 0 has received, in order, messages with generation numbers g, g + 1, … , g k . Consider the round when 0 receives the last such message for generation g. Without loss of generality, assume that this was from channel 2. Thus, at that round, the summary snapshot was (g 1 , g?, g � 3 , g � 4 ) , with g 1 < g ≤ g ′ 3 , g ′ 4 . By Lemma 6, one generation g + 1 value was then sent over channel 1.
Note also that, at the outset of cases (1, 2, 3), no other messages for generation g + 1 will have been sent yet because too few messages at generation g have been received.

Rule and symbol complexity
Overall, the rule and symbol complexity of the system is relatively low, but given that some elements of any specific NMP process are abstracted away (e.g., by the oracle), this is likely an underestimate for a given final implementation.

Rule complexity
The total system requires ten rules. Eight for the joint NM and OQ phases-only two directly involve interaction with the oracle, but others prepare for said interaction-and two for the finalisation phase-both of which interact with the oracle too. The rules that involve direct interaction with the oracle, rules (4), (5), (9) and (10) will normally be replaced with new problem-specific rules, however. Thus, it can be said that there are six rules relating to the structure and process for NMP and four stub rules which stand in for a more complex process.

Symbol complexity
Across all ten rules, two atoms, eight functors, and eight states are used. This gives a total symbol complexity of 18 distinct symbols. As with the rule complexity, more symbols will likely be used when the OQ phase is expanded to encompass the desired computation process.

Generational confluence
Both the synchronous variants are essentially deterministic, and thus the final result is deterministic and confluent for a given lattice, ruleset and starting state. With the locally synchronous variant, there can be different orderings between sends and receipts of messages, but since the generations of the input messages are guaranteed, there will be no difference in the outgoing messages. Not even this will occur with the globally synchronous variant, though-every message for a generation is instead exchanged simultaneously with every other one, and thus there can be no non-determinism.
The asynchronous system is not (necessarily) confluent, however. Messages may be received in such an order that the contents of some messages might not be used in the preparation of new messages, since two messages potentially can be received from one neighbour between receipts from a different neighbour. Nevertheless, as Remark 12 shows, for a proxel to send a message for generation g + 1 to neighbour k , it must have received messages for at least generation g from all other neighbours. Therefore, no input data will be 'outof-date'. Thus, we conjecture that the final results should be similar for many "well-defined" cases. We term this concept generational confluence.
Conjecture 13 A NMP system following the asynchronous rules of Sect. 4 will be approximately/generationally confluent. Under a wide variety of practical conditions (e.g., when the base globally synchronous version is convergent), every system's evolution will produce a similar result, no matter the specific ordering of the messaging. Experimental evidence in Sect. 8.7 supports this conjecture.

Conjecture 14
The asynchronous behaviour might lead to a 'better quality' result, for a particular domain-specific meaning of quality. Specifically, the potential for proxels to forgo the use of data from particular messages in favour of newer data will lead to a final output no worse than the synchronous variants'.
Determining the truth or falsity of Conjecture 13 is still an open problem requiring further study (but see Sect. 8.7). Ideally, one would also clarify and quantify what "similar" means in this context. We intend to investigate Conjecture 14, as applied to belief propagation (BP) stereo matching (SM), in more detail in future work.

Conjecture 15 No matter the NMP variant employed, the most generations any given proxel may be ahead of any other proxel on the lattice depends on their distance from each other, as measured by the smallest number of intermediate proxels along any path between the two.
The requirement to have received messages from every other neighbour before preparing the message to send to the final neighbour means that, at a certain point, a proxel transitively depends on its neighbours' neighbours in an ever-expanding radius. A given proxel in a four-neighbourhood cannot send a new message to neighbour 4 until it has received messages from neighbours 1-3. In turn, neighbour 4 cannot send messages to any of its other neighbours besides the original proxel until it has received the due message from the original proxel. The same then applies to those neighbours of the original proxel's neighbour 4, et cetera.
This reasoning gives rise to the intuition in Conjecture 15. Suppose one proxel stops sending messages but has not yet reached its maximum generation. In that case, the neighbouring proxels will successively be unable to send further messages themselves, which will ripple across the lattice and eventually prevent further proxels from sending further messages as well. Until the ripple has progressed across the lattice, however, the more remote proxels may still send new messages. Therefore, in all cases where two proxels have a path between them, there is an interdependency, albeit in some cases far removed. We have not yet found a suitable and convincing formal proof of this idea.

Experiments
We performed experiments with short prototype programs to validate Ruleset 2 and investigate the comparative properties of the NMP variants. In particular, we have: (i) constructed visualisations of the 'spread of influence' across a grid of proxels when using different functions to compute each proxel's value, and with differing delays on channels based on their direction of connection; (ii) simulated the operation of the ruleset without actually passing messages, to verify empirically that Theorem 11 holds for an arbitrary proxel; (iii) implemented and benchmarked the three variants of the NMP algorithm operating on a four-neighbourhood grid, to assess whether the asynchronous version genuinely completes faster on average; (iv) investigated the variants' convergence behaviour; and, (v) compared the comparative timings of the variants for when the first, 'average' and last proxel finish. Source code for the described scripts or programs may be found at https:// github. com/ jcoo0 92/ Neigh bourh oodMe ssage Passi ngExp erime nts.
The cP systems models proposed were constructive as a starting point for developing the experimental programs. The models led pretty directly to the implementations, with most subsequent development performed to fit the model better to the host language's features. The converse is also true. Implementing the programs uncovered problems in earlier drafts of the rules.
One of the more significant examples of the benefit comes from early development. As it was at the time, the validation program would usually end normally but occasionally crashed or entered an infinite loop. Investigating this issue's cause revealed a potential race condition in Ruleset 2 (as it was then). It was possible for proxels either to send too few or too many messages, depending on scheduling. Practical experimentation thus led directly to a more robust cP systems ruleset.

Visualisations
Sequences of visualisations of the evolution of a sample four-neighbourhood system were created to explore the messaging behaviour of NMP. These sequences reflect the changes on a small grid that result from NMP under Sect. 5's system. Each square on a grid is one proxel, and the proxels are assumed to be connected in the usual four-neighbourhood manner.
Each sequence was generated using a script by computing values over the grid at the end of each round of messaging and writing the results out to a file. In each case, the grid starts with all zero values, except for the centre proxel, which starts with a 'full-strength' internal datum-represented by wholly white and completely black squares, respectively. The value to display at a given grid point for each round was, in turn, computed by selecting the maximum from among the data stored in the corresponding proxel. Each of the sequences uses a different method for computing the values present in a given square, and so depicts a different aspect of the spread of the 'influence' of the initial data in the centre proxel. The approach used here was effectively the locally synchronous variant, but the results apply equally to the variants.

Maximum over messages received and internal datum
The sequence in Fig. 6 was produced using the maximum of the proxel's current messages received from the other neighbours and the value stored internally. At each step, the datum was updated to be the maximum of itself or the messages received. The net effect of this was to ensure that when a message carrying information from the centre proxel arrived, the current proxel would turn black and remain that way. Figure 6 shows that this information propagates outwards as expected.
More interesting is Fig. 7. The same update computations were performed, but a two-round delay was introduced for the delivery of messages to the neighbours below and to the left of the sending proxel. Messages destined for the neighbours to the right and above the sending proxel were delivered without delays, as in Fig. 6. The periodic symmetry of the shape may seem surprising at first but is logical. Every proxel on the grid is a neighbour of those to the top and right and will only receive messages from those neighbours after the two-round delay-which consequently means that Fig. 6 Visualisation of 'influence' spread from the centre proxel during NMP. Each square's value was computed as the maximum over both the last messages received and a datum stored by the proxel they, in turn, only send out messages to their bottom and left neighbours after that delay. See also Conjecture 15. What might happen if there is a systematic delay in only one direction, and thus potentially, messages from all other neighbours might arrive relatively quickly? This is examined in Fig. 8, where a two-round delay was introduced for messages sent to the left only. In this case, the overall shape of the proxels on the grid who have received a message  containing 'influence' from the centre proxel still invariably ends up returning to a symmetrical, balanced shape. It is a different shape, however, from that seen in either Fig. 6 or Fig. 7, and is not centred on the centre proxel. Instead, it grows with a bias towards the right.

Maximum over messages received
The sequence in Fig. 9 was produced by taking the maximum over the messages received from the other neighbours when computing the new outgoing messages for each proxel. 6 As with the sequence in Fig. 6, the spread of the influence from the centre proxel moves outward, similar to an expanding '+' shape. Notably, however, points in the grid (except the centre proxel for one round) oscillate between the minimum and the maximum, showing that the influence from the centre proxel is only relevant for every second message update computation. Felzenzswalb and Huttenlocher [10] noticed this oscillating behaviour and used it to halve the number of computations needed to produce identical results in their improved loopy BP SM implementation.

Mean over messages received
The sequence in Fig. 10 was produced using the mean of the messages received from the other neighbours when computing the new outgoing messages for each proxel. This sequence makes it clear that as the influence from a given proxel diffuses across the grid, its impact upon the proxels gradually dilutes. After just four rounds of message passing, the 'signal strength' from the centre proxel has become so weak as to be almost imperceptible in the most distant locations it has reached, e.g., points (0, 4) and (4,8), assuming a zero-based grid count beginning in the upper-left corner.

Validation of rules for asynchronous neighbourhood messaging
A short script was written in F#. 7 This script takes the perspective of a single arbitrary proxel on a four-neighbourhood grid and simulates NMP. Actual data are irrelevant to this test, so OQ updates were omitted. The proxel 'receives messages' from one or more randomly selected neighbours during each round. Said receipts are represented by incrementing the received-message generation counts and creating Fig. 9 Visualisation of 'influence' spread from the centre proxel during NMP. Each square's value was computed as the maximum over the last messages received Fig. 10 Visualisation of 'influence' spread from the centre proxel during NMP. Each square's value was computed as the mean over the last messages received receipt tokens. The proxel then generates NM messages per Ruleset 2 based on the generation counts and receipt tokens and increments the sent-message counts as appropriate. Finally, the script prints the results and then terminates once messaging completes. Inspecting the state of the proxel at various points in the system's execution strongly supported Theorem 11. If exactly g messages were received from each neighbour, where g is the maximum generation count, the same number of messages would consequently be sent. Furthermore, messages were indeed sent based on the ordering of messages received, as expected.

Timing of variations
A C# program was written to investigate the comparative running time of each of the three variants described above using the four-neighbourhood grid. Each proxel is represented by a separate asynchronous Task from .NET's Task Parallel Library, 8 with communication between proxels conducted via a channel construct provided by .NET. 9 All three approaches are included in the same program and selected via a configuration file supplied at runtime. The use of Tasks, which the .NET runtime schedules to work in parallel when possible, means that all three variants are expected to run faster on a CPU with more cores. More proxels can execute physically (as well as logically) simultaneously on the larger CPU, so while the total CPU time used across all cores might stay constant, "wall clock" time should reduce.
Multiple variables besides the variant were used over different runs to explore the relative behaviour of each variant in terms of running time. These included: the size of the grid; the maximum generation count; the use and length of a work-intensive wait period when computing a new message to send (essentially, the OQ phase); the use and length of delays in delivery of messages over the channels while permitting the proxels to continue operating, as well as using different delay lengths for different directions.
Running time results were gathered using the Stopwatch class from .NET, which supplies access to underlying high-performance operating system time measurement facilities. The program was instrumented to start the timer immediately before the initialisation of the proxels, but after reading and parsing the configuration file. The timer is then stopped immediately after the last proxel finishes running when control flow returns to the program's main function. The total elapsed milliseconds are then printed to the command line. Numerical results are presented in Tables 2, 3 and 4. The numbers presented in the tables are the mean (rounded to the nearest whole number) of five runs for each variant under each parameter set.
Early testing showed no meaningful difference between variants regarding generation count, so in all instances, the results are from running the program with a maximum generation count of 20 and an oracle delay of approximately one-quarter of one m per message. The delays over channels were varied between equal delays of 30 ms, three channels with 30 ms and one with 150 ms, and two channels with 30 ms and two with 150 ms. The test program was run on three different computers with three different CPUs, each with a different number of physical and logical cores. The CPUs were an Intel Ⓡ Core TM i7-7700, with 4 physical/8 logical cores; an Intel Ⓡ Core TM i7-8750H with 6 physical/12 logical cores; and an Intel Ⓡ Xeon Ⓡ Silver 4116 with 24 physical/48 logical cores.
The data for Tables 2, 3     channel; and longer delays on two channels. Within each bar cluster, in every instance, the bars are the timings for (from left to right) the globally synchronous, locally synchronous and asynchronous variants. The y-axes do not follow the same scale, however.
There appear to be only two consistent trends in these data. First, in almost all cases, the asynchronous approach was the fastest. Although, in some instances, it was approximately the same as the locally synchronous case. Second, increasing the number of cores available appears to   Table 3 favour the asynchronous case. In all cases, the running time decreased with an increase in the core count, but the percentage decrease in running time was always greater for the asynchronous case than the globally synchronous case when comparing program execution run time with the same parameters between the 4/8 and 24/48-core CPUs. We suspect that this latter trend means that the asynchronous version is more scalable with respect to the count of CPU cores but have not investigated this thoroughly yet.
We suspect there will likely be some upper limit on the improvement in running time offered by the asynchronous version, however. The dependence upon the receipt of messages for every other neighbour before the preparation of a new outgoing message for the final neighbour ensures that any given proxel can only get so far ahead of its neighbours before it must wait for them (à la Conjecture 15).
The smallest performance difference on the same test between the globally synchronous and asynchronous cases was on the 6/12-core CPU on a 299 × 299 proxel grid and with equal delays on all channels, where the globally synchronous approach took only approximately 9% longer than the asynchronous version. The largest performance difference between the globally synchronous and asynchronous versions was on the 24/48-core CPU on a 99 × 99 proxel grid and with two longer delays on the channels, where the globally synchronous approach took roughly 51% longer. The precise reason these were the relative fastest and slowest test runs is unclear. Across each CPU, the widest differences between the variants are seen when using a 99 × 99 proxel grid, or a total of 9801 proxels. Perhaps this is closest to a 'sweet spot' where the .NET runtime can best schedule blocked Tasks on each core without becoming overwhelmed by overheads related to their management. 10 Comparing the globally and locally synchronous cases, there is only one experiment where the mean running time for the locally synchronous variant was higher than for the globally synchronous case: The 499 × 499 grid with equal delays between all channels, running on the 4/8-core CPU. Even this is close, with only a 3.2% running time increase. On many of the other tests, the locally synchronous version produces a decrease in running time of roughly 5-13%, yet still produces the same answer to the computation at hand (see Sect. 8.7). On this basis, it would appear that the locally synchronous approach could be regarded as strictly superior to and should be favoured over the globally synchronous variant.

Convergence
Both synchronous variants will produce the same final computed result because the ordering of messages varies only within a single generation, and thus the inputs used to compute new messages will be the same at each round. As delineated in Conjecture 13, we also hypothesise that the asynchronous system will produce comparable but nonidentical results to the other two. The test program used in Sect. 8.6 was further instrumented to provide output at the end describing the state of each proxel in the grid to gather evidence for this hypothesis.
The grid was divided into quadrants, and each proxel was assigned an internal value. Proxels in the upper-left quadrant were initialised with a value of 0.0, those in the lower-left and upper-right received 0.5, and those in the lower-right received 1.0. The value for the next message to a neighbour was computed as the mean of the values received from the other three neighbours plus the proxel's own internal value. At the end of each round, proxels updated their internal values to the mean of the values most recently received from each neighbour (this bears some resemblance to the formula used for Fig. 10).
Once messaging finished, every proxel wrote its ending internal value to a file. At this point, the values were rounded to two decimal places to account for the inherent numerical instability of typical (e.g., those of IEEE standard 754 [1,13]) floating-point numbers. These values were then extracted and compared between runs of variants with the same input configuration.

Hamming distance
To provide a quick measure of the level of difference between variants, a Hamming distance was computed between proxels in the same grid position. Matched proxels with the same value were assigned 0, and matched proxels with different values were assigned 1. The total distance between the two sequences was the total of the assigned values. I.e., the total distance was computed as where d is the Hamming distance, n is the total number of proxels in the grid, a i is the i th entry in one of the sequences, and b i is the i th entry in the other sequence. An average distance between variants was then calculated by dividing the sum by n . This average provides a rough measure of the level of difference in the final results between variants and is equivalent to the percentage of the grid's proxels with different results between variants. Globally synchronous vs locally synchronous To verify that the globally synchronous and locally synchronous variants produce the same final result for every proxel, the results from the two variants were compared using the above formula. In every instance, the total distance was 0, meaning there was no difference whatsoever. This appears to confirm that the globally synchronous and locally synchronous variants always produce the same final result.
Locally synchronous vs asynchronous Knowing that the globally synchronous and locally synchronous variants produce the same result, the asynchronous variant was compared against only the locally synchronous one. The runs with equal delays on all channels produced the smallest distances in all cases. In most cases, the runs with a single unbalanced channel delay produced the largest differences, sometimes more than double that of those with two unbalanced channels. Tables 5, 6 and 7 list the total distance scores and the scores expressed as a percentage of the total proxel count. Smaller grids produce smaller absolute distances but greater relative (percentage) distances. Interestingly, proportionally the lowest scores in all cases were seen on the 299 × 299 grids. The CPUs with more cores tended to have higher distances on smaller grids but smaller distances on larger ones. We are yet to find a convincing explanation for the latter two observations.

Absolute difference
The Hamming distance gives a clear indication of the proportion of the grid which computed a different result under the locally synchronous and asynchronous variants, but it might under-or over-estimate the scale of those differences. Using the same output empirical data, the distance was recomputed as the sum of the absolute difference in final proxel values between variants, i.e., d = ∑ i≤n i=1 �a i − b i � , where n , a i and b i are the same as above, and d is the new distance measure. Tables 8, 9 and 10 show the results of computing the absolute differences, both as their sums, and those sums as a percentage of the grid-wide sum of final proxel values from the globally synchronous/locally synchronous processing. The latter metric conveys the overall size of the difference in results. The results show that the difference is less than 2% in all cases and usually less than one-hundredth of 1% on larger grids-supporting Conjecture 13. This also appears to provide some evidence against Conjecture 14. It seems unlikely that a difference of under 1% will have much impact on the 'quality' of the final results.

Progressive completion of proxels
Conceivably, in some problem domains, valuable information could be gleaned from a partially completed NMP process. Given that each proxel runs independently, they will not necessarily all finish simultaneously-particularly with the locally synchronous and asynchronous variants. The locally synchronous and asynchronous variants provide each proxel with a greater ability to act as and when appropriate to them, and, as shown in Sect. 8.6, also tend to complete the entire process more quickly. These factors suggest that some proxels might complete their respective duties long before the grid as a whole has reached the end. If so, then problems in the aforementioned domains perhaps do not need to wait until the entire grid has finished before using the computed information or prompting some action.
To investigate the likelihood of a sizeable proportion of the proxels finishing early, the NMP program used in the Sects. 8.6 and 8.7 was changed so that every proxel stores the elapsed time from the start of measurement to when it finishes processing. These times are then written out at the end with the other measurements collected. The measurements were processed to identify the minimum, mean, and maximum completion times for proxels on the grid, i.e., the number of elapsed milliseconds from the start of the NMP process until the first proxel recorded its completion time; the arithmetic mean of the completion times across the entire grid; and, the elapsed milliseconds until the last proxel completion time.
Tables 11, 12 and 13 detail the recorded results from the progressive completion measurements across grids of assorted sizes. Here, grids of 11 × 10, 51 × 50 and 101 × 100 proxels were used. As above, all measurements are presented in milliseconds. The left-most column in each table is the number of proxels in the grid from which the timings were recorded. The second-left-most column lists which variant of the NMP system was used for that entry. The middle three columns record the total processing times for the fastest, average, and slowest proxels, respectively. The right-most two columns express the difference between the times for the fastest and slowest, or the average and slowest, proxels, respectively. In these latter two, the differences are expressed as a percentage of the slowest proxel's running time.
As a proportion of the maximum running time, the largest difference between the minimum and maximum running times is seen on the 110 proxel grid with the 8-core CPU using the locally synchronous approach, where there is still roughly 75% of the total time to go when the fastest proxel finishes. The same experiment also observed the largest gap from mean to maximum running times. Proportionately, the smallest difference between minimum and maximum is seen on the 110 proxel grid with the 48-core CPU using the globally synchronous approach, where there is just a 0.5% time difference. The smallest difference from the mean to the maximum is found in the 10,100 proxel grid with the 8-core CPU using the asynchronous approach, with a 0.14% difference. It appears that on smaller grids, the time difference from the fastest to the slowest proxels completing can be significant, but on larger grids, it is usually brief compared to the total running time.
In absolute terms, the largest difference between the minimum and maximum running times is observed on the 10,100 proxel grid with the 8-core CPU, using the locally synchronous approach, with a gap of 5.036 seconds. At 1.985 seconds, the largest gap from mean to maximum is also seen using the locally synchronous variant on the 10,100 proxel grid but running on the 48-core CPU. The smallest minimum to maximum gap is 8 ms, observed on the 110 proxel grid with the 48-core CPU and using the globally synchronous variant. The shortest mean to maximum gap is seen in the same experiment, at 5 ms.  Table 11 Visual comparisons of the information listed in Tables 11,  12 and 13 are shown in Figs. 14, 15 and 16. Each figure shows three line charts, which compare the timings (from left) for the grids with 110, 2,550, and 10,100 proxels Fig. 15 Charts comparing the minimum, mean, and maximum termination times for proxels executing on a CPU with 12 cores, for grids with (from left) 110, 2550 and 10,100 proxels, respectively. See also Table 12 Fig. 16 Charts comparing the minimum, mean, and maximum termination times for proxels executing on a CPU with 48 cores, for grids with (from left) 110, 2550 and 10,100 proxels, respectively. See also Table 13 1 3 respectively. The charts highlight that the size of the gap between the minimum and maximum times for the globally synchronous approach tends to be much smaller, as clear from the relatively flat slopes of the lines. The asynchronous systems mostly closely follow the same pattern as the locally synchronous ones. It is noticeable, however, that for the largest grid, on all three CPUs, the line segments between the mean and max completions are relatively flat, suggesting that the final proxel completes only a brief time after the average one. We are unsure what the reason for this is, but we hypothesise that it relates to the release of system resources by the completed proxels. As soon as an asynchronous proxel completes, its resources can be released, and the system scheduler no longer needs to account for it. With 10,100 proxels, it seems plausible that one or more system resources are over-saturated, or there are too many active threads for the scheduler to handle well. Therefore, as some proxels finish, more resources become available, meaning the next proxel can complete faster, and so on in a virtuous cycle (at least, until the relevant resources are no longer over-saturated).
Considering that the locally synchronous and asynchronous approaches often complete entirely before the fastest proxel finishes in the globally synchronous process, a significant benefit is already available from using one of the other two without considering progressive completion. Assuming the above hypothesis on the reason behind rapid meanto-max completion for the asynchronous variant is correct, it suggests that even if the 'quality' of the results due to progressive completion are uninteresting, there may still be a practical benefit to be gained from it in simulations of large lattices. Regardless, effective implementation of a real progressive completion for both cP systems and computer programs are further open problems.
Based on these results and those of Sect. 8.7, it appears that, in general, the asynchronous approach should be taken as the 'default' when an explicit message passing implementation for NMP is used. If the problem domain at hand is one where precisely the same final result as the globally synchronous variant is essential, then the locally synchronous variant should be used. There is no clear reason to prefer the globally synchronous variant.

Experimental limitations
These results indicate trends that may be drawn, but there are limitations to the data. The limitations primarily stem from a lack of diversity in the testing. The experiments were conducted using only one programming language and runtime, C# on .NET (specifically version 5.0.301 of .NET). All processors used were Intel Ⓡ x86_64 processors. All tests were performed on computers running either Windows 10 or Windows Server 2016. We conjecture that similar results will be obtained using other technologies-there is no obvious basis for undue influence from a programming language, CPU, or operating system in these experiments-assuming due consideration is paid to the proper granularity of assigning proxels to processors. Indeed, we predict that more noteworthy results will be seen on platforms that use lightweight tasks.
Arguably, too few iterations of each test were performed to consider the results statistically valid, 11 although we note that, usually, the difference between the minimum and maximum running times for each test was at most a few percent. All timing experiments were performed with a ratio of one Task to one proxel. This was a direct correspondence between the cP systems models and the practical implementation but may be suboptimal for resource allocation. A thorough exploration of the proper granularity of proxels for available processing resources could be made in the future.
All experiments were based on four-neighbourhood lattices. The theorems and lemmas described in Sect. 7.1 hold for lattices of any size neighbourhood, but the performance of NMP implementations may (or may not) vary according to the number of neighbours ordinary proxels have. The effect is uncertain at this point. On the one hand, a higher number of neighbours could force the synchronous variants to spend even more time waiting for their neighbours than the asynchronous variant, which would favour the latter. On the other hand, the overhead of stepping through the messaging loop more often might outweigh the benefits of asynchronicity. The exact balance between them likely is implementation-dependent.

Conclusion and future work
After briefly summarising the core aspects of cP systems (P systems with compound objects, formerly P systems with complex objects), this paper presented three variants of a generic system for NMP computation on a lattice in cP systems: (i) a globally synchronous one in which every toplevel cell, representing a proxel, performs the same actions simultaneously; (ii) an asynchronous one where different top-level cells may be in different states, perform different actions simultaneously and may be at a different generation count to their neighbours; and (iii) a locally synchronous one, partway between the other two, in which proxels evolve separately, but synchronise sufficiently to keep to the same generation count across the system.
An example for the asynchronous variant was provided to clarify the operation and progression of the rules. Next, we analysed the asynchronous system. We proved that in both the synchronous and asynchronous versions, every proxel sends precisely one message per neighbour, per generation, and halts once it has reached its maximum number of generations. The main difference between the two is that in preparing the outgoing message for neighbour k and generation g + 1 under the asynchronous version, some received input messages may already be from a later generation (greater than g ). No message used will ever be from a generation less than g , however.
We found that the systems have relatively low rule and symbol complexity for a cP systems implementation but note that the systems are only a partial view: oracles were used to abstract over the rules for proxel-internal computations, the nature of which are domain-specific. The variants are, in our view, essentially a high-level abstraction of various NMP-based algorithms employing communication. While there can be significant differences in the nature of the computations performed within each proxel, they all have fundamentally the same structure: computing updates over some data within a given proxel and then exchanging those with other proxels.
We performed multiple experiments to investigate the NMP variants. First, we visualised the spread of 'influence' from a proxel across a lattice under different circumstances. Then, after validating Theorem 11 with a working program, we measured the overall completion times for each variant, across multiple grid sizes and CPUs with different core counts. The results showed that the asynchronous variant was the fastest in nearly all cases. Although, the locally synchronous variant was as fast on smaller grid sizes and fewer cores.
Final proxel values from the same starting conditions were compared between the globally synchronous and locally synchronous variants and the locally synchronous and asynchronous variants. The results showed that the globally synchronous and locally synchronous variants always produce identical final results, while the asynchronous variant's final results are usually extremely close to those of the other two. The times to completion for the fastest, average and slowest proxels in the grid were determined and analysed. All three times tended to be quite close for the globally synchronous approach. The time to completion for the fastest and average proxels tended to be similar between the locally synchronous and asynchronous approaches but the slowest proxel generally finished notably earlier with the asynchronous variant.

Future work
We plan to adapt the asynchronous system to loopy belief propagation (BP) stereo matching (SM) (see, e.g., [5,11,33]), using NMP precepts. Preliminary work has begun on implementing a close approximation of the asynchronous system as a framework in a common programming language to explore the effectiveness of this approach in modern computer systems.
In the system presented above, the size and shape of the grid and the communication topology between neighbours are permanently fixed at system initialisation. Greater flexibility is usually not needed, but it might be useful in some circumstances, such as when a dynamic communication topology is necessary for a particular algorithm. Modifying the system from Sect. 4 accordingly and analysing the impacts on running computer programs is a clear potential extension to this paper's work.
At present, every proxel stays active throughout the system's evolution until it has sent and received all its scheduled messages. Permitting proxels to deactivate at appropriate points before they have reached their maximum generation count could save processing capabilities when the potential for parallelism is bounded. As discussed with Conjecture 15, however, any proxel which ceases messaging before reaching the maximum generation count can eventually affect the entire system. Implementing effective early stopping is an unsolved problem.
We also have yet to examine the systems concerning communication complexity measures such as those found in [19]. The precise results presented there are not directly applicable to this work, given the use of different P systems models, but the underlying concepts appear directly relevant. Finding ways to define and quantify the communication complexity and 'cost' ( [19]'s ComR and ComW) seems particularly pertinent for determining the overall characteristics of a system.
Investigating the uses of alternative hardware for NMP would be worthwhile because of the potential for large-scale parallelism offered by devices such as GPUs and FPGAs. A GPU implementation of the NMP variants, in particular, seems like a promising target for further investigation. Each proxel in a NMP system likely has small processing requirements individually, but an entire system might have many proxels. The cores on a GPU are limited in their capabilities compared to a CPU's cores, but there are many more of them, and modern GPUs often comfortably run processes with millions of live threads. Thus, NMP and GPUs are potentially a superb match. The high levels of messaging and control flow, however, might not fit well with GPU architecture, as GPU performance is generally averse to branching. We also speculate that our novel approaches may prove superior for systems such as FPGAs and semi-distributed systems like high-performance computing clusters.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions. No funding was received to assist with the preparation of this manuscript.