1 Introduction

The Open Community Runtime (OCR) is a task-based model for parallel and distributed computing [16]. The OCR working group maintains a specification [15], which describes the OCR application programming interface (OCR API) and the expected behavior of a runtime system that implements the OCR API. Among other things, it provides a memory model, which describes how data are handled and how are changes to the data are propagated—it provides a consistency model. The model is based on the happens-before relationship, which describes which operations (made by the user code within tasks) are guaranteed to happen in a particular order. Simply put, if one operation modifies data, the data are read by another operation, and the first operation happens-before the second operation, the second operation is guaranteed to see the changes made by the first operation. Having such model is essential for programmers’ understanding of the code, but it can also be used in tools that detect errors such as data races [20, 22].

However, the memory model only deals with user’s data, which is stored in data blocks. It does not cover changes made to other OCR objects (like tasks or events) using OCR API. It would be possible to assume that full effect of each OCR API call is performed before the function call returns to the calling code (the user’s code of a task), but this is too restrictive. The overall idea of the OCR is to allow as much asynchronicity as possible. It would be beneficial to be able to return from the API call immediately, while the effect of the call is still being evaluated by the runtime system. This may improve execution efficiency, but it is also beneficial for resiliency, which is another design goal of the OCR. If applying the effect a task has on the state of OCR objects can be delayed to a later point (even after the task has ended executing), it makes it easier to isolate that task from the rest of the runtime state. This may simplify checkpointing or allow redundant task execution, where a task may be executed twice at different locations, but only changes made by one of the two task clones are propagated to the global runtime state.

The goal of the following text is to fill the gaps in the OCR specification, to give a clean definition of the way the state of all OCR objects (not just data blocks) changes. The proposal is designed to allow the deferred execution model described in the previous paragraph, providing the API user (the application developer) with clear guarantees, while giving the runtime system the option to handle the API calls asynchronously. The specification already expects some of the changes not to be immediate. For example, once all dependences of a task have been satisfied, the task is ready to run. But the actual start of the execution does not have to be immediate. When the API call that fulfills the last dependence of the task is made, its effect does include the execution of the task, but it is not expected to start executing right away. The similar is true for events connected via a dependence. If an event is satisfied, the connected events should be satisfied as well, but these indirect satisfactions are not expected to happen immediately. The specification does not clearly define these situations, so our proposal also provides the necessary clarification.

The proposed object model, which defines the way the state of OCR objects is affected by the OCR API calls, is based on the existing OCR memory model. It turns out that the memory model provides a very good basis for the object model. The object model fully reuses the happens-before relation used to define the memory model, making it a very natural and well-fitting extension, which does not change the fundamental ideas laid by the memory model. As a result, the existing implementation does not have to be changed to implement the proposed object model. The reference implementation created by Intel/Rice University [16], a derived implementation from PNNL [14], and our OCR-Vx implementation [5,6,7] all already conform to the proposed object model. In some cases, they may provide stronger guarantees than required, so the model gives them room for optimizations.

2 Related work

Originally, C and C++ programming languages did not have a memory model that would define their behavior in a multi-threaded environment. Parallel execution was possible with a single thread per process, with the behavior of the synchronization primitives defined outside the language, for example in POSIX standards. The memory models were finally introduced by C++11 and C11 standards. The memory model defines the semantics of computer memory storage. A simplified view is that the memory model prescribes under which conditions under which a change made by a write operation to an object stored in memory needs to be seen by some read operation. For example, if both operations are in the same thread and there is a sequence point between them, the change has to be seen. A more complex example is when the thread that performed the write operation then releases a mutex. If this mutex is then acquired by another thread and that thread then reads the value, the change must be visible. In C/C++ the objects that control the execution (threads, mutexes, etc.) are managed the same way as “normal” objects used to store the data used by the program. For example, to use a mutex created by another thread, the user must be synchronized the same way as if it wanted to read a value written by the first thread. Also, the mutex can only be destroyed in a way that guarantees that no one else may be using it, just like when destroying a data object. The threads and mutexes don’t live in separate worlds, one belonging to the application and the other to some runtime system. Similarly to C/C++, the OCR memory model is based on a relation that defines which changes are visible at different places. However, synchronization (a key to defining visibility) in C/C++ is controlled by locking and atomic operations. OCR uses a completely different approach—synchronization is defined using events and task dependences. Also, events are special objects, managed differently from application data.

Intel Threading Building Blocks (TBB, [11]) is a task-based parallelization library. The computation is performed by tasks that are synchronized with dependences, similarly to OCR. However, TBB is based on the C++ memory model. The management of application data is left completely to the application code and the TBB objects (e.g., tasks) are also normal C++ objects. For example, every task has a counter that tracks the number of unsatisfied incoming dependences. This counter is decremented using an atomic operation (covered by the C++ memory model) when a dependence is satisfied. No special treatment is necessary.

MPI is an example of a distributed programming model with such special objects. Application data are stored in the memory of the process and managed through the usual C/C++ (or Fortran) means, but MPI objects like communicators follow a different set of rules. The state of a communicator is distributed among the processes that are members of the communicator and the individual processes only get a handle which they can use to refer to the communicator in MPI calls. Some MPI objects are local, like or . Less obvious examples are and , which are used for communication, but they are managed locally. These behave as normal “local” objects. The and objects are “global”. Most operations that modify these are collective operations and need to be done within a communicator by all members. As all members of a communicator need to make all of the calls and make them in the same order, this clearly defines when their effect takes place—there is a clear “before” and “after” for each operation. For example, creates new communicators from an existing one. Since both members of the old communicator and the new communicators must participate in the call and no other processes can use the new communicators, the point in time where the new communicators can be used is clearly defined. Note that MPI does not guarantee that collective operations are synchronized with point-to-point communication and there is also no guarantee about ordering of collective operations on different communicators. It is the user’s responsibility to avoid race conditions in such cases. Still, the requirement to modify “global” objects via collective operations on clearly scoped communicators is a fairly elegant solution to the problem of defining the ordering of such operations. It is still possible to run into synchronization issues, but the clear model helps in isolating these [3, 8]. The MPI approach is not applicable in OCR, as there are no communicators in OCR and no collective calls. OCR objects have global visibility and any object can be modified by an OCR API call made by any task.

Some models avoid the problem by not using such global objects. For example, in OpenSHMEM (a PGAS library) scope of collective operations is defined by providing start, stride, and size—three numerical parameters passed to the function calls (like and ). Clearly, this is not possible in OCR, as tasks and events need to be created (and destroyed) in order for any computation to happen.

UPC++ [23] is a very interesting example of an APGAS model, which extends C++ for distributed computing. It is based on earlier work on UPC—Unified Parallel C [4]. UPC++ allows tasks to be executed on remote nodes, using futures for synchronization in a way that is similar to the way events are used in OCR. A typical invocation scenario for an asynchronous (remote) operation is to make a call which initializes the operation and returns a future. The future becomes ready once the operation has completed. This constrains the ordering of the operation: It cannot start before the call is made and it must finish before the future is satisfied. If one operation writes data to memory and another reads it, the way to ensure that the change is visible is to make sure that the second operation cannot start before the future that corresponds to the first operation becomes ready. Furthermore, UPC++ allows the application to define teams, which correspond to derived MPI communicators, and invoke collective operations on those teams. The ordering of these operations is similar to MPI. Unlike MPI, UPC++ does not guarantee ordering of point-to-point communication between two ranks (if order needs to be preserved, it must be enforced by dependences via futures). As a result, the fact that there is also no implicit ordering between collective and point-to-point operations does not really add to the overall complexity, as it might in MPI. Another type of entity in UPC++ is distributed objects. These are objects with globally valid name, similar to the way OCR objects are identified with GUIDs. Distributed creation of these objects is possibly the aspect of UPC++ most relevant to the work presented in this text. However, despite the creation being a collective call (it is local in OCR), no guarantees are given that after the constructor is called on one rank the name is valid on other ranks. The user must ensure that using a barrier or other custom synchronization, which guarantees that the constructor is called on a rank before the name is used. It would not be sufficient to construct the name and immediately use it to invoke a task on another rank—the invocation could be processed before the recipient calls the constructs the name. Such usage is generally valid in OCR. The objective of our work is to formalize the conditions under which it is valid.

OpenCL standard includes a complex memory model, which prescribes the way different memory types are read and written [9, 17]. One part of the model also describes how the host-side and device-side commands get synchronized. There are significant analogies between this and synchronization in OCR. The OpenCL queues contain commands (like kernel execution or memory transfers) and events, which are similar to OCR tasks and events. A happens-before relation is established among the commands, events, and API calls. It defines the order of operations and also controls visibility of memory operations, which is also similar to the way synchronization works in OCR.

Mixing different types of objects or the ways objects are accessed can be problematic. In MPI, care needs to be taken when combining collective communication with point-to-point. Similarly, combining atomic and non-atomic variables in C++ can lead to serious issues, even with the C++11 memory model [2, 12]. In UPC++, the interaction of point-to-point communication, collectives, atomic operations, and global objects needs to be carefully managed. It is easy for an application developer to make incorrect assumptions about the way different types of operations interact, expecting some ordering to be enforced, while it is in fact not guaranteed. In OCR, there are also two types of objects, which are managed in two different ways. Data blocks, which contain the application data, need to be acquired by a task, before they are modified using C operations on pointers. The OCR objects are modified either by OCR API calls made inside tasks or implicitly by the runtime system. Our objective is to base both on the same foundations, reducing the cognitive burden placed on the programmer.

3 The Open Community Runtime and synchronization

To decouple work from execution units, all work in OCR is performed inside tasks. These tasks are scheduled by the OCR runtime system and they can be freely moved between execution units, including moving a task to a different node in a compute cluster. Giving such scheduling flexibility to the runtime system can improve execution efficiency, especially on heterogeneous systems [1, 10, 18].

To facilitate this task movement, all data need to be decoupled from storage. This is achieved by storing all application data in data blocks. A data block represents one contiguous piece of memory—a fixed-size sequence of bytes that can be used to store application data. A task can only access contents of a data block that it has acquired. There are only two ways to acquire a data block. The task may create a new data block or the data block is specified as a dependence of the task before it starts. This makes it more difficult for applications to manage their data but it makes the runtime system aware of all data a task may access, giving it much greater flexibility. The runtime system is allowed to move the data stored inside a data block around, including moving it to a different cluster node. The runtime system can make multiple copies of the data. It is even possible to provide different tasks with different (possibly old) versions of the data. Naturally, the runtime system needs to observe a set of rules which dictate when and how the different copies need to be synchronized. This is the OCR memory model.

Fig. 1
figure 1

An example of OCR tasks, data blocks, and synchronization

There is only one way to synchronize execution of OCR tasks. It is possible to specify task dependences to build a directed acyclic graph (a DAG) of tasks, which the runtime systems follow when making scheduling decisions. Events are lightweight objects that can be used in the DAG along with tasks, to specify more complex dependences. Output events are a special kind of events. Every task has an associated output event, which signals that the task has finished executing. Anything that depends on that output event cannot start before the task finishes. Figure 1 shows an example with three tasks, two events (one output, one normal), and a single data block. The dependences ensure that the two tasks at the bottom (tasks 2 and 3) of the figure cannot start before task 1 finishes.

In the example, two different tasks acquire the same data block, obtaining access to its data. As we can see, the second task (task 3) is synchronized to run after the first task (task 1). Because of this, the OCR memory model guarantees that the second task sees all changes made by the first task. If there was no synchronization between the tasks, they could still both access the same data block, but there would be no guarantee that one task would see the changes made by the other task. This is the intention of the OCR specification. Task dependences not only ensure that tasks are executed in a certain order, but they also govern the visibility of updates to the data.

While the way the data stored in data blocks is updated is clearly defined by the memory model, the way the OCR objects are changed is not defined in detail. However, all OCR objects contain some state information or (meta)data. This may be read-only, like the size of a data block, but it may also be mutable. For example, the state of each OCR object can be changed by destroying the object.

Latch events are a good example of why these updates are important. A latch event is a special kind of event, used to define more advanced synchronization patterns. A latch event contains a counter. This counter may be incremented or decremented, either explicitly from a task or by a dependence. The latter case can be for example be used to automatically decrement the counter after a task finishes. The latch event is used to wait for the counter to reach zero.

Fig. 2
figure 2

An example of a latch event modified from two tasks

In Fig. 2, two tasks explicitly update the counter. As the counter starts at 1 and one operation is increment and the other is decrement, their order is clearly important. If the counter is decremented first, it reaches zero, allowing all tasks that depend on it to start. If it is incremented and then decremented, it moves from 1 to 2 and then back, preventing the dependent tasks from starting. The existing OCR specification does not specify what should happen in such case.

First issue to deal with is the atomicity of the update. Even though not explicitly stated, it was clearly the intention of the authors of the specification to make the update an atomic operation. Even if the counter is updated from multiple tasks concurrently, the final value should be equivalent to some sequential ordering of the changes. It should not be possible for an update to interrupt another update, producing an inconsistent result.

Clearly, this does not solve the problem presented in Fig. 2. We also need to define a way that the application can use to specify the order in which multiple changes are applied to an object (e.g., the latch event). We call this the object model. It turns out that the foundations laid by the memory model can also be used for the object model. The basic idea is to use the synchronization established via events to also order changes made to OCR objects.

It may be tempting to define the object model by requiring all changes that tasks made to OCR objects to be immediately evaluated in an atomic way. Since tasks can only start (and make changes to OCR objects) after all their dependences have been satisfied, this would ensure that changes to OCR objects follow the dependences. However, not all changes to OCR objects are invoked directly by tasks. Some are performed implicitly by the runtime system. Still, this simple object model could be extended to also cover these cases.

There is a reason to use a more complex object model. Since the OCR targets distributed systems, it is possible that the data and metadata of the OCR objects are not available on the node that is executing the task that changes it. Requiring the change to be fully processed immediately would entail blocking the tasks until the change can be completed. For performance reasons, we want a model that allows the change to be applied asynchronously while the tasks continue to work, hiding the communication latency. Still, we would like to make this more complex model to be mostly compatible with the view that all changes are performed immediately. It turns out that this is possible, if we build our system around dependences in a way that is similar to the way the memory model handles data block changes.

Returning to the example shown in Fig. 2, to obtain a correct result, the application would need to ensure that dependences are set up to ensure that the two tasks run in a certain order. In this particular example, it only makes sense to make the task that performs the decrement depend on the task that increments the counter. The object model would then ensure that the counter is first incremented to 2 and then decremented back to 1. We allow the runtime system to make the actual updates in a deferred way, as long as the proper ordering is maintained. The value can actually be changed even after both tasks have finished. The OCR is designed in a way that prevents such delays from breaking the application. This is discussed in detail in Sect. 6, but the fundamental reason is that no task is allowed to wait for such a change to actually take place. It can be used in dependences but not as an explicit wait operation inside a task.

An important thing to note is that we also need to be careful about changes made by a single task. One task could increment and then decrement the counter of the latch. We need to make sure that even if the application of these changes is delayed, we still get the correct result. This is in fact a direct consequence of our requirement to apply changes in the order of their synchronization. Operations within a task are considered to be synchronized according their execution order (as per the rules of the C language).

4 The big picture

Before describing the way the OCR object model is built, we should first have a look at the big picture—see how the individual pieces fit together to define how OCR implementations and applications should behave.

First, there is the happens-before relation which defines how various actions performed by an OCR application are synchronized. Similarly to C/C++ memory model, if one action happens-before before another one, the second action should be able to see the changes made by the first one. In C/C++, synchronization is established through library calls (like working with mutexes) and atomic operations. In OCR, synchronization is established only by events and dependences. The relation is built on the actions that were actually executed by the application, not those in the source code. After the application terminates, we can look at all the actions and build the happens-before relation.

The OCR memory model uses the happens-before relation to define when a change (write operation) made to a data block must be visible to an operation that reads the data block. This means that the OCR runtime system must propagate these changes and ensure that the read operation accesses the correct modified version of the data block. The proposed object model, which will be described later in the text, restricts the way the state of OCR objects (like events) are updated in a similar way, based on the happens-before relation. Note that we are using synchronization established by events (the happens-before relation) to define how events behave. This looks like the happens-before relation is being used while it is being constructed.

This is not actually the case. In a distributed environment, it is not possible to exactly define what the relation looks like at a given point in time. What can be done is to look at a finished OCR application, build the relation and determine if the program execution was correct—whether the constraints imposed by the memory and object models were observed. The common goal of runtime system and application developers is ensuring that the answer is always “yes”. From the runtime system’s point of view, the goal is to ensure that any “correct OCR application” is executed in a way that adheres to the models. From this point of view, a correct OCR application can be roughly defined as an application that follows the OCR API specification and uses synchronization (events and dependences) to ensure proper ordering of actions that need to happen in a certain order. For example, it has to ensure that any object is created before it is used.

Creating such a runtime system in a shared-memory environment is fairly straightforward. Later in the text, we will show a way to also do that in a distributed-memory environment where messages are used to communicate between distributed processes.

5 The sequenced-before, synchronized-with, and happens-before relations

The OCR 1.2.0 specification provides a memory model for OCR programs, to define how tasks can access data in data blocks concurrently. The memory model is based on three relations. First, the order among operations within a single task is defined by the sequenced-before relationship. This is provided by the C language used to implement the tasks and it is the natural ordering of operations performed by a C program, as one would expect. Second relation is synchronized-with, which is defined by dependences among OCR events and tasks. The simplest example is a task whose output event is used as a dependence for another task. In this case, it is natural to expect that the second task comes after the first task. There are more complex examples of synchronized-with, which will be covered later. The third relation is happens-before, which is a transitive closure of combined sequenced-before and synchronized-with.

The OCR specification does not provide a full definition of synchronized-with, it only shows a simple case of two connected tasks. To properly define it, we need two things. First, we need to define the set that the relation is applied to (domain and range). Second, we need to define which pairs of objects from the set are in the relation.

The domain and range of synchronized-with are the OCR API calls made by the application. We also include implicit operations performed by the runtime system in response to certain situations. For example, if two tasks are connected by dependences formed by a chain of events, the events in the middle of the chain are satisfied automatically by the runtime system in response to the previous links in the chain being satisfied. The actual list is given in “Appendix C”.

To describe the synchronized-with relation, we need to clarify the behavior of tasks and events. All tasks and events have a certain number of pre-slots. These are the actual targets of dependences. A task with five pre-slots (the number is defined when the task is created) needs to be set as a target of exactly five dependences. Events have one or two pre-slots, depending on the type of event. A pre-slot is said to be satisfied either if it is a target of a dependence and the dependence itself is satisfied or if an explicit OCR API call is used to satisfy the pre-slot. The origin of all dependences is an event. A dependence is satisfied when the source event is triggered. An event is triggered when its triggering condition (defined by the event type) is satisfied. The satisfaction of event’s triggering condition always only considers satisfaction of the event’s pre-slots.

Note that we have just described the way satisfaction propagates through events. A satisfied incoming dependence satisfies an event’s pre-slot. In the case of basic events, this is enough to satisfy the event’s triggering condition. The event gets triggered and satisfies all outgoing dependences. Tasks are sinks in the signal propagation. Satisfying task’s pre-slot may allow the task to eventually start but it requires no further propagation of the satisfaction signal. Besides sinks, we also need sources. There are two kinds: output events, triggered by completion of the corresponding task, and explicit pre-slot satisfactions performed by OCR API calls inside tasks.

The following rules build the synchronized-with relation: A task cannot start before its dependences are defined and satisfied; event’s pre-slots are satisfied before it satisfies pre-slots of connected events and tasks; a pre-slot of an event or task that is a target of a dependence can only be satisfied after the dependence is defined. Once again, a detailed definition is given in “Appendix C”.

Going back to example in Fig. 1, we can see if synchronized-with is defined as expected. The output event of task 1 is satisfied by an implicit OCR API call made at the end of task 1. As per our rules, this satisfaction call is synchronized-with all satisfactions that happen as a result of triggering of the output event. It satisfies pre-slots of the non-output event and task 3. From transitivity, we now know that the satisfaction of the output event at the end of task 1 is synchronized-with satisfaction of the pre-slot of task 3, which in turn is synchronized-with the actual start of task 3 (the first rule). Similarly, we obtain that the end of task 1 is synchronized-with the beginning of task 2, because satisfaction of the pre-slot of the non-output event is synchronized-with satisfaction of the connected pre-slot of task 2.

By combining sequenced-before and synchronized-with relations, we know that all operations inside task 1 happens-before all operations in task 3. Therefore, the release of the data block in task 1 happens-before acquisition of the data block in task 3 and changes made to the contents of the data block in task 1 must be visible in task 3.

6 Deferred execution

For performance reasons and to support resiliency features of the OCR, it is beneficial to allow the runtime system to defer evaluation of the OCR API calls, while allowing the user’s code of the task to keep running. The performance benefits are clear. Forcing the operation to fully complete before returning means that in a distributed environment the task may be blocked waiting for communication to complete. Deferring operations is beneficial to resiliency, because it allows the runtime system to execute tasks in a speculative way. If all operations are not just running on the background (as they might be for performance reasons), but they are completely suspended, the task can be run without affecting the overall state of the computation. This can be very useful when the runtime system is making a snapshot of the computation. This snapshot may be later used to restart a failed computation. Allowing speculative task execution might decrease the cost (performance degradation) of making the snapshot.

Fig. 3
figure 3

Comparing normal and deferred execution of a task that makes two calls to the OCR API. The arrows correspond to threads and bold parts the intervals where the threads are active (not blocked)

An example comparing normal and deferred execution is shown in Fig. 3. On the left, we can see that the thread needs to be paused while communication takes place. On the right, the operations are placed in a local queue, which allows them to be evaluated later. The task runs to completion uninterrupted. Multiple queues (for multiple tasks) can be multiplexed to the same worker thread, saving resources.

To determine whether it is possible to actually defer the evaluation of the OCR operations, we need to examine the different effects of OCR API calls. There are three basic options, which are described in the following paragraphs.

The API calls change the internal state of OCR objects. All object types may be created or destroyed, events may be satisfied, tasks may have their dependences set, etc. The OCR API does not provide any way for the user’s code to query the state of the OCR objects, nor can the code wait for an object’s state to change. For example, the only way a task can find out that an event has been satisfied is to have the event connected to the task’s pre-slot via a dependence. There is no API call that would allow a running task to find out the state of the event or to wait for the event to be satisfied. Therefore, it is possible to delay evaluating the API calls, even until after the task finishes. Naturally, the runtime system must take care to preserve the semantics of the API calls—the result produced in the presence of deferred execution should be equivalent to the result of non-deferred execution. Ensuring this is the objective of the proposed object model.

However, there is a second way in which a task can interact with objects outside of its own code and local data. A task may change the contents of data blocks and these data blocks may be read by other tasks. Clearly, it would not be a good idea to allow such changes to reach other tasks before the API calls are evaluated. If that were the case, a task could change a state of an OCR object via an API call and store the information that it has done so in a data block for other tasks to see. Another task would see that the change has been made, but if the API call gets deferred, its view would be inconsistent with the actual state of the system. Fortunately, to make changes to a data block visible to another task, the task that made the change has to release the data block. The data block is released either implicitly by the runtime system at the end of a task or via an API call. The runtime system can therefore also defer the release operation, preventing other tasks from seeing the updated information before the object state is changed. The OCR memory model is a good fit for the deferred API execution.

The final effect of an OCR API call is returning values to the calling code. There are three options. Error codes, handles of newly created objects, and pointer to a newly created data block. The error codes are not a major issue, since the specification already assumes that the runtime system may not be able to correctly identify all error conditions. If the evaluation is deferred, the call may immediately return an OK status. A special nanny mode should be provided by the OCR runtime systems, where the errors are checked strictly. As for the other two options, the runtime system can be designed in a way that allows it to immediately return a valid handle of an object as well as a valid pointer to a data block, without fully evaluating the OCR API calls. The actual creation of the object is deferred, but the runtime system must ensure that it is created with the same handle as the one returned to the user’s code.

For a low-level example and further detailed discussion of deferred execution, see “Appendix D”.

7 Object model

At this point, it is important to clarify how the concepts described in preceding sections (OCR memory model, the happens-before relation, and deferred execution) fit together. The extended synchronized-with relation defined in Sect. 5 and “Appendix C” serves to fill in a gap in the OCR specification, where the relation is not fully defined. With the extended synchronized-with, we can fully define the happens-before relation. This is then used by the OCR memory model to define when changes made to a data block by one task have to be seen by another task. The deferred execution model described in Sect. 6 is a way to delay evaluation of OCR API call, while maintaining correctness with respect to the OCR memory model.

The missing piece is defining a similar model for OCR objects. Clearly, it would be good to reuse the already defined happens-before relation. On the other hand, forcing the effects of API calls to be applied immediately would go against the deferred execution model. Our proposal is to use a solution similar to the memory model. Use the happens-before relation to define where a change made to OCR object’s state needs to be seen by other actions that use that state. This way, if the change (write) is deferred, we can also defer the use (read), ensuring that they still happen in the right order, therefore maintaining correctness without having to block tasks while we wait for the changes to be processed.

To define the object model, we need to split the effects of OCR API calls into two groups: immediate effects and delayed effects. The simplest explanation is that the delayed effects are effects that would not have to be evaluated immediately even if we used a simplified object model that tries to evaluate all OCR API calls right away. The most obvious example is starting a task as a result of its pre-slots getting satisfied. If the last pre-slot is satisfied by an OCR API call, no one would actually expect the task to be started as part of the satisfaction call. It is a delayed effect of the call. Maybe less obvious, but at least as important example is satisfaction of a pre-slot of the first event in a chain of events. The first event’s pre-slot should be satisfied immediately, fulfilling the event’s triggering condition. However, the actual triggering of the event is akin to starting of a task. It can be delayed, also delaying the triggering of the whole event chain.

With the groundwork already in place, the definition of object model itself is simple: The object model requires all effects to be evaluated atomically. Furthermore, for any two operations connected by happens-before relation, it requires that immediate effects of the first call are seen when immediate effects of the second call are evaluated and also when delayed effects of both calls are evaluated.

Notice that this allows evaluation to be deferred, as long as correct ordering is maintained. Also note that it is possible to evaluate delayed effects of the first operation after effects (both immediate and delayed) of the second operation. The object model only guarantees that a delayed effect is evaluated after (and therefore can see the results of) the immediate effects of the same operation but does not give any constraints about how much it can be delayed beyond that.

A formal definition is given in “Appendix E” along discussion of some low-level implications.

Fig. 4
figure 4

An example of a latch event updated indirectly from a single task

In most cases, the presented object model only provides a formal background for programmers’ natural understanding of the way objects in OCR behave. It confirms the notion that two synchronized updates should happen in the specified order. However, there are cases where the exact behavior is not immediately obvious, where the formal model helps by providing clear definition of the expected behavior. Consider the example given in Fig. 4.

If a task directly increments the counter of a latch event by satisfying the correct pre-slot and then decrements it the same way, it is natural to expect that the counter is first incremented and then decremented. However, in this example, two extra events are put between the task and the latch event. The task first satisfies the event connected to the increment pre-slot of the latch event and then the one connected to the decrement pre-slot. Satisfaction of the latch event’s pre-slot is a delayed effect of satisfying the interposed event. Therefore, the object model places no restrictions on the relative order of the two delayed effects (increment pre-slot satisfaction and decrement pre-slot satisfaction) and the counter could be first decremented and then incremented. This may look inconvenient, but in a distributed environment it would be too costly to require chains of connected events to be fully evaluated as an immediate effect of the satisfaction of the first event in the chain.

8 Implementations of the object model

In shared-memory systems, implementation of the proposed model is straightforward. When an API call is made by a task, its immediate effects are evaluated as part of the API call, making the changes visible to all subsequent calls. This is possible, as we do not need to hide the latency of communication. Some care needs to be taken to either properly protect the changes with locks or use atomic operations, but this is also fairly easy. Therefore, well focus on distributed systems in the following text. We assume that the state of any object is maintained by one of the nodes in the system—the owner of the object. Interaction among nodes is facilitated by point-to-point messages. Changing a value of an object is performed by sending a message to its owner. Messages and delayed operations are managed by system workers, available on each node. These workers don’t guarantee any ordering among the operations, except for cases specified later in this text. When an OCR object state is updated, the change is either atomic or properly protected by a lock. If a delayed operation causes further operations to be invoked, those are also delayed. An example of such situation is satisfaction a chain of events. Each “hop” in the chain may be delayed. Therefore, they are all processed by the system workers. Note that there may be OCR implementation that does not satisfy our constraints. For example, a state of an OCR object may not be maintained (owned) exclusively by a single node.

There are many ways a runtime system may implement the object model. We will provide several options in the following sections. A low-level discussion is provided in “Appendix F”. In appendix, we also show a way to prove that these implementations actually ensure that the object model is maintained.

8.1 Blocking

As we have already mentioned, it is possible to block the calling task while an operation is being evaluated, avoiding the deferred execution model. With our object model, we can formulate a more precise definition for this option. It requires all immediate effects of an OCR API call to be fully evaluated before the call returns. The delayed effects can be evaluated by system workers at a later point in time.

The reference OCR implementation created by Intel and Rice University [16] uses this approach. All OCR API calls are translated to messages even if the operation can be processed locally. A message can be processed either in a blocking mode, where the sending task blocks until the message is processed, or in a non-blocking mode. The authors have decided which kind is appropriate by analyzing the individual cases and deciding whether a blocking call is required or not. This corresponds to our selection of immediate and delayed effects.

The downside of this solution is the fact that the tasks need to be blocked while waiting for the remote operation to finish. To compensate, the runtime system allows other tasks to execute on the same worker thread, while the original task is waiting. As a result, the original task may be suspended for longer than just the duration of the remote operation, but the overall utilization of the available compute units can be significantly improved.

This solution is in line with the OCR design philosophy of handing over the control of the execution to the runtime system. If there is always alternative work to do and the increased duration of the task does not have adverse effect on the task schedule, it can be very efficient. So far, the experience suggests that this works well for some codes, but it may also be problematic up to a point where switching of the feature altogether (forcing the whole thread to stall while waiting for the response) may significantly improve performance. Still, it is an interesting alternative to the deferred execution model.

8.2 Immediate confirmation protocol

To allow deferred execution, we can enqueue all operations to be executed later by a background worker thread. To satisfy the object model, the operations have to be handled by the worker in a certain way. First, they are executed in the order in which they are enqueued. Second, the worker must send out all messages that implement immediate effects of the operation and wait for all of them to be processed before moving on to delayed effects of the operation and then to the next operation. To check that a message has been processed, confirmation messages are used.

Initially, the distributed OCR-Vx implementation (OCR-Vdm) implementation did not use confirmation messages [7]. This was a source of race conditions. For example, if an object was created and the newly created handle was immediately used to satisfy an event that was at the beginning of a chain of events connected via dependences, it was possible for the event chain to be processed before the creation message, if the chain involved a third node (other than the node running the task and the node that owns the created object). A dependency mechanism was introduced to messages, preventing this kind of race condition, but it was not sufficient for all cases. For example, the indirect latch example in Fig. 4 could still be processed incorrectly.

To solve this, confirmation messages were introduced. If the OCR API call modified state of multiple objects (e.g., adding a dependence modifies the origin and destination of the dependence), the message could bounce around multiple nodes, making confirmation difficult. Later, this was changed to multiple smaller messages which only facilitated an update of a single object. At this point, it became apparent that a theoretical view of the problem is needed to justify the design, culminating in this work. The object model, where updates to each object are treated separately, validated the use of multiple simple messages with direct confirmation.

8.3 Further refinement

The immediate confirmation protocol can be further relaxed to improve efficiency while still providing all guarantees required by the object model. If multiple messages in a row are sent to the same remote node for processing, it is possible to only confirm the last one. This assumes that messages sent to one node cannot overtake each other and that they are processed in order in which they are received. Some designs might not provide such guarantees, but as they hold in OCR-Vdm, we already have a working experimental implementation of this optimization on a development branch.

As sending a long batch of messages to the same remote node is going to be rare, we might further improve the situation by reordering messages. In general, the object model does not allow that, so it is necessary to individually determine which pairs of messages can be reordered. For example, if the message implements event satisfaction and two messages target the same event, they obviously cannot be reordered. Even if they target different events, they still cannot be reordered. Delayed effect of satisfying the second event could then overtake immediate effect of satisfying the first event, producing incorrect behavior. However, adding two outgoing dependences to an event can be reordered, as the order of dependences does not matter. We could not, however, swap any of the messages with any message that can lead to satisfaction of the event.

A typical example where reordering may help significantly is the typical scenario for creating a new task. After the task is created, most of its incoming dependences are defined immediately. Defining such dependence entails two messages: one to the task and another to the source of the dependence. We can move all messages for the task forward, forming one larger group that can be sent in one batch and confirmed with a single confirmation message. The messages to the dependence sources can be grouped by owning nodes and send in groups as well.

9 Conclusion

The existing OCR specification does not sufficiently define the synchronized-with relation and the way synchronization is applied to the runtime objects. We have filled in these gaps and also provided some examples how the proposal can be implemented by a runtime system, to ensure correct synchronization. It turns out that the existing definition can be naturally extended from covering only data stored in data blocks to all runtime objects, without breaking established OCR practices (and programs). The implicit assumptions made in the specification and mostly adopted by application developers had to be made explicit and clarified.

The formal model can be used to reason about concurrency issues in OCR programs, possibly even allowing automatic checking tools to be deployed to find instances where the program violates the rules set by the OCR specification and the object model. For example, if an object is being destroyed, but its destruction is not properly synchronized to ensure that the destruction happens after all uses of the object.