Distributed tracing of OPC UA method calls

With increasing digitalization, shopfloor architectures transition to service-oriented, distributed layouts in which the complexity of monitoring communication between systems becomes a major challenge. Distributed tracing assists in establishing causality and hence supports the analysis of latency aspects, wrongly configured communication endpoints, and bottlenecks. In this paper, we present a first feasibility study, which investigates to what extent it is possible to trace OPC UA method calls in a distributed manner using the Zipkin framework. We show how this standard can be used in conjunction with the Eclipse Milo OPC UA open source stack and how it can be integrated into our industry demonstrator “Factory in a Box”.


Introduction
As digitalization advances, shopfloor architectures transition to service-oriented, distributed layouts. As a communication protocol, OPC UA is introduced at multiple hierarchy levels, e.g., cell, station, substation, and even component (see, for example, the VDMA R+A OPC UA Demonstrator [17]). The two trends together create highly distributed CPPS, in which the complexity of monitoring communication between systems becomes a major challenge.
Distributed tracing assists in understanding which communication and action on the shopfloor is the result of what other action. This, in turn, helps establish causality and hence supports the analysis of latency aspects, wrongly configured communication endpoints, and bottlenecks.
As a centralized schema for monitoring is typically only feasible when a single vendor provides an integrated solution, yet with elements coming from multiple vendors, a more lightweight approach is more feasible.
In this paper, we present a first feasibility study to what extent it is possible to trace OPC UA method calls in a distributed manner. To this end, we propose to use the widely used Zipkin standard for tracing information modeling and trace correlation information propagation. We show how this standard can be used in conjunction with the Eclipse Milo OPC UA open source stack and integrated into our industry demonstrator: "Factory in a Box". We also discuss the various ways OPC UA interaction affordances can be instrumented, and the open issues deriving from a communication environment that aside from method calls also uses data access, monitored items, and events.

Motivating scenario
Our lab-scale production cell, as part of a university-funded demonstrator project Factory in a Box 1 (FIAB), aims at illustrating the concepts that enable flexible production. Our particular demonstrator can customize the drawings on a piece of paper at multiple plotting stations. The production cell consists of the following machine types: input stations that provide pallets with paper, plotters that load the pallets and draw images in one color each, turntables that transport the pallets between plotters, and finally, output stations where the finished product (i.e., paper) is placed.
Communication between the machines is purely based on OPC UA. Plotters and output stations are designed and programmed with the IEC 61499 2 industry standard using the open source IDE Eclipse 4diac [13] and the respective FORTE runtime environment that builds on the Open62541 OPC UA [5] server. The turntables (and virtual Fig. 1. Example control flow through a distributed system (left), with instrumented and non-instrumented components, using method calls and events (full lines); displaying component internal control flow with dashed lines. Info icons mark the instrumentation points where trace information is sent to the central trace collector (e.g., a Zipkin server). The diagram to the right displays the resulting spans and their relations. Span names identify where the tracing information has been captured twins of the prior introduced machines) are implemented in Java, using the Eclipse Milo OPC UA framework. Hence, despite the toy character of the setup, the used software and communication infrastructure is industrial grade. We use Lego Mindstorm EV3-based PLCs as a basis as this enables rapid and cheap machine prototyping without the need to test individual subcomponents such as motors, actuators, and sensors. Also, it ensures their seamless integration.
In the scope of this article, we are especially interested in tracking individual orders through the FIAB shopfloor, and hence, need to trace how the machines communicate for fulfilling each order. Given the highly dynamic layout, one critical support is quick detection of incomplete or incorrect wiring of communication paths between machines, production deadlocks, identify the potential for parallelizing activities, and understanding to what extend the production time of each order is determined by the machine one the one hand and coordination needs among the machines on the other hand. In short, we need to understand how the behavior of machines (i.e., method calls, etc.) is causally related to the individual production orders.

Distributed tracing background
This section provides only a brief introduction to the concepts used in distributed tracing. The interested reader finds a more thorough discussion in [11].
The idea of distributed tracing is tracking how a batch, an order, a single command, or event is processed by a (distributed) production system. The ripple effect of further requests or events is then captured in a "trace". Figure 1 (left) depicts a simplified control flow graph through an example system. Different approaches are possible to know which request/events belong to the same trace (see also Sect. 5). The state-of-the-art approach to obtain such a correlation is using metadata propagation of a trace identifier. This implies the instrumentation of the various system components with tracepoints. These tracepoints are typically placed where a request or event enters a component, respectively, where a response or event is returned.
A key aspect of distributed tracing is capturing the causality amongst requests, i.e., how is the sending of one request was caused by the reception of another, earlier request. In the example flow it would be potentially difficult to assess based only on timestamps and trace identifier information alone whether request 4 occurred due to request 2 or due to request 3. To this end, trace metadata propagation typically also includes a parent request identifier. For example, both requests 2 and 3 will have request 1 as a parent, request 4 will have request 2 as a parent.
Tracepoints by themselves just represent events in a system. Even with causality relationships among these events (that then build up an acyclical directed graph), users find it usually hard to easily interpret such trace information. Most modern tracing approaches thus utilize "spans" to group events that meaningfully describe a component's processing activities. These spans build up a tree where lower spans describe the more detailed, fine granular processing. Figure 1 (right) depicts the span hierarchy for the depicted control flow on the left. Note that the uppermost client span (i.e., the root span), is complete before the end of Consumer D and potentially even Producer D span due to their asynchronous execution.
Ultimately, span-based distributed tracing such as Zipkin foresees the sending of spans to the central span collection server (e.g., a Zipkin server) upon the finishing event of a span (indicated in Fig. 1 with the info icons).

Zipkin
In Zipkin, a span entry defines the following properties which we briefly introduce. We discuss a more detailed mapping to OPC UA in Sect. 3.
-traceId: same random id for all spans that belong to the same trace. -name: the activity or operation this span represents, e.g., an OPC UA skill or the OPC UA method name if available. -parentId: empty if this is the root span, otherwise parent span's id. -id: uniquely identifies this span -kind: additional, optional information to allow interpretation of timestamp, duration, and remoteEndpoint. We discuss the implications of the possible values (CLIENT, SERVER, PRODUCER, and CONSUMER) below. -timestamp: an as accurate as possible measure representing the start of this span. -duration: how long this span lasted. The interpretation depends on the span's kind. -localEndpoint: describes the networked component that created this span in terms of a serviceName, ipv4 or ipv6 address, and port number. -remoteEndpoint: describes the networked component to which a causal relation exists in the scope of this trace. The span's kind determines how to interpret the remoteEndpoint. -annotations: capture additional events that occur between the start and end of this span. Each annotation consists of a timestamp and a string describing the event, e.g., for capturing when an error occurred, perhaps an internal state transition happened, etc.
-tags: provide additional endpoint or vendor-specific information such as version information, query string, request parameters, etc. as a set of key-value pairs.
Aside from traceId, and id, (and parentId for child spans), no other properties are strictly mandatory.
CLIENT describes a span from the view of a component that requests some service from a remote component (described in the span's remoteEndpoint), capturing in the span's timestamp the instant the request was sent and measuring in the span's duration the time until a response is received. SERVER describes a span from the view of a component that receives a service request from a remote client (described in the span's remoteEndpoint), capturing in the span's timestamp the instant the request was received, and measuring in the span's duration the time it takes to send back a response. PRODUCER describes the span from the view of a component that dispatches an event to an event broker (described in the span's re-moteEndpoint) without any knowledge of who the receiving consumers will be. The span's timestamp indicates the time of handing the event over to the broker. The span's duration, if present, measures the time between handing over the event, and the event being actually sent to consumers. CONSUMER described the span from the view of a component that receives an event from an event broker (described in the span's re-moteEndpoint) without any explicit knowledge of who the sender of the event is. The span's timestamp indicates the time of obtaining the event from the broker, the duration, if present, measuring the time between the reception by the event broker and the start of being processed by the consumer component.
Note that in the case of CLIENT and SERVER the span's duration is typically more meaningful as it describes a component's/service's/skill's latency as well as network latency, whereas the span's duration for PRODUCER and CONSUMER only measures the latency of the event broker and requires application-specific knowledge to measure any type of service/skill latency.

Mapping Zipkin to OPC UA
Two aspects need consideration when applying distributed tracing to OPC UA. First, the propagation of trace meta-data, and second, the use of spans.

Propagating trace correlation information
Zipkin uses the B3 header specification 3 for which mappings to HTTP headers, Apache Kafka record header, gRPC ASCII headers, or JMS headers already exist. We propose to encode the B3 equivalent information in an OPC UA method call's additionalHeader element defined in the OPC UA standard. 4 At the moment, no standardization exists yet that would manage how multiple, independent header extensions can coexist in the additional header element. Hence, the exact injection and extraction procedure of the B3 header from OPC UA method calls is very likely to change; the encoding of the B3 information itself, however, will not be affected by this. 3 https://github.com/openzipkin/b3-propagation. 4 https://reference.opcfoundation.org/v104/Core/docs/Part4/7.28/.

<B3
t r a c e I d ="80f198ee56343ba864fe8b2a57d3 . . . " s p a n I d = \ e457b5a2e4d86bd1" p a r e n t S p a n I d ="05e3ac9a4f6e3b90" / > In order to provide the B3 information for injection into the actual OPC UA additionalHeader element, we needed to extend a few classes in the Eclipse Milo stack (v0.3.8) as the header is not sufficiently exposed via the stack's API. Specifically, on the client-side we extended the OpcUaClient and the UaStackClient to enable handing over and insertion of B3 information into the header and the resulting dispatching of span information to the Zipkin server upon method call completion. On the server-side, we extended the Server-SymmetricHandler, StackServer, SessionManager, and Session to extract the B3 header information from the request, and thereby enable any server logic to pass on the B3 information for further propagation, and likewise, the dispatching of span information to the Zipkin server. Note that the client can send span information to the Zipkin server also when invoking a method on a non-instrumented server, as it doesn't rely on header information in the invocation response. We additionally relied on the Brave library 5 for the actual dispatching of the span information to Zipkin.

Implications of OPC UA usage
Contemporary web-based systems primarily use a procedure callcentric architecture where a request is typically expecting a response with the desired content. In contrast, with OPC UA, method requests are expected to finish as quickly as possible, thus being most often used merely as triggers to longer running actions on the serverside. The result and success of such actions are then inspected using separate method calls, data access operations, monitored items, or events.
Consequently, even when instrumenting OPC UA method calls and creating spans of kind "SERVER", respectively "CLIENT", this will result in a span structure more similar to spans of kind "PRO-DUCER" and "CONSUMER". We demonstrate this behavior in Sect. 4 where we also discuss the opportunity to lift tracing instrumentation into the component logic to obtain a more easily understandable span hierarchy.
Aside from method invocations, requests to read and write an OPC UA node's value via OPC UA Data Access can be equally instrumented with B3 header extension information. Here, however, we have to differentiate the direction of the correlation information flow for reads and writes. A write can be interpreted as a form of request. Note, that this will not be the case for every data item; hence such instrumentation only makes sense where a write is expected to trigger some behavior on the server-side that propagates correlation information and is traced. Here, the request header will contain the relevant B3 correlation information. Upon dispatching the write request, the OPC UA client may then create a span of kind "PRO-DUCER", and the server should create a span of kind "CONSUMER" whenever that node is next accessed locally. A read, in contrast, can be used as a mechanism to providing a return value from a longrunning action, or a trigger to execute some next steps on the client side. In both cases, the B3 header in the response will be meaningful to propagate context from server to client. Creating of spans is reversed here. Whenever the node value changes at the server, the server may create it as a span of kind "PRODUCER", and the OPC UA client will create a span of kind "CONSUMER" whenever it reads

Fig. 2. Conceptual architecture excerpt of the Factory in a Box showing OPC UA method calls (full lines), and machine internal communication paths (dashed lines)
the value. Note that in the former case, the client could also retain the B3 propagation information from the initial long-running action trigger, thus not relying on any B3 response header information. For both read and write, however, there is more instrumentation needed on the server-side than for method invocations, as the server has to maintain B3 information for every node that is used for correlation propagation. The same read-specific aspects hold for propagating correlation information when using monitored items.
Transparent instrumentation of OPC UA events is more difficult, as, at the time of writing, no standardized mechanism for including meta-information in an event is available. Hence, any correlation information needs to be explicitly modeled in the event data model.

Evaluation use case
In the Factory in a Box scenario, the primary communication via OPC UA occurs between the MES and the machines (input station, output station, turntables, plotters) as well as between machines to coordinate the handover of pallets. In doing so, we can obtain a single trace from the MES' transport request(s) to the turntable(s) and the subsequent handover handshakes. Figure 2 provides a conceptual architecture overview of the involved components in the Factory in a Box for transporting a pallet from input station to Turntable 1, on to Turntable 2, a plotter, and ultimately reaching the output station. To this end, turntables offer an OPC UA server endpoint for receiving transport commands (i.e., method invocations), and then connect with respective OPC UA clients to the other machines to coordinate the pallet handover. A turntable also exposes its OPC UA server to other turntables to coordinate pallet handover between turntables. Note, that the input station and output station need not be controlled from the MES as they signal via their state (i.e., accessible as an OPC UA node) whether they are ready to provide a pallet, respectively, whether they can take on a pallet. Figure 3 displays the resulting trace information as displayed in the Zipkin UI for an exemplary transport scenario where a pallet is moved from input station to output station via the turntables without, for the sake of clarity, involving a stop at a plotter. Note that a turntable needs to internally forward the trace correlation information obtained when receiving the transport request to the handshake com-ponent which coordinates the handover. This requires explicit extraction of the B3 header information from the OPC UA method call and thus is specific to the communication means used within the machines (here IEC 61499 for plotters, and Akka actor framework for turntables). The exact mechanism involved is outside the scope of this article. In FIAB, we limited the instrumentation to sending spans from the client-side as not all server-side OPC UA stacks involved in FIAB are yet instrumented (i.e., currently only Eclipse Milo, but not yet the open62514 server used in the Forte framework on the plotters and input and output stations).
There are several aspects visible in the trace of Fig. 3 (left). The root span is created in the MES (indicated by "DEFAULT" spans), where the transport system coordinator has separate OPC UA clients to connect to both involved turntables. A transport request is then dispatched to the two turntables at the same time. This is visible in the timestamps of the two "/transportrequest" spans that reside at nesting level two. Figure 3 (right) provides details of one such request in the span's tags, here specifically the method invocation parameters and OPC UA method node identifier.
The spans are listed in their dispatch order by timestamp. In this particular case, the MES happened to dispatch the transport request to Turntable 2 first, and immediately afterward to Turntable 1. Turntable 1 then proceeds to initiate the handshake to the input station (the two bottom spans). Only once Turntable 1 has completed loading the pallet and is in a ready state will Turntable 2 commence with the handshake to Turntable 1 (the two brown spans), and then the handshake to the output station. The time before and after the set of two handshake method calls is spent on turning and conveying (internal component behavior that involves no traceable OPC UA method calls). The attentive reader will notice that the spans' duration is extremely short. As we outlined in the previous section, the method calls are merely used as triggers to an action such as "initiate" or "start" a handshake and trigger a transport request. To obtain spans of longer duration, it is necessary to push the tracing instrumentation from the OPC UA stack up into the (sub) components that manage logic at a higher abstraction level, for example, the state machine that governs the handshake. There, a component could then start a span upon receiving an internal "start" re- Fig. 3. Example trace of a transport request from the MES to two turntables, as well as the handshake communication to an input station, among the turntables, and an output station. Span details of one request displayed on the right quest, capture in span annotations the events triggered by receiving of state updates via monitored items, and eventually close the span upon transitioning into the handshake state machine's "Completed" state. Any OPC UA-level spans would then be captured as they currently are but only as sub spans of a higher-level handshake span. Similarly, the MES transport system coordinator would have to manage separate spans for each end-to-end transport request.
Nevertheless, already at the current level of trace instrumentation, the available trace information allows to inspect the timing of requests, here for example, between initiate and start, and analyze any deviations thereof over time, or the context of requests in case one of them fails.

Related work
A plethora of runtime monitoring approaches [8] focus on the monitoring of resources, e.g., for fault detection or resource usage [15], often supported by model-driven techniques to simplify instrumentation [6]. In this category, approaches that address distributed systems, mostly focus on aspects and events of the individual distributed components but not their interaction. Here passing of correlation metadata is less of a requirement, or implicitly provided in the identifier of the instrumented components [16].
As production systems become more flexible and need to support frequent reconfiguration on the move to lot-size one, per order, batch, or request tracing becomes important. To this end, metadata propagation based techniques such as Zipkin (originally developed by Twitter) are suitable candidates to support these existing approaches. Zipkin is not the only distributed tracing technology with multiple frameworks emerging over the past decade (e.g., Dapper [12] at Google, Jaeger [10] at Uber, or Canopy [3] at Facebook).
These tracing infrastructures are then the basis for sophisticated use cases such as anomaly detection [7] or attack path detection [14]. Additional use cases and their support by distributed tracing frameworks are discussed in depth by Sambasivan et al. [9]. As the number of different use cases extends to debugging, taint propagation, auditing, or provenance, a generic multipurpose context propagation mechanism becomes sensible [4].
Alternative approaches to metadata propagation-based monitoring include treating the distributed system as a black box and thus only derive insights from analyzing vast amounts of heterogeneous log entries (e.g., [2]) or schema-based approaches (e.g., [1]) that define a model for each component on how to interpret logged system calls. Such approaches, however, have significant limits in their ability to enable inspection of the causal relationships of requests (and their properties) for individual traces.

Conclusions and outlook
In this article, we motivated the potential for distributed tracing to better understand causality in highly distributed and dynamic CPPS. We introduced how to integrate Zipkin with OPC UA method calls and discussed that additional instrumentation is also useful when reading and writing to nodes via OPC UA Data Access. We also highlighted that due to the event-centric nature of CPPS, method callbased spans are not ideally suited to quickly and easily understand span structures. To this end, span generation based on component internal models such as state machines seems very promising. As we outlined in Sect. 5, Zipkin is just one applicable framework, with the OpenTracing APIs 6 aiming for interoperability. However, all similar approaches that have their origin in the web and cloud domain face similar challenges concerning passing correlation information and, most importantly, the event-centric nature of OPC UA communication.
Funding Note Open access funding provided by Johannes Kepler University Linz.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.