1 Introduction

Railway signalling is a safety-critical system whose responsibility is to guarantee a safe and efficient operation of railway networks. In recent decades there have been proposals to utilize distributed system concepts (e.g. [13, 24]) in railway signalling as a way to increase railway network capacity and reduce maintenance costs. These emerging distributed railway signalling concepts propose using a radio-based communication technology to decentralise contemporaneous signalling systemsFootnote 1. Because of their complex concurrent behaviour, distributed systems are notoriously difficult to validate and this could curtail the development and deployment of novel distributed signalling solutions.

In recent years there has been a push (e.g. [12, 22]) by the industry with a strong focus on distributed systems to incorporate formal methods into their system development processes to improve system assurance and time-to-market. Yet, despite that for years the railway domain has proved to be a fruitful area for applying various formal methods [3, 7], considerably less has been done in applying them for distributed railway systems by industry and academia. Therefore, the long-term aim of our research is to lower the effort the barriers to applying formal methods in developing correct-by-construction distributed signalling systems.

In order to manage the modelling and verification complexity of distributed protocols we are working towards an integrated multifaceted methodology, which is based on three concepts: stepwise renement, communication modelling patterns and validation through proofs. In spite of advancements in proof automation, it might be too onerous to mathematically prove the model in early development stages. Therefore, it is also desirable that the framework should support model animation and scenario validation. It is also paramount that the framework should support quantitative evaluation; as stated by Fantechi and Haxthausen [10], distributed signalling solutions will only be adopted in practice if system availability is demonstrated. The authors (as discussed in [10]) of related researches did not consider liveness and fairness properties, which directly affect system availability. In our proposed multifaceted methodology we integrate stochastic simulators for quantitative analysis.

In this paper, we present a research, which uses the proposed methodology to formally develop and verify a distributed railway signalling protocol, which would deliver decentralised signalling benefits, while meeting high safety requirements. The developed distributed signalling protocol is based on serialisability and is inspired by protocols used in transactions processing [4, 8, 11] in centralised and distributed database systems. The main objective of our protocol is to guarantee mutual exclusion of railway sections while ensuring systems liveness. In a nutshell, our key contributions are the formally proved distributed railway section allocation protocol inspired by past protocols for database systems and the formalisation of the multifaceted verification framework.

Related Work. In Fantechi and Haxthausen [10] the authors formalise the railway interlocking problem as a distributed mutual exclusion problem and discuss the related literature on distributed interlocking (e.g. [9, 13, 24]). In principle all railway models share similar high-level safety, liveness and fairness requirements, as summarised on page 2 in [10]. One difference between our work and the studies overviewed in [10] is the interlocking engineering concept and the system model (e.g. allowed message delays). Another difference is the formal consideration of liveness and fairness requirements. In our work we not only prove the safety properties of the protocol, but also ensure systems liveness, fairness and analyse performance.

A similar distributed signalling concept is presented as a case study in [1]. The authors verified their system design via a simulation approach and only considered scenarios with up to two trains. In our verification approach we prove the distributed signalling system mathematically and hence guarantee its safety for any number of trains. In the paper by Morley [21] the author formally proved a distributed protocol, which is used in the real-world railway signalling systems to reserve a route, which is jointly controlled by adjacent signalling systems. Even though, the distributed signalling concepts of our works are different, the effects of message delays to the safety were considered in both works.

The rest of the paper is organised as follows. Section 2 outlines the motivation for developing the protocol, semi-formally describes its functionality, elicits the requirements and introduces its specifications and the properties to be proved. Section 3 further discusses the integrated methodology we are proposing. The following section briefly discusses formal model development and also provides technical details on property verication and performance analysis. In the last section we summarise our work and discuss future work directions.

2 Distributed Resource Allocation Model and Protocol

The distributed railway signalling can increase networks capacity (as trains could run closer), improve systems agility to delays and possibly reduce repair costs. On the other hand, an increased system complexity and a safety-critical (SIL4) nature requires the highest level of safety assurance. In order to apply formal methods one must clearly state system requirements and specifications. In the following subsections we describe an abstract model of the distributed railway system and its requirements as well as the \(\mathsf {stage_1}\) of the distributed protocol, which guarantees the safety and liveness of the distributed system.

2.1 High-Level Distributed System Model and Requirements

We abstract the railway model and instead of trains, routes and switches our system model consists of agents and resources (resources controllers). The system model permits message exchanges only between agents and resources, and messages can be delayed. Each resource controller has an associated queue-like memory, where agents allocation order can be stored. A resource also has a promise (ppt) and read pointers (rpt), which respectively indicate the currently available slot in the queue and the reserved slot (with an associated agent) that currently uses the resource. An agent has an objective, which is a collection of resources an agent will attempt to reserve (all at the same time) before using and eventually releasing them.

  • \(\mathsf {SAF_1 \,|} \) A resource will not be allocated to different agents at the same time.

  • \(\mathsf {SAF_2 \,|} \) An agent will not use a resource until all requested resources are allocated.

  • \(\mathsf {LIV_1 \; \;|} \) An agent must be eventually allocated requested set of resources.

  • \(\mathsf {LIV_2 \; \;|} \) Resource allocation must be guaranteed in the presence of message delays.

Requirements 1: High-level systems safety and liveness requirements

The main objective of the protocol is to enable safe and deadlock-free distributed atomic reservation of collection of resources. Where by a safe resource reservation we mean that no two different agents have reserved the same resource at the same time. The protocol must also guarantee that each agent eventually gets all requested resources - partial request satisfaction is not permitted. The main high-level safety and liveness requirements of the distributed system are expressed in Requirements 1.

The following section attempts to justify the need for an adequate distribute protocol by discussing problematic distributed resource allocation scenarios.

2.2 Problematic Distributed Resource Allocation Scenarios

Let us consider Scenarios 1–2 (visualised in Fig. 1) to see how requirement \(\mathsf {LIV_1}\) could not be guaranteed (while ensuring \(\mathsf {SAF_2}\)) without an adequate distributed resource allocation protocol.

Scenario 1. In this scenario, agents \(\mathsf {a_0}\) and \(\mathsf {a_1}\) are attempting to reserve the same set of resources \(\mathsf {\{r_0, r_1\}}\). Agents start by firstly sending request messages to both resources. Once a resource receives a request message, it replies with the current value of the promised pointer (\(\mathsf {ppt(r_k)}\)) and then increments the \(\mathsf {ppt(r_k)}\). For instance, in this scenario, resource \(\mathsf {r_0}\) firstly received a request message from agent \(\mathsf {a_0}\) and thus replied with the value \(\mathsf {ppt(r_0)}\) = 0, which was then followed by a message to \(\mathsf {a_1}\) with an incremented \(\mathsf {ppt(r_0)}\) value of 1. In Figure, we denote \(\mathsf {a_{n}^\mathbf * }\) as the \(\mathsf {ppt(r_k)}\) value sent to \(\mathsf {a_n}\). Request messages at resource \(\mathsf {r_1}\) have been received and replied in the opposite order.

Fig. 1.
figure 1

Problematic scenarios: Scenario 1 (left) and Scenario 2 (right)

In this preliminary protocol, after an agent receives promised pointer values from all requested resources, it sends messages to requested resources to lock them at the promised queue-slot. In this scenario, agent \(\mathsf {a_0}\) was promised queue-slots \(\mathsf {\{(r_0, 0), (r_1, 1)\}}\) while \(\mathsf {a_1}\) queue-slots \(\mathsf {\{(r_0, 1), (r_1, 0)\}}\). If agents would lock these exact queue-slots, resource \(\mathsf {r_0}\) would allow \(\mathsf {a_0}\) to use it first, while \(\mathsf {r_1}\) would concurrently allow \(\mathsf {a_1}\). The distributed system would deadlock and fail to satisfy \(\mathsf {LIV_2}\) requirement as both agents would wait for the second use message to ensure \(\mathsf {SAF_2}\).

In order to prevent the cross-blocking type of deadlocks, an agent should repeatedly re-request the same set of resources (and not lock them) until all received promised queue slot values are the same. We define a process of an agent attempting to receive the same promised queue slots as an agent forming a distributed lane (\(\mathsf {dl}\)).

A distributed lane of agent \(\mathsf {a_n}\) is \(\mathsf {dl(a_n) \, = \,\{(r_k,s),\,(r_{k+1},s), \, \dots \,, \, (r_{k+m},s) \}}\), where \(\mathsf {\{r_k,\, r_{k+1}, \, \dots \,, \, r_{k+m}\}}\) are all resources requested by agent \(\mathsf {a_n}\) and \(\mathsf {s}\) is the queue slot value promised by all requested resources. Important to note, that this solution relies on the assumption, that there is a non-zero probability of distinct messages arriving at the same destination in different orders, even if they are simultaneously sent by different sources.

The modified situation is depicted in Scenario 1, where, after agents \(\mathsf {\{a_0, a_1\}}\) initially receiving \(\mathsf {\{(r_0, 0), (r_1, 1)\}}\) and \(\mathsf {\{(r_0, 1), (r_1, 0)\}}\) slots, mutually re-request resources again. This time they receive \(\mathsf {\{(r_0, 2), (r_1, 2)\}}\) and \(\mathsf {\{(r_0, 3), (r_1, 3)\}}\) slots, and are able to form distributed lanes \(\mathsf {dl_0(a_0)}\) and \(\mathsf {dl_1(a_1)}\).

Scenario 2. However, simply re-requesting the same resources might result in a different problem. In Scenario 2, agent \(\mathsf {a_1}\) has requested and has been allocated a single resource \(\mathsf {r_1}\) which in turn modified \(\mathsf {ppt(r_1)}\) to 1 while \(\mathsf {ppt(r_0)}\) remained 0. If another agent \(\mathsf {a_0}\) attempts to reserve resources \(\mathsf {\{r_0, r_1\}}\), it will never receive the same promised pointer values from both resources, and hence, will not be able to lock them.

To address the two issue described above, we developed a two-stage protocol, where the \(\mathsf {stage_1}\) of the distributed protocol specifies how an agent forms a distributed lane. \(\mathsf {Stage_2}\) of the protocol, which is out of this paper scope, addresses other deadlock scenarios, which can occur after agents form distributed lanes. In the following subsection we semi-formally describe the \(\mathsf {stage_1}\) of the protocol.

2.3 Semi-formal Description of the \(\mathsf {Stage_1}\)

An agent, which intends to reserve a set of resources starts by sending \(\mathsf {request}\) messages to resources. The messages are sent to those resources which are part of agents current objective. In the provided pseudocode excerpt, we first denote relations \(\mathsf {sent\_requests}\) and \(\mathsf {objective}\) where they are mappings from agents to resource collections (ln. 1–3 Algorithm 1). The messages \(\mathsf {request}\) are sent by an agent \(\mathsf {a_n}\) to a resource \(\mathsf {r_k}\) (\(\mathsf {r_k \in objective[a_n]}\)) until \(\mathsf {sent\_requests[a_n]}\) = \(\mathsf {objective[a_n]}\) (images are equal). When a resource \(\mathsf {r_k}\) receives a \(\mathsf {request}\) message from an agent \(\mathsf {a_n}\) it responds with a \(\mathsf {reply}\) message which contains the current promised pointer value of resource \(\mathsf {ppt(r_k)}\) to that agent and increments the promised pointer (ln. 2–4 Algorithm 2). After sending all \(\mathsf {request}\) messages an agent waits until \(\mathsf {reply}\) messages are received from requested resources and then makes a decision.

figure a

When all received promised pointer values are the same (a distributed lane can be formed) an agent completes the \(\mathsf {stage_1}\) by sending \(\mathsf {write}\), to all requested resources, messages which contain the negotiated index (ln. 14–17 Algorithm 1). But if one of the received promised pointer values is different an agent will start a renegotiation cycle (ln 5–13 Algorithm 1). By sending a \(\mathsf {srequest}\) messages which contain a desired slot index to resources. A desired index is computed by taking the maximum of all received promised pointer values and adding a constant (one is sufficient) - ln. 6 Algorithm 1. A resource will reply to \(\mathsf {srequest}\) message with the higher value of the current \(\mathsf {ppt(r_k)}\) or received \(\mathsf {srequest}\) message value and will update the promised pointer (ln. 5–7 Algorithm 2). After sending all \(\mathsf {srequest}\) messages, an agent waits for \(\mathsf {reply}\) messages and then restarts the loop if received slot indices are not the same.

figure b
  • \(\mathsf {SAF_3 \,|}\) An agent will not send \(\mathsf {write}\) (form a distributed lane) messages until all receive promised pointer values are identical.

  • \(\mathsf {SAF_4 \,|} \)Agents with overlapping resource objectives will negotiate distributed lanes with different index.

  • \(\mathsf {LIV_3 \,|} \) An agent will eventually negotiate a distributed lane.

Requirements 2: Low-level protocol \(\mathsf {stage_1}\) safety and liveness requirements

It is important to note that the \(\mathsf {stage_1}\) protocol solution to the described deadlock scenarios has a stochastic nature and one needs to guarantee that a desirable state is probabilistically reachable. In Requirements 2 we summarise requirements for the \(\mathsf {stage_1}\) of the protocol.

After an agent completes \(\mathsf {stage_1}\) and thus negotiates a distributed lane it will start protocol \(\mathsf {stage_2}\) to prevent other deadlock scenarios. Predominantly because of papers verification focus towards properties from \(\mathsf {stage_1}\) (all complimentary verification/analysis techniques used) we provide protocol \(\mathsf {stage_2}\) description in the online appendixFootnote 2.

3 Multifaceted Modelling and Verification Framework

As stated before, the long-term objectives of our research are to reduce modelling and verification effort of distributed systems and to have a multifaceted framework to study protocols from all relevant perspectives. In the introduction, we defined key formal concepts the framework should rely on and in the following section we discussed protocol requirements we need to guarantee.

The following subsections proposes an engineering process with different formal techniques each of which is efficient to handle parts of above requirements and help to manage modelling and verification complexity.

Fig. 2.
figure 2

Multifaceted modelling and verification framework

3.1 Formalised Multifaceted Verification Framework

For any adequate formal system development, system requirements should be clearly stated, and so, this is the first step (Step 1 in Fig. 2) in the modelling process. Currently, we do not suggest or provide a specific structural approach for defining distributed system requirements. The next step (Step 2) in the process is developing and verifying a pivotal formal model. The purpose of formally modelling a distributed system is to have a formal artefact, which can be animated, analysed and formally verified.

For the development and verification of pivotal functional system models we selected the Event-B [2] specification language, which has previously been successfully used for modelling and verification of various distributed protocols [5, 15, 16]. The Event-B method provides an expressive modelling language, flexible refinement mechanism and is also proof driven, meaning model correctness is demonstrated by generating and discharging proof obligations with available automated theorem provers [6, 17]. The method is supported by tools such as ProB [19] which enable animating and model-checking a model. On the other hand, the Event-B method does not have an adequate probabilistic reasoning support, which, for example, was essential for verifying the distributed railway section reservation protocol. Therefore, it was decided to integrate the well-known PRISM [14] stochastic model checker into the framework, so stochastic system’s properties can be verified.

The last step (Step 3) in the proposed engineering process is analysing a developed distributed system’s performance. For that, we have implemented a high-fidelity protocol simulator which could help to evaluate protocols under normal or stressed conditions. Following subsections provide more detail on how each of the formal techniques would be used in the development and verification of a distributed protocol.

3.2 Step 2: Developing Functional Pivot Models in Event-B

A formal functional Event-B model can have a multitude of uses, but the main application is for formally proving properties about the distributed system. The completed distributed system’s model in Step 2 should cover all requirements and specifications, and would be considered correct when all generated proof obligations are proved.

The model development approach we propose is a rather standard and starts with the abstract model which formally specifies the objective of the distributed protocol. In fact, distributed aspects of the system are ignored at this model level and the abstract model considers a centralised configuration. The abstract model is then iteratively refined by introducing more details about the distributed protocol, primarily by modelling communication aspects. To reduce modelling effort we previously developed communication modelling patterns and described a generic model refinement plan in [23]. A key aspect of our methodology is the scenario validation and analysis. Particularly, in early protocol development stages, it might be too onerous to verify a model only to discover design mistakes. To facilitate design exploration we apply animation and model-checking enabled by ProB. Nonetheless, the final (concrete) model should be proved by adding invariants to the model and proving generated proof obligations with available automated theorem provers.

3.3 Step 2: Proving Stochastic Properties with PRISM

As the distributed signalling protocol had a stochastic nature it was important to formally demonstrate that a satisfying state could be reached. Probabilistic or liveness properties are hard to formalise and prove in the Event-B method. Therefore, it was decided to prove progress of the protocol outside of Event-B by redeveloping part of the model (\(\mathsf {stage_1}\)) in the PRISM model checker.

The drawback of using PRISM model checker, if a bounded problem abstraction cannot be found, the verification is limited to bounded models. As we could not find protocol’s \(\mathsf {stage_1}\) abstraction, we created a skeleton model, which then could be instantiated to model specific scenarios of \(\mathsf {stage_1}\) with \(\mathsf {n}\) agents, \(\mathsf {m}\) resources and other initial conditions. Additionally, we developed a model generator, which can automatically instantiate the skeleton model to capture a random scenario and run probabilistic verification conditions.

3.4 Step 3: Analysing System’s Performance

With Event-B and PRISM we aim to demonstrate that the protocol addresses the formulated requirements but it is necessary in our application domain to understand how the protocol is going to perform under various conditions if it were deployed in a real system. To conduct such a simulation we have implemented a high fidelity protocol simulator that can be populated with any number of resources and agents while realising any conceivable agents’ goal formation and message delivery policies.

The simulator is parametrised with a function of probability of picking a certain message out of a pool of available messages. The probability function is itself parametrised by message source, destination, timestamp and type. The simulation would help to answer how fast, in terms of vital steps such as messages sent, a protocol’s \(\mathsf {stage_1}\) can be completed and how the performance is affected by messages delays. With function D we can simulate slow agents and resources, fair, arbitrary and unfair delivery policies, agents that operate much faster than others and so on.

4 Formal Protocol Modelling, Verification and Analysis

In this section we present the application of previously introduced modeling and verification framework for developing distributed railway signalling protocol. In Sect. 2 we defined protocol’s requirements (Step 1), thus following subsections focuses on formal methodology aspects.

4.1 Step 2. Formal Protocol Model Development in Event-B

We apply the Event-B formalism to develop a high-fidelity functional model and prove the protocol functional correctness requirements. We follow the modelling process presented in Sect. 3.2. Important to note that the protocol model was redeveloped multiple times as various deadlock scenarios were found with ProB animator and model-checker. Below, we overview the final (verified) model.

Modelling was started by creating an abstract model context which contains constants, given sets and uninterpreted functions. In the abstract context, we introduced three (finite) sets, to respectively represent agents (\(\mathsf {agt}\)), resources (\(\mathsf {res}\)) and objectives (\(\mathsf {obj}\)). The context also contains an objective function which is a mapping from objectives to a collection of resources () and an enumerated set for agents status counter.

The dynamic protocol parts, such as messages exchanges, are modelled as variables and events computing next variable states and contained in a machine. According to the proposed model development process, the initial machine (abstract) should summarise the objective of protocol, which is an agent completing an objective (locking all necessary resources). To capture that, the abstract protocol machine contains two events, respectively modelling an agent locking and then releasing a free objective (). The abstract model is refined by mostly modelling communication aspects of the distributed signalling protocol and for that we use a backward unfolding style where the next refinement step introduces preceding protocol step. Below, we overview the refinement chain and properties we proved at that modelling stage.

Refinement 1 (Abstract ext.). In this refinement we introduce resources into the model and now an agent tries to fulfill the objective by locking resources. Previous two events (lock/release) are now decomposed to two for each and capture iterative locking and releasing of resources.

Refinement 2. The abstract models are firstly refined with \(\mathsf {stage_2}\) part of the protocol. In the refinement, \(\mathsf {r\_2}\), we introduced \(\mathsf {lock}\), \(\mathsf {response}\) and \(\mathsf {release}\) messages and associated events into the model. In this step we also demonstrated that the protocol \(\mathsf {stage_2}\) ensures safe distributed resource reservation by proving an invariant. The invariant states that no two agents will be both at resource consuming stage if both requested intersecting collections of resources.

Refinement 3. Model \(\mathsf {r\_3}\), is the bridge between protocol stages \(\mathsf {stage_1}\) and \(\mathsf {stage_2}\) and introduces two new messages \(\mathsf {write}\) and \(\mathsf {pready}\) into the model.

Refinement 4. The final refinement step - \(\mathsf {r\_4}\) - models \(\mathsf {stage_1}\) of the distributed protocol which is responsible for creating distributed lanes. Remaining messages \(\mathsf {request}\), \(\mathsf {reply}\), \(\mathsf {srequest}\) and associated events are introduced together with the distributed lane data structure. In this refinement we prove that distributed lanes are correctly formed (req. \(\mathsf {SAF_{3\text {-}4}}\)).

4.2 Step 2: Proving Functional Correctness Properties in Event-B

As shown in Sect. 2.2 (Scenarios 1 - 2) high-level system’s requirements can only be met if an agent invariably and correctly forms a distributed lane. The probabilistic lane forming eventuality (\(\mathsf {LIV_3}\)) is discussed separately while in the following paragraphs we focus on the proof regarding requirements \(\mathsf {SAF_{3\text {-}4}}\).

\(\mathsf {SAF_3}\) is required to ensure that agent’s resource objectives are not satisfied or satisfied on full. The model addresses this via event guards restricting enabling states of the event that generates an outgoing \(\mathsf {write}\) message. To cross-check this implementation we add an invariant that directly shows that \(\mathsf {SAF_3}\) is maintained in the model. For illustrative purposes we focus on details of verifying a slightly more interesting case of \(\mathsf {SAF_4}\) and assume that \(\mathsf {SAF_3}\) is proven.

Requirement \(\mathsf {SAF_4}\) addresses potential cross-blocking deadlocks or resource double locking due to distributed lane overriding. The strategy is to prove the requirement is to show that agents that are interested in at least one common resource (related) always form distributed lanes with differing indices. We start by assuming that agents only form distributed lanes if all received indices are the same (proved as \(\mathsf {SAF_3}\)). Then, if a resource (or resources) shared between any two related agents send unique promised pointer values to these agents, these indices will be distributed lane deciders as all other indices from different resources must be the same to form a distributed lane. Hence, to prove \(\mathsf {SAF_4}\) it is enough to show that each resource replies to a \(\mathsf {request}\) or \(\mathsf {special \, request}\) message with a unique promised pointer value.

Fig. 3.
figure 3

Event-B model excerpt of a resource sending a \(\mathsf {reply}\) message (Color figure online)

To prove that all resources replies to a \(\mathsf {request}\) or \(\mathsf {special \, request}\) message with a unique promised pointer value, we firstly introduced a history variable \(\mathsf {his_{ppt}}\) of type into our model. The main idea behind the history variable was to chronologically store the promised pointer values sent by a resource. We also introduced a time-stamp variable \(\mathsf {his_{wr}}\) of the type to chronologically order the promised pointer values stored in the history variable.

After introducing history variables, we modified events \(\mathsf {resource\_reply\_general}\) and \(\mathsf {resource\_reply\_special}\), which in the protocol update the promised pointer variables, by adding two new actions (see Fig. 3). The first action \(\mathbf {act_4}\) updates the history variable with the promised pointer value (\(\mathsf {ppt(res)}\)) that was sent to the agent at the time stamp (\(\mathsf {his_wr(res)}\)). The second action, \(\mathbf {act_5}\), simply increments resource’s \(\mathsf {res}\) time-stamp (\(\mathsf {his_wr(res)}\)) variable.

Action \(\mathbf {act_4}\) updates a history variable for a resource \(\mathsf {res}\) with the current write stamp and promised pointer (\(\mathsf {ppt(res)}\)) value sent. The next action \(\mathbf {act_5}\) simply updates the resource’s write stamp. We can then add the main invariant to prove (\(\mathbf {inv\_saf\_4}\)) which states that if we take any two entries \(\mathsf {n1, n2}\) of the history variable for the same resource where one is larger, then that larger entry should have larger promised pointer value.

To prove that \(\mathsf {resource\_reply\_\{general, special\}}\) preserve \(\mathbf {inv\_saf\_4}\), the following properties play the key role: (1) the domain of \(\mathsf {his_{ppt}}\) (i.e., ‘indices’ of \(\mathsf {his_{ppt}}\)) is \(\{0, \ldots , \mathsf {his_{wr}} - 1\}\), (2) \(\mathsf {his_{ppt}}(\mathsf {his_{wr}}-1) < \mathsf {his_{ppt}}(\mathsf {his_{wr}})\). Property (2) holds because \(\mathsf {his_{ppt}}(\mathsf {his_{wr}})\) is the maximum of promised pointer (\(\mathsf {ppt}\)) and special request slot number and promised pointer is incremented as \(\mathsf {resource\_reply\_\{general, special\}}\) occurs. We also specified these properties as an invariant (\(\mathbf {inv\_his\_ppt}\)) and proved they are preserved by the events which helped to prove \(\mathbf {inv\_saf\_4}\).

Proof Statistics. In Table 1 we provide an overall proof statistics of the Event-B protocol model which may be used as a metric for models complexity. The majority of the generated proof obligations were automatically discharged with available solvers and even a large fraction of interactive proofs required minimum number of steps. We believe that a high proof automation was due to modelling patterns [23] use and SMT-based verification support [6, 17].

Table 1. Event-B protocol model proof statistics

4.3 Step 2: Proving Liveness (req. \(\mathsf {LIV_3}\)) with PRISM

In this subsection, we discuss stochastic model checking results with which we intend to prove level that \(\mathsf {LIV_3}\) requirement is preserved. In particular, we focus on showing that \(\mathsf {LIV_3}\) requirement is ensured in Scenario 2 (Sect. 2.2).

In order to demonstrate that \(\mathsf {LIV_3}\) requirement holds in Scenario 2 (Sect. 2.2) we used \(\mathsf {stage_1}\) protocol’s skeleton PRISM model to replicate Scenario 2. In this experiment we were interested in observing the effects a promised pointer offset has on an probability of agent forming a distributed lane while the upper limit of the promised pointer is increasedFootnote 3 (\(\mathsf {n}\) in Scenario 2). Early experiments showed that verification would not scale well (several hours for a single data-point) if we would increase the number of resources and agents above two resources and three agents (each agent trying to reserve both resources) so we kept these parameters constant.

For each scenario, we would run a quantitative property: \(\mathsf {P \,= \,? \, [F \, dist_0 \, > \, \text {-}1] }\) which asks what is the probability of an agent negotiating a distributed lane until the upper promised pointer limit is reached. The three curves (red, green and violet) in Fig. 4 show the effect a promised pointer offset has on negotiation probability as queue depth is increased. Results suggest that increasing the offset reduces the probability of negotiating a distributed lane as queue depth is increased, but the probability still approaches one as the number of rounds is increased (Fig. 4).

Fig. 4.
figure 4

Scenario 2 with varied resource promised pointer offset and queue depth.

To further see the effects of the offset, we considered a different experiment where the same quantitative property would be run when the number of possible renegotiations value is kept constant and offset is increased (light blue plot). Results indicate that offset has only effect until a specific threshold and after that the probability of agent negotiating a distributed lane is not affected by the offset. These results suggest that the situation in Scenario 2 does not violate \(\mathsf {LIV_3}\) requirement as distributed lanes can be negotiated.

4.4 Step 3: Analysing Performance

The goal of this part is to study the protocol performance under various stress conditions and thus provide assurances of its applicability in real life situations. To build simulation, we simply capture protocol’s \(\mathsf {stage_1}\) behaviour using a program. We are also able to obtain bounds on the number of messages required to form lanes in different setups. This can be directly translated into real-life time bounds on the basis of point to point transmission times.

Simulation Construction. Simulation is setup as a collection of actors of two types - agents and resources - and an orchestration component observing and recording message passing among the actors. A message is said to be in transit as soon as it is created by an actor. Every act of message receipt (and receipt only) advances the simulation (world) clock by one unit. Hence, any number of computations leading to message creation can occur in parallel but message delivery is sequential. To model delays we define a function that probabilistically picks a message to be delivered among all the messages currently in transit. A special message, called skip, is circulated to simulate idle passage of time. This message is resent immediately upon receipt by an implicit idle actor.

Fig. 5.
figure 5

Time to form all or first lanes, logarithmic scale.

Let \(\mathbb {M}\) be set of all messages that can be generated by agents and resources. Also, let \(\mathsf {skip} \notin \mathbb {M}\) denote the skip message and \(\mathbb {M}'= \mathbb {M} \cup \{\mathsf {skip}\}\). By its structure, set \(\mathbb {M}'\) is countable (each message identified by unique integer) and one can define a measure space over \(\mathbb {M}'\). Let D signify the probability that some message \(m \in M \subseteq \mathbb {M}'\) from message pool M is selected for reception. We shall define D via the current message pool, the attributes of m such its source, destination, time stamp and protocol stage, and the world time: \(D = D(M, m, t) = D(M, m.s, m.d, m.c, m.o, t)\). Here M is the set of available message, m.s and m.d are the message source and destination agent or resource, m.c is the message type (e.g., \(\mathsf {WRITE}\)), m.o is the message timestamp (the point of its creation) and t is the world clock. Defining differing probabilities D we are able to address most scenarios of interest.

Uniform Distribution. With the simulator picks a message from M using a uniform distribution. It is an artificial setting as the time in transit bears no influence over the probability of arrival. Counter-intuitively, the said probability may decrease with the passage of time when new messages are created quicker than they are delivered. The skip message has equal probability with the rest so the system “speeds up” when M is large. The plots in Fig. 5 shows how the protocol performance changes when the number of resources (Resource line), agents (Agent lines), and resources an agent attempts to acquire (Agent goal) increase. We plot separately time to form all lanes and any first lane. The values plotted are averaged over 10000 runs.

5 Conclusions and Future Work

In this paper we proposed a multifaceted framework with which we aim to reduce modelling and verification of distributed (railway signalling) systems. The framework was applied in the development of the novel distributed signalling protocol. Starting only with high-level system requirements we developed an early formal protocol prototype which with the help of ProB was refined as subtle deadlock scenarios were discovered. This in part is the advantage of a stepwise development supported by Event-B as complex distributed models can be decomposed into smaller problems and errors found earlier. The stepwise distributed protocol development as also shown before [5, 15, 16] together with adequate tools [6, 17] helped to achieve fairly high verification automation. On the other hand, protocol verification was complicated by the need of stochastic reasoning and not adequate Event-B support for reasoning about probabilistic properties. The current solution relied on a model redevelopment in stochastic model checker PRISM which did not scale well for verification of larger scenarios. As a future direction it is essential to address this problem by most likely improving stochastic reasoning in Event-B. In the future we would also like to a much closer tool integration and support an automatic translation to PRISM and the stochastic simulator.