Replicating \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {Restart}$$\end{document}RESTART with Prolonged Retrials: An Experimental Report

Statistical model checking uses Monte Carlo simulation to analyse stochastic formal models. It avoids state space explosion, but requires rare event simulation techniques to efficiently estimate very low probabilities. One such technique is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {Restart}$$\end{document}RESTART. Villén-Altamirano recently showed—by way of a theoretical study and ad-hoc implementation—that a generalisation of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {Restart}$$\end{document}RESTART to prolonged retrials offers improved performance. In this paper, we demonstrate our independent replication of the original experimental results. We implemented \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {Restart}$$\end{document}RESTART with prolonged retrials in the and modes tools, and apply them to the models used originally. To do so, we had to resolve ambiguities in the original work, and refine our setup multiple times. We ultimately confirm the previous results, but our experience also highlights the need for precise documentation of experiments to enable replicability in computer science.

In stochastic timed systems, the time between faults, customer interarrival times, transmission delays, or exponential backoff wait times follow (continuous) probability distributions.Probabilistic model checking [3] can compute dependability metrics like reliability and availability in the Markovian case.To evade state space explosion and evaluate non-Markovian systems, statistical model checking (SMC [2]) has become a popular alternative.At its core, SMC is Monte Carlo simulation for formal models.It faces a runtime explosion when estimating the probability p of a rare event with a sufficiently low error, e.g. an error of ±10 −10 for p ≈ 10 −9 (i.e. a relative error of 0.1).Rare event simulation (RES) techniques [17] address this problem.They can broadly be categorised into importance sampling and importance splitting.The former changes the probability distributions while the latter changes the simulation algorithm to make the rare event more likely.Both techniques then compensate for these changes in the statistical evaluation.RES has garnered the interest of mathematicians and computer scientists alike.The scientific outcomes range from theoretical studies of a RES technique's limit behaviour and optimality [8,14,16] over experimental validation on Matlab studies or ad-hoc implementations [10,11,19] to application reports using larger case studies [5,12,18] as well as automated tools [4,6,15,18] that accept a loss of optimality in exchange for practicality.
Two recent papers showed theoretically [21] and empirically [19] that prolonging retrials in the Restart importance splitting technique [22] reduces the required number of samples for the same error, with optimal runtime around prolonging by 1 to 2 levels.The models and parameters used in [19] are described in supplementary material [20], but the implementation is not publicly available.In this paper, we demonstrate our replication of the results of [19,21], where replication "means that an independent group can obtain the same result using artifacts which they develop completely independently" in the ACM terminology [1].To this end, we implemented Restart with prolonged retrials (Restart-P) in the FIG rare event simulator [4] and the modes statistical model checker [7] of the Modest Toolset [13].We recreated the models in the IOSA and Modest languages, and ran experiments following the original setup.
Our experiments confirm the behaviour and performance improvements of Restart-P reported in [19,21].However, we encountered ambiguities in the textual and pictorial descriptions of Restart-P and the experimental setup in the original papers, some of which we could only resolve with input from the author of [19,21].Different parts of our work thus reside on different levels between replication and reproduction (which "means that an independent group can obtain the same result using the author's own artifacts" [1]).Throughout the paper, we document where we achieved fully independent replication, where information from private communication was needed, and where we had to ultimately resort to requesting and inspecting the source code for the original implementation.
The contribution of this paper is thus threefold: (1) We provide pseudocode for Restart-P in Sect. 2 that clarifies the technical details w.r.t.[19,21]

Restart with Prolonged Retrials
Let a stochastic timed discrete-event model be given as a tuple S, s 0 , step, F of a set of states S, an initial state s 0 ∈ S, a function step : S → [0, ∞) × S where step(s) samples a random path from s to the next event and returns a pair t, s of its duration and next state, and a subset of rare event states F ⊆ S. A simulation run is a sequence of states obtained by repeatedly applying step.Models with general probability distributions encode their memory in the states.
Importance splitting uses an importance function f I : S → [0, ∞) indicating "how close" a state is to the rare event.Partition the range of f I into k + 1 nonempty intervals to obtain a level function f L : S → { 0, . . ., k } with f L (s 1 ) < f L (s 2 ) ⇒ f I (s 1 ) < f I (s 2 ).For simplicity, assume f I (s 0 ) = 0 and step(s) = t, s ⇒ f L (s ) ≤ f L (s)+1 (a step moves up by at most one level).Let Then "thresholds T i of f I are defined so that each set C i is associated f I , f L , and f S are specified by experts or derived automatically [6].Importance splitting with Restart starts a run (the main trial ) from s 0 that, whenever it moves up from s in current level l − 1 to s in level l, spawns f S (l) − 1 new child runs (retrials of level l) from s .Retrials end when they move down below their creation level.The trials' weights in probability estimation are appropriately reduced to compensate.Restart with prolonged retrials of depth j, denoted Restart-P j , is defined as follows in [21] (shortened and adapted to our notation): In Restart-P j , each of the retrials of level i finishes when it leaves set C i−j ; that is, it continues until it down-crosses the threshold i − j.If one of these trials again up-crosses the threshold where it was generated (or any other between i − j + 1 and i), a new set of retrials is not performed.If j ≥ i, the retrials are cut when they reach the threshold 0. The main trial, which continues after leaving set C i−j , potentially leads to new sets of retrials if it up-crosses threshold T i after having left set C i−j .If the main trial reaches the threshold 0, it collects the weight of all the retrials (which has been cut at that threshold) and thus, new sets of retrials of level 1 are performed next time the main trial up-crosses threshold T 1 .
In addition, [21, Fig. 1] graphically illustrates the behaviour of Restart-P 1 .The original Restart [22] is Restart-P 0 .The above textual description clearly conveys the core idea of Restart-P, but we found it to omit three technical details: -The condition for when an up-going retrial spawns new retrials is more complex than with Restart.We became aware of this when comparing the textual description with the graphical depiction in [21, Fig. 1].In fact, we need to keep track of the last level at which a retrial will split, and decrement that value when it moves more than j levels down.(Independent replication.)-The definitions in [19,21] do not include 0 in the range of values for i in C i and T i .Our definitions would associate T 0 with states s where f I (s) = 0. Implemented in FIG, this lead to increasing underestimation as the prolongation depth j increased.Only once we interpreted threshold 0 to refer to level 0 (i.e.states s where f L (s) = 0) did we obtain consistent estimations across different j.The correctness of this interpretation was confirmed by the author of [19,21] in private communication.(Semi-independent replication.)-When a trial reaches, or spends time in, a state in F , we must weight this event's influence on the statistical estimate by a factor of 1/ i=1 f S (i) in the original Restart.With this weight calculation, FIG produced subtle underestimations on some of the models from [20] when j > 0. We finally requested the source code for the original experiments and found that f L (s) must be replaced by the level on which the current trial was last split, i.e. the value must not change when moving down ≤ j levels.(Resembles a reproduction.)We make these details explicit in Algorithm 1, for the case of estimating the long-run average time spent in F (i.e.steady-state simulation).FIG evolved as described above and is thus mostly a replication.modes was extended with prolongations later, using a recursive formulation of the algorithm gleaned from the original code.It thus lacks the complete independence of a replication as per [1].

Experiments
Table 2 in [21] provides steady-state estimates, numbers of samples, and runtimes obtained using Restart-P j on a Jackson (i.e.Markov) 2-tandem queueing network for j ∈ { 0, . . ., 4 }.The same data is given in [19] for j ∈ { 0, . . ., 2 } on a similar system with three queues and a seven-node network, in Jackson and non-Jackson (using Erlang and hyperexponential distributions) variants.The original articles and extra material [20] describe the models, and the experimental setup: -The set F is characterised.E.g. for the 3-tandem network, it contains the states where the third queue has ≥ L = 30 (Jackson) or 14 packets (non-Jackson).-All probability distributions and the f I , f L , and f S functions are characterised.
-T max time values for the steady-state simulations are specified for all models.
-The statistical evaluation aims for a relative error of 0.1 with 95 % confidence (except for the tandem queue, where the error is 0.005); Restart-P runs are performed sequentially until the half-width of the 95 % confidence interval is below 10 % (resp.0.5 %) of the current estimate.(Note that this guarantees the requested confidence only asymptotically for decreasing width [9].)In our replication attempt, we had to resolve the following unspecified aspects: -The queue capacities C > L are not documented, but influence the estimate: for C close to L, the steady-state probability is underestimated.We settled for C = 20 where a single run is partitioned into equal-duration batches, each of which provides one sample value.In communication with the original author, we found that each of their samples results from an independent run.We adapted FIG to do the same.It is the default in modes.(Semi-independent replication.)-We also found in this communication that the original runs perform no splitting for the first 40 clients served; this part of the run is ignored as an initial transient phase.We confirmed this in the source code.We measured the average time to serve 40 clients for each model and use the result as transient phase duration with FIG and modes since neither tool supports a transient phase based on clients served.(Semi-independent replication.)The original experiments were realised in a single file of C code that represents both the algorithm and the models, specialised to queueing models with transition probabilities and service rates specified in constant arrays.In fact, the code we received implemented the 2-tandem queueing network only.We extended this code with a compile-time choice among the models described in [20], and fixed few small bugs.We thus have four sets of results to compare, shown in Table 1: the original numbers given in [19,21], plus those from our new executions of the adapted code, modes, and FIG.In the table, time is in seconds, p is the estimate, p is the true steady-state probability where it can be derived, and n is the number of samples needed by the statistical evaluation.The adapted code and FIG ran on an Intel Xeon E5-2630 v3 (2.4-3.2GHz), and modes ran on a Core i7-4790 (3.6-4.0GHz, 4 physical/8 logical cores) system.The adapted code and FIG are single-threaded whereas modes used 7 simulation threads.The adapted code is tailor-mode for these models, while FIG has to encode them in the more general IOSA framework, making it slower; modes in turn profits from a special-case implementation for CTMC to speed up the Markovian cases.Comparing runtimes between tools is thus of limited use.The estimates are the centers of confidence intervals returned by the tools with confidence and relative width as described above.Each estimate, n, time triple was selected from 5 tool executions by picking the one with the median runtime.We underline the best runtimes among values for j.However, the wide confidence intervals (except for 2-tandem), few executions, and in principle unsound stopping criterion that we reproduce from the original experiments mean that results, including best values of j, vary a lot for different random seeds.The original experimental setup is thus insufficient for drawing conclusions about the precise tradeoffs between specific values of j, but may at most expose an overall trend.
Nevertheless, our estimates are mostly within the margin of error around the original or true results.We confirm the main experimental conclusion of [19,21]: as j increases, n decreases, but from some point-mostly j > 1 or 2-runtime increases, due to the overhead of more retrials surviving longer.For the non-Jackson triple tandem network, none of our results matches the numbers of [19].Since the original code, albeit adapted, agrees with FIG and modes rather than with the original results, we suspect an error in [19] or [20] w.r.t.this one model.

Conclusion
We demonstrated the extension of the FIG and modes rare event simulation tools to support prolonged retrials in rare event simulation using Restart importance splitting.These implementations and experiments were the outcome of an exercise in independently replicating experimental research originally performed in mathematics, from a computer science perspective.We confirm the key findings of the earlier work.At the same time, we document several issues-small but critical technical details of the algorithm and experimental setup-where the publicly available information was insufficient for a completely independent replication.We in particular noticed that replicating randomised/statistical algorithms poses a particular challenge since small errors may result in subtle mis-estimations that are often drowned in the overall statistical error.In the end, however, all issues could be resolved due to the exceptional support, responsiveness, and openness of the original author, José Villén-Altamirano, whom we thank earnestly.However, such support cannot be expected for experimental work in general, in particular where temporary staff like Ph.D. students-who eventually graduate and move to new institutions or industry-perform the bulk of the experiments.This paper thus also highlights the need for computer science and the formal verification community to continue their push for artifact evaluation and archived, publicly available reproduction packages.A reproduction package for our experiments is archived at DOI 10.6084/m9.figshare.12269462.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/),which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original authors and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Table 1 .
Experimental results for the examples considered in [19,21] -FIG by default uses the batch means technique for steady-state simulation, • L in FIG's IOSA models (replication); the influence of C − L rapidly diminishes beyond small values.Later, from inspecting the original source code, we found that the queues are practically unbounded (implemented as 32-bit integer counters), which we reproduce in the Modest models for modes.