Stochastic parareal: an application of probabilistic methods to time-parallelisation

Parareal is a well-studied algorithm for numerically integrating systems of time-dependent differential equations by parallelising the temporal domain. Given approximate initial values at each temporal sub-interval, the algorithm locates a solution in a fixed number of iterations using a predictor-corrector, stopping once a tolerance is met. This iterative process combines solutions located by inexpensive (coarse resolution) and expensive (fine resolution) numerical integrators. In this paper, we introduce a stochastic parareal algorithm aimed at accelerating the convergence of the deterministic parareal algorithm. Instead of providing the predictor-corrector with a deterministically located set of initial values, the stochastic algorithm samples initial values from dynamically varying probability distributions in each temporal sub-interval. All samples are then propagated in parallel using the expensive integrator. The set of sampled initial values yielding the most continuous (smoothest) trajectory across consecutive sub-intervals are fed into the predictor-corrector, converging in fewer iterations than the deterministic algorithm with a given probability. The performance of the stochastic algorithm, implemented using various probability distributions, is illustrated on low-dimensional systems of ordinary differential equations (ODEs). We provide numerical evidence that when the number of sampled initial values is large enough, stochastic parareal converges almost certainly in fewer iterations than the deterministic algorithm, maintaining solution accuracy. Given its stochastic nature, we also highlight that multiple simulations of stochastic parareal return a distribution of solutions that can represent a measure of uncertainty over the ODE solution.


Introduction
In its most basic form, parallel computing is the process by which an algorithm is partitioned into a number of sub-problems that can be solved simultaneously without prior knowledge of each other.More widespread sub-interval.Whilst an excellent first incursion into the field, for nonlinear problems this direct method suffered from interpolation errors and questions remained on how to efficiently scale this up to systems of ODEs.We generalise the original idea of Nievergelt by combining it with the parareal algorithm, generating the M initial values at each sub-interval using probability distributions based on information known about the solutions at the current iteration.
The paper is organised as follows.In Section 2, we recall the parareal algorithm and its properties from a multiple shooting viewpoint, including a known modification that will enable us to extend the algorithm to incorporate stochastic sampling.In Section 3, we introduce the stochastic parareal algorithm.Following an explanation of its key features, we elucidate how a variety of different sampling rules can be flexibly interchanged in order to carry out the sampling.In Section 4, we conduct numerical experiments to illustrate the performance of stochastic parareal against its deterministic counterpart using the different sampling rules.Findings are presented for three ODE systems of increasing complexity, with two additional examples given in the Appendix.Results indicate that for sufficiently many samples, stochastic parareal almost certainly beats the convergence rate of the deterministic algorithm and generates (stochastic) solutions of comparable numerical accuracy.For the multivariate ODEs, results show that performance is improved by generating correlated, as opposed to uncorrelated, random samples.In Section 5, conclusions are drawn and avenues for future research are discussed.

The parareal algorithm
Following previously outlined descriptions of the parareal algorithm [20,8], henceforth referred to as 'parareal' or P , consider a system of d ∈ N ODEs Having partitioned the time domain, N smaller sub-problems for n = 0, . . ., N − 1 can theoretically be solved in parallel on N processors, henceforth denoted P 1 , . . ., P N .Each solution u n (t|U n ) is defined over [T n , T n+1 ] given the initial values U n ∈ R d at t = T n .Note however that only the initial value U 0 = u 0 (T 0 ) = u 0 is known, whereas the rest (U n for n ≥ 1) need to be determined before (2) can be solved in parallel.These initial values must satisfy the continuity conditions  The first runs of G and F are given in yellow and blue respectively; and the second run of G in red.The red dots represent the solution after applying the predictorcorrector (5).
which form a (nonlinear) system of N + 1 equations that ensure solutions match at each T n ∀n ≥ 1. System (3) is solved for U n using the Newton-Raphson method to form the iterative system for n = 1, . . ., N , where k = 0, 1, 2, . . . is the iteration number.This system contains the unknown solutions u n and their partial derivatives, which even if known, would be computationally expensive to calculate.
To solve (4) and evaluate u n at discrete times, P utilises two numerical integrators.The first is a numerically fast coarse integrator G that integrates over [T n , T n+1 ] using initial values U n .The second is a fine integrator F that runs much slower compared to G but has much greater numerical accuracy, chosen such that it integrates over the same interval [T n , T n+1 ] with initial values U n .In our implementation, the difference between fast and slow integration is guaranteed by setting the time steps for G and F as δT and δt respectively with δt δT (recall Figure 1).The key principle is that, if used to integrate (1) over [T 0 , T N ] serially, F would take an infeasible amount of computational time, hence the need to use P in the first place.Therefore G is permitted to run serially across multiple sub-intervals rapidly, whereas the slower solver F is only permitted to run in parallel on sub-intervals.This is a strict requirement for running P , else numerical speedup will not be achieved.
Given that (4a) is known a priori for all k, Lions et al. [8] approximate the first term in equation (4b) using the fine solver F (U k n−1 ) and the second term by a coarse finite difference of the derivative using G(U k+1 n−1 )−G(U k n−1 ).One could instead approximate the derivative using a fine finite difference F (U k+1 n−1 ) − F (U k n−1 ), however (4b) then becomes a sequential calculation using just F -exactly what we wish to avoid.Assuming G meets the conditions required for (4b) to converge (see Section 2.1), using the coarse approximation becomes reasonable and enables the parallel computation of (4b) [20].The result is that an initial guess for the initial values U 0 n (found using G) is improved at successive parareal iterations k using the predictor-corrector update rule Pseudocode for an implementation of P is given in Algorithm 1 alongside a graphical description of the first iteration in Figure 2.During the 'zeroth' iteration, G is applied to u 0 , generating initial values Û0 n ∀n ∈ {1, . . ., N } sequentially (lines 1-6).Immediately following, F is run in parallel from these initial values to generate a more accurate set of fine solution states Ũ0 n (lines [8][9][10][11].Next, G is run in the first time sub-interval, from the known initial value, to 'predict' the solution Û1 1 at T 1 .It is subsequently 'corrected' using rule (5), and the predictor-corrector solution U 1  1 is found at T 1 .This prediction and correction process (lines 12-16) repeats sequentially until U 1 n is found ∀n ∈ {1, . . ., N }.After checking if the stopping criteria has been met (lines 17-21), the algorithm either starts the next iteration or stops.
Algorithm 1 Parareal (P ) 1: Set counters k = I = 0 and define U k n , Ûk n and Ũk n as the predictor-corrector, coarse, and fine solutions at the n th time step and k th iteration respectively (note that U k 0 = Ûk 0 = Ũk 0 = u 0 ∀k).2: // Calculate initial values at the start of each sub-interval T n by running G serially on processor P 1 .

9:
for n = I + 1 to N do 10: end for 12: // Propagate the predictor-corrector solution (at iteration k) with G on any available processor.Then correct this value using coarse and fine solutions obtained during iteration k − 1 (this step cannot be carried out in parallel). 13: for n = I + 1 to N do 14: end for 17: // Check whether the stopping criterion is met, saving all solutions up to time step T I before the next iteration.If tolerance is met for all time steps, the algorithm stops. 18: end if 22: end for

Stopping criteria and other properties
The algorithm is said to have converged up to time T I if for some small fixed tolerance ε -noting that • ∞ denotes the usual infinity norm [21,17].Taking the relative, instead of absolute, errors in (6) would also be appropriate, however, in our numerical experiments the results (not reported) did not change.Once I = N , we say P has taken k d = k iterations to converge, yielding a solution with numerical accuracy of the order of the F solver.In its original formulation, P iteratively improves the solution across all time steps, regardless of whether they have converged or not, up until the tolerance has been met for all T n .The modified formulation presented here however only iterates on the solutions for the unconverged sub-intervals [23,17].This has no effect on the convergence rate k d and becomes especially important once we introduce the stochastic modifications in Section 3.
It should be clear that at least one sub-interval will converge during each iteration k, as F will be run directly from a converged initial value.Therefore it will take at most k d = N iterations for P to converge, equivalent to running F over the N sub-intervals serially (yielding zero parallel speedup).To achieve significant parallel speedup, multiple sub-intervals need to converge during an iteration so that k d N .Assuming G takes a negligible amount of time to run compared to the parallel components (along with any other serial computations), an approximate upper bound on the speedup achieved by P is N /k d .
One challenge in attempting to minimise k d , hence the overall runtime of P , is to identify optimal solvers F and G for implementation.Whereas F is assumed to have high-accuracy and be computationally expensive to run, G must be chosen such that it runs significantly faster (usually orders of magnitude) than F whilst being sufficiently numerically accurate to converge in as few iterations as possible.Usually G is chosen such that its solutions have lower numerical accuracy, coarser temporal resolution, and/or reduced physics/coarser spatial resolution (when solving PDEs).There is currently no rigorous method for choosing the two solvers, however they should be chosen such that the ratio between the time taken to run F over [T n , T n+1 ] and the time taken to run G over the same interval is large.
In this paper, we fix both F and G as explicit1 fourth-order Runge-Kutta methods (henceforth RK4) for ease of implementation.More detailed analysis on the mathematical conditions required for P to converge, i.e. for the numerical solution U n to approach the fine solution, can be found in [25,20,8,22].

A stochastic parareal algorithm
In this section, we introduce the stochastic parareal algorithm, an extension of parareal that incorporates randomness and utilises its well-studied deterministic convergence properties to locate a solution in k s < k d iterations.A summary of additional notation required to describe the algorithm, henceforth referred to as P s , is provided in Table 1.
The idea behind P s is to sample M vectors of initial values α k n,1 , . . ., α k n,M , at each unconverged sub-interval T n , in the neighbourhood of the current predictor-corrector solution U k n from a given probability distribution (with sufficiently broad support) and propagate them all in parallel using F .Given a sufficient number of samples is taken, one will be closer (in the Euclidean sense) to the true root that equation ( 5) is converging toward.Among them, we select an optimal αk n by identifying which samples generate the most continuous trajectory, at the fine resolution, in phase space across [T 0 , T N ].Therefore, at each iteration, we stochastically "jump" toward more accurate initial values and then feed them into the predictor-corrector (5).For increasing values of M, the convergence rate k s will decrease, satisfying k s < k d with probability one -shown numerically in Section 4.

The algorithm
Suppose again we aim to solve system (1), adopting the same conditions, properties, and notation as discussed in Section 2. P s follows the first iteration (k = 1) of P identically -see line 1 of Algorithm 2. This is because information about initial values at the different temporal resolutions (i.e.results from F and G) are required to construct the appropriate probability distributions for sampling.Following the convergence check, we assume (for the purposes of explaining the stochastic iterations) that only the first sub-interval [T 0 , T 1 ] converged during k = 1, leaving N − 1 unconverged sub-intervals.At this point we know the most up-to-date predictor-corrector solutions U 1 n ∀n ∈ {1, . . ., N } and the stochastic iterations can begin (henceforth k = 2).The d-dimensional probability distribution used to sample initial values at time T n and iteration k.
The d-dimensional vector of marginal means at time T n and iteration k.
The d-dimensional vector of marginal standard deviations at time T n and iteration k.
Pairwise correlations coefficients between i th and j th components of the M fine resolution propagated samples at time T n and iteration k.
The symmetric positive semi-definite d × d correlation matrix with elements ρ k−1 n i,j at time T n and iteration k.
The symmetric positive semi-definite d × d covariance matrix at time T n and iteration k.

At any unconverged T n (n > 1), we sample M vectors of initial values, denoted α
, and correlation structure given by the matrix R k−1 n .These quantities depend upon the information available at iteration k − 1, i.e. a combination of is introduced to take into account the dependence between components of the ODE system (lines 3-9).The elements of R k−1 n , for k ≥ 3, are defined using the Pearson correlation coefficient , i, j ∈ {1, . . ., d}, ( where and 7) are essentially the estimated pairwise correlation coefficients of the M d-dimensional fine resolution propagations of the sampled initial values at T n from the previous iteration, i.e.
Note that other types of linear correlation coefficient could also be chosen.Since each we sample from a multivariate distribution with uncorrelated components.
Of the M sampled initial values at each T n (n > 1), only one is retained, denoted by αk−1 n , chosen such that it minimises the Euclidean distance between the fine solution and the sampled values (lines 23-28).To do this, n,m (green dots) are taken at T 2 and T 3 from distributions with means U 1  2 and U 1 3 , and some finite standard deviations respectively.These values, along with U 1  2 and U 1 3 themselves, are propagated in parallel forward in time using F (green lines).The optimally chosen samples α1 n (refer to text for how these are chosen) are then propagated forward in time using G (yellow lines).
start from the converged initial values at T 2 given by the fine solver: F (U k−1 1 ).Calculate the Euclidean distance between F (U k−1 1 ) and each of the M samples α k−1 2,1 , . . ., α k−1 2,M .The sample minimising this distance is chosen as αk−1 2 .Repeat for later T n , minimising the distance between F ( αk−1 n−1 ) and one of the samples α k−1 n,1 , . . ., α k−1 n,M .This process must be run sequentially and relies on the modification to P made in Section 2 -that solutions are not altered once converged.Referring again to Figure 3, the corresponding coarse trajectories of these optimally chosen samples αk−1 n must also be calculated to carry out the predictor-corrector step (lines [29][30][31][32].At this point, the set of initial values { αk−1 2 , . . ., αk−1 N −1 } has been selected from the ensemble of random samples, effectively replacing the previously found U k−1 n .The coarse and fine propagations of these values are now used in the predictor-corrector (lines 33-37) such that Using the same stopping criteria (6) from P (lines 38-42), the algorithm either stops or runs another stochastic parareal iteration.

Sampling rules
The probability distributions Φ k−1 n incorporate different combinations of available information about the initial values at different temporal resolutions, i.e. the coarse, fine, and predictor-corrector initial values respectively.This information is used to define the marginal means and standard deviations in the 'sampling rules' outlined in the following subsections.Note how the distributions vary with time T n and k, so that the accuracy of the distributions increase as the initial values in P s are iteratively improved.Using Gaussian and copula distributions, we analyse the performance of different sampling rules within P s in Section 4. This will give us a more comprehensive understanding on whether the choice of distribution family  if k ≥ 3 then 6: 7)).
for m = 1 to M do 14: if m == 1 then 15: end if

21:
end for

22:
end for

23:
// Select the most continuous fine trajectory from the ensemble sequentially.

24:
for n = I + 1 to N − 1 do 25: αk−1 n = α k−1 n,J // store optimal initial value 27: Ũk−1 n+1 = Ũn+1,J // store most optimal fine trajectories 28: end for 29: // Run G from the optimal samples (can run in parallel). 30: end for 33: // Predict and correct the initial values. 34: end for 38: // Check whether the stopping criterion is met. 39: end if 43: end for have the greatest impact on the convergence rate k s .In all tested cases, we observe that taking correlated samples significantly improves the performance of P s compared to the independent setting.

Multivariate Gaussian
First, we consider perturbing the initial values using stochastic "noise", i.e. considering errors compared to the true initial values to be normally distributed, a standard method for modelling uncertainty.The initial values n we choose either the fine resolution values F (U k−2 n−1 ) (prior to correction) or the predictor-corrector values U k−1 n .For the marginal standard deviations, we choose | as they are of the order of the corrections made by the predictor-corrector and each marginal decreases toward zero as the algorithm converges (as expected).For the correlation coefficients, ρ k−1 n i,j , we calculate the linear correlation between the F propagated samples using Pearson's method -recall (7).Note that | • | denotes the component-wise absolute value.Testing revealed that alternative marginal standard | did not span sufficiently large distances around µ k−1 n in order for sampling to be efficient, i.e they required much higher sampling to perform as well as ) are taken according to the following sampling rules: Note that a linear combination of the rules or taking half the samples from each appears to work well, with performance similar to the individual rules themselves (results not shown).

Multivariate copula
Alternatively, we consider errors that may have a different (possibly non-Gaussian) dependency structure.We consider another multivariate distribution, known as a copula, with uniformly distributed marginals scaled such that they have the same value as the marginal means and standard deviations in Section 3.2.1.
A copula C : [0, 1] d → [0, 1] is a joint cumulative distribution function, of a d-dimensional random vector, with uniform marginal distributions [26].Sklar's theorem [27] states that any multivariate cumulative distribution function with continuous marginal distributions can be written in terms of d uniform marginal distributions and a copula that describes the correlation structure between them.Whilst there are numerous families of copulas, we consider the symmetric t-copula, C t , the copula underlying the multivariate t-distribution, which depends on the parameter ν and the correlation matrix R k−1 n .We fix ν = 1 so that samples have a higher probability of being drawn toward the edges of the [0, 1] d hypercube, see [26].Note that ν → ∞ can be thought of as sampling from the Gaussian copula.
Correlated samples χ = (χ 1 , . . ., χ d ) T ∼ C t that are generated in [0, 1] d then need to be re-scaled such that each marginal is uniformly distributed in an interval [x i , y i ] ⊂ R, with mean µ i and standard deviation σ i for i ∈ {1, . . ., d}.By definition, a marginal uniform distribution on [x i , y i ] has mean (x i + y i )/2 and variance (y i − x i ) 2 /12 which we set to µ i and σ 2 i respectively.Solving these equations, we find that the desired marginals are uniform distributions on √ 3σ i guarantees that the generated samples χ ∼ C t have the same marginal means µ i and standard deviations σ i as the Gaussian distributions.This allows us to compare performance results in Section 4. The t-copula sampling rules (Rule 3 and Rule 4) are therefore defined component-wise as . ., d}, with χ ∼ C t and parameters µ i and σ i chosen to be identical to Rule 1 and Rule 2 respectively.
Note that this copula construction can be extended to other families of copulas (e.g. the Gaussian copula) and to copulas with different marginal distributions (e.g.Gaussian, t, or logistic marginals).Therefore, sampling from the Gaussian multivariate distributions in Section 3.2.1 can be considered a special case of these more general copulas.

Convergence
Independent simulations of P s will return a numerical solution U k n which, along with k s , vary stochasticallysince the optimal initial values chosen will vary between simulations.Given the randomness in these quantities, we can discuss the convergence of P s in two ways.
Firstly, consider convergence in terms of minimising the random variable k s by studying P(k s < k d ).Given there are no analytical results for P guaranteeing that k d < N for any given problem, proving that P(k s < k d ) = 1, or at least E(k s ) = N k=1 kP(k s = k) < k d seems to be analytically intractable.However, we can qualitatively discuss P(k s < k d ) and E(k s ) with respect to the number of samples M. Consider the following cases: Running P s is equivalent to running P , hence the convergence of P s follows from that of P and therefore • 1 < M < ∞ For finitely many samples, we estimate the discrete probability distributions P(k s = k) and observe, in all numerical experiments (see Section 4), that P(k s < k d ) → 1 for M ≈ 10.Moreover, we observe that P(k s < k d ) increases and E(k s ) decreases for increasing M -with E(k s ) < k d for all values of M tested.
• M → ∞ If we were able to take infinitely many samples, P s effectively samples every possible value in the support of Φ, i.e. every initial value that has a non-zero probability of being sampled from Φ. Therefore, if Φ has infinite support, e.g. the Gaussian distribution, all possible initial values in R d are sampled and propagated, hence the fine solution will be recovered almost surely in E(k s ) = k s = 2 iterations.Note this is the smallest value k s can take to converge assuming convergence does not occur following the first iteration -although the algorithm could be modified such that convergence occurs almost surely in k s = 1 iterations as M → ∞.In Section 4.1, we illustrate this property numerically by taking a large number of samples for a single realisation of P s .
In the scenario that P s converges in k s = N iterations (irrespective of the value of M), it will return the fine solution just as P does when k d = N (having propagated the exact initial value at T 0 sequentially N times using F ). Secondly, consider convergence in terms of the stochastic solution U k n (9) approaching (in the mean-square sense) the fine solution U n as k increases.In the numerical experiments in Section 4, we did not observe a case where P s fails to converge to the fine solution.In fact, we observe tight confidence intervals on the numerical errors between U k n and U n upon multiple realisations of P s (see Figure 7 and Figure 9).We may expect P s to fail to converge (i.e.solutions blow up) in cases in which P also fails -typically this means that a more accurate coarse solver is required for both algorithms.It should be noted, however, that because P s samples M initial values, only M out of the M propagated solutions may blow up in each time sub-interval, due to the poor coarse solver, and so P s could in fact locate a solution in cases where P cannot -although this is not tested here.

Computational complexity
At each iteration, P s runs the fine solver more frequently than P , albeit still in parallel, and therefore requires a larger number of processors.The first iteration of P s requires N processors, however once sampling begins in k ≥ 2, it requires at most M(N − I − 1) + 1 processors -assuming I sub-intervals converge during k = 1.Whilst we assume processors are in abundance, this number scales directly with M and so it is important to keep M to a minimum if limited processing power is available in practice.
As the stochastic iterations progress, the number of processors required, i.e.M(N − I − 1) + 1, decreases as the number of converged sub-intervals, I, increases -leaving a growing number of processors idle.Each additional sub-interval that converges leaves M idle processors that we can re-assign to do additional sampling and propagation.We assign each set of M idle processors to the earliest unconverged sub-interval with the least number of samples, ensuring all processors are working at all times to explore the solution space for the true initial values (see Figure 4 for an illustration).We do not explicitly write the pseudocode for re-assigning the idle Figure 4: Illustration of a possible processor configuration if two sub-intervals were to converge between iterations two and three of P s .The letter C denotes a converged sub-interval, the number 1 denotes a subinterval where we propagate the converged initial value from the preceding sub-interval using F , and the letter M denotes the number of samples taken in an unconverged sub-interval.processors in Algorithm 2 (lines 10-22) to avoid additional complexity -the process is, however, implemented in the numerical experiments in Section 4 2 .
In terms of timings, an iteration of P s takes approximately the same wallclock time as one of P .In this regard, we assume that the extra serial costs in P s , e.g.correlation estimation, selecting optimal samples, and the extra G runs, take negligible wallclock time when compared to a single run of F .Therefore, the wallclock time for P s will be lower than for P if k s < k d .This comes at a cost of requiring O(MN ) processors rather than N to solve the problem.However, it should be highlighted that if P s converges in even one less iteration than P , we avoid an extra run of F which may save a large amount of wallclock time.

Numerical results
In this section, we compare the numerical performance of P and P s on systems of one, two, and three ODEs of varying complexity 3 .Both algorithms use RK4 methods to carry out integration, with G using time step δT and F using time step δt, the latter at least 75 times smaller than the former.We quantify the performance of P s by estimating the distributions of k s for each sampling rule and by measuring the accuracy of the stochastic solutions against those obtained serially with F .Since a limited number of processors were available for these experiments, the results are based not on calculating wallclock runtimes but on comparing the convergence rates k d and k s -which are independent of the number of processors used.Additional results for these test cases, as well as two further test problems, are given in the Appendix.

Scalar nonlinear equation
First, we consider the nonlinear ODE du 1 dt = sin(u 1 ) cos(u 1 ) − 2u 1 + e −t/100 sin(5t) + ln(1 + t) cos(t), (10) with initial value u 1 (0) = 1 [7].Discretise the time interval t ∈ [0, 100] using N = 40 sub-intervals, coarse time steps δT = 100/80, and fine time steps δt = 100/8000.Numerical solutions to (10) are shown on the interval [0, 18] in Figure 5a.Deterministically, P locates a solution in k d = 25 iterations using error tolerance ε = 10 −10 .P s converges in a varying number of iterations k s , with P(k s < k d ) = 1 -see Figure 5b for the convergence of ten independent simulations using M = 3.From this plot, we can see that by taking just three samples, P s reduces the convergence rate by almost a factor of two -from 25 to approximately 14 (on average).When M is increased above one, P s begins generating stochastic solutions that converge in a varying number of iterations k s .In order to accurately compare k d with the discrete random variable k s , we run 2000 independent  10) over [0, 18] using F serially and a single realisation of P s .Note that only a subset of the fine times steps of the P s solution are shown for clarity.(b) Errors at successive iterations of P (black line) and ten independent realisations of P s (blue lines).Horizontal dashed red line represents the stopping tolerance ε = 10 −10 .Note that both panels use P s with sampling rule 1 and M = 3. simulations of P s to estimate the distribution of k s for a given M. Upon estimating these distributions, it was found that P(k s < 25) = 1 for each of the four sampling rules (for all M > 1), meaning that by doubling the number of processors we can beat parareal with probability one.The estimated distributions of k s , using sampling rule 1 (the other rules perform similarly), as a function of M are given in Figure 6a.The stacked bars represent the estimated discrete probability of a simulation converging in a given number of iterations.The results show P s converging in just five iterations in the best case -demonstrating P s has the potential to yield significant parallel speedup, given sufficiently many samples are drawn.Figure 6b emphasises the power of the stochastic method, showing that the estimated expected value E(k s ) decreases as M increases, with the estimated standard deviation sd(k s ) = N k=1 (k − E(k s )) 2 P(k s = k) decreasing too.The improved performance of P s as M increases reflects what was discussed in Section 3.3.In particular, we ran a single realisation of P s with M = 10 6 , observing that P s converged in four iterations (result not shown), confirming that k s continues to decrease for increasing M. By looking at Figure 6b, we see that sampling rule 1 yields the lowest expected values of k s for small values of M, with all sampling rules performing similarly for large M.
To verify the accuracy of the stochastic solutions, we plot the difference between the mean of 2000 independent realisations of P s and the serially calculated F solution in Figure 7. Also shown is the confidence interval given by two standard deviations of the stochastic solutions (which is at most O(10 −11 )) and the error generated by P .Accuracy is maintained with respect to the fine solution across the time interval, even more so than the P solution.See Appendix A for numerical results of P s applied to a stiff scalar nonlinear ODE.In this case, the stiffness of the equation demands a higher value of M to improve k s -something we also observe for the Brusselator in Section 4.2.

The Brusselator system
Next, consider the Brusselator system   a pair of stiff nonlinear ODEs that model an auto-catalytic chemical reaction [28].Using parameters (A, B) = (1, 3), trajectories of the system exhibit oscillatory behaviour in phase space, approaching a limit cycle (as t → ∞) that contains the unstable fixed point (1, 3) T .Now that d > 1, we use bivariate distributions to sample the initial values -meaning we can compare the effects of including or excluding the correlations between variables.System (11) is solved using initial values u(0) = (1, 3.07) T over time interval t ∈ [0, 15.3] with N = 25, δT = 15.3/25, and δt = 15.3/2500[29].The numerical solution to (11) in phase space and convergence of the successive errors are reported in Appendix B. With these parameters and a tolerance of ε = 10 −6 , P takes k d = 7 iterations to stop and return a numerical solution.
The estimated distributions of k s for sampling rule 1 are given in Figure 8a.Even though P takes just k d = 7 iterations to stop, we observe that P s can still reach the desired tolerance in 5 or 6 iterations -albeit requiring larger values of M. We believe this is due to the stiffness of the system and poor accuracy of the explicit G solverresults presented for the stiff ODE in Appendix A appear to confirm this.Using adaptive time-stepping methods could be a way to reduce the value of M needed to converge in fewer k s [24].The solid lines in Figure 8b show that, using sampling rules 1 or 3, P s only requires M ≈ 10 to beat parareal almost certainly, i.e. to guarantee that P(k s < 7) → 1. Sampling rules 1 and 3 outperform 2 and 4 in this particular system.Note, however, the stark decrease in performance if instead uncorrelated samples are generated within P s (dashed lines).This demonstrates the importance of accounting for the dependence between variables in nonlinear systems such as (11).Observe again, in Figure 9, how the mean P s solutions attain equivalent, or better, accuracy than the P solutions, with standard deviations at most O(10 −6 ).
Further results of P s applied to (11) and an additional two-dimensional nonlinear system are presented in Appendix B and Appendix C respectively.Observe again that for the less stiff system in Appendix C, we require less sampling to improve k s compared to the Brusselator -further highlighting the demand for higher M in stiff systems to improve the convergence rate.

The Lorenz system
Finally, we consider the Lorenz system (a) a simplified model for weather prediction [30].With the parameters (γ 1 , γ 2 , γ 3 ) = (10, 28, 8/3), ( 12) exhibits chaotic behaviour where trajectories with initial values close to one another diverge exponentially.This will test the robustness of P s , as small numerical differences between initial values will mean that errors can grow rapidly as time progresses.We solve (12)  Running P s to compare the performance of the sampling rules, we see again in Figure 10a that taking correlated samples is much more efficient than not and that only M ≈ 10 samples are required to beat parareal with probability one.For the chaotic trajectories generated by (12), sampling close to the predictor-corrector, rules 2 and 4, yields superior performance compared to rules 1 and 3 for small values of M. Figure 10b displays estimated distributions for varying M using sampling rule 2 -yielding a best convergence rate k s = 16 for approximately 25% of runs with M = 1000.These results demonstrate the robustness of and that the sampling and propagation process not impeded by exponential divergence of Additional results of P s applied to (12) are given Appendix D.

Conclusions
In this we have the parareal algorithm using probabilistic methods to develop a stochastic parareal algorithm for solving systems of ODEs in a time-parallel manner.Instead of passing deterministically calculated initial values into parareal's predictor-corrector, stochastic parareal selects more accurate values from a randomly sampled set, at each temporal sub-interval, to converge in fewer iterations.In Section 4, we compared performance against the deterministic parareal algorithm on several low-dimensional ODE systems of increasing complexity by calculating the estimated convergence rate distributions (upon multiple independent realisation of stochastic parareal) with increasing numbers of random samples M. By taking just M ≈ 10 (correlated) samples, the estimated probability of converging sooner than parareal approached one in all test cases.Similarly, we observed numerical convergence toward the fine (exact) solution with accuracy of similar order to parareal and, in the spirit of probabilistic numerics [31,32], obtained a measure of uncertainty over the ODE solution upon multiple realisations of the algorithm.The probability that stochastic parareal converges faster than standard parareal depends on a number of factors: the complexity of the problem being solved, the number of time sub-intervals (N ), the accuracy of the coarse integrator (G), the number of random samples (M), and the type of sampling rule in use.Sampling rules 1 and 3 (sampling close to the fine solutions) outperformed rules 2 and 4 (sampling close to the predictor-corrector solutions) for the ODE systems in Section 4.1, Section 4.2, and SM1.The reverse was true, however, for the systems in Section 4.3 and SM3, making it difficult to determine an optimal rule for a general ODE system.To overcome having to choose a particular sampling rule, one could linearly combine different rules or even sample from multiple rules simultaneously.We would suggest sampling from probability distributions with infinite support, i.e. the Gaussians (rules 1 and 2), so that samples can be taken anywhere in R d with non-zero probability.Having finite support may have created difficulty for the uniform marginal t-copulas (rules 3 and 4) because samples could only be taken in a finite hyperrectangle in R d -problematic if the exact solution were to lay outside of this space.
When solving stiff ODEs (see Section 4.2 and SM1), results indicated that stochastic parareal demanded increasingly high sampling to converge sooner than parareal than for non-stiff systems.For example, we observe that when taking M = 100 samples in the non-stiff scalar ODE in Figure 6, the expected convergence rate decreases from 25 to 7 whereas for the stiff scalar ODE in SM1, the rate only drops from 8 to 6.A similar observation can be made in the two-dimensional test cases in Section 4.2 (stiff) and SM3 (non-stiff).These results exemplify the role that system complexity, e.g.stiffness or chaos, plays in the performance of both algorithms.In SM1, stochastic parareal was also shown to perform more efficiently for problems that deterministic parareal itself struggles with, i.e. cases in which the accuracy of the coarse integrator G is poor.In SM3 it was also observed that, for low sample numbers (M = 2), stochastic parareal actually converged in one more iteration than parareal in less than 2.5% of cases.This suggests there may be minimum number of samples required to beat parareal in some situations -something to be investigated with further experimentation.
In summary, we have demonstrated that probabilistic methods and additional processors can, for lowdimensional ODEs, be used to increase the parallel scalability of existing time-parallel algorithms.Next we need to investigate possible improvements and generalisations.For example, determining whether the algorithm scales for larger systems of equations -essential if it is to be used for solving PDE problems.Moreover, whether we can design adaptive sampling rules that do not need to be specified a priori to simulation.Finally, we would aim to make use of the whole ensemble of fine propagated trajectories rather than using only one -avoiding the waste of valuable information about the solution at the coarse and fine resolutions.An approach in this direction has recently been proposed [33], making use of ideas from the field of probabilistic numerics to adopt a more Bayesian approach to this problem.probabilities.On the contrary, Figure 13 shows how increasing the number of coarse steps in the interval [0, 10] from 20 to 60 drastically decreases the probabilities of P s converging sooner than P .Increasing the number of coarse steps increases the accuracy of the G solver, hence P reaches the stopping tolerance in fewer iterations k d .This result suggests that whilst P s can still converge faster than P by using more samples, it works more efficiently for particular problems where k d is relatively large (i.e when P obtains low rates of parallel speed up).
Finally, we calculated errors between the mean P s solution with respect to the F solution (Figure 14a) and the analytical solution u 1 (Figure 14b).In both cases, we observe how the mean solution attains comparable accuracy to P and both the fine and analytical solutions.

B The Brusselator system
The additional results here complement subsection 4.2.The numerical solutions to system (4.2) in the phase plane are given in 15a alongside the errors at successive iterations of five runs of P s in 15b.In Figure 16 we report the expected value of k s as a function of M for each of the sampling rules.This result suggests that larger values of M are required to reduce E(k s ) even further for this stiff system.whose solutions, for initial values within the box [0, π] × [0, π], converge toward a square-shaped limit cycle on the edges of the box (see Figure 17a) [34].The system is solved on t ∈ [0, 60], starting at u(0) = (3/2, 3/2) T , using N = 30 sub-intervals and time steps δT = 60/30 and δt = 60/3000.As shown in Figure 17b, P takes k d = 20 iterations to converge with tolerance ε = 10 −8 whilst the ten realisations of P s shown take between 17 ≤ k s ≤ 19.Contrary to the previous two-dimensional test problem, Figure 18a shows that sampling close to the predictor-corrector values (rules 2 and 4) yield lower expected values of k s .In this case, the bivariate Gaussian outperforms the t-copula, however the reverse is true for rules 1 and 3. Results using uncorrelated samples have been shown to generate inferior performance hence are not shown here.The detailed distributions of k s in Figure 18b, using sampling rule 2, show a best performance of k s = 14 with 100 samples and P(k s < 20) = 1 for M ≈ 10.They do, however, reveal that in a limited number of cases, using two samples, P s can in fact take more than k d = 20 iterations to converge (k s = 21 in this small fraction of realisations).This suggests that there may be a minimum number of samples required to beat the convergence rate of parareal for some systems of equationssomething to be investigated in future work.

D The Lorenz system
The additional results here complement subsection 4.3.Figure 19 displays E(k s ) against M, indicating that for the Lorenz system, generating correlated samples close to the predictor-corrector solutions (sampling rules 2 and 4) yield the lowest expected values of k s .In Figure 20, we plot the absolute errors between the mean P s solution and the fine solver.Notice that accuracy is maintained with respect to the parareal solution in this chaotic system, even as the errors grow with increasing time.
where f : R d × [T 0 , T N ] → R d is a smooth multivariate (possibly nonlinear) function, u : [T 0 , T N ] → R d the timedependent vector solution, and u 0 ∈ R d the initial values at T 0 .Decompose the time domain into N sub-intervals such that [T 0 , T N ] = [T 0 , T 1 ] ∪ • • • ∪ [T N −1 , T N ], with each sub-interval taking fixed length ∆T := T n − T n−1 for n = 1, . . ., N (see Figure 1 for a schematic).

Figure 1 :
Figure 1: Schematic of the time domain decomposition.Three levels of temporal discretisation are shown: sub-intervals (size ∆T ), coarse intervals (size δT ), and fine intervals (size δt).Note how the discretisations align with one another and that δt δT ≤ ∆T in our implementation.

Figure 2 :
Figure 2: First iteration of the parareal algorithm to numerically evaluate the solution of a scalar ODE, either available or obtained via the fine solver (black line).The first runs of G and F are given in yellow and blue respectively; and the second run of G in red.The red dots represent the solution after applying the predictorcorrector (5).

n 6 :
end for 7: for k = 1 to N do 8:

Figure 3 :
Figure 3: An illustration of the sampling and propagation process within P s following iteration k = 1.The fine solution is given in black, the k = 0 fine solutions in blue, the k = 1 coarse solutions in red, and the k = 1 predictor-corrector solutions as red dots.With M = 5, four samples α 1n,m (green dots) are taken at T 2 and T 3 from distributions with means U1  2 and U 1 3 , and some finite standard deviations respectively.These values, along with U1  2 and U 1 3 themselves, are propagated in parallel forward in time using F (green lines).The optimally chosen samples α1 n (refer to text for how these are chosen) are then propagated forward in time using G (yellow lines).

Figure 5 :
Figure 5: (a) Numerical solution of (10) over [0, 18] using F serially and a single realisation of P s .Note that only a subset of the fine times steps of the P s solution are shown for clarity.(b) Errors at successive iterations of P (black line) and ten independent realisations of P s (blue lines).Horizontal dashed red line represents the stopping tolerance ε = 10 −10 .Note that both panels use P s with sampling rule 1 and M = 3.

Figure 6 :
Figure 6: (a) Estimated discrete distributions of k s as a function of M for sampling rule 1.(b) Estimated expectation of k s as a function of M, calculated using estimated distributions of k s for each sampling rule with error bars representing ± two standard deviations sd(k s ).Distributions in both panels are estimated by simulating 2000 independent realisations of P s for each M.

Figure 7 :
Figure 7: Errors of P (red) and mean P s (black) solutions against the serial F solution over time.The mean error is obtained by running 2000 independent realisations of P s with sampling rule 2 and M = 4 -the confidence interval representing the mean ± two standard deviations is shown in light blue.

Figure 8 :
Figure 8: (a) Estimated discrete probabilities of k s as a function of M for sampling rule 1.(b) Estimated probability that the convergence rate k s is smaller than k d = 7 as a function of M for the sampling rules with (solid lines) and without (dashed lines) correlations.Distributions were estimated by simulating 2000 independent realisations of P s for each M.

Figure 9 :
Figure 9: Errors of P (red) and mean P s (black) solutions against the F solution.The mean error is obtained by running 2000 independent realisations of P s with sampling rule 4 with M = 200 -the confidence interval representing the mean ± two standard deviations is shown in light blue.Panel (a) displays errors for the u 1 component of the solution whilst (b) displays the u 2 component.

Figure 10 :
Figure 10: (a) Estimated probability that the convergence rate k s is smaller than k d = 20 as a function of M for the sampling rules with (solid lines) and without (dashed lines) correlations.Distributions were estimated by simulating 2000 independent realisations of P s for each M. (b) Estimated discrete probabilities of k s as a function of M for sampling rule 2.

Figure 12 :
Figure 12: (a) Estimated discrete distributions of k s as a function of M for sampling rule 1.(b) Estimated expectation of k s as a function of M, calculated using estimated distributions of k s for each sampling rule with error bars representing ± two standard deviations sd(k s ).Distributions were estimated by simulating 2000 independent realisations of P s for each M.

Figure 13 :
Figure13: Estimated probabilities that the convergence rate k s is smaller than k d as a function of M using sampling rule 3.Each curve shows P(k s < 8) (black), P(k s < 5) (red), and P(k s < 4) (blue) using coarse steps δT = 10/20, 10/40, 10/60 respectively -noting that P converges in k d = 8, 5, 4 iterations for these coarse steps respectively.As before, 2000 independent realisations of P s were run for each M.

Figure 14 :
Figure 14: Errors of P (red) and mean P s (black) solutions against (a) the F solution and (b) the analytical solution u 1 .The mean error is obtained by running 2000 independent realisations of P s with sampling rule 1 with M = 10 -the confidence interval representing the mean ± two standard deviations is shown in light blue.

Figure 15 :
Figure 15: (a) Numerical solution of system (4.2) in the phase plane using F and P s .Note again that only a subset of the fine times steps of the (mean) P s solution are shown for clarity.(b) Errors at successive iterations of P (black lines) and five independent realisations of P s (blue lines).Horizontal dashed black line represents the tolerance ε = 10 −6 .Both panels run P s with sampling rule 1 and M = 10.

Figure 16 :
Figure 16: Estimated expectation of k s as a function of M, calculated using estimated distributions of k s for each sampling rule with error bars representing ± two standard deviations sd(k s ) shown for correlated (solid lines) sampling rules.Uncorrelated sampling rules are shown with dashed lines without error bars.Distributions are estimated by simulating 2000 independent realisations of P s for each M.

Figure 17 :
Figure 17: (a) Numerical solution of (14) over [0, 60] using F serially and P s .Note again that only a subset of the fine times steps of the (mean) P s solution are shown for clarity.(b) Errors at successive iterations of P and ten independent realisations of P s .Dashed black line represents the tolerance ε = 10 −8 .Note that both panels use P s with sampling rule 2 and M = 20.

Figure 18 :
Figure 18: (a) Estimated expectation of k s as a function of M, calculated using estimated distributions of k s for each sampling rule with error bars representing ± two standard deviations sd(k s ).(b) Estimated discrete distributions of k s as a function of M for sampling rule 2. Distributions were calculated by simulating 2000 independent realisations of P s for each M.

Figure 19 :
Figure 19: Estimated expectation of k s as a function of M, calculated using estimated distributions of k s for each sampling rule with error bars representing ± two standard deviations sd(k s ) shown for correlated (solid lines) sampling rules.Uncorrelated sampling rules are shown with dashed lines without error bars.Distributions are estimated by simulating 2000 independent realisations of P s for each M.

Figure 20 :
Figure 20: Absolute errors of P (red) and mean P s (black) solutions against the F solution in each component: u 1 (top), u 2 (middle), and u 3 (bottom).The mean error is obtained by running 2000 independent realisations of P s with sampling rule 2 with M = 500 -the confidence interval representing the mean ± two standard deviations is shown in light blue.

Table 1 :
Additional notation used to describe the stochastic parareal algorithm.