Annual Review of Statistics and Its Application Simulation-Based Bayesian Analysis

I consider the development of Markov chain Monte Carlo (MCMC) meth-ods,from late-1980s Gibbs sampling to present-day gradient-based methods and piecewise-deterministic Markov processes. In parallel, I show how these ideas have been implemented in successive generations of statistical software for Bayesian inference. These software packages have been instrumental in popularizing applied Bayesian modeling across a wide variety of scientific domains. They provide an invaluable service to applied statisticians in hiding the complexities of MCMC from the user while providing a convenient modeling language and tools to summarize the output from a Bayesian model.As research into new MCMC methods remains very active,it is likely that future generations of software will incorporate new methods to improve the user experience.


INTRODUCTION
Over the last 30 years, Markov chain Monte Carlo (MCMC) methods have expanded the scope of applied Bayesian statistics across diverse scientific fields. The growth in popularity of MCMC has been accelerated by the wide availability of free software. Essential features of this software include: 1. Providing a flexible and expressive modeling language that allows users to build large complex models from simple components. This capability represented a radically different approach from traditional statistical software, which provided a set of canned routines for fixed classes of models. With a flexible Bayesian modeling language, the user can easily extend existing models to account for issues such as measurement error, missing data, censoring and truncation, or hierarchical structure (Gilks et al. 1993). 2. Hiding the implementation details of MCMC from the user. Several competing sampling methods to analyze a given model may be available. The software takes care of the choice of sampling method. In addition, most sampling methods have parameters that need to be tuned for optimal performance. The software also takes care of adaptively tuning the sampling method. Ultimately, the user does not need to care about how the samples are drawn as long as the sampling is efficient. From the point of view of statistical methodology, the software packages are "black boxes," or, perhaps more accurately, "gray boxes," since they may allow partial control over the sampling methods. 3. Communicating with other software. Bayesian software is typically designed to do one thing only-efficient sampling-allowing the complexities of data formatting and analyzing output to be carried out in a more suitable environment such as R or Python.
Examples of Bayesian software with these features include BUGS (Lunn et al. 2012), JAGS (Plummer 2017), NIMBLE (de Valpine et al. 2022), and Stan (Stan Dev. Team 2022). The same description may be used for other packages that do not use MCMC, such as R-INLA (Iterated Nested Laplace Approximation) ).
In this article I review the historical development of software for Bayesian inference. The article focuses on MCMC, which remains the dominant methodology in this area, but also touches on alternatives such as INLA. The review tracks the way that advances in software have been driven by advances in statistical methodology. The main thesis of the review is that the success of generalpurpose Bayesian statistical software relies on two important pillars: (a) the existence of statistical methods that can be easily tuned for optimal performance and so are widely applicable and (b) the availability of reusable software components.

MARKOV CHAIN MONTE CARLO
The 1980s saw the convergence of three powerful ideas: graphical models, Gibbs sampling, and Bayesian inference. At the same time, exponential growth in computing power reached a critical point that enabled these ideas to be translated from theory to practice. In this section I review some essential information on MCMC methods, covering only the material necessary to understand the rest of the article. A more comprehensive overview is presented by Brooks et al. (2011).
Bayesian computations involve high-dimensional integration to derive the posterior distribution. Analytic solutions to these integrals are rare and are generally limited to conjugate prior distributions, that is, where the prior and the posterior come from the same family. Conjugate priors are useful only in the simplest models. Numerical approximations such as Gaussian quadrature do not scale well to high dimensions. This situation changed with the rediscovery of MCMC methods in the 1980s (Gelfand & Smith 1990). MCMC was first developed in the 1950s to solve problems in statistical physics (Metropolis et al. 1953). The wide applicability of MCMC to statistical models was not fully appreciated until the rapid development of desktop computing made the methods computationally feasible and accessible. With MCMC, we relax the search for an exact solution to the posterior distribution, instead relying on the empirical distribution of a sequence of samples. Furthermore, we do not require these samples to be independent. Instead, we create a Markov chain that has the target distribution as its stationary distribution. Under suitable regularity conditions, the Markov chain will converge to the target distribution, allowing inference from posterior samples after a suitable "burn-in" period.
To make matters more concrete, suppose we wish to estimate the distribution p(θ) for θ ∈ R d . In Bayesian applications, the distribution that we wish to estimate is the posterior distribution p(θ ࢯD) for some data D, but for notational convenience we suppress the dependence on D. The aim of Monte Carlo inference is to conduct inference on p(θ) by drawing a sequence of samples θ (1) , θ (2) , . . . θ (T ) . Then, for a function g(θ) with finite expectation under p(θ), the expectation is estimated by the sample mean ) .

1.
In MCMC, the sequence θ (1) , θ (2) , . . . θ (T ) is a sequence of dependent samples from a Markov chain, with p(θ) as its stationary distribution. Under mild regularity conditions, the Markov chain is ergodic so that g (T ) is a consistent estimator of E(g): In practice, convergence may be extremely slow, so that the sample mean in Equation 1 may be a poor estimator even with a large number of iterations T. Ideally, the Markov chain should be geometrically ergodic. Then, the Wasserstein distance between the distribution of θ (t ) and the target p(θ) diminishes as t → ∞ at a geometric rate. When the chain is geometrically ergodic, a central limit theorem applies to the sample mean in Equation 1, which has an asymptotic normal distribution as T → ∞.
The particular form of MCMC developed in the 1980s is named Gibbs sampling (Geman & Geman 1984), in honor of Josiah Willard Gibbs . Suppose we have d random variables θ = (θ 1 . . . θ d ) with joint distribution p(θ). At each iteration of the Gibbs sampler, we visit each variable in turn and replace its current value θ i with a new value θ ′ i for i in 1. . . d. The new value is a sample from the full conditional distribution p( is the state of the random variables, except for θ i , before we draw the new sample. The worst-case computational complexity of the Gibbs sampler is O(d 2 ). If the full conditional density p(θ i | θ −i ) depends on all d − 1 elements of θ −1 , then the computational cost of sampling the new value θ ′ i is O(d ) as d increases. At each iteration we draw d samples, once each for i = 1, . . . d, which is also O(d ), giving an overall complexity of O(d 2 ). An algorithm of this computational complexity would be impractical for very large models. However, in practice many statistical models have a sparse representation in terms of a conditional independence graph (CIG), in which variables are represented by nodes. The absence of an edge between two nodes in a CIG indicates that the two corresponding variables are independent of each other, conditional on the values of all other nodes in the graph.
In a CIG, the full conditional distribution p(θ i | θ −i ) of variable θ i depends only on its neighbors in the graph. An archetypal CIG is a lattice, illustrated in Figure 1. In this example, the Lattice illustrating conditional independence in a graph. The light gray node is conditionally independent of the white nodes given its dark gray neighbors. The remaining nodes, in white, do not contribute to its full conditional distribution.
highlighted node depends only on its four neighbors. The remaining nodes do not contribute to its full conditional distribution. The same argument applies to all nodes. As a result, sampling a new value for a given node takes constant time, regardless of the size of the lattice. Therefore, the computational complexity of the Gibbs sampler scales linearly with the size of the graph.
Gibbs sampling on a graph fits perfectly with Bayesian inference. If a node on a graph represents an observed quantity, then the Gibbs sampler skips that node, leaving it at its observed value. The stationary distribution of the resulting Markov chain is the conditional distribution of the unobserved nodes given the observed nodes, in other words, the posterior distribution that is the target of Bayesian inference.
Sampling from the full conditional distribution p(θ i | θ −i ) is not always feasible. However, it is not necessary to draw an independent sample from the full conditional. It is sufficient to make a transition be the probability of a transition from the current value θ i to the new value θ ′ i . Reversibility is defined by the detailed balance condition: 3.
There are two qualitatively different probability terms in Equation 3. The conditional probability p (θ i | θ −i ) is derived from the joint distribution p(θ), which is defined by the model. The transition probability p t ( is determined by the transition kernel of the Markov chain and is completely independent of p(θ). Many choices of p t ( exist for a given model. We emphasize the distinction between the two probabilities with the subscript t for the transition probabilities. The goal of statistical software for MCMC is to determine appropriate transition without any user input and, in fact, to hide these probabilities from the user.
Under broad conditions, any transition kernel can be made reversible by the addition of a Metropolis-Hastings acceptance step (Metropolis et al. 1953, Hastings 1970. We can rewrite the detailed balance condition in Equation 3 as R = 1, where remain at the current value θ i . The addition of the acceptance step makes the transition reversible. The modified algorithm is known as Metropolis within Gibbs.

Markov Chain Monte Carlo in Practice
While Markov chain theory guarantees that the chain will converge to the target distribution, it is relatively silent on how long convergence will take. Theoretical research on convergence bounds did not lead to practically usable criteria. Therefore, the determination of MCMC convergence remains largely empirical. In practice, it is customary to discard the initial portion of the Markov chain. This process is referred to as burn-in. The burn-in period may coincide with the samplers adapting their behavior for optimal efficiency (Andrieu & Thoms 2008). After burn-in, samples of the parameters of interest can be monitored. Visual inspection of the monitored values normally uses trace plots, which show how each scalar parameter changes with time. When trace plots show high autocorrelation, the chain is said to display poor mixing. In this case it is customary to thin the chain by monitoring only every m iterations for m â 1. The discarded samples do not contribute any additional information about the target distribution when the autocorrelation is high.
It is also customary to run several parallel chains, where the initial values and the transitions are completely independent of one another. The use of the term "parallel chain" predates the current multicore era of computing and simply refers to the realizations of the Markov chain being statistically independent. However, in practice most software will run parallel chains on different cores.
It remains to make a determination of whether the chain has converged sufficiently for the monitored values to be considered an approximate sample from the posterior distribution. After some vigorous activity on convergence diagnostics in the early 1990s (reviewed in Cowles & Carlin 1996), consensus has largely settled on the Gelman-Rubin diagnostic (Gelman & Rubin 1992), which is based on starting parallel Markov chains from widely dispersed starting points. The degree of convergence can then be assessed by comparing within-chain and between-chain variation, with a summary statistic R that tends to one as the Markov chains converge. The Gelman-Rubin diagnostic has evolved over the years, notably with the introduction of the split R that separates the beginning and the end of each chain, and it continues to be refined , Vats & Knudson 2021, Vehtari et al. 2021).

BUGS
BUGS (Bayesian inference Using Gibbs Sampling) is a long-running project to develop Bayesian modeling software. The BUGS project began in 1989 at the Medical Research Council (MRC) Biostatistics Unit in Cambridge, United Kingdom.
The BUGS developers maintain that the BUGS project was independent of Gelfland & Smith's (1990) paper, which appeared around the time of the launch of the BUGS project. It was in fact inspired by earlier developments in artificial intelligence. In the 1980s it was understood that expert systems could represent qualitative knowledge in the form of a directed acyclic graph (DAG), or influence diagram, and that uncertainty could be represented by a probability distribution on the graph (Lauritzen et al. 1990). Furthermore, efficient algorithms for updating the probability in the presence of new data were developed (Lauritzen & Spiegelhalter 1988). These ideas, together with an understanding of the value of object-oriented programming, formed the conceptual foundation of the BUGS project.
The BUGS software has evolved through four different versions, which, confusingly, have different names. These versions are described below, but first I describe the key conceptual contribution of the BUGS project to Bayesian modeling, which is the BUGS language.

The BUGS Language and Directed Acyclic Graphs
Central to the success of BUGS has been the BUGS language (Thomas 2006) for specifying graphical models. The BUGS language has been adopted by other projects, such as JAGS and NIMBLE, so it is important to distinguish the language from the software.
The BUGS language is syntactically similar to S, or R, but is declarative, not procedural. Statements in BUGS define the static relationships between variables in the model, not instructions on how to carry out a calculation.
I illustrate the BUGS language with the Eight Schools example, which was originally developed by Rubin (1981) and has since been popularized by Gelman et al. (2014, section 5.5) as a teaching example. The data come from an experiment to see how short-term coaching could improve scores on a standardized school test. The coaching was applied across J = 8 schools with varying results.
The estimates of the coaching effect y 1 . . . y J from the eight schools have known standard errors σ 1 . . . σ J , which vary across schools because varying numbers of students take the training in each school. The aim is to estimate the average effect, accounting for the heterogeneous information coming from the participating schools.
The model used to analyze the Eight Schools data is an example of a normal-normal hierarchical model, where θ j is a random effect representing the effect of training in school j. The same model has an alternative noncentered parameterization, which is used as the basis of the BUGS model:

6.
Listing 1 shows the BUGS code for the Eight Schools example. The code is an almost exact translation of the mathematical description of the noncentered parameterization in Equation 6, except that we must also supply proper prior distributions for mu and tau.

Listing 1 BUGS code for the Eight Schools example
The BUGS language uses an S-like syntax to define the model in terms of a series of relations. Relations may be of two types. A stochastic relation, denoted by a tilde (~), defines a random variable on the left-hand side that is parameterized in terms of the quantities on the right. A logical relation, denoted by a left arrow (<-), defines a quantity on the left-hand side that is a deterministic function of the quantities on the right.
Listing 1 shows that relations indexed by j appear inside a for loop. Despite appearances, the for loop is not a control flow statement but may be considered a macro that expands the contents inside the curly braces J times, replacing index j with every integer value from 1 to J.

Figure 2
Directed acyclic graph for the Eight Schools model. The graph was created with the WinBUGS Doodle tool, which allows users to interactively create a graphical model.
The normal distribution in the BUGS language (dnorm) is parameterized in terms of the mean and the precision, which is the reciprocal of the variance. This was an early language design decision that simplified the use of conjugate prior distributions (a gamma prior on the precision of a normal random variable is conjugate). If the user wishes to define a weakly informative prior for a normal random variable, then it should have a low precision, as observed in the small value of 10 −6 assigned to the precision parameter of mu. For the outcome variable y[j], the standard deviation parameter sigma[j] is transformed into a precision parameter precision.y [j], which is then passed as the second argument of the dnorm distribution.
The BUGS language is largely agnostic about what is data and what is a parameter. There are some restrictions, however. Any quantity that does not appear on the left-hand side of a relation must be supplied as data. In the Eight Schools example (Listing 1), these quantities are J, used as the limit of the for loop, and sigma[1]. . . sigma[J], used to define the precision of y[1]. . . y[J]. Apart from this restriction, the BUGS language makes no distinction between data and parameters. This distinction is made when the data are supplied to the model. For the Eight Schools example, we may choose not to supply data values for y[1:J], and in this case the parameters are sampled from their prior distribution. If we supply data values for y[1:J], then BUGS will choose appropriate samplers for the unobserved stochastic variables and sample them from their posterior distribution given y[1:J].
The BUGS language description defines a DAG, as shown in Figure 2, with directed edges leading from quantities on the right-hand side of a relation to the quantity on the left. The BUGS program takes the description of the model and translates it into a virtual graphical model (VGM) in computer memory. The joint prior distribution of the variables in the model factorizes as where Parents(v i ) denotes the set of parents of node v i .

Figure 3
Deriving a conditional independence graph (right) from a directed acyclic graph (left). Nodes A and C are "married" in the conditional independence graph because they have a common child node, B.
In order to determine the conditional independence relationships for Gibbs sampling, the DAG must be translated into a CIG. This process involves three steps. First, we marginalize out any deterministic nodes so that only stochastic nodes remain. Second, we draw an undirected edge between two nodes with a common stochastic child. Third, we replace all directed edges in the graph with undirected edges. This step is known as moralizing the graph because it marries parents. This process is illustrated for a simple graph in Figure 3.

Classic BUGS
The first version of the BUGS program, retrospectively named classic BUGS, was written in Modula-2 and ran on MS-DOS and Unix (Gilks et al. 1994, Spiegelhalter et al. 1996a). It included a scripting language for running a model as well as facilities for reading and writing data files. Input data could be prepared in S-PLUS, the commercially available distribution of S, and then written to file using the dump() function. Conversely, output files from BUGS could be read into S-PLUS and analyzed with a suite of functions called CODA (Convergence Diagnostics and Output Analysis), which was later developed into an R package (Plummer et al. 2006).
The workhorse algorithm for classic BUGS was adaptive rejection sampling (ARS) , Gilks & Wild 1992. ARS draws a sample from a univariate distribution with a log concave density. It works by first constructing a piecewise-linear hull around the log density that is easy to sample from. Then, rejection sampling is used to determine whether to accept the sampled value on the basis of how closely the hull approximates the log density. As the name suggests, the algorithm is adaptive: Rejected sampling points are used to improve the accuracy of the hull approximation so that the probability of a successful draw increases with every rejection.
An expert system built into classic BUGS recognized situations where the full conditional distribution of a node was log concave and applied the ARS algorithm. This ability to identify and sample log-concave distributions enabled Gibbs sampling on a graph without the user being restricted to conjugate priors, the strong condition that had previously restricted the application of Bayesian models. The algorithm was later generalized to adaptive rejection Metropolis sampling (ARMS), which added a Metropolis-Hastings acceptance step when the full conditional is not log concave, although ARMS is efficient only in situations when the log density is close to being log concave (Gilks et al. 1995).
In addition to providing the software, the BUGS authors provided a rich set of examples from the contemporaneous literature illustrating the use of BUGS (Spiegelhalter et al. 1996b,c). Many of these examples were taken from Breslow & Clayton (1993), who assembled a set of examples to illustrate approximate inference for generalized linear mixed models (GLMMs). These worked examples provided templates for new users to get started with modeling in BUGS by adapting an existing example to their own problems.

WinBUGS
In 1996, the BUGS project moved to Imperial College London, where the second generation of BUGS was developed. WinBUGS (Lunn et al. 2000) was written in Component Pascal and built with the Black Box Component Builder, originally a commercial product from Oberon Microsystems that is now available as open source software (see https://blackboxframework.org). As its name suggests, WinBUGS was designed to run on Windows, and it had a rich graphical user interface (GUI). Initially, the user was required to use the GUI to run the model, but that changed in later versions, when a scripting language was introduced. An interface to the R language was developed in the form of the R2WinBUGS package (Sturtz et al. 2005).
A distinguishing feature of the Black Box Component Builder is its focus on compound documents, which may contain a richer set of entities than text alone. This feature was used in the WinBUGS manual (Spiegelhalter et al. 2003), where, for each example, a single compound document contains the model description, the data, initial values, and a human-readable rich-text description. This might be considered an early example of reproducible research. However, Black Box compound documents did not gain widespread use in statistics outside of WinBUGS.
The modular structure of WinBUGS allowed a richer set of samplers than the classic BUGS software and led to the development of several extensions for specialized tasks: ■ The WinBUGS Differential Interface for models defined by differential equations. ■ The WinBUGS Developer Interface, which enabled users to define their own functions and hard-code them into WinBUGS (Lunn 2003, Wetzels et al. 2010).
Some of these extensions improve on the flexibility of the BUGS language by allowing users to define special classes of models more easily while retaining the flexibility of WinBUGS. Some extensions provide specialized sampling routines to allow for more efficient sampling. The final version of WinBUGS was 1.4.3. It is still available at the MRC Biostatistics Unit website but is no longer actively developed.

OpenBUGS
OpenBUGS (see https://www.openbugs.net) was the third incarnation of the BUGS software. It began in 2004, when lead developer Andrew Thomas moved to Helsinki. OpenBUGS was the first open source version of BUGS and was released under the GNU General Public License, version 2. OpenBUGS is not limited to the Windows operating system; it also runs on Linux using a command line interface rather than a GUI. An interface to R is provided by the BRugs package .
OpenBUGS was developed in parallel with WinBUGS, with the rationale that WinBUGS was the stable release version while OpenBUGS was the experimental development version subject to incompatible changes from one release to another. OpenBUGS was considered stable enough to be a reliable replacement for WinBUGS from version 3.0.7. OpenBUGS 3.2.3 was the final release of OpenBUGS.
The GeoBUGS extension to OpenBUGS (Thomas et al. 2004) includes some features specifically for geospatial models. However, the WinBUGS extensions described in the previous section were not ported over to OpenBUGS.

MultiBUGS
The current version of BUGS is MultiBUGS (Goudie et al. 2020), which was designed specifically for a multicore computing environment. Previous generations of software could count on Moore's law, which posited a doubling of the number of transistors every 18 months, thereby guaranteeing increased performance on new hardware without fundamental changes to the software. In the mid-2000s, Moore's law was opposed by the limitations of clock speeds and power consumption, so hardware design switched to providing multicore processors. These offer improved performance only if the software is designed for a multicore environment.
Gibbs sampling can be parallelized in three ways. The first way is to run parallel chains on different cores. The second way involves parallelizing the log density calculations that occur at each step of the MCMC algorithm. Many MCMC algorithms require the calculation of the log density of the variables being sampled. This density is typically a sum of separate terms, and when the number of terms is large, there may be a speed benefit in doing these calculations in parallel. The third way to parallelize MCMC calculations is based on parallel updating of different nodes in the same model. This can be done if the CIG can be partitioned into sets of nodes {S 1 . . . S K } such that no nodes in S k have an edge between them for k = 1. . . K.
Partitioning the CIG in this way is an example of a vertex coloring problem. Figure 4 illustrates a coloring of the lattice from Figure 1. Only K = 2 colors are required to separate nodes that are conditionally independent of one another. The Gibbs sampling algorithm can alternate between updating nodes of the first color in parallel and nodes of the other color in parallel.
The regular structure of the lattice in Figure 4 makes the coloring problem easy. In general, vertex coloring is computationally hard. For K > 2 it is NP-complete to determine whether a graph admits a K-coloring (Garey et al. 1976). However, a CIG in a BUGS model is derived from a DAG with a rich underlying structure that can be used to simplify the problem. Goudie et al. (2020) note that each node in the DAG has a depth defined by the shortest distance from a root node (one with no parents). Nodes with the same depth can be updated in parallel if they have no common child nodes.

Figure 4
The lattice admits a 2-coloring, such that all dark gray nodes are conditionally independent of one another and can be updated in parallel. Likewise, all the light gray nodes can be updated in parallel.
In a modern high-performance computing environment with many cores, the three forms of parallelization can yield important improvements in run time. Goudie et al. (2020) cite the example of a large GLMM with 425,112 observations and 20,426 random effects. Using 48 cores, MultiBUGS was faster than OpenBUGS by a factor of 64.

JAGS
JAGS ( Just Another Gibbs Sampler) is a clone of BUGS that has a completely independent code base but aims for similar functionality (Plummer 2017). JAGS is written in C++ and runs on Windows, MacOS, and Linux. It is published under the GNU General Public License, version 2. JAGS would not have been possible without the R math library, which provides high-quality algorithms for random number generation and calculation of quantities associated with probability distributions. At least four interfaces between JAGS and R exist (Su & Yajima 2015, Denwood 2016, Kellner 2021, Plummer 2021. JAGS has a modular structure. A module may be loaded at run time that adds new functions and distributions to the BUGS language. A module may also define new sampling methods and new monitors, which are ways of summarizing the streaming data generated by the Markov chain. Modules contributed by third parties include the wiener module (Wabersich & Vandekerckhove 2014) for inference on Wiener processes and the pexm module (Mayrink et al. 2021) for models using piecewise-exponential distributions.
The workhorse sampling method for JAGS is slice sampling (Neal 2003), which can be applied to both continuous-and discrete-valued nodes. The glm module, which comes with the JAGS distribution, incorporates efficient samplers for GLMMs. These samplers are based on the principle of data augmentation, a commonly used technique to simplify the structure of a graphical model by adding new nodes (Hobert 2011). In this case, data augmentation reduces GLMMs with binary outcomes (Albert & Chib 1993, Holmes & Held 2006, Polson et al. 2013 or binary and Poisson outcomes (Frühwirth-Schnatter et al. 2009) to a linear model with normal outcomes.
This reduction to a normal linear model allows block updating of all the parameters in the linear predictor, which is much more efficient than Gibbs sampling. The underlying engine for the linear model uses sparse matrix algebra (Davis 2006), which handles fixed and random effects simultaneously.
Block updating solves one of the disadvantages of Gibbs sampling. It is strongly dependent on the parameterization of the model. If two variables have a high posterior correlation but are updated independently using Gibbs sampling, then the Markov chain will exhibit high autocorrelation for both variables. Practitioners can learn to recognize situations where high posterior correlation is likely to occur and reparameterize the model to avoid the problem. A better solution is to do block updating of correlated parameters so that they move together.

NIMBLE
NIMBLE (Numerical Inference for statistical Models for Bayesian and Likelihood Estimation) is another engine for Bayesian modeling (de Valpine et al. 2017(de Valpine et al. , 2022. NIMBLE uses a dialect of BUGS language with an extended syntax. One of these extensions is the use of named arguments for distributions. This feature allows distributions to have multiple parameterizations, freeing the user from the historical restriction to parameterizations that were convenient for conjugacy but not convenient for understanding (de Valpine et al. 2022).
The extended BUGS syntax can be illustrated with the Eight Schools example (Listing 2), which can be compared with the BUGS version (Listing 1). In the NIMBLE version, the user can specify directly that σ i is the prior standard deviation of y i , removing the need for a transformation.
Likewise, the prior distribution of µ is defined in terms of its large standard deviation rather than its small precision.

Listing 2 NIMBLE BUGS code for the Eight Schools example
NIMBLE is built on top of R, and it brings the flexibility and transparency of the R language to Bayesian modeling. Model objects can be queried and modified, and the choice of samplers can be customized. NIMBLE users can also extend the scope of the BUGS language by writing their own functions and distributions. New samplers can also be defined in a simplified version of the R language.
Running MCMC models in R is too slow to be practically useful. In order to overcome this limitation, NIMBLE has a back end that converts R code to C++ and then compiles it into object code. The details of this compilation process are hidden from the user, who continues to use the R interface and does not need any understanding of C++. The NIMBLE compiler is not tied to the use of Bayesian models but can be used to speed up any numerical code written in R as long as it is compatible with the simplified version of R supported by the compiler. The NIMBLE compiler allows familiar control flow statements from R and supports linear algebra through the Eigen library (Guennebaud et al. 2010).
NIMBLE's modular design creates a cleaner separation between the front end and the back end. Models defined in the BUGS language do not have to be analyzed using MCMC for inference but can in principle be analyzed with any algorithm. Sequential Monte Carlo algorithms (bootstrap particle filter, auxiliary particle filter, ensemble Kalman filter, iterated filter2, and particle MCMC) are provided in the nimbleSMC package (Michaud et al. 2021, NIMBLE Dev. Team 2021).

GRADIENT-BASED METHODS
Gradient-based methods for MCMC may be motivated by analogy with optimization problems. Suppose that we are interested only in finding the posterior mode, not the full posterior distribution. In this case, we could use an optimization method to find the maximum value of the log density. This could be done with direct search optimization methods, which rely only on function evaluations. However, it is more efficient to use optimization methods that use the gradient of the log density, such as gradient ascent.
Returning to the problem of finding the full posterior distribution, MCMC methods that rely only on evaluating the log density without using gradient information are inefficient. This inefficiency manifests as random walk behavior, where the Markov chain tends to go back over a previously explored area of the sample space instead of moving to a new one, thus providing no new information about the posterior distribution while using computation time.
Random walk behavior can cause long excursions into the tail of the distribution, where the density is low and the chain is supposed to spend only a small proportion of time. When tail excursions occur, practical remedies such as extending the run time and increasing the thinning interval will not, in general, be effective, as the run time will be too long to be practically useful.
Gradient-based MCMC methods rely on evaluating the gradient of the log density at each step. The implementation of gradient-based methods relies heavily on automatic differentiation (autodiff ), a methodology from computer algebra for efficiently calculating the derivatives of functions expressed as computer programs (Baydin et al. 2018). Briefly, autodiff takes a function that evaluates the log density, analyzes the syntax of the function body, and creates a new function that returns the gradient. This process needs to be done only once, and then the gradient function can be called at each iteration. Autodiff works with functions of multiple arguments and can therefore calculate the partial derivatives of the log density with respect to all continuous parameters, allowing block updating of all of these parameters.

Langevin Diffusions
Langevin diffusions are continuous-time Markov processes based on the Langevin stochastic differential equation, where W t is a Wiener process or Brownian motion. A Langevin diffusion has p(θ) as its stationary distribution but has a tendency to drift toward areas of higher density. The drift term ∇ log p(θ)dt suppresses some, but not all, of the random walk behavior contributed by the Brownian motion. For MCMC, a discrete-time approximation to the Langevin dynamics is used with a time step of size τ : This approximation requires the use of a Metropolis-Hastings acceptance step after each move in order to correct for any errors in the discrete-time approximation. Therefore, the corresponding algorithm is called the Metropolis-adjusted Langevin algorithm (MALA). MALA has a single parameter, the step size τ , which needs to be tuned for optimal performance. Note that the aim is not to accurately reproduce the Langevin dynamics of Equation 8 but rather to efficiently explore the parameter space while preserving p(θ) as the stationary distribution. Therefore, there is a compromise between large values of τ , which permit larger moves at each time step, and smaller values of τ , which have higher acceptance probabilities. In high dimensions, the optimal acceptance rate is 0.574, and τ should be adaptively tuned to attain this target (Roberts & Rosenthal 1998). MALA has been somewhat superseded by Hamiltonian Monte Carlo, which contains MALA as a special case.

Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC), or hybrid Monte Carlo (Duane et al. 1987, Neal 2011, Betancourt 2017, is based on Hamiltonian dynamics for physical systems. In the statistical implementation of Hamiltonian dynamics, the parameter space θ ∈ R d is supplemented with a set of momentum parameters ϕ ∈ R d . The joint coordinates (θ, ϕ) define the state of the dynamical system in phase space.
The Hamiltonian www.annualreviews.org • Simulation-Based Bayesian Analysis represents the total energy of the system as the sum of potential energy K (θ) and kinetic energy V(ϕ). The mass matrix M is usually taken to be the identity matrix I d .
The evolution of the dynamical system in time is determined by the Hamiltonian equations which determine the transfer of energy between potential and kinetic energy while keeping total energy (H) constant. Hamiltonian dynamics are time reversible and preserve volumes in phase space.
Hamiltonian dynamics can be used for MCMC. At the start of each iteration, new momentum variables are drawn at random ϕ ∼ N d (0, M −1 ). Then the process evolves according to Hamiltonian dynamics for a given time T. The transition in the position parameters θ is a reversible MCMC transition in detailed balance with p(θ). Refreshing the momentum variables at the start of each iteration ensures that the chain is ergodic. Without refreshment, Hamiltonian dynamics is periodic and will follow a closed loop back to its starting point.
As with the Langevin diffusion, Hamiltonian dynamics in continuous time cannot be exactly simulated on a computer. For this reason, a discretized version is used for MCMC, which alternates updates to the position parameters (Equation 9) and the momentum parameters (Equation 10) in discrete time steps of size h: Each HMC move runs for L steps, simulating Hamiltonian dynamics for time T = Lh. The discretization process introduces errors, of which the most important is that the discretized version of Hamiltonian dynamics does not preserve volumes in phase space. As a consequence, the discretized Hamiltonian dynamics will either collapse to a single point or tend to infinity (Neal 2011). This serious problem can be solved using the Störmer-Verlet or leapfrog integrator, in which each HMC update begins and ends with a half-update of the momentum parameters (Equation 10) with step size h/2 instead of h (Neal 2011). The shortest leapfrog path, which contains a single update to the position parameters (Equation 9), is equivalent to MALA when M = I d . HMC has two parameters: the step size h and the number of steps L. The step size h needs to be sufficiently small to ensure a high acceptance probability, but not so small that the complex density and gradient calculations are wasted without advancing the dynamics. The current standard is to choose h during an adaptation phase to reach a target acceptance probability of 0.65, derived by Beskos et al. (2013) for the high-dimensional limit of HMC.
The choice of integration time T is considerably harder especially when the variance of the target distribution is heterogeneous between different parameters θ 1 . . . θ d . In this case, use of a constant T can lead to arbitrarily poor mixing (Neal 2011). This problem can be mitigated by drawing T from an exponential distribution (Bou-Rabee & Sanz-Serna 2017).
The breakthrough in adaptively choosing the integration time T came from the No U-Turn Sampler (NUTS) (Hoffman & Gelman 2014). NUTS exploits the periodicity of Hamiltonian dynamics to find a maximum integration time T. To maximize the efficiency with which HMC explores the parameter space, the path should go as far as possible without turning back on itself, that is, without executing a U-turn. NUTS can be thought of as a stopping rule for HMC.
The subtlety of the NUTS algorithm lies in the way it preserves reversibility. From the current point, NUTS explores discretized Hamiltonian dynamics (Equations 9 and 10) in both forward and backward directions, randomly choosing the direction and doubling the number of leapfrog steps at each iteration. This process continues until the ends of the path start to turn back toward each other. The next NUTS sample is drawn at random from the points on the path.
The NUTS path is constructed in such a way that the probability of constructing the given path is the same starting from any one of the points it contains. This uniform distribution ensures reversibility of the chain. Stan (Stan Dev. Team 2022), named in honor of the physicist Stanislaw Ulam , who coinvented the Monte Carlo method (Metropolis & Ulam 1949), is the latest general-purpose software for Bayesian modeling. Stan is designed for scalability and speed and was written from the start to exploit a multicore computing environment. Stan is written in C++ and published under the 3-clause BSD license.

STAN
Stan takes a different approach to model building from BUGS, NIMBLE, and JAGS by having a different language to define the model. Rather than creating a VGM, the Stan language defines the terms contributing to the log density of the joint distribution of the parameters log p(θ). Users can use built-in density functions for common distributions, but they can also define their own functions in the Stan language to calculate contributions to the log density.
Listing 3 shows the Eight Schools example coded in the Stan language. Unlike the BUGS language, the Stan language is very explicit about the distinction between data and parameters, which must be declared in separate blocks. Deterministic functions of other variables are declared in a third block. Each variable declaration declares the data type, possibly including bounds, and the dimensions for vector or matrix quantities.
Although the declarations make the Stan code much longer than the BUGS code, the definition of the model itself is very succinct, due to the vectorization of the Stan language. The model is defined by three lines. Two lines in the model block show the contributions of the variables η and y to the log density (target), and the declaration of θ also defines it as a linear function of µ, τ , and η.

Listing 3 Stan code for the Eight Schools example
Another difference between the BUGS and Stan code is that Stan does not require all parameters to have a proper prior distribution. In the Eight Schools example, there is no prior distribution for the parameters theta and tau. Implicitly, these parameters have an improper flat prior.
Stan uses a C++ compiler to compile the model description into object code, which can then be executed with any data set as input. This property distinguishes Stan from BUGS and JAGS, which need to reinterpret the model description and create a new VGM if there are any changes to the data. This difference makes Stan ideal for MCMC analysis of simulated data, as the expensive compilation step needs to be run only once.
The main inference engine for Stan is HMC with NUTS. The HMC/NUTS algorithm is used to update all parameters of the model simultaneously. In contrast, BUGS, NIMBLE, and JAGS may use different sampling methods for different parameters. As a consequence of this approach, Stan does not admit discrete-valued parameters. Models that use discrete-valued parameters in the BUGS language may be rewritten in Stan to remove the discrete parameters by marginalization.
Like NIMBLE, Stan allows different forms of inference. In addition to NUTS for MCMC, Stan allows penalized maximum-likelihood estimation as well as approximate Bayesian inference with variational inference (Blei et al. 2016).
Stan has interfaces to R (Stan Dev. Team 2021) and Python (Riddell et al. 2021), among others. In addition, the rstanarm (Goodrich et al. 2020) and brms (Bürkner 2017) packages allow users to define regression models using the compact Wilkinson-Rogers-style notation used by the (g)lm and (g)lmer functions and still have access to the Stan HMC sampler as a back end.

INLA
INLA ) is an alternative methodology for Bayesian inference applicable to a large class of models. Unlike MCMC, INLA is a deterministic approximation to the posterior.
INLA can be used for latent Gaussian models. These are two-level hierarchical models where the observable data y 1 . . . y n are conditionally independent given a set of latent random variables η and possibly a set of hyperparameters θ 1 : 11.
The latent variables η are expressed in terms of other latent quantities as a linear predictor, where β is a vector of coefficients for covariates z j ∈ R n b , f (1) . . . f (n f ) are unknown functions of the covariates u i ∈ R n f , and ϵ 1 . . . ϵ n represent unstructured variation. Let x be the collection of all latent quantities in the model: Then, x has a prior Gaussian distribution with mean zero and precision matrix Q(θ 2 ) depending on hyperparameters θ 2 . Practical restrictions are that the full set of hyperparameters θ = (θ 1 , θ 2 ) is of low dimension and the precision matrix Q is sparse. If Equation 12 seems abstract and difficult to understand, that is because it is written in a way that admits the full generality of models that can be analyzed by INLA. This class includes time series models, generalized additive models, generalized additive mixed models, geoadditive models, and univariate volatility models.
INLA provides fast and accurate deterministic approximations to the marginal posteriors p(x i | y) and p(θ | y). It does not provide the joint posterior distribution of x, but the correlations are often not of interest. INLA is distinguished by being exceptionally fast compared with MCMC. Rue et al. (2009, p. 322) suggested that INLA "outperforms MCMC algorithms to such an extent that, for latent Gaussian models, resorting to MCMC sampling rarely makes sense in practice." The R-INLA software (see https://www.r-inla.org) provides an implementation of INLA as an R package. The INLA method has been extremely successful in a wide range of applied problems. Its success illustrates that there is an alternative to the approach taken by most Bayesian software. Instead of trying to provide a universal solution to an unlimited class of Bayesian models, it is possible to conduct inference accurately and efficiently on a prescribed class of models.

PROBABILISTIC PROGRAMMING LANGUAGES
In the last 10 years, a convergence of interests in machine learning and computer science has given rise to the concept of probabilistic programming. This statistical modeling tool takes concepts from programming language design and applies them to the design and analysis of statistical models.
A full survey of probabilistic programming languages is beyond the scope of this review. Notable probabilistic programming languages include Church, based on Scheme (Goodman et al. 2008); Figaro, based on Scala (Pfeffer 2016); and PyMC3, based on Python (Salvatier et al. 2016).
Notably, PyMC3 provides an implementation of the NUTS algorithm for HMC. The Stan developers describe Stan as a probabilistic programming language (Gelman et al. 2015). Arguably, NIMBLE is a probabilistic programming language that extends R.

Robustness Versus Efficiency
The optimal behavior of MCMC methods in high dimensions has been well studied for the case when the joint probability distribution of the parameters θ factorizes into independent and identically distributed (i.i.d.) components (Gelman et al. 1997, Roberts & Rosenthal 1998, Beskos et al. 2013, 13. for some density function f ( ) common to all dimensions. This scenario is somewhat simplistic, especially when compared with the complex distributions to which MCMC is applied in practice. Nevertheless, it permits analysis of the optimal step size for each algorithm and the way it scales with dimension d. Table 1 summarizes the results. For all methods, the optimal step size decreases with d, so the Markov chain becomes less efficient with increasing dimension of the parameter space. However, the rate at which the optimal step size decreases is slower for MALA compared with random walk Metropolis-Hastings, and slower for HMC than for MALA. Since the i.i.d. limit in Equation 13 is somewhat artificial, the behavior of all these methods may be very poor in practice. The robustness of MCMC methods has attracted increasing interest. The question of robustness is extremely important for general-purpose software for Bayesian inference, which aims to efficiently solve a large class of applied problems.

www.annualreviews.org • Simulation-Based Bayesian Analysis
The word "robust" is an overloaded word in statistics, and its application to MCMC methods is no different. There are different ways to interpret the notion of robustness in this context. One is to consider which kinds of distributions an MCMC method can efficiently address. For heavy-tailed distributions, MALA and HMC are not geometrically ergodic (Roberts & Tweedie 1996, Livingstone et al. 2019. If the gradient of the log density goes to zero in the tails, then gradient-based methods no longer have useful additional information for efficiently exploring the parameter space and the random walk behavior that gradient-based methods are meant to suppress reappears. Conversely, for light-tailed distributions, the calculation of the gradients required by MALA and HMC may break down (Roberts & Tweedie 1996, Livingstone et al. 2019, rendering these methods unusable when simpler methods remain applicable. Another way of thinking about robustness is to consider how well MCMC methods can be adaptively tuned for optimal performance. HMC exhibits poor behavior when the posterior distribution is heterogeneous, with much higher variance in some dimensions than others (Neal 2011). The mixing behavior of the worst component can be arbitrarily poor (Riou-Durand & Vogrinc 2022). In certain applications, this problem can be alleviated by some form of preconditioning. This might be straightforward manipulation of the data, such as standardizing predictor variables, as recommended by Gelman et al. (2014, section 16.3) for logistic regression. Livingstone & Zanella (2022) have proposed an alternative to MALA, called the Barker proposal, which maintains the d −1/3 scaling with dimension of MALA but has improved robustness to tuning. The Barker proposal uses a proposal distribution that is skewed in the direction of the gradient of the log density. In contrast, MALA has separate deterministic and random noise terms in each step.

Nonreversible Markov Chain Monte Carlo
Recent developments in MCMC have focused on nonreversible Markov chains as a way of suppressing random walk behavior and thus providing more efficient samples (Neal 1999, Diaconis et al. 2000, Bierkens 2016. Figure 5 shows a mixture of two normals with a bimodal distribution,

Figure 5
A bimodal distribution constructed from a mixture of two Gaussians.

Figure 6
Trace plots for a random walk Metropolis algorithm and a guided walk on the bimodal target density shown in Figure 5. Red horizontal lines indicate the two modes of the distribution. and Figure 6 shows two Markov chains exploring this distribution. The first is a random walk Metropolis-Hastings sampler, which is reversible. The second is a guided walk (Gustafson 1998), which is the same as the random walk except that the direction of each step is the same as the previous iteration until a proposal is rejected. The guided walk then changes direction at the next iteration. The modified sampler is nonreversible. As shown by the trace plots, the nonreversible sampler is more efficient at switching between the two modes of the distribution. The momentum of the guided walk allows it to cross the gap between the two modes.
This an example of a so-called lifted Markov chain that has been augmented by a velocity variable taking values in {−1, 1} (Chen et al. 1999, Turitsyn et al. 2011. Lifted Markov chains are easier to study in the continuous-time limit when we make the step size smaller and smaller, adjusting the timescale accordingly. Figure 7 shows the effect of reducing the step size by a factor of 10. As the figure suggests, in the limit, the sampler becomes a piecewise-deterministic process. Change points are determined by a nonhomogeneous Poisson process in which the rate depends on the gradient of the log density in the direction of travel. Notably, the rate of change is zero when the process is moving toward a higher-density area. The chain always crosses one of the modes of the distribution before changing direction. There are two multivariate generalizations of this process: the zig-zag sampler (Bierkens et al. 2016) and the bouncy particle sampler (Bouchard-Côté et al. 2018). In the zig-zag sampler, each component changes direction independently, whereas in the bouncy particle sampler, the whole path reflects off a tangent to the log density. Reference implementations of these samplers can be found in the R package RZigZag (Bierkens 2019).
The zig-zag and bouncy particle samplers are examples of piecewise-deterministic Markov processes (PDMPs) (Davis 1984). These are continuous-time stochastic processes that evolve Continuous-time limit of the guided walk shown in Figure 6 as the step size decreases. deterministically in between change points that occur randomly in time. The events that occur at change points may also be stochastic. For example, the momentum vector may either change in a deterministic way or be refreshed by random sampling to ensure irreducibility. The first use of PDMPs for MCMC simulation was in the computational physics literature (Peters & de With 2012).
Like HMC, PDMPs augment the parameter space with additional parameters representing momentum. The momentum component allows a PDMP to quickly reach the mode of the distribution when the initial value is far away (Bierkens et al. 2016). Unlike HMC, however, the velocity is always constant. The simpler deterministic dynamics of PDMPs are easy to model on a computer without approximation error. In particular, PDMPs have no rejection step. Fearnhead et al. (2018) note that PDMPs are a large class of MCMC methods whose potential has not yet been fully explored. A promising feature of PDMPs in the era of big data is that the log gradient ∇ log p(θ) does not need to be calculated exactly but can be estimated from a subsample of data points. Unlike other MCMC methods, the subsampling of the gradient does not cause any bias, so the stationary distribution of the PDMP remains the same (Bierkens et al. 2016). This attribute makes PDMPs highly suitable for "tall" data with a very large number of observations.
PDMPs have no tuning parameters. As such, they are highly promising methods for implementation in general-purpose MCMC software. However, there is an additional layer of complexity in implementing PDMPs, which is sampling from a nonhomogeneous Poisson process to find the change-point times. Such sampling is feasible only if a suitably tight upper bound can be found for the gradient of the log density, which determines the rate of the Poisson process. The bound can be piecewise constant and thus needs to be calculated only in the range (t, t + δ) for some fixed value δ. When a bound on the gradient is determined, the next change point can then be simulated from a time-homogeneous Poisson process and thinned by rejection sampling (Lewis & Shedler 2012). For general-purpose Bayesian software, the outstanding problem is how to compute a bound on the gradient for an arbitrary posterior distribution. This may require the use of optimization methods to find the maximum of the log gradient within the time window (t, t + δ).
The practical challenge of implementing PDMPs into existing software is that existing frameworks are based on discrete-time MCMC. The latest generation of software, including NIMBLE and Stan, has separated the inference algorithm from the model description and hence could, in principle, incorporate continuous-time Markov processes as an alternative back end to discretetime MCMC. Alternatively, it may be possible to have a discrete-time analog of PDMPs (Sherlock & Thiery 2022, Bertazzi et al. 2021. However, it is not certain whether the advantages of PDMPs can be retained in discrete time.

SUMMARY
As noted in Section 1, the high-dimensional integrals required for Bayesian inference are not generally tractable. However, several compromises that give approximate solutions are available. For MCMC, the main compromise is the time required to get a sufficient number of samples to accurately approximate the posterior distribution. INLA offers another compromise for the wide class of latent Gaussian models. INLA gives up on estimating the full posterior but provides fast and accurate approximations to the posterior margins. Users who require only a point estimate of the posterior mode may turn to the fast approximations provided by variational Bayes methods (Blei et al. 2016). In summary, different methods offer a range of compromises to applied Bayesian statisticians. The enduring popularity of MCMC suggests that it currently offers the most favorable compromise in many applications. Whether this will remain true as the demand for analyses of larger data sets increases remains to be determined. Green et al. (2015) conclude that approximate methods will be at the heart of future progress in statistical computing. It is notable that many of the theoretical innovations in MCMC methodology have come from applications in physics (Metropolis et al. 1953, Duane et al. 1987, Peters & de With 2012 before being adapted more widely. This observation suggests that methodologists should keep a weather eye on the computational physics literature.
Over the history of Bayesian software for MCMC, there has been a shift away from Gibbs sampling on a graph toward block-updating methods that act simultaneously on all parameters and are capable of scaling to high-dimensional parameter spaces. MCMC methods continue to be developed along these lines. PDMP methods are interesting but have not yet bridged the gap between theoretical development and practice. There are also promising innovations in robustness of MCMC methods that will improve the user experience when implemented in software.
The existence of freely available statistical software is essential for propagating statistical methods. For example, the R language and its associated CRAN repository (see https://cran.rproject.org), as of late 2022, have 19,000 third-party packages. Among these are reference implementations of some of the methods discussed in this review. While Bayesian software has not matched this level of developer engagement, it has allowed scientific problems in many different domains to be addressed by Bayesian methods. This level of engagement with applied statisticians needs to be maintained.

Annual Review of Statistics and Its Application
Volume 10, 2023 Contents Fifty Years of the Cox Model John D. Kalbfleisch and Douglas E. Schaubel p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1

High-Dimensional Survival Analysis: Methods and Applications
Stephen Salerno and Yi Li p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p25 Shared Frailty Methods for Complex Survival Data: A Review of Recent Advances Malka Gorfine and David M. Zucker p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p51

Surrogate Endpoints in Clinical Trials
Michael R. Elliott p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p75 Statistical Machine Learning for Quantitative Finance M. Ludkovski p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 271