This tutorial will introduce you to working with Bayesian Statistics. Distinct from frequentist statistics, which is concerned with accepting or rejecting the null hypothesis, Bayesian Statistics asks what the probability of different hypotheses is, given the data and our prior beliefs about the world. We will start with an introduction to the conceptual underpinnings of Bayesian Statistics, and then see how we can actually fit Bayesian models in R.
By the end of this tutorial, you will understand:
brms
package in RThe tutorial was created by Dr Aislinn Keogh.
Bayesian Statistics gets its name from Thomas Bayes (1702-1761), an English statistician, philosopher and Presbyterian minister who formulated a very influential theorem in probability theory called Bayes’ rule. First, we want to get the intuition for Bayes’ rule1.
Imagine your friend complains of a headache, and you’re trying to decide between three possible causes for this headache: a brain tumour, dehydration, or athlete’s foot. Reaching a diagnosis requires you to consider two questions:
Starting with the first question, let’s say this is our assessment:
So if all we cared about was the likelihood of symptoms given each underlying condition, we would conclude that your friend either has a brain tumour or is dehydrated.
Next let’s say we make the following assessment about our second question:
So if all we cared about was the prevalence of each underlying condition, we would conclude that your friend is either dehydrated or suffering from athlete’s foot.
But most people don’t reach either of these conclusions: instead, they bring the two quantities together in a smart way to conclude that the headache is probably caused by dehydration. How do we do this? Enter: Bayes’ rule.
Bayes’ rule allows us to estimate the probability that our friend has a particular medical condition given their symptoms — \(P(illness \mid symptoms)\) — based on our answers to the two questions we considered above: the likelihood of a symptom given an illness — \(P(symptoms \mid illness)\) — and the prior probability of each illness — \(P(illness)\). Bayes’ rule provides a convenient way of expressing the quantity we want to know in terms of the quantities we already (roughly) know:
\[P(illness \mid symptoms) \propto P(symptoms \mid illness) \cdot P(illness)\]
Where \(\propto\) is read as “is proportional to”. The full equation has an extra (less interesting) term whose sole purpose is to make the output into a proper probability distribution, as follows:
\[P(illness \mid symptoms) = \frac{P(symptoms \mid illness) \cdot P(illness)}{P(symptoms)}\]
Let’s step through each of these components in turn:
Hopefully you’ll agree that Bayes’ rule makes intuitive sense. If the likelihood of symptoms given a certain illness is high, this will increase the posterior probability of that illness. Likewise, if the prior probability of a certain illness is high, this will also increase the posterior probability of that illness. And really importantly, if the prior probability of an illness is very low, we’d need some pretty strong evidence to make us accept that as a diagnosis.
In inferential statistics, we want to use the data we have collected to answer some research question. To do so, we need to formalise two things:
Note that this is not specific to Bayesian statistics! You might not
have thought about these questions in exactly these terms before, but
this is also what you’re doing when you fit a frequentist model in
lme4
. For example, if you’ve ever fit a logistic regression
model using glm()
with family=binomial
, you’ve
formalised the assumption that your data was drawn from a binomial
distribution. And when you use a model formula like
outcome ~ predictor1 + predictor2
, you’re formalising the
fact that your parameters of interest are the main effects of
predictor1
and predictor2
. For example:
The radical difference between Bayesian statistics and frequentist statistics is in how we think of these parameters:
On first reading, you may feel that this description is wrong: after all, frequentist models give you standard errors on all your coefficients, don’t they? Aren’t these SEs precisely capturing uncertainty about the point estimate? The confusion here is basically down to the fact that the underlying math behind frequentist models is really pretty unintuitive and generally poorly understood. We don’t have space to get into the details here, but suffice it to say that frequentist models technically can’t be interpreted in terms of uncertainty, as much as we instinctively want to make this leap; Bayesian models can and must be interpreted this way.
Perhaps more saliently, one really important practical consequence of what might feel like a very abstract distinction between frequentist and Bayesian analyses is this: there is no concept of significance in Bayesian statistics. Models allocate their beliefs over a range of possible values, and we as researchers observe the extent to which the model estimates are consistent with our hypotheses. However, this is not a binary decision of rejecting or failing to reject the null hypothesis (as in frequentist statistics). The fact that we can’t describe results in terms of significance/non-significance is usually the main thing that surprises (and often scares) people when they are learning about Bayesian statistics, or reading a paper that uses this approach. We’ll come back to this point in the practical part of the tutorial!
Let’s look in more detail at how we apply Bayes’ rule — combining our data with our prior assumptions — to arrive at a probability distribution over plausible values for our parameter(s)2.
Imagine that we’ve run a lexical decision experiment. Participants are shown a word, and they have to decide whether it is a real word or not. We have data from 100 trials. Out of those 100 trials, participants answered correctly 80 times.
\[correct = \frac{80}{100}\]
What is the probability of answering correctly? Based on the data alone, our best guess at the probability of a correct response (i.e. the probability that a new participant would answer correctly on any given trial) would have to be 0.8, right? We know it can’t be 0, because we’ve observed some successes, and we know it can’t be 1, because we’ve also observed some failures. If the underlying probability of success was only 0.5 (i.e. participants were just guessing randomly), we would pretty surprised to see this many successes. But 0.7 or 0.9 don’t seem like totally unreasonable estimates. In short, we have intuitions about how probable different probabilities of success are, given the data we observed. We can visualise these intuitions as a probability distribution:
This is a posterior probability distribution, because it takes into account information about the data, as well as (implicitly, so far) information about how probable we think each probability is to begin with. Let’s make that implicit information explicit and see how we arrived at this posterior distribution.
We might enter the experiment with no expectations about how well or how poorly participants will perform. This is known as a uniform prior, and it allocates equal probability over all possible probabilities of success:
Alternatively, perhaps we entered the experiment with a belief (for whatever reason) that participants would perform around chance (i.e. that they would just be flipping a coin on each trial). This is known as an informative prior, and it allocates highest probability to our chance performance value (since this is a binary outcome task, 0.5), gradually reducing the probability density as we move further away from this value:
The important thing to be aware of is that, given enough data, the mean of your posterior distribution will eventually approach the maximum likelihood estimate indicated by your data (in this case, 0.8, since we had 80 successes out of 100 trials), no matter the shape of the prior. However:
You can see this in action here.
When people are starting out in Bayesian statistics, they often worry about the influence of the prior, and whether this is a sufficiently objective or scientific way to do data analysis. This is an understandable concern! But try not to panic: remember that all analyses are subjective, and all analyses make assumptions. The great thing about the Bayesian approach is that it lets you formalise your assumptions and put them up for debate. And as we’ve seen above, the model’s reliance on the priors decreases as the amount of data increases. It’s also worth saying that if you did actively want to manipulate or mess with your analysis, using priors would be an embarrassingly transparent way to do so.
Let’s return to the posterior distribution we saw before (which, by the way, comes from a uniform-prior model):
The x-axis shows every possible probability for answering the lexical decision task correctly. The model allocates its belief over these probabilities based on the data we observed and the prior beliefs we encoded. The posterior probability distribution therefore shows the model’s belief about how likely different probabilities are to be the true probability that generated the data. Here, the model considers probabilities of success between about 0.7 and 0.9 to be the most plausible, given the observed data and our uniform prior beliefs. We usually report posterior distributions by summarising their central tendency and dispersion:
The 95% Credible Interval (abbreviated as 95% CrI) defines the range in which the model is 95% certain that the true value lies. The 95% CrI has this interpretation because it contains 95% of the probability mass of the entire posterior distribution. And not just any 95%: it contains the highest-density 95% (i.e., the “tallest” and “narrowest” region possible). Put differently, the 95% CrI spans the distribution from the 2.5th quantile up to the 97.5th quantile. Here’s how it looks, shaded in light blue:
The interpretation of the Bayesian Credible Interval — the range within which, with 95% probability, the true value lies — is the interpretation we often instinctively want the frequentist confidence interval to have. This is considered by some to be a point in favour of the Bayesian framework: it reflects our intuitions a little bit better.
To get started, we’re going to have a look at the model-building
workflow using an example dataset: palmerpenguins
. This
dataset contains measurements for three penguin species observed on
three islands in the Palmer Archipelago, Antarctica. First, let’s get an
idea of what’s in the dataset.
# install.packages("palmerpenguins", repos="http://cran.us.r-project.org") # run if this is your first time using this dataset
penguins <- palmerpenguins::penguins
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Null values are a pain for data visualisation and model fitting, so let’s see how many of those we’ve got.
colSums(is.na(penguins))
## species island bill_length_mm bill_depth_mm
## 0 0 2 2
## flipper_length_mm body_mass_g sex year
## 2 2 11 0
In the real world, I’d probably want to be selective about only dropping rows where columns I actually cared about contained null values, but for ease here I’m just going to drop every row that contains at least 1 null value: since the biggest number in the table above is 11, this is still going to leave us with 333 rows.
penguins <- penguins %>% drop_na()
There’s a few things we could model in this dataset, but I’ve picked an easy research question to get us started: does flipper length depend on sex?
To set us up to answer this question, I’m first going to add a new
column to the dataset which centres the predictor using ±0.5 sum coding
(for more on contrast coding, see here).
There are other ways to set contrasts without creating whole new columns
(e.g. using contr.sum(2)/2
) but it can be easier to
interpret model output if you force yourself to be more explicit. If you
don’t specify contrasts, brms
will default to treatment
coding (like lme4
).
penguins <- penguins %>%
mutate(sex_contrsum = ifelse(sex=="male", -0.5, 0.5))
If you’ve done linear modelling before, you’ve probably already encountered the concept of model family e.g. you can fit a basic linear model to continuous outcome data, but you need a logistic model for binary outcome data. The same applies in Bayesian statistics: we select a likelihood function based on the kinds of values that the outcome variable can take on.
To help us choose a likelihood function, let’s first visualise the flipper length data. N.B. I’m also splitting the plot by species just to make it a bit cleaner, but we won’t be including species in the model right away (because doing so requires us to think a bit harder about contrast coding, and this is not a class on contrast coding!).
penguins %>%
ggplot(aes(x=flipper_length_mm)) +
facet_wrap(~species) +
geom_histogram(aes(fill=sex), position="identity", alpha=0.5)
If I was being fairly liberal, I’d say that these distributions look normal-ish. I say “ish” because flipper length obviously can’t be negative, so technically these are not normal distributions. But they have roughly the right kind of shape (i.e. not super skewed) and it’s definitely continuous data (rather than e.g. binary). This might not feel particularly satisfying, but it’s pretty standard practice to use a normal likelihood function in these kinds of cases, much as you would probably just fit a basic linear model without transforming the data if you were running frequentist stats. So try not to worry too much about it!
If we’re using a normal likelihood, this means we can express flipper length (F) as follows:
\(F \sim Normal(\mu, \sigma)\)
Or, in words: flipper length is distributed according to a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).
Our likelihood function contains two parameters that define its shape: \(\mu\) (the mean) and \(\sigma\) (the standard deviation). We also want our model to estimate one other parameter: the effect of sex on flipper length, or in other words, the difference between male and female penguins’ mean flipper length.
Every parameter in a Bayesian model needs to have a prior that tells the model what kinds of values are plausible for that parameter to take on. So how do we choose priors for these three parameters? There are many different schools of thought, but these are the basic types of priors you’ll usually encounter:
Weaker/broader priors are generally less philosophically alarming to researchers trained in frequentist statistics, and these are likely to be the people reviewing your papers. So unless you can point to a lot of prior knowledge in your particular domain to justify very informative priors, I would generally recommend regularising priors.
Let’s step through each of the parameters in turn.
Unless you’re a penguin expert, you may not have strong intuitions about how long flippers can be. However, we can use common sense to set some sensible lower/upper bounds on the mean. For example:
So we could start off with something incredibly broad with these couple of very minor restrictions e.g. a uniform prior with a lower bound of 50 (5cm) and an upper bound of 1000 (1m).
To get an idea of what kind of values a particular prior would permit
your parameter to take, you can use a q
function (formally,
the inverse cumulative distribution function): this is one of a family
of functions (known as the dpqr functions) that comes with every
distribution. All q
functions take as their first argument
a probability or vector of probabilities, followed by all the parameters
required to define the distribution. Since we’ll be looking at 95%
credible intervals for each of our parameter values, the probabilities
we want to pass in are always 0.025 (the 2.5th percentile) and 0.975
(the 97.5th percentile): the area between these two percentiles
comprises 95% of the distribution.
In this case, we want the qunif
function for the uniform
distribution. The uniform distribution has two parameters: min (the
lower bound) and max (the upper bound).
qunif(c(0.025, 0.975), min = 50, max = 1000)
## [1] 73.75 976.25
So this output tells us that such a prior defines an expectation that mean flipper length should be between 73.75mm and 976.25mm. This seems very likely to be true! But it’s a uniform distribution, so we’re saying that all values in that range are equally plausible, which maybe seems unnecessarily permissive.
To make things a bit more restrictive, we could instead assume that mean flipper length is drawn from a normal distribution, where extreme values are less likely. For example:
qnorm(c(0.025, 0.975), mean = 500, sd = 250)
## [1] 10.009 989.991
So this prior says that mean flipper length should be between about 10cm and about 99cm. Because the standard deviation here is quite high, this distribution is going to be fairly spread out around the mean, which is good if we’re only trying to be weakly regularising. We can see the spread by visualising the distribution: I’ve chosen the range on the x-axis by adding/subtracting 3 standard deviations from the mean, which I know (thanks to math) is going to include 99% of the probability mass.
m = 500
sd = 250
ggplot(tibble(x = c(m-(3*sd), m+(3*sd))), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = m, sd = sd), colour=cbpalette[1]) +
labs(x = "flipper length (mm)", y = 'Probability\ndensity') +
theme(panel.grid=element_blank())
This looks a bit better: completely implausible values have very low probability, but we’re fairly permissive about other values.
However, you may have noticed that the mean of this distribution seems much too high given the data we actually have:
penguins$flipper_length_mm %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 172 190 197 201 213 231
There is a delicate balancing act here: we want our prior to be on the right order of magnitude, but we don’t want our model to overfit to this specific dataset in case we collect data from a much larger penguin species in future and want to be able to use the same statistical models to analyse that data. Priors are also meant to be exactly that: our prior expectations before observing any data.
There’s no single correct answer, but I suggest that we stick with a normal distribution (the central limit theorem tells us that sample means will tend to be normally distributed, so this makes sense for a prior on the mean) but shift it a bit further to the left to be more in line with the properties of the penguin species we’re dealing with here. For example:
qnorm(c(0.025, 0.975), mean = 250, sd = 125)
## [1] 5.004502 494.995498
Standard deviations can only be positive, but luckily,
brms
is clever and knows this, so we have to worry even
less about whether the normal distribution is technically a
valid choice than we did for the mean.
However, because brms
is going to do some truncating
under the hood, if we want to see what kinds of values different priors
would correspond to, we do need to use the q
function for
the truncated normal distribution: qtnorm
. This
distribution has four parameters: mean, sd, a (the lower bound) and b
(the upper bound). We’re just going to specify a lower bound in this
case, since that’s the end of the distribution where we want the
truncation to take place. For example:
qtnorm(c(0.025, 0.975), mean = 0, sd = 100, a = 0)
## [1] 3.133798 224.140273
This prior on sigma sets an expectation that the standard deviation of flipper sizes should be in the above range (still on the mm scale). This seems fine to me, but if you’re running the code yourself you could always try playing around with a few different values! As always, you can also see the effect of truncating the normal distribution by visualising the prior:
m = 0
sd = 100
a = 0
ggplot(tibble(x = c(m-(3*sd), m+(3*sd))), aes(x = x)) +
stat_function(fun = dtnorm, args = list(mean = m, sd = sd, a=0), colour=cbpalette[1]) +
labs(x = "flipper length (mm)", y = 'Probability\ndensity') +
theme(panel.grid=element_blank())
Ok, so this is the parameter we’re most interested in: the main effect of sex. We know that this parameter will have to have much smaller values than the mean flipper length itself, because the difference between male and female penguins is not likely to be on the order of magnitude of one whole flipper! However, we may not want to set a really specific prior on which direction the effect will go in i.e. we might want to allow for male flipper length to be greater than female flipper length, or vice versa.
The fact that we want to allow this parameter to take either positive or negative values means we’re definitely justified in using a normal distribution this time, hurrah! For example:
qnorm(c(0.025, 0.975), mean = 0, sd = 100)
## [1] -195.9964 195.9964
So this prior says that we don’t know which direction the effect will go in, but we expect the difference between male and female penguins to be in the above range (still on the mm scale). This seems pretty sensible: very small differences (close to 0) are going to have highest prior probability, but much bigger differences are possible. However, you could of course make this prior either wider or narrower if you wanted to be more permissive or had stronger existing domain knowledge.
Before we dive in to running the model, we want to check that the priors we’ve chosen generate data that looks vaguely sensible. Enter: prior predictive checks.
brms
gives us a way of generating simulated data from
our priors, by using all the same syntax as we’re going to use for the
real model but adding one extra argument to make it ignore the actual
data. The basic template for a prior predictive check is as follows:
brm(
my_outcome ~ my_predictor,
data = my_data,
family = gaussian(),
prior = c(
prior(SOMETHING, class = Intercept),
prior(SOMETHING, class = sigma),
prior(SOMETHING, class = b, coef = my_colname)
),
sample_prior = "only"
)
lme4
syntax.data=
argument names the data frame where the data
to analyse can be found.family=
argument defines the model family: in this
case, we’re using a normal likelihood, so we want
gaussian()
(this is the default so you can leave this
argument out if you like, but I think it’s good to be explicit for our
own sake).prior=
argument defines the priors that the model
will use. If there is no prior=
argument, the model will
use the default priors (more on this later!).sample_prior = "only"
is what makes this model
into a prior predictive model: it ignores the data and uses
only the priors to estimate the posteriors (basically just reproducing
the priors). Removing this line will cause the model to take the data
into account when estimating posteriors, and we’ll do this when we
properly fit the model.So here’s the prior predictive model for the priors I chose above.
Note that if you want to use a uniform prior on any parameter, you’ll
need to also specify lb=
and ub=
(lower bound
and upper bound respectively) as additional arguments to
prior()
.
prior_pred <- brm(
flipper_length_mm ~ sex_contrsum,
data = penguins,
family = gaussian(),
prior = c(
prior(normal(250, 125), class = Intercept),
prior(normal(0, 100), class = sigma),
prior(normal(0, 100), class = b, coef = sex_contrsum)
),
sample_prior = "only",
refresh = 0, # this prevents brms from printing a lot of unnecessary output: you'll still see warnings if relevant (e.g. non-convergence)
seed = 42 # set a random seed to make the results reproducible
)
This model has estimated posteriors that reproduce the priors, ignoring the data, so we can use these posteriors to generate some new data and see whether it look reasonable.
Generating predictive data is such a common thing to do that
brms
comes with a useful function that helps us do it
graphically: pp_check()
. It takes the following
arguments:
type = "stat"
is one possible type of plot, which
applies some summary statistic to the generated data.stat = mean
says that our summary statistic is going to
be the function mean()
.prefix = "ppd"
hides the observed data, which is shown
by default, displaying only the predictive distributions (“ppd” = “prior
predictive distribution”). It doesn’t make sense to include the data in
the plot, because our prior predictive model didn’t take it into
account.We can also add any of the usual ggplot layers to the plot!
pp_check(
prior_pred,
type = "stat",
stat = mean,
prefix = "ppd"
) +
labs(x = "flipper_length_mm", title = "Prior predictive distribution means")
As we can see, these priors are generating data which is pretty well in line with both our intuitions and what we know about the data, so we should be fairly happy with them. One quick note though: you may be worried that the model has some probability mass on values equal to and less than zero, which is clearly impossible. Luckily, because we’re never going to observe such values in the data, this isn’t going to be a problem when we run the real model. Remember that this was quite a permissive prior, so the model is going to rely more heavily on the data. Have a look back at this plot once you’ve run the real model to see how much narrower the estimates are once we take the data into account!
q
functions and prior predictive checks to find
plausible priors for each of those parameters.Ok, we’re nearly ready to fit the model! Before we do so, just a quick note on how Bayesian models work under the hood. Bayesian data analysis is often seen as a cutting-edge, up-and-coming method, even though Bayes’ rule dates back to 1763. Why is this? What stopped the Bayesian framework from taking off sooner? The short answer is: computational limitations.
In slightly more detail, finding posterior distributions analytically (i.e. using calculus) works sometimes, but it’s not possible for every kind of model. Another way to find the posterior is to approximate it by drawing samples from it. This is computationally challenging, and it’s only fairly recently that sampling methods have been developed so that posteriors can be estimated by any old researcher with a laptop.
The method that brms
uses is called Markov Chain
Monte Carlo, or MCMC. It treats the posterior as a kind of
landscape, and identifies the areas of highest probability essentially
by running a physics simulation. We don’t need to understand this in
detail: the key point is that, if we take enough samples, we can
get a decent approximation of the shape of the posterior without needing
to compute it exactly.
By default, brms
models explore the posterior landscape
four separate times (these are called chains), for
2,000 iterations per chain:
brms
will summarise when it describes each parameter’s
posterior to us.You can always fiddle with the arguments to the brm()
call to run more chains or iterations.
All we need to do to run the proper model is remove the
sample_prior = "only"
line from the code we ran for our
prior predictive check:
flipper_fit <- brm(
flipper_length_mm ~ sex_contrsum,
data = penguins,
family = gaussian(),
prior = c(
prior(normal(250, 125), class = Intercept),
prior(normal(0, 100), class = sigma),
prior(normal(0, 100), class = b, coef = sex_contrsum)
),
refresh = 0,
seed = 42
)
Recall that the MCMC chains that sample from the posterior are moving around the posterior landscape, trying to find where the bulk of the probability mass is. Sometimes they might never find it, or might not do so reliably. When the chains don’t properly sample from the posterior, we say the model has not converged.
Ideally, the chains will “mix”: in other words, all four chains will be sampling from the same region of the posterior. There are a couple of ways to tell that the chains are mixing properly.
First, we’re going to have a look at some so-called “trace-plots”.
mcmc_trace(flipper_fit, pars = c("b_Intercept", "b_sex_contrsum"))
Trace plots track where the four chains were as they traversed the posterior during sampling. The y-axis represents the values of the posterior, and the x-axis shows the sample indices (each chain drew 1,000 samples after the warm-up period). If the chains mixed well, then the trace plots should look like “fat hairy caterpillars”. And these ones do!
If you see trace plots where chains aren’t overlapping as densely, or
even worse, where some chains have wandered off in a completely
different direction and never mingled with the others, you should be
worried (although probably brms
will also have warned you
of non-convergence in this case).
The second thing to check is more quantitative. For each parameter, the model gives us a diagnostic measure called Rhat. Rhat is a measure of how well the chains have mixed. It compares the estimates within chains and between chains, and if the chains have mixed well, then the result should be near 1. We consider the model to have converged if all Rhat values are equal to 1.00; even an Rhat of 1.01 should make you suspicious.
The model summary gives us this information: we’re looking for a column called Rhat.
summary(flipper_fit)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: flipper_length_mm ~ sex_contrsum
## Data: penguins (Number of observations: 333)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 200.94 0.75 199.46 202.39 1.00 4072 2834
## sex_contrsum -7.14 1.50 -10.05 -4.22 1.00 4187 2958
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 13.62 0.55 12.60 14.74 1.00 4283 3164
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
If there is any Rhat value greater than 1.00, it means that the chains didn’t mix well enough and we cannot trust the posterior estimates we’ve been given. If you do ever fit a model that yields Rhat > 1.00, this page offers some solutions you could try.
The final check we want to run on our model is a posterior predictive check. We’ve encountered the concept of predictive checks before: with prior predictive checks, we drew sample values from each parameter’s prior, used those values to define the likelihood, and used that likelihood to generate simulated outcomes. Now, with posterior predictive checks, we’re going to draw sample values from each parameter’s posterior, and use them to do the exact same thing.
In fact, the processes are so similar that we can use
pp_check()
again to give us the means of the posterior
predictive distributions, and compare them to the mean of the observed
data (which is included by default if you don’t specify
prefix = "ppd"
):
pp_check(
flipper_fit,
type = "stat",
stat = mean,
) +
labs(x = "flipper_length_mm", title = "Posterior distribution means")
With the priors I used, the mean of the observed data (the dark vertical line) is smack in the middle of the posterior predictive distribution of means — hurray! That’s what we want to see. A little bit off to either side is also OK, but if the observed data were far away from the bulk of the posterior predictive distribution, that would suggest that we might want to change the model architecture to better reflect the generative process behind the data.
Reassured that our model does a decent job of capturing the data,
let’s now delve into the actual estimates given by
summary()
.
Estimate
, we get the mean of the posterior
distribution of each parameter.Est.Error
, we get the standard deviation.l-95% CI
, we get the 2.5th quantile of the
posterior (the lower bound of the 95% Credible Interval).u-95% CI
, we get the 97.5th quantile of the
posterior (the upper bound of the 95% Credible Interval).summary(flipper_fit)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: flipper_length_mm ~ sex_contrsum
## Data: penguins (Number of observations: 333)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 200.94 0.75 199.46 202.39 1.00 4072 2834
## sex_contrsum -7.14 1.50 -10.05 -4.22 1.00 4187 2958
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 13.62 0.55 12.60 14.74 1.00 4283 3164
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
You can also use the fixef()
function to extract only
the posterior summaries of the fixed effects.
fixef(flipper_fit)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 200.940810 0.7462072 199.46074 202.39358
## sex_contrsum -7.137227 1.5029720 -10.04872 -4.22041
Let’s look at those estimates individually to understand what they mean.
First, the intercept. This is the model’s estimate for mean flipper length across male and female penguins (because we used sum coding; if we had stuck with the default treatment coding, this would have been the estimate for the ‘reference level’). With the priors I used, estimated mean flipper length is 200.94mm (95% CrI: [199.46, 202.39]).
Because the posterior is a distribution of belief over plausible parameter values, this means that the model thinks there’s a 95% probability that the intercept of this model (i.e. mean flipper length) is between 199.46 and 202.39. The model is also very certain that this is a positive effect: the mass of the posterior distribution is nowhere near zero. This makes sense since flipper length can’t be negative!
We can also visualise the posterior distribution as a density plot:
mcmc_dens(flipper_fit, pars = "b_Intercept") +
labs(title = "Posterior distribution of b_Intercept")
Next, the coefficient for sex. With the priors I used, the mean effect of sex (i.e. the difference between male and female penguins) is -7.14 (95% CrI: [-10.05, -4.22]). Because the entire 95% CrI is negative, the model is quite certain that this is a negative effect: female penguins have smaller flippers than male penguins. However, this is (as expected) a much smaller effect than the mean itself.
mcmc_dens(flipper_fit, pars = "b_sex_contrsum") +
labs(title = "Posterior distribution of b_sex_contrsum")
We can also look at any or all posterior distributions with their
means and 95% CrI highlighted using mcmc_areas()
— if you
show both parameters on this plot it will be a bit hard to see the
shading (because they’re on quite different scales), but this can work
really nicely for some models where parameters are a bit closer
together. You could also use this function to look at each parameter
individually.
mcmc_areas(flipper_fit,
pars = c("b_Intercept", "b_sex_contrsum"), # add/remove terms as desired
prob = 0.95) +
geom_vline(xintercept = 0, linetype = 'dotted')
If you were going to write this model up for a paper, here’s how you might report it.
We fit a Bayesian linear model with a normal likelihood predicting flipper length (in millimetres) as a function of sex. The model used weakly regularising priors which we selected based on prior predictive checks. The model converged (as indicated by all Rhats = 1).
The model estimates a negative effect of sex (\(\beta\) = -7.14, 95% CrI [-10.05, -4.22]) meaning that female penguins have shorter flippers on average than male penguins.
Note again that we’re not referring to this as a significant result, or talking about it in terms of rejecting the null hypothesis that male and female penguins have the same flipper length. This can feel a bit uncomfortable if you’re used to reporting frequentist models — and you will almost certainly get queries on it, especially from the people who review your papers — but it is the correct way to report a Bayesian analysis. You will definitely see people arguing that a result is significant because the 95% CrI doesn’t include zero, but try not to be tempted to do this! Remember, there is no such thing as significance in the Bayesian framework.
You can also use the xtable
library to generate a LaTeX
version of your model summary as a table:
library(xtable)
xtable(fixef(flipper_fit))
I mentioned above that if you leave out the prior=
argument when fitting a model, brms
will use its default
priors. These will differ from model to model depending on the model
family and the properties of your data, but you can always find out what
they are for your specific model.
There’s two ways to do this. Either, before actually running the
model, you can pass the model specification into the
get_prior()
function:
get_prior(flipper_length_mm ~ sex_contrsum,
data = penguins,
family = gaussian())
Or, you can also fit the model with default priors and then pass that
model object into the prior_summary()
function: you’ll get
the same output from these two methods.
Ok, so here’s what this output is telling us:
coef
.The flat priors aren’t very interesting, but let’s get an idea of what this Student’s t-distribution looks like. We can inspect it in the same way as we were doing for normal distributions earlier.
df = 3
m = 197
sd = 16.3
ggplot(tibble(x = c(m-(3*sd), m+(3*sd))), aes(x = x)) +
stat_function(fun = dstudent_t, args = list(df = df, mu = m, sigma = sd), colour=cbpalette[1]) +
labs(x = "flipper length (mm)", y = 'Probability\ndensity') +
theme(panel.grid=element_blank())
qstudent_t(c(0.025, 0.975), df=3, mu=197, sigma=16.3)
## [1] 145.1261 248.8739
From the plot and the 95% credible interval, we can see that the
extra weight in the tails means that this distribution encodes less
certainty than a normal distribution in the range it covers,
but because brms
has picked a pretty low value for sigma,
that range is much narrower than in the prior we used. You might be
worried that it’s too restrictive given our lack of prior knowledge
about penguins. Let’s see how it affects the model!
flipper_fit_default <- brm(
flipper_length_mm ~ sex_contrsum,
data = penguins,
family = gaussian(),
refresh = 0,
seed = 42
)
If we were running this analysis for real, we’d need to check the trace plots and Rhats and run a posterior predictive check before diving into the model estimates, but rest assured that we did this behind the scenes and everything looks fine, so we’re safe to just go ahead and check the summary!
summary(flipper_fit_default)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: flipper_length_mm ~ sex_contrsum
## Data: penguins (Number of observations: 333)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 200.90 0.74 199.42 202.33 1.00 4841 3011
## sex_contrsum -7.12 1.50 -10.12 -4.23 1.00 5031 3201
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 13.61 0.54 12.62 14.71 1.00 4185 2690
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Recall that my model with user-defined priors had the following coefficient estimates:
The estimates of the model with default priors are, I think you’ll agree, strikingly similar. What does this tell us? Well, it seems like the influence of the prior is pretty minimal: probably there’s just so much data that the likelihood is dominating in determining the shape of the posterior.
If you want to play more with this, you could try filtering the data down to fewer rows (e.g. just look at the penguins on Torgersen island) and compare models with user-defined vs. default priors: remember that priors have more of an influence a) when there’s less data and b) when they’re more informative. If you want to get really advanced, you can read more about prior sensitivity analyses.
Obviously, the model we’ve been looking at so far is extremely simple: we only have one categorical predictor with two levels. There’s other columns in this dataset we might want to use as predictors: for example, maybe body mass is a better predictor of flipper length than sex. On your own time, why not try adding some more predictors to the model and thinking about how you’d interpret the coefficients? Remember: every parameter (including any interaction terms) needs a prior!
Once you get into multiple regression territory, it can be helpful to
see coefficient estimates for each level of each predictor rather than
trying to intuit that from summary()
. You can do this by
passing your model into the ggpredict()
function from the
ggeffects
package. Here’s an example for the model we built
earlier:
library(ggeffects)
ggpredict(flipper_fit)
## $sex_contrsum
## # Predicted values of flipper_length_mm
##
## sex_contrsum | Predicted | 95% CI
## -----------------------------------------
## -0.50 | 204.50 | 202.49, 206.52
## 0.50 | 197.37 | 195.26, 199.54
##
##
## attr(,"class")
## [1] "ggalleffects" "list"
## attr(,"model.name")
## [1] "flipper_fit"
This tutorial was created by Dr Aislinn Keogh. Dr Keogh was a PhD student in Linguistics at the University of Edinburgh from 2021 to 2025, and a CDCS Teaching Fellow.
Suggested citation: Keogh, A. (2025). Introduction to Bayesian Statistics. CDCS Tutorials
This tutorial is published under a CC
BY-NC 4.0 license. The palmerpenguins
dataset is
available by CC-0
license.
Please help us keep our tutorials up to date. If you find something that is not working, email us at cdcs@ed.ac.uk.
sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggeffects_1.5.2 xtable_1.8-4 extraDistr_1.10.0 bayesplot_1.11.0
## [5] brms_2.20.4 Rcpp_1.0.12 lubridate_1.9.3 forcats_1.0.0
## [9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [13] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.4 gridExtra_2.3 inline_0.3.19
## [4] rlang_1.1.3 magrittr_2.0.3 ggridges_0.5.6
## [7] matrixStats_1.2.0 compiler_4.3.2 mgcv_1.9-0
## [10] loo_2.6.0 vctrs_0.6.5 reshape2_1.4.4
## [13] pkgconfig_2.0.3 fastmap_1.1.1 backports_1.4.1
## [16] ellipsis_0.3.2 labeling_0.4.3 utf8_1.2.4
## [19] threejs_0.3.3 promises_1.2.1 rmarkdown_2.25
## [22] markdown_1.12 tzdb_0.4.0 nloptr_2.0.3
## [25] haven_2.5.4 xfun_0.41 cachem_1.0.8
## [28] jsonlite_1.8.8 highr_0.10 later_1.3.2
## [31] parallel_4.3.2 R6_2.5.1 dygraphs_1.1.1.6
## [34] bslib_0.6.1 stringi_1.8.3 StanHeaders_2.32.5
## [37] boot_1.3-28.1 numDeriv_2016.8-1.1 jquerylib_0.1.4
## [40] rstan_2.32.5 knitr_1.45 zoo_1.8-12
## [43] base64enc_0.1-3 splines_4.3.2 glmmTMB_1.1.11
## [46] httpuv_1.6.14 Matrix_1.6-5 igraph_2.0.1.1
## [49] timechange_0.3.0 tidyselect_1.2.0 rstudioapi_0.15.0
## [52] abind_1.4-5 yaml_2.3.8 TMB_1.9.17
## [55] codetools_0.2-19 miniUI_0.1.1.1 curl_5.2.0
## [58] pkgbuild_1.4.3 lattice_0.21-9 plyr_1.8.9
## [61] bayestestR_0.14.0 shiny_1.8.0 withr_3.0.0
## [64] bridgesampling_1.1-2 posterior_1.5.0 coda_0.19-4.1
## [67] evaluate_0.23 RcppParallel_5.1.7 xts_0.13.2
## [70] pillar_1.9.0 tensorA_0.36.2.1 checkmate_2.3.1
## [73] DT_0.31 stats4_4.3.2 reformulas_0.4.1
## [76] insight_0.20.4 shinyjs_2.1.0 distributional_0.3.2
## [79] generics_0.1.3 hms_1.1.3 rstantools_2.4.0
## [82] munsell_0.5.0 scales_1.3.0 minqa_1.2.6
## [85] gtools_3.9.5 glue_1.7.0 tools_4.3.2
## [88] shinystan_2.6.0 lme4_1.1-35.1 colourpicker_1.3.0
## [91] mvtnorm_1.2-4 grid_4.3.2 rbibutils_2.3
## [94] QuickJSR_1.1.3 crosstalk_1.2.1 datawizard_0.12.3
## [97] colorspace_2.1-0 nlme_3.1-163 palmerpenguins_0.1.1
## [100] cli_3.6.2 fansi_1.0.6 Brobdingnag_1.2-9
## [103] V8_5.0.0 gtable_0.3.4 sass_0.4.8
## [106] digest_0.6.34 htmlwidgets_1.6.4 farver_2.1.1
## [109] htmltools_0.5.7 lifecycle_1.0.4 mime_0.12
## [112] MASS_7.3-60 shinythemes_1.2.0
This example is taken from the Simulating Language course in the School of Philosophy, Psychology and Language Sciences at the University of Edinburgh, with thanks to Professors Kenny Smith and Simon Kirby.↩︎
This section of the tutorial is based on the Bayes, stat! workshop, by Dr Elizabeth Pankratz.↩︎