This tutorial will introduce you to working with Bayesian Statistics. Distinct from frequentist statistics, which is concerned with accepting or rejecting the null hypothesis, Bayesian Statistics asks what the probability of different hypotheses is, given the data and our prior beliefs about the world. We will start with an introduction to the conceptual underpinnings of Bayesian Statistics, and then see how we can actually fit Bayesian models in R.

By the end of this tutorial, you will understand:

  • How beliefs about the world are formalised
  • How different priors influence a model’s estimates
  • How to interpret posterior distributions
  • The workflow for running a Bayesian analysis
  • How to fit and inspect Bayesian models using the brms package in R

The tutorial was created by Dr Aislinn Keogh.

1 Theory

1.1 Introducing Bayes

Bayesian Statistics gets its name from Thomas Bayes (1702-1761), an English statistician, philosopher and Presbyterian minister who formulated a very influential theorem in probability theory called Bayes’ rule. First, we want to get the intuition for Bayes’ rule1.

Imagine your friend complains of a headache, and you’re trying to decide between three possible causes for this headache: a brain tumour, dehydration, or athlete’s foot. Reaching a diagnosis requires you to consider two questions:

  1. How likely is it that someone with each medical condition would exhibit the symptom of having a headache?
  2. How common is each medical condition?

Starting with the first question, let’s say this is our assessment:

  • Brain tumour: headache is very likely
  • Dehydration: headache is very likely
  • Athlete’s foot: headache is very unlikely

So if all we cared about was the likelihood of symptoms given each underlying condition, we would conclude that your friend either has a brain tumour or is dehydrated.

Next let’s say we make the following assessment about our second question:

  • Brain tumour: very rare
  • Dehydration: very common
  • Athlete’s foot: very common

So if all we cared about was the prevalence of each underlying condition, we would conclude that your friend is either dehydrated or suffering from athlete’s foot.

But most people don’t reach either of these conclusions: instead, they bring the two quantities together in a smart way to conclude that the headache is probably caused by dehydration. How do we do this? Enter: Bayes’ rule.

1.1.1 Bayes’ rule

Bayes’ rule allows us to estimate the probability that our friend has a particular medical condition given their symptoms — \(P(illness \mid symptoms)\) — based on our answers to the two questions we considered above: the likelihood of a symptom given an illness — \(P(symptoms \mid illness)\) — and the prior probability of each illness — \(P(illness)\). Bayes’ rule provides a convenient way of expressing the quantity we want to know in terms of the quantities we already (roughly) know:

\[P(illness \mid symptoms) \propto P(symptoms \mid illness) \cdot P(illness)\]

Where \(\propto\) is read as “is proportional to”. The full equation has an extra (less interesting) term whose sole purpose is to make the output into a proper probability distribution, as follows:

\[P(illness \mid symptoms) = \frac{P(symptoms \mid illness) \cdot P(illness)}{P(symptoms)}\]

Let’s step through each of these components in turn:

  • \(P(illness \mid symptoms)\): The thing we want to know is called the posterior
  • \(P(symptoms \mid illness)\): The probability of a particular set of symptoms given that you have a specific illness is called the likelihood
  • \(P(illness)\): The probability that you have a particular illness before I have any evidence from your symptoms is called the prior
  • \(P(symptoms)\): The term on the bottom (the probability of symptoms independent of illness) is just a normalising constant: it’s not very interesting to us since it’s the same for all illnesses. It’s often known as the marginal or the marginal likelihood

Hopefully you’ll agree that Bayes’ rule makes intuitive sense. If the likelihood of symptoms given a certain illness is high, this will increase the posterior probability of that illness. Likewise, if the prior probability of a certain illness is high, this will also increase the posterior probability of that illness. And really importantly, if the prior probability of an illness is very low, we’d need some pretty strong evidence to make us accept that as a diagnosis.

1.2 Bayesian inference for statistics

In inferential statistics, we want to use the data we have collected to answer some research question. To do so, we need to formalise two things:

  1. What kind of distribution is our data drawn from?
  2. What parameter(s) are we trying to estimate?

Note that this is not specific to Bayesian statistics! You might not have thought about these questions in exactly these terms before, but this is also what you’re doing when you fit a frequentist model in lme4. For example, if you’ve ever fit a logistic regression model using glm() with family=binomial, you’ve formalised the assumption that your data was drawn from a binomial distribution. And when you use a model formula like outcome ~ predictor1 + predictor2, you’re formalising the fact that your parameters of interest are the main effects of predictor1 and predictor2. For example:

  • We might want to know how well participants in an experiment perform on a binary outcome task (i.e. responses can either be correct or incorrect). As above, the data in such a task is drawn from a binomial distribution i.e. the number of successes in a sequence of binary outcome trials. The parameter we’re trying to estimate is a probability (intuitively) or log-odds (formally) of a correct response.
  • Alternatively, we might want to know how many people click a link in Version A of our marketing newsletter compared to Version B. The data here is drawn from a Poisson distribution, which captures how many times an event is likely to occur over a specific period. The parameters we’re trying to estimate are counts: the average number of clicks (intercept) and the difference between newsletter versions (slope).
  • As a final example, we might want to know how different patients respond to a particular treatment depending on their age and sex. Some kinds of data (e.g. change in blood pressure) are drawn from a normal distribution: a continuous number line between negative infinity and positive infinity. The parameters we’re trying to estimate are numbers along that line: blood pressure could increase (positive coefficient), decrease (negative coefficient) or stay the same (0). And as well as the slopes for age and sex, we might also want to estimate an interaction term e.g. perhaps older men respond better than younger men, but the effect of age goes in the opposite direction or is just not observable for women.

1.2.1 Bayesian vs. frequentist statistics

The radical difference between Bayesian statistics and frequentist statistics is in how we think of these parameters:

  • The frequentist view: Parameters have fixed values and can be described by point estimates (unknown but certain)
  • The Bayesian view: Parameters are random variables and can only be described by probability distributions (unknown and uncertain)

On first reading, you may feel that this description is wrong: after all, frequentist models give you standard errors on all your coefficients, don’t they? Aren’t these SEs precisely capturing uncertainty about the point estimate? The confusion here is basically down to the fact that the underlying math behind frequentist models is really pretty unintuitive and generally poorly understood. We don’t have space to get into the details here, but suffice it to say that frequentist models technically can’t be interpreted in terms of uncertainty, as much as we instinctively want to make this leap; Bayesian models can and must be interpreted this way.

Perhaps more saliently, one really important practical consequence of what might feel like a very abstract distinction between frequentist and Bayesian analyses is this: there is no concept of significance in Bayesian statistics. Models allocate their beliefs over a range of possible values, and we as researchers observe the extent to which the model estimates are consistent with our hypotheses. However, this is not a binary decision of rejecting or failing to reject the null hypothesis (as in frequentist statistics). The fact that we can’t describe results in terms of significance/non-significance is usually the main thing that surprises (and often scares) people when they are learning about Bayesian statistics, or reading a paper that uses this approach. We’ll come back to this point in the practical part of the tutorial!

1.2.2 How do we use Bayes’ rule to estimate parameter values?

Let’s look in more detail at how we apply Bayes’ rule — combining our data with our prior assumptions — to arrive at a probability distribution over plausible values for our parameter(s)2.

Imagine that we’ve run a lexical decision experiment. Participants are shown a word, and they have to decide whether it is a real word or not. We have data from 100 trials. Out of those 100 trials, participants answered correctly 80 times.

\[correct = \frac{80}{100}\]

What is the probability of answering correctly? Based on the data alone, our best guess at the probability of a correct response (i.e. the probability that a new participant would answer correctly on any given trial) would have to be 0.8, right? We know it can’t be 0, because we’ve observed some successes, and we know it can’t be 1, because we’ve also observed some failures. If the underlying probability of success was only 0.5 (i.e. participants were just guessing randomly), we would pretty surprised to see this many successes. But 0.7 or 0.9 don’t seem like totally unreasonable estimates. In short, we have intuitions about how probable different probabilities of success are, given the data we observed. We can visualise these intuitions as a probability distribution:

This is a posterior probability distribution, because it takes into account information about the data, as well as (implicitly, so far) information about how probable we think each probability is to begin with. Let’s make that implicit information explicit and see how we arrived at this posterior distribution.

1.2.2.1 The prior

We might enter the experiment with no expectations about how well or how poorly participants will perform. This is known as a uniform prior, and it allocates equal probability over all possible probabilities of success:

Alternatively, perhaps we entered the experiment with a belief (for whatever reason) that participants would perform around chance (i.e. that they would just be flipping a coin on each trial). This is known as an informative prior, and it allocates highest probability to our chance performance value (since this is a binary outcome task, 0.5), gradually reducing the probability density as we move further away from this value:

The important thing to be aware of is that, given enough data, the mean of your posterior distribution will eventually approach the maximum likelihood estimate indicated by your data (in this case, 0.8, since we had 80 successes out of 100 trials), no matter the shape of the prior. However:

  • The less data you have, the stronger the influence of the prior
  • The less informative the prior, the stronger the influence of the data (or, equivalently: the more informative the prior, the weaker the influence of the data)

You can see this in action here.

When people are starting out in Bayesian statistics, they often worry about the influence of the prior, and whether this is a sufficiently objective or scientific way to do data analysis. This is an understandable concern! But try not to panic: remember that all analyses are subjective, and all analyses make assumptions. The great thing about the Bayesian approach is that it lets you formalise your assumptions and put them up for debate. And as we’ve seen above, the model’s reliance on the priors decreases as the amount of data increases. It’s also worth saying that if you did actively want to manipulate or mess with your analysis, using priors would be an embarrassingly transparent way to do so.

1.2.2.2 The posterior

Let’s return to the posterior distribution we saw before (which, by the way, comes from a uniform-prior model):

The x-axis shows every possible probability for answering the lexical decision task correctly. The model allocates its belief over these probabilities based on the data we observed and the prior beliefs we encoded. The posterior probability distribution therefore shows the model’s belief about how likely different probabilities are to be the true probability that generated the data. Here, the model considers probabilities of success between about 0.7 and 0.9 to be the most plausible, given the observed data and our uniform prior beliefs. We usually report posterior distributions by summarising their central tendency and dispersion:

  • Central tendency: mean or median (depending on how skewed the distribution is)
  • Dispersion: conventionally, the 95% credible interval

The 95% Credible Interval (abbreviated as 95% CrI) defines the range in which the model is 95% certain that the true value lies. The 95% CrI has this interpretation because it contains 95% of the probability mass of the entire posterior distribution. And not just any 95%: it contains the highest-density 95% (i.e., the “tallest” and “narrowest” region possible). Put differently, the 95% CrI spans the distribution from the 2.5th quantile up to the 97.5th quantile. Here’s how it looks, shaded in light blue:

The interpretation of the Bayesian Credible Interval — the range within which, with 95% probability, the true value lies — is the interpretation we often instinctively want the frequentist confidence interval to have. This is considered by some to be a point in favour of the Bayesian framework: it reflects our intuitions a little bit better.

2 Practical

2.1 Data wrangling and setup

To get started, we’re going to have a look at the model-building workflow using an example dataset: palmerpenguins. This dataset contains measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. First, let’s get an idea of what’s in the dataset.

# install.packages("palmerpenguins", repos="http://cran.us.r-project.org") # run if this is your first time using this dataset
penguins <- palmerpenguins::penguins
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Null values are a pain for data visualisation and model fitting, so let’s see how many of those we’ve got.

colSums(is.na(penguins))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0

In the real world, I’d probably want to be selective about only dropping rows where columns I actually cared about contained null values, but for ease here I’m just going to drop every row that contains at least 1 null value: since the biggest number in the table above is 11, this is still going to leave us with 333 rows.

penguins <- penguins %>% drop_na()

2.2 The model-building workflow

There’s a few things we could model in this dataset, but I’ve picked an easy research question to get us started: does flipper length depend on sex?

To set us up to answer this question, I’m first going to add a new column to the dataset which centres the predictor using ±0.5 sum coding (for more on contrast coding, see here). There are other ways to set contrasts without creating whole new columns (e.g. using contr.sum(2)/2) but it can be easier to interpret model output if you force yourself to be more explicit. If you don’t specify contrasts, brms will default to treatment coding (like lme4).

penguins <- penguins %>% 
  mutate(sex_contrsum = ifelse(sex=="male", -0.5, 0.5))

2.2.1 Choosing a likelihood

If you’ve done linear modelling before, you’ve probably already encountered the concept of model family e.g. you can fit a basic linear model to continuous outcome data, but you need a logistic model for binary outcome data. The same applies in Bayesian statistics: we select a likelihood function based on the kinds of values that the outcome variable can take on.

To help us choose a likelihood function, let’s first visualise the flipper length data. N.B. I’m also splitting the plot by species just to make it a bit cleaner, but we won’t be including species in the model right away (because doing so requires us to think a bit harder about contrast coding, and this is not a class on contrast coding!).

penguins %>% 
  ggplot(aes(x=flipper_length_mm)) +
  facet_wrap(~species) +
  geom_histogram(aes(fill=sex), position="identity", alpha=0.5)

If I was being fairly liberal, I’d say that these distributions look normal-ish. I say “ish” because flipper length obviously can’t be negative, so technically these are not normal distributions. But they have roughly the right kind of shape (i.e. not super skewed) and it’s definitely continuous data (rather than e.g. binary). This might not feel particularly satisfying, but it’s pretty standard practice to use a normal likelihood function in these kinds of cases, much as you would probably just fit a basic linear model without transforming the data if you were running frequentist stats. So try not to worry too much about it!

If we’re using a normal likelihood, this means we can express flipper length (F) as follows:

\(F \sim Normal(\mu, \sigma)\)

Or, in words: flipper length is distributed according to a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

2.2.2 Choosing priors

Our likelihood function contains two parameters that define its shape: \(\mu\) (the mean) and \(\sigma\) (the standard deviation). We also want our model to estimate one other parameter: the effect of sex on flipper length, or in other words, the difference between male and female penguins’ mean flipper length.

Every parameter in a Bayesian model needs to have a prior that tells the model what kinds of values are plausible for that parameter to take on. So how do we choose priors for these three parameters? There are many different schools of thought, but these are the basic types of priors you’ll usually encounter:

  • Flat/uninformative priors: Priors that treat all values within a wide range as equally plausible (i.e. a uniform distribution).
  • Weakly regularising priors: Priors that are fairly broad, allowing a wide range of possible values, but down-weighting or ruling out impossibly extreme values.
  • Principled/informative priors: Priors that are quite narrow, reflecting more a priori certainty about plausible values e.g. from existing domain knowledge.

Weaker/broader priors are generally less philosophically alarming to researchers trained in frequentist statistics, and these are likely to be the people reviewing your papers. So unless you can point to a lot of prior knowledge in your particular domain to justify very informative priors, I would generally recommend regularising priors.

Let’s step through each of the parameters in turn.

2.2.2.1 A prior for mu (mean flipper length)

Unless you’re a penguin expert, you may not have strong intuitions about how long flippers can be. However, we can use common sense to set some sensible lower/upper bounds on the mean. For example:

  • Flippers must have a length, which means that mean length must be greater than 0.
  • Flippers must be shorter than the penguin itself, and even the biggest penguins in the world are shorter than an average human, so e.g. a mean length of 2000mm (2m) would be implausible — even if there was one monstrous penguin who had such long flippers, they’d definitely be an outlier!

So we could start off with something incredibly broad with these couple of very minor restrictions e.g. a uniform prior with a lower bound of 50 (5cm) and an upper bound of 1000 (1m).

To get an idea of what kind of values a particular prior would permit your parameter to take, you can use a q function (formally, the inverse cumulative distribution function): this is one of a family of functions (known as the dpqr functions) that comes with every distribution. All q functions take as their first argument a probability or vector of probabilities, followed by all the parameters required to define the distribution. Since we’ll be looking at 95% credible intervals for each of our parameter values, the probabilities we want to pass in are always 0.025 (the 2.5th percentile) and 0.975 (the 97.5th percentile): the area between these two percentiles comprises 95% of the distribution.

In this case, we want the qunif function for the uniform distribution. The uniform distribution has two parameters: min (the lower bound) and max (the upper bound).

qunif(c(0.025, 0.975), min = 50, max = 1000)
## [1]  73.75 976.25

So this output tells us that such a prior defines an expectation that mean flipper length should be between 73.75mm and 976.25mm. This seems very likely to be true! But it’s a uniform distribution, so we’re saying that all values in that range are equally plausible, which maybe seems unnecessarily permissive.

To make things a bit more restrictive, we could instead assume that mean flipper length is drawn from a normal distribution, where extreme values are less likely. For example:

qnorm(c(0.025, 0.975), mean = 500, sd = 250)
## [1]  10.009 989.991

So this prior says that mean flipper length should be between about 10cm and about 99cm. Because the standard deviation here is quite high, this distribution is going to be fairly spread out around the mean, which is good if we’re only trying to be weakly regularising. We can see the spread by visualising the distribution: I’ve chosen the range on the x-axis by adding/subtracting 3 standard deviations from the mean, which I know (thanks to math) is going to include 99% of the probability mass.

m = 500
sd = 250

ggplot(tibble(x = c(m-(3*sd), m+(3*sd))), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = m, sd = sd), colour=cbpalette[1]) +
  labs(x = "flipper length (mm)", y = 'Probability\ndensity') +
  theme(panel.grid=element_blank())

This looks a bit better: completely implausible values have very low probability, but we’re fairly permissive about other values.

However, you may have noticed that the mean of this distribution seems much too high given the data we actually have:

penguins$flipper_length_mm %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     172     190     197     201     213     231

There is a delicate balancing act here: we want our prior to be on the right order of magnitude, but we don’t want our model to overfit to this specific dataset in case we collect data from a much larger penguin species in future and want to be able to use the same statistical models to analyse that data. Priors are also meant to be exactly that: our prior expectations before observing any data.

There’s no single correct answer, but I suggest that we stick with a normal distribution (the central limit theorem tells us that sample means will tend to be normally distributed, so this makes sense for a prior on the mean) but shift it a bit further to the left to be more in line with the properties of the penguin species we’re dealing with here. For example:

qnorm(c(0.025, 0.975), mean = 250, sd = 125)
## [1]   5.004502 494.995498

2.2.2.2 A prior for sigma (standard deviation of flipper lengths)

Standard deviations can only be positive, but luckily, brms is clever and knows this, so we have to worry even less about whether the normal distribution is technically a valid choice than we did for the mean.

However, because brms is going to do some truncating under the hood, if we want to see what kinds of values different priors would correspond to, we do need to use the q function for the truncated normal distribution: qtnorm. This distribution has four parameters: mean, sd, a (the lower bound) and b (the upper bound). We’re just going to specify a lower bound in this case, since that’s the end of the distribution where we want the truncation to take place. For example:

qtnorm(c(0.025, 0.975), mean = 0, sd = 100, a = 0)
## [1]   3.133798 224.140273

This prior on sigma sets an expectation that the standard deviation of flipper sizes should be in the above range (still on the mm scale). This seems fine to me, but if you’re running the code yourself you could always try playing around with a few different values! As always, you can also see the effect of truncating the normal distribution by visualising the prior:

m = 0
sd = 100
a = 0

ggplot(tibble(x = c(m-(3*sd), m+(3*sd))), aes(x = x)) +
  stat_function(fun = dtnorm, args = list(mean = m, sd = sd, a=0), colour=cbpalette[1]) +
  labs(x = "flipper length (mm)", y = 'Probability\ndensity') +
  theme(panel.grid=element_blank())

2.2.2.3 A prior for beta (the difference between male and female penguins)

Ok, so this is the parameter we’re most interested in: the main effect of sex. We know that this parameter will have to have much smaller values than the mean flipper length itself, because the difference between male and female penguins is not likely to be on the order of magnitude of one whole flipper! However, we may not want to set a really specific prior on which direction the effect will go in i.e. we might want to allow for male flipper length to be greater than female flipper length, or vice versa.

The fact that we want to allow this parameter to take either positive or negative values means we’re definitely justified in using a normal distribution this time, hurrah! For example:

qnorm(c(0.025, 0.975), mean = 0, sd = 100)
## [1] -195.9964  195.9964

So this prior says that we don’t know which direction the effect will go in, but we expect the difference between male and female penguins to be in the above range (still on the mm scale). This seems pretty sensible: very small differences (close to 0) are going to have highest prior probability, but much bigger differences are possible. However, you could of course make this prior either wider or narrower if you wanted to be more permissive or had stronger existing domain knowledge.

2.2.3 Prior predictive checks

Before we dive in to running the model, we want to check that the priors we’ve chosen generate data that looks vaguely sensible. Enter: prior predictive checks.

brms gives us a way of generating simulated data from our priors, by using all the same syntax as we’re going to use for the real model but adding one extra argument to make it ignore the actual data. The basic template for a prior predictive check is as follows:

brm(
  my_outcome ~ my_predictor,
  data = my_data,
  family = gaussian(),
  prior = c(
    prior(SOMETHING, class = Intercept),
    prior(SOMETHING, class = sigma),
    prior(SOMETHING, class = b, coef = my_colname)
  ),
  sample_prior = "only"
)
  • The first argument is the model formula in lme4 syntax.
  • The data= argument names the data frame where the data to analyse can be found.
  • The family= argument defines the model family: in this case, we’re using a normal likelihood, so we want gaussian() (this is the default so you can leave this argument out if you like, but I think it’s good to be explicit for our own sake).
  • The prior= argument defines the priors that the model will use. If there is no prior= argument, the model will use the default priors (more on this later!).
  • The line sample_prior = "only" is what makes this model into a prior predictive model: it ignores the data and uses only the priors to estimate the posteriors (basically just reproducing the priors). Removing this line will cause the model to take the data into account when estimating posteriors, and we’ll do this when we properly fit the model.

So here’s the prior predictive model for the priors I chose above. Note that if you want to use a uniform prior on any parameter, you’ll need to also specify lb= and ub= (lower bound and upper bound respectively) as additional arguments to prior().

prior_pred <- brm(
  flipper_length_mm ~ sex_contrsum,
  data = penguins,
  family = gaussian(),
  prior = c(
    prior(normal(250, 125), class = Intercept),
    prior(normal(0, 100), class = sigma),
    prior(normal(0, 100), class = b, coef = sex_contrsum)
  ),
  sample_prior = "only",
  refresh = 0, # this prevents brms from printing a lot of unnecessary output: you'll still see warnings if relevant (e.g. non-convergence)
  seed = 42 # set a random seed to make the results reproducible
)

This model has estimated posteriors that reproduce the priors, ignoring the data, so we can use these posteriors to generate some new data and see whether it look reasonable.

Generating predictive data is such a common thing to do that brms comes with a useful function that helps us do it graphically: pp_check(). It takes the following arguments:

  • A model object.
  • type = "stat" is one possible type of plot, which applies some summary statistic to the generated data.
  • stat = mean says that our summary statistic is going to be the function mean().
  • prefix = "ppd" hides the observed data, which is shown by default, displaying only the predictive distributions (“ppd” = “prior predictive distribution”). It doesn’t make sense to include the data in the plot, because our prior predictive model didn’t take it into account.

We can also add any of the usual ggplot layers to the plot!

pp_check(
  prior_pred,
  type = "stat",
  stat = mean,
  prefix = "ppd"
) +
  labs(x = "flipper_length_mm", title = "Prior predictive distribution means")

As we can see, these priors are generating data which is pretty well in line with both our intuitions and what we know about the data, so we should be fairly happy with them. One quick note though: you may be worried that the model has some probability mass on values equal to and less than zero, which is clearly impossible. Luckily, because we’re never going to observe such values in the data, this isn’t going to be a problem when we run the real model. Remember that this was quite a permissive prior, so the model is going to rely more heavily on the data. Have a look back at this plot once you’ve run the real model to see how much narrower the estimates are once we take the data into account!

2.2.4 Summary of the model-building workflow

  1. Find a likelihood suitable for the outcome data.
  2. Identify the parameters required by the likelihood and by the line you want to fit.
  3. Use the q functions and prior predictive checks to find plausible priors for each of those parameters.

2.3 Running the analysis

2.3.1 Finding posteriors through sampling

Ok, we’re nearly ready to fit the model! Before we do so, just a quick note on how Bayesian models work under the hood. Bayesian data analysis is often seen as a cutting-edge, up-and-coming method, even though Bayes’ rule dates back to 1763. Why is this? What stopped the Bayesian framework from taking off sooner? The short answer is: computational limitations.

In slightly more detail, finding posterior distributions analytically (i.e. using calculus) works sometimes, but it’s not possible for every kind of model. Another way to find the posterior is to approximate it by drawing samples from it. This is computationally challenging, and it’s only fairly recently that sampling methods have been developed so that posteriors can be estimated by any old researcher with a laptop.

The method that brms uses is called Markov Chain Monte Carlo, or MCMC. It treats the posterior as a kind of landscape, and identifies the areas of highest probability essentially by running a physics simulation. We don’t need to understand this in detail: the key point is that, if we take enough samples, we can get a decent approximation of the shape of the posterior without needing to compute it exactly.

By default, brms models explore the posterior landscape four separate times (these are called chains), for 2,000 iterations per chain:

  • The first 1,000 iterations are used to explore the posterior landscape and hopefully eventually find a higher-probability area: this is called the “warm-up” or “burn-in” period.
  • The next 1,000 iterations are spent traversing the highest density area and drawing samples from it: it’s those samples that brms will summarise when it describes each parameter’s posterior to us.

You can always fiddle with the arguments to the brm() call to run more chains or iterations.

2.3.2 Fitting and inspecting the model

All we need to do to run the proper model is remove the sample_prior = "only" line from the code we ran for our prior predictive check:

flipper_fit <- brm(
  flipper_length_mm ~ sex_contrsum,
  data = penguins,
  family = gaussian(),
  prior = c(
    prior(normal(250, 125), class = Intercept),
    prior(normal(0, 100), class = sigma),
    prior(normal(0, 100), class = b, coef = sex_contrsum)
  ),
  refresh = 0,
  seed = 42
)

2.3.2.1 Checking convergence

Recall that the MCMC chains that sample from the posterior are moving around the posterior landscape, trying to find where the bulk of the probability mass is. Sometimes they might never find it, or might not do so reliably. When the chains don’t properly sample from the posterior, we say the model has not converged.

Ideally, the chains will “mix”: in other words, all four chains will be sampling from the same region of the posterior. There are a couple of ways to tell that the chains are mixing properly.

2.3.2.1.1 Trace plots

First, we’re going to have a look at some so-called “trace-plots”.

mcmc_trace(flipper_fit, pars = c("b_Intercept", "b_sex_contrsum"))

Trace plots track where the four chains were as they traversed the posterior during sampling. The y-axis represents the values of the posterior, and the x-axis shows the sample indices (each chain drew 1,000 samples after the warm-up period). If the chains mixed well, then the trace plots should look like “fat hairy caterpillars”. And these ones do!

If you see trace plots where chains aren’t overlapping as densely, or even worse, where some chains have wandered off in a completely different direction and never mingled with the others, you should be worried (although probably brms will also have warned you of non-convergence in this case).

2.3.2.1.2 Rhats

The second thing to check is more quantitative. For each parameter, the model gives us a diagnostic measure called Rhat. Rhat is a measure of how well the chains have mixed. It compares the estimates within chains and between chains, and if the chains have mixed well, then the result should be near 1. We consider the model to have converged if all Rhat values are equal to 1.00; even an Rhat of 1.01 should make you suspicious.

The model summary gives us this information: we’re looking for a column called Rhat.

summary(flipper_fit)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: flipper_length_mm ~ sex_contrsum 
##    Data: penguins (Number of observations: 333) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept      200.94      0.75   199.46   202.39 1.00     4072     2834
## sex_contrsum    -7.14      1.50   -10.05    -4.22 1.00     4187     2958
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    13.62      0.55    12.60    14.74 1.00     4283     3164
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

If there is any Rhat value greater than 1.00, it means that the chains didn’t mix well enough and we cannot trust the posterior estimates we’ve been given. If you do ever fit a model that yields Rhat > 1.00, this page offers some solutions you could try.

2.3.2.2 Posterior predictive checks

The final check we want to run on our model is a posterior predictive check. We’ve encountered the concept of predictive checks before: with prior predictive checks, we drew sample values from each parameter’s prior, used those values to define the likelihood, and used that likelihood to generate simulated outcomes. Now, with posterior predictive checks, we’re going to draw sample values from each parameter’s posterior, and use them to do the exact same thing.

In fact, the processes are so similar that we can use pp_check() again to give us the means of the posterior predictive distributions, and compare them to the mean of the observed data (which is included by default if you don’t specify prefix = "ppd"):

pp_check(
  flipper_fit,
  type = "stat",
  stat = mean,
) +
  labs(x = "flipper_length_mm", title = "Posterior distribution means")

With the priors I used, the mean of the observed data (the dark vertical line) is smack in the middle of the posterior predictive distribution of means — hurray! That’s what we want to see. A little bit off to either side is also OK, but if the observed data were far away from the bulk of the posterior predictive distribution, that would suggest that we might want to change the model architecture to better reflect the generative process behind the data.

2.4 Interpreting and reporting model results

Reassured that our model does a decent job of capturing the data, let’s now delve into the actual estimates given by summary().

  • Under Estimate, we get the mean of the posterior distribution of each parameter.
  • Under Est.Error, we get the standard deviation.
  • Under l-95% CI, we get the 2.5th quantile of the posterior (the lower bound of the 95% Credible Interval).
  • Under u-95% CI, we get the 97.5th quantile of the posterior (the upper bound of the 95% Credible Interval).
summary(flipper_fit)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: flipper_length_mm ~ sex_contrsum 
##    Data: penguins (Number of observations: 333) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept      200.94      0.75   199.46   202.39 1.00     4072     2834
## sex_contrsum    -7.14      1.50   -10.05    -4.22 1.00     4187     2958
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    13.62      0.55    12.60    14.74 1.00     4283     3164
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

You can also use the fixef() function to extract only the posterior summaries of the fixed effects.

fixef(flipper_fit)
##                Estimate Est.Error      Q2.5     Q97.5
## Intercept    200.940810 0.7462072 199.46074 202.39358
## sex_contrsum  -7.137227 1.5029720 -10.04872  -4.22041

Let’s look at those estimates individually to understand what they mean.

2.4.1 Interpreting coefficient estimates

First, the intercept. This is the model’s estimate for mean flipper length across male and female penguins (because we used sum coding; if we had stuck with the default treatment coding, this would have been the estimate for the ‘reference level’). With the priors I used, estimated mean flipper length is 200.94mm (95% CrI: [199.46, 202.39]).

Because the posterior is a distribution of belief over plausible parameter values, this means that the model thinks there’s a 95% probability that the intercept of this model (i.e. mean flipper length) is between 199.46 and 202.39. The model is also very certain that this is a positive effect: the mass of the posterior distribution is nowhere near zero. This makes sense since flipper length can’t be negative!

We can also visualise the posterior distribution as a density plot:

mcmc_dens(flipper_fit, pars = "b_Intercept") +
  labs(title = "Posterior distribution of b_Intercept")

Next, the coefficient for sex. With the priors I used, the mean effect of sex (i.e. the difference between male and female penguins) is -7.14 (95% CrI: [-10.05, -4.22]). Because the entire 95% CrI is negative, the model is quite certain that this is a negative effect: female penguins have smaller flippers than male penguins. However, this is (as expected) a much smaller effect than the mean itself.

mcmc_dens(flipper_fit, pars = "b_sex_contrsum") +
  labs(title = "Posterior distribution of b_sex_contrsum")

We can also look at any or all posterior distributions with their means and 95% CrI highlighted using mcmc_areas() — if you show both parameters on this plot it will be a bit hard to see the shading (because they’re on quite different scales), but this can work really nicely for some models where parameters are a bit closer together. You could also use this function to look at each parameter individually.

mcmc_areas(flipper_fit, 
           pars = c("b_Intercept", "b_sex_contrsum"), # add/remove terms as desired
           prob = 0.95) +
  geom_vline(xintercept = 0, linetype = 'dotted')

2.4.2 Reporting the results

If you were going to write this model up for a paper, here’s how you might report it.

We fit a Bayesian linear model with a normal likelihood predicting flipper length (in millimetres) as a function of sex. The model used weakly regularising priors which we selected based on prior predictive checks. The model converged (as indicated by all Rhats = 1).

The model estimates a negative effect of sex (\(\beta\) = -7.14, 95% CrI [-10.05, -4.22]) meaning that female penguins have shorter flippers on average than male penguins.

Note again that we’re not referring to this as a significant result, or talking about it in terms of rejecting the null hypothesis that male and female penguins have the same flipper length. This can feel a bit uncomfortable if you’re used to reporting frequentist models — and you will almost certainly get queries on it, especially from the people who review your papers — but it is the correct way to report a Bayesian analysis. You will definitely see people arguing that a result is significant because the 95% CrI doesn’t include zero, but try not to be tempted to do this! Remember, there is no such thing as significance in the Bayesian framework.

You can also use the xtable library to generate a LaTeX version of your model summary as a table:

library(xtable)
xtable(fixef(flipper_fit))

2.5 Bonus exercise: understanding default priors

I mentioned above that if you leave out the prior= argument when fitting a model, brms will use its default priors. These will differ from model to model depending on the model family and the properties of your data, but you can always find out what they are for your specific model.

There’s two ways to do this. Either, before actually running the model, you can pass the model specification into the get_prior() function:

get_prior(flipper_length_mm ~ sex_contrsum,
  data = penguins,
  family = gaussian())

Or, you can also fit the model with default priors and then pass that model object into the prior_summary() function: you’ll get the same output from these two methods.

Ok, so here’s what this output is telling us:

  • The model is going to use flat (i.e. uniform) priors on all slope coefficients (class = b). Here, we only have one (sex_contrsum), which gets its own row just in case you weren’t sure that it came under that ‘all slope coefficients’ group. If you added other predictors, you would also see rows for each of those with their column names under coef.
  • The prior on the intercept (mean flipper length) is a scaled and shifted Student’s t-distribution with degrees of freedom (\(\nu\)) = 3, mean = 197 and standard deviation = 16.3. Student’s t is very similar to the normal distribution but has heavier tails. The amount of probability mass in the tails is given by the degrees of freedom: smaller numbers result in heavier tails.
  • The prior on sigma (the standard deviation of flipper lengths) is very similar, but with mean = 0 and lb (lower bound) = 0, because standard deviations can only be positive.

The flat priors aren’t very interesting, but let’s get an idea of what this Student’s t-distribution looks like. We can inspect it in the same way as we were doing for normal distributions earlier.

df = 3
m = 197
sd = 16.3

ggplot(tibble(x = c(m-(3*sd), m+(3*sd))), aes(x = x)) +
  stat_function(fun = dstudent_t, args = list(df = df, mu = m, sigma = sd), colour=cbpalette[1]) +
  labs(x = "flipper length (mm)", y = 'Probability\ndensity') +
  theme(panel.grid=element_blank())

qstudent_t(c(0.025, 0.975), df=3, mu=197, sigma=16.3)
## [1] 145.1261 248.8739

From the plot and the 95% credible interval, we can see that the extra weight in the tails means that this distribution encodes less certainty than a normal distribution in the range it covers, but because brms has picked a pretty low value for sigma, that range is much narrower than in the prior we used. You might be worried that it’s too restrictive given our lack of prior knowledge about penguins. Let’s see how it affects the model!

flipper_fit_default <- brm(
  flipper_length_mm ~ sex_contrsum,
  data = penguins,
  family = gaussian(),
  refresh = 0,
  seed = 42
)

If we were running this analysis for real, we’d need to check the trace plots and Rhats and run a posterior predictive check before diving into the model estimates, but rest assured that we did this behind the scenes and everything looks fine, so we’re safe to just go ahead and check the summary!

summary(flipper_fit_default)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: flipper_length_mm ~ sex_contrsum 
##    Data: penguins (Number of observations: 333) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept      200.90      0.74   199.42   202.33 1.00     4841     3011
## sex_contrsum    -7.12      1.50   -10.12    -4.23 1.00     5031     3201
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma    13.61      0.54    12.62    14.71 1.00     4185     2690
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Recall that my model with user-defined priors had the following coefficient estimates:

  • Intercept: 200.94 (95% CrI: [199.46, 202.39])
  • Main effect of sex: -7.14 (95% CrI: [-10.05, -4.22])

The estimates of the model with default priors are, I think you’ll agree, strikingly similar. What does this tell us? Well, it seems like the influence of the prior is pretty minimal: probably there’s just so much data that the likelihood is dominating in determining the shape of the posterior.

If you want to play more with this, you could try filtering the data down to fewer rows (e.g. just look at the penguins on Torgersen island) and compare models with user-defined vs. default priors: remember that priors have more of an influence a) when there’s less data and b) when they’re more informative. If you want to get really advanced, you can read more about prior sensitivity analyses.

2.6 Bonus exercise: adding other predictors

Obviously, the model we’ve been looking at so far is extremely simple: we only have one categorical predictor with two levels. There’s other columns in this dataset we might want to use as predictors: for example, maybe body mass is a better predictor of flipper length than sex. On your own time, why not try adding some more predictors to the model and thinking about how you’d interpret the coefficients? Remember: every parameter (including any interaction terms) needs a prior!

Once you get into multiple regression territory, it can be helpful to see coefficient estimates for each level of each predictor rather than trying to intuit that from summary(). You can do this by passing your model into the ggpredict() function from the ggeffects package. Here’s an example for the model we built earlier:

library(ggeffects)
ggpredict(flipper_fit)
## $sex_contrsum
## # Predicted values of flipper_length_mm
## 
## sex_contrsum | Predicted |         95% CI
## -----------------------------------------
##        -0.50 |    204.50 | 202.49, 206.52
##         0.50 |    197.37 | 195.26, 199.54
## 
## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "flipper_fit"

3 Credits

This tutorial was created by Dr Aislinn Keogh. Dr Keogh was a PhD student in Linguistics at the University of Edinburgh from 2021 to 2025, and a CDCS Teaching Fellow.

Suggested citation: Keogh, A. (2025). Introduction to Bayesian Statistics. CDCS Tutorials

This tutorial is published under a CC BY-NC 4.0 license. The palmerpenguins dataset is available by CC-0 license.

Please help us keep our tutorials up to date. If you find something that is not working, email us at cdcs@ed.ac.uk.

sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggeffects_1.5.2   xtable_1.8-4      extraDistr_1.10.0 bayesplot_1.11.0 
##  [5] brms_2.20.4       Rcpp_1.0.12       lubridate_1.9.3   forcats_1.0.0    
##  [9] stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2       readr_2.1.5      
## [13] tidyr_1.3.1       tibble_3.2.1      ggplot2_3.4.4     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##   [1] Rdpack_2.6.4         gridExtra_2.3        inline_0.3.19       
##   [4] rlang_1.1.3          magrittr_2.0.3       ggridges_0.5.6      
##   [7] matrixStats_1.2.0    compiler_4.3.2       mgcv_1.9-0          
##  [10] loo_2.6.0            vctrs_0.6.5          reshape2_1.4.4      
##  [13] pkgconfig_2.0.3      fastmap_1.1.1        backports_1.4.1     
##  [16] ellipsis_0.3.2       labeling_0.4.3       utf8_1.2.4          
##  [19] threejs_0.3.3        promises_1.2.1       rmarkdown_2.25      
##  [22] markdown_1.12        tzdb_0.4.0           nloptr_2.0.3        
##  [25] haven_2.5.4          xfun_0.41            cachem_1.0.8        
##  [28] jsonlite_1.8.8       highr_0.10           later_1.3.2         
##  [31] parallel_4.3.2       R6_2.5.1             dygraphs_1.1.1.6    
##  [34] bslib_0.6.1          stringi_1.8.3        StanHeaders_2.32.5  
##  [37] boot_1.3-28.1        numDeriv_2016.8-1.1  jquerylib_0.1.4     
##  [40] rstan_2.32.5         knitr_1.45           zoo_1.8-12          
##  [43] base64enc_0.1-3      splines_4.3.2        glmmTMB_1.1.11      
##  [46] httpuv_1.6.14        Matrix_1.6-5         igraph_2.0.1.1      
##  [49] timechange_0.3.0     tidyselect_1.2.0     rstudioapi_0.15.0   
##  [52] abind_1.4-5          yaml_2.3.8           TMB_1.9.17          
##  [55] codetools_0.2-19     miniUI_0.1.1.1       curl_5.2.0          
##  [58] pkgbuild_1.4.3       lattice_0.21-9       plyr_1.8.9          
##  [61] bayestestR_0.14.0    shiny_1.8.0          withr_3.0.0         
##  [64] bridgesampling_1.1-2 posterior_1.5.0      coda_0.19-4.1       
##  [67] evaluate_0.23        RcppParallel_5.1.7   xts_0.13.2          
##  [70] pillar_1.9.0         tensorA_0.36.2.1     checkmate_2.3.1     
##  [73] DT_0.31              stats4_4.3.2         reformulas_0.4.1    
##  [76] insight_0.20.4       shinyjs_2.1.0        distributional_0.3.2
##  [79] generics_0.1.3       hms_1.1.3            rstantools_2.4.0    
##  [82] munsell_0.5.0        scales_1.3.0         minqa_1.2.6         
##  [85] gtools_3.9.5         glue_1.7.0           tools_4.3.2         
##  [88] shinystan_2.6.0      lme4_1.1-35.1        colourpicker_1.3.0  
##  [91] mvtnorm_1.2-4        grid_4.3.2           rbibutils_2.3       
##  [94] QuickJSR_1.1.3       crosstalk_1.2.1      datawizard_0.12.3   
##  [97] colorspace_2.1-0     nlme_3.1-163         palmerpenguins_0.1.1
## [100] cli_3.6.2            fansi_1.0.6          Brobdingnag_1.2-9   
## [103] V8_5.0.0             gtable_0.3.4         sass_0.4.8          
## [106] digest_0.6.34        htmlwidgets_1.6.4    farver_2.1.1        
## [109] htmltools_0.5.7      lifecycle_1.0.4      mime_0.12           
## [112] MASS_7.3-60          shinythemes_1.2.0

  1. This example is taken from the Simulating Language course in the School of Philosophy, Psychology and Language Sciences at the University of Edinburgh, with thanks to Professors Kenny Smith and Simon Kirby.↩︎

  2. This section of the tutorial is based on the Bayes, stat! workshop, by Dr Elizabeth Pankratz.↩︎