This Tutorial will help you performing statistical analysis and visualisation task that you would normally perform with SPSS in R.
R is a programming language and software environment primarily used for statistical computing and data analysis. It provides a wide variety of statistical and graphical techniques, making it a popular choice among statisticians, data analysts, and researchers. You can manipulate and analyse data using simply R but these tasks are much easier when using it is most popular user-friendly interface R Studio
R Studio is an integrated development environment (IDE) for R that enhances the user experience by providing a user-friendly interface, tools for plotting, debugging, and workspace management.
R Markdown is an authoring format that enables users to create dynamic documents, reports, and presentations directly from R. It combines the power of R with the flexibility of Markdown, allowing for the integration of code, results, and narrative text in a single document. R Markdown documents can be rendered to various formats, including HTML, PDF, and Word Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
The first thing you need to do is getting set up in R Studio. There are two options to do, i.e. you can download the software and run it locally on your PC or you can access R Studio online.
Setting up online is on average easier but it will entail uploading your data online.
Windows
.exe
file that was just downloadedmacOS
.pkg
file for the latest R versionLinux
sudo apt-get install r-base
, and for Fedora
sudo yum install R
), but we don’t recommend this approach
as the versions provided by this are usually out of date. In any case,
make sure you have at least R 3.5.1.sudo dpkg -i rstudio-x.yy.zzz-amd64.deb
at
the terminal).Different institutions would have their own cloud-based computational notebook services that will allow you to run your code online (e.g. https://noteable.edina.ac.uk/ for The University of Edinburgh)but these services are normally available only for staff and students. If your institution does not have their own service you can use the cloud-based server at https://posit.co/. The basic version is free but you will have a limitation on the number of projects and collaborators you can have.
The R studio interface has four windows: the Console, the Source, Environment / History / Connections / Tutorial and the Files / Plots / Packages / Help / Viewer.
Below you can find a quick explanation of what each windows contains. For a full overview of the R studio interface see https://bookdown.org/daniel_dauber_io/r4np_book/the-rstudio-interface.html
Located in the bottom left, this is where you normally see the result of the code (but if you are using RMarkdown documents you will see the code running under each chuck of code as well). You can run code directly in here but you will normally do it from the Source window instead.
Located in the top left, this is where you are going to write your code and open any files you want to work on. If you are going to use RMarkdown, you can magnify this window and ignore any other windows except maybe the *Environment one because you are going to see the output of your code under each chunk
Located on the top right, this windows contains a series of sub-windows Environment, history, Connections, Git, and Tutorial.
In *Environment you are going to see all object that you will create in the R Studio Environment, from the dataset you will import to the variable you will create. Once you have create an object you can visualise it by double-clicking on it. On the top ribbon you can also see an icon ““Import Dataset” that will allow you to select the dataset you want to import into R. If you click on it and then in From Text(readr) you can navigate to where you dataset is
If you click on it and then in From Text(readr) you can navigate to where you dataset is.
Reproducibility Trick - instead of pressing Import copy and paste the bit of code that you see under Code Preview into your script (and make sure that the file will be on the same folder of the script). In this way everyone will be able to replicate your steps. In the History tab you will see all the computations you will perform in the Console.
In the Connections tab you can connect to external databases.
In the Git tab you can manage your work if you are working with Version Control (more on setting up with Git and GitHub in RStudio https://happygitwithr.com/rstudio-git-github).
In the Tutorial tab you can find more tutorials and material to learn R and R Studio.
The bottom right windows will allows you see and open files, visualise the plots you will create, visually install and load packages (for reproducibility is always better to do it from the source code), get help about the functions and packages, visualise non static images (viewer), and visualise your presentation (if you are using Quarto https://quarto.org/).
As we said before R Markdown is a specific type of file format designed to produce documents that include both code and text. It allows you to simultaneously
1: save and execute code, and
2: produce high quality documents that include both code and text
ctrl
+ enter
together.For a more in depth overview on how to get set and create RMarkdown files see https://intro2r.com/rmarkdown_r.html and https://github.com/DCS-training/Interactive-Analysis-Reports-with-R-Markdown.github.io
Now that we are all set up let’s start exploring what we can do with R and RStudio. If you want to explore the basics of R and R studio more you can have a look to this introductory video introductory video.
For anyone who want access to a useful resource to continue working through R, I recommend the Pirates’s guide to RRRRRR.
Please note that there are multiple ways to conduct our analyses in R. So don’t be afraid if you’ve come across something different in the past. Just use the method you find that works best for you.
To run the code, click the small green arrow in the code
chunk, or press ctrl
+ enter
together.
Today, we will be working with the Palmer Penguins data set. This data set investigates the differences in flipper length, and bill length + depth, between different species of Penguins, across different island, and across the sex of Penguins.
R packages are extensions to the R programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R.
In R, installing a package and loading a package are two distinct processes. Installing a package involves downloading it from CRAN (the Comprehensive R Archive Network) or another repository and placing it in your local library, using the install.packages(“packageName”) function. This step is only necessary once per package version. Loading a package, on the other hand, is done every time you start a new R session and wish to use the functions within that package. This is achieved with the library(packageName) or require(packageName) function. Installation adds the package to your system, while loading makes the package’s functions and datasets available for use in your current R session.
If you do not have the libraries already installed in your
environment you need to uncomment the install.packages
rows
(remove the hashtag). Otherwise just run as it is. To check if you have
the package already you can run the cell below. If you get an error it
means that the package that throws the error is not already installed.
Otherwise you can go on the Packages tab and check if they are
listed.
NB: you may get some red warning when you run it. This is normal! it is just telling you that the package was either created for a different version of R or it contains functions that have the same name of functions in other packages.
In the case of the image above if you want to use the the compose
function from the the flextable package you will have to specify it by
adding flextable::
to compose()
otherwise the
system would use the compose function from Purr (one of
the tidyverse sub-packages).
summary(penguins) # summary stats of our data
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
str(penguins) # inspecting data structure
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(penguins) # view first 6 rows
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
view(penguins) # Look at the whole data set. Not advised for big data projects.
We can also use R to generate plots, for example:
#base R
plot(penguins$species,
penguins$flipper_length_mm )
plot(penguins$bill_length_mm,
penguins$flipper_length_mm )
hist(penguins$flipper_length_mm)
boxplot(penguins$flipper_length_mm )
# ggplot - grammar of graphics
## Here we have assinged our plot code to "plot_1", using the assingment arrow `<-`.
plot_1 <- ggplot(data = penguins,
aes(x = species, y = flipper_length_mm)) +
geom_boxplot()
plot_1
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## We can use the assigned object to further customise our code if we want to
plot_1 + facet_wrap(~sex) +
labs(title = "Pretty Penguins Plot",
subtitle = "My sister was bitten by a penguin once...",
x = "Penguin Species",
y = "Flipper Length"
)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## We can also use ggplot to visualise statistical models
plot_2 <- ggplot(data = penguins,
aes(x = bill_length_mm, y = flipper_length_mm, color = species)) +
geom_point() +
geom_smooth(method = "lm")
plot_2
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
plot_2 + facet_grid(~sex ~island) +
labs(title = "Another Pretty Penguins Plot",
subtitle = str_wrap("No realli! She was karving her initials on the penguin with the sharpened end of an interspace toothbrush given to her by Svenge...", 80),
x = "Bill length (mm)",
y = "Flipper Length (mm)") +
theme(legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Now that we’ve seen some examples of data visualisation, and seen some basics of working with R, it’s time to run some analyses on the data. On the menu today, we will run through: - t-tests - ANOVAS (and posthoc tests) - Correlation analyses - Regression analyses
Our classic t-test. Our go-to tool for investigating differences between 2 groups. For this example, we are going to investigate if there is a significant difference in Flipper Length between male and female penguins. But in order to do so, we will first need to examine our data to see if it is appropriate
summary(penguins$sex)
## female male NA's
## 165 168 11
As we can see, there are some NA’s in the dataset. So, unfortunately,
we will need to filter out the androgynous penguins in order to run the
analysis (Sorry Penguin Bowie). This will be done with the
filter()
function. And for simplicity, we will remove all
other NA’s from the data.
t_test_data <- penguins %>%
filter(
sex != "NA's", # != is the logical operator for "does not include" - this will remove the NA's from sex.
species == "Gentoo"
) %>%
na.omit() # na.omit removes NA from our data.
summary(t_test_data$sex)
## female male
## 58 61
Now that our data is appropriately filtered for the t-test, it’s time to run some quick diagnostics to determine if our assumptions are met.
# Checking normality
hist(t_test_data$flipper_length_mm, breaks = 30)
shapiro.test(t_test_data$flipper_length_mm)
##
## Shapiro-Wilk normality test
##
## data: t_test_data$flipper_length_mm
## W = 0.96148, p-value = 0.00176
ks.test(t_test_data$flipper_length_mm, 'pnorm')
## Warning in ks.test.default(t_test_data$flipper_length_mm, "pnorm"): ties should
## not be present for the Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: t_test_data$flipper_length_mm
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
# Checking for equal variances - this is to determine which t test can be used
LeveneTest(flipper_length_mm ~ sex, data = t_test_data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 6.8152 0.01022 *
## 117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## As normality is not equal, the double peak in histogram may be caused by sex differences (as indicated by unequal variance). We will use ggplot to check.
ggplot(t_test_data,
aes(x = bill_length_mm, fill = sex)) +
geom_histogram(position = "dodge") +
facet_wrap(~sex) +
labs(title = "Normal distribution, within each sex",
subtitle = "We apologise for the fault in the subtitles...")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## re-running tests - normality assumed!
norm_test_m <- t_test_data %>%
filter(sex == "male")
shapiro.test(norm_test_m$flipper_length_mm)
##
## Shapiro-Wilk normality test
##
## data: norm_test_m$flipper_length_mm
## W = 0.96184, p-value = 0.05452
norm_test_f <- t_test_data %>%
filter(sex == "female")
shapiro.test(norm_test_m$flipper_length_mm)
##
## Shapiro-Wilk normality test
##
## data: norm_test_m$flipper_length_mm
## W = 0.96184, p-value = 0.05452
So our data meets the normality assumption, but does not meet the equal variance assumption. What do we do? Well, we opt for the Welch’s t-test. Thankfully, this is already the default option for R. Additionally, many applied statistical researchers are calling for its use over the student t-test, as it is more robust to unequal variances, and performs just as well with equal variance. That is, it does both with less risk of misleading outputs.
## Default Welch's
t_test(flipper_length_mm ~ sex, data = t_test_data) # rstaxix version
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 flipper_length_mm female male 58 61 -9.94 107. 7.11e-17
t.test(flipper_length_mm ~ sex, data = t_test_data) #base R version
##
## Welch Two Sample t-test
##
## data: flipper_length_mm by sex
## t = -9.9417, df = 106.69, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -10.595669 -7.072505
## sample estimates:
## mean in group female mean in group male
## 212.7069 221.5410
# Student (if needed)
t_test(flipper_length_mm ~ sex, var.equal = TRUE, data = t_test_data) # rstaxix version
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 flipper_length_mm female male 58 61 -9.85 117 4.87e-17
t.test(flipper_length_mm ~ sex, var.equal = TRUE,data = t_test_data) #base R version
##
## Two Sample t-test
##
## data: flipper_length_mm by sex
## t = -9.8515, df = 117, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -10.610010 -7.058165
## sample estimates:
## mean in group female mean in group male
## 212.7069 221.5410
### Summary statistics for reporting
t_test_data %>%
group_by(sex) %>%
summarise(mean_flipper_length = round(mean(flipper_length_mm), 2),
sd_flipper_length =round(sd(flipper_length_mm), 2))
## # A tibble: 2 × 3
## sex mean_flipper_length sd_flipper_length
## <fct> <dbl> <dbl>
## 1 female 213. 3.9
## 2 male 222. 5.67
ggplot(t_test_data,
aes(x = sex, y = flipper_length_mm))+
geom_boxplot() +
theme_apa() +
labs(title = "Sex Differences in Gentoo Penguin Flipper Lengths",
subtitle = "Mind you, Penguin bites kan be pretty nastiii...",
x = "Penguin sex",
y = "Flipper Length (mm)")
Now we have covered our T-test, let’s have a dive into the ANOVA world. This time we will examine differences in flipper length between the different species of penguin.
anova_data <- penguins %>%
na.omit()
As previously, we need to test our assumptions for the ANOVA, else we risk misleading results. And once again, we
LeveneTest(flipper_length_mm ~ species, data = anova_data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.4428 0.6426
## 330
## As normality is not equal, the double peak in histogram may be caused by sex differences (as indicated by unequal variance). We will use ggplot to check.
ggplot(anova_data,
aes(x = flipper_length_mm, fill = species)) +
geom_histogram(position = "dodge") +
facet_wrap(~species) +
labs(title = "Normal distribution, within each sex",
subtitle = "We apologise again for the fault in the subtitles...")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## re-running tests - normality assumed!
norm_test_a <- anova_data %>%
filter(species == "Adelie")
shapiro.test(norm_test_a$flipper_length_mm)
##
## Shapiro-Wilk normality test
##
## data: norm_test_a$flipper_length_mm
## W = 0.9934, p-value = 0.7427
norm_test_c <- anova_data %>%
filter(species == "Chinstrap")
shapiro.test(norm_test_c$flipper_length_mm)
##
## Shapiro-Wilk normality test
##
## data: norm_test_c$flipper_length_mm
## W = 0.98891, p-value = 0.8106
norm_test_g <- anova_data %>%
filter(species == "Gentoo")
shapiro.test(norm_test_g$flipper_length_mm)
##
## Shapiro-Wilk normality test
##
## data: norm_test_g$flipper_length_mm
## W = 0.96148, p-value = 0.00176
So our data is normally distributed across the 3 species (apart from Gentoo). But it approximates normality, and so we can commit our analyses.
Now it’s time to define our analysis. In R, we define our ANOVA model
using the aov
function. From there, our go to formula is:
aov(outcome_variable ~ predictor_group_variable, data)
.
We then use summary to view the results, and can run the model through post-hoc tests if required.
# Defining our model
model <- aov(flipper_length_mm ~ species, data = anova_data)
# Viewing results
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 50526 25263 567.4 <2e-16 ***
## Residuals 330 14693 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Conducting post hoc analyses
PostHocTest(model,
method = "bonferroni",
conf.level = 0.95, ordered = FALSE)
##
## Posthoc multiple comparisons of means : Bonferroni
## 95% family-wise confidence level
##
## $species
## diff lwr.ci upr.ci pval
## Chinstrap-Adelie 5.72079 3.363512 8.078068 3.8e-08 ***
## Gentoo-Adelie 27.13255 25.149622 29.115487 < 2e-16 ***
## Gentoo-Chinstrap 21.41176 18.970990 23.852539 < 2e-16 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To get hold of our summary statistics, we need to do some data
wrangling. This will make use of the tidyverse
family of
functions.
First off, we use group_by()
to tell R which categorical
variable we want our summaries to be focused on. We can use as many
group variables as we like here, but make sure your order follows any
hierarchical patterns of your data… Otherwise, the interpretation might
be odd.
From there, we use summarise()
to provide our summarised
statistics across our chosen group variable. We choose a meaningful
name, and then apply a function to our variable. As seen below, you can
combine functions. We’re wrapping ours in the round()
function, so that we can decide how many decimal places we want the
analysis to go to.
anova_data %>%
group_by(species) %>% # Try playing around with the "sex" and "island" variables here.
summarise(mean_flipper_length = round(mean(flipper_length_mm), 2),
sd_flipper_length =round(sd(flipper_length_mm), 2))
## # A tibble: 3 × 3
## species mean_flipper_length sd_flipper_length
## <fct> <dbl> <dbl>
## 1 Adelie 190. 6.52
## 2 Chinstrap 196. 7.13
## 3 Gentoo 217. 6.59
# task: include a `median()` summarised statistic for flipper_length. Call it median_flipper_length (if you like).
Now that we have conducted our ANOVA, it’s time to visualise the analysis to ensure that our understanding of the analysis is accurate.
Here we will overlap different plot aspects to communicate our model
assumptions. The geom_violin()
is used to show the
distributions of each group. The geom_boxplot()
to
communicate the median and inter quantile
range of our data groups.
There are so many options for customising, so it can be adjusted entierly for your needs.
ggplot(anova_data,
aes(x = species,
y = flipper_length_mm,
fill = species) # fill allows us to fill an object colour by a variable. Useful here for making a more visually appealing plot.
)+
geom_violin( # violin plots helps us visualise distributions
alpha = .7 # alpha adjusts transparency of points
) +
geom_boxplot(alpha = .7, width = .5) +
theme_apa() +
theme(legend.position = "none") + #
labs(title = "Penguininal Differences in Flipper Lengths",
subtitle = "...",
x = "Penguin Species",
y = "Flipper Length (mm)")
Now to move on to examining correlations. Once again, we will need to perform some initial data tidying to ensure we can run our analyses.
corr_data <- penguins %>%
dplyr::select(flipper_length_mm, bill_length_mm, bill_depth_mm) %>%
na.omit() # we're using na.omit to remove missing values. Another useful function!
stratified_corr_data <- penguins %>%
dplyr::select(flipper_length_mm, bill_length_mm,
bill_depth_mm, species) %>%
group_by(species) %>%
na.omit()
For correlations, we need to inspect the residuals (the distance each point on a scatter graph has between itself and the line of best fit). So we will temporarily delay the assumption tests for now, and jump straight to our analysis.
results <- correlation(corr_data)
results
## # Correlation Matrix (pearson-method)
##
## Parameter1 | Parameter2 | r | 95% CI | t(340) | p
## --------------------------------------------------------------------------------
## flipper_length_mm | bill_length_mm | 0.66 | [ 0.59, 0.71] | 16.03 | < .001***
## flipper_length_mm | bill_depth_mm | -0.58 | [-0.65, -0.51] | -13.26 | < .001***
## bill_length_mm | bill_depth_mm | -0.24 | [-0.33, -0.13] | -4.46 | < .001***
##
## p-value adjustment method: Holm (1979)
## Observations: 342
summary(results, redundant = TRUE)
## # Correlation Matrix (pearson-method)
##
## Parameter | flipper_length_mm | bill_length_mm | bill_depth_mm
## ----------------------------------------------------------------------
## flipper_length_mm | | 0.66*** | -0.58***
## bill_length_mm | 0.66*** | | -0.24***
## bill_depth_mm | -0.58*** | -0.24*** |
##
## p-value adjustment method: Holm (1979)
strat_results <- correlation(stratified_corr_data)
strat_results
## # Correlation Matrix (pearson-method)
##
## Group | Parameter1 | Parameter2 | r | 95% CI | t | df | p
## ----------------------------------------------------------------------------------------------
## Adelie | flipper_length_mm | bill_length_mm | 0.33 | [0.18, 0.46] | 4.21 | 149 | < .001***
## Adelie | flipper_length_mm | bill_depth_mm | 0.31 | [0.16, 0.45] | 3.95 | 149 | < .001***
## Adelie | bill_length_mm | bill_depth_mm | 0.39 | [0.25, 0.52] | 5.19 | 149 | < .001***
## Chinstrap | flipper_length_mm | bill_length_mm | 0.47 | [0.26, 0.64] | 4.34 | 66 | < .001***
## Chinstrap | flipper_length_mm | bill_depth_mm | 0.58 | [0.40, 0.72] | 5.79 | 66 | < .001***
## Chinstrap | bill_length_mm | bill_depth_mm | 0.65 | [0.49, 0.77] | 7.01 | 66 | < .001***
## Gentoo | flipper_length_mm | bill_length_mm | 0.66 | [0.55, 0.75] | 9.69 | 121 | < .001***
## Gentoo | flipper_length_mm | bill_depth_mm | 0.71 | [0.61, 0.79] | 10.98 | 121 | < .001***
## Gentoo | bill_length_mm | bill_depth_mm | 0.64 | [0.53, 0.74] | 9.24 | 121 | < .001***
##
## p-value adjustment method: Holm (1979)
## Observations: 68-151
summary(strat_results, redundant = TRUE)
## # Correlation Matrix (pearson-method)
##
## Group | Parameter | flipper_length_mm | bill_length_mm | bill_depth_mm
## ----------------------------------------------------------------------------------
## Adelie | flipper_length_mm | | 0.33*** | 0.31***
## Adelie | bill_length_mm | 0.33*** | | 0.39***
## Adelie | bill_depth_mm | 0.31*** | 0.39*** |
## Chinstrap | flipper_length_mm | | 0.47*** | 0.58***
## Chinstrap | bill_length_mm | 0.47*** | | 0.65***
## Chinstrap | bill_depth_mm | 0.58*** | 0.65*** |
## Gentoo | flipper_length_mm | | 0.66*** | 0.71***
## Gentoo | bill_length_mm | 0.66*** | | 0.64***
## Gentoo | bill_depth_mm | 0.71*** | 0.64*** |
##
## p-value adjustment method: Holm (1979)
For correlations, testing the assumptions is best done through visualising the data. Now before we even get to residuals, play around and compare the plots in when species is and is not accounted for. Behold the Simpson Paradox in its glory!
# data visualisation
ggplot(penguins,
aes(x =bill_length_mm, y = flipper_length_mm
# ,color = species
)) +
geom_point(position = "jitter", alpha = .7 ) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(penguins,
aes(x =bill_depth_mm, y = flipper_length_mm
#,color = species
)) +
geom_point(position = "jitter", alpha = .7 ) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(penguins,
aes(x =bill_length_mm, y = bill_depth_mm
# ,color = species
)) +
geom_point(position = "jitter", alpha = .7 ) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
And now to the final stages of this session. Let’s regress to the regression. We will focus here on predicting penguin flipper length. We will also use some features of R to help you statistically determine which model best explains the variance in the penguin data.
regression_data <- penguins %>%
na.omit()
Defining our linear models in R is nice and simple. It follows a
similar formula to the aov()
model setting from above. We
set our outcome variable first, and then define our predictor variables.
As a bonus, R will also automatically dummy code any predictor variables
for us!
For those interested in interaction/moderation analyses - R also
makes this very simple. Replace the +
with a *
and it will add the interaction term to your model. Play around with the
models below if you want to test it.
# Step 1 - defining our models
## general pattern: lm(outcome ~ predictor, data)
model_1 <- lm(flipper_length_mm ~ bill_length_mm,
data = regression_data )
model_2 <- lm(flipper_length_mm ~ bill_length_mm + bill_depth_mm ,
data = regression_data )
model_3 <- lm(flipper_length_mm ~ bill_length_mm + bill_depth_mm + sex ,
data = regression_data )
# Step 2 - viewing and interpreting results
summary(model_1)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.413 -7.837 0.652 8.360 21.321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 127.3304 4.7291 26.93 <2e-16 ***
## bill_length_mm 1.6738 0.1067 15.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.63 on 331 degrees of freedom
## Multiple R-squared: 0.4265, Adjusted R-squared: 0.4248
## F-statistic: 246.2 on 1 and 331 DF, p-value: < 2.2e-16
summary(model_2)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm + bill_depth_mm,
## data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.658 -5.813 0.048 6.073 22.929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.21816 6.43671 30.17 <2e-16 ***
## bill_length_mm 1.40892 0.08931 15.78 <2e-16 ***
## bill_depth_mm -3.21782 0.24801 -12.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.663 on 330 degrees of freedom
## Multiple R-squared: 0.6203, Adjusted R-squared: 0.618
## F-statistic: 269.5 on 2 and 330 DF, p-value: < 2.2e-16
summary(model_3)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm + bill_depth_mm +
## sex, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.2138 -4.6707 0.2532 5.3634 18.9069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 227.68101 6.69289 34.018 <2e-16 ***
## bill_length_mm 1.00057 0.08994 11.125 <2e-16 ***
## bill_depth_mm -4.41011 0.25271 -17.451 <2e-16 ***
## sexmale 9.84594 1.03044 9.555 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.676 on 329 degrees of freedom
## Multiple R-squared: 0.7027, Adjusted R-squared: 0.7
## F-statistic: 259.3 on 3 and 329 DF, p-value: < 2.2e-16
Unfortunatley, R does not provide standardised coefficents by
default, and getting hold of them ourselves takes a little more work.
But fortunatley, converting the output to get hold of standardised
coefficients has been provided for you here! It makes use of the
scale()
function, which standardises your
continuous variables by turning them into Z-scores. This allows
for clearer comparison of effect sizes between coefficients.
As categorical variables are already dummy coded, they do not need to be standardised - as they already have been to some extent.
# Step 1 - defining our models
## general pattern: lm(outcome ~ predictor, data)
standardised_model_3 <- lm(scale(flipper_length_mm) ~ scale(bill_length_mm) + scale(bill_depth_mm) + sex,
data = regression_data )
## Compare standardised and default model. What similarities and differences do you see?
summary(standardised_model_3)
##
## Call:
## lm(formula = scale(flipper_length_mm) ~ scale(bill_length_mm) +
## scale(bill_depth_mm) + sex, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87030 -0.33325 0.01806 0.38267 1.34897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.35441 0.04771 -7.428 9.57e-13 ***
## scale(bill_length_mm) 0.39040 0.03509 11.125 < 2e-16 ***
## scale(bill_depth_mm) -0.61963 0.03551 -17.451 < 2e-16 ***
## sexmale 0.70249 0.07352 9.555 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5477 on 329 degrees of freedom
## Multiple R-squared: 0.7027, Adjusted R-squared: 0.7
## F-statistic: 259.3 on 3 and 329 DF, p-value: < 2.2e-16
summary(model_3)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm + bill_depth_mm +
## sex, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.2138 -4.6707 0.2532 5.3634 18.9069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 227.68101 6.69289 34.018 <2e-16 ***
## bill_length_mm 1.00057 0.08994 11.125 <2e-16 ***
## bill_depth_mm -4.41011 0.25271 -17.451 <2e-16 ***
## sexmale 9.84594 1.03044 9.555 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.676 on 329 degrees of freedom
## Multiple R-squared: 0.7027, Adjusted R-squared: 0.7
## F-statistic: 259.3 on 3 and 329 DF, p-value: < 2.2e-16
We can use the anova()
function to statistically compare
our models on the basis of sum of squares and degrees of freedom. R will
then perfom an F test to help us determine if one model significantly
improves upon the other.
# Step 3 - statistically comparing models
anova(model_1, model_2, model_3)
## Analysis of Variance Table
##
## Model 1: flipper_length_mm ~ bill_length_mm
## Model 2: flipper_length_mm ~ bill_length_mm + bill_depth_mm
## Model 3: flipper_length_mm ~ bill_length_mm + bill_depth_mm + sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 331 37401
## 2 330 24767 1 12634.0 214.405 < 2.2e-16 ***
## 3 329 19387 1 5379.9 91.299 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_2, model_3)
## Analysis of Variance Table
##
## Model 1: flipper_length_mm ~ bill_length_mm + bill_depth_mm
## Model 2: flipper_length_mm ~ bill_length_mm + bill_depth_mm + sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 330 24767
## 2 329 19387 1 5379.9 91.299 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By wrapping our model in the plot()
function, we can
visualise the assumption tests underlying our analyses.
# Step 4 - assumption tests
plot(model_1)
Lets say we want to visualise our third model. We have a few different coefficents to take care of. Thankfully, ggplot allows for to play with a variety of
summary(model_3)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm + bill_depth_mm +
## sex, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.2138 -4.6707 0.2532 5.3634 18.9069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 227.68101 6.69289 34.018 <2e-16 ***
## bill_length_mm 1.00057 0.08994 11.125 <2e-16 ***
## bill_depth_mm -4.41011 0.25271 -17.451 <2e-16 ***
## sexmale 9.84594 1.03044 9.555 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.676 on 329 degrees of freedom
## Multiple R-squared: 0.7027, Adjusted R-squared: 0.7
## F-statistic: 259.3 on 3 and 329 DF, p-value: < 2.2e-16
ggplot(data = regression_data, # definining the data
aes(x= bill_length_mm, # main predictor variable
col = bill_depth_mm, # setting color as 2nd predictor
y = flipper_length_mm) # y is our outcome variable
) +
geom_point(position = "jitter", # jitter adds a teeny amount of movement to each point. Useful for overlapping data
alpha = .7 # alpha is a measure of transparency. Useful for overlapping points
) +
geom_smooth(method = "lm")+
facet_wrap(~sex) + # Finally setting sex as another predictor by faceting
theme_apa()+
theme(legend.position = "right") +
labs(title = "We can use ggplot to visualise multiple dimensions",
subtitle = "This can be very useful for regression models",
caption = "Just be careful not to overcomplicate it",
x = "Bill Length (mm)",
y = "Flipper Length (mm)",
color = "Bill Depth (mm)"
)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
We can also R to do some fancy coefficient plotting - with confidence
intervals. This can be a useful way to help improve our understanding of
our model through using a visual representation of our coefficents. This
is achieved using the ggstats
package, which allows for
quick plots, and versatility with the ggplot
features.
# Coefficent plot
ggcoef_table(model_1)
ggcoef_table(model_2)
ggcoef_table(model_3)
The ggstats
package also allows us to directly compare
our models through the ggcoef_compare()
function. This can
be very useful in detecting and understanding any mediations and
potential moderation effects that might be present in our analysis.
# Comparing coefficent plots
## step 1 - create a list item of our models
models <- list(
"basic model" = model_1,
"second model" = model_2,
"full model" = model_3
)
ggcoef_compare(models) # All in one plot
ggcoef_compare(models, type = "faceted") # Faceted to split by model - easier to interpret.
Linear regressions, ANOVA’s, and t-test are practically the same thing (mathematically, the ANOVA and the t-test are just specific forms of linear models).
R is more forgiving than SPSS here, as we do not need to manually dummy code our categorical variables! Less accidental errors for us.
This allows for us to play around and compare between these analyses to help improve our conceptual understanding of the analyses we’re conducting.
reg_model <- lm(flipper_length_mm ~ species ,
regression_data)
summary(reg_model)
##
## Call:
## lm(formula = flipper_length_mm ~ species, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1027 -4.8235 -0.1027 4.7647 19.8973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.1027 0.5522 344.25 < 2e-16 ***
## speciesChinstrap 5.7208 0.9796 5.84 1.25e-08 ***
## speciesGentoo 27.1326 0.8241 32.92 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.673 on 330 degrees of freedom
## Multiple R-squared: 0.7747, Adjusted R-squared: 0.7734
## F-statistic: 567.4 on 2 and 330 DF, p-value: < 2.2e-16
We can transform our linear model into an ANOVA as well. Setting + 0 removes the intercept, and replaces it with the reference variable. This in effect converts our analysis to an ANOVA with post-hoc tests.
Try comparing the output of the linear regression model with the
anova_style_model
below. What do you notice about the
coefficents?
#
anova_style_model <- lm(flipper_length_mm ~ species + 0,
regression_data)
summary(anova_style_model)
##
## Call:
## lm(formula = flipper_length_mm ~ species + 0, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1027 -4.8235 -0.1027 4.7647 19.8973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## speciesAdelie 190.1027 0.5522 344.2 <2e-16 ***
## speciesChinstrap 195.8235 0.8092 242.0 <2e-16 ***
## speciesGentoo 217.2353 0.6117 355.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.673 on 330 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9989
## F-statistic: 1.011e+05 on 3 and 330 DF, p-value: < 2.2e-16
Now let’s re-run our ANOVA and post-hoc tests to compare with the regression outputs. What do we notice?
anova_model <- aov(flipper_length_mm ~ species , regression_data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 50526 25263 567.4 <2e-16 ***
## Residuals 330 14693 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
post_hoc <- PostHocTest(anova_model,
method = "lsd",
conf.level = 0.95, ordered = FALSE)
post_hoc
##
## Posthoc multiple comparisons of means : Fisher LSD
## 95% family-wise confidence level
##
## $species
## diff lwr.ci upr.ci pval
## Chinstrap-Adelie 5.72079 3.793645 7.647935 1.3e-08 ***
## Gentoo-Adelie 27.13255 25.511448 28.753661 < 2e-16 ***
## Gentoo-Chinstrap 21.41176 19.416359 23.407171 < 2e-16 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing outputs - look at the F value, model significance, df’s and even the model coefficents.
What similarities/differences do you notice?
summary(reg_model)
##
## Call:
## lm(formula = flipper_length_mm ~ species, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1027 -4.8235 -0.1027 4.7647 19.8973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.1027 0.5522 344.25 < 2e-16 ***
## speciesChinstrap 5.7208 0.9796 5.84 1.25e-08 ***
## speciesGentoo 27.1326 0.8241 32.92 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.673 on 330 degrees of freedom
## Multiple R-squared: 0.7747, Adjusted R-squared: 0.7734
## F-statistic: 567.4 on 2 and 330 DF, p-value: < 2.2e-16
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 50526 25263 567.4 <2e-16 ***
## Residuals 330 14693 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(anova_style_model)
##
## Call:
## lm(formula = flipper_length_mm ~ species + 0, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1027 -4.8235 -0.1027 4.7647 19.8973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## speciesAdelie 190.1027 0.5522 344.2 <2e-16 ***
## speciesChinstrap 195.8235 0.8092 242.0 <2e-16 ***
## speciesGentoo 217.2353 0.6117 355.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.673 on 330 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9989
## F-statistic: 1.011e+05 on 3 and 330 DF, p-value: < 2.2e-16
post_hoc
##
## Posthoc multiple comparisons of means : Fisher LSD
## 95% family-wise confidence level
##
## $species
## diff lwr.ci upr.ci pval
## Chinstrap-Adelie 5.72079 3.793645 7.647935 1.3e-08 ***
## Gentoo-Adelie 27.13255 25.511448 28.753661 < 2e-16 ***
## Gentoo-Chinstrap 21.41176 19.416359 23.407171 < 2e-16 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now have a look at the box plot chart from below. How does this translate to each of the model summaries above?
ggplot(regression_data,
aes( x= species, y = flipper_length_mm, fill = species)) +
geom_point(position = "jitter", alpha = .7) +
geom_boxplot(alpha = .6)
In this second part we are going to explore how to import and organise data and how to export the results of your analysis.
Tips: - Make sure to save your script in the same folder as your dataset. - This will make your life much easier when it comes to re-running your analysis. - With R, we can work in a way that protects your raw data and allows you to retrace your steps. - We can also use R to save new versions of our data, in any format we want. This can be very useful if you want to re-run your analyses on another software (i.e., AMOS for path analyses).
This is a more technical aspect of using R. But, doing so will help ensure good data management practices. So whilst it may appear scary, this will help to keep your research organised, your files discoverable, and your ethics committee happy. The key concept for today is that we need all aspects of our analysis in the same folder. This includes the script we are working with, and the data set. So lets run through these steps:
Session
on top panel, move
to set working directory
, click
To source file location
. This will tell R to work within
the folder that your current R script is saved in. 4a) Alternatively, if
you are switched on with your path locations, you can use
setwd("~.../.../)
[fill in the gaps as appropriate].
However, this can be tricky. I tend to copy and paste from the above
steps, as I’m prone to typos.A key aspect of any data analysis is to keep our raw data unchanged, so that we can re-trace our steps, correct any errors, and make our analyses reproducible. R makes this very easy, as you would need to manually code a re-save if you wanted to overwrite your raw code… which takes deliberate effort.
Instead, it allows us to work with a copy of our data, which we can rename at a whim. It only exists within the session, until we decide to export it - which again takes deliberate effort. If however you are still concerned about protecting the integrity of your raw data, you can always create a new sub-folder to store a backup of your raw-data.
Anyhow, back to uploading the data! Within R, we have 2 ways of doing
this. The first is through the graphic interface, which we can find by
clicking the table icon in the environment
tab. The second
is through coding. As I am lazy/“pragmatic”, I tend to use the graphic
interface for my initial uploading of the data, and then I copy and
paste the generated code into my script. This allows me to quickly
re-run my code in future sessions without having to navigate the folder
system of my computer.
df_raw<-read_csv("~survey_data.csv")
## Rows: 42 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (27): id, group, eventsscore, dass1, dass2, dass3, dass4, dass5, dass6, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Today we will be using a questionnaire-based study. The research was investigating the relationships between stressful life events, psychological distress and social support.
The data contains convenience sample of 40 individuals who attend a community group.
The group had recently run a ten-week mindfulness training course; therefore, the researcher also recorded whether each respondent had completed this course. This allows us to compare whether the course was effective in reducing symptoms of stress, anxiety and depression,
You will see that the following variables are in the data set: id, group, number of stressful events experienced in the last year (eventsscore), scores for each of the 21 items on the Depression, Anxiety and Stress Scale (DASS; dass1 - dass21), social support from significant other (SOsupport), friends (friendsupport) and family (famsupport).
The DASS-21 scale has three subscales of 7 items each. The items for each scale are:
Stress: items 1, 6, 8, 11, 12, 14, 18
Anxiety: items 2, 4, 7, 9, 15, 19, 20
Depression: items 3, 5, 10, 13, 16, 17, 21
This is where R comes into its own. Data tidying in R is fantastic. Our script is our own personal notebook - every step of the data tidying process is recorded. And if we need to make any changes for any reason (looking at you reviewer #2…), these changes can be applied and re-run with little effort - meaning that we do not have to repeat the steps of coding and calculating any following analysis, we just get to re-run the analysis script.
All the steps required for tidying can be overwhelming. My tip is to
tidy in layers. Perform each stage separately in manageable bite sized
pieces. The most important aspect for this stage is not the coding
itself, but rather your note keeping. Use the hash-tags #
to note what you are doing, and why (especially if complicated).
And don’t worry if you ever feel that a problem is too complex to manage. Do what you can in R, and that is good enough for now. As you build more confidence and experience, you’ll be able to start coding up any additional steps that are required (and be better equipped at knowing what to search to solve your problems!).
The key part of this exercise will be the trusty pipe operator
%>%
. This is a tidyverse function that is very powerful
when it comes to data tidying. It effectively allows to take an R
object, and apply changes to it. It’s power comes from it being
readable. For now, you can translate it as “take this object
from the left, and do that function on the right to it”.
head(df_raw) # view first 6 rows
## # A tibble: 6 × 27
## id group eventsscore dass1 dass2 dass3 dass4 dass5 dass6 dass7 dass8 dass9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 3 1 0 1 0 2 2 0 0 0
## 2 2 1 4 3 1 NA 1 1 3 1 2 2
## 3 3 2 1 2 1 1 1 2 2 1 1 1
## 4 4 1 7 3 2 3 3 NA 3 2 3 2
## 5 5 2 3 0 0 0 0 1 0 0 1 0
## 6 6 1 7 2 0 NA 1 0 1 0 1 1
## # ℹ 15 more variables: dass10 <dbl>, dass11 <dbl>, dass12 <dbl>, dass13 <dbl>,
## # dass14 <dbl>, dass15 <dbl>, dass16 <dbl>, dass17 <dbl>, dass18 <dbl>,
## # dass19 <dbl>, dass20 <dbl>, dass21 <dbl>, SOsupport <dbl>,
## # friendsupport <dbl>, famsupport <dbl>
tail(df_raw) # view last 6 rows
## # A tibble: 6 × 27
## id group eventsscore dass1 dass2 dass3 dass4 dass5 dass6 dass7 dass8 dass9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37 2 2 1 0 0 0 1 2 0 1 0
## 2 38 2 3 0 0 0 0 NA 1 0 0 1
## 3 39 2 6 1 1 1 1 1 1 1 1 1
## 4 40 2 5 2 0 2 0 3 2 1 1 1
## 5 41 2 3 NA NA NA NA NA NA NA NA NA
## 6 42 1 2 3 1 0 1 2 NA NA NA NA
## # ℹ 15 more variables: dass10 <dbl>, dass11 <dbl>, dass12 <dbl>, dass13 <dbl>,
## # dass14 <dbl>, dass15 <dbl>, dass16 <dbl>, dass17 <dbl>, dass18 <dbl>,
## # dass19 <dbl>, dass20 <dbl>, dass21 <dbl>, SOsupport <dbl>,
## # friendsupport <dbl>, famsupport <dbl>
summary(df_raw) # View summary of data - is everything formatted correctly?
## id group eventsscore dass1
## Min. : 1.00 Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.:11.25 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000
## Median :21.50 Median :2.000 Median :3.000 Median :1.000
## Mean :21.50 Mean :1.595 Mean :3.738 Mean :1.341
## 3rd Qu.:31.75 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :42.00 Max. :2.000 Max. :9.000 Max. :3.000
## NA's :1
## dass2 dass3 dass4 dass5 dass6
## Min. :0.0000 Min. :0 Min. :0.0000 Min. :0.000 Min. :0.00
## 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.00
## Median :1.0000 Median :1 Median :1.0000 Median :1.000 Median :1.00
## Mean :0.8537 Mean :1 Mean :0.8537 Mean :1.297 Mean :1.35
## 3rd Qu.:1.0000 3rd Qu.:2 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :3.0000 Max. :3 Max. :3.0000 Max. :3.000 Max. :3.00
## NA's :1 NA's :5 NA's :1 NA's :5 NA's :2
## dass7 dass8 dass9 dass10 dass11
## Min. :0.000 Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.0
## Median :1.000 Median :1.000 Median :1.0 Median :1.000 Median :1.0
## Mean :0.675 Mean :1.175 Mean :0.9 Mean :1.111 Mean :1.3
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.0 3rd Qu.:2.000 3rd Qu.:2.0
## Max. :2.000 Max. :3.000 Max. :3.0 Max. :3.000 Max. :3.0
## NA's :2 NA's :2 NA's :2 NA's :6 NA's :2
## dass12 dass13 dass14 dass15 dass16
## Min. :0.000 Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.0 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.000 Median :1.0 Median :1.000 Median :1.000
## Mean :1.275 Mean :1.075 Mean :1.4 Mean :1.075 Mean :1.083
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :3.000 Max. :3.000 Max. :3.0 Max. :3.000 Max. :3.000
## NA's :2 NA's :2 NA's :2 NA's :2 NA's :6
## dass17 dass18 dass19 dass20
## Min. :0.0000 Min. :0.000 Min. :0.0 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.000
## Median :1.0000 Median :1.000 Median :1.0 Median :1.000
## Mean :0.9444 Mean :1.225 Mean :0.9 Mean :0.875
## 3rd Qu.:2.0000 3rd Qu.:2.000 3rd Qu.:1.0 3rd Qu.:1.250
## Max. :3.0000 Max. :3.000 Max. :3.0 Max. :3.000
## NA's :6 NA's :2 NA's :2 NA's :2
## dass21 SOsupport friendsupport famsupport
## Min. :0.0000 Min. : 4.00 Min. : 7.00 Min. : 4.00
## 1st Qu.:0.0000 1st Qu.:19.75 1st Qu.:19.00 1st Qu.:16.75
## Median :1.0000 Median :24.50 Median :20.50 Median :20.00
## Mean :0.9722 Mean :22.70 Mean :20.95 Mean :19.10
## 3rd Qu.:2.0000 3rd Qu.:28.00 3rd Qu.:24.75 3rd Qu.:23.25
## Max. :3.0000 Max. :28.00 Max. :28.00 Max. :28.00
## NA's :6 NA's :2 NA's :2 NA's :2
To remind ourselves of what we have done, we will use comments to
summarise the process. Additionally, we will call the new data
df_1
(Actually, you can call it whatever you want. I like
using numbers for my stages, as it intuitively reminds me of the
processes of my tidying).
Below, we will relabel and recode the group
variable to
make it more meaningful for our interpretation. R can handle these
factors, and will do the required dummy coding behind the scenes. So no
more needing to double check what our numbers represent!
df_1 <- df_raw %>%
mutate( group = case_when(group == "1"~ "control",
group == "2" ~ "experiment"),
group = as.factor(group)
) # Using mutate to recode variables appropriately
summary(df_1$group)
## control experiment
## 17 25
We may have also noticed that our dataset contains missing values.
Thankfully R has loads of useful packages to make working with missing
data much easier. Getting to know what’s missing in your life/data is
very important to ensure that our analysis is accurate. For today, we
will give a quick and easy overview to working with the missing data,
using the naniar
package.
pct_miss(df_1) # provides percentage of total missing data
## [1] 5.908289
miss_case_summary(df_1) # breaks down missing data percentage by each case
## # A tibble: 42 × 3
## case n_miss pct_miss
## <int> <int> <dbl>
## 1 41 24 88.9
## 2 42 19 70.4
## 3 2 2 7.41
## 4 4 2 7.41
## 5 6 2 7.41
## 6 17 2 7.41
## 7 20 2 7.41
## 8 21 2 7.41
## 9 25 2 7.41
## 10 8 1 3.70
## # ℹ 32 more rows
vis_miss(df_1) # Visually inspect data
We can see that case 41 and 42 have a lot of missing entries from the above summary. These participants need to be removed.
df_2 <- df_1[-c(41,42), ] # using base R to manually remove participants, using the minus symbol.
pct_miss(df_2)
## [1] 2.222222
miss_case_summary(df_2)
## # A tibble: 40 × 3
## case n_miss pct_miss
## <int> <int> <dbl>
## 1 2 2 7.41
## 2 4 2 7.41
## 3 6 2 7.41
## 4 17 2 7.41
## 5 20 2 7.41
## 6 21 2 7.41
## 7 25 2 7.41
## 8 8 1 3.70
## 9 10 1 3.70
## 10 16 1 3.70
## # ℹ 30 more rows
vis_miss(df_2)
gg_miss_var(df_2) # using gg_miss_var to visualise missing data by variable.
For the others, we will impute based on row-mean values within each subscale. Whilst not as sophisticated as methods such as Multiple Chained Imputation Equations, it is much simpler to conceptually understand. It is also a robust alternative that has comparable effectiveness for dealing with missing data in psychometric subscales under conditions of low % Missing Completely At Random data (Parent, 2012). The missing data for today was simulated to match these conditions.
Other options are more appropriate for robust missing data analysis, and examples of working with such instances of missing data can be found in our [CDCS Much Ado About Nothing](DCS-training/Much-ado-about-nothing-missing-data-in-research: Repo for the Much ado about nothing workshop. (github.com) github materials.
Conveniently, we will notice that all of our missing variables all belong to the depression subscale… How magically convenient! This may/will not be the case for real data, but it is helpful for today, as it stops us being overwhelmed with code.
Depression_data <- df_2 %>%
dplyr::select(dass3, dass5, dass10, dass16, dass17, dass21)
#Step 2 - applying rowwise imputation to the selected data
Depression_fix <- apply_imputation(Depression_data,
FUN = mean,
type = "rowwise")
#Step 3 - updating dataset with imputed data - do this by subscale to help brain manage.
df_3 <- df_2 %>% mutate(
dass3 = Depression_fix$dass3,
dass5 = Depression_fix$dass5,
dass10 = Depression_fix$dass10,
dass16 = Depression_fix$dass16,
dass17 = Depression_fix$dass17,
dass21 = Depression_fix$dass21,
)
## Task - compute the DASS subscales. Stress has already been calculated in the code below as an example. Don't forget to use a comma between each command in the `mutate()` function.
# Stress: items dass1, dass6, dass8, dass11, dass12, dass14, dass18
#
# Anxiety: items dass2, dass4, dass7, dass9, dass15, dass19, dass20
#
# Depression: items dass3, dass5, dass10, dass13, dass16, dass17, dass21
#Uncomment the tip below and complete the missing bits
#df_4 <- df_3 %>%
#mutate( # Using mutate to create composite variables
#stress = dass1 + dass6 + dass8 + dass11 + dass12 + dass14 + dass18,
#anxiety = ... ,
# depression = ...
# )
Below is the solution of the exercise above
## Results
df_4 <- df_3 %>%
mutate( # Using mutate to create composite variables
stress = dass1 + dass6 + dass8 + dass11 + dass12 + dass14 + dass18,
anxiety = dass2 + dass4 + dass7 + dass9 + dass15 + dass19 + dass20 ,
depression = dass3 + dass5 + dass10 + dass13 + dass16 + dass17 + dass21
)
Here we will be performing the bane of most researchers - filtering. This includes removing suspect bots, removing duplicates, removing participants who did not finish survey, and removing outliers.
Today, we will focus on removing outliers on the basis of Z-scores. This is when data is standardised, so that the mean value is 0, and each participant score is converted to their relative standard deviation position. Any participant with entries with less than -3, or greater than +3 can be considered to be a large/influential outlier (Osbourne & Overbray, 2004). As such, there is justification to remove them from the data. Prior to this, we need to inspect our data.
summary(df_4) # Using summary to inspect our data
## id group eventsscore dass1 dass2
## Min. : 1.00 control :16 Min. :0.0 Min. :0.0 Min. :0.00
## 1st Qu.:10.75 experiment:24 1st Qu.:2.0 1st Qu.:1.0 1st Qu.:0.00
## Median :20.50 Median :3.5 Median :1.0 Median :1.00
## Mean :20.50 Mean :3.8 Mean :1.3 Mean :0.85
## 3rd Qu.:30.25 3rd Qu.:5.0 3rd Qu.:2.0 3rd Qu.:1.00
## Max. :40.00 Max. :9.0 Max. :3.0 Max. :3.00
## dass3 dass4 dass5 dass6 dass7
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000
## Median :1.000 Median :1.00 Median :1.000 Median :1.00 Median :1.000
## Mean :1.069 Mean :0.85 Mean :1.222 Mean :1.35 Mean :0.675
## 3rd Qu.:2.000 3rd Qu.:1.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:1.000
## Max. :3.000 Max. :3.00 Max. :3.000 Max. :3.00 Max. :2.000
## dass8 dass9 dass10 dass11 dass12
## Min. :0.000 Min. :0.0 Min. :0.00 Min. :0.0 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:0.000
## Median :1.000 Median :1.0 Median :1.00 Median :1.0 Median :1.000
## Mean :1.175 Mean :0.9 Mean :1.01 Mean :1.3 Mean :1.275
## 3rd Qu.:2.000 3rd Qu.:1.0 3rd Qu.:2.00 3rd Qu.:2.0 3rd Qu.:2.000
## Max. :3.000 Max. :3.0 Max. :3.00 Max. :3.0 Max. :3.000
## dass13 dass14 dass15 dass16 dass17
## Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.0 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.0 Median :1.000 Median :1.000 Median :1.000
## Mean :1.075 Mean :1.4 Mean :1.075 Mean :1.056 Mean :0.865
## 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.250
## Max. :3.000 Max. :3.0 Max. :3.000 Max. :3.000 Max. :3.000
## dass18 dass19 dass20 dass21 SOsupport
## Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.0000 Min. : 4.00
## 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:19.75
## Median :1.000 Median :1.0 Median :1.000 Median :1.0000 Median :24.50
## Mean :1.225 Mean :0.9 Mean :0.875 Mean :0.9625 Mean :22.70
## 3rd Qu.:2.000 3rd Qu.:1.0 3rd Qu.:1.250 3rd Qu.:1.8125 3rd Qu.:28.00
## Max. :3.000 Max. :3.0 Max. :3.000 Max. :3.0000 Max. :28.00
## friendsupport famsupport stress anxiety
## Min. : 7.00 Min. : 4.00 Min. : 1.000 Min. : 1.000
## 1st Qu.:19.00 1st Qu.:16.75 1st Qu.: 3.000 1st Qu.: 2.000
## Median :20.50 Median :20.00 Median : 7.500 Median : 4.500
## Mean :20.95 Mean :19.10 Mean : 9.025 Mean : 6.125
## 3rd Qu.:24.75 3rd Qu.:23.25 3rd Qu.:13.250 3rd Qu.: 9.250
## Max. :28.00 Max. :28.00 Max. :21.000 Max. :19.000
## depression
## Min. : 0.000
## 1st Qu.: 1.875
## Median : 5.650
## Mean : 7.260
## 3rd Qu.:12.125
## Max. :21.000
# Using plots to visually inspect our data
boxplot(df_4$SOsupport) # Any outliers in boxplot?
hist(df_4$anxiety) # How is the data distributed? (Does it make sense regarding the literature?)
## Task - visually inspect the outliers and distribution of the stress and depression sub scales below:
Now that we have visually inspected our data, we can apply the code to remove our outliers.
# Creating Z score of study measures - useful for subjectively removing outliers.
df_5 <- df_4 %>%
mutate(Z_score_life_events = scale(eventsscore),
Z_score_anxiety = scale(anxiety),
Z_score_stress = scale(stress),
Z_score_depression = scale(depression),
Z_score_SOsupport = scale(SOsupport),
Z_score_friendsupport = scale(friendsupport),
Z_score_famsupport = scale(famsupport)
)
summary(df_5) # Which variable has a Z-score that breaks our +3/-3 boundary?
## id group eventsscore dass1 dass2
## Min. : 1.00 control :16 Min. :0.0 Min. :0.0 Min. :0.00
## 1st Qu.:10.75 experiment:24 1st Qu.:2.0 1st Qu.:1.0 1st Qu.:0.00
## Median :20.50 Median :3.5 Median :1.0 Median :1.00
## Mean :20.50 Mean :3.8 Mean :1.3 Mean :0.85
## 3rd Qu.:30.25 3rd Qu.:5.0 3rd Qu.:2.0 3rd Qu.:1.00
## Max. :40.00 Max. :9.0 Max. :3.0 Max. :3.00
## dass3 dass4 dass5 dass6 dass7
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000
## Median :1.000 Median :1.00 Median :1.000 Median :1.00 Median :1.000
## Mean :1.069 Mean :0.85 Mean :1.222 Mean :1.35 Mean :0.675
## 3rd Qu.:2.000 3rd Qu.:1.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:1.000
## Max. :3.000 Max. :3.00 Max. :3.000 Max. :3.00 Max. :2.000
## dass8 dass9 dass10 dass11 dass12
## Min. :0.000 Min. :0.0 Min. :0.00 Min. :0.0 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:0.000
## Median :1.000 Median :1.0 Median :1.00 Median :1.0 Median :1.000
## Mean :1.175 Mean :0.9 Mean :1.01 Mean :1.3 Mean :1.275
## 3rd Qu.:2.000 3rd Qu.:1.0 3rd Qu.:2.00 3rd Qu.:2.0 3rd Qu.:2.000
## Max. :3.000 Max. :3.0 Max. :3.00 Max. :3.0 Max. :3.000
## dass13 dass14 dass15 dass16 dass17
## Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.0 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :1.0 Median :1.000 Median :1.000 Median :1.000
## Mean :1.075 Mean :1.4 Mean :1.075 Mean :1.056 Mean :0.865
## 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.250
## Max. :3.000 Max. :3.0 Max. :3.000 Max. :3.000 Max. :3.000
## dass18 dass19 dass20 dass21 SOsupport
## Min. :0.000 Min. :0.0 Min. :0.000 Min. :0.0000 Min. : 4.00
## 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:19.75
## Median :1.000 Median :1.0 Median :1.000 Median :1.0000 Median :24.50
## Mean :1.225 Mean :0.9 Mean :0.875 Mean :0.9625 Mean :22.70
## 3rd Qu.:2.000 3rd Qu.:1.0 3rd Qu.:1.250 3rd Qu.:1.8125 3rd Qu.:28.00
## Max. :3.000 Max. :3.0 Max. :3.000 Max. :3.0000 Max. :28.00
## friendsupport famsupport stress anxiety
## Min. : 7.00 Min. : 4.00 Min. : 1.000 Min. : 1.000
## 1st Qu.:19.00 1st Qu.:16.75 1st Qu.: 3.000 1st Qu.: 2.000
## Median :20.50 Median :20.00 Median : 7.500 Median : 4.500
## Mean :20.95 Mean :19.10 Mean : 9.025 Mean : 6.125
## 3rd Qu.:24.75 3rd Qu.:23.25 3rd Qu.:13.250 3rd Qu.: 9.250
## Max. :28.00 Max. :28.00 Max. :21.000 Max. :19.000
## depression Z_score_life_events.V1 Z_score_anxiety.V1
## Min. : 0.000 Min. :-1.6847872 Min. :-0.9199542
## 1st Qu.: 1.875 1st Qu.:-0.7980571 1st Qu.:-0.7404509
## Median : 5.650 Median :-0.1330095 Median :-0.2916928
## Mean : 7.260 Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.:12.125 3rd Qu.: 0.5320381 3rd Qu.: 0.5609477
## Max. :21.000 Max. : 2.3054983 Max. : 2.3111044
## Z_score_stress.V1 Z_score_depression.V1 Z_score_SOsupport.V1
## Min. :-1.2580338 Min. :-1.1032597 Min. :-3.182635
## 1st Qu.:-0.9445051 1st Qu.:-0.8183269 1st Qu.:-0.502073
## Median :-0.2390656 Median :-0.2446623 Median : 0.306350
## Mean : 0.0000000 Mean : 0.0000000 Mean : 0.000000
## 3rd Qu.: 0.6623293 3rd Qu.: 0.7393056 3rd Qu.: 0.902030
## Max. : 1.8772529 Max. : 2.0879874 Max. : 0.902030
## Z_score_friendsupport.V1 Z_score_famsupport.V1
## Min. :-2.3888948 Min. :-2.3771801
## 1st Qu.:-0.3339315 1st Qu.:-0.3699585
## Median :-0.0770611 Median : 0.1416862
## Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.: 0.6507384 3rd Qu.: 0.6533309
## Max. : 1.2072909 Max. : 1.4011194
### Note: Z-scores between -3 and +3 are recommended as cuttoff for non-exponential data. May not always be appropriate, so know your data first.
#We can use the results of the above summary to identify the variable that needs filtering.
df_6 <- df_5 %>% filter(
Z_score_SOsupport <= 3 ,
Z_score_SOsupport >= -3
)
## Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
## ℹ Please use one dimensional logical vectors instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
summary(df_6)
## id group eventsscore dass1
## Min. : 1.00 control :16 Min. :0.000 Min. :0.000
## 1st Qu.:10.50 experiment:23 1st Qu.:2.000 1st Qu.:1.000
## Median :20.00 Median :4.000 Median :1.000
## Mean :20.46 Mean :3.846 Mean :1.308
## 3rd Qu.:30.50 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :40.00 Max. :9.000 Max. :3.000
## dass2 dass3 dass4 dass5
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :1.000 Median :1.0000 Median :1.000
## Mean :0.8462 Mean :1.071 Mean :0.8462 Mean :1.228
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :3.0000 Max. :3.000 Max. :3.0000 Max. :3.000
## dass6 dass7 dass8 dass9
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.000 Median :1.0000 Median :1.000 Median :1.0000
## Mean :1.359 Mean :0.6667 Mean :1.179 Mean :0.8974
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.000 Max. :2.0000 Max. :3.000 Max. :3.0000
## dass10 dass11 dass12 dass13 dass14
## Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.00
## Median :1.00 Median :1.000 Median :1.000 Median :1.000 Median :1.00
## Mean :1.01 Mean :1.308 Mean :1.282 Mean :1.077 Mean :1.41
## 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :3.00 Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.00
## dass15 dass16 dass17 dass18
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.500
## Median :1.000 Median :1.000 Median :1.0000 Median :1.000
## Mean :1.077 Mean :1.058 Mean :0.8615 Mean :1.256
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.5000 3rd Qu.:2.000
## Max. :3.000 Max. :3.000 Max. :3.0000 Max. :3.000
## dass19 dass20 dass21 SOsupport
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 8.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:20.00
## Median :1.0000 Median :1.0000 Median :1.0000 Median :25.00
## Mean :0.8974 Mean :0.8974 Mean :0.9615 Mean :23.18
## 3rd Qu.:1.0000 3rd Qu.:1.5000 3rd Qu.:1.8750 3rd Qu.:28.00
## Max. :3.0000 Max. :3.0000 Max. :3.0000 Max. :28.00
## friendsupport famsupport stress anxiety
## Min. : 7.00 Min. : 4.00 Min. : 1.000 Min. : 1.000
## 1st Qu.:19.00 1st Qu.:16.50 1st Qu.: 3.000 1st Qu.: 2.000
## Median :21.00 Median :20.00 Median : 8.000 Median : 4.000
## Mean :21.31 Mean :18.87 Mean : 9.103 Mean : 6.128
## 3rd Qu.:25.50 3rd Qu.:23.00 3rd Qu.:13.500 3rd Qu.: 9.500
## Max. :28.00 Max. :28.00 Max. :21.000 Max. :19.000
## depression Z_score_life_events.V1 Z_score_anxiety.V1
## Min. : 0.000 Min. :-1.6847872 Min. :-0.9199542
## 1st Qu.: 1.750 1st Qu.:-0.7980571 1st Qu.:-0.7404509
## Median : 5.500 Median : 0.0886730 Median :-0.3814444
## Mean : 7.267 Mean : 0.0204630 Mean : 0.0005753
## 3rd Qu.:12.250 3rd Qu.: 0.5320381 3rd Qu.: 0.6058235
## Max. :21.000 Max. : 2.3054983 Max. : 2.3111044
## Z_score_stress.V1 Z_score_depression.V1 Z_score_SOsupport.V1
## Min. :-1.2580338 Min. :-1.1032597 Min. :-2.5018576
## 1st Qu.:-0.9445051 1st Qu.:-0.8373224 1st Qu.:-0.4595249
## Median :-0.1606834 Median :-0.2674569 Median : 0.3914471
## Mean : 0.0121593 Mean : 0.0010131 Mean : 0.0816060
## 3rd Qu.: 0.7015204 3rd Qu.: 0.7583011 3rd Qu.: 0.9020303
## Max. : 1.8772529 Max. : 2.0879874 Max. : 0.9020303
## Z_score_friendsupport.V1 Z_score_famsupport.V1
## Min. :-2.3888948 Min. :-2.3771801
## 1st Qu.:-0.3339315 1st Qu.:-0.4093158
## Median : 0.0085623 Median : 0.1416862
## Mean : 0.0612537 Mean :-0.0359261
## 3rd Qu.: 0.7791736 3rd Qu.: 0.6139737
## Max. : 1.2072909 Max. : 1.4011194
boxplot(df_6$SOsupport) # New boxplot
boxplot(df_4$SOsupport) # Boxplot with outliers
Now that our data is ready, we can save a copy. This can help us save time on future projects, by saving us the hassle of needing to re-run our tidying script every time. It also allows for communication of our data with others who might not want to go through the hassle/pain of data tidying. Because our source file is already set, our newly saved data will pop straight into our folder.
For today, we will work with CSV as the file format. It’s the recommended format, as it can be opened with multiple pieces of software. Data safety, it is also preferred as the file format is safe from system updates/software disengagements. This is because csv files keep data organised through the simple yet effective comma separated values (aka: C.S.V). This also allows them to store much larger datasets in comparison to SAV and xlsx file formats, as there is less complications to manage.
This makes it a commonly used file format in data science activities… So if you ever want to make life difficult for anyone who might steal personal data, make sure to include plenty of comma’s in your responses. It will beautifully vandalise their dataset.
#Make sure to check your folder to view your data. Open it up in excel or SPSS if you like!
write.csv(df_6, "~tidy_data.csv")
After the above step is complete, we can now re-run our analysis without having to retrace the steps of cleaning every time. In my own practice, I keep a separate script for tidying and a separate script for analysis. I will then use the tidied data for my analysis. Now it is time for a quick break, and then we will continue in our part-2 analysis document.
Osborne, J. W., & Overbay, A. (2004). The power of outliers (and why researchers should always check for them). Practical Assessment, Research, and Evaluation, 9(1), 6.
Parent, M. C. (2013). Handling item-level missing data: Simpler is just as good.The Counseling Psychologist,41(4), 568-600.
Finally We are going to quickly learn about how R can be used to
render documents. This is done using the Knit
button. To
tidy our end document, we will also use some extra commands in our code
chunks. The echo=FALSE
is used to prevent our code from
being rendered. Meanwhile, message = FALSE
is used to
prevent any warning messages from being rendered.
#code to upload our newly saved file into R.
df_tidy <- read_csv("~tidy_data.csv") %>% mutate(group = as.factor(group))
summary(df_tidy)
Now that the hard work of preparing our data is complete. We can run our analyses! Today we will perform t-tests to compare the effect of the intervention on depression, anxiety and stress.
From there, we will perform a regression analysis to predict depression scores whilst controlling for exposure to stressful life events, social support, and the intervention.
We will also use this to demonstrate how we can use R to generate our analysis reports. So no more copy and pasting. Just smoooooooth sailing.
Here we will be taking advantage of the flextable
and
gtsummary
packages to conduct and report our statistics.
These package makes it super easy to generate publication ready tables,
and it can be rendered into Word documents thorugh R!
For more information on how to use the package, please check out the
flextable
vignettes and the
gtsummary
vignettes.
The code might look a little intimidating, but we will walk you through it so that you can also use this for your own analysis.
For each analysis, we will display the raw version and the tidied version. We will also show you how to render your analysis to finish the session.
The code for the summary statistics table has been set to display
most of the common descriptive summary statistics. Any columns that are
not needed can be deleted manually in the rendered word document (Or by
digging into get_summary_stats()
by looking up the R
documentation).
NB If you are running the cell below to generate a
report that will contains only the output tables make sure to change
echo=TRUE
in the header of the markdown cell into
echo=FALSE
.
table <- df_tidy %>%
dplyr::select(stress, anxiety, depression, eventsscore, SOsupport) %>% # Select relevant columns
tbl_summary(
type = all_continuous() ~ "continuous", # Specify that all selected columns are continuous
statistic = all_continuous() ~ "{mean} ({sd})", # Customize the statistics
label = list(stress ~ "Stress Score", # Optional: Customize labels for better readability
anxiety ~ "Anxiety Score",
depression ~ "Depression Score",
eventsscore ~ "Event Score",
SOsupport ~ "Social Support")
)
as_flex_table(table) # Converts table to flextable, which makes it prettier and allows for rendering into Word.
Characteristic | N = 391 |
---|---|
Stress Score | 9.1 (6.4) |
Anxiety Score | 6.1 (5.6) |
Depression Score | 7 (7) |
Event Score | 3.85 (2.27) |
Social Support | 23.2 (5.1) |
1Mean (SD) |
This is probably the most complicated part of the worksheet, as it requires using a combination of different functions. However, we have tried to compile the code so that you can copy, paste and adjust to your own needs.
# Compiling alphas
## Step 1: compiling subscale variables
stress_alpha <- df_tidy %>% dplyr::select(dass1, dass6, dass8,
dass11, dass12, dass14, dass18)
anxiety_alpha <- df_tidy %>% dplyr::select(dass2, dass4, dass7,
dass9, dass15, dass19, dass20)
depression_alpha <- df_tidy %>% dplyr::select(dass3, dass5, dass10,
dass13, dass16, dass17, dass21)
## Step 2: Running the cronbach.alpha test, and collecting only the score.
## If the commands below are not working, run `instal.packages("ltm")` in the console first.
str_alpha_value <- ltm::cronbach.alpha(stress_alpha)[[1]]
anx_alpha_value <- ltm::cronbach.alpha(anxiety_alpha)[[1]]
dep_alpha_value <- ltm::cronbach.alpha(depression_alpha)[[1]]
## Step 3:Creating Subscale names and cronbach alpha values objects - make sure values align with names
alpha_values <- c(str_alpha_value , anx_alpha_value, dep_alpha_value)
Measure_name <- c("DASS Stress",
"DASS Anxiety",
"DASS Depression")
## Step 4: Creating a pretty table
data.frame(Measure_name, alpha_values) %>% # Choosing our objects from step 3
mutate(alpha_values = round(alpha_values, 2)) %>% # Rounding to 2 decimal places
dplyr::rename( `Measure` = Measure_name, # Renaming so we can include spaces
`Cronbach alpha` = alpha_values ) %>%
flextable() # Using flextable() to make prettier table that renders to word.
Measure | Cronbach alpha |
---|---|
DASS Stress | 0.96 |
DASS Anxiety | 0.95 |
DASS Depression | 0.97 |
t_test(stress ~ group, data = df_tidy)
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 stress control experiment 16 23 2.71 30.6 0.0108
df_tidy %>%
dplyr::select(group, anxiety, stress, depression) %>%
tbl_summary(by = group,
missing = "no",
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>%
add_p(everything() ~ "t.test") %>%
modify_header( # header needs to be modified to display test statistc and df
statistic = "**t-statistic**",
parameter = "**df**"
) %>%
# add a header to the statistic column, which is hidden by default
modify_fmt_fun(c(statistic , parameter) ~ style_sigfig)
Characteristic | control, N = 161 | experiment, N = 231 | t-statistic | df | p-value2 |
---|---|---|---|---|---|
anxiety | 9.1 (6.6) | 4.1 (3.8) | 2.7 | 22 | 0.013 |
stress | 12.3 (6.2) | 6.9 (5.7) | 2.7 | 31 | 0.011 |
depression | 11 (6) | 4 (5) | 3.5 | 28 | 0.001 |
1 Mean (SD) | |||||
2 Welch Two Sample t-test |
cor_data <- df_tidy %>%
dplyr::select( anxiety, stress, depression, eventsscore , SOsupport)
# Calculate correlation
results <- correlation(cor_data)
# Capture the summary output
summary_output <- capture.output(summary(results, redundant = TRUE))
# Wrap the output in markdown code block to avoid header interpretation
cat("```\n", paste(summary_output, collapse = "\n"), "\n```")
# Correlation Matrix (pearson-method)
Parameter | anxiety | stress | depression | eventsscore | SOsupport
----------------------------------------------------------------------
anxiety | | 0.90*** | 0.76*** | 0.38 | -0.39
stress | 0.90*** | | 0.74*** | 0.37 | -0.46*
depression | 0.76*** | 0.74*** | | 0.42* | -0.48*
eventsscore | 0.38 | 0.37 | 0.42* | | -0.24
SOsupport | -0.39 | -0.46* | -0.48* | -0.24 |
p-value adjustment method: Holm (1979)
Please note that the correlation matrix is left with redundant
values. This is to allow you to decide if you want to manually display
the lower or upper triangle. Also, a bug in the correlation
package rearranges the display in a displeasing manner when
redundant = FALSE
. But thankfully, we can edit our output
to our needs in the rendered word document.
df_tidy %>% dplyr::select( anxiety, stress, depression, eventsscore , SOsupport) %>%
rename(`Anxiety` = anxiety ,
`Stress` = stress,
`Depression` = depression,
`Stressful Life Events` = eventsscore,
`Significant Other Support`= SOsupport) %>%
correlation::correlation() %>%
summary(redundant = TRUE) %>%
display()
Parameter | Anxiety | Stress | Depression | Stressful Life Events | Significant Other Support |
---|---|---|---|---|---|
Anxiety | 0.90*** | 0.76*** | 0.38 | -0.39 | |
Stress | 0.90*** | 0.74*** | 0.37 | -0.46* | |
Depression | 0.76*** | 0.74*** | 0.42* | -0.48* | |
Stressful Life Events | 0.38 | 0.37 | 0.42* | -0.24 | |
Significant Other Support | -0.39 | -0.46* | -0.48* | -0.24 |
p-value adjustment method: Holm (1979)
# Setting the model
model <- lm(depression ~ group + eventsscore + SOsupport, df_tidy)
# Ugly (but useful) results
summary(model)
##
## Call:
## lm(formula = depression ~ group + eventsscore + SOsupport, data = df_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.482 -3.849 -1.341 2.748 11.982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8993 4.8397 3.698 0.00074 ***
## groupexperiment -4.4211 2.0045 -2.206 0.03407 *
## eventsscore 0.5451 0.4332 1.258 0.21662
## SOsupport -0.4367 0.1798 -2.429 0.02044 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.353 on 35 degrees of freedom
## Multiple R-squared: 0.4062, Adjusted R-squared: 0.3553
## F-statistic: 7.981 on 3 and 35 DF, p-value: 0.0003486
I have recently discovered the flextable
package, and it
is amazing! It will convert your model into a publication ready format
in word with minimal effort. Simply take your defined model, and wrap it
in the as_flextable()
function. More information on
flextable
can be found here.
# Pretty results
model <- tbl_regression(model)
as_flex_table(model)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
group | |||
control | — | — | |
experiment | -4.4 | -8.5, -0.35 | 0.034 |
eventsscore | 0.55 | -0.33, 1.4 | 0.2 |
SOsupport | -0.44 | -0.80, -0.07 | 0.020 |
1CI = Confidence Interval |
And what use is a fancy table without a fancier plot?! APA guidelines are a little hazy when it comes to plot standards, but here is a good estimate. The title, subtitle and caption have been left empty here, as sometimes it is more useful to edit this within word (sometimes…).
#Generating plot
plot <- ggplot(df_tidy,
aes(x = eventsscore, y = depression, col = group)
) +
geom_point(alpha = .7, position = "jitter")+
geom_smooth(method = "lm", se = FALSE)
#Adding labels and APA theme
plot +
labs(x = "Exposure to stressful life events",
y = "Depression score (DASS)",
col = "Condition") +
theme_apa(legend.pos = "right",
legend.use.title = TRUE)
If you are running this on R Markdown you can now click on
Knit
to prepare your word document with all the prepared
analyses.