## Usage and Adaptation of Data Carpentry Materials
Most material found in this document has been adapted from the Data Carpentry [https://datacarpentry.org/r-socialsci/] materials, under the creative commons attribution license [https://creativecommons.org/licenses/by/4.0/]. Minor amendments have been made to allow for compatability in order.
## Objectives of the session:
- Load external data from a .csv file into a data frame.
- Summarise the contents of a data frame.
- Describe the difference between a factor and a string.
- Convert between strings and factors.
- Examine and change date formats.
## Questions to be able to answer:
- What is a data.frame?
- How can I read a complete csv file into R?
- How can I get basic summary information about my dataset?
- How can I change the way R treats strings in my dataset?
- Why would I want strings to be treated differently?
- How are dates represented in R and how can I change the format?
## What are data frames and tibbles?
Data frames are the *de facto* data structure for tabular data in `R`, and what
we use for data processing, statistics, and plotting.
A data frame is the representation of data in the format of a table where the
columns are vectors that all have the same length. Data frames are analogous to
the more familiar spreadsheet in programs such as Excel, with one key difference.
Because columns are vectors, each column must contain a single type of data (e.g., characters, integers,
factors). For example, here is a figure depicting a data frame comprising a
numeric, a character, and a logical vector.
## Data Frame Reading
Data frames can be created by hand, but most commonly they are generated by the
functions `read_csv()` or `read_table()`; in other words, when importing
spreadsheets from your hard drive (or the web).
## Presentation of the SAFI Data
SAFI (Studying African Farmer-Led Irrigation) is a study looking at farming
and irrigation methods in Tanzania and Mozambique. The survey data
was collected through interviews conducted between November 2016 and
June 2017. For this lesson, we will be using a subset of the
available data. For information about the full teaching dataset used
in other lessons in this workshop, see the dataset description (https://www.datacarpentry.org/socialsci-workshop/data/).
## Importing data
You are going to load the data in R's memory using the function `read_csv()`
from the **`readr`** package, which is part of the **`tidyverse`**; learn
more about the **`tidyverse`** collection of packages
[here](https://www.tidyverse.org/).
**`readr`** gets installed as part as the **`tidyverse`** installation.
When you load the **`tidyverse`** (`library(tidyverse)`), the core packages
(the packages used in most data analyses) get loaded, including **`readr`**.
## An Import Example
```{r, purl=FALSE}
library(tidyverse)
interviews <- read_csv("https://raw.githubusercontent.com/datacarpentry/r-socialsci/main/episodes/data/SAFI_clean.csv")
interviews
```
## Side-note on Conflicts
Before proceeding, however, this is a good opportunity to talk about conflicts.
Certain packages we load can end up introducing function names that are already
in use by pre-loaded R packages. For instance, when we load the tidyverse
package below, we will introduce two conflicting functions: `filter()` and `lag()`.
This happens because `filter` and `lag` are already functions used by the stats
package (already pre-loaded in R). What will happen now is that if we, for
example, call the `filter()` function, R will use the `dplyr::filter()` version
and not the `stats::filter()` one. This happens because, if conflicted, by
default R uses the function from the most recently loaded package. Conflicted
functions may cause you some trouble in the future, so it is important that we
are aware of them so that we can properly handle them, if we want.
## Inspecting data frames
When calling a `tbl_df` object (like `interviews` here), there is already a lot
of information about our data frame being displayed such as the number of rows,
the number of columns, the names of the columns, and as we just saw the class of
data stored in each column. However, there are functions to extract this
information from data frames. Here is a non-exhaustive list of some of these
functions. Let's try them out!
## Inspecting functions
Size:
- `dim(interviews)` - returns a vector with the number of rows as the first
element, and the number of columns as the second element (the **dim**ensions of
the object)
- `nrow(interviews)` - returns the number of rows
- `ncol(interviews)` - returns the number of columns
Content:
- `head(interviews)` - shows the first 6 rows
- `tail(interviews)` - shows the last 6 rows
## Inspecting functions 2
Names:
- `names(interviews)` - returns the column names (synonym of `colnames()` for
`data.frame` objects)
Summary:
- `str(interviews)` - structure of the object and information about the class,
length and content of each column
- `summary(interviews)` - summary statistics for each column
- `glimpse(interviews)` - returns the number of columns and rows of the tibble,
the names and class of each column, and previews as many values will fit on the
screen. Unlike the other inspecting functions listed above, `glimpse()` is not a
"base R" function so you need to have the `dplyr` or `tibble` packages loaded to
be able to execute it.
Note: most of these functions are "generic." They can be used on other types of
objects besides data frames or tibbles.
## Factors
R has a special data class, called factor, to deal with categorical data that
you may encounter when creating plots or doing statistical analyses. Factors are
very useful and actually contribute to making R particularly well suited to
working with data. So we are going to spend a little time introducing them.
Factors represent categorical data. They are stored as integers associated with
labels and they can be ordered (ordinal) or unordered (nominal). Factors
create a structured relation between the different levels (values) of a
categorical variable, such as days of the week or responses to a question in
a survey. This can make it easier to see how one element relates to the
other elements in a column. While factors look (and often behave) like
character vectors, they are actually treated as integer vectors by `R`. So
you need to be very careful when treating them as strings.
## Factor Example
Once created, factors can only contain a pre-defined set of values, known as
*levels*. By default, R always sorts levels in alphabetical order. For
instance, if you have a factor with 2 levels:
```{r, purl=TRUE}
respondent_floor_type <- factor(c("earth", "cement",
"cement", "earth"))
```
## Factor Example Continued
R will assign `1` to the level `"cement"` and `2` to the level `"earth"`
(because `c` comes before `e`, even though the first element in this vector is
`"earth"`). You can see this by using the function `levels()` and you can find
the number of levels using `nlevels()`:
```{r, purl=FALSE}
levels(respondent_floor_type)
nlevels(respondent_floor_type)
```
## Factors in a Data Set
In the case where our data has encoded a factor variable as a string, we can
instead use the 'as.factor()' function to convert it. This is useful for
further data wrangling and visualisation.
```{r, purl=FALSE}
memb_assoc <- interviews$memb_assoc
memb_assoc
```
## As a factor...
```{r, purl=FALSE}
memb_assoc <- as.factor(memb_assoc)
memb_assoc
```
## Formatting Dates
One of the most common issues that new (and experienced!) R users have is
converting date and time information into a variable that is appropriate and
usable during analyses. A best
practice for dealing with date data is to ensure that each component of your
date is available as a separate variable. In our dataset, we have a
column `interview_date` which contains information about the
year, month, and day that the interview was conducted. Let's
convert those dates into three separate columns.
## Overview of Dates Data
Let's extract our `interview_date` column and inspect the structure:
```{r, purl=FALSE}
dates <- interviews$interview_date
str(dates)
```
## Splitting This Up
When we imported the data in R, `read_csv()` recognized that this column
contained date information. We can now use the `day()`, `month()` and `year()`
functions to extract this information from the date, and create new columns in
our data frame to store it:
```{r, purl=FALSE}
interviews$day <- day(dates)
interviews$month <- month(dates)
interviews$year <- year(dates)
```