Skip to content

Commit

Permalink
Merge pull request #15 from ceciliamescobedo/post-recipes
Browse files Browse the repository at this point in the history
recipes post
  • Loading branch information
lwjohnst86 authored Nov 27, 2023
2 parents 1928230 + 5b70bf1 commit f2f9feb
Show file tree
Hide file tree
Showing 6 changed files with 377 additions and 2 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
_freeze
_site
.Rbuildignore
_dev
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# R Virtual Coding Club

<!-- badges: start -->

<!-- badges: end -->

This is an informal space to learn things in R. Read the
[website](https://coding-club.rostools.org) for details on how we run
it.

## Admin

- Thumbnails for posts are taken from <https://www.pexels.com/>.
2 changes: 0 additions & 2 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,4 @@ the blog. To add a post, read through:
tools. For all the programs we'll use, follow [these installation
instructions](https://guides.rostools.org/pre-course.html#installation-instructions).

<!-- Note: Images taken from https://www.pexels.com/ -->

## Blogs
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
363 changes: 363 additions & 0 deletions posts/preprocessing-with-recipes/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,363 @@
---
title: "Preprocessing of Data: Understanding the Recipes Package"
description: "Performing data transformation using the recipes package"
author:
- "Cecilia Martinez Escobedo"
date: "2023-10-25"
date-modified: last-modified
image: images/thumbnail.jpg
categories:
- quarto
- recipes
- data transformation
- tidymodels
---

> This session was recorded and uploaded on YouTube here:
{{< video https://www.youtube.com/embed/7ULqg32j_j0?si=EVtd4GDRniY-Ns79 >}}

In this session, we will learn how to use the `{recipes}` package to
transform data for statistical modeling. `{recipes}` is a powerful tool
for pre-processing data in a tidy and reproducible way. This package is
an integral part of the the `{tidymodels}` workflow, which provides a
unified framework for building and evaluating statistical models. For
more information regarding the `{recipes}` package, please visit the
[recipes webpage](https://recipes.tidymodels.org/)

## Why do we need to pre-process data?

Well, it's all about setting the stage for statistical tests!

**Data pre-processing** is the process of transforming raw data into a
format that is suitable for statistical modeling. Each statistical test
has its unique assumptions, so sometimes we need to fine-tune our data
to meet those criteria. These transformations are tailored to the
specific statistical test at hand. Pre-processing may include:

- **Cleaning the data:** This can include removing missing values or
outliers.

- **Transforming the data:** This may involve converting variable to
different scales, creating new variables, or combining existing
variables. An example of this can be logarithmic transformation.

- **Normalizing the data**: This involves scaling the variables to
have a similar mean and variance.

In today's session we'll focus on two fundamental pre-processing
techniques: transformations and normalizations! Keep in mind that the
transformations you choose will be determined by both the statistical
test you are performing and your research question. It's also essential
to run some basic checks to ensure your data aligns with the assumptions
of your chosen test.

## Pre-processing data with recipes

The `{recipes}` package provides a simple and efficient way to
pre-process data for statistical modeling. It works by creating a
**recipe**, which is a sequence of pre-processing steps that are applied
to the data, right before it enters statistical modelling.

To create a recipe, we use the `recipe()` function. This function
requires a formula as input. This formula will specify the dependent
variable (outcome) and the independent variables(predictors) in the
model. We also require data frame with the raw data.

Once we have created a recipe, we can apply it to the data using the
`prep()` function. This function takes the recipe and the data frame as
input and returns a pre-processed data frame. The `bake()` function
takes a new data frame as input (such as, the test data set if data
split was performed) and also returns a pre-processed data frame.

## Example

Let's use the `{recipes}` package to pre-process the penguins data set
from the `{palmerpenguins}` package.

```{r setup}
#| output: false
library(tidymodels)
library(tidyverse)
library(palmerpenguins)
```

Let's first take a quick look into the data!

```{r}
# Load the penguins dataset
penguins %>%
glimpse()
```

Before we dive into data transformations, it is essential to have a
clear research question in mind. Remember the data transformations
depends on the research question.

In this case, our research question is:

> Given knowledge of body mass, species, and sex, can we accurately
> estimate bill depth in penguins?
![Penguin bill depth](images/penguin.png){#fig-penguin-bill-depth
align="center"}

Given our research question, we will fit a simple linear regression
model using the `lm()` function in R. Then, we will use the `tidy()`
function to view the output of the model.

The `tidy()` function constructs a tibble that summarizes the model's
statistical findings. This includes estimates (aka. as coefficients) and
p-values. For more information about the `tidy` function consult its
[documentation
webpage](https://cran.r-project.org/web/packages/broom/vignettes/broom.html).

```{r}
# Let's define our model
test_model <- lm(bill_depth_mm ~ body_mass_g + species + sex, data = penguins)
# Let's take a look into the output of our model
tidy(test_model)
```

The previous model estimates the bill depth based on body mass, species
and sex. The p-values for body mass, species, and sex are significant,
but the estimates appear to be very small. This can indicate that the
relationships between the variables are statistically significant, but
they are weak.

An interpretation of the results can be as follows: If we keep all other
variables constant (i.e species and sex), a one-unit increase in body
mass is associated with a very small increase (0.000706 mm) in bill
depth.

### Verifying the assumptions of a linear regression model

Linear regression models require certain assumptions to be met in order
to be valid. These assumptions include:

- **Linearity:** The relationship between the dependent variable and
the independent variables is linear.

- **Normality:** The residuals of the model are normally distributed.
The residuals of a model are the differences between the observed
values and the predicted values of the dependent variables.

- **Homoscedasticity:** The variance of the residuals is constant
across all values of the independent variables

We can use residual plots to check whether our model meets these
assumptions.

One of the easiest residual plots to interpret is the **Q-Q plot of the
residuals**. In a Q-Q plot, the residuals of the model are plotted
against the theoretical quantiles of the normal distribution. If the
residuals are normally distributed, the points in the Q-Q plot will fall
along a straight line.

Let's check if our model meets those assumptions using the `plot()`
function. The function `plot()` will generate several plots to check
your model assumptions. Here we only want the Q-Q plot of the residuals
of the model, which we get by using the argument `which = 2`.

```{r}
# Check linear model assumptions
plot(test_model, which = 2)
```

The resulting Q-Q plot shows that the residuals of the model are not
normally distributed, as the points in the plot do not fall along a
straight line (they are deviating in the upward part of the plot). This
suggests that we may need to transform the data before fitting the
linear regression model.

If the Q-Q plot of the residuals deviates upwards, it means that there
are more outliers in the upper tail of the distribution than in the
lower tail. This can be caused by a number of factors including a
bimodal distribution of the dependent variable (i.e bill depth). If the
dependent variable has a bimodal distribution, the residuals will also
have a bimodal distribution. This is because the linear regression model
will try to fit a straight line through the center of the distribution,
which will result in errors for the outliers in the upper and lower
tails.

Based on this, let's make a histogram to investigate the distribution of
bill depth! We will use the `ggplot()` function for this.

```{r}
ggplot(penguins, aes(x = bill_depth_mm)) +
geom_histogram()
```

The histogram shows that the distribution of bill depth appears to be
bimodal. This suggests that the Q-Q plot is deviating upwards because of
the bimodal distribution of bill depth.

To address this issue, we can transform the bill depth variable using a
log transformation. After transforming the bill depth variable, we can
generate a new Q-Q plot of the residuals. The new Q-Q plot shows that
the residuals should now be normally distributed.

### Creating a recipe for Data transformation

Based on the previous, we will now create a recipe to transform our
data. This will helps us to improve the normality of the residuals and
make data more interpretable.

To create a recipe we will perform the following steps:

**Step 1: Specify the model**

We first need to specify the model to the recipe. We do this using the
`recipe()` function. In this case, we will use the same model as we did
before. The bill depth (`bill_depth_mm`) is our outcome variable (y),
and the body mass (`body_mass_g`), species (`species`), and sex (`sex`)
are our predictors/exposures.

**Step 2: Specify the data transformations**

Next, we need to specify the data transformations that we want to
perform. Every data transformation starts with `step_`. A complete list
of all the possible transformations that you can perform using the
recipes package can be consulted in the [recipes reference
section](https://recipes.tidymodels.org/reference/index.html).

In this example, we will perform a logarithmic transformation using
`step_log()`. This will help us improve the normality of the residuals.
We also want to perform a normalization using `step_normalize()`. This
will make the data more interpretable by setting the mean of the
predictors to zero.

Normalization is a data transformation technique that sets the mean of a
variable to zero and the standard deviation to one. This can be useful
for improving the interpretability of the data, especially when using
linear regression models.

For example, let's say we have a linear regression model that predicts
bill depth based on body mass. If we do not normalize the data, the
intercept of the model will represent the bill depth when the body mass
is zero. However, this is not a very meaningful value, since penguins do
not have zero body mass.

After normalizing the data, the intercept of the model will represent
the bill depth when the body mass is at the mean value. This is a much
more meaningful value, as it represents the average bill depth of
penguins.

In every transformation using `step_`, you always need to specify what
you want to transform. In this case we wanted to log transform all
numeric variables. For this we used `step_log(all_double())`. For
normalization we used use use
`step_normalize (all numeric_predictors())` This specification will
indicate the recipe to only normalize body mass (`body_mass_g`), since
it is the only numeric predictor.

**Step 3: Prepare and bake the data**

Finally, we need to prepare and bake the data using the `prep()` and
`bake()` functions. As mentioned before, this steps are really useful
when creating recipe outside the `{tidymodels}` workflow and also when
data splitting has been performed.

Prepping the data will perform all the calculations for the data
transformations in the training data set. Baking the data will perform
the data transformations on the test or validation set.

In this example, we did not perform data splitting, so we will set
`bake(new_data=NULL)`

```{r}
transformed_penguins <- penguins %>%
# Step 1: specify the model
recipe(bill_depth_mm ~ body_mass_g + species + sex) %>%
# Step 2: specify data transformations
step_log(all_double()) %>%
# Step 2: specify data transformations
step_normalize(all_numeric_predictors()) %>%
# Step 3: prepare data
prep() %>%
# Step 3: bake the data
bake(new_data = NULL)
transformed_penguins
```

The variable `transformed_penguins` will contain the pre-processed
(transformed) data, which is ready for statistical analysis.

We can now re-run our linear model using the pre-processed data. To do
this, we will simply replace the `data` argument to include the
`transformed_penguins` data. e will call this the `transformed_model`
and we will use `tidy()` to look at the output of the model. We will
also use `plot()` to check linear assumptions.

```{r}
# Fit linear model on transformed data
transformed_model <- lm(bill_depth_mm ~ body_mass_g + species + sex, data = transformed_penguins)
# Look at the output of the model
tidy(transformed_model)
# Check linear assumptions
plot(transformed_model, which = 2)
```

As we observe, the new Q-Q plot shows that the residuals are now
normally distributed (after log transformation).

When we log transform the outcome variable in a linear regression model,
the coefficients of the model represent the percentage change in the
outcome variable for a one unit increase in the predictor variable,
assuming that all other predictor variables are held constant.

For example, in the example we have been discussing, the coefficient for
the `body_mass_g` variable is 0.03463. This means that a one unit
increase in body mass will lead to a 3.463% increase in bill length,
assuming that the species and sex of the penguin are held constant.

It is important to note that the interpretation of the coefficients of a
linear model changes after a log transformation. Before the log
transformation, the coefficients represented the change in the outcome
variable for a one unit increase in the predictor variable, measured in
the original units. After the log transformation, the coefficients
represent the percentage change in the outcome variable for a one unit
increase in the predictor variable, measured in standard deviations.

Now, let's try pre-processing our data in a different ways!

We will now log transform all outcome variables using
`step_log (all_outcomes())` . We will also remove highly correlated
variables using `step_corr(all_numeric_predictors()),`and we will remove
variables with non-zero variance
using`step_nzv(all_numeric-predictors())`.

```{r}
transformed_penguins <- penguins %>%
recipe(bill_depth_mm ~ body_mass_g + species + sex) %>%
step_log(all_outcomes()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
prep() %>%
bake(new_data = NULL)
```

**Keep in mind that the order in which you perform the pre-processing
steps matters!!!**

The recommended pre-processing ordering, taken from the `{recipes}`
website, is as follows:

1. Impute
2. Handle factor levels
3. Individual transformations for skewness and other issues
4. Discretize (if needed and if you have no other choice)
5. Create dummy variables
6. Create interactions
7. Normalization steps (center, scale, range, etc)
8. Multivariate transformation (e.g. PCA, spatial sign, etc)

For more information regarding the order in which the pre-processing
steps should be performed visit the articles section in the recipe
package [Ordering of
Steps](https://recipes.tidymodels.org/articles/Ordering.html).

Done!

0 comments on commit f2f9feb

Please sign in to comment.