Merge pull request #15 from ceciliamescobedo/post-recipes

recipes post
rostools · Nov 27, 2023 · f2f9feb · f2f9feb
2 parents 1928230 + 5b70bf1
commit f2f9feb
Show file tree

Hide file tree

Showing 6 changed files with 377 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 _freeze
 _site
 .Rbuildignore
+_dev
diff --git a/README.md b/README.md
@@ -0,0 +1,13 @@
+# R Virtual Coding Club
+
+<!-- badges: start -->
+
+<!-- badges: end -->
+
+This is an informal space to learn things in R. Read the
+[website](https://coding-club.rostools.org) for details on how we run
+it.
+
+## Admin
+
+-   Thumbnails for posts are taken from <https://www.pexels.com/>.
diff --git a/index.qmd b/index.qmd
@@ -55,6 +55,4 @@ the blog. To add a post, read through:
     tools. For all the programs we'll use, follow [these installation
     instructions](https://guides.rostools.org/pre-course.html#installation-instructions).
 
-<!-- Note: Images taken from https://www.pexels.com/ -->
-
 ## Blogs
diff --git a/posts/preprocessing-with-recipes/images/penguin.png b/posts/preprocessing-with-recipes/images/penguin.png
diff --git a/posts/preprocessing-with-recipes/images/thumbnail.jpg b/posts/preprocessing-with-recipes/images/thumbnail.jpg
diff --git a/posts/preprocessing-with-recipes/index.qmd b/posts/preprocessing-with-recipes/index.qmd
@@ -0,0 +1,363 @@
+---
+title: "Preprocessing of Data: Understanding the Recipes Package"
+description: "Performing data transformation using the recipes package"
+author: 
+- "Cecilia Martinez Escobedo"
+date: "2023-10-25"
+date-modified: last-modified
+image: images/thumbnail.jpg
+categories: 
+- quarto
+- recipes
+- data transformation
+- tidymodels
+---
+
+> This session was recorded and uploaded on YouTube here:
+
+{{< video https://www.youtube.com/embed/7ULqg32j_j0?si=EVtd4GDRniY-Ns79 >}}
+
+In this session, we will learn how to use the `{recipes}` package to
+transform data for statistical modeling. `{recipes}` is a powerful tool
+for pre-processing data in a tidy and reproducible way. This package is
+an integral part of the the `{tidymodels}` workflow, which provides a
+unified framework for building and evaluating statistical models. For
+more information regarding the `{recipes}` package, please visit the
+[recipes webpage](https://recipes.tidymodels.org/)
+
+## Why do we need to pre-process data?
+
+Well, it's all about setting the stage for statistical tests!
+
+**Data pre-processing** is the process of transforming raw data into a
+format that is suitable for statistical modeling. Each statistical test
+has its unique assumptions, so sometimes we need to fine-tune our data
+to meet those criteria. These transformations are tailored to the
+specific statistical test at hand. Pre-processing may include:
+
+-   **Cleaning the data:** This can include removing missing values or
+    outliers.
+
+-   **Transforming the data:** This may involve converting variable to
+    different scales, creating new variables, or combining existing
+    variables. An example of this can be logarithmic transformation.
+
+-   **Normalizing the data**: This involves scaling the variables to
+    have a similar mean and variance.
+
+In today's session we'll focus on two fundamental pre-processing
+techniques: transformations and normalizations! Keep in mind that the
+transformations you choose will be determined by both the statistical
+test you are performing and your research question. It's also essential
+to run some basic checks to ensure your data aligns with the assumptions
+of your chosen test.
+
+## Pre-processing data with recipes
+
+The `{recipes}` package provides a simple and efficient way to
+pre-process data for statistical modeling. It works by creating a
+**recipe**, which is a sequence of pre-processing steps that are applied
+to the data, right before it enters statistical modelling.
+
+To create a recipe, we use the `recipe()` function. This function
+requires a formula as input. This formula will specify the dependent
+variable (outcome) and the independent variables(predictors) in the
+model. We also require data frame with the raw data.
+
+Once we have created a recipe, we can apply it to the data using the
+`prep()` function. This function takes the recipe and the data frame as
+input and returns a pre-processed data frame. The `bake()` function
+takes a new data frame as input (such as, the test data set if data
+split was performed) and also returns a pre-processed data frame.
+
+## Example
+
+Let's use the `{recipes}` package to pre-process the penguins data set
+from the `{palmerpenguins}` package.
+
+```{r setup}
+#| output: false
+library(tidymodels)
+library(tidyverse)
+library(palmerpenguins)
+```
+
+Let's first take a quick look into the data!
+
+```{r}
+# Load the penguins dataset
+penguins %>%
+  glimpse()
+```
+
+Before we dive into data transformations, it is essential to have a
+clear research question in mind. Remember the data transformations
+depends on the research question.
+
+In this case, our research question is:
+
+> Given knowledge of body mass, species, and sex, can we accurately
+> estimate bill depth in penguins?
+
+![Penguin bill depth](images/penguin.png){#fig-penguin-bill-depth
+align="center"}
+
+Given our research question, we will fit a simple linear regression
+model using the `lm()` function in R. Then, we will use the `tidy()`
+function to view the output of the model.
+
+The `tidy()` function constructs a tibble that summarizes the model's
+statistical findings. This includes estimates (aka. as coefficients) and
+p-values. For more information about the `tidy` function consult its
+[documentation
+webpage](https://cran.r-project.org/web/packages/broom/vignettes/broom.html).
+
+```{r}
+# Let's define our model
+test_model <- lm(bill_depth_mm ~ body_mass_g + species + sex, data = penguins)
+# Let's take a look into the output of our model
+tidy(test_model)
+```
+
+The previous model estimates the bill depth based on body mass, species
+and sex. The p-values for body mass, species, and sex are significant,
+but the estimates appear to be very small. This can indicate that the
+relationships between the variables are statistically significant, but
+they are weak.
+
+An interpretation of the results can be as follows: If we keep all other
+variables constant (i.e species and sex), a one-unit increase in body
+mass is associated with a very small increase (0.000706 mm) in bill
+depth.
+
+### Verifying the assumptions of a linear regression model
+
+Linear regression models require certain assumptions to be met in order
+to be valid. These assumptions include:
+
+-   **Linearity:** The relationship between the dependent variable and
+    the independent variables is linear.
+
+-   **Normality:** The residuals of the model are normally distributed.
+    The residuals of a model are the differences between the observed
+    values and the predicted values of the dependent variables.
+
+-   **Homoscedasticity:** The variance of the residuals is constant
+    across all values of the independent variables
+
+We can use residual plots to check whether our model meets these
+assumptions.
+
+One of the easiest residual plots to interpret is the **Q-Q plot of the
+residuals**. In a Q-Q plot, the residuals of the model are plotted
+against the theoretical quantiles of the normal distribution. If the
+residuals are normally distributed, the points in the Q-Q plot will fall
+along a straight line.
+
+Let's check if our model meets those assumptions using the `plot()`
+function. The function `plot()` will generate several plots to check
+your model assumptions. Here we only want the Q-Q plot of the residuals
+of the model, which we get by using the argument `which = 2`.
+
+```{r}
+# Check linear model assumptions
+plot(test_model, which = 2)
+```
+
+The resulting Q-Q plot shows that the residuals of the model are not
+normally distributed, as the points in the plot do not fall along a
+straight line (they are deviating in the upward part of the plot). This
+suggests that we may need to transform the data before fitting the
+linear regression model.
+
+If the Q-Q plot of the residuals deviates upwards, it means that there
+are more outliers in the upper tail of the distribution than in the
+lower tail. This can be caused by a number of factors including a
+bimodal distribution of the dependent variable (i.e bill depth). If the
+dependent variable has a bimodal distribution, the residuals will also
+have a bimodal distribution. This is because the linear regression model
+will try to fit a straight line through the center of the distribution,
+which will result in errors for the outliers in the upper and lower
+tails.
+
+Based on this, let's make a histogram to investigate the distribution of
+bill depth! We will use the `ggplot()` function for this.
+
+```{r}
+ggplot(penguins, aes(x = bill_depth_mm)) +
+  geom_histogram()
+```
+
+The histogram shows that the distribution of bill depth appears to be
+bimodal. This suggests that the Q-Q plot is deviating upwards because of
+the bimodal distribution of bill depth.
+
+To address this issue, we can transform the bill depth variable using a
+log transformation. After transforming the bill depth variable, we can
+generate a new Q-Q plot of the residuals. The new Q-Q plot shows that
+the residuals should now be normally distributed.
+
+### Creating a recipe for Data transformation
+
+Based on the previous, we will now create a recipe to transform our
+data. This will helps us to improve the normality of the residuals and
+make data more interpretable.
+
+To create a recipe we will perform the following steps:
+
+**Step 1: Specify the model**
+
+We first need to specify the model to the recipe. We do this using the
+`recipe()` function. In this case, we will use the same model as we did
+before. The bill depth (`bill_depth_mm`) is our outcome variable (y),
+and the body mass (`body_mass_g`), species (`species`), and sex (`sex`)
+are our predictors/exposures.
+
+**Step 2: Specify the data transformations**
+
+Next, we need to specify the data transformations that we want to
+perform. Every data transformation starts with `step_`. A complete list
+of all the possible transformations that you can perform using the
+recipes package can be consulted in the [recipes reference
+section](https://recipes.tidymodels.org/reference/index.html).
+
+In this example, we will perform a logarithmic transformation using
+`step_log()`. This will help us improve the normality of the residuals.
+We also want to perform a normalization using `step_normalize()`. This
+will make the data more interpretable by setting the mean of the
+predictors to zero.
+
+Normalization is a data transformation technique that sets the mean of a
+variable to zero and the standard deviation to one. This can be useful
+for improving the interpretability of the data, especially when using
+linear regression models.
+
+For example, let's say we have a linear regression model that predicts
+bill depth based on body mass. If we do not normalize the data, the
+intercept of the model will represent the bill depth when the body mass
+is zero. However, this is not a very meaningful value, since penguins do
+not have zero body mass.
+
+After normalizing the data, the intercept of the model will represent
+the bill depth when the body mass is at the mean value. This is a much
+more meaningful value, as it represents the average bill depth of
+penguins.
+
+In every transformation using `step_`, you always need to specify what
+you want to transform. In this case we wanted to log transform all
+numeric variables. For this we used `step_log(all_double())`. For
+normalization we used use use
+`step_normalize (all numeric_predictors())` This specification will
+indicate the recipe to only normalize body mass (`body_mass_g`), since
+it is the only numeric predictor.
+
+**Step 3: Prepare and bake the data**
+
+Finally, we need to prepare and bake the data using the `prep()` and
+`bake()` functions. As mentioned before, this steps are really useful
+when creating recipe outside the `{tidymodels}` workflow and also when
+data splitting has been performed.
+
+Prepping the data will perform all the calculations for the data
+transformations in the training data set. Baking the data will perform
+the data transformations on the test or validation set.
+
+In this example, we did not perform data splitting, so we will set
+`bake(new_data=NULL)`
+
+```{r}
+transformed_penguins <- penguins %>%
+  # Step 1: specify the model
+  recipe(bill_depth_mm ~ body_mass_g + species + sex) %>% 
+  # Step 2: specify data transformations
+  step_log(all_double()) %>% 
+  # Step 2: specify data transformations
+  step_normalize(all_numeric_predictors()) %>%  
+  # Step 3: prepare data 
+  prep() %>% 
+  # Step 3: bake the data 
+  bake(new_data = NULL) 
+
+transformed_penguins
+```
+
+The variable `transformed_penguins` will contain the pre-processed
+(transformed) data, which is ready for statistical analysis.
+
+We can now re-run our linear model using the pre-processed data. To do
+this, we will simply replace the `data` argument to include the
+`transformed_penguins` data. e will call this the `transformed_model`
+and we will use `tidy()` to look at the output of the model. We will
+also use `plot()` to check linear assumptions.
+
+```{r}
+# Fit linear model on transformed data
+transformed_model <- lm(bill_depth_mm ~ body_mass_g + species + sex, data = transformed_penguins)
+# Look at the output of the model
+tidy(transformed_model)
+# Check linear assumptions
+plot(transformed_model, which = 2)
+```
+
+As we observe, the new Q-Q plot shows that the residuals are now
+normally distributed (after log transformation).
+
+When we log transform the outcome variable in a linear regression model,
+the coefficients of the model represent the percentage change in the
+outcome variable for a one unit increase in the predictor variable,
+assuming that all other predictor variables are held constant.
+
+For example, in the example we have been discussing, the coefficient for
+the `body_mass_g` variable is 0.03463. This means that a one unit
+increase in body mass will lead to a 3.463% increase in bill length,
+assuming that the species and sex of the penguin are held constant.
+
+It is important to note that the interpretation of the coefficients of a
+linear model changes after a log transformation. Before the log
+transformation, the coefficients represented the change in the outcome
+variable for a one unit increase in the predictor variable, measured in
+the original units. After the log transformation, the coefficients
+represent the percentage change in the outcome variable for a one unit
+increase in the predictor variable, measured in standard deviations.
+
+Now, let's try pre-processing our data in a different ways!
+
+We will now log transform all outcome variables using
+`step_log (all_outcomes())` . We will also remove highly correlated
+variables using `step_corr(all_numeric_predictors()),`and we will remove
+variables with non-zero variance
+using`step_nzv(all_numeric-predictors())`.
+
+```{r}
+transformed_penguins <- penguins %>%
+  recipe(bill_depth_mm ~ body_mass_g + species + sex) %>%
+  step_log(all_outcomes()) %>%
+  step_normalize(all_numeric_predictors()) %>%
+  step_corr(all_numeric_predictors()) %>%
+  step_nzv(all_numeric_predictors()) %>%
+  prep() %>%
+  bake(new_data = NULL)
+```
+
+**Keep in mind that the order in which you perform the pre-processing
+steps matters!!!**
+
+The recommended pre-processing ordering, taken from the `{recipes}`
+website, is as follows:
+
+1.  Impute
+2.  Handle factor levels
+3.  Individual transformations for skewness and other issues
+4.  Discretize (if needed and if you have no other choice)
+5.  Create dummy variables
+6.  Create interactions
+7.  Normalization steps (center, scale, range, etc)
+8.  Multivariate transformation (e.g. PCA, spatial sign, etc)
+
+For more information regarding the order in which the pre-processing
+steps should be performed visit the articles section in the recipe
+package [Ordering of
+Steps](https://recipes.tidymodels.org/articles/Ordering.html).
+
+Done!
+