Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hw3 - Kinga Frańczak #2

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions Homeworks/Homework-I/Frańczak_Kinga/Frańczak_Kinga_hw1.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
title: "Homework 1"
author: "Kinga Frańczak"
output:
html_document:
df_print: paged
---

# Preparation

For this predictions I am using data frame `insurance.csv` that contains information about patient and the charges for medical insurance. The goal is to predict the cost of insurance.

## Loading Packages

```{r message=FALSE, warning=FALSE}
library(DALEX)
library(tidyverse)
library(caret)
library(ranger)
library(dplyr)
```

## Loading Data Frame

```{r}
insurance <- read.csv("insurance.csv")
head(insurance)
```

## Splitting Data into Train Set and Test Set

```{r}
set.seed(42)
index <- createDataPartition(insurance$charges, p = 0.8, list = FALSE)

train <- insurance[index,]
test <- insurance[-index,]
```

## Creating Models

```{r}
lr_model <- lm(charges ~., data = train)
pred_lr <- predict(lr_model, test)
postResample(pred_lr, test$charges)
```

```{r}
ranger_model <- ranger(charges ~., data = train)
pred_ranger <- predict(ranger_model, test)
postResample(pred_ranger$predictions, test$charges)
```
I create two models: a linear regression model and a random forest model. The random forest model performed better than linear regression model thus I will be using it in the next steps.

## Choosing Observation
```{r}
observation <- test[13, ]
observation
```
# 1. Model Prediction for Observation

```{r}
p <- predict(ranger_model, observation)
p$predictions
observation$charges
```
The prediction made by the model is not the most acurate. Relative approximation error is equal to 0.22.

```{r}
(p$predictions - observation$charges)/observation$charges
```


# 2. Break Down

```{r}
explainer_rf <- DALEX::explain(ranger_model,
data = test[,-7],
y = test$charges)
```

```{r}
bd_pr <- predict_parts(explainer = explainer_rf,
new_observation = observation,
type = "break_down")

plot(bd_pr)
```

According to the Break Down model variables smoker and age have the biggest impact on the predicted value. Value "no" of variable smoker decreases predicted value. Variable age equal to 53 increases predicted value.

# 3. Shapley Values

```{r}
shap_pr <- predict_parts(explainer = explainer_rf,
new_observation = observation,
type = "shap")

plot(shap_pr)
```

Age and smoker variables both have the most effect on prediction, which also was visible on Break Down plot. According to Shapley value plot age variable is more important than smoker variable. For the Break Down plot the reverse is true. Bmi variable on average has more contribution to prediction than Break Down Profile sugests. Region and sex variable have the least importance for the model.

# 4. Effects of Different Variable Values

```{r}
observation2 <- test %>%
filter(smoker == "yes") %>%
filter(age < 30) %>%
filter(bmi == 28.5)
observation2
```
```{r}
p2 <- predict(ranger_model, observation2)
p2$predictions
```
```{r}
observation2$charges
```

```{r}
abs(p2$predictions - observation2$charges)/observation2$charges
```

Prediction for observation 2 is quite inaccurate, relative approximation error is equal to 0.75.

```{r}
bd_pr2 <- predict_parts(explainer = explainer_rf,
new_observation = observation2,
type = "break_down")

plot(bd_pr2)
```
```{r}
shap_pr2 <- predict_parts(explainer = explainer_rf,
new_observation = observation2,
type = "shap")

plot(shap_pr2)
```

# 5. Analysis

According to Shapley values plot for observation nr 1 age variable and smoker variable are the most important for creating prediction. Age increases prediction and smoker decreases it. Both variables have noticeable higher effect on prediction than the remaining ones. Similar relationship between variables can be spotted for observation nr 2. Age variable and smoker variable are the most important ones, however the variable with the highest contribution is different. For observation 1 it is age variable and for observation 2 it is smoker variable.

Observation 1 and observation 2 have different values on those two variables. Smoker variable has value “no” for the first observation and “yes” for the second observation. However, in both of those cases the smoker variable has negative contribution to prediction. It could be reason for large difference in predicted value and observed value for observation 2.

Observations differ on second variable. Person from first observation is 53 years old and is nearly twice the age of person from second observation how is 27 years old. In this case the difference is visible in shapley values for observations. Age variable increases prediction for observation nr 1 and decreases it for observation nr 2.

The next two variables with medium effects on prediction are childer and bmi. For both of observations different variable from those are more important. For first observation childer variable is more important, and for second it is bmi. The bmi values are similar in both cases and are equal 28.1 and 28.5 for first and second observation. Their Shapley values are negative. Children variable is different, and it is equal to 3 for observation 1 and 0 for observation 2. For the first observation Shapley value for variable childer for first observation is positive and for second observation is negative.

Region variable and sex variable for both observation have the smallest value. In both cases shapley value for region variable is a bit bigger than for sex variable. Both variables have different values for each observation, however in every case the contribution to the prediction is positive.


# Conclusions

For both observation 1 and observation 2 age variable and smoker variable have the highest effect on prediction, and region variable and sex variable have the lowest effect. There are variables such as smoker that for different values have similar effects on predictions, and variables such as age that for different values have negative or positive shapley values. Bmi variable that was very similar for both observation in had similar effect on the prediction.
366 changes: 366 additions & 0 deletions Homeworks/Homework-I/Frańczak_Kinga/Frańczak_Kinga_hw1.html

Large diffs are not rendered by default.

232 changes: 232 additions & 0 deletions Homeworks/Homework-II/Frańczak_Kinga/hw2.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
---
title: "Homework 2"
author: "Kinga Frańczak"
output: html_document
---

# Preparations

## Loading Packages

```{r message=FALSE, warning=FALSE}
library(DALEX)
library(tidyverse)
library(caret)
library(ranger)
library(dplyr)
library(DALEXtra)
library(lime)
```

## Loading and Preparing Data Frame

```{r}
insurance <- read.csv("insurance.csv")
head(insurance)
```

```{r}
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
```


## Splitting Data into Train Set and Test Set

```{r}
set.seed(42)
index <- createDataPartition(insurance$charges, p = 0.8, list = FALSE)

train <- insurance[index,]
test <- insurance[-index,]
```

## Creating Model

```{r}
ranger_model <- ranger(charges ~., data = train)
pred_ranger <- predict(ranger_model, test)
postResample(pred_ranger$predictions, test$charges)
```

The random forest model used in this homework is the same as in homework number 1 because of its performance.

# 1. Selecting Observations

## Observation

```{r}
observation1 <- test[13, ]
observation1
```

## Prediction for Chosen Observation

```{r}
p1 <- predict(ranger_model, observation1)
p1$predictions
observation1$charges
```

# 2. Prediction Decomposition with LIME Method

```{r}
explainer_rf <- DALEX::explain(ranger_model,
data = test[,-7],
y = test$charges,
label = "random forest")
```
```{r}
model_type.dalex_explainer <- DALEXtra::model_type.dalex_explainer
predict_model.dalex_explainer <- DALEXtra::predict_model.dalex_explainer

lime_pr1 <- predict_surrogate(explainer = explainer_rf,
new_observation = observation1[ ,-7],
n_features = 3,
n_permutations = 1000,
type = "lime")
```


```{r}
lime_pr1
plot(lime_pr1)
```

The three most important features according to LIME method are `smoker`, `age` and `bmi`. There are some differences between weights of features in LIME method and the Shapley values of the same variable. Three features with the most effect on prediction for observation number 1 according to Shapley values are `age`, `smoker` and `children`. Variable `bmi` is forth.

The order of variables is not the only difference. Although, the direction of contribution (positive or negative) to the model of each variable is the same, the relative magnitude is different. With Shapley values the absolute value of contribution of `age` and `smoker` are roughtly the same, however, with LIME method the absolute value of `smoker` contribution is approximately four times larger than the contribution of `age` variable.

# 3. Comparison of LIME Decomposition for Different Observations

## Comparison with Random Observation

```{r}
observation3 <- insurance %>%
filter(age > 51) %>%
filter(smoker == 'yes') %>%
filter(bmi == 26.29)

observation6 <- insurance %>%
filter(age > 51) %>%
filter(smoker == 'no') %>%
filter(bmi == 34.8)

observation7 <- insurance %>%
filter(age > 51) %>%
filter(bmi == 20.1) %>%
filter(smoker == 'no')

observation4 <- insurance %>%
filter(age == 18) %>%
filter(bmi>27,bmi<29) %>%
filter(smoker == 'no') %>%
filter(sex=='female')

observation5 <- insurance %>%
filter(age == 34) %>%
filter(bmi == 27.5) %>%
filter(smoker == 'no') %>%
filter(sex=='female')
```

## Comparison with Observation with different Value `smoker`

```{r}
observation3
```
```{r}
lime_pr3 <- predict_surrogate(explainer = explainer_rf,
new_observation = observation3[ ,-7],
n_features = 3,
n_permutations = 1000,
type = "lime")
plot(lime_pr3)
```

## Comparison with Observations with different Value `age`

```{r}
observation4
observation5
```
```{r}
lime_pr4 <- predict_surrogate(explainer = explainer_rf,
new_observation = observation4[ ,-7],
n_features = 3,
n_permutations = 1000,
type = "lime")
lime_pr5 <- predict_surrogate(explainer = explainer_rf,
new_observation = observation5[ ,-7],
n_features = 3,
n_permutations = 1000,
type = "lime")
plot(lime_pr4)
plot(lime_pr5)

```

## Comparison with Observations with different Value `bmi`

```{r}
observation6
observation7
```
```{r}
lime_pr6 <- predict_surrogate(explainer = explainer_rf,
new_observation = observation6[ ,-7],
n_features = 3,
n_permutations = 1000,
type = "lime")
lime_pr7 <- predict_surrogate(explainer = explainer_rf,
new_observation = observation7[ ,-7],
n_features = 3,
n_permutations = 1000,
type = "lime")
plot(lime_pr6)
plot(lime_pr7)

```

# 4. Analysis

## For Different Values in `smoker`

```{r}
lime_pr1
lime_pr3
```

Variables `age` and `bmi` for observation number 1 and number 3 were classify to the same categories. Variable `smoker` is equal to "no" for observation 1 and "yes" for observation 3. The weights of variables are similar for both observation. For variable `smoker` the weights have similar values, but they have different signs. For "no" the weight is negative, and for "yes" it is positive.

## For Different Values in `age`

```{r}
lime_pr4
lime_pr5
```

Weight of variables `smoker` and `bmi`, which were classified into the same categories for observations 1, 4 and 5, are similar. The weight of feature `age` is the largest for values higher than 51, with positive effect on the prediction is positive. For two other observations the absolute value of contribution is smaller, with negative effect on the prediction. As the age gets smaller, the weight of the prediction is getting smaller, changing sign in the process.

## For Different Values in `bmi`

```{r}
lime_pr6
lime_pr7
```

As in two examples above, the weights of variables `smoker` and `age` were similar for similar values of observations 1, 6 and 7. Variable `bmi` was classified to different category for each of those observations, however, weight of feature `bmi` for observation 1 and observation 7 is quite similar and has negative effect on the prediction. Value of variable `age` is the highest for observation 6. The weight of this feature for this observation has positive effect on prediction.

# 5. Conclusion

After compering LIME decomposition for different observation it is possible to conclude that:

* for different observation the same features are most important for prediction
* if the value of variable was classified to the same category for different observation its weight will be similar






Loading