-
Notifications
You must be signed in to change notification settings - Fork 0
/
Subscription Churn.Rmd
930 lines (669 loc) · 36.2 KB
/
Subscription Churn.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
---
title: "Subscription Churn"
output: github_document
author: "Julian"
date: "May 25, 2017"
---
## Motivation
The goal of this analysis is to predict subscription churn and identify factors or behaviors that singificantly influence its probability.
Based on this previous [**survival analysis**](https://github.com/jwinternheimer/analysis/blob/master/business_subscription_survival.md?utm_content=buffer67cab&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer), we will focus on the probability of a subscription churning _within the first 60 days after being created_.
We will hope to infer some causal relationships between the features and subscription churn. This will hopefully lead to some ideas for future pitches and experiments.
```{r setup, include = F}
# Load libraries
library(buffer); library(dplyr); library(ggplot2); library(tidyr)
```
## Methodology
In this analysis we will gather data from [**Looker**]([**this look**](https://looker.buffer.com/looks/3861) and SQL. We will visualize a few of the distributions of different features and build single-variable models from each of the features in order to get a feel for the predictive power of each feature.
We will also try a decision trees, logistic regression, and finally a random forests model to predict churn and estimate the probability of each subscription churning.
We will use the area under the ROC curve (AUC) as a measure of success for the models. For each different classifier (churn or not-churn) we’d get by picking a different churn probability threshold, we can plot both the true positive rate and false positive rate. This curve is the receiver operating characteristic (ROC) curve. This curve represents every possible tradeoff between sensitivity and specificity that is available.
The area under the curve (AUC) is 1.0 for perfect classifiers and 0.5 for classifiers that do no better than random guessing. I believe this will serve as a good gauge of our classifier models. :)
## Thoughts and conclusions
We tried several different approaches: single-variable models, decision trees, logistic regression, and random forests. Of those methods, random forests seems to perform the best in terms of churn prediction.
There are several features that seem to be quite important in these models. The billing cycle and account age at the time the subscription was created both seem to be quite important. The number of updates sent in the first 60 days and the number of helpscout conversations had both seem to be important as well.
However, it seems that there are a couple things going on here that aren't accounted for in the model. You could say that monthly customers have more opportunities to churn than yearly -- they renew each month. Also, there is a distinct pattern on a group of annual customers churning very early on -- perhaps they were automatically upgraded, or something.
For these reasons _**I would suggest building separate churn models for annual and monthly subscriptions**_. I think that would help us get more granular and learn more about why people churn. It may also help us boost accuracy. Theoretically the models in this analysis should account for monthly and annual plans, but let's go through the exercise of building the models manually just to learn more.
## Data collection and prep
The data we'll use in this analysis comes from [**this look**](https://looker.buffer.com/looks/3861) which contains subscriptions that were created in 2016 and have at least one successful payment. The dataset includes the follwing features:
- User ID
- Number of team members
- Number of profiles
- The creation date of the subscription
- The cancelation date of the subscription
- The current status of the subscription
- The plan ID of the subscription
- The billing interval (month or year)
- The dollar value of the subscription per payment
- The number of days it took the user to send 3 updates with Buffer
- The number of successful charges
- The country of the credit card
- The number of NPS survey responses
- The overall NPS score
- The number of helpscout conversations
I wanted to normalize the updates data a bit, so we will only look at updates sent in the first 60 days _after_ the subscription was created. We may also look at the proportion of subscriptions that churned within the first 60 days.
```{r}
# Get the data from Looker
subs <- get_look(3861)
```
Great, we have over 60K subscriptions in our dataset. Let's clean up the data a bit and get it ready for analysis.
```{r}
# Rename columns
colnames(subs) <- c('user_id', 'user_joined_at', 'team_members', 'profiles',
'created_at', 'canceled_at', 'id', 'status', 'plan', 'billing_cycle',
'amount', 'days_to_activate', 'country', 'referred_by_marketing',
'visited_before_joining', 'charges', 'nps_responses', 'nps_score',
'helpscout_convos')
# Set dates as date objects
subs$user_joined_at <- as.Date(subs$user_joined_at, "%Y-%m-%d")
subs$created_at <- as.Date(subs$created_at, "%Y-%m-%d")
subs$canceled_at <- as.Date(subs$canceled_at, "%Y-%m-%d")
# Set NPS score as a factor
subs$nps_score <- as.factor(subs$nps_score)
```
We also would like to get the number of updates sent by these users _in the first 60 days after starting the subscription_. Let's see if we can't get these updates from a SQL query.
```{r eval = F}
# Get redshift connection
con <- redshift_connect()
```
```{sql eval = F}
select
subscriptions.id
, count(distinct updates.id) as updates_count
, count(distinct updates.profile_id) as update_profiles
from updates
left join users
on updates.user_id = users.user_id
left join stripe._subscriptions as subscriptions
on users.billing_stripe_id = subscriptions.customer
where updates.status != 'service'
and updates.date >= subscriptions.start
and updates.date <= (subscriptions.start + 60)
and date_trunc('year', start) = '2016-01-01'
group by 1
```
```{r echo = F}
# Load updates data
load("updates.Rda")
```
Alright! Now let's join the updates counts into the `subs` dataframe.
```{r warning = F, message = F}
# Join updates and subs
subs <- subs %>%
left_join(updates, by = c('id' = 'id'))
```
Now I want to know how many days users were active in the first 60 days after starting a subscription. Let's write a query to help us get that number.
```{sql eval = F}
select
subscriptions.id
, count(distinct date(actions_taken.date)) as days_active
from stripe._subscriptions as subscriptions
left join users
on users.billing_stripe_id = subscriptions.customer
left join actions_taken
on actions_taken.user_id = users.user_id
where date_trunc('year', subscriptions.start) = '2016-01-01'
and actions_taken.date >= subscriptions.start
and actions_taken.date <= (subscriptions.start + 60)
group by 1
```
```{r echo = F}
load("activity.Rda")
```
Awesome! Let's join the `activity` dataframe to our original `subs` dataframe.
```{r warning = F, message = F}
# Join activity and subs
subs <- subs %>%
left_join(activity, by = c('id' = 'id'))
```
Great, we've collected all of the data we need so far. Let's take care of missing values.
```{r}
# Replace NAs with 0s
subs$helpscout_convos[is.na(subs$helpscout_convos)] <- 0
subs$nps_responses[is.na(subs$nps_responses)] <- 0
subs$updates_count[is.na(subs$updates_count)] <- 0
subs$days_active[is.na(subs$days_active)] <- 0
subs$profiles[is.na(subs$profiles)] <- 0
subs$team_members[is.na(subs$team_members)] <- 0
subs$update_profiles[is.na(subs$update_profiles)] <- 0
# Replace NAs in billing cycle
subs$billing_cycle <- as.character(subs$billing_cycle)
subs$billing_cycle[subs$billing_cycle == ""] <- ifelse("Month" %in% subs$plan, "month", "year")
subs$billing_cycle <- as.factor(subs$billing_cycle)
```
Now let's calculate the length of the subscription in days for those that have cancelled and add an indicator to show if the subscription has been canceled. We may also want to know if a user had churned within the first 60 days, so let's add the indicator variable `churned_within_60_days` as well.
```{r}
# Calculate subscription length
subs <- subs %>%
mutate(length = as.numeric(canceled_at - created_at),
churned = ifelse(is.na(canceled_at), 0, 1)) %>%
mutate(churned_within_60_days = ifelse((length <= 60 & !(is.na(length))), 1, 0))
```
We also know that users have differing numbers of profiles, and that the number of profiles is correllated with the number of updates scheduled. To account for this multicollinearity, let's create the metric `updates_per_profile`. We need to also account for users that have 0 values for profiles _and_ have updates.
```{r}
# Calculate updates per profile
subs <- subs %>%
mutate(updates_per_profile = ifelse(updates_count > 0 & profiles == 0,
updates_count / update_profiles, updates_count / profiles))
# Replace NAs with 0
subs$updates_per_profile[is.na(subs$updates_per_profile)] <- 0
subs$updates_per_profile[subs$updates_count == 0] <- 0
```
That's great! One last metric I'd like to create: the age of the Buffer account when the subscription was created.
```{r}
# Calculate buffer account age
subs <- subs %>%
mutate(account_age = as.numeric(created_at - user_joined_at))
```
Now let's move on to the exploratory analysis portion.
## Exploratory analysis
Now that we've prepped the data, let's summarize and visualize some of its features. For example, how many of these 60K subscriptions have churned?
```{r}
subs %>%
group_by(churned) %>%
summarise(subscriptions = n_distinct(id)) %>%
mutate(percentage = subscriptions / sum(subscriptions))
```
Around 56% of the subscriptions have churned. Interesting! Let's look at the percentage that churned within 60 days.
```{r}
subs %>%
group_by(churned_within_60_days) %>%
summarise(subscriptions = n_distinct(id)) %>%
mutate(percentage = subscriptions / sum(subscriptions))
```
Around 26% of subscriptions churned within 60 days.
Now let's look at the distribution of `updates_per_profile` and see how that correlates with the length of a subscription.
```{r echo = F}
# Plot distribution of updates per profile
ggplot(subs, aes(x = updates_per_profile)) +
geom_density() +
scale_x_continuous(limits = c(0, 600)) +
labs(x = 'Updates Per Profile', y = 'Density')
```
Interesting, this looks to be close to a power-law distribution, with many users having a low `updates_per_profile` value and a few users having very high value. A log transformation might help normalize this data. :)
```{r}
# Plot distribution of log updates per profile
ggplot(subs, aes(x = log(updates_per_profile))) +
geom_density() +
labs(x = 'Log Updates Per Profile', y = 'Density')
```
There we are, a nice bell-shaped curve. This transformation might not be necessary -- we'll need to look at the residuals of the model to determine that -- but it's nice to see a bell shape anyway. :)
Let's take a look at the churn rates and see if they correlate with the number of updates scheduled per profile. Because it is a numeric variable, it may be easier to "bin" the data into buckets and calculate the churn rate for each bucket.
We'll use the following code to make 10 bins.
```{r}
# Make the cuts to bin the data
cuts <- unique(as.numeric(quantile(subs$updates_per_profile, probs = seq(0, 1, 0.1))))
cuts <- c(-1, cuts)
# Cut the updates per profile into bins
subs <- subs %>%
mutate(updates_bin = cut(updates_per_profile, cuts))
```
Now, let's look at the number and percentage of subscriptions that have churned for each bin.
```{r}
subs %>%
group_by(updates_bin, churned) %>%
summarise(subscriptions = n_distinct(id)) %>%
mutate(percentage = subscriptions / sum(subscriptions)) %>%
filter(churned == T)
```
Wow! We can see that there seems to be a pretty clear negative correlation between the number of updates sent per profile and the percentage of subscriptions that have churned! Let's plot it out just to see.
```{r echo = F}
# Group subscriptions by update bin
by_bin <- subs %>%
group_by(updates_bin, churned) %>%
summarise(subscriptions = n_distinct(id)) %>%
mutate(percentage = subscriptions / sum(subscriptions)) %>%
filter(churned == T)
# Visualize the churn percentages
ggplot(by_bin, aes(x = updates_bin, y = percentage)) +
geom_bar(stat = 'identity') +
labs(x = 'Number of Updates Bucket', y = 'Churn Rate')
# Remove by_bin dataframe
rm(by_bin)
```
That's pretty neat to see. Now let's look at the `days_active` variable. Let's begin by visualizing its distribution.
```{r}
# Plot distribution of days_active
ggplot(subs, aes(x = days_active)) +
geom_density() +
scale_x_continuous(limits = c(0, 100))
```
Interestingly, this also seems to have a power-law distribution. Let's see if a log transformation helps again.
```{r}
# Plot distribution of the log of days_active
ggplot(subs, aes(x = log(days_active))) +
geom_density()
```
Maybe not! Let's look and see if there is any correlation with churn. We'll use the same approach of cutting the `days_active` number into bins or buckets, and calculating the churn rate for each bucket.
```{r}
# Make the cuts to bin the data
cuts <- unique(as.numeric(quantile(subs$days_active, probs = seq(0, 1, 0.1), na.rm = T)))
cuts <- c(-1, cuts)
# Cut the days active variable into bins
subs <- subs %>%
mutate(days_active_bin = cut(days_active, cuts))
```
Now, let's look at the number and percentage of subscriptions that have churned for each bin.
```{r echo = F}
# Group subscriptions by days active bin
by_bin <- subs %>%
group_by(days_active_bin, churned) %>%
summarise(subscriptions = n_distinct(id)) %>%
mutate(percentage = subscriptions / sum(subscriptions)) %>%
filter(churned == T)
# Visualize the churn percentages
ggplot(by_bin, aes(x = days_active_bin, y = percentage)) +
geom_bar(stat = 'identity') +
labs(x = 'Number of Days to Activate Bucket', y = 'Churn Rate')
# Remove by_bin dataframe
rm(by_bin)
```
Ok, that's good to see! It looks like there is another clear negative correlation. Interestingly, users that are not active on any day are less likely to churn than those who are active 1 to 2 days out of the first 60!
How about NPS scores? Let's view the churn rate by NPS score.
```{r}
# Group subscriptions by NPS score and calculate churn rate
subs %>%
group_by(nps_score, churned) %>%
summarise(subscriptions = n_distinct(id)) %>%
mutate(percentage = subscriptions / sum(subscriptions)) %>%
filter(churned == T)
```
Interesting! Most users gave Buffer a high NPS rating (9 or 10), yet the churn rate is highest for those people! The churn rate is lowest for those that gave Buffer a 7 or 8.
## Single variable models
To see how well our prediction model performs, we'll need something to compare it to. Single-variable models can serve as good benchmarks, and they also give us a good indication of how predictive individual features are.
First let's split our dataset into training and testing sets.
```{r}
# Set seed for reproducible results
set.seed(2356)
# Set random groups
subs$rgroup <- runif(dim(subs)[[1]])
# Split out training and testing sets
training <- subset(subs, rgroup <= 0.8)
testing <- subset(subs, rgroup > 0.8)
```
Let's start with a very simple model, using the `billing_cycle` variable. Let's make a table that shows the number of subscriptions that churn _in the first 60 days_ for each level of `billing_cycle`.
```{r}
# Create table
billing_cycle_table <- table(cycle = training[, 'billing_cycle'],
churn = training[, 'churned_within_60_days'],
useNA = 'ifany')
print(billing_cycle_table)
```
We can create our first single-variable model based on the `billing_cycle` like this.
```{r}
# First single-variable model
print(billing_cycle_table[, 2]/(billing_cycle_table[, 1] + billing_cycle_table[, 2]))
```
This is our churn probability prediction for each value of `billing_cycle`! We can now define a function that creates a single-variable model for any categorical variable.
```{r warning = F, message = F}
# Given a vector of outcomes (outCol), a categorical training variable (varCol),
# and a prediction variable (appCol), use outCol and varCol to build a single-variable model
# and then apply the model to appCol to get new predictions.
pos <- 1
make_prediction_categorical <- function(outCol, varCol, appCol) {
# Find how often the outcome is positive during training
pPos <- sum(outCol == pos) / length(outCol)
# We need this to handle NA values
naTab <- table(as.factor(outCol[is.na(varCol)]))
# Get stats on how often outcome is positive for NA values in training
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol),varCol)
# Get stats on how often outcome is positive, conditioned on levels of the variable
pPosWv <- (vTab[pos, ] + 1.0e-3 * pPos) / (colSums(vTab) + 1.0e-3)
# Make predictions by looking up levels of appCol
pred <- pPosWv[appCol]
# Add predictions for NA values
pred[is.na(appCol)] <- pPosWna
# Add in predictions for levels of appCol that weren’t known during training
pred[is.na(pred)] <- pPos
pred
}
```
Let's test it out! Now we can apply single-variable models to all of our datasets.
```{r warning = F, message = F}
# Specify categorical varaibles
catVars <- c('plan', 'billing_cycle', 'country', 'referred_by_marketing',
'visited_before_joining', 'nps_score')
# Loop through categorical varaibles and mkae predictions
for(v in catVars) {
# Make prediction for each categorical variable
pi <- paste('pred', v, sep='_')
# Do it for the training and testing datasets
training[, pi] <- make_prediction_categorical(training[, "churned_within_60_days"],
training[, v], training[,v])
testing[, pi] <- make_prediction_categorical(training[, "churned_within_60_days"],
training[, v], testing[,v])
}
```
Once we have the predictions, we can find the categorical variables that have a good AUC both on the training data and on the testing data not used during training. These are likely the more useful variables.
```{r warning = F, message = F}
library('ROCR')
# Define a function to calculate AUC
calcAUC <- function(predcol,outcol) {
perf <- performance(prediction(predcol,outcol==pos),'auc')
as.numeric([email protected])
}
```
Now, for each of the categorical variables, we calculate the AUC based on the predictions that we made earlier.
```{r warning = F, message = F}
for(v in catVars) {
# Name the prediction variables
pi <- paste('pred', v, sep = '_')
# Find the AUC of the variable on the training set
aucTrain <- calcAUC(training[, pi], training[, "churned_within_60_days"])
# Find the AUC of the variable on the testing set
aucTest <- calcAUC(testing[, pi], testing[, "churned_within_60_days"])
# Print the results
print(sprintf("%s, trainAUC: %4.3f testingAUC: %4.3f", pi, aucTrain, aucTest))
}
```
None of these AUC values appear to be very good. A value of 0.5 is as good as a random guess. Let's look at our numeric variables now.
### Numeric variables
There are many ways to use numeric features to make predictions. A common method is to bin the numeric feature into a number of ranges and then use the range labels as a categorical variable. R can do this with `quantile()` and `cut()` commands.
Let's try to score the numeric variables by AUC.
```{r warning = F, message = F}
# Define a function that makes predictions
make_prediction_numeric <- function(outCol, varCol, appCol) {
# Make the cuts to bin the data
cuts <- unique(as.numeric(quantile(varCol, probs = seq(0, 1, 0.1), na.rm = T)))
cuts <- c(-1, cuts)
varC <- cut(varCol, cuts)
appC <- cut(appCol, cuts)
# Now apply the categorical make prediction function
make_prediction_categorical(outCol, varC, appC)
}
```
Now let's apply this function to the numeric variables.
```{r message = F, warning = F}
# Define numeric variables
numVars <- c('team_members', 'profiles', 'amount', 'days_to_activate','charges',
'updates_per_profile', 'updates_count','days_active', 'helpscout_convos',
'account_age', 'charges')
# Loop through the columns and apply the formula
for(v in numVars) {
# Name the prediction column
pi <- paste('pred', v, sep = '_')
# Make predictions
training[, pi] <- make_prediction_numeric(training[, "churned_within_60_days"],
training[, v], training[,v])
testing[, pi] <- make_prediction_numeric(training[, "churned_within_60_days"],
training[, v], testing[,v])
# Find the AUC of the variable on the training set
aucTrain <- calcAUC(training[, pi], training[, "churned_within_60_days"])
# Find the AUC of the variable on the testing set
aucTest <- calcAUC(testing[, pi], testing[, "churned_within_60_days"])
# Print the results
print(sprintf("%s, trainAUC: %4.3f testingAUC: %4.3f", pi, aucTrain, aucTest))
}
```
Cool! That's about it for our single-variable models. Let's try to do better than random guesses then. :)
## Decision trees
Building decision trees involves proposing many possible _data cuts_ and then choosing the best cuts based on simultaneous competing criteria of predictive power, cross-validation strength, and interaction with other chosen cuts.
One of the advantages of using a package for decision tree work is not having to worry about the construction details.
```{r}
# Import library
library(rpart)
# Define the outcome
outcome <- "churned_within_60_days"
# Define formula to use
formula <- paste(outcome, '~', paste(c(catVars, numVars), collapse = ' + '), sep = '')
# Fit a decision tree model
tmodel <- rpart(formula, data = training)
# Calculate AUC on training set
print(calcAUC(predict(tmodel, newdata = training), training[, outcome]))
# Calculate AUC on testing set
print(calcAUC(predict(tmodel, newdata = testing), testing[, outcome]))
```
Wow, this model actually did surprisingly well, and maintained that level on the `testing` set! Let's get a better look into what the decision tree model looks like!
```{r}
print(tmodel)
```
Each row is a _node_ of the tree. This tree has 6 nodes. Node 1 is always called the root. Each node, other than the root node, has a parent, and the parent of node _k_ is node `floor(k/2)`.
The indentation also indicates how deep in the tree the node is. Each node other than the root is named by what condition must be true to move from the parent to the node.
We can represent this with a graph too.
```{r}
# Plot decision tree
par(cex = 0.7)
plot(tmodel)
text(tmodel)
```
Oh ok, I think I see what's happening. The model is counting the number of successful charges! Of course that will be correlated with the percentage of subscriptions that churn within two months! Let's remove that variable from the model.
```{r}
# Redefine numeric variables
numVars <- c('team_members', 'profiles', 'amount', 'days_to_activate', 'updates_per_profile',
'days_active', 'nps_score', 'helpscout_convos', 'account_age')
# Define formula to use
formula <- paste(outcome, '~', paste(c(catVars, numVars), collapse = ' + '), sep = '')
# Fit a decision tree model
tmodel <- rpart(formula, data = training)
# Calculate AUC on training set
print(calcAUC(predict(tmodel, newdata = training), training[, outcome]))
# Calculate AUC on testing set
print(calcAUC(predict(tmodel, newdata = testing), testing[, outcome]))
```
Again the AUC values are quite good, and they don't contain the number of successful charges. Let's take a closer look at the tree model.
```{r}
print(tmodel)
```
And let's plot the nodes.
```{r}
# Plot decision tree
par(cex = 0.7)
plot(tmodel)
text(tmodel)
```
Interesting, the nodes are made up primarily of groups of plan IDs. I'm not sure how useful this will be for us. The first branch, however, is interesting. If the Buffer account was created on the same day, there is a _much_ higher churn rate!
Let's remove the `plan` variable from the model now.
```{r}
# Redefine categorical variables
catVars <- c('billing_cycle', 'country', 'referred_by_marketing', 'visited_before_joining')
# Define formula to use
formula <- paste(outcome, '~', paste(c(catVars, numVars), collapse = ' + '), sep = '')
# Fit a decision tree model
tmodel <- rpart(formula, data = training)
# Calculate AUC on training set
print(calcAUC(predict(tmodel, newdata = training), training[, outcome]))
# Calculate AUC on testing set
print(calcAUC(predict(tmodel, newdata = testing), testing[, outcome]))
```
Great, the AUCs are still relatively good. Let's take a look at that tree now.
```{r}
# Print decision tree
print(tmodel)
```
Alright, now we're getting a bit more useful information.
```{r}
# Plot decision tree
par(cex = 0.7)
plot(tmodel)
text(tmodel)
```
For accounts that are _not_ in their first day, if the billing cycle is yearly, only around 3.3% churn in their first 60 days. If the billing cycle is monthly, around 24.9% churn within 60 days.
If the account is not in its first day and the billing cycle is monthly, if the number of updates per profile is greater than 0.36, the churn rate is around 22.9% if the updates per profile is less than 0.36, the churn rate rises to 48%.
For accounts that _are_ in their first day, if the billing cycle is yearly, around 6% churn. If the billing cycle is monthly, a whopping 62% churn in the first two months!
## Logistic regression
Logistic regression is the most important member of a class of models called _generalized linear models_. Logistic regresion can directly predict values that are restricted to the (0, 1) interval, such as probabilities.
Let's build the model.
```{r}
# Define formula
formula <- "churned_within_60_days ~ billing_cycle + referred_by_marketing + visited_before_joining + team_members + profiles + amount + days_to_activate + updates_per_profile + days_active + nps_score + helpscout_convos + account_age"
# Train regression model
model <- glm(formula, data = training, family = binomial(link = "logit"))
```
Let's make some predictions for both the training and testing sets.
```{r}
# Make predictions
training$pred <- predict(model, newdata = training, type = "response")
testing$pred <- predict(model, newdata = testing, type = "response")
```
Our goal is to use the model to classify new instances into one of two categories. We want the model to give high scores to positive instances and low scores otherwise.
Let's plot the double-density plot of the predictions.
```{r}
# Plot double density plot
ggplot(training, aes(x = pred, color = as.factor(churned_within_60_days),
linetype = as.factor(churned_within_60_days))) +
geom_density() +
labs(x = "Predicted Churn Probability", y = "", color = "Churned", linetype = "Churned")
```
Hmm that's not great. Not too bad. Ideally, we'd like the distribution of scores to be separated, with the scores of the negative instances to be concentrated on the left, and the distribution for the positive instances to be concentrated on the right. In this case, both distributions are concentrated on the left. This isn't surprising
In order to use the model as a classifier, you must pick a threshold, above which scores will be classified as positive and below as negative.
When you pick a threshold, you're trying to balance the _precision_ of the classifier (what fraction of the predicted positives are true positives) and its _recall_ (how many of the true positives the classifier finds).
If the score distributions of the positive and negative instances are well separated, then we can pick an appropriate threshold in the "valley" between the two peaks. In the current case, the score distributions aren't well separated, which indicates that the model can't build a classifier that simultaneously achieves good recall and good precision.
But we can build a classifier that identifies a subset of situations with a higher-than-average rate of churn. We'll call the ratio of the classifier precision to the average rate of positives the _enrichment rate_.
The higher we set the threshold, the more _precise_ the classifier will be; but we'll also miss a higher percentage of at-risk situations.
We'll use the training set to pick a threshold and use the training set to evaluate its performance.
To help pick a threshold, we can use a plot that shows both enrichment and recall as a function of the threshold.
```{r warning = F, message = F}
library(ROCR); library(gridExtra)
# Create a prediction object
predObj <- prediction(training$pred, training$churned_within_60_days)
# Create an ROCR object to calculate precision as a function of threshold
precObj <- performance(predObj, measure = "prec")
# Create an ROCR object to calculate recall as a function of threshold
recObj <- performance(predObj, measure = "rec")
# Extract precision and recall from S4 objects with @ notation
precision <- ([email protected])[[1]]
recall <- ([email protected])[[1]]
# The x values (thresholds) are the same in both predObj and recObj
prec.x <- ([email protected])[[1]]
```
Phew, now let's build a data frame with thresholds, precision, and recall.
```{r}
# Build data frame
rocFrame <- data.frame(threshold = prec.x, precision = precision, recall = recall)
```
Now we can build a couple plots.
```{r}
# Calculate rate of churn in the training set
pnull <- mean(as.numeric(training$churned_within_60_days))
p1 <- ggplot(rocFrame, aes(x = threshold)) +
geom_line(aes(y = precision / pnull)) +
coord_cartesian(xlim = c(0, 0.5), ylim = c(0, 5)) +
labs(x = "Threshold", y = "", title = "Enrichment and Threshold")
p2 <- ggplot(rocFrame, aes(x = threshold)) +
geom_line(aes(y = recall)) +
coord_cartesian(xlim = c(0, 0.5)) +
labs(x = "Threshold", y = "", title = "Recall and Threshold")
grid.arrange(p1, p2, ncol = 1)
```
Looking at the plots, you can see that higher thresholds result in more precise classifications, at the cost of missing more cases; a lower threhold will identify more cases, at the cost of many more false positives.
A threshold of 0.3 (which is slightly higher than the overall rate of churn) might be a good tradeoff. The resulting classifier will identify a set of potential churn situations that finds about 70% of all the true churn situations, with a true positive rate 1.5 times higher than the overall population.
Let's evaluate 0.3 as a threshold.
```{r}
# Build confusion matrix
confuse <- table(pred = testing$pred > 0.3, churned = testing$churned_within_60_days)
confuse
```
The rows contain predicted negatives and positives, and the columns contain actual negatives and positives. Let's calculate precision and recall.
```{r}
# Calculate precision
precision <- confuse[2, 2] / sum(confuse[2, ])
precision
```
Around 37% of the subscriptions we predicted to churn actually did churn in the testing set.
```{r}
# Calculate recall
recall <- confuse[2, 2] / sum(confuse[, 2])
recall
```
We capture over 70% of the true churn cases in the testing set.
```{r}
# Calculate enrichment
enrich <- precision / mean(as.numeric(testing$churned_within_60_days))
enrich
```
The ratio of the classifier precision to the average rate of positives is 1.44. The subscriptions that we predict to churn end up churning at a rate 44% higher than the overall population. :)
Let's calculate the AUCs
```{r}
# Calculate AUC on training set
print(calcAUC(training$pred, training[, outcome]))
# Calculate AUC on testing set
print(calcAUC(testing$pred, testing[, outcome]))
```
This isn't quite as good as the decision tree model, but is quite interpretable!
```{r}
# Summarise glm model
summary(model)
```
## Random forests
```{r warning = F, message = F}
# Load library
library(randomForest)
# Specify features
features <- c('team_members', 'profiles', 'billing_cycle', 'updates_count','referred_by_marketing',
'visited_before_joining', 'account_age', 'days_active', 'nps_responses',
'helpscout_convos', 'updates_per_profile')
# Train the model
fmodel <- randomForest(x = training[, features], y = as.factor(training$churned_within_60_days),
ntree = 500, nodesize = 10, importance = T)
```
```{r}
# Define function to calculate log likelihood
loglikelihood <- function(y, py) {
pysmooth <- ifelse(py == 0, 1e-12, ifelse(py == 1, 1-1e-12, py))
sum(y * log(pysmooth) + (1 -y) * log(1 - pysmooth))
}
accuracyMeasures <- function(pred, truth, name = "model") {
dev.norm <- -2*loglikelihood(as.numeric(truth), pred)/length(pred)
# Normalize the deviance by the number of data points
# so that we can compare the deviance across training and test sets.
ctable <- table(truth = truth, pred=(pred >= 0.3))
print(ctable)
accuracy <- sum(diag(ctable))/sum(ctable)
precision <- ctable[2,2]/sum(ctable[,2])
recall <- ctable[2,2]/sum(ctable[2,])
f1 <- precision*recall
data.frame(model = name, accuracy = accuracy, f1 = f1, dev.norm)
}
# Report the quality of the model on the training set
accuracyMeasures(predict(fmodel, newdata = training[, features], type = 'prob')[, '1'],
training$churned_within_60_days == 1, name = "random forest, training")
# Report the quality of the model on the testing set
accuracyMeasures(predict(fmodel, newdata = testing[, features], type = 'prob')[, '1'],
testing$churned_within_60_days == 1, name = "random forest, testing")
```
Let's look at how important each variable was to the random forest model.
```{r}
# Plot variable importance
varImpPlot(fmodel, type = 1)
```
We can see here that `billing_cycle` and `account_age` were both very important, followed (on a smaller scale) by `days_active` and `updates_count`.
Let's calculate AUC.
```{r}
# Calculate AUC on training set
print(calcAUC(predict(fmodel, newdata = training[, features], type = 'prob')[, '1'],
training[, outcome]))
# Calculate AUC on testing set
print(calcAUC(predict(fmodel, newdata = testing[, features], type = 'prob')[, '1'],
testing[, outcome]))
```
This is the best we've gotten so far, but the AUC being much lower for the testing set suggests overfitting. Let's simplify the model by reducing the number of features.
```{r}
# Specify features
features <- c('profiles', 'billing_cycle', 'updates_count', 'account_age', 'days_active',
'helpscout_convos')
# Train the model
fmodel <- randomForest(x = training[, features], y = as.factor(training$churned_within_60_days),
ntree = 500, nodesize = 10, importance = T)
```
Let's report the accuracy of the new model.
```{r}
# Report the quality of the model on the training set
accuracyMeasures(predict(fmodel, newdata = training[, features], type = 'prob')[, '1'],
training$churned_within_60_days == 1, name = "random forest, training")
# Report the quality of the model on the testing set
accuracyMeasures(predict(fmodel, newdata = testing[, features], type = 'prob')[, '1'],
testing$churned_within_60_days == 1, name = "random forest, testing")
```
And calculate the AUCs.
```{r}
# Calculate AUC on training set
print(calcAUC(predict(fmodel, newdata = training[, features], type = 'prob')[, '1'],
training[, outcome]))
# Calculate AUC on testing set
print(calcAUC(predict(fmodel, newdata = testing[, features], type = 'prob')[, '1'],
testing[, outcome]))
```
## Thoughts and conclusions
We've tried several different approaches: single-variable models, decision trees, logistic regression, and random forests. Of those methods, random forests seems to perform the best in terms of churn prediction.
There are several features that seem to be quite important in these models. The billing cycle and account age at the time the subscription was created both seem to be quite important. The number of updates sent in the first 60 days and the number of helpscout conversations had both seem to be important as well.
However, it seems that there are a couple things going on here that aren't accounted for in the model. You could say that monthly customers have more opportunities to churn than yearly -- they renew each month. Also, there is a distinct pattern on a group of annual customers churning very early on -- perhaps they were automatically upgraded, or something.
For these reasons _**I would suggest building separate churn models for annual and monthly subscriptions**_. I think that would help us get more granular and learn more about why people churn. It may also help us boost accuracy. Theoretically the models in this analysis should account for monthly and annual plans, but let's go through the exercise of building the models manually just to learn more.