Unraveling Kickstarter Success: A Machine Learning Approach Using Random Forest and Logistic Regression

Introduction

In recent years, crowd-funding platforms have revolutionized the way innovators, artists, and entrepreneurs seek financial backing for their creative ventures. Among these platforms, Kickstarter has emerged as a pioneering force, enabling project creators to connect directly with a global audience and secure funding for their ideas. However, the success of Kickstarter projects is far from guaranteed, with numerous campaigns failing to reach their funding goals.

This study aims to delve into the multifaceted factors that play a crucial role in determining the triumph or downfall of Kickstarter projects. Leveraging the power of machine learning techniques, we have chosen to employ two popular algorithms—Random Forest and Logistic Regression—to undertake a comprehensive analysis of the variables influencing campaign outcomes.

The Random Forest algorithm’s inherent ability to handle complex and correlated data, combined with Logistic Regression’s capacity to predict binary outcomes, empowers us to gain valuable insights into the interplay of project-specific attributes, campaign characteristics, and external factors on Kickstarter project success. By exploring a rich dataset encompassing past Kickstarter campaigns, we seek to identify patterns, trends, and key predictors that can potentially enhance the chances of project creators achieving their funding targets.

This research not only offers substantial contributions to the field of crowdfunding analysis but also aims to provide practical implications for project creators aiming to launch successful campaigns on Kickstarter and other similar platforms. Ultimately, our findings aspire to foster a deeper understanding of the dynamics of crowdfunding success, promoting innovation and creativity in the digital age.

Let’s start by loading the required packages.

library(broom)
library(tidyverse)
library(tidymodels)
library(ranger)

Data Processing

Loading the data set. The data set was obtained via Kaggle.The target variable is “State”

dtdt<-read.csv('ks-projects-201801.csv', header = T)
dtdt$state<-as.factor(dtdt$state)
levels(dtdt$state)
## [1] "canceled"   "failed"     "live"       "successful" "suspended" 
## [6] "undefined"

There are six levels of State(Failed, Successful, Cancelled, Live, Suspended and Undefined) but we will only be interested in Successful and Failed project.

dtdt1<-dtdt%>%filter(state == "failed" | state == "successful") %>% droplevels

Take a look at the currency variable.

dtdt1$currency<-as.factor(dtdt1$currency)
round(prop.table(table(dtdt1$currency)),2)
## 
##  AUD  CAD  CHF  DKK  EUR  GBP  HKD  JPY  MXN  NOK  NZD  SEK  SGD  USD 
## 0.02 0.04 0.00 0.00 0.04 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.79

79% of the total amount is in US dollars, 9% is in British pounds, and 4% is in euros. The USD and GBP make up over 88% of the currency. As a result, we shall divide into the USD, GBP, and Others.

## Transforming Currency variable into 4 levels
levels(dtdt1$currency) <- 
  list(GBP="GBP", EUR="EUR",USD = "USD",
       other = c("MXN", "NZD", 
                                         "SEK","AUD","CAD","JPY", "HKD", "SGD", "CHF", "NOK", "DKK"))

levels(dtdt1$currency)
## [1] "GBP"   "EUR"   "USD"   "other"

Let’s look at the project’s launch date and its completion date. Using these two pieces of information, we can build a new variable called duration. Duration represents the amount of time the project took to succeed or fail.

dtdt1$deadline <- as.Date(dtdt1$deadline)
dtdt1$launched <- as.Date(dtdt1$launched)
dtdt2<-dtdt1%>%mutate(Duration=deadline-launched)
dtdt2$Duration<-as.numeric(dtdt2$Duration)

Take a look at the country variable.

dtdt2$country<-as.factor(dtdt2$country)
round(prop.table(table(dtdt2$country)),2)
## 
##   AT   AU   BE   CA   CH   DE   DK   ES   FR   GB   HK   IE   IT   JP   LU   MX 
## 0.00 0.02 0.00 0.04 0.00 0.01 0.00 0.01 0.01 0.09 0.00 0.00 0.01 0.00 0.00 0.00 
## N,0"   NL   NO   NZ   SE   SG   US 
## 0.00 0.01 0.00 0.00 0.00 0.00 0.79

We make the assumption that Country will have the same impact on our model as the currency variable based on its distribution. Therefore, the country variable won’t be included in our modeling.Let’s look at our data’s percentage of missing values.

mean(!complete.cases(dtdt2))
## [1] 0.0006331499
dtdt2<-na.omit(dtdt2)

selected variables for the analysis

data<-dtdt2%>%select(main_category,currency,state,backers,usd_pledged_real,usd_goal_real,Duration)

Exploratory Analysis

p1 <- ggplot(data, aes(x = log(backers), y = log(usd_pledged_real), color=state)) +
  geom_jitter()  +
  ylab("USD Pledged(log)")+ xlab("Backers(log)") +
  ggtitle("The Amount of USD Pledged vs Backers")
p1
There is a correlation between the amount of USD pledged and the number of backers for both unsuccessful and successful projects on Kickstarter.
 p2<-ggplot(data, aes(log(usd_goal_real),  fill = state)) +
  geom_boxplot() +
  xlab("USD Goal(log))") + ylab("") +
  ggtitle("USD Goal for projects")+theme_bw()
p2
According to the boxplot, the average USD goal for unsuccessful Kickstarter projects is larger than the average USD pledged for profitable projects. In other words, the majority of failed projects appear to have greater USD goal.
p3 <- ggplot(data, aes(x = currency, fill = state)) +
  geom_bar(position = "dodge") +
  ylab("Number of projects") + xlab("") +
  ggtitle("Number of Projects Based on Currency")+theme_bw()
p3
Most successful ad=nd failed projects on kickstarter were funded in USD. Just few were funded in in GBP, EUR and other currency.
p4 <- ggplot(data, aes(x = main_category, fill = state)) +
  geom_bar(position = 'dodge')  +
  ylab("Number of projects") + xlab("") +scale_x_discrete(guide = guide_axis(angle = 90)) +
  ggtitle("Number of Projects Based on Categories")+theme_bw()
p4
Art, Games, Film & Video, Musixe and Publishing have most of the successful projects.
p5<- ggplot(data, aes(x = log(backers), y = log(usd_goal_real), color=state)) +
  geom_jitter()  +
  ylab("USD Goal(log)") + xlab("Backers(log)") +
  ggtitle("USD Goal vs Backers")
p5

For both Failed and Successful projects on Kickstarter, there is a positive relationship between USD goal and the backers

 p6<-ggplot(data, aes(log(usd_pledged_real),  fill = state)) +
  geom_boxplot() +
  xlab("USD pledged (log)") + ylab("") +
  ggtitle("USD pledged")+theme_bw()
p6
The average amount of money committed on Kickstarter is larger for successful projects than it is for unsuccessful ones.

Modelling

Dividing the data into Train and Test

set.seed(10000)
nd_data <- initial_split(data, prop = 0.60, strata = state)
trainset <- training(nd_data)
testset <- testing(nd_data)

A training set and a test set will be created at random from the data using the “initial split” function. The proportion of the training set that is used in the test set can be found by specifying prop. The proportion of each “state” level will remain consistent by incorporating the “Strata” parameter in the training and test sets.

Data Pre-Processing

Let’s pre-process our data by converting to dummy variable , scaling and centering it.

data_recipe<- recipe(state~., data = trainset)%>%
  step_dummy(all_nominal_predictors(), one_hot = T)%>%
  step_center(all_numeric_predictors())%>%
  step_scale(all_numeric_predictors())

Specifying the number of folds to be used during the training, in our case, we will be using 5 folds to be repeated once.

cv_ks<- vfold_cv(trainset, v = 5, repeats = 1, strata = state)
measure <- metric_set(accuracy) 

##Implementing the Machine Learning Models.

Two machine learning models will be fitted. Namely: Logistic Regression and Random Forest. These models are known to perform better on classification task.

Logistic Regression Model

log_mod <- logistic_reg() %>% 
  set_engine("glm")%>%
  set_mode("classification")

logistic_model_fit  <- 
  workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(data_recipe) %>%
  fit(trainset)

Let’s examine the performance of the logistic regression on the test dataset.

logistic_model_fit    %>%
  augment(testset, type.predict = "response") %>%
  accuracy(state, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.999
log_class<- predict(logistic_model_fit  , new_data = testset,
                       type = 'class')
prob_preds <- predict(logistic_model_fit  , new_data = testset, 
                      type = 'prob')
logistic_res<- testset %>% 
  select(state) %>% 
  bind_cols(log_class, prob_preds)

logistic_res%>% 
  # Plot ROC curve
  roc_curve(truth = state, .pred_failed) %>% 
  autoplot()

99.8% Accuracy on Logistic regression? Fantastic!!!

Fitting Random Forest Model

rf_mod<- 
  rand_forest(mtry = tune(),
              trees = tune(),
              min_n = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

rf_wf<- 
  workflow() %>% 
  add_recipe(data_recipe) %>% 
  add_model(rf_mod)

r_grid<- grid_random(
 mtry() %>% range_set(c( 1,  5)),
  trees() %>% range_set(c( 100, 150)), 
  min_n() %>% range_set(c(2,  8)),
  size = 10)

tune_random <- 
  rf_wf %>% 
  tune_grid(
    resamples = cv_ks, 
    grid = r_grid, 
    ##control = ctrl, 
    metrics = measure)
show_best(tune_random)
## # A tibble: 5 × 9
##    mtry trees min_n .metric  .estimator  mean     n   std_err .config           
##   <int> <int> <int> <chr>    <chr>      <dbl> <int>     <dbl> <chr>             
## 1     5   118     7 accuracy binary     0.996     5 0.0000975 Preprocessor1_Mod…
## 2     5   130     8 accuracy binary     0.996     5 0.000105  Preprocessor1_Mod…
## 3     5   130     6 accuracy binary     0.996     5 0.000190  Preprocessor1_Mod…
## 4     4   150     4 accuracy binary     0.993     5 0.000110  Preprocessor1_Mod…
## 5     2   140     8 accuracy binary     0.947     5 0.00168   Preprocessor1_Mod…

For the parameter tuning, the random forest model with the best accuracy is the model with mtry=5, trees=117 and min_n=4

rfo <- rf_wf%>%
  finalize_workflow(select_best(tune_random )) %>%
  fit(trainset)
rfo%>%
  augment(testset, type.predict = "response") %>%
  accuracy(state, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.996
rf_class<- predict(rfo, new_data = testset,
                       type = 'class')
rf_prob <- predict(rfo , new_data = testset, 
                      type = 'prob')
rf_main <- testset %>% 
  select(state) %>% 
  bind_cols(rf_class, rf_prob)

rf_main%>% 
  # Plot ROC curve
  roc_curve(truth = state, .pred_failed) %>% 
  autoplot()

for the random fores model, the accuracy was 99.6%

logistic_model_fit %>%
  tidy() %>%
  mutate(term =  ordered(term)|> fct_reorder(p.value)) %>%
  ggplot(aes(p.value, term, fill = p.value < 0.05)) +
  geom_col() +
  theme(legend.position = "none")
categories such as Publishing, Photography, Crafts, Art, Games, Design, food are highly successeful when it comes to getting funded on Kickstarter.

Key Points

Logistic regression outperformed Random forest model in this project Categories such as Publishing, Photography, Crafts, Art, Games, Design, food are the most funded categories.

Hamid Abdulsalam
Hamid Abdulsalam
Data Scientist

Statistical inference, machine learning, deep learning, and its use in the financial and medical fields are some of my research interests.