Unraveling Kickstarter Success: A Machine Learning Approach Using Random Forest and Logistic Regression
Introduction
In recent years, crowd-funding platforms have revolutionized the way innovators, artists, and entrepreneurs seek financial backing for their creative ventures. Among these platforms, Kickstarter has emerged as a pioneering force, enabling project creators to connect directly with a global audience and secure funding for their ideas. However, the success of Kickstarter projects is far from guaranteed, with numerous campaigns failing to reach their funding goals.
This study aims to delve into the multifaceted factors that play a crucial role in determining the triumph or downfall of Kickstarter projects. Leveraging the power of machine learning techniques, we have chosen to employ two popular algorithms—Random Forest and Logistic Regression—to undertake a comprehensive analysis of the variables influencing campaign outcomes.
The Random Forest algorithm’s inherent ability to handle complex and correlated data, combined with Logistic Regression’s capacity to predict binary outcomes, empowers us to gain valuable insights into the interplay of project-specific attributes, campaign characteristics, and external factors on Kickstarter project success. By exploring a rich dataset encompassing past Kickstarter campaigns, we seek to identify patterns, trends, and key predictors that can potentially enhance the chances of project creators achieving their funding targets.
This research not only offers substantial contributions to the field of crowdfunding analysis but also aims to provide practical implications for project creators aiming to launch successful campaigns on Kickstarter and other similar platforms. Ultimately, our findings aspire to foster a deeper understanding of the dynamics of crowdfunding success, promoting innovation and creativity in the digital age.
Let’s start by loading the required packages.
library(broom)
library(tidyverse)
library(tidymodels)
library(ranger)
Data Processing
Loading the data set. The data set was obtained via Kaggle.The target variable is “State”
dtdt<-read.csv('ks-projects-201801.csv', header = T)
dtdt$state<-as.factor(dtdt$state)
levels(dtdt$state)
## [1] "canceled" "failed" "live" "successful" "suspended"
## [6] "undefined"
There are six levels of State(Failed, Successful, Cancelled, Live, Suspended and Undefined) but we will only be interested in Successful and Failed project.
dtdt1<-dtdt%>%filter(state == "failed" | state == "successful") %>% droplevels
Take a look at the currency variable.
dtdt1$currency<-as.factor(dtdt1$currency)
round(prop.table(table(dtdt1$currency)),2)
##
## AUD CAD CHF DKK EUR GBP HKD JPY MXN NOK NZD SEK SGD USD
## 0.02 0.04 0.00 0.00 0.04 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.79
79% of the total amount is in US dollars, 9% is in British pounds, and 4% is in euros. The USD and GBP make up over 88% of the currency. As a result, we shall divide into the USD, GBP, and Others.
## Transforming Currency variable into 4 levels
levels(dtdt1$currency) <-
list(GBP="GBP", EUR="EUR",USD = "USD",
other = c("MXN", "NZD",
"SEK","AUD","CAD","JPY", "HKD", "SGD", "CHF", "NOK", "DKK"))
levels(dtdt1$currency)
## [1] "GBP" "EUR" "USD" "other"
Let’s look at the project’s launch date and its completion date. Using these two pieces of information, we can build a new variable called duration. Duration represents the amount of time the project took to succeed or fail.
dtdt1$deadline <- as.Date(dtdt1$deadline)
dtdt1$launched <- as.Date(dtdt1$launched)
dtdt2<-dtdt1%>%mutate(Duration=deadline-launched)
dtdt2$Duration<-as.numeric(dtdt2$Duration)
Take a look at the country variable.
dtdt2$country<-as.factor(dtdt2$country)
round(prop.table(table(dtdt2$country)),2)
##
## AT AU BE CA CH DE DK ES FR GB HK IE IT JP LU MX
## 0.00 0.02 0.00 0.04 0.00 0.01 0.00 0.01 0.01 0.09 0.00 0.00 0.01 0.00 0.00 0.00
## N,0" NL NO NZ SE SG US
## 0.00 0.01 0.00 0.00 0.00 0.00 0.79
We make the assumption that Country will have the same impact on our model as the currency variable based on its distribution. Therefore, the country variable won’t be included in our modeling.Let’s look at our data’s percentage of missing values.
mean(!complete.cases(dtdt2))
## [1] 0.0006331499
dtdt2<-na.omit(dtdt2)
selected variables for the analysis
data<-dtdt2%>%select(main_category,currency,state,backers,usd_pledged_real,usd_goal_real,Duration)
Exploratory Analysis
p1 <- ggplot(data, aes(x = log(backers), y = log(usd_pledged_real), color=state)) +
geom_jitter() +
ylab("USD Pledged(log)")+ xlab("Backers(log)") +
ggtitle("The Amount of USD Pledged vs Backers")
p1
p2<-ggplot(data, aes(log(usd_goal_real), fill = state)) +
geom_boxplot() +
xlab("USD Goal(log))") + ylab("") +
ggtitle("USD Goal for projects")+theme_bw()
p2
p3 <- ggplot(data, aes(x = currency, fill = state)) +
geom_bar(position = "dodge") +
ylab("Number of projects") + xlab("") +
ggtitle("Number of Projects Based on Currency")+theme_bw()
p3
p4 <- ggplot(data, aes(x = main_category, fill = state)) +
geom_bar(position = 'dodge') +
ylab("Number of projects") + xlab("") +scale_x_discrete(guide = guide_axis(angle = 90)) +
ggtitle("Number of Projects Based on Categories")+theme_bw()
p4
p5<- ggplot(data, aes(x = log(backers), y = log(usd_goal_real), color=state)) +
geom_jitter() +
ylab("USD Goal(log)") + xlab("Backers(log)") +
ggtitle("USD Goal vs Backers")
p5
For both Failed and Successful projects on Kickstarter, there is a positive relationship between USD goal and the backers
p6<-ggplot(data, aes(log(usd_pledged_real), fill = state)) +
geom_boxplot() +
xlab("USD pledged (log)") + ylab("") +
ggtitle("USD pledged")+theme_bw()
p6
Modelling
Dividing the data into Train and Test
set.seed(10000)
nd_data <- initial_split(data, prop = 0.60, strata = state)
trainset <- training(nd_data)
testset <- testing(nd_data)
A training set and a test set will be created at random from the data using the “initial split” function. The proportion of the training set that is used in the test set can be found by specifying prop. The proportion of each “state” level will remain consistent by incorporating the “Strata” parameter in the training and test sets.
Data Pre-Processing
Let’s pre-process our data by converting to dummy variable , scaling and centering it.
data_recipe<- recipe(state~., data = trainset)%>%
step_dummy(all_nominal_predictors(), one_hot = T)%>%
step_center(all_numeric_predictors())%>%
step_scale(all_numeric_predictors())
Specifying the number of folds to be used during the training, in our case, we will be using 5 folds to be repeated once.
cv_ks<- vfold_cv(trainset, v = 5, repeats = 1, strata = state)
measure <- metric_set(accuracy)
##Implementing the Machine Learning Models.
Two machine learning models will be fitted. Namely: Logistic Regression and Random Forest. These models are known to perform better on classification task.
Logistic Regression Model
log_mod <- logistic_reg() %>%
set_engine("glm")%>%
set_mode("classification")
logistic_model_fit <-
workflow() %>%
add_model(log_mod) %>%
add_recipe(data_recipe) %>%
fit(trainset)
Let’s examine the performance of the logistic regression on the test dataset.
logistic_model_fit %>%
augment(testset, type.predict = "response") %>%
accuracy(state, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.999
log_class<- predict(logistic_model_fit , new_data = testset,
type = 'class')
prob_preds <- predict(logistic_model_fit , new_data = testset,
type = 'prob')
logistic_res<- testset %>%
select(state) %>%
bind_cols(log_class, prob_preds)
logistic_res%>%
# Plot ROC curve
roc_curve(truth = state, .pred_failed) %>%
autoplot()
99.8% Accuracy on Logistic regression? Fantastic!!!
Fitting Random Forest Model
rf_mod<-
rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("ranger")
rf_wf<-
workflow() %>%
add_recipe(data_recipe) %>%
add_model(rf_mod)
r_grid<- grid_random(
mtry() %>% range_set(c( 1, 5)),
trees() %>% range_set(c( 100, 150)),
min_n() %>% range_set(c(2, 8)),
size = 10)
tune_random <-
rf_wf %>%
tune_grid(
resamples = cv_ks,
grid = r_grid,
##control = ctrl,
metrics = measure)
show_best(tune_random)
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 118 7 accuracy binary 0.996 5 0.0000975 Preprocessor1_Mod…
## 2 5 130 8 accuracy binary 0.996 5 0.000105 Preprocessor1_Mod…
## 3 5 130 6 accuracy binary 0.996 5 0.000190 Preprocessor1_Mod…
## 4 4 150 4 accuracy binary 0.993 5 0.000110 Preprocessor1_Mod…
## 5 2 140 8 accuracy binary 0.947 5 0.00168 Preprocessor1_Mod…
For the parameter tuning, the random forest model with the best accuracy is the model with mtry=5, trees=117 and min_n=4
rfo <- rf_wf%>%
finalize_workflow(select_best(tune_random )) %>%
fit(trainset)
rfo%>%
augment(testset, type.predict = "response") %>%
accuracy(state, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.996
rf_class<- predict(rfo, new_data = testset,
type = 'class')
rf_prob <- predict(rfo , new_data = testset,
type = 'prob')
rf_main <- testset %>%
select(state) %>%
bind_cols(rf_class, rf_prob)
rf_main%>%
# Plot ROC curve
roc_curve(truth = state, .pred_failed) %>%
autoplot()
for the random fores model, the accuracy was 99.6%
logistic_model_fit %>%
tidy() %>%
mutate(term = ordered(term)|> fct_reorder(p.value)) %>%
ggplot(aes(p.value, term, fill = p.value < 0.05)) +
geom_col() +
theme(legend.position = "none")
Key Points
Logistic regression outperformed Random forest model in this project Categories such as Publishing, Photography, Crafts, Art, Games, Design, food are the most funded categories.