Traditionally, we use statistics to estimate models and test hypotheses using data. We can collect data, describe variables such as averages, variances and distributions and then explain relationships between two or more variables. The roots of statistics lie in working with data and checking theory against data (Breiman 2001). This data model replies heavily on the theory and hypotheses to first generate or collect data to test the model, in an attempt to explain the relationship before prediction. Machine learning takes on a different approach. It focuses on prediction by using algorithmic models without first developing a theory then hypothesis. UC Berkeley Statistics professor Leo Breiman compares the two cultures of statistics in his famous 2001 paper and explains the difference. He argues that with so much data coming in from all sources and directions, using the data model approach alone may not be most effective in making the best use of data to solve the problem. He suggests employing algorithmic models to improve prediction. Algorithms refer to a sequence of computational and/or mathematical programs to solve the problem. The goal of algorithmic models is to identify an algorithm that operates on predictor variables (x) to best predict the response variable (y).
The ultimate goal of data modeling is to explain and predict the variable of interest using data. Machine learning is to achieve this goal using computer algorithms in particular to more effectively make the prediction and solve the problem. According to Carnegie Mellon Computer Science professor Tom M. Mitchell, machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience. To statisticians, the “improve through experience” part is the process of validation or cross validation. Learning can be done through repeated exercises to understand data. This involves having computer or statistics programs do repeated estimations, like human learns from experience and improve actions and decisions. This is called the training process in machine learning.
“A computer algorithm/program is said to learn from performance measure P and experience E with some class of tasks T if its performance at tasks in T, as measured by P, improves with experience E.” -Tom M. Mitchell.
## Resampling Imagine you select a sample such as a group of respondents or interviewees to gather responses on a series of questions. These responses can be inferred to represent the opinion of the population. Can you take another sample and another sample and so on, so to improve the answers that better represent the population? Alternatively, you can select a portion of the existing sample to check if the answers are different from the whole sample. How does it work if we repeat the process, getting another portion and another portion and record the answers? This is called the resampling process.
Tree-based methods for regression and classification involve stratifying or segmenting the predictor space into a number of simple regions. The set of splitting rules used to segment the predictor space can be summarized in a tree known as decision-tree.
## Supervised Learning (Classification): Tree-based models
## Decision Tree: rpart
## Conditional Inference Tree: party
## Random Forest: randomForest
## Packages used: AER, rpart, rpart.plot, rattle, party, randomForest
#install.packages("rpart")
#install.packages("rpart.plot")
#install.packages("rattle")
library(rpart)
library(rpart.plot)
library(rattle)
# AER Package (AER: Applied Econometrics with R)
# install.packages("AER")
library(AER)
# CreditCard dataset
# card: "Was the application for a credit card accepted?"
# reports: Number of major derogatory reports.
# age: Age in years plus twelfths of a year.
# income: Yearly income (in USD 10,000)
# owner: Does the individual own their home?
# months: Months living at current address.
# help("CreditCard") for dataset detail
# Reference: Greene, W.H. (2003). Econometric Analysis, 5th edition.
# Upper Saddle River, NJ: Prentice Hall.
# Link: http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm.
# Load dataset
data(CreditCard)
# Subset data including predictor variables
bankcard <- subset(CreditCard, select = c(card, reports, age, income, owner, months))
# Recode card to 0, 1
bankcard$card <- ifelse(bankcard$card == "yes", 1, 0);
library(descr)
freq(bankcard$card)
## bankcard$card
## Frequency Percent
## 0 296 22.44
## 1 1023 77.56
## Total 1319 100.00
set.seed(1001)
# Order data by row number
newbankcard <- bankcard[sample(nrow(bankcard)),]
# Indexing for training data
t_idx <- sample(seq_len(nrow(bankcard)), size = round(0.70 * nrow(bankcard)))
# Build train and test data
traindata <- newbankcard[t_idx,]
testdata <- newbankcard[ - t_idx,]
# Decision tree model
dtree_creditcard <- rpart::rpart(formula = card ~ ., data = traindata, method = "class", control = rpart.control(cp = 0.001)) # complexity parameter
# Plot Decision tree
rattle::fancyRpartPlot(dtree_creditcard, type = 1, main = "", caption = "Credit card approval" )
resultdt <- predict(dtree_creditcard, newdata = testdata, type = "class")
# Confusion matrix
cm_creditcarddt <- table(testdata$card, resultdt, dnn = c("Actual", "Predicted"))
cm_creditcarddt
## Predicted
## Actual 0 1
## 0 35 57
## 1 18 286
## [1] 0.8338192
## [1] 0.6603774
## [1] 0.8106061
# install.packages("party")
library(party)
# Conditional Inference Tree
cit <- ctree(card ~ ., data = traindata)
plot(cit, main = "Conditional Inference Tree: Credit card approval")
## No. of reports
## Approved 0 1 2 3 4 5 6 7 9 10 11 12 14
## 0 145 47 37 20 16 11 5 6 2 1 4 1 1
## 1 915 90 13 4 1 0 0 0 0 0 0 0 0
# Confusion matrix
cm_creditcardcit = table(testdata$card, round(predict(cit, newdata = testdata)), dnn = c("Actual", "Predicted"))
# Predicted Approval rate
cm_creditcardcit[4] / sum(cm_creditcardcit[, 2])
## [1] 0.8271955
## [1] 0.7209302
## [1] 0.8156566
# Random Forest
# install.packages("randomForest")
library(randomForest)
set.seed(1001)
# randomForest model
rf_creditcard <- randomForest(card ~ ., data = traindata, importance = T, proximity = T, do.trace = 100)
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 100 | 0.1254 72.82 |
## 200 | 0.1247 72.41 |
## 300 | 0.1244 72.25 |
## 400 | 0.1244 72.28 |
## 500 | 0.1239 71.97 |
## %IncMSE IncNodePurity
## reports 45.173 31.136
## age 7.795 10.158
## income 17.449 12.509
## owner 17.838 3.943
## months 7.410 10.297
# %IncMSE indicates importance
# IncNodePurity relates to the loss function which by best splits are chosen.
resultrf <- predict(rf_creditcard, newdata = testdata)
resultrf_Approved <- ifelse(resultrf > 0.6, 1, 0)
# Confusion matrix
cm_creditcardrf <- table(testdata$card, resultrf_Approved, dnn = c("Actual", "Predicted"))
cm_creditcardrf
## Predicted
## Actual 0 1
## 0 32 60
## 1 12 292
## [1] 0.8295455
## [1] 0.7272727
## [1] 0.8181818
Graham Williams 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge