1 What is machine learning?

Traditionally, we use statistics to estimate models and test hypotheses using data. We can collect data, describe variables such as averages, variances and distributions and then explain relationships between two or more variables. The roots of statistics lie in working with data and checking theory against data (Breiman 2001). This data model replies heavily on the theory and hypotheses to first generate or collect data to test the model, in an attempt to explain the relationship before prediction. Machine learning takes on a different approach. It focuses on prediction by using algorithmic models without first developing a theory then hypothesis. UC Berkeley Statistics professor Leo Breiman compares the two cultures of statistics in his famous 2001 paper and explains the difference. He argues that with so much data coming in from all sources and directions, using the data model approach alone may not be most effective in making the best use of data to solve the problem. He suggests employing algorithmic models to improve prediction. Algorithms refer to a sequence of computational and/or mathematical programs to solve the problem. The goal of algorithmic models is to identify an algorithm that operates on predictor variables (x) to best predict the response variable (y).

The ultimate goal of data modeling is to explain and predict the variable of interest using data. Machine learning is to achieve this goal using computer algorithms in particular to more effectively make the prediction and solve the problem. According to Carnegie Mellon Computer Science professor Tom M. Mitchell, machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience. To statisticians, the “improve through experience” part is the process of validation or cross validation. Learning can be done through repeated exercises to understand data. This involves having computer or statistics programs do repeated estimations, like human learns from experience and improve actions and decisions. This is called the training process in machine learning.

“A computer algorithm/program is said to learn from performance measure P and experience E with some class of tasks T if its performance at tasks in T, as measured by P, improves with experience E.” -Tom M. Mitchell.

Tom M. Mitchell ## Resampling Imagine you select a sample such as a group of respondents or interviewees to gather responses on a series of questions. These responses can be inferred to represent the opinion of the population. Can you take another sample and another sample and so on, so to improve the answers that better represent the population? Alternatively, you can select a portion of the existing sample to check if the answers are different from the whole sample. How does it work if we repeat the process, getting another portion and another portion and record the answers? This is called the resampling process.

1.1 Tree-based models

Tree-based methods for regression and classification involve stratifying or segmenting the predictor space into a number of simple regions. The set of splitting rules used to segment the predictor space can be summarized in a tree known as decision-tree.

  • Tree-based methods are simple and useful for interpretation.
  • However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy.
  • Methods such as bagging, random forests, and boosting grow multiple trees which are then combined to yield a single consensus prediction.
  • Combining a large number of trees can improve prediction accuracy but at the expense of interpretation.

1.1.1 Decision Tree

  • One way to make predictions in a regression problem is to divide the predictor space (i.e. all the possible values for for \(X_1,X_2,…,X_p)\) into distinct regions, say \(R_1, R_2,…,R_k\)
  • Then for every \(X\) that falls in a particular region (say \(R_j\)) we make the same prediction.

1.1.1.1 Tree building process

  • Predictor space: set of possible values of \(X_1, X_2,...,X_i\)
  • Divide feature space into \(J\) distinct and non-overlapping regions, \(R_1, R_2,..., R_J\)
  • Predict by mean of observations in every region \(R_j\)

1.1.2 Terminology for Trees

  • Decision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tree.
  • The points along the tree where the predictor space is split are referred to as internal nodes e.g. Years<4.5 and Hits<117.5.
  • Terminal nodes are at end of branch/leaves, which is more meaningful for interpretation

1.1.3 Random Forest

  • Build a number of decision trees on bootstrapped training sample
  • When building the trees, each time a split in a tree is considered
  • A random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors (Usually \(m\approx\sqrt{p}\) )
  • Why a random sample of \(m\) predictors instead of all \(p\) predictors for splitting?
  • Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split!
  • All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated
  • Averaging many highly correlated quantities does not lead to a large variance reduction
  • Random forests “de-correlates” the bagged trees leading to more reduction in variance

1.1.4 Hands-on workshop: Credit card approval

## Supervised Learning (Classification): Tree-based models
## Decision Tree: rpart
## Conditional Inference Tree: party
## Random Forest: randomForest
## Packages used: AER, rpart, rpart.plot, rattle, party, randomForest

#install.packages("rpart")
#install.packages("rpart.plot")
#install.packages("rattle")

library(rpart)
library(rpart.plot)
library(rattle)

# AER Package (AER: Applied Econometrics with R)
# install.packages("AER")
library(AER)

# CreditCard dataset
# card:   "Was the application for a credit card accepted?"
# reports: Number of major derogatory reports.
# age:     Age in years plus twelfths of a year.
# income:  Yearly income (in USD 10,000)
# owner:   Does the individual own their home?
# months:  Months living at current address.
# help("CreditCard") for dataset detail
# Reference: Greene, W.H. (2003). Econometric Analysis, 5th edition. 
#            Upper Saddle River, NJ: Prentice Hall. 
# Link: http://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm.

# Load dataset
data(CreditCard)

# Subset data including predictor variables
bankcard <- subset(CreditCard, select = c(card, reports, age, income, owner, months))

# Recode card to 0, 1
bankcard$card <- ifelse(bankcard$card == "yes", 1, 0);
library(descr)
freq(bankcard$card)

## bankcard$card 
##       Frequency Percent
## 0           296   22.44
## 1          1023   77.56
## Total      1319  100.00
set.seed(1001)
# Order data by row number
newbankcard <- bankcard[sample(nrow(bankcard)),]

# Indexing for training data
t_idx <- sample(seq_len(nrow(bankcard)), size = round(0.70 * nrow(bankcard)))

# Build train and test data
traindata <- newbankcard[t_idx,]
testdata <- newbankcard[ - t_idx,]

# Decision tree model
dtree_creditcard <- rpart::rpart(formula = card ~ ., data = traindata, method = "class", control = rpart.control(cp = 0.001)) # complexity parameter

# Plot Decision tree 
rattle::fancyRpartPlot(dtree_creditcard, type = 1, main = "", caption = "Credit card approval" )

resultdt <- predict(dtree_creditcard, newdata = testdata, type = "class")

# Confusion matrix
cm_creditcarddt <- table(testdata$card, resultdt, dnn = c("Actual", "Predicted"))
cm_creditcarddt
##       Predicted
## Actual   0   1
##      0  35  57
##      1  18 286
# Predicted Approval rate
cm_creditcarddt[4] / sum(cm_creditcarddt[, 2])
## [1] 0.8338192
# Predicted Denial rate 
cm_creditcarddt[1] / sum(cm_creditcarddt[, 1])
## [1] 0.6603774
# Accuracy
accuracydt <- sum(diag(cm_creditcarddt)) / sum(cm_creditcarddt)
accuracydt
## [1] 0.8106061
# install.packages("party") 
library(party)

# Conditional Inference Tree
cit <- ctree(card ~ ., data = traindata)
plot(cit, main = "Conditional Inference Tree: Credit card approval")

table(bankcard$card, bankcard$reports, dnn = c("Approved", "No. of reports"))
##         No. of reports
## Approved   0   1   2   3   4   5   6   7   9  10  11  12  14
##        0 145  47  37  20  16  11   5   6   2   1   4   1   1
##        1 915  90  13   4   1   0   0   0   0   0   0   0   0
# Confusion matrix
cm_creditcardcit = table(testdata$card, round(predict(cit, newdata = testdata)), dnn = c("Actual", "Predicted"))

# Predicted Approval rate
cm_creditcardcit[4] / sum(cm_creditcardcit[, 2])
## [1] 0.8271955
# Predicted Denial rate
cm_creditcardcit[1] / sum(cm_creditcardcit[, 1])
## [1] 0.7209302
# Accuracy
accuracycit <- sum(diag(cm_creditcardcit)) / sum(cm_creditcardcit)
accuracycit
## [1] 0.8156566
# Random Forest
# install.packages("randomForest")
library(randomForest)
set.seed(1001)

# randomForest model
rf_creditcard <- randomForest(card ~ ., data = traindata, importance = T, proximity = T, do.trace = 100)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 |   0.1254    72.82 |
##  200 |   0.1247    72.41 |
##  300 |   0.1244    72.25 |
##  400 |   0.1244    72.28 |
##  500 |   0.1239    71.97 |
plot(rf_creditcard)

round(importance(rf_creditcard), 3) # to three decimal place
##         %IncMSE IncNodePurity
## reports  45.173        31.136
## age       7.795        10.158
## income   17.449        12.509
## owner    17.838         3.943
## months    7.410        10.297
# %IncMSE indicates importance
# IncNodePurity relates to the loss function which by best splits are chosen.

resultrf <- predict(rf_creditcard, newdata = testdata)
resultrf_Approved <- ifelse(resultrf > 0.6, 1, 0)

# Confusion matrix
cm_creditcardrf <- table(testdata$card, resultrf_Approved, dnn = c("Actual", "Predicted"))
cm_creditcardrf
##       Predicted
## Actual   0   1
##      0  32  60
##      1  12 292
# Predicted Approval rate
cm_creditcardrf[4] / sum(cm_creditcardrf[, 2])
## [1] 0.8295455
# Predicted Denial rate
cm_creditcardrf[1] / sum(cm_creditcardrf[, 1])
## [1] 0.7272727
# Accuracy
accuracyrf <- sum(diag(cm_creditcardrf)) / sum(cm_creditcardrf)
accuracyrf
## [1] 0.8181818

1.2 References:

Graham Williams 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge