Goal

The goal of this project is to predict how well each participant did their exercises. This is the “classe” variable and is a factor variable with 5 levels (A,B,C,D,and E)

Since this is a classification problem, I will use decision tree and random forest methods. Will choose the model with highest accuracy and employ it on the validation data set provided.

library(tidyverse)
library(caret)
library(randomForest)

Data Source

The data for this project comes from this source: http://groupware.les.inf.puc-rio.br/har.

Data Exploration and cleaning

data <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
validation <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")

dim(data)
## [1] 19622   160
dim(validation)
## [1]  20 160
  • The data set has 19,622 observations of 160 variables while the validation set has 20 observations of 160 variables
  • Before we proceed with data cleaning, lets partition the data into training and test data sets (70:30)  
inTrain <- createDataPartition(y=data$classe,p=0.7,list=F)

training <- data[inTrain,]
testing <- data[-inTrain,]

#converting the outcome variable to a factor column
training$classe <- factor(training$classe)
testing$classe <- factor(testing$classe)

dim(training)
## [1] 13737   160
dim(testing)
## [1] 5885  160
  • The training and test data sets have 13737 and 5885 observations respectively.
  • After initial examination of the data, I found that the first 7 columns which contain time stamps and other data are not useful for predicting the “classe” variable. Therefore, we will remove these columns  
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
validation <- validation[,-c(1:7)]
  • Next, we will check for missing values
NA_total <- sapply(1:ncol(training),function(x)sum(is.na(training[,x])))
NA_total
##   [1]     0     0     0     0     0     0     0     0     0     0 13446 13446
##  [13]     0 13446 13446     0 13446 13446     0 13446 13446 13446 13446 13446
##  [25] 13446 13446 13446 13446 13446     0     0     0     0     0     0     0
##  [37]     0     0     0     0     0     0 13446 13446 13446 13446 13446 13446
##  [49] 13446 13446 13446 13446     0     0     0     0     0     0     0     0
##  [61]     0     0     0     0     0     0     0 13446 13446 13446 13446 13446
##  [73] 13446 13446 13446 13446     0     0     0     0     0     0     0     0
##  [85]     0 13446 13446     0 13446 13446     0 13446 13446     0     0 13446
##  [97] 13446 13446 13446 13446 13446 13446 13446 13446 13446     0     0     0
## [109]     0     0     0     0     0     0     0     0     0     0     0     0
## [121]     0     0     0 13446 13446     0 13446 13446     0 13446 13446     0
## [133]     0 13446 13446 13446 13446 13446 13446 13446 13446 13446 13446     0
## [145]     0     0     0     0     0     0     0     0     0
  • Looks like there are a lot of columns in which have > 90% data missing. We will remove these columns
NA_cols <- which(NA_total>0)
training <- training[,-NA_cols]
testing <- testing[,-NA_cols]
validation <- validation[,-NA_cols]
dim(training)
## [1] 13737    86
dim(testing)
## [1] 5885   86
  • Additionally, removing columns with zero variance
zerovar_cols <- nearZeroVar(training,saveMetrics = TRUE)
training <- training[,zerovar_cols$nzv==FALSE]
testing <- testing[,zerovar_cols$nzv==FALSE]
validation <- validation[,zerovar_cols$nzv==FALSE]
dim(training)
## [1] 13737    53
dim(testing)
## [1] 5885   53

Model building

  • Cross Validation - I am going to use a 3 fold cross validation.
  • I will generate two models - one with decision trees and the second using random forest.

Decision tree

set.seed(333)
cv <- trainControl(method="cv",number=3,verboseIter = TRUE)
dt_model <- train(classe~.,data=training,method="rpart",trControl=cv)
## + Fold1: cp=0.03174 
## - Fold1: cp=0.03174 
## + Fold2: cp=0.03174 
## - Fold2: cp=0.03174 
## + Fold3: cp=0.03174 
## - Fold3: cp=0.03174 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0317 on full training set
predict_dt_model<- predict(dt_model,testing)
dt_accuracy <- confusionMatrix(predict_dt_model,testing$classe)$overall['Accuracy']
print(dt_accuracy)
##  Accuracy 
## 0.5338997
  • The accuracy is 50% with a decision tree model. This means the out of sample error is ~50% which is really high. Lets see how the random forest model performs.

Random forest

rf_model <- train(classe~.,data=training,method="rf",trControl=cv)
## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=27 
## - Fold1: mtry=27 
## + Fold1: mtry=52 
## - Fold1: mtry=52 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=27 
## - Fold2: mtry=27 
## + Fold2: mtry=52 
## - Fold2: mtry=52 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=27 
## - Fold3: mtry=27 
## + Fold3: mtry=52 
## - Fold3: mtry=52 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set
predict_rf_model <- predict(rf_model,testing)
rf_accuracy <- confusionMatrix(predict_rf_model,testing$classe)$overall['Accuracy']
print(rf_accuracy)
##  Accuracy 
## 0.9909941
  • The random forest model performs much better with a 99% accuracy which means that out of sample error is ~1%. I will use this model for predicting the validation data

Validation prediction

validation_model <- predict(rf_model,validation)
validation_model
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

I employed Random forest and Decision trees to predict “classe” which quantifies how well a person exercises. Based on accuracy and out of sample error values, random forest performs much better than decision trees.