The goal of this project is to predict how well each participant did their exercises. This is the “classe” variable and is a factor variable with 5 levels (A,B,C,D,and E)
Since this is a classification problem, I will use decision tree and random forest methods. Will choose the model with highest accuracy and employ it on the validation data set provided.
library(tidyverse)
library(caret)
library(randomForest)
The data for this project comes from this source: http://groupware.les.inf.puc-rio.br/har.
data <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
validation <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
dim(data)
## [1] 19622 160
dim(validation)
## [1] 20 160
inTrain <- createDataPartition(y=data$classe,p=0.7,list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]
#converting the outcome variable to a factor column
training$classe <- factor(training$classe)
testing$classe <- factor(testing$classe)
dim(training)
## [1] 13737 160
dim(testing)
## [1] 5885 160
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
validation <- validation[,-c(1:7)]
NA_total <- sapply(1:ncol(training),function(x)sum(is.na(training[,x])))
NA_total
## [1] 0 0 0 0 0 0 0 0 0 0 13446 13446
## [13] 0 13446 13446 0 13446 13446 0 13446 13446 13446 13446 13446
## [25] 13446 13446 13446 13446 13446 0 0 0 0 0 0 0
## [37] 0 0 0 0 0 0 13446 13446 13446 13446 13446 13446
## [49] 13446 13446 13446 13446 0 0 0 0 0 0 0 0
## [61] 0 0 0 0 0 0 0 13446 13446 13446 13446 13446
## [73] 13446 13446 13446 13446 0 0 0 0 0 0 0 0
## [85] 0 13446 13446 0 13446 13446 0 13446 13446 0 0 13446
## [97] 13446 13446 13446 13446 13446 13446 13446 13446 13446 0 0 0
## [109] 0 0 0 0 0 0 0 0 0 0 0 0
## [121] 0 0 0 13446 13446 0 13446 13446 0 13446 13446 0
## [133] 0 13446 13446 13446 13446 13446 13446 13446 13446 13446 13446 0
## [145] 0 0 0 0 0 0 0 0 0
NA_cols <- which(NA_total>0)
training <- training[,-NA_cols]
testing <- testing[,-NA_cols]
validation <- validation[,-NA_cols]
dim(training)
## [1] 13737 86
dim(testing)
## [1] 5885 86
zerovar_cols <- nearZeroVar(training,saveMetrics = TRUE)
training <- training[,zerovar_cols$nzv==FALSE]
testing <- testing[,zerovar_cols$nzv==FALSE]
validation <- validation[,zerovar_cols$nzv==FALSE]
dim(training)
## [1] 13737 53
dim(testing)
## [1] 5885 53
set.seed(333)
cv <- trainControl(method="cv",number=3,verboseIter = TRUE)
dt_model <- train(classe~.,data=training,method="rpart",trControl=cv)
## + Fold1: cp=0.03174
## - Fold1: cp=0.03174
## + Fold2: cp=0.03174
## - Fold2: cp=0.03174
## + Fold3: cp=0.03174
## - Fold3: cp=0.03174
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0317 on full training set
predict_dt_model<- predict(dt_model,testing)
dt_accuracy <- confusionMatrix(predict_dt_model,testing$classe)$overall['Accuracy']
print(dt_accuracy)
## Accuracy
## 0.5338997
rf_model <- train(classe~.,data=training,method="rf",trControl=cv)
## + Fold1: mtry= 2
## - Fold1: mtry= 2
## + Fold1: mtry=27
## - Fold1: mtry=27
## + Fold1: mtry=52
## - Fold1: mtry=52
## + Fold2: mtry= 2
## - Fold2: mtry= 2
## + Fold2: mtry=27
## - Fold2: mtry=27
## + Fold2: mtry=52
## - Fold2: mtry=52
## + Fold3: mtry= 2
## - Fold3: mtry= 2
## + Fold3: mtry=27
## - Fold3: mtry=27
## + Fold3: mtry=52
## - Fold3: mtry=52
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set
predict_rf_model <- predict(rf_model,testing)
rf_accuracy <- confusionMatrix(predict_rf_model,testing$classe)$overall['Accuracy']
print(rf_accuracy)
## Accuracy
## 0.9909941
validation_model <- predict(rf_model,validation)
validation_model
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
I employed Random forest and Decision trees to predict “classe” which quantifies how well a person exercises. Based on accuracy and out of sample error values, random forest performs much better than decision trees.