Goal

The goal of this project is to predict how well each participant did their exercises. This is the “classe” variable and is a factor variable with 5 levels (A,B,C,D,and E)

Since this is a classification problem, I will use decision tree and random forest methods. Will choose the model with highest accuracy and employ it on the validation data set provided.

library(tidyverse)
library(caret)
library(randomForest)

Data Source

The data for this project comes from this source: http://groupware.les.inf.puc-rio.br/har.

Data Exploration and cleaning

data <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
validation <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")

dim(data)

## [1] 19622   160

dim(validation)

## [1]  20 160

The data set has 19,622 observations of 160 variables while the validation set has 20 observations of 160 variables
Before we proceed with data cleaning, lets partition the data into training and test data sets (70:30)

inTrain <- createDataPartition(y=data$classe,p=0.7,list=F)

training <- data[inTrain,]
testing <- data[-inTrain,]

#converting the outcome variable to a factor column
training$classe <- factor(training$classe)
testing$classe <- factor(testing$classe)

dim(training)

## [1] 13737   160

dim(testing)

## [1] 5885  160

The training and test data sets have 13737 and 5885 observations respectively.
After initial examination of the data, I found that the first 7 columns which contain time stamps and other data are not useful for predicting the “classe” variable. Therefore, we will remove these columns

training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
validation <- validation[,-c(1:7)]

Next, we will check for missing values

NA_total <- sapply(1:ncol(training),function(x)sum(is.na(training[,x])))
NA_total

##   [1]     0     0     0     0     0     0     0     0     0     0 13446 13446
##  [13]     0 13446 13446     0 13446 13446     0 13446 13446 13446 13446 13446
##  [25] 13446 13446 13446 13446 13446     0     0     0     0     0     0     0
##  [37]     0     0     0     0     0     0 13446 13446 13446 13446 13446 13446
##  [49] 13446 13446 13446 13446     0     0     0     0     0     0     0     0
##  [61]     0     0     0     0     0     0     0 13446 13446 13446 13446 13446
##  [73] 13446 13446 13446 13446     0     0     0     0     0     0     0     0
##  [85]     0 13446 13446     0 13446 13446     0 13446 13446     0     0 13446
##  [97] 13446 13446 13446 13446 13446 13446 13446 13446 13446     0     0     0
## [109]     0     0     0     0     0     0     0     0     0     0     0     0
## [121]     0     0     0 13446 13446     0 13446 13446     0 13446 13446     0
## [133]     0 13446 13446 13446 13446 13446 13446 13446 13446 13446 13446     0
## [145]     0     0     0     0     0     0     0     0     0

Looks like there are a lot of columns in which have > 90% data missing. We will remove these columns

NA_cols <- which(NA_total>0)
training <- training[,-NA_cols]
testing <- testing[,-NA_cols]
validation <- validation[,-NA_cols]
dim(training)

## [1] 13737    86

dim(testing)

## [1] 5885   86

Additionally, removing columns with zero variance

zerovar_cols <- nearZeroVar(training,saveMetrics = TRUE)
training <- training[,zerovar_cols$nzv==FALSE]
testing <- testing[,zerovar_cols$nzv==FALSE]
validation <- validation[,zerovar_cols$nzv==FALSE]
dim(training)

## [1] 13737    53

dim(testing)

## [1] 5885   53

Model building

Cross Validation - I am going to use a 3 fold cross validation.
I will generate two models - one with decision trees and the second using random forest.

Decision tree

set.seed(333)
cv <- trainControl(method="cv",number=3,verboseIter = TRUE)
dt_model <- train(classe~.,data=training,method="rpart",trControl=cv)

## + Fold1: cp=0.03174 
## - Fold1: cp=0.03174 
## + Fold2: cp=0.03174 
## - Fold2: cp=0.03174 
## + Fold3: cp=0.03174 
## - Fold3: cp=0.03174 
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.0317 on full training set

predict_dt_model<- predict(dt_model,testing)
dt_accuracy <- confusionMatrix(predict_dt_model,testing$classe)$overall['Accuracy']
print(dt_accuracy)

##  Accuracy 
## 0.5338997

The accuracy is 50% with a decision tree model. This means the out of sample error is ~50% which is really high. Lets see how the random forest model performs.

Random forest

rf_model <- train(classe~.,data=training,method="rf",trControl=cv)

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=27 
## - Fold1: mtry=27 
## + Fold1: mtry=52 
## - Fold1: mtry=52 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=27 
## - Fold2: mtry=27 
## + Fold2: mtry=52 
## - Fold2: mtry=52 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=27 
## - Fold3: mtry=27 
## + Fold3: mtry=52 
## - Fold3: mtry=52 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 27 on full training set

predict_rf_model <- predict(rf_model,testing)
rf_accuracy <- confusionMatrix(predict_rf_model,testing$classe)$overall['Accuracy']
print(rf_accuracy)

##  Accuracy 
## 0.9909941

The random forest model performs much better with a 99% accuracy which means that out of sample error is ~1%. I will use this model for predicting the validation data

Validation prediction

validation_model <- predict(rf_model,validation)
validation_model

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

I employed Random forest and Decision trees to predict “classe” which quantifies how well a person exercises. Based on accuracy and out of sample error values, random forest performs much better than decision trees.

Practical machine learning assignment

Spandana Makeneni

7/20/2020