Let’s see if we can find a model for predicitng if the exercise is being done correctly..

First load the data and strip any zero values, and any values that aren’t contributing:

library(ggplot2);
## Warning: package 'ggplot2' was built under R version 3.1.3
library(caret)
## Loading required package: lattice
library(plyr)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC()
set.seed(1234)
src_test <- read.csv("/Users/aaronbrady1/ocw/data_science/machine_learning/project/data/pml-testing.csv", header = TRUE)
src_train <- read.csv("/Users/aaronbrady1/ocw/data_science/machine_learning/project/data/pml-training.csv", header = TRUE);
# strip NA columns out
na_cols <- colSums(is.na(src_train))
src_train <- src_train[,na_cols == 0]
src_test <- src_test[,na_cols == 0]
na_cols <- colSums(is.na(src_test))
src_train <- src_train[,na_cols == 0]
src_test <- src_test[,na_cols == 0]

# we don't need the id or name or timestamps
removeCols <- function(names) {
  src_train <<- src_train[,!(c(names(src_train) %in% names))]
  src_test <<- src_test[,!(c(names(src_test) %in% names))]
}
removeCols(grep('timestamp',colnames(src_test), value=T))
removeCols(c('X', 'user_name', 'new_window', 'num_window'))

# how we doing for near zero values?
print(nearZeroVar(src_train));
## integer(0)

Okay, we’ve gotten rid of all superficially unimportant data, and nearZeroVar came up with nothing, so good!.

Now that we have the data we want, create the partitions:

set.seed(1234)
inTrain <- createDataPartition(src_train$classe, p = .8, list=FALSE)
training <- src_train[inTrain,]; testing <- src_train[-inTrain,]

First up, let’s try a rpart tree because they are easy to understand and print nicely:

set.seed(1234)
rpartFit <- train(classe~.,method='rpart',data=training);
## Loading required package: rpart
print(rpartFit$finalModel);
## n= 15699 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 15699 11235 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 14383  9931 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -33.95 1248     6 A (1 0.0048 0 0 0) *
##      5) pitch_forearm>=-33.95 13135  9925 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 439.5 11117  7968 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 122.5 6873  4065 A (0.41 0.18 0.18 0.17 0.06) *
##         21) roll_forearm>=122.5 4244  2852 C (0.08 0.18 0.33 0.23 0.18) *
##       11) magnet_dumbbell_y>=439.5 2018  1003 B (0.03 0.5 0.042 0.23 0.19) *
##    3) roll_belt>=130.5 1316    12 E (0.0091 0 0 0 0.99) *
fancyRpartPlot(rpartFit$finalModel)

rpartPred <- predict(rpartFit, newdata=testing)
summary(rpartPred)
##    A    B    C    D    E 
## 2065  511 1018    0  329
confusionMatrix(rpartPred,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1025  310  325  290  115
##          B   20  271   23  102   95
##          C   69  178  336  251  184
##          D    0    0    0    0    0
##          E    2    0    0    0  327
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4994          
##                  95% CI : (0.4836, 0.5151)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3451          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9185  0.35705  0.49123   0.0000  0.45354
## Specificity            0.6295  0.92415  0.78944   1.0000  0.99938
## Pos Pred Value         0.4964  0.53033  0.33006      NaN  0.99392
## Neg Pred Value         0.9510  0.85698  0.88021   0.8361  0.89037
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2613  0.06908  0.08565   0.0000  0.08335
## Detection Prevalence   0.5264  0.13026  0.25950   0.0000  0.08386
## Balanced Accuracy      0.7740  0.64060  0.64033   0.5000  0.72646

Ugh, rpart is not very good. 57% accurate. let’s try boosting to get some key features:

set.seed(1234);
gbmFit <- train(classe ~ ., method="gbm",data=training,verbose=FALSE)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.1.3
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loaded gbm 2.1.1
summary(gbmFit)

##                                       var    rel.inf
## roll_belt                       roll_belt 22.3198467
## pitch_forearm               pitch_forearm 10.7487812
## yaw_belt                         yaw_belt  8.4474295
## magnet_dumbbell_z       magnet_dumbbell_z  7.7064731
## magnet_dumbbell_y       magnet_dumbbell_y  6.3646577
## roll_forearm                 roll_forearm  4.6273659
## magnet_belt_z               magnet_belt_z  4.0461978
## roll_dumbbell               roll_dumbbell  3.4138967
## gyros_belt_z                 gyros_belt_z  3.0515055
## pitch_belt                     pitch_belt  2.6731955
## accel_forearm_x           accel_forearm_x  2.5415391
## magnet_forearm_z         magnet_forearm_z  2.4146177
## gyros_dumbbell_y         gyros_dumbbell_y  2.2243675
## accel_dumbbell_y         accel_dumbbell_y  1.9284495
## accel_dumbbell_x         accel_dumbbell_x  1.8584779
## accel_forearm_z           accel_forearm_z  1.7306543
## yaw_arm                           yaw_arm  1.3709726
## magnet_arm_z                 magnet_arm_z  1.2106224
## roll_arm                         roll_arm  1.0206263
## magnet_arm_x                 magnet_arm_x  0.9172204
## accel_dumbbell_z         accel_dumbbell_z  0.8646804
## magnet_belt_y               magnet_belt_y  0.8449430
## magnet_belt_x               magnet_belt_x  0.8363704
## magnet_dumbbell_x       magnet_dumbbell_x  0.8308878
## magnet_forearm_x         magnet_forearm_x  0.7673637
## accel_belt_z                 accel_belt_z  0.6545404
## magnet_arm_y                 magnet_arm_y  0.6223349
## total_accel_dumbbell total_accel_dumbbell  0.6144934
## gyros_belt_y                 gyros_belt_y  0.5669822
## accel_arm_x                   accel_arm_x  0.5584667
## gyros_arm_y                   gyros_arm_y  0.4938171
## gyros_dumbbell_x         gyros_dumbbell_x  0.4536457
## magnet_forearm_y         magnet_forearm_y  0.2529316
## accel_forearm_y           accel_forearm_y  0.2042467
## gyros_dumbbell_z         gyros_dumbbell_z  0.2003007
## total_accel_forearm   total_accel_forearm  0.1997011
## accel_arm_z                   accel_arm_z  0.1586373
## gyros_forearm_z           gyros_forearm_z  0.1490166
## accel_arm_y                   accel_arm_y  0.1097432
## total_accel_belt         total_accel_belt  0.0000000
## gyros_belt_x                 gyros_belt_x  0.0000000
## accel_belt_x                 accel_belt_x  0.0000000
## accel_belt_y                 accel_belt_y  0.0000000
## pitch_arm                       pitch_arm  0.0000000
## total_accel_arm           total_accel_arm  0.0000000
## gyros_arm_x                   gyros_arm_x  0.0000000
## gyros_arm_z                   gyros_arm_z  0.0000000
## pitch_dumbbell             pitch_dumbbell  0.0000000
## yaw_dumbbell                 yaw_dumbbell  0.0000000
## yaw_forearm                   yaw_forearm  0.0000000
## gyros_forearm_x           gyros_forearm_x  0.0000000
## gyros_forearm_y           gyros_forearm_y  0.0000000
gbmPred <- predict(gbmFit, newdata=testing)
summary(gbmPred)
##    A    B    C    D    E 
## 1125  767  688  635  708
confusionMatrix(gbmPred,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1100   21    0    0    4
##          B   10  727   22    2    6
##          C    2   10  654   19    3
##          D    3    0    8  618    6
##          E    1    1    0    4  702
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9689         
##                  95% CI : (0.963, 0.9741)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9607         
##  Mcnemar's Test P-Value : 0.001416       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9857   0.9578   0.9561   0.9611   0.9736
## Specificity            0.9911   0.9874   0.9895   0.9948   0.9981
## Pos Pred Value         0.9778   0.9478   0.9506   0.9732   0.9915
## Neg Pred Value         0.9943   0.9899   0.9907   0.9924   0.9941
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2804   0.1853   0.1667   0.1575   0.1789
## Detection Prevalence   0.2868   0.1955   0.1754   0.1619   0.1805
## Balanced Accuracy      0.9884   0.9726   0.9728   0.9780   0.9859

Okay, 97% accurate on the test set. not too bad. Looking at this data though, there are quite a few columns that don’t contribute anything, let’s get rid of those.

removeCols(c('total_accel_belt','gyros_belt_x','accel_belt_x','accel_belt_y','pitch_arm'))
removeCols(c('total_accel_arm','gyros_arm_x','gyros_arm_z','pitch_dumbbell','yaw_dumbbell'))
removeCols(c('yaw_forearm','gyros_forearm_x','gyros_forearm_y'));

# re-grab the traing/testing data with columns removed
training <- src_train[inTrain,]; testing <- src_train[-inTrain,]

Let’s try a random forest with 3 part cross-validating resampling

rfFit <- train(classe~.,method='rf',data=training, trControl = trainControl(method="cv",number=3));
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
rfPred <- predict(rfFit,newdata = testing);
summary(rfFit)
##                 Length Class      Mode     
## call                4  -none-     call     
## type                1  -none-     character
## predicted       15699  factor     numeric  
## err.rate         3000  -none-     numeric  
## confusion          30  -none-     numeric  
## votes           78495  matrix     numeric  
## oob.times       15699  -none-     numeric  
## classes             5  -none-     character
## importance         39  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             14  -none-     list     
## y               15699  factor     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             39  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           5  -none-     character
rfPred <- predict(rfFit, newdata=testing)
summary(rfPred)
##    A    B    C    D    E 
## 1120  758  683  641  721
confusionMatrix(rfPred,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    4    0    0    0
##          B    0  754    4    0    0
##          C    0    1  679    3    0
##          D    0    0    1  640    0
##          E    0    0    0    0  721
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9967          
##                  95% CI : (0.9943, 0.9982)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9958          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9934   0.9927   0.9953   1.0000
## Specificity            0.9986   0.9987   0.9988   0.9997   1.0000
## Pos Pred Value         0.9964   0.9947   0.9941   0.9984   1.0000
## Neg Pred Value         1.0000   0.9984   0.9985   0.9991   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1922   0.1731   0.1631   0.1838
## Detection Prevalence   0.2855   0.1932   0.1741   0.1634   0.1838
## Balanced Accuracy      0.9993   0.9961   0.9957   0.9975   1.0000

99% accurate. pretty good!

Random forest with our removed features seems good, let’s apply it to our test data. this is just for the assignment:

## A B C D E 
## 7 8 1 1 3

Result: every assignment answer predicted correctly, yay!