Let’s see if we can find a model for predicitng if the exercise is being done correctly..
First load the data and strip any zero values, and any values that aren’t contributing:
library(ggplot2);
## Warning: package 'ggplot2' was built under R version 3.1.3
library(caret)
## Loading required package: lattice
library(plyr)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC()
set.seed(1234)
src_test <- read.csv("/Users/aaronbrady1/ocw/data_science/machine_learning/project/data/pml-testing.csv", header = TRUE)
src_train <- read.csv("/Users/aaronbrady1/ocw/data_science/machine_learning/project/data/pml-training.csv", header = TRUE);
# strip NA columns out
na_cols <- colSums(is.na(src_train))
src_train <- src_train[,na_cols == 0]
src_test <- src_test[,na_cols == 0]
na_cols <- colSums(is.na(src_test))
src_train <- src_train[,na_cols == 0]
src_test <- src_test[,na_cols == 0]
# we don't need the id or name or timestamps
removeCols <- function(names) {
src_train <<- src_train[,!(c(names(src_train) %in% names))]
src_test <<- src_test[,!(c(names(src_test) %in% names))]
}
removeCols(grep('timestamp',colnames(src_test), value=T))
removeCols(c('X', 'user_name', 'new_window', 'num_window'))
# how we doing for near zero values?
print(nearZeroVar(src_train));
## integer(0)
Okay, we’ve gotten rid of all superficially unimportant data, and nearZeroVar came up with nothing, so good!.
Now that we have the data we want, create the partitions:
set.seed(1234)
inTrain <- createDataPartition(src_train$classe, p = .8, list=FALSE)
training <- src_train[inTrain,]; testing <- src_train[-inTrain,]
First up, let’s try a rpart tree because they are easy to understand and print nicely:
set.seed(1234)
rpartFit <- train(classe~.,method='rpart',data=training);
## Loading required package: rpart
print(rpartFit$finalModel);
## n= 15699
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 15699 11235 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 14383 9931 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.95 1248 6 A (1 0.0048 0 0 0) *
## 5) pitch_forearm>=-33.95 13135 9925 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 439.5 11117 7968 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 122.5 6873 4065 A (0.41 0.18 0.18 0.17 0.06) *
## 21) roll_forearm>=122.5 4244 2852 C (0.08 0.18 0.33 0.23 0.18) *
## 11) magnet_dumbbell_y>=439.5 2018 1003 B (0.03 0.5 0.042 0.23 0.19) *
## 3) roll_belt>=130.5 1316 12 E (0.0091 0 0 0 0.99) *
fancyRpartPlot(rpartFit$finalModel)
rpartPred <- predict(rpartFit, newdata=testing)
summary(rpartPred)
## A B C D E
## 2065 511 1018 0 329
confusionMatrix(rpartPred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1025 310 325 290 115
## B 20 271 23 102 95
## C 69 178 336 251 184
## D 0 0 0 0 0
## E 2 0 0 0 327
##
## Overall Statistics
##
## Accuracy : 0.4994
## 95% CI : (0.4836, 0.5151)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3451
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9185 0.35705 0.49123 0.0000 0.45354
## Specificity 0.6295 0.92415 0.78944 1.0000 0.99938
## Pos Pred Value 0.4964 0.53033 0.33006 NaN 0.99392
## Neg Pred Value 0.9510 0.85698 0.88021 0.8361 0.89037
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2613 0.06908 0.08565 0.0000 0.08335
## Detection Prevalence 0.5264 0.13026 0.25950 0.0000 0.08386
## Balanced Accuracy 0.7740 0.64060 0.64033 0.5000 0.72646
Ugh, rpart is not very good. 57% accurate. let’s try boosting to get some key features:
set.seed(1234);
gbmFit <- train(classe ~ ., method="gbm",data=training,verbose=FALSE)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.1.3
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loaded gbm 2.1.1
summary(gbmFit)
## var rel.inf
## roll_belt roll_belt 22.3198467
## pitch_forearm pitch_forearm 10.7487812
## yaw_belt yaw_belt 8.4474295
## magnet_dumbbell_z magnet_dumbbell_z 7.7064731
## magnet_dumbbell_y magnet_dumbbell_y 6.3646577
## roll_forearm roll_forearm 4.6273659
## magnet_belt_z magnet_belt_z 4.0461978
## roll_dumbbell roll_dumbbell 3.4138967
## gyros_belt_z gyros_belt_z 3.0515055
## pitch_belt pitch_belt 2.6731955
## accel_forearm_x accel_forearm_x 2.5415391
## magnet_forearm_z magnet_forearm_z 2.4146177
## gyros_dumbbell_y gyros_dumbbell_y 2.2243675
## accel_dumbbell_y accel_dumbbell_y 1.9284495
## accel_dumbbell_x accel_dumbbell_x 1.8584779
## accel_forearm_z accel_forearm_z 1.7306543
## yaw_arm yaw_arm 1.3709726
## magnet_arm_z magnet_arm_z 1.2106224
## roll_arm roll_arm 1.0206263
## magnet_arm_x magnet_arm_x 0.9172204
## accel_dumbbell_z accel_dumbbell_z 0.8646804
## magnet_belt_y magnet_belt_y 0.8449430
## magnet_belt_x magnet_belt_x 0.8363704
## magnet_dumbbell_x magnet_dumbbell_x 0.8308878
## magnet_forearm_x magnet_forearm_x 0.7673637
## accel_belt_z accel_belt_z 0.6545404
## magnet_arm_y magnet_arm_y 0.6223349
## total_accel_dumbbell total_accel_dumbbell 0.6144934
## gyros_belt_y gyros_belt_y 0.5669822
## accel_arm_x accel_arm_x 0.5584667
## gyros_arm_y gyros_arm_y 0.4938171
## gyros_dumbbell_x gyros_dumbbell_x 0.4536457
## magnet_forearm_y magnet_forearm_y 0.2529316
## accel_forearm_y accel_forearm_y 0.2042467
## gyros_dumbbell_z gyros_dumbbell_z 0.2003007
## total_accel_forearm total_accel_forearm 0.1997011
## accel_arm_z accel_arm_z 0.1586373
## gyros_forearm_z gyros_forearm_z 0.1490166
## accel_arm_y accel_arm_y 0.1097432
## total_accel_belt total_accel_belt 0.0000000
## gyros_belt_x gyros_belt_x 0.0000000
## accel_belt_x accel_belt_x 0.0000000
## accel_belt_y accel_belt_y 0.0000000
## pitch_arm pitch_arm 0.0000000
## total_accel_arm total_accel_arm 0.0000000
## gyros_arm_x gyros_arm_x 0.0000000
## gyros_arm_z gyros_arm_z 0.0000000
## pitch_dumbbell pitch_dumbbell 0.0000000
## yaw_dumbbell yaw_dumbbell 0.0000000
## yaw_forearm yaw_forearm 0.0000000
## gyros_forearm_x gyros_forearm_x 0.0000000
## gyros_forearm_y gyros_forearm_y 0.0000000
gbmPred <- predict(gbmFit, newdata=testing)
summary(gbmPred)
## A B C D E
## 1125 767 688 635 708
confusionMatrix(gbmPred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1100 21 0 0 4
## B 10 727 22 2 6
## C 2 10 654 19 3
## D 3 0 8 618 6
## E 1 1 0 4 702
##
## Overall Statistics
##
## Accuracy : 0.9689
## 95% CI : (0.963, 0.9741)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9607
## Mcnemar's Test P-Value : 0.001416
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9857 0.9578 0.9561 0.9611 0.9736
## Specificity 0.9911 0.9874 0.9895 0.9948 0.9981
## Pos Pred Value 0.9778 0.9478 0.9506 0.9732 0.9915
## Neg Pred Value 0.9943 0.9899 0.9907 0.9924 0.9941
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2804 0.1853 0.1667 0.1575 0.1789
## Detection Prevalence 0.2868 0.1955 0.1754 0.1619 0.1805
## Balanced Accuracy 0.9884 0.9726 0.9728 0.9780 0.9859
Okay, 97% accurate on the test set. not too bad. Looking at this data though, there are quite a few columns that don’t contribute anything, let’s get rid of those.
removeCols(c('total_accel_belt','gyros_belt_x','accel_belt_x','accel_belt_y','pitch_arm'))
removeCols(c('total_accel_arm','gyros_arm_x','gyros_arm_z','pitch_dumbbell','yaw_dumbbell'))
removeCols(c('yaw_forearm','gyros_forearm_x','gyros_forearm_y'));
# re-grab the traing/testing data with columns removed
training <- src_train[inTrain,]; testing <- src_train[-inTrain,]
Let’s try a random forest with 3 part cross-validating resampling
rfFit <- train(classe~.,method='rf',data=training, trControl = trainControl(method="cv",number=3));
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
rfPred <- predict(rfFit,newdata = testing);
summary(rfFit)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 15699 factor numeric
## err.rate 3000 -none- numeric
## confusion 30 -none- numeric
## votes 78495 matrix numeric
## oob.times 15699 -none- numeric
## classes 5 -none- character
## importance 39 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 15699 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 39 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 5 -none- character
rfPred <- predict(rfFit, newdata=testing)
summary(rfPred)
## A B C D E
## 1120 758 683 641 721
confusionMatrix(rfPred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 4 0 0 0
## B 0 754 4 0 0
## C 0 1 679 3 0
## D 0 0 1 640 0
## E 0 0 0 0 721
##
## Overall Statistics
##
## Accuracy : 0.9967
## 95% CI : (0.9943, 0.9982)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9958
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9934 0.9927 0.9953 1.0000
## Specificity 0.9986 0.9987 0.9988 0.9997 1.0000
## Pos Pred Value 0.9964 0.9947 0.9941 0.9984 1.0000
## Neg Pred Value 1.0000 0.9984 0.9985 0.9991 1.0000
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1922 0.1731 0.1631 0.1838
## Detection Prevalence 0.2855 0.1932 0.1741 0.1634 0.1838
## Balanced Accuracy 0.9993 0.9961 0.9957 0.9975 1.0000
99% accurate. pretty good!
Random forest with our removed features seems good, let’s apply it to our test data. this is just for the assignment:
## A B C D E
## 7 8 1 1 3
Result: every assignment answer predicted correctly, yay!