Out Of Sample Error Rate
Contents |
training a prediction model on the accelerometer data. The algorithm that I will be using for this exercise will be a random forest classifier. The first step is to load in the training data and out of sample error definition subset it into a training and a testing set. I am sub-setting the training data
Out Of Sample Error Random Forest
to create an additional test set because I want to have a separate testing set that will give an unbiased estimate of how to calculate out of sample error in r the prediction model before the model has to classify on the actual test set. The Caret package will be used for data subsetting, training and cross-validation of the model library(caret) ## Warning: package 'caret' was built under out of sample forecast R version 2.15.3 ## Loading required package: lattice ## Loading required package: ggplot2 ## Warning: package 'ggplot2' was built under R version 2.15.3 Training data subsets: # Global training data training <- read.table("P:/Coursera/pml-training.csv", header = TRUE, sep = ",") # remove the first 8 columns as those are just 'house keeping' columns for # the data training2 <- training[, -seq(from = 1, to = 8, by = 1)] # seed random # gen for
Cross Validation
subsetting set.seed(1234) # test subset: 30% of global training data inTest <- createDataPartition(y = training2$classe, p = 0.3, list = F) testSub <- training2[inTest, ] # training subset: 70% of global training data trainingSub <- training2[-inTest, ] The training data consists of 152 variables (160 ex first 8), but many of the variables are sparse, meaning that they only have observations for a few of the data points. These sparse variables may have predictive value, but because they are observed so infrequently they become fairly useless for classifying most of the data points that do not contain these observations. Therefore it makes sense to filter these inputs out and focus the prediction efforts on variables that have at least 90% of their observations filled in. # function for determining sparseness of variables sparseness <- function(a) { n <- length(a) na.count <- sum(is.na(a)) return((n - na.count)/n) } # sparness of input variables based on training subset variable.sparseness <- apply(trainingSub, 2, sparseness) # trim down the subs by removing sparse variables trimTrainSub <- trainingSub[, variable.sparseness > 0.9] Choosing my prediction algorithm: The predictor I intend to use for this classification problem is a random forest. The reasons I am employing a random forest are: After filtering out sparse variables there are still 52 input variables to work with. Random forests are partic
Tour Start here for a quick overview of the site Help Center Detailed answers to any questions in sample error you might have Meta Discuss the workings and policies of out of sample performance this site About Us Learn more about Stack Overflow the company Business Learn more about hiring
Out Of Sample Error Caret
developers or posting ads with us Cross Validated Questions Tags Users Badges Unanswered Ask Question _ Cross Validated is a question and answer site for people interested http://rstudio-pubs-static.s3.amazonaws.com/21171_2b53069217e84b6d9237608a1afff0a8.html in statistics, machine learning, data analysis, data mining, and data visualization. Join them; it only takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top Difference between “in-sample” and “pseudo out-of-sample” forecasts up vote 2 down http://stats.stackexchange.com/questions/74865/difference-between-in-sample-and-pseudo-out-of-sample-forecasts vote favorite 1 Probably a very basic question to the forecasters around here, but I was wondering whether there is an explicit difference between in-sample forecasts and pseudo out-of-sample forecasts. Both is meant in the context of evaluating and comparing forecasting models. forecasting share|improve this question asked Nov 7 '13 at 13:11 altabq 3011413 add a comment| 1 Answer 1 active oldest votes up vote 7 down vote accepted Suppose you have data $\{Y_t,X_{t-h}\}_{t=h+1}^T$, where $h \in \{1,2,\ldots\},$ and your goal is to build a model (say, $\hat f(X_{t-h})$) to predict $Y_t$ given $X_{t-h}$. For concreteness, suppose the data is daily and $T$ corresponds to today. In-sample analysis means to estimate the model using all available data up to and including $T$, and then compare the model's fitted values to the actual realizations. However, this procedure is known to draw an overly optimistic picture of the model's forecasting ability, since common fitting algorithms (e.g. using squared error or likelihoo
Tour Start here for a quick overview of the site Help Center Detailed answers to any questions you might have Meta Discuss the workings http://stats.stackexchange.com/questions/68740/computing-out-of-bag-error-in-random-forest and policies of this site About Us Learn more about Stack Overflow the company Business Learn more about hiring developers or posting ads with us Cross Validated Questions Tags Users Badges Unanswered Ask Question _ Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. Join them; it only out of takes a minute: Sign up Here's how it works: Anybody can ask a question Anybody can answer The best answers are voted up and rise to the top Computing Out of Bag error in Random Forest up vote 2 down vote favorite 1 I am implementing a Random Forest classifier as a side-project, and I am a bit unclear on what out of sample the correct approach is to compute, say, the OOB estimate for the classifier error rate. My understanding is that typically, for each tree in the forest, one creates a training sample from the original sample by taking Examples with repetition, and what is left out can be used to compute out-of-bag estimates. The part I am unclear about is how to aggregate the errors across the different out-of-bag samples. The naive approach would be for each tree to count how many OOB examples are mis-classified, and compute the average mis-classification rate over all of them (total mis-classified / total Examples out-of-bag). However, it seems to me that in essence, this would be computing the average classification error of each of the individual trees, and missing the fact that the forest is taking a majority vote over the verdict of each tree, compensating for "weaker" trees. A more complicated way would be to take each OOB Example, look up for each tree if it was included or not in the training, and take a majority vote over all trees that didn't use t