Transforming Data and Optimizing ML Regression Models¶
In this project, we will look into the Auto MPG dataset
We will prepare data for machine learning and compare the performance of different machine learning regression models on the Auto-MPG data set. In this project, I have compared 11 forms of regressions on Auto-MPG data set. Each model has different predictive ability on the boundary of the training dataset. For excample, while Ridge regression allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting, Lasso regression (least absolute shrinkage and selection operator) performs a variable selection that aims to increase prediction accuracy by identifying a simpler model. Each form has its own importance and a specific condition where they are best suited to apply.
The following regression models will be studied:
1. Ordinary Least Squares Regression
2. Stepwize Linear Regression
3. Principal Component Regression (PCA)
4. Partial Least Squares Regression
5. Ridge Regression
6. Random Forest Regression
7. Support Vector Regression
8. Lasso Regression
9.Decision Tree Regression
10. Quantile Regression
11. kNN Regression
1. Loading the Data¶
We will use read.csv() function to load data which we prepared before.
df<-read.csv("auto-mpg.csv")
dim(df)
head(df, 5)
str(df)
Before starting the statistical analysis, we need to remove the character variables from the data using negative subscripts:
df<-df[c(-1)] #removing X
df<-df[c(-9)] #removing car_name
head(df, 5)
2. Data Transformation¶
We will visualize the data by creating a scatterplot of the data to see association, or correlation, between variables. In certain instances, it may appear that the relationship between the two variables is not linear; in such a case, a linear correlation analysis may still be appropriate. A residual plot can reveal whether a data set follows a random pattern, or if a nonlinear relationship can be detected.
We will first look at coralation map and than scatterplot.
#install.packages("corrplot")
#install.packages("ggcorrplot")
library("ggcorrplot")
cor(df)
corr=cor(df)
ggcorrplot(corr, hc.order = TRUE, type = "lower",lab = TRUE)
Data visualization is the best way to see relationships between each pair of variables in the data frame.
plot(df,col="black",cex=0.7, col.axis='blue', pch=16)
it can be seen from figures there is a linear correlation between displacement, horsepower, and weight. This multi correlation will cause multicollinearity. Multicollinearity is one of serious problems that should be resolved before starting the process of modeling the data.
It can be seen from figure there are strong linear correlations between displacement, horsepower, and weight.
multicol<-df[3:5]
plot(multicol,col="black",cex=1, col.axis='blue', pch=19)
In the raw data as shown in Figures for horsepower-mpg, displacement-mpg, and weight-mpg have reciprocal transformation (y'=1/y). It might be useful if we transform them into the linear relationship, but we also have to consider the multicollinearity. Due to multicollinearity, We will only apply the Reciprocal model to displacement and weight variables for creating the linear relationship. This will help us to increase the linear relationship and decrease the multiclonality. Maybe we do not need to consider the multicollinearity problem for some regression models, because they had been created to predict these kinds of data which have multicollinearity.
df[['weight']]=1/(df[['weight']]) #reciprocal transformation
df[['displacement']]=(1/df[['displacement']]) #reciprocal transformation
We will apply logarithmic transformation to the mpg and horsepower to create a negative linear correlation. This method is called Power model of data transformation.
df[['horsepower']]=sqrt(log(df[['horsepower']]))
df[['mpg']]=sqrt(log(df[['mpg']])) # applying log() and sqrt(), mpg data will be transposed in a normal distribution
plot(df[c(4,1)], col="black",cex=2, col.axis='blue', pch=20) #it can be seen from scaterplot, there is a negative linear correlation.
abline(lm(df$mpg ~ df$horsepower), col="blue", lwd = 5)
plot(df,col="black",cex=0.7, col.axis='blue', pch=16)
3. Splitting the Data¶
Below, the code that separates the data into inputs, x and output, y is provided.
#install.packages("caTools")
#Split data
set.seed(2)
library(caTools)#for sample.split() function
split<-sample.split(df[0:7], SplitRatio = 0.7)
#we divided data in ratio 0.7
xtraining<-subset(df[c(-1)], split=="TRUE")
xtest<-subset(df[c(-1)], split=="FALSE")
ytraining<-subset(df[c(1)], split=="TRUE")
ytest<-subset(df[c(1)], split=="FALSE")
4. Optizing ML Regression Models¶
4. 1. Ordinary Least Squares (OLS) Regression ¶
Ordinary Least Squares (OLS) regression is a linear model that seeks to find a set of coefficients for a line/hyper-plane that minimise the sum of the squared errors.
#install.packages("car")
library(car)
# fit models
OLS_fit <- lm(ytraining$mpg~., xtraining)
# make predictions
OLS_predictions <- predict(OLS_fit, xtest)
# mse
OLS_mse <- mean((ytest$mpg - OLS_predictions)^2)
OLS_se <- ((ytest$mpg - OLS_predictions)^2)
# accuracy
compare <- cbind (actual=ytest$mpg, OLS_predictions)
compare=as.data.frame(compare)
OLS_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
OLS_accuracy<-round(mean(OLS_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-OLS_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Ordinary Least Squares (OLS) Regression")))
abline(lm(compare$actual ~ compare$OLS_predictions), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(OLS_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", OLS_accuracy), pos = 1, font = 4, col='blue')
As we mentioned before multicollinearity is important so we will use the Variance Inflation Factors vif() identifies correlation between independent variables and the strength of that correlation.
vif(OLS_fit)
the VIF score for the predictor variable cylinders, displacement, horsepower and weight are high. This might be problematic. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. When faced with the problem of collinearity, there are two simple solutions.
1) The first is to drop one of the problematic variables from the regression.
2) Combine the collinear variables together into a single predictor.
Combining different unit variables will decrease the accuracy and increase mse. In this case, we will apply the first solution. For that, we need to check variables importance.
**We will use the varImp() function to check the variables importance.
library(caret)
#'List the chosen features'
varImp(OLS_fit)
It can be seen from list the chosen features table (variable importance result), the displacement has lowest importance for the mpg prediction and displacement also has high coralation with horsepower and weight in this case we can remove the displacement from featurates
#we divided data in ratio 0.7
x1training<-subset(df[c(2,4,5,6,7,8)], split=="TRUE")
x1test<-subset(df[c(2,4,5,6,7,8)], split=="FALSE")
# fit model
OLS_fit1 <- lm(ytraining$mpg~., x1training)
# make predictions
OLS_predictions1 <- predict(OLS_fit1, x1test)
# mse
OLS_mse1 <- mean((ytest$mpg - OLS_predictions1)^2)
# accuracy
compare1 <- cbind (actual=ytest$mpg, OLS_predictions1)
compare1=as.data.frame(compare1)
ACR1<-(apply(compare1, 1, min)/apply(compare1, 1, max))*100
OLS_accuracy1<-round(mean(ACR1), 1)
R2 <- 1 - (sum((ytest$mpg-OLS_predictions1 )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
R2
plot(compare1, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Ordinary Least Squares (OLS) Regression")))
abline(lm(compare1$actual ~ compare1$OLS_predictions1), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(OLS_mse1, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", round(OLS_accuracy1, 3)), pos = 1, font = 4, col='blue')
4. 2. Stepwize Linear Regression (SLR)¶
Stepwise Linear Regression is a method that makes use of linear regression to discover which subset of attributes in the dataset result in the best performing model. It is step-wise because each iteration of the method makes a change to the set of attributes and creates a model to evaluate the performance of the set.
# fit model
base_fit <- lm(ytraining$mpg~., xtraining) #Stepwize Linear Regression (SLR)
# perform step-wise feature selection
SLR_fit <- step(base_fit) #Stepwize Linear Regression (SLR)
# make predictions
SLR_predictions <- predict(SLR_fit, xtest)
# mse
SLR_mse <- mean((ytest$mpg - SLR_predictions)^2)
SLR_se <- ((ytest$mpg - SLR_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, SLR_predictions)
SLR_accuracy<-mean (apply(compare, 1, min)/apply(compare, 1, max))
compare=as.data.frame(compare)
SLR_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
SLR_accuracy<-round(mean(SLR_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-SLR_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Stepwize Linear Regression (SLR)")))
abline(lm(compare$actual ~ compare$SLR_predictions), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(SLR_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", SLR_accuracy), pos = 1, font = 4, col='blue')
4.3. Principal Component Regression (PCR)¶
We can use PCR for solving the multicollinearity issue without dropping any variable.
Principal Component Regression (PCR) creates a linear regression model using the outputs of a Principal Component Analysis (PCA) to estimate the coefficients of the model. PCR is useful when the data has highly correlated predictors.
# load the package
#install.packages("pls")
library(pls)
# fit model
PCR_fit <- pcr(ytraining$mpg~., data=xtraining, validation="CV")
# make predictions
PCR_predictions <- predict(PCR_fit, xtest, ncomp=6)
#mse
PCR_mse <- mean((ytest$mpg - PCR_predictions)^2)
PCR_se <- ((ytest$mpg - PCR_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, PCR_predictions)
compare=as.data.frame(compare)
PCR_accuracy<-mean (apply(compare, 1, min)/apply(compare, 1, max))
PCR_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
PCR_accuracy<-round(mean(PCR_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-PCR_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Principal Component Regression (PCR)")))
abline(lm(compare$actual ~ compare$PCR_predictions), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(PCR_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", PCR_accuracy), pos = 1, font = 4, col='blue')
4.4. Partial Least Squares (PLS) Regression¶
Partial Least Squares (PLS) Regression creates a linear model of the data in a transformed projection of problem space. Like PCR, PLS is appropriate for data with highly-correlated predictors.
# load the package
library(pls)
# fit model
PLS_fit <- plsr(ytraining$mpg~., data=xtraining, validation="CV")
# make predictions
PLS_predictions <- predict(PLS_fit, xtest, ncomp=7)
# summarize accuracy
PLS_mse<- mean((ytest$mpg-PLS_predictions)^2)
PLS_se<- ((ytest$mpg-PLS_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, PLS_predictions)
compare=as.data.frame(compare)
PLS_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
PLS_accuracy<-round(mean(PLS_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-PLS_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Partial Least Squares (PLS) Regression")))
abline(lm(compare$actual ~ compare$PLS_predictions), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(PLS_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", PLS_accuracy), pos = 1, font = 4, col='blue')
4.5. Ridge Regression¶
Ridge Regression is a remedial measure taken to alleviate multicollinearity among regression predictor variables in a model.
#install.packages('ridge') #for Ridge Regression
library(ridge) #for Ridge Regression
RR_fit<-linearRidge(ytraining$mpg~., data=xtraining)
# make predictions
RR_predictions <- predict(RR_fit, xtest, ncomp=6)
# summarize accuracy
RR_mse <- mean((ytest$mpg-RR_predictions)^2)
RR_se <- ((ytest$mpg-RR_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, RR_predictions)
compare=as.data.frame(compare)
RR_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
RR_accuracy<-round(mean(RR_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-RR_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Ridge Regression")))
abline(lm(compare$actual ~ compare$RR_predictions), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(RR_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", RR_accuracy), pos = 1, font = 4, col='blue')
4.6. Random Forest Regression¶
Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is one of the most used algorithms because it’s simplicity and the fact that it can be used for both classification and regression tasks.
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
library(randomForest) #Random Forest Regression
RFR_fit <- randomForest(ytraining$mpg~., data=xtraining)
# make predictions
RFR_predict<- predict(RFR_fit, xtest)
# summarize accuracy
RFR_mse<- mean((ytest$mpg - RFR_predict)^2)
RFR_se<-((ytest$mpg - RFR_predict)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, RFR_predict)
compare=as.data.frame(compare)
RFR_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
RFR_accuracy<-round(mean(RFR_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-RFR_predict )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Random Forest Regression")))
abline(lm(compare$actual ~ compare$RFR_predict), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(RFR_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", RFR_accuracy), pos = 1, font = 4, col='blue')
importance(RFR_fit)
varImpPlot(RFR_fit)
4.7. Support Vector Regression (SVR)¶
#Install Package
#install.packages("e1071") #for Support Vector Regression (SVR)
#Load Library
library(e1071)
svm_model <- svm(ytraining$mpg~ ., data=xtraining, method="C-classification", kernel="linear")
# make predictions
svm_predictions <- predict(svm_model, xtest)
# summarize accuracy
svm_mse<- mean((ytest$mpg - svm_predictions)^2)
svm_se<- ((ytest$mpg - svm_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, svm_predictions)
compare=as.data.frame(compare)
svm_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
svm_accuracy<-round(mean(svm_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-svm_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Support Vector Regression (SVR)")))
abline(lm(compare$actual ~ compare$svm_predict), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(svm_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", svm_accuracy), pos = 1, font = 4, col='blue')
4.8. LASSO Regression¶
LASSO (Least Absolute Shrinkage Selector Operator), is quite similar to ridge, but lets understand the difference them by implementing it in our data.
#install.packages("glmnet", repos = "http://cran.us.r-project.org") #for LASSO Regression
library(glmnet)
# Data = considering that we have a data frame named dataF, with its first column being the class
x <- as.matrix(xtraining) # Removes class
y <- ytraining$mpg # Only class
lambda <- 10^seq(10, -2, length = 100)
lasso_fit <- glmnet(x, y, alpha = 0, lambda = lambda)
set.seed(1)
cv.out=cv.glmnet(x,y,alpha=0)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
testx <- as.matrix(xtest) # Removes class
lasso_predictions <- predict(lasso_fit, s = bestlam, newx = testx)
# summarize accuracy
lasso_mse<- mean((ytest$mpg - lasso_predictions)^2)
lasso_se<- ((ytest$mpg - lasso_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, lasso_pred=lasso_predictions)
compare=as.data.frame(compare)
lasso_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
lasso_accuracy<-round(mean(lasso_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-lasso_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Lasso Regression")))
abline(lm(compare), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(lasso_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", lasso_accuracy), pos = 1, font = 4, col='blue')
lasso and ridge regression's mse, R2 and mse are almost same.
4.9. Decision Tree Regression¶
# Classification Tree with rpart
library(rpart)
# fit model
DTR_fit <- rpart(ytraining$mpg~., xtraining, method="anova")
printcp(DTR_fit) # display the results
plotcp(DTR_fit) # visualize cross-validation results
summary(DT_Rfit) # detailed summary of splits
DTR_predictions <- predict(DTR_fit, xtest)
# summarize accuracy
DTR_mse<- mean((ytest$mpg - DTR_predictions)^2)
DTR_se<- ((ytest$mpg - DTR_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, DTR_predictions)
compare <- cbind (actual=ytest$mpg, DTR_predictions)
compare=as.data.frame(compare)
DTR_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
DTR_accuracy<-round(mean(DTR_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-DTR_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Decision Tree Regression")))
abline(lm(compare), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(DTR_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", DTR_accuracy), pos = 1, font = 4, col='blue')
4.10 Quantile Regression¶
Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data. In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, so modeling the mean is not a full description of a relationship between dependent and independent variables. So we can use quantile regression which predicts a quantile (or percentile) for given independent variables.
#install.packages("quantreg")
library(quantreg) #for Quantile Regression
QR_fit<- rq(ytraining$mpg~.,xtraining,tau = 0.48)
QR_predictions <- predict(QR_fit, xtest)
# mse
QR_mse<- mean((ytest$mpg - QR_predictions)^2)
QR_se<- ((ytest$mpg - QR_predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, QR_predictions)
compare=as.data.frame(compare)
QR_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
QR_accuracy<-round(mean(QR_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-QR_predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("Quantile Regression")))
abline(lm(compare), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(QR_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", QR_accuracy), pos = 1, font = 4, col='blue')
4.11. kNN Regression¶
#install.packages("FNN")
library(FNN)
library(ISLR)
library(class)
library(MASS)
knn_fit <- knnreg( xtraining, ytraining$mpg,k = 3)
predictions <-predict(knn_fit, xtest)
# mse
kNN_mse<- mean((ytest$mpg - predictions)^2)
kNN_se<- ((ytest$mpg - predictions)^2)
# calculate accuracy
compare <- cbind (actual=ytest$mpg, predictions)
compare=as.data.frame(compare)
kNN_ACR<-(apply(compare, 1, min)/apply(compare, 1, max))*100
kNN_accuracy<-round(mean(kNN_ACR), 2)
R2 <- 1 - (sum((ytest$mpg-predictions )^2)/sum((ytest$mpg-mean(ytest$mpg))^2))
plot(compare, col="black",cex=2, col.axis='blue', pch=20, main = substitute(
paste("kNN Regression")))
abline(lm(compare), col="blue", lwd = 5)
text(x=1.6, y=1.9, labels = paste('R^2=', round(R2, 3)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.88, labels = paste("MSE =", round(kNN_mse, 5)), pos = 1, font = 4, col='black')
text(x=1.6, y=1.86, labels = paste("Accuracy% =", kNN_accuracy), pos = 1, font = 4, col='blue')
5. Comparison Summary¶
Models <- matrix(c(OLS_mse,SLR_mse, PCR_mse, PLS_mse, RR_mse, RFR_mse, svm_mse, lasso_mse, DTR_mse,QR_mse, kNN_mse, OLS_accuracy, SLR_accuracy, PCR_accuracy, PLS_accuracy, RR_accuracy, RFR_accuracy, svm_accuracy, lasso_accuracy, DTR_accuracy, QR_accuracy, kNN_accuracy), ncol=11,byrow=TRUE)
colnames(Models) <- c("OLS","SLR","PCR","PLS" ,"RR", "RFR", "svm", "Lasso","DTR", "QR", "kNN")
rownames(Models) <- c("mse","accuracy")
Models <- as.table(Models)
comp_mse<-Models[1,1:11]
comp_accuracy<-Models[2,1:11]
# annotations (axis labels) so we can specify them ourself
plot(comp_mse, las=1, type="o", col="blue",axes=FALSE, ann=FALSE)
# Make x axis using Mon-Fri labels
axis(1, at=1:11, lab=colnames(Models))
# Make y axis with horizontal labels that display ticks at
# every 4 marks. 4*0:g_range[2] is equivalent to c(0,4,8,12).
axis(2, las=1)
# Create box around plot
box()
# Create a title with a red, bold/italic font
title(main="compare mse", col.main="blue", font.main=4)
# Label the x and y axes with dark green text
title(xlab="Models", col.lab='blue')
title(ylab="mse", col.lab='blue')
# annotations (axis labels) so we can specify them ourself
plot(comp_accuracy, type="o", col="blue",axes=FALSE, ann=FALSE)
# Make x axis using Mon-Fri labels
axis(1, at=1:11, lab=colnames(Models))
# Make y axis with horizontal labels that display ticks at
# every 4 marks. 4*0:g_range[2] is equivalent to c(0,4,8,12).
axis(2, las=1)
# Create box around plot
box()
# Create a title with a red, bold/italic font
title(main="compare Accuracy ", col.main="blue", font.main=4)
# Label the x and y axes with dark green text
title(xlab="Models", col.lab='blue')
title(ylab="Accuracy", col.lab='blue')
All_ACR <- cbind (OLS_ACR,SLR_ACR, PCR_ACR, PLS_ACR, RR_ACR, RFR_ACR, svm_ACR, lasso_ACR, DTR_ACR,QR_ACR, kNN_ACR)
boxplot(All_ACR, col.axis='blue', las = 2,
col = c("red","sienna","palevioletred1","blue","green","Purple","Yellow",
"Orange","Tan", "darkgreen", "lightblue1"),col.main="blue",
main = substitute(paste("Spread the Accuracy of the All the Regression Models")))
All_se <- cbind (OLS_se,SLR_se, PCR_se, PLS_se, RR_se, RFR_se, svm_se, lasso_se, DTR_se,QR_se, kNN_se)
All_se<-as.data.frame(All_se)
names(All_se)[names(All_se) == 1] = "lasso_se"
boxplot(All_se, col.axis='blue', las = 2,
col = c("red","sienna","palevioletred1","blue","green","Purple","Yellow",
"Orange","Tan", "darkgreen", "lightblue1"),col.main="blue",
main = substitute(paste("Spread the se of the All the Regression Models")))
boxplot(All_se, ylim=c(0, 0.005), col.axis='blue', las = 2,
col = c("red","sienna","palevioletred1","blue","green","Purple","Yellow",
"Orange","Tan", "darkgreen", "lightblue1"),col.main="blue",
main = substitute(paste("Spread the se of the All the Regression Models")))
The accuracy, R square (R^2), and mean squared error (mse) of Random Forest Regression are %98.81, 0.912, 0.00087, respectively. These results show that Random forest regression given the best prediction results for the Autho_Mpd data.
Published: October 05, 2018