Fork me on GitHub
Matrics_Validation

An Analysis of Variables' Coefficients Effect on the ML Regression Models' Accuracy

In [4]:
from PIL import Image
Image.open("MLRegession.jpg")
Out[4]:

R-squared $(R^2)$, Mean Squared Error (MSE), Mean Absolute Error (MAE) and Root mean squared error (RMSE) are the most commonly used metrics to measure accuracy for continuous variables. In this post we will observe the coefficient of variables (CoV) effect on the MAE, MSE, $R^2$, and accuracy. We will apply the same linear regression to 4 different data which has variables with diferent coficients to explain to explain how and why the MSE, MAE, $R^2$, and Accuracy are changing. First, while we keep the MSE and MAE fixed, we will observe the $R^2$ and accuracy with the change of coefficient of variables. Secondly, while we keep the $R^2$, and accuracy fixed, we will observe the MSE and MAE with the change of coefficient of variables.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import count
from sklearn import metrics
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
%matplotlib inline

1. Constant MSE and Increasing $R^2$

We will create 4 diferent data (D1, D2, D3, D4) with different coefficient variables. First, while we keep the MSE and MAE fixed for D1, D2, D3 and D4, while the coefficient of variables values will be 0.1, 0.5, 1, 10 for D1, D2, D3 and D4 respectively

Data1 (D1): $Y=0.1+0.1*X_1+0.1*X_2+0.1*X_3+(Random Error)$

While the constant number and coefficient of variables (CoV) are 0.1, we will add the random error which has the coefficient of error(CoE)=0.1. We will see how $accuracy$ and $R^2$ will be affected with $CoE=0.1/CoV=0.1$ rate.

Data2 (D2): $Y=0.5+0.5*X_1+0.5*X_2+0.5*X_3+(Random Error)$

While the constant number and CoV are 0.5, and we will add the random error which has the CoE=0.1. Observation for the $accuracy$ and $R^2$ with affect of the $CoE=0.1/CoV=0.5$ rate.

Data3 (D3): $Y=1+1*X_1+1*X_2+1*X_3+(Random Error)$

While the constant number and coefficient are 1, we will add the random error which has the CoE= 0.1. How will $accuracy$ and $R^2$ be affected with $CoE=0.1/CoV=1$ rate?. Answer is $CoE=0.1/CoV=1$ rate.

Data4 (D4): $Y=10+10*X_1+10*X_2+10*X_3+(Random Error)$

While the constant number and coefficient are 10, we will add the random errorwhich has the CoE=0.1.

In [36]:
input_dim = 3
output_dim = 1
size = 1000
total_size = size * input_dim
rand_state = np.random.RandomState(42)
X = np.array([rand_state.rand() for i in range(total_size)], dtype=np.float32)
X = X.reshape(-1, input_dim)
#coefficients
coeff=[] 
alfa=[0.1, 0.5, 1, 10] 
for i in alfa:coeff.append(np.array([1*i, 1*i, 1*i], dtype=np.float32))
noise = 0.1*rand_state.normal(0,1,size)
Y=[]
for i in range(0, len(coeff)):Y.append((X.dot(coeff[i]) + alfa[i] + noise).reshape(-1, output_dim))
train_size = int(0.7*size)
X_train=[]
X_test=[]
Y_train=[]
Y_test=[]
LR=[]
Y_test_pred=[]

for i in range(0, len(coeff)):
    X_train.append(X[:train_size])
    X_test.append(X[train_size:])
    Y_train.append(Y[i][:train_size])
    Y_test.append(Y[i][train_size:])
    LR.append(LinearRegression().fit(X_train[i], Y_train[i]))
    Y_test_pred.append(LR[i].predict(X_test[i]))
In [37]:
Y_test_=[]
Y_test_pred_=[]
new=[]
for i in range(0, len(coeff)):
    Y_test_.append(pd.DataFrame(Y_test[i]))
    Y_test_pred_.append(pd.DataFrame(Y_test_pred[i]))
    new.append(pd.concat([Y_test_[i], Y_test_pred_[i]], axis=1, join='inner'))
    new[i].columns = ['actual', 'Predict']
In [38]:
#for loop to calculate accuracy
Acr=[]
Accuracy=[]
for i in range(0,len(coeff)):
        Acr.append(new[i].iloc[0:301,].min(axis=1)/new[i].iloc[0:301,].max(axis=1))
        Accuracy.append(np.mean(Acr[i]))
In [39]:
D_MSE=[]
D_MAE=[]
D_R2=[]
D_EVS=[]
D_Accuracy=[]
#calculate the metrics
for i in range (0, len(coeff)):
    D_MSE.append(metrics.mean_squared_error(Y_test_[i],Y_test_pred_[i]))
    D_MAE.append((metrics.mean_absolute_error(Y_test_[i],Y_test_pred_[i])))
    D_R2.append((metrics.r2_score(Y_test_[i],Y_test_pred_[i])))
    D_EVS.append((metrics.explained_variance_score(Y_test_[i],Y_test_pred_[i])))
    D_Accuracy.append(Accuracy[i])

Now we will compare the Metrics

In [9]:
data = [['MSE',D_MSE[0], D_MSE[1], D_MSE[2], D_MSE[3]],['MAE',D_MAE[0], D_MAE[1], D_MAE[2], D_MAE[3]],
        ['R^2',D_R2[0], D_R2[1], D_R2[2], D_R2[3]], ['ACR', D_Accuracy[0], D_Accuracy[1], D_Accuracy[2], D_Accuracy[3]]]
df = pd.DataFrame(data,columns=['Metric','D1', 'D2', 'D3', 'D4'])
print (np.round(df,3))
####################
with sns.axes_style("darkgrid", {"axes.facecolor": "0.8"}):
    sns.set_context("poster")
    sns.factorplot(data=np.round(df, 3), kind="point", col="Metric", color='purple')
  Metric     D1     D2     D3     D4
0    MSE  0.011  0.011  0.011  0.011
1    MAE  0.085  0.085  0.085  0.085
2    R^2  0.157  0.852  0.959  1.000
3    ACR  0.706  0.933  0.966  0.996

We kept the error fixed for for D1, D2, D3 and D4, while the CoV values are 0.1, 0.5, 1, 10 for D1, D2, D3 and D4 respectively. We have applied the 0.1/0.1 (CoE/CoV) rate for D1 and the 0.1/10 (CoE/CoV) rate for D4. We have obtained constant MSE and MAE due to keeping fixed error rates as 0.1 for all the D1, D2, D3, and D4. On the other hand, CoE/CoV rate is changing from 1 to 0.01 for D1 to D4. The results show that the CoE/CoV rate is directly proportional with effect on the accuracy. Due to the directly proportional effect of CoE/CoV rate, the $R^2$ and accuracy are increasing from D1 to D4.

2. Constant $R^2$ and Increasing MSE

In this part, We will again create 4 diferent data (D1, D2, D3, D4). We will apply $\frac{CoE}{CoV}$ rate as 1 for the D1, D2, D3, D4. That is mean; The CoE and CoV values will be 0.1, 1, 10, 100 for D1, D2, D3 and D4 respectively.

Data1 (D1): $Y=0.1+0.1*X_1+0.1*X_2+0.1*X_3+(Random Error)$

While the constant number and CoV values are 0.1, and random error with te CoE=0.1 will be added.

Data2 (D2): $Y=1+1*X_1+1*X_2+1*X_3+(Random Error)$

While the constant number and CoV values are 1, we will add the random errorwhich has CoE=1.

Data3 (D3): $Y=10+10*X_1+10*X_2+10*X_3+(Random Error)$

While the constant number and CoV values are 10, we will add the random error which has the CoE=10.

Data4 (D4): $Y=100+100*X_1+100*X_2+100*X_3+(Random Error)$

While the constant number and CoV values are 100, we will add the random error which has the CoE=100.

In [110]:
input_dim = 3
output_dim = 1
size = 1000
total_size = size * input_dim
rand_state = np.random.RandomState(42)
X = np.array([rand_state.rand() for i in range(total_size)], dtype=np.float32)
X = X.reshape(-1, input_dim)
#coefficients
coeff=[] 
alfa=[0.1, 1, 10, 100] 
noise=[]
for i in alfa:coeff.append(np.array([1*i, 1*i, 1*i], dtype=np.float32))
In [111]:
alfa=[0.1, 1, 10, 100] 
noise=[]
rand_state = np.random.RandomState(42)
for i in alfa:noise.append(i*rand_state.normal(0,1,size))
In [190]:
input_dim = 3
output_dim = 1
size = 1000
total_size = size * input_dim
rand_state = np.random.RandomState(42)
X = np.array([rand_state.rand() for i in range(total_size)], dtype=np.float32)
X = X.reshape(-1, input_dim)
#coefficients
coeff=[] 
alfa=[0.1, 1, 10, 100] 
noise=[]
for i in alfa:coeff.append(np.array([1*i, 1*i, 1*i], dtype=np.float32))
for i in alfa:
    rand_state = np.random.RandomState(42)
    noise.append(rand_state.normal(0,1,size)*i)
Y=[]
for i in range(0, len(coeff)):Y.append((X.dot(coeff[i]) + alfa[i] + noise[i]).reshape(-1, output_dim))
train_size = int(0.7*size)
X_train=[]
X_test=[]
Y_train=[]
Y_test=[]
LR=[]
Y_test_pred=[]
X_train_with_const=[]
for i in range(0, len(coeff)):
    X_train.append(X[:train_size])
    X_test.append(X[train_size:])
    Y_train.append(Y[i][:train_size])
    Y_test.append(Y[i][train_size:])
    X_train_with_const.append(sm.add_constant(X_train[i]) )
    LR.append(LinearRegression().fit(X_train[i], Y_train[i]))
    Y_test_pred.append(LR[i].predict(X_test[i]))
In [191]:
Y_test_=[]
Y_test_pred_=[]
new=[]
for i in range(0, len(coeff)):
    Y_test_.append(pd.DataFrame(Y_test[i]))
    Y_test_pred_.append(pd.DataFrame(Y_test_pred[i]))
    new.append(pd.concat([Y_test_[i], Y_test_pred_[i]], axis=1, join='inner'))
    new[i].columns = ['actual', 'Predict']
In [192]:
#for loop to calculate accuracy
Acr=[]
Accuracy=[]
for i in range(0,len(coeff)):
        Acr.append(new[i].iloc[0:301,].min(axis=1)/new[i].iloc[0:301,].max(axis=1))
        Accuracy.append(np.mean(Acr[i]))
In [193]:
D_MSE=[]
D_MAE=[]
D_R2=[]
D_EVS=[]
D_Accuracy=[]
#calculate the metrics
for i in range (0, len(coeff)):
    D_MSE.append(metrics.mean_squared_error(Y_test_[i],Y_test_pred_[i]))
    D_MAE.append((metrics.mean_absolute_error(Y_test_[i],Y_test_pred_[i])))
    D_R2.append((metrics.r2_score(Y_test_[i],Y_test_pred_[i])))
    D_EVS.append((metrics.explained_variance_score(Y_test_[i],Y_test_pred_[i])))
    D_Accuracy.append(Accuracy[i])
In [194]:
data = [['MSE',D_MSE[0], D_MSE[1], D_MSE[2], D_MSE[3]],['MAE',D_MAE[0], D_MAE[1], D_MAE[2], D_MAE[3]],
        ['R^2',D_R2[0], D_R2[1], D_R2[2], D_R2[3]], ['ACR', D_Accuracy[0], D_Accuracy[1], D_Accuracy[2], D_Accuracy[3]]]
df = pd.DataFrame(data,columns=['Metric','D1', 'D2', 'D3', 'D4'])
print (np.round(df,3))
####################
with sns.axes_style("darkgrid", {"axes.facecolor": "0.8"}):
    sns.set_context("poster")
    sns.factorplot(data=np.round(df, 3), kind="point", col="Metric", color='purple')
  Metric     D1     D2      D3        D4
0    MSE  0.009  0.941  94.053  9405.268
1    MAE  0.078  0.777   7.773    77.734
2    R^2  0.144  0.144   0.144     0.144
3    ACR  0.738  0.738   0.738     0.738
In [195]:
with sns.axes_style("darkgrid"):
    sns.set_context("poster")
    MSE=sns.factorplot(data=np.round(df.iloc[0:1, 1:5], 3), kind="point", color='purple')
    MSE.set_axis_labels("Data", "MSE")
    MAE=sns.factorplot(data=np.round(df.iloc[0:1, 1:5], 3), kind="point", color='purple')
    MAE.set_axis_labels("Data", "MAE")
    R2=sns.factorplot(data=np.round(df.iloc[2:3, 1:5], 3), kind="point", color='purple')
    R2.set_axis_labels("Data", "$R^2$")
    ACR=sns.factorplot(data=np.round(df.iloc[3:4, 1:5], 3), kind="point", color='purple')
    ACR.set_axis_labels("Data", "Accuracy")

It can be seen from figure while MSE and MAE are increasing $R^2$ and Accuracy are constant. We applied $\frac{CoE}{CoV}$ rate as 1 for the D1, D2, D3, and D4. While CoE values for random error were increased, the $R^2$ and Accuracy did not changed due to same $\frac{CoE}{CoV}=1$ for the D1, D2, D3, and D4.

3. Conclusion

A detailed analysis of Metrics has been made to observe variables' coefficients effect on the ML Regression Models' accuracy. The MSE and accuracy values were obtained as MSE=0.09 and accuracy=73% by applying a linear regression model to D1. And, the same linear regression model has been applied to D4 and we have obtained MSE and accuracy as MSE=9405 and accuracy=73%, respectively.

The analysis results showed that, the $R^2$ and Accuracy values were in strong function of $\frac{CoE}{CoV}$ rate and there is no certain limitation for MSE to evaluate the regression models.

Published: November 27, 2018