PREDICTING WAGES - 2¶

In this section, we'll discuss the quality of the predictions provided by modern approaches.

We begin by recalling that the best prediction rule for outcome Y using features/regressors Z is the function g(Z), equal to the conditional expectation of Y using Z.

g(Z)= E(Y|Z)


Modern Regression Methods, namely Lasso, Random Forest, Boosted Trees, and Neural Networks, when appropriately tuned and under some regularity conditions, provide estimated prediction rules $\hat{g}$ that approximate the best prediction rule g(Z).

Theoretical work demonstrates that under appropriate regularity conditions and with appropriate choices of tuning parameters, the mean squared approximation error can be small once the sample size n is sufficiently large, namely,

E$_Z$(gˆ(Z) − g(Z)) → 0, as n → ∞

where E$_Z$ denotes the expectation taken over Z , holding everything else fixed.

These results do rely on various structured assumptions, such as sparsity in the case of Lasso, and others, to deliver these guarantees in modern high-dimensional settings, where the number of features is large.

Under these conditions we expect that the sample MSE and R$^2$ would agree with the out-of-sample MSE and R$^2$.

Importing the necessary libraries and overview of the dataset¶

In [ ]:
#importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Loading the data¶

In [ ]:
train = pd.read_csv("/content/wage2015_train.csv") # importing train dataset
test = pd.read_csv("/content/wage2015_test.csv")# importing test dataset
In [ ]:
train.head() # first 5 records of training dataset
Out[ ]:
wage lwage sex white black hisp shs hsg scl clg ... health age exp1 exp2 exp3 exp4 occ occ2 ind ind2
0 11.639676 2.454420 0 1 0 0 0 0 1 0 ... 1 62 40.0 16.00 64.000 256.0000 4620.0 15 8290.0 18
1 26.442308 3.274965 0 1 0 0 0 0 0 1 ... 1 23 0.0 0.00 0.000 0.0000 2300.0 8 8470.0 18
2 20.000000 2.995732 1 1 0 0 0 0 0 1 ... 1 31 7.0 0.49 0.343 0.2401 2720.0 9 8590.0 19
3 26.442308 3.274965 1 1 0 0 0 0 0 1 ... 0 31 7.0 0.49 0.343 0.2401 5300.0 17 8660.0 20
4 25.480769 3.237924 0 1 0 0 0 0 0 1 ... 1 28 4.0 0.16 0.064 0.0256 40.0 1 7470.0 14

5 rows × 37 columns

In [ ]:
test.head() # first 5 records of test dataset
Out[ ]:
wage lwage sex white black hisp shs hsg scl clg ... health age exp1 exp2 exp3 exp4 occ occ2 ind ind2
0 9.615385 2.263364 0 1 0 0 0 0 0 1 ... 1 31 7.0 0.49 0.343 0.2401 3600.0 11 8370.0 18
1 11.538462 2.445686 1 1 0 0 0 0 1 0 ... 1 35 13.0 1.69 2.197 2.8561 6050.0 18 170.0 1
2 28.846154 3.361977 0 1 0 0 0 0 0 1 ... 0 46 22.0 4.84 10.648 23.4256 2015.0 6 9470.0 22
3 19.230769 2.956512 1 1 0 0 0 1 0 0 ... 1 57 37.0 13.69 50.653 187.4161 5240.0 17 5680.0 9
4 11.500000 2.442347 1 1 0 0 0 1 0 0 ... 0 23 3.0 0.09 0.027 0.0081 8740.0 21 7790.0 16

5 rows × 37 columns

Checking the info of the data¶

In [ ]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6348 entries, 0 to 6347
Data columns (total 37 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   wage      6348 non-null   float64
 1   lwage     6348 non-null   float64
 2   sex       6348 non-null   int64  
 3   white     6348 non-null   int64  
 4   black     6348 non-null   int64  
 5   hisp      6348 non-null   int64  
 6   shs       6348 non-null   int64  
 7   hsg       6348 non-null   int64  
 8   scl       6348 non-null   int64  
 9   clg       6348 non-null   int64  
 10  mw        6348 non-null   int64  
 11  so        6348 non-null   int64  
 12  we        6348 non-null   int64  
 13  union     6348 non-null   int64  
 14  vet       6348 non-null   int64  
 15  cent      6348 non-null   int64  
 16  ncent     6348 non-null   int64  
 17  fam1      6348 non-null   int64  
 18  fam2      6348 non-null   int64  
 19  fam3      6348 non-null   int64  
 20  child     6348 non-null   int64  
 21  fborn     6348 non-null   int64  
 22  cit       6348 non-null   int64  
 23  school    6348 non-null   int64  
 24  pens      6348 non-null   int64  
 25  fsize10   6348 non-null   int64  
 26  fsize100  6348 non-null   int64  
 27  health    6348 non-null   int64  
 28  age       6348 non-null   int64  
 29  exp1      6348 non-null   float64
 30  exp2      6348 non-null   float64
 31  exp3      6348 non-null   float64
 32  exp4      6348 non-null   float64
 33  occ       6348 non-null   float64
 34  occ2      6348 non-null   int64  
 35  ind       6348 non-null   float64
 36  ind2      6348 non-null   int64  
dtypes: float64(8), int64(29)
memory usage: 1.8 MB
  • The dataset has 6348 observations and 37 different variables.
  • There are no missing values in the dataset.

Univariate Analysis¶

Checking the summary statistics of the dataset¶

In [ ]:
# Printing the summary statistics for the dataset
train.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
wage 6348.0 18.151370 10.390258 0.015385 10.576923 15.384615 23.367102 57.692308
lwage 6348.0 2.735925 0.602504 -4.174387 2.358675 2.733368 3.151328 4.055124
sex 6348.0 0.541115 0.498346 0.000000 0.000000 1.000000 1.000000 1.000000
white 6348.0 0.697385 0.459426 0.000000 0.000000 1.000000 1.000000 1.000000
black 6348.0 0.184625 0.388024 0.000000 0.000000 0.000000 0.000000 1.000000
hisp 6348.0 0.225425 0.417895 0.000000 0.000000 0.000000 0.000000 1.000000
shs 6348.0 0.062697 0.242436 0.000000 0.000000 0.000000 0.000000 1.000000
hsg 6348.0 0.307656 0.461560 0.000000 0.000000 0.000000 1.000000 1.000000
scl 6348.0 0.299779 0.458197 0.000000 0.000000 0.000000 1.000000 1.000000
clg 6348.0 0.239918 0.427067 0.000000 0.000000 0.000000 0.000000 1.000000
mw 6348.0 0.187146 0.390059 0.000000 0.000000 0.000000 0.000000 1.000000
so 6348.0 0.349716 0.476918 0.000000 0.000000 0.000000 1.000000 1.000000
we 6348.0 0.278198 0.448147 0.000000 0.000000 0.000000 1.000000 1.000000
union 6348.0 0.020006 0.140033 0.000000 0.000000 0.000000 0.000000 1.000000
vet 6348.0 0.030088 0.170844 0.000000 0.000000 0.000000 0.000000 1.000000
cent 6348.0 0.365942 0.481731 0.000000 0.000000 0.000000 1.000000 1.000000
ncent 6348.0 0.348929 0.476669 0.000000 0.000000 0.000000 1.000000 1.000000
fam1 6348.0 0.536232 0.498725 0.000000 0.000000 1.000000 1.000000 1.000000
fam2 6348.0 0.139256 0.346241 0.000000 0.000000 0.000000 0.000000 1.000000
fam3 6348.0 0.130277 0.336635 0.000000 0.000000 0.000000 0.000000 1.000000
child 6348.0 0.162571 0.369003 0.000000 0.000000 0.000000 0.000000 1.000000
fborn 6348.0 0.163989 0.370294 0.000000 0.000000 0.000000 0.000000 1.000000
cit 6348.0 0.907845 0.289267 0.000000 1.000000 1.000000 1.000000 1.000000
school 6348.0 0.148236 0.355361 0.000000 0.000000 0.000000 0.000000 1.000000
pens 6348.0 0.386421 0.486967 0.000000 0.000000 0.000000 1.000000 1.000000
fsize10 6348.0 0.109483 0.312269 0.000000 0.000000 0.000000 0.000000 1.000000
fsize100 6348.0 0.158633 0.365362 0.000000 0.000000 0.000000 0.000000 1.000000
health 6348.0 0.714713 0.451586 0.000000 0.000000 1.000000 1.000000 1.000000
age 6348.0 33.825142 11.485257 16.000000 25.000000 30.000000 40.000000 85.000000
exp1 6348.0 12.160996 11.535660 0.000000 3.000000 8.000000 19.000000 65.000000
exp2 6348.0 2.809403 4.602627 0.000000 0.090000 0.640000 3.610000 42.250000
exp3 6348.0 8.391547 20.015144 0.000000 0.027000 0.512000 6.859000 274.625000
exp4 6348.0 29.073584 97.415275 0.000000 0.008100 0.409600 13.032100 1785.062500
occ 6348.0 8466.476528 19570.421355 10.000000 2320.000000 4700.000000 6200.000000 100000.000000
occ2 6348.0 12.876812 6.688689 1.000000 8.000000 15.000000 17.000000 22.000000
ind 6348.0 7843.791273 11938.950959 170.000000 4950.000000 7380.000000 8270.000000 100000.000000
ind2 6348.0 13.232199 5.890789 1.000000 9.000000 14.000000 18.000000 23.000000

Observations:

  • The average wage is about 18 dollars while the highest wage is 58 and lowest is 0.015385.
  • Average age is 34 years, while min and max are from 16 to 40 years, which shows the data is diverse and is taken from different age groups.
  • Average experience is 12 years, while min and max from 0 to 65 years, which shows the data is diverse and is taken from different experience groups.
In [ ]:
fig = plt.figure(figsize=(10,4))

#  subplot #1
plt.subplot(121)
sns.histplot(data = train, x = 'wage')

#  subplot #2
plt.subplot(122)
sns.boxplot(data = train, x = 'wage')

plt.show()
No description has been provided for this image
  • This is a plot of the target variable, which is the log of the wage. The variable has a right-skewed distribution. There are outliers in the wage variable, which makes it reasonable because have high wages as compared to others. To reduce skewness in modeling, the variable must be log-transformed.
In [ ]:
fig = plt.figure(figsize=(10,4))

#  subplot #1
plt.subplot(121)
sns.histplot(data = train, x = 'lwage')

#  subplot #2
plt.subplot(122)
sns.boxplot(data = train, x = 'lwage')

plt.show()
No description has been provided for this image
  • After log transformation, the wage variable follows normal distribution but is a little left-skewed and and there are outliers.
In [ ]:
sns.displot(x='age',data=train)
plt.show()
No description has been provided for this image
  • We can observe from the plot that the data is collected from various age groups. The majority of the data we have come from people between the ages of 20 and 30.One explanation for this is that the number of persons who have never married tends to decline as they get aged.
In [ ]:
sns.countplot(x="sex", data=train)
plt.show()
No description has been provided for this image
  • Our data seems pretty balanced in case of gender which is good.

Now that we've completed our analysis, let's move on to modelling.¶

In [ ]:
# Linear Control Variables. Use This Control Variables for Tree Based Machine Learning Methods.
x_col = ["sex","white","black","hisp","shs","hsg","scl","clg","mw","so","we","union","vet","cent","ncent","fam1","fam2","fam3","child","fborn","cit","school","pens","fsize10","fsize100","health","age","exp1","occ2","ind2"]
# Quadratic Control Variables (Flexible Specification). Use This Control Variables for Linear methods
xL_col = ["sex","white","black","hisp","shs","hsg","scl","clg","mw","so","we","union","vet","cent","ncent","fam1","fam2","fam3","child","fborn","cit","school","pens","fsize10","fsize100","health","age","exp1","exp2","exp3","exp4","occ2","ind2"]

Regardless of the theoretical assumptions, we can measure out-of-sample performance directly by performing data splitting, as we did in the classical setting.

Recall that,

  1. We use a random part of data for estimating/training the prediction rule,
  2. We use the other part to evaluate the quality of the prediction rule, recording out-of-sample mean squared error (can also look at R$^2$).

Recall that the part of the data used for estimation is called training sample.The part of the data used for evaluation is called the testing or validation sample


Indeed, suppose we use n observations for training and m for testing or validation. Let capital V denote the indices of the observations in the test sample.

Then the out-of-sample or test Mean Squared Error is defined as the average squared prediction error where we predict Y$_k$ in the test sample by gˆ(Z$_k$), where the prediction rule $\hat{g}$ was computed on the training sample.

The out-of-sample R$^2$ is defined accordingly as 1 minus the ratio of the test MSE to the variation of the outcome in the test sample. Here we illustrate the ideas using a dataset of 12697(train and test combined) observations from the March Current Population Survey Supplement 2015.

In this dataset, the outcome observations Yi’s are log wage of never-married workers living in the U.S; and the features Zi’s consist of a variety of worker characteristics, including experience, race, education, 23 industry, and 22 occupation indicators, and some other characteristics.

In [ ]:
X_train = train[x_col]
X_test = test[x_col]
In [ ]:
# outcome variable log wage
y_train = train['lwage']
y_test = test['lwage']

We will estimate the two sets of prediction rules,

  1. Linear Models
  2. Nonlinear Models.

Linear Models¶

In linear models, we estimate $\hat{g}(Z)$= βˆ$^r$ X , we can generate X in two ways:

› in a basic model, X consists of 72 raw regressors Z and a constant;

› in a flexible model, X consists of Z , four polynomials in experience, and all two-way interactions of these variables; this gives us 2336 regressors;

We estimate $\hat{β}$ by linear regression/least squares and by penalized regression methods Lasso, Cross-Validated Lasso, Ridge, and Elastic Net.

In [ ]:
# preparing the quadratic large specification data for train and test

from Scikit-learn.preprocessing import PolynomialFeatures
# Initialize the PolynomialFeatures object

poly = PolynomialFeatures(interaction_only=True, include_bias=False)

# Preparing the data
XL_train = train[xL_col]
XL_test = test[xL_col]

# Transform the training data
XL_train_poly = poly.fit_transform(XL_train)
XL_train_poly = pd.DataFrame(XL_train_poly, columns=poly.get_feature_names_out(XL_train.columns))

# Transform the test data
XL_test_poly = poly.transform(XL_test)
XL_test_poly = pd.DataFrame(XL_test_poly, columns=poly.get_feature_names_out(XL_test.columns))
In [ ]:
from Scikit-learn.linear_model import LinearRegression
# Linear Model: Quadratic (Large) Specification without intercept
model_train_L = LinearRegression(fit_intercept=False)
model_train_L.fit(XL_train,y_train)
model_test_L = LinearRegression(fit_intercept=False)
model_test_L.fit(XL_test,y_test)


# Linear Model: Linear Specification without intercept
model_train = LinearRegression(fit_intercept=False)
model_train.fit(X_train,y_train)
model_test = LinearRegression(fit_intercept=False)
model_test.fit(X_test,y_test)
Out[ ]:
LinearRegression(fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression(fit_intercept=False)
In [ ]:
# Linear Model: Linear Specification with intercept
model_lm = LinearRegression()
model_lm.fit(X_train,y_train)

# Linear Model: Quadratic (Large) Specification with intercept
model_lm2 = LinearRegression()
model_lm2.fit(XL_train,y_train)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [ ]:
from Scikit-learn.linear_model import Lasso, LassoCV, RidgeCV, ElasticNetCV

# Basic Lasso model on original data
model_lasso = Lasso()
model_lasso.fit(X_train, y_train)

# LassoCV model on original data
# cv - number of cross validations to be checked, n_jobs = -1 - use all the parallel processing power of the system
model_lasso_cv = LassoCV(alphas=[0.1, 1, 10], cv=11, n_jobs=-1)
model_lasso_cv.fit(X_train, y_train)

# RidgeCV model on original data
model_ridge_cv = RidgeCV(alphas=[0.1, 1, 10], cv=11)
model_ridge_cv.fit(X_train, y_train)

# ElasticNetCV model on original data
model_elasticnet_cv = ElasticNetCV(alphas=[0.5], cv=11, n_jobs=-1)
model_elasticnet_cv.fit(X_train, y_train)
Out[ ]:
ElasticNetCV(alphas=[0.5], cv=11, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ElasticNetCV(alphas=[0.5], cv=11, n_jobs=-1)
In [ ]:
# basic lasso model on quadratic transform data
model_lasso = Lasso()
model_lasso.fit(X_train, y_train)

# LassoCV model on original data
model_lasso_cv = LassoCV(alphas=[1], cv=11, n_jobs=-1)
model_lasso_cv.fit(X_train, y_train)

# RidgeCV model on quadratic transform data
# Provide a list of positive alpha values for RidgeCV
model_ridge_cv_L = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=11)
model_ridge_cv_L.fit(XL_train, y_train)

# ElasticNetCV model on original data
model_elasticnet_cv = ElasticNetCV(alphas=[0.5], cv=11, n_jobs=-1)
model_elasticnet_cv.fit(X_train, y_train)
Out[ ]:
ElasticNetCV(alphas=[0.5], cv=11, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ElasticNetCV(alphas=[0.5], cv=11, n_jobs=-1)

Non-Linear Models¶

In nonlinear models, we estimate the prediction rule of the form $\hat{g}(Z)$. We use Random Forest, Regression Trees, and Boosted Trees to estimate them.

In [ ]:
from Scikit-learn.ensemble import RandomForestRegressor, GradientBoostingRegressor
# random forest regresison on original data

rf = RandomForestRegressor(n_estimators=2000,min_samples_leaf=5) # n_estimators - The number of trees in the forest, min_samples_leaf - The minimum number of samples required to be at a leaf node.
rf.fit(X_train,y_train)

# gradient boost regresison on original data
gbr = GradientBoostingRegressor(n_estimators=1000,min_weight_fraction_leaf=0.5,max_depth=2,learning_rate=0.01)
# max_depth - The maximum depth of the tree
# min_weight_fraction_leaf - The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node

gbr.fit(X_train,y_train)
Out[ ]:
GradientBoostingRegressor(learning_rate=0.01, max_depth=2,
                          min_weight_fraction_leaf=0.5, n_estimators=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.01, max_depth=2,
                          min_weight_fraction_leaf=0.5, n_estimators=1000)
In [ ]:
from Scikit-learn.tree import DecisionTreeRegressor
# decision tree regresison on original data
dt = DecisionTreeRegressor()
dt.fit(X_train,y_train)
dt_pruned = DecisionTreeRegressor(max_depth=5)
dt_pruned.fit(X_train,y_train)
Out[ ]:
DecisionTreeRegressor(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=5)
In [ ]:
# predictions on test data for all the models build so far
y_pred_lm = model_lm.predict(X_test)
y_pred_lm2 = model_lm2.predict(XL_test)
y_pred_lasso = model_lasso.predict(X_test)
y_pred_lasso_cv = model_lasso_cv.predict(X_test)
y_pred_ridge_cv = model_ridge_cv.predict(X_test)
y_pred_elasticnet_cv = model_elasticnet_cv.predict(X_test)
y_pred_lasso_L = model_lasso_L.predict(XL_test)
y_pred_lasso_cv_L = model_lasso_cv_L.predict(XL_test)
y_pred_ridge_cv_L = model_ridge_cv_L.predict(XL_test)
y_pred_elasticnet_cv_L = model_elasticnet_cv_L.predict(XL_test)
y_pred_rf = rf.predict(X_test)
y_pred_gbr = gbr.predict(X_test)
y_pred_dt = dt.predict(X_test)
y_pred_dt_pruned = dt_pruned.predict(X_test)
In [ ]:
from Scikit-learn.metrics import r2_score,mean_squared_error
# linear model metrics
lm_r2 = r2_score(y_test,y_pred_lm)
lm_mse = mean_squared_error(y_test,y_pred_lm)
print("lm_r2",lm_r2)
print("lm_mse",lm_mse)
print()

# quadratic model metrics
lm2_r2 = r2_score(y_test,y_pred_lm2)
lm2_mse = mean_squared_error(y_test,y_pred_lm2)
print("lm2_r2",lm2_r2)
print("lm2_mse",lm2_mse)
print()

# lasso model metrics
lasso_r2 = r2_score(y_test,y_pred_lasso)
lasso_mse = mean_squared_error(y_test,y_pred_lasso)
print("lasso_r2",lasso_r2)
print("lasso_mse",lasso_mse)
print()

# lassocv model metrics
lasso_cv_r2 = r2_score(y_test,y_pred_lasso_cv)
lasso_cv_mse = mean_squared_error(y_test,y_pred_lasso_cv)
print("lasso_cv_r2",lasso_cv_r2)
print("lasso_cv_mse",lasso_cv_mse)
print()

# ridgecv model metrics
ridge_cv_r2 = r2_score(y_test,y_pred_ridge_cv)
ridge_cv_mse = mean_squared_error(y_test,y_pred_ridge_cv)
print("ridge_cv_r2",ridge_cv_r2)
print("ridge_cv_mse",ridge_cv_mse)
print()

# elasticnetcv model metrics
elasticnet_cv_r2 = r2_score(y_test,y_pred_elasticnet_cv)
elasticnet_cv_mse = mean_squared_error(y_test,y_pred_elasticnet_cv)
print("elasticnet_cv_r2",elasticnet_cv_r2)
print("elasticnet_cv_mse",elasticnet_cv_mse)
print()

# lasso on quadratic data metrics
lasso_L_r2 = r2_score(y_test,y_pred_lasso_L)
lasso_L_mse = mean_squared_error(y_test,y_pred_lasso_L)
print("lasso_L_r2",lasso_L_r2)
print("lasso_L_mse",lasso_L_mse)
print()

# lassocv on quadratic data metrics
lasso_cv_L_r2 = r2_score(y_test,y_pred_lasso_cv_L)
lasso_cv_L_mse = mean_squared_error(y_test,y_pred_lasso_cv_L)
print("lasso_cv_L_r2",lasso_cv_L_r2)
print("lasso_cv_L_mse",lasso_cv_L_mse)
print()

# ridgecv on quadratic data metrics
ridge_cv_L_r2 = r2_score(y_test,y_pred_ridge_cv_L)
ridge_cv_L_mse = mean_squared_error(y_test,y_pred_ridge_cv_L)
print("ridge_cv_L_r2",ridge_cv_L_r2)
print("ridge_cv_L_mse",ridge_cv_L_mse)
print()

# elasticnetcv on quadratic data metrics
elasticnet_cv_L_r2 = r2_score(y_test,y_pred_elasticnet_cv_L)
elasticnet_cv_L_mse = mean_squared_error(y_test,y_pred_elasticnet_cv_L)
print("elasticnet_cv_L_r2",elasticnet_cv_L_r2)
print("elasticnet_cv_L_mse",elasticnet_cv_L_mse)
print()

# randomforest model metrics
rf_r2 = r2_score(y_test,y_pred_rf)
rf_mse = mean_squared_error(y_test,y_pred_rf)
print("rf_r2",rf_r2)
print("rf_mse",rf_mse)
print()

# gradientboost model metrics
gbr_r2 = r2_score(y_test,y_pred_gbr)
gbr_mse = mean_squared_error(y_test,y_pred_gbr)
print("gbr_r2",gbr_r2)
print("gbr_mse",gbr_mse)
print()

# decision tree model metrics
dt_r2 = r2_score(y_test,y_pred_dt)
dt_mse = mean_squared_error(y_test,y_pred_dt)
print("dt_r2",dt_r2)
print("dt_mse",dt_mse)
print()

# pruned decision tree model metrics
dt_pruned_r2 = r2_score(y_test,y_pred_dt_pruned)
dt_pruned_mse = mean_squared_error(y_test,y_pred_dt_pruned)
print("dt_pruned_r2",dt_pruned_r2)
print("dt_pruned_mse",dt_pruned_mse)
lm_r2 0.2564591769934085
lm_mse 0.3127664499897118

lm2_r2 0.2602708126287935
lm2_mse 0.31116310595069446

lasso_r2 0.04811131393532875
lasso_mse 0.4004068585258824

lasso_cv_r2 0.04811131393532875
lasso_cv_mse 0.4004068585258824

ridge_cv_r2 0.25655045286270683
ridge_cv_mse 0.31272805528598224

elasticnet_cv_r2 0.11137542178631354
elasticnet_cv_mse 0.3737951516604702

lasso_L_r2 0.04811131393532875
lasso_L_mse 0.4004068585258824

lasso_cv_L_r2 0.04811131393532875
lasso_cv_L_mse 0.4004068585258824

ridge_cv_L_r2 0.26035753454659527
ridge_cv_L_mse 0.31112662684218034

elasticnet_cv_L_r2 0.12063635718219501
elasticnet_cv_L_mse 0.36989958897214126

rf_r2 0.28773110798677815
rf_mse 0.29961208033241493

gbr_r2 -0.0002358120438648026
gbr_mse 0.420743817159267

dt_r2 -0.29610208022972073
dt_mse 0.5451983723214303

dt_pruned_r2 0.21722036231287212
dt_pruned_mse 0.32927204644077146

Conclusion:¶

  • The above result shows results for a single split of data into the training and testing part. We can observe that the Random Forest prediction algorithm performs the best here, with the lowest testing MSE or highest R-squared.

  • Other methods, such as the quadratic model, Cross-Validated Ridge, and Pruned regression tree, do similarly well.

  • Remarkably , OLS on a simple model with 72 regressors performs extremely well, almost as well as the sophisticated version of Random Forest.

  • Since the performance of OLS on a simple model is statistically indistinguishable from that of Random Forest, we may choose this method to be the winner here as OLS method is more interpretable than random forest and using the coefficient of variables in the OLS method we can select the more important features with valid reasons.

In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"