Data Scientist Employee Attrition - Job Change of Data Scientists¶

Problem Statement¶

A company that is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses conducted by the company. Many people sign up for their training. The company wants to know which of these candidates want to work for the company after training or looking for new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education and experience is provided by candidates during signup and enrollment.

This dataset is designed to understand the factors that lead a person to leave their current job, and it is hence useful for HR research. By building a model that uses the current credentials, demographics, and work experience related data, you will predict the probability that a candidate is looking for a new job, as well as interpret the main factors that affect an employee's decision whether to continue or attrite.

Data Description¶

  • Enrollee_id: Unique ID for candidate
  • City: City code
  • City_development_index: Developement index of the city (scaled)
  • Gender: Gender of candidate
  • Relevent_experience: Relevant experience of candidate
  • Enrolled_university: Type of University course enrolled if any
  • Education_level: Education level of candidate
  • Major_discipline: Education major discipline of candidate
  • Experience: Candidate total experience in years
  • Company_size: No of employees in current employer's company
  • Company_type: Type of current employer
  • Last_new_job: Difference in years between previous job and current job
  • Training_hours: training hours completed
  • Target: 0 – Not looking for job change, 1 – Looking for a job change
In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Introductory Steps¶

Importing Libraries¶

In [ ]:
pip install scikeras
Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB)
Requirement already satisfied: keras>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from scikeras) (3.4.1)
Requirement already satisfied: scikit-learn>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from scikeras) (1.5.2)
Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (1.4.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (1.26.4)
Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (13.8.1)
Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.0.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (3.11.0)
Requirement already satisfied: optree in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.12.1)
Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (0.4.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras>=3.2.0->scikeras) (24.1)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.4.2->scikeras) (3.5.0)
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from optree->keras>=3.2.0->scikeras) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.2.0->scikeras) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras>=3.2.0->scikeras) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras>=3.2.0->scikeras) (0.1.2)
Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.13.0
In [ ]:
pip install scikit-learn==1.2.2
Collecting scikit-learn==1.2.2
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (1.26.4)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (1.13.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.2.2) (3.5.0)
Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 25.2 MB/s eta 0:00:00
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.5.2
    Uninstalling scikit-learn-1.5.2:
      Successfully uninstalled scikit-learn-1.5.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikeras 0.13.0 requires scikit-learn>=1.4.2, but you have scikit-learn 1.2.2 which is incompatible.
Successfully installed scikit-learn-1.2.2

The code was executed after downgrading to version 1.2.2, hence it's advisable to downgrade to avoid potential unnecessary issues. After installing scikit-learn, restart the session then follow below steps.

In [ ]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import warnings
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dense, Input, Dropout,BatchNormalization
from scikeras.wrappers import KerasClassifier

import random
from tensorflow.keras import backend
random.seed(1)
np.random.seed(1)
tf.random.set_seed(1)
warnings.filterwarnings("ignore")

Loading the Data¶

In [ ]:
# Loeading the data. Fill the blank with the file path of the csv file
#Data = pd.read_csv('_________')

dataset = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Six_-_Deep_Learning/Data_Scientist_Employee_Attrition/Data.csv')
Data = dataset.copy()
In [ ]:
# Checking the number of rows and columns in the data
Data.shape
Out[ ]:
(19158, 14)
  • The dataset has 19158 rows and 14 columns

Data Overview¶

In [ ]:
# Let's view the first 5 rows of the data
Data.head()
Out[ ]:
Enrollee_id City City_development_index Gender Relevent_experience Enrolled_university Education_level Major_discipline Experience Company_size Company_type Last_new_job Training_hours Target
0 8949 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM >20 NaN NaN 1 36 1
1 29725 city_40 0.776 Male No relevent experience no_enrollment Graduate STEM 15 50-99 Pvt Ltd >4 47 0
2 11561 city_21 0.624 NaN No relevent experience Full time course Graduate STEM 5 NaN NaN never 83 0
3 33241 city_115 0.789 NaN No relevent experience NaN Graduate Business Degree <1 NaN Pvt Ltd never 52 1
4 666 city_162 0.767 Male Has relevent experience no_enrollment Masters STEM >20 50-99 Funded Startup 4 8 0
In [ ]:
# Let's view the last 5 rows of the data
Data.tail()
Out[ ]:
Enrollee_id City City_development_index Gender Relevent_experience Enrolled_university Education_level Major_discipline Experience Company_size Company_type Last_new_job Training_hours Target
19153 7386 city_173 0.878 Male No relevent experience no_enrollment Graduate Humanities 14 NaN NaN 1 42 1
19154 31398 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM 14 NaN NaN 4 52 1
19155 24576 city_103 0.920 Male Has relevent experience no_enrollment Graduate STEM >20 50-99 Pvt Ltd 4 44 0
19156 5756 city_65 0.802 Male Has relevent experience no_enrollment High School NaN <1 500-999 Pvt Ltd 2 97 0
19157 23834 city_67 0.855 NaN No relevent experience no_enrollment Primary School NaN 2 NaN NaN 1 127 0
In [ ]:
# Let's check the datatypes of the columns in the dataset
Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Enrollee_id             19158 non-null  int64  
 1   City                    19158 non-null  object 
 2   City_development_index  19158 non-null  float64
 3   Gender                  14650 non-null  object 
 4   Relevent_experience     19158 non-null  object 
 5   Enrolled_university     18772 non-null  object 
 6   Education_level         18698 non-null  object 
 7   Major_discipline        16345 non-null  object 
 8   Experience              19093 non-null  object 
 9   Company_size            13220 non-null  object 
 10  Company_type            13018 non-null  object 
 11  Last_new_job            18735 non-null  object 
 12  Training_hours          19158 non-null  int64  
 13  Target                  19158 non-null  int64  
dtypes: float64(1), int64(3), object(10)
memory usage: 2.0+ MB
  • There are 19,158 observations and 14 columns in the data.
  • 10 columns are of the object datatype and 4 columns are numerical.
In [ ]:
# Let's check for duplicate values in the data
Data.duplicated().sum()
Out[ ]:
0
In [ ]:
# Let's check for missing values in the data
round(Data.isnull().sum() / Data.isnull().count() * 100, 2)
Out[ ]:
0
Enrollee_id 0.00
City 0.00
City_development_index 0.00
Gender 23.53
Relevent_experience 0.00
Enrolled_university 2.01
Education_level 2.40
Major_discipline 14.68
Experience 0.34
Company_size 30.99
Company_type 32.05
Last_new_job 2.21
Training_hours 0.00
Target 0.00

In [ ]:
Data["Target"].value_counts(1)
Out[ ]:
proportion
Target
0 0.750652
1 0.249348

In [ ]:
# Let's view the statistical summary of the numerical columns in the data
Data.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Enrollee_id 19158.0 16875.358179 9616.292592 1.000 8554.25 16982.500 25169.75 33380.000
City_development_index 19158.0 0.828848 0.123362 0.448 0.74 0.903 0.92 0.949
Training_hours 19158.0 65.366896 60.058462 1.000 23.00 47.000 88.00 336.000
Target 19158.0 0.249348 0.432647 0.000 0.00 0.000 0.00 1.000

Outside of the Enrollee_id (an ID column) and the Target variable, there are only two numerical columns in the dataset.

  • The maximum number of training hours is 336, but at least 75% of employees had finished their training within 88 hours.
  • 50% of the rows have a city development index between 0.903 and 0.949, so the dataset is weighted towards the higher side with respect to this column.
In [ ]:
# Let's check the number of unique values in each column
Data.nunique()
Out[ ]:
0
Enrollee_id 19158
City 123
City_development_index 93
Gender 3
Relevent_experience 2
Enrolled_university 3
Education_level 5
Major_discipline 6
Experience 22
Company_size 8
Company_type 6
Last_new_job 6
Training_hours 241
Target 2

  • Each value of the column 'Employee_id' is a unique identifier for an employee. Hence we can drop this column as it will not add any predictive power or value to the model.
  • The 'City' column has 123 unique categories.
In [ ]:
for i in Data.describe(include=["object"]).columns:
    print("Unique values in", i, "are :")
    print(Data[i].value_counts())
    print("*" * 50)
Unique values in City are :
City
city_103    4355
city_21     2702
city_16     1533
city_114    1336
city_160     845
            ... 
city_129       3
city_111       3
city_121       3
city_140       1
city_171       1
Name: count, Length: 123, dtype: int64
**************************************************
Unique values in Gender are :
Gender
Male      13221
Female     1238
Other       191
Name: count, dtype: int64
**************************************************
Unique values in Relevent_experience are :
Relevent_experience
Has relevent experience    13792
No relevent experience      5366
Name: count, dtype: int64
**************************************************
Unique values in Enrolled_university are :
Enrolled_university
no_enrollment       13817
Full time course     3757
Part time course     1198
Name: count, dtype: int64
**************************************************
Unique values in Education_level are :
Education_level
Graduate          11598
Masters            4361
High School        2017
Phd                 414
Primary School      308
Name: count, dtype: int64
**************************************************
Unique values in Major_discipline are :
Major_discipline
STEM               14492
Humanities           669
Other                381
Business Degree      327
Arts                 253
No Major             223
Name: count, dtype: int64
**************************************************
Unique values in Experience are :
Experience
>20    3286
5      1430
4      1403
3      1354
6      1216
2      1127
7      1028
10      985
9       980
8       802
15      686
11      664
14      586
1       549
<1      522
16      508
12      494
13      399
17      342
19      304
18      280
20      148
Name: count, dtype: int64
**************************************************
Unique values in Company_size are :
Company_size
50-99        3083
100-500      2571
10000+       2019
Oct-49       1471
1000-4999    1328
<10          1308
500-999       877
5000-9999     563
Name: count, dtype: int64
**************************************************
Unique values in Company_type are :
Company_type
Pvt Ltd                9817
Funded Startup         1001
Public Sector           955
Early Stage Startup     603
NGO                     521
Other                   121
Name: count, dtype: int64
**************************************************
Unique values in Last_new_job are :
Last_new_job
1        8040
>4       3290
2        2900
never    2452
4        1029
3        1024
Name: count, dtype: int64
**************************************************
  • The 'City' column has 123 unique categories, and the city with the highest number of employees is City 103.
  • Over 90% of the employees in this dataset are males, so it is highly gender-skewed.
  • Most of the employees (~70%) have relevant experience in Data Science.
  • 70% of the employees did not enroll in any of the courses.
  • Most of the employees have a Bachelor's level of education (Graduation) but not more than that (~62% of the total number of employees). There are very few employees with Master's degrees and PhDs.
  • Almost all the employees have previous experience (Related and Non-Related).
  • ~75% of the employees are from private companies.

Data Pre-processing¶

In [ ]:
# ID column consists of uniques ID for clients and hence will not add value to the modeling
Data.drop(columns="Enrollee_id", inplace=True)

EDA¶

Univariate Analysis¶

In [ ]:
# Function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
Data.dtypes
Out[ ]:
0
City object
City_development_index float64
Gender object
Relevent_experience object
Enrolled_university object
Education_level object
Major_discipline object
Experience object
Company_size object
Company_type object
Last_new_job object
Training_hours int64
Target int64

In [ ]:
histogram_boxplot(Data, "City_development_index")
No description has been provided for this image
  • From the above plot, we observe that there are many people from cities having a development index more than 0.9.
In [ ]:
histogram_boxplot(Data, "Training_hours")
No description has been provided for this image
  • From the plot, we observe that the measures of central tendency with respect to training hours seem to be 70, despite a maximum value over 300 hours. So most of the people in this dataset have undergone traning for less than 100 hours.
In [ ]:
# Function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
labeled_barplot(Data, "Gender")
No description has been provided for this image
  • There are far more males in this dataset in comparison to females.
  • Over 90% of this dataset is male, representing a highly gender-skewed dataset. This could be a limitation with respect to implementing this model in the real world, since gender balance is highly important to create machine learning models that are practically implemented on datasets related to people.
In [ ]:
labeled_barplot(Data, "Relevent_experience")
No description has been provided for this image
  • Most of the employees have relevant prior experience (~70%).
  • 30% of the employees, however, have no relevant experience.
In [ ]:
labeled_barplot(Data, "Enrolled_university")
No description has been provided for this image
  • Most of the employees did not enroll in any of the courses.
  • Approximately 20% of the employees have enrolled themselves in full-time courses.
  • Only 6% have enrolled in part-time courses.
In [ ]:
labeled_barplot(Data, "Education_level")
No description has been provided for this image
  • Approximately 62% of employees have a Bachelor's (Graduate) level of education, but not more than that.
  • Approx 23% of employees have a Master's degree as their highest level of education.
  • There are very few employees (~1.5%) with only a High School level of education or below.
In [ ]:
labeled_barplot(Data, "Major_discipline")
No description has been provided for this image
  • Approximately 88% of employees have opted for STEM as their major discipline.
In [ ]:
labeled_barplot(Data, "Experience")
No description has been provided for this image
  • Approximately 17% of total employees have over 20 years of work experience.
In [ ]:
labeled_barplot(Data, "Company_type")
No description has been provided for this image
  • Approximately 75% of the total employees are from a private limited company, showing the skew of the profile towards the private sector.
In [ ]:
print(Data.Target.value_counts())
labels = 'Looking for job change', 'Not Looking for job change'
#sizes = [ds.is_promoted[ds['is_promoted']==1].count(), ds.is_promoted[ds['is_promoted']==0].count()]
sizes = [Data.Target[Data['Target']==1].count(),Data.Target[Data['Target']==0].count()]
explode = (0, 0.1)
fig1, ax1 = plt.subplots(figsize=(10, 8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
plt.title("Proportion", size = 20)
plt.show()
Target
0    14381
1     4777
Name: count, dtype: int64
No description has been provided for this image
  • This pie chart shows that the actual distribution of classes is itself imbalanced for the target variable.
  • Only ~25% of the employees in this dataset are actually looking for a job change.

Hence, this dataset and problem statement represent an example of Imbalanced Classification, which has unique challenges in comparison to performing classification over balanced target variables.

Bivariate Analysis¶

In [ ]:
### Function to plot distributions


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
In [ ]:
distribution_plot_wrt_target(Data, "City_development_index", "Target")
No description has been provided for this image
  • From the above plot, we observe that employees from cities having a development index over 0.9, are not willing to switch their jobs.
In [ ]:
distribution_plot_wrt_target(Data, "Training_hours", "Target")
No description has been provided for this image
  • We observe that the distribution of the training hours with respect to the target variable is rightly skewed, and from the box plot for both classes the median traning hours are around 50.
In [ ]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
stacked_barplot(Data, "Gender", "Target")
Target      0     1    All
Gender                    
All     11262  3388  14650
Male    10209  3012  13221
Female    912   326   1238
Other     141    50    191
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • From the above plot, it is observed that the likelihood of the employee choosing a job switch does not depend on their gender.
In [ ]:
stacked_barplot(Data, "Relevent_experience", "Target")
Target                       0     1    All
Relevent_experience                        
All                      14381  4777  19158
Has relevent experience  10831  2961  13792
No relevent experience    3550  1816   5366
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • From the above plot, we see that employees from Non-relevant experience are more likely to be switching their job.
In [ ]:
stacked_barplot(Data, "Enrolled_university", "Target")
Target                   0     1    All
Enrolled_university                    
All                  14118  4654  18772
no_enrollment        10896  2921  13817
Full time course      2326  1431   3757
Part time course       896   302   1198
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Employees who have taken full-time courses in universities are the ones who are more likely to be trying to switch jobs.
In [ ]:
stacked_barplot(Data, "Education_level", "Target")
Target               0     1    All
Education_level                    
All              14025  4673  18698
Graduate          8353  3245  11598
Masters           3426   935   4361
High School       1623   394   2017
Phd                356    58    414
Primary School     267    41    308
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Employees who completed Graduation and Master's degrees are more likely to be trying to switch their jobs.
In [ ]:
stacked_barplot(Data, "Major_discipline", "Target")
Target                0     1    All
Major_discipline                    
All               12117  4228  16345
STEM              10701  3791  14492
Humanities          528   141    669
Other               279   102    381
Business Degree     241    86    327
No Major            168    55    223
Arts                200    53    253
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Employees who took STEM or Business Degrees as their major discipline are slightly more likely to change their job.
In [ ]:
stacked_barplot(Data, "Experience", "Target")
Target          0     1    All
Experience                    
All         14339  4754  19093
>20          2783   503   3286
3             876   478   1354
4             946   457   1403
5            1018   412   1430
2             753   374   1127
6             873   343   1216
7             725   303   1028
<1            285   237    522
1             316   233    549
9             767   213    980
10            778   207    985
8             607   195    802
11            513   151    664
15            572   114    686
14            479   107    586
12            402    92    494
13            322    77    399
16            436    72    508
17            285    57    342
19            251    53    304
18            237    43    280
20            115    33    148
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • From the above plot, it's clear that employees having a work experience of less than 3 years are trying to switch their jobs.
In [ ]:
stacked_barplot(Data, "Last_new_job", "Target")
Target            0     1    All
Last_new_job                    
All           14112  4623  18735
1              5915  2125   8040
never          1713   739   2452
2              2200   700   2900
>4             2690   600   3290
3               793   231   1024
4               801   228   1029
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
  • Employees who have never switched their job before are the most likely to be looking for a job change.
In [ ]:
###Dropping these columns as they will not add value to the modeling
Data.drop(['Company_size','Gender','City'], axis=1, inplace=True)
In [ ]:
## Separating all the categorical columns for imputation
cat_col_df = Data.drop(['City_development_index','Training_hours','Target'], axis=1)

Missing Value Imputation¶

  • We will impute the missing values in columns using their mode.
In [ ]:
## Separating Independent and Dependent Columns
X = Data.drop(['Target'],axis=1)
Y = Data[['Target']]
In [ ]:
Y.head()
Out[ ]:
Target
0 1
1 0
2 0
3 1
4 0
In [ ]:
# Splitting the dataset into the Training and Testing set.

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42,stratify = Y)
In [ ]:
X_train.isnull().sum()
Out[ ]:
0
City_development_index 0
Relevent_experience 0
Enrolled_university 317
Education_level 362
Major_discipline 2258
Experience 50
Company_type 4881
Last_new_job 343
Training_hours 0

In [ ]:
imputer_mode = SimpleImputer(strategy="most_frequent")
X_train[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]] = imputer_mode.fit_transform(
    X_train[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]])

X_test[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]] = imputer_mode.transform(
    X_test[["Enrolled_university","Education_level","Major_discipline","Experience","Company_type","Last_new_job"]])
In [ ]:
# Checking that no column has missing values in train and test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
City_development_index    0
Relevent_experience       0
Enrolled_university       0
Education_level           0
Major_discipline          0
Experience                0
Company_type              0
Last_new_job              0
Training_hours            0
dtype: int64
------------------------------
City_development_index    0
Relevent_experience       0
Enrolled_university       0
Education_level           0
Major_discipline          0
Experience                0
Company_type              0
Last_new_job              0
Training_hours            0
dtype: int64

Encoding Categorical Columns¶

  • We will be using the Label Encoding technique to encode the values of the categorical columns in this dataset.
In [ ]:
from sklearn.preprocessing import LabelEncoder
labelencoder_RE = LabelEncoder()
X_train['Relevent_experience']= labelencoder_RE.fit_transform(X_train['Relevent_experience'])
X_test['Relevent_experience']= labelencoder_RE.transform(X_test['Relevent_experience'])
In [ ]:
labelencoder_EN = LabelEncoder()
X_train['Enrolled_university'] =  labelencoder_EN.fit_transform(X_train['Enrolled_university'])
X_test['Enrolled_university'] =  labelencoder_EN.transform(X_test['Enrolled_university'])
In [ ]:
labelencoder_EL = LabelEncoder()
X_train['Education_level']=  labelencoder_EL.fit_transform(X_train['Education_level'])
X_test['Education_level']=  labelencoder_EL.transform(X_test['Education_level'])
In [ ]:
labelencoder_MD = LabelEncoder()
X_train['Major_discipline']=  labelencoder_MD.fit_transform(X_train['Major_discipline'])
X_test['Major_discipline']=  labelencoder_MD.transform(X_test['Major_discipline'])
In [ ]:
labelencoder_EX = LabelEncoder()
X_train['Experience']=  labelencoder_EX.fit_transform(X_train['Experience'])
X_test['Experience']=  labelencoder_EX.transform(X_test['Experience'])
In [ ]:
labelencoder_CT = LabelEncoder()
X_train['Company_type']=  labelencoder_CT.fit_transform(X_train['Company_type'])
X_test['Company_type']=  labelencoder_CT.transform(X_test['Company_type'])
In [ ]:
labelencoder_LNJ = LabelEncoder()
X_train['Last_new_job']=  labelencoder_LNJ.fit_transform(X_train['Last_new_job'])
X_test['Last_new_job']=  labelencoder_LNJ.transform(X_test['Last_new_job'])
In [ ]:
X_train.head()
Out[ ]:
City_development_index Relevent_experience Enrolled_university Education_level Major_discipline Experience Company_type Last_new_job Training_hours
17855 0.624 0 2 0 5 1 5 0 90
17664 0.920 1 2 4 5 15 5 5 15
13404 0.896 0 2 0 5 3 2 4 36
13366 0.920 0 2 0 5 15 1 0 53
15670 0.855 0 0 0 5 15 5 0 158
In [ ]:
y_train.head()
Out[ ]:
Target
17855 0
17664 0
13404 0
13366 0
15670 1
In [ ]:
###Checking the shape of train and test sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(15326, 9)
(3832, 9)
(15326, 1)
(3832, 1)

Model Building¶

A model can make wrong predictions in the following ways:¶

  • Predicting an employee is looking for a job, when he/she is not looking for it.
  • Predicting an employee is not looking for a job, when he/she is in fact looking for one.

Which case is more important?¶

Both cases are actually important for the purposes of this case study. Not giving a chance to a deserving employee (by wrongly classifying them as likely to attrite) might lead to decreased productivity, and the company might lose a good employee affecting the organization's growth. However, giving chances to a non-deserving employee (as they are likely to attrite) would lead to a financial loss for the company, and giving such employees an increased amount of responsibility might again affect the company's growth.

How to reduce this loss i.e need to reduce False Negatives as well as False Positives?¶

Since both errors are important for us to minimize, the company would want the F1 Score evaluation metric to be maximized/ Hence, the focus should be on increasing the F1 score rather than focusing on just one metric i.e. Recall or Precision.

Model 1¶

In [ ]:
backend.clear_session()
#Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
# Initializing the ANN
model = Sequential()
# The amount of nodes (dimensions) in hidden layer should be the average of input and output layers, in this case 64.
# This adds the input layer (by specifying input dimension) AND the first hidden layer (units)
model.add(Dense(activation = 'relu', input_dim = 9, units=64))
#Add 1st hidden layer
model.add(Dense(32, activation='relu'))
# Adding the output layer
# Notice that we do not need to specify input dim.
# we have an output of 1 node, which is the the desired dimensions of our output (stay with the bank or not)
# We use the sigmoid because we want probability outcomes
model.add(Dense(1, activation = 'sigmoid'))
In [ ]:
# Create optimizer with default learning rate
# Compile the model
model.compile(optimizer='SGD', loss='binary_crossentropy', metrics=['accuracy'])
In [ ]:
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 64)                640       
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 2753 (10.75 KB)
Trainable params: 2753 (10.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
history=model.fit(X_train, y_train,
          validation_split=0.2,
          epochs=50,
          batch_size=32,verbose=1)
Epoch 1/50
384/384 [==============================] - 3s 4ms/step - loss: 0.6304 - accuracy: 0.7440 - val_loss: 0.6566 - val_accuracy: 0.7515
Epoch 2/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5711 - accuracy: 0.7504 - val_loss: 0.5643 - val_accuracy: 0.7515
Epoch 3/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5645 - accuracy: 0.7502 - val_loss: 0.5679 - val_accuracy: 0.7515
Epoch 4/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5637 - accuracy: 0.7499 - val_loss: 0.5634 - val_accuracy: 0.7515
Epoch 5/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5619 - accuracy: 0.7504 - val_loss: 0.6195 - val_accuracy: 0.6644
Epoch 6/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5606 - accuracy: 0.7502 - val_loss: 0.5782 - val_accuracy: 0.7515
Epoch 7/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5585 - accuracy: 0.7504 - val_loss: 0.5593 - val_accuracy: 0.7515
Epoch 8/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5571 - accuracy: 0.7502 - val_loss: 0.5677 - val_accuracy: 0.7515
Epoch 9/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5566 - accuracy: 0.7503 - val_loss: 0.5730 - val_accuracy: 0.7515
Epoch 10/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5568 - accuracy: 0.7502 - val_loss: 0.5540 - val_accuracy: 0.7515
Epoch 11/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5557 - accuracy: 0.7503 - val_loss: 0.5790 - val_accuracy: 0.7456
Epoch 12/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5534 - accuracy: 0.7502 - val_loss: 0.5762 - val_accuracy: 0.7515
Epoch 13/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5532 - accuracy: 0.7505 - val_loss: 0.5864 - val_accuracy: 0.7456
Epoch 14/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5538 - accuracy: 0.7503 - val_loss: 0.5517 - val_accuracy: 0.7515
Epoch 15/50
384/384 [==============================] - 2s 6ms/step - loss: 0.5529 - accuracy: 0.7503 - val_loss: 0.5553 - val_accuracy: 0.7515
Epoch 16/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5519 - accuracy: 0.7502 - val_loss: 0.5854 - val_accuracy: 0.7469
Epoch 17/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5523 - accuracy: 0.7499 - val_loss: 0.5673 - val_accuracy: 0.7515
Epoch 18/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5502 - accuracy: 0.7498 - val_loss: 0.5502 - val_accuracy: 0.7515
Epoch 19/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5501 - accuracy: 0.7498 - val_loss: 0.5570 - val_accuracy: 0.7515
Epoch 20/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5505 - accuracy: 0.7506 - val_loss: 0.5489 - val_accuracy: 0.7515
Epoch 21/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5507 - accuracy: 0.7498 - val_loss: 0.5606 - val_accuracy: 0.7515
Epoch 22/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5501 - accuracy: 0.7499 - val_loss: 0.5493 - val_accuracy: 0.7515
Epoch 23/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5492 - accuracy: 0.7504 - val_loss: 0.6562 - val_accuracy: 0.5753
Epoch 24/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5496 - accuracy: 0.7498 - val_loss: 0.5564 - val_accuracy: 0.7515
Epoch 25/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5498 - accuracy: 0.7499 - val_loss: 0.5640 - val_accuracy: 0.7515
Epoch 26/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5492 - accuracy: 0.7509 - val_loss: 0.5933 - val_accuracy: 0.7515
Epoch 27/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5491 - accuracy: 0.7501 - val_loss: 0.5554 - val_accuracy: 0.7534
Epoch 28/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5479 - accuracy: 0.7505 - val_loss: 0.5496 - val_accuracy: 0.7515
Epoch 29/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5489 - accuracy: 0.7500 - val_loss: 0.5566 - val_accuracy: 0.7505
Epoch 30/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5482 - accuracy: 0.7497 - val_loss: 0.5408 - val_accuracy: 0.7515
Epoch 31/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5482 - accuracy: 0.7506 - val_loss: 0.5466 - val_accuracy: 0.7511
Epoch 32/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5476 - accuracy: 0.7499 - val_loss: 0.5477 - val_accuracy: 0.7502
Epoch 33/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5464 - accuracy: 0.7507 - val_loss: 0.5988 - val_accuracy: 0.7365
Epoch 34/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5476 - accuracy: 0.7506 - val_loss: 0.5641 - val_accuracy: 0.7534
Epoch 35/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5467 - accuracy: 0.7498 - val_loss: 0.5461 - val_accuracy: 0.7515
Epoch 36/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5462 - accuracy: 0.7508 - val_loss: 0.5534 - val_accuracy: 0.7515
Epoch 37/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5456 - accuracy: 0.7501 - val_loss: 0.6286 - val_accuracy: 0.6458
Epoch 38/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5461 - accuracy: 0.7505 - val_loss: 0.5490 - val_accuracy: 0.7515
Epoch 39/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5457 - accuracy: 0.7510 - val_loss: 0.5594 - val_accuracy: 0.7515
Epoch 40/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5454 - accuracy: 0.7505 - val_loss: 0.5439 - val_accuracy: 0.7511
Epoch 41/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5456 - accuracy: 0.7502 - val_loss: 0.5483 - val_accuracy: 0.7515
Epoch 42/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5447 - accuracy: 0.7500 - val_loss: 0.5736 - val_accuracy: 0.7515
Epoch 43/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5453 - accuracy: 0.7505 - val_loss: 0.5876 - val_accuracy: 0.7515
Epoch 44/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5448 - accuracy: 0.7503 - val_loss: 0.5420 - val_accuracy: 0.7515
Epoch 45/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5454 - accuracy: 0.7502 - val_loss: 0.5398 - val_accuracy: 0.7511
Epoch 46/50
384/384 [==============================] - 1s 4ms/step - loss: 0.5448 - accuracy: 0.7506 - val_loss: 0.5451 - val_accuracy: 0.7515
Epoch 47/50
384/384 [==============================] - 2s 4ms/step - loss: 0.5440 - accuracy: 0.7511 - val_loss: 0.5393 - val_accuracy: 0.7508
Epoch 48/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5434 - accuracy: 0.7507 - val_loss: 0.5374 - val_accuracy: 0.7511
Epoch 49/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5447 - accuracy: 0.7507 - val_loss: 0.5410 - val_accuracy: 0.7508
Epoch 50/50
384/384 [==============================] - 2s 5ms/step - loss: 0.5439 - accuracy: 0.7505 - val_loss: 0.5467 - val_accuracy: 0.7515
In [ ]:
# Capturing learning history per epoch
hist  = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

# Plotting accuracy at different epochs
plt.plot(hist['loss'])
plt.plot(hist['val_loss'])
plt.legend(("train" , "valid") , loc =0)

#Printing results
results = model.evaluate(X_test, y_test)
120/120 [==============================] - 0s 2ms/step - loss: 0.5455 - accuracy: 0.7508
No description has been provided for this image

There is noise in the loss behavior here. Sometimes, the loss function fluctuates a lot during training, which makes the convergence slow. These fluctuations are due to the nature of Stochastic Gradient Descent that produces noisy updates in the parameters.

Let's check the other metrices.

In [ ]:
y_pred=model.predict(X_test)
y_pred = (y_pred > 0.5)
y_pred
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [False],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
def make_confusion_matrix(cf,
                          group_names=None,
                          categories='auto',
                          count=True,
                          percent=True,
                          cbar=True,
                          xyticks=True,
                          xyplotlabels=True,
                          sum_stats=True,
                          figsize=None,
                          cmap='Blues',
                          title=None):
    '''
    This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
    Arguments
    '''


    # CODE TO GENERATE TEXT INSIDE EACH SQUARE
    blanks = ['' for i in range(cf.size)]

    if group_names and len(group_names)==cf.size:
        group_labels = ["{}\n".format(value) for value in group_names]
    else:
        group_labels = blanks

    if count:
        group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
    else:
        group_counts = blanks

    if percent:
        group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
    else:
        group_percentages = blanks

    box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
    box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])


    # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
    if sum_stats:
        #Accuracy is sum of diagonal divided by total observations
        accuracy  = np.trace(cf) / float(np.sum(cf))



    # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
    if figsize==None:
        #Get default figure size if not set
        figsize = plt.rcParams.get('figure.figsize')

    if xyticks==False:
        #Do not show categories if xyticks is False
        categories=False


    # MAKE THE HEATMAP VISUALIZATION
    plt.figure(figsize=figsize)
    sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)


    if title:
        plt.title(title)
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image

Here, the 0% of False Positives is because we gave 0.5 as the threshold to the model, and as this is an imbalanced dataset, we should calculate the threshold using the AUC-ROC curve.

In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr=metrics.classification_report(y_test,y_pred)
print(cr)
              precision    recall  f1-score   support

           0       0.75      1.00      0.86      2877
           1       0.00      0.00      0.00       955

    accuracy                           0.75      3832
   macro avg       0.38      0.50      0.43      3832
weighted avg       0.56      0.75      0.64      3832

As you can see, the above model has a good accuracy but a poor F1-score. This could be due to the imbalanced dataset. We observe that the False positive rates are also high, which should be considerably lower.

  1. Imbalanced dataset: As you have seen in the EDA, this dataset is imbalanced, and it contains more examples that belong to the 0 class.

  2. Decision Threshold: Due to the imbalanced dataset, we can use ROC-AUC to find the optimal threshold and use the same for prediction.

Let's try to change the optimizer, tune the decision threshold, increase the layers and configure some other hyperparameters accordingly, in order to improve the model's performance.

Model 2¶

In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model1 = Sequential()
      #Adding the hidden and output layers
model1.add(Dense(256,activation='relu',kernel_initializer='he_uniform',input_dim = X_train.shape[1]))
model1.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model1.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model1.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
model1.add(Dense(1, activation = 'sigmoid'))
      #Compiling the ANN with Adam optimizer and binary cross entropy loss function
optimizer = tf.keras.optimizers.Adam(0.001)
model1.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
In [ ]:
model1.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 256)               2560      
                                                                 
 dense_1 (Dense)             (None, 128)               32896     
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dense_4 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 45825 (179.00 KB)
Trainable params: 45825 (179.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
history1 = model1.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50
192/192 [==============================] - 3s 5ms/step - loss: 0.8645 - accuracy: 0.6954 - val_loss: 0.5785 - val_accuracy: 0.7355
Epoch 2/50
192/192 [==============================] - 1s 4ms/step - loss: 0.6333 - accuracy: 0.7193 - val_loss: 0.5575 - val_accuracy: 0.7469
Epoch 3/50
192/192 [==============================] - 1s 4ms/step - loss: 0.6248 - accuracy: 0.7277 - val_loss: 0.6609 - val_accuracy: 0.7518
Epoch 4/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5867 - accuracy: 0.7365 - val_loss: 0.5417 - val_accuracy: 0.7518
Epoch 5/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5684 - accuracy: 0.7414 - val_loss: 0.6028 - val_accuracy: 0.7502
Epoch 6/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5612 - accuracy: 0.7440 - val_loss: 0.5552 - val_accuracy: 0.7531
Epoch 7/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5678 - accuracy: 0.7418 - val_loss: 0.5352 - val_accuracy: 0.7524
Epoch 8/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5477 - accuracy: 0.7486 - val_loss: 0.5888 - val_accuracy: 0.7133
Epoch 9/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5500 - accuracy: 0.7467 - val_loss: 0.5536 - val_accuracy: 0.7469
Epoch 10/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5519 - accuracy: 0.7459 - val_loss: 0.5435 - val_accuracy: 0.7462
Epoch 11/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5524 - accuracy: 0.7479 - val_loss: 0.5832 - val_accuracy: 0.7267
Epoch 12/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5432 - accuracy: 0.7495 - val_loss: 0.5810 - val_accuracy: 0.7492
Epoch 13/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5422 - accuracy: 0.7517 - val_loss: 0.5275 - val_accuracy: 0.7518
Epoch 14/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5358 - accuracy: 0.7506 - val_loss: 0.5385 - val_accuracy: 0.7482
Epoch 15/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5415 - accuracy: 0.7493 - val_loss: 0.5355 - val_accuracy: 0.7508
Epoch 16/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5400 - accuracy: 0.7518 - val_loss: 0.5309 - val_accuracy: 0.7508
Epoch 17/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5347 - accuracy: 0.7518 - val_loss: 0.5275 - val_accuracy: 0.7534
Epoch 18/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5391 - accuracy: 0.7490 - val_loss: 0.5269 - val_accuracy: 0.7511
Epoch 19/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5326 - accuracy: 0.7514 - val_loss: 0.5265 - val_accuracy: 0.7521
Epoch 20/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5383 - accuracy: 0.7513 - val_loss: 0.5343 - val_accuracy: 0.7508
Epoch 21/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5405 - accuracy: 0.7499 - val_loss: 0.5330 - val_accuracy: 0.7489
Epoch 22/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5343 - accuracy: 0.7514 - val_loss: 0.5240 - val_accuracy: 0.7498
Epoch 23/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5407 - accuracy: 0.7519 - val_loss: 0.5386 - val_accuracy: 0.7515
Epoch 24/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5303 - accuracy: 0.7522 - val_loss: 0.5197 - val_accuracy: 0.7515
Epoch 25/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5308 - accuracy: 0.7510 - val_loss: 0.5403 - val_accuracy: 0.7436
Epoch 26/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5238 - accuracy: 0.7532 - val_loss: 0.5160 - val_accuracy: 0.7551
Epoch 27/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5288 - accuracy: 0.7534 - val_loss: 0.5376 - val_accuracy: 0.7495
Epoch 28/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5244 - accuracy: 0.7544 - val_loss: 0.5237 - val_accuracy: 0.7495
Epoch 29/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5192 - accuracy: 0.7554 - val_loss: 0.5187 - val_accuracy: 0.7511
Epoch 30/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5191 - accuracy: 0.7554 - val_loss: 0.5211 - val_accuracy: 0.7580
Epoch 31/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5151 - accuracy: 0.7588 - val_loss: 0.5259 - val_accuracy: 0.7518
Epoch 32/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5190 - accuracy: 0.7562 - val_loss: 0.5129 - val_accuracy: 0.7524
Epoch 33/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5170 - accuracy: 0.7564 - val_loss: 0.5074 - val_accuracy: 0.7547
Epoch 34/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5310 - accuracy: 0.7529 - val_loss: 0.5229 - val_accuracy: 0.7511
Epoch 35/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5171 - accuracy: 0.7540 - val_loss: 0.5370 - val_accuracy: 0.7498
Epoch 36/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5190 - accuracy: 0.7574 - val_loss: 0.5282 - val_accuracy: 0.7515
Epoch 37/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5115 - accuracy: 0.7587 - val_loss: 0.5492 - val_accuracy: 0.7567
Epoch 38/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5138 - accuracy: 0.7564 - val_loss: 0.5206 - val_accuracy: 0.7505
Epoch 39/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5147 - accuracy: 0.7542 - val_loss: 0.5091 - val_accuracy: 0.7518
Epoch 40/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5100 - accuracy: 0.7569 - val_loss: 0.5161 - val_accuracy: 0.7557
Epoch 41/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5103 - accuracy: 0.7572 - val_loss: 0.5286 - val_accuracy: 0.7515
Epoch 42/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5137 - accuracy: 0.7560 - val_loss: 0.5233 - val_accuracy: 0.7560
Epoch 43/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5094 - accuracy: 0.7563 - val_loss: 0.5135 - val_accuracy: 0.7551
Epoch 44/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5084 - accuracy: 0.7613 - val_loss: 0.5451 - val_accuracy: 0.7534
Epoch 45/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5105 - accuracy: 0.7564 - val_loss: 0.5131 - val_accuracy: 0.7622
Epoch 46/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5162 - accuracy: 0.7583 - val_loss: 0.5211 - val_accuracy: 0.7599
Epoch 47/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5063 - accuracy: 0.7604 - val_loss: 0.5137 - val_accuracy: 0.7554
Epoch 48/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5101 - accuracy: 0.7591 - val_loss: 0.5139 - val_accuracy: 0.7590
Epoch 49/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5102 - accuracy: 0.7591 - val_loss: 0.5346 - val_accuracy: 0.7531
Epoch 50/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5043 - accuracy: 0.7615 - val_loss: 0.5104 - val_accuracy: 0.7609
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history1.history['loss'])
plt.plot(history1.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

As we increased the depth of the neural network and changed the optimizer to Adam, we can see smoother loss curves for both train and validation.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat1 = model1.predict(X_test)
# keep probabilities for the positive outcome only
yhat1 = yhat1[:, 0]
# calculate roc curves
fpr, tpr, thresholds1 = roc_curve(y_test, yhat1)
# calculate the g-mean for each threshold
gmeans1 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans1)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds1[ix], gmeans1[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.219626, G-Mean=0.686
No description has been provided for this image

Let's tune the threshold using ROC-AUC

There are many ways we could locate the threshold with the optimal balance between false positive and true positive rates.

Firstly, the true positive rate is called the Sensitivity. The inverse of the false-positive rate is called the Specificity.

Sensitivity = True Positive / (True Positive + False Negative)

Specificity = True Negative / (False Positive + True Negative)

Where:

Sensitivity = True Positive Rate

Specificity = 1 – False Positive Rate

The Geometric Mean or G-Mean is a metric for imbalanced classification that, if optimized, will seek a balance between the sensitivity and the specificity.

G-Mean = sqrt(Sensitivity * Specificity)

One approach would be to test the model with each threshold returned from the call roc_auc_score(),

and select the threshold with the largest G-Mean value.

In [ ]:
#Predicting the results using best as a threshold
y_pred_e1=model1.predict(X_test)
y_pred_e1 = (y_pred_e1 > thresholds1[ix])
y_pred_e1
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm1=confusion_matrix(y_test, y_pred_e1)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm1,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr=metrics.classification_report(y_test,y_pred_e1)
print(cr)
              precision    recall  f1-score   support

           0       0.86      0.71      0.78      2877
           1       0.43      0.66      0.52       955

    accuracy                           0.70      3832
   macro avg       0.65      0.69      0.65      3832
weighted avg       0.76      0.70      0.71      3832

As the number of layers in the neural network has increased, we can see that the macro F1 score has increased.

Now let's try to use the Batch Normalization technique and check to see if we can increase the F1 score.

Model 3¶

In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model2 = Sequential()
model2.add(Dense(128,activation='relu',input_dim = X_train.shape[1]))
model2.add(BatchNormalization())
model2.add(Dense(64,activation='relu',kernel_initializer='he_uniform'))
model2.add(BatchNormalization())
model2.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
model2.add(Dense(1, activation = 'sigmoid'))
In [ ]:
model2.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 128)               1280      
                                                                 
 batch_normalization (Batch  (None, 128)               512       
 Normalization)                                                  
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 batch_normalization_1 (Bat  (None, 64)                256       
 chNormalization)                                                
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 12417 (48.50 KB)
Trainable params: 12033 (47.00 KB)
Non-trainable params: 384 (1.50 KB)
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam(0.001)
model2.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
In [ ]:
history_2 = model2.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50
192/192 [==============================] - 5s 7ms/step - loss: 0.5859 - accuracy: 0.7278 - val_loss: 0.5599 - val_accuracy: 0.7515
Epoch 2/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5484 - accuracy: 0.7472 - val_loss: 0.5417 - val_accuracy: 0.7531
Epoch 3/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5325 - accuracy: 0.7512 - val_loss: 0.5247 - val_accuracy: 0.7547
Epoch 4/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5189 - accuracy: 0.7558 - val_loss: 0.5643 - val_accuracy: 0.7114
Epoch 5/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5116 - accuracy: 0.7582 - val_loss: 0.5220 - val_accuracy: 0.7674
Epoch 6/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5080 - accuracy: 0.7573 - val_loss: 0.5059 - val_accuracy: 0.7629
Epoch 7/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5026 - accuracy: 0.7632 - val_loss: 0.5113 - val_accuracy: 0.7567
Epoch 8/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5020 - accuracy: 0.7648 - val_loss: 0.5041 - val_accuracy: 0.7635
Epoch 9/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5021 - accuracy: 0.7657 - val_loss: 0.5022 - val_accuracy: 0.7596
Epoch 10/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5006 - accuracy: 0.7656 - val_loss: 0.5543 - val_accuracy: 0.7401
Epoch 11/50
192/192 [==============================] - 1s 8ms/step - loss: 0.4989 - accuracy: 0.7658 - val_loss: 0.5143 - val_accuracy: 0.7632
Epoch 12/50
192/192 [==============================] - 1s 7ms/step - loss: 0.4979 - accuracy: 0.7644 - val_loss: 0.5328 - val_accuracy: 0.7560
Epoch 13/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4965 - accuracy: 0.7619 - val_loss: 0.5328 - val_accuracy: 0.7603
Epoch 14/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4977 - accuracy: 0.7656 - val_loss: 0.5061 - val_accuracy: 0.7655
Epoch 15/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4956 - accuracy: 0.7661 - val_loss: 0.5337 - val_accuracy: 0.7482
Epoch 16/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4934 - accuracy: 0.7670 - val_loss: 0.5103 - val_accuracy: 0.7652
Epoch 17/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4944 - accuracy: 0.7664 - val_loss: 0.5074 - val_accuracy: 0.7665
Epoch 18/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4911 - accuracy: 0.7693 - val_loss: 0.5158 - val_accuracy: 0.7603
Epoch 19/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4926 - accuracy: 0.7677 - val_loss: 0.5053 - val_accuracy: 0.7632
Epoch 20/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4915 - accuracy: 0.7691 - val_loss: 0.4974 - val_accuracy: 0.7665
Epoch 21/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4898 - accuracy: 0.7690 - val_loss: 0.4980 - val_accuracy: 0.7609
Epoch 22/50
192/192 [==============================] - 1s 8ms/step - loss: 0.4904 - accuracy: 0.7692 - val_loss: 0.5135 - val_accuracy: 0.7606
Epoch 23/50
192/192 [==============================] - 1s 7ms/step - loss: 0.4911 - accuracy: 0.7721 - val_loss: 0.5017 - val_accuracy: 0.7570
Epoch 24/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4884 - accuracy: 0.7734 - val_loss: 0.5114 - val_accuracy: 0.7688
Epoch 25/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4877 - accuracy: 0.7699 - val_loss: 0.4956 - val_accuracy: 0.7599
Epoch 26/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4873 - accuracy: 0.7739 - val_loss: 0.5418 - val_accuracy: 0.7505
Epoch 27/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4872 - accuracy: 0.7734 - val_loss: 0.4981 - val_accuracy: 0.7753
Epoch 28/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4869 - accuracy: 0.7697 - val_loss: 0.5835 - val_accuracy: 0.7202
Epoch 29/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4871 - accuracy: 0.7693 - val_loss: 0.5027 - val_accuracy: 0.7639
Epoch 30/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4867 - accuracy: 0.7723 - val_loss: 0.4992 - val_accuracy: 0.7671
Epoch 31/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4850 - accuracy: 0.7726 - val_loss: 0.5193 - val_accuracy: 0.7622
Epoch 32/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4856 - accuracy: 0.7723 - val_loss: 0.5019 - val_accuracy: 0.7661
Epoch 33/50
192/192 [==============================] - 1s 7ms/step - loss: 0.4859 - accuracy: 0.7724 - val_loss: 0.5027 - val_accuracy: 0.7668
Epoch 34/50
192/192 [==============================] - 1s 8ms/step - loss: 0.4840 - accuracy: 0.7741 - val_loss: 0.4995 - val_accuracy: 0.7678
Epoch 35/50
192/192 [==============================] - 1s 8ms/step - loss: 0.4844 - accuracy: 0.7708 - val_loss: 0.5129 - val_accuracy: 0.7590
Epoch 36/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4834 - accuracy: 0.7754 - val_loss: 0.5039 - val_accuracy: 0.7707
Epoch 37/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4820 - accuracy: 0.7732 - val_loss: 0.4988 - val_accuracy: 0.7655
Epoch 38/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4827 - accuracy: 0.7759 - val_loss: 0.5014 - val_accuracy: 0.7658
Epoch 39/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4800 - accuracy: 0.7768 - val_loss: 0.5364 - val_accuracy: 0.7564
Epoch 40/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4825 - accuracy: 0.7700 - val_loss: 0.5179 - val_accuracy: 0.7551
Epoch 41/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4829 - accuracy: 0.7739 - val_loss: 0.5087 - val_accuracy: 0.7639
Epoch 42/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4801 - accuracy: 0.7761 - val_loss: 0.5285 - val_accuracy: 0.7609
Epoch 43/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4816 - accuracy: 0.7749 - val_loss: 0.5055 - val_accuracy: 0.7648
Epoch 44/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4800 - accuracy: 0.7740 - val_loss: 0.5154 - val_accuracy: 0.7551
Epoch 45/50
192/192 [==============================] - 2s 8ms/step - loss: 0.4811 - accuracy: 0.7756 - val_loss: 0.5968 - val_accuracy: 0.6905
Epoch 46/50
192/192 [==============================] - 2s 8ms/step - loss: 0.4796 - accuracy: 0.7768 - val_loss: 0.5083 - val_accuracy: 0.7616
Epoch 47/50
192/192 [==============================] - 1s 7ms/step - loss: 0.4800 - accuracy: 0.7732 - val_loss: 0.5088 - val_accuracy: 0.7645
Epoch 48/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4795 - accuracy: 0.7750 - val_loss: 0.5253 - val_accuracy: 0.7541
Epoch 49/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4789 - accuracy: 0.7774 - val_loss: 0.5032 - val_accuracy: 0.7613
Epoch 50/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4776 - accuracy: 0.7794 - val_loss: 0.5017 - val_accuracy: 0.7678
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_2.history['loss'])
plt.plot(history_2.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

Unfortunately, from the above plot we observe that there is a lot of noise in the model, and it and seems to have overfitted on the training data because there is a significant difference in performance between train and validation.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat2 = model2.predict(X_test)
# keep probabilities for the positive outcome only
yhat2 = yhat2[:, 0]
# calculate roc curves
fpr, tpr, thresholds2 = roc_curve(y_test, yhat2)
# calculate the g-mean for each threshold
gmeans2 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans2)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds2[ix], gmeans2[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.228691, G-Mean=0.698
No description has been provided for this image
In [ ]:
y_pred_e2=model2.predict(X_test)
y_pred_e2 = (y_pred_e2 > thresholds2[ix])
y_pred_e2
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm2=confusion_matrix(y_test, y_pred_e2)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm2,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr2=metrics.classification_report(y_test,y_pred_e2)
print(cr2)
              precision    recall  f1-score   support

           0       0.87      0.75      0.81      2877
           1       0.47      0.65      0.54       955

    accuracy                           0.73      3832
   macro avg       0.67      0.70      0.67      3832
weighted avg       0.77      0.73      0.74      3832

The Train and Validation curves seem to show overfitting despite having a good F1 score.

Let's try to use the Dropout technique and check to see if it can reduce the False Negative rate.

Model 4¶

In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model3 = Sequential()
model3.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
model3.add(Dropout(0.2))
model3.add(Dense(128,activation='relu'))
model3.add(Dropout(0.2))
model3.add(Dense(64,activation='relu'))
model3.add(Dropout(0.2))
model3.add(Dense(32,activation='relu'))
model3.add(Dense(1, activation = 'sigmoid'))
In [ ]:
model3.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 256)               2560      
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 128)               32896     
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 32)                2080      
                                                                 
 dense_4 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 45825 (179.00 KB)
Trainable params: 45825 (179.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam(0.001)
model3.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
In [ ]:
history_3 = model3.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50
192/192 [==============================] - 3s 5ms/step - loss: 0.6245 - accuracy: 0.7349 - val_loss: 0.5638 - val_accuracy: 0.7515
Epoch 2/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5763 - accuracy: 0.7480 - val_loss: 0.5716 - val_accuracy: 0.7515
Epoch 3/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5708 - accuracy: 0.7498 - val_loss: 0.5578 - val_accuracy: 0.7515
Epoch 4/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5607 - accuracy: 0.7504 - val_loss: 0.5542 - val_accuracy: 0.7515
Epoch 5/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5575 - accuracy: 0.7502 - val_loss: 0.5534 - val_accuracy: 0.7515
Epoch 6/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5561 - accuracy: 0.7504 - val_loss: 0.5482 - val_accuracy: 0.7515
Epoch 7/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5524 - accuracy: 0.7504 - val_loss: 0.5420 - val_accuracy: 0.7515
Epoch 8/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5511 - accuracy: 0.7503 - val_loss: 0.5492 - val_accuracy: 0.7515
Epoch 9/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5484 - accuracy: 0.7505 - val_loss: 0.5377 - val_accuracy: 0.7515
Epoch 10/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5443 - accuracy: 0.7504 - val_loss: 0.5390 - val_accuracy: 0.7515
Epoch 11/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5424 - accuracy: 0.7503 - val_loss: 0.5482 - val_accuracy: 0.7515
Epoch 12/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5418 - accuracy: 0.7502 - val_loss: 0.5393 - val_accuracy: 0.7515
Epoch 13/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5420 - accuracy: 0.7504 - val_loss: 0.5325 - val_accuracy: 0.7515
Epoch 14/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5402 - accuracy: 0.7503 - val_loss: 0.5395 - val_accuracy: 0.7515
Epoch 15/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5391 - accuracy: 0.7504 - val_loss: 0.5308 - val_accuracy: 0.7515
Epoch 16/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5372 - accuracy: 0.7504 - val_loss: 0.5257 - val_accuracy: 0.7515
Epoch 17/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5320 - accuracy: 0.7500 - val_loss: 0.5254 - val_accuracy: 0.7515
Epoch 18/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5316 - accuracy: 0.7510 - val_loss: 0.5213 - val_accuracy: 0.7515
Epoch 19/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5298 - accuracy: 0.7502 - val_loss: 0.5237 - val_accuracy: 0.7524
Epoch 20/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5276 - accuracy: 0.7526 - val_loss: 0.5205 - val_accuracy: 0.7528
Epoch 21/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5211 - accuracy: 0.7510 - val_loss: 0.5208 - val_accuracy: 0.7626
Epoch 22/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5205 - accuracy: 0.7529 - val_loss: 0.5123 - val_accuracy: 0.7554
Epoch 23/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5222 - accuracy: 0.7562 - val_loss: 0.5088 - val_accuracy: 0.7596
Epoch 24/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5188 - accuracy: 0.7563 - val_loss: 0.5043 - val_accuracy: 0.7661
Epoch 25/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5145 - accuracy: 0.7553 - val_loss: 0.5007 - val_accuracy: 0.7733
Epoch 26/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5114 - accuracy: 0.7570 - val_loss: 0.5030 - val_accuracy: 0.7671
Epoch 27/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5114 - accuracy: 0.7595 - val_loss: 0.4977 - val_accuracy: 0.7704
Epoch 28/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5125 - accuracy: 0.7598 - val_loss: 0.5018 - val_accuracy: 0.7720
Epoch 29/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5141 - accuracy: 0.7572 - val_loss: 0.5052 - val_accuracy: 0.7655
Epoch 30/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5084 - accuracy: 0.7604 - val_loss: 0.5040 - val_accuracy: 0.7658
Epoch 31/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5063 - accuracy: 0.7620 - val_loss: 0.5007 - val_accuracy: 0.7616
Epoch 32/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5063 - accuracy: 0.7602 - val_loss: 0.5087 - val_accuracy: 0.7674
Epoch 33/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5049 - accuracy: 0.7631 - val_loss: 0.5160 - val_accuracy: 0.7697
Epoch 34/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5077 - accuracy: 0.7615 - val_loss: 0.5058 - val_accuracy: 0.7658
Epoch 35/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5030 - accuracy: 0.7619 - val_loss: 0.5005 - val_accuracy: 0.7665
Epoch 36/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4993 - accuracy: 0.7631 - val_loss: 0.5014 - val_accuracy: 0.7694
Epoch 37/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5032 - accuracy: 0.7624 - val_loss: 0.5223 - val_accuracy: 0.7528
Epoch 38/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5012 - accuracy: 0.7617 - val_loss: 0.5056 - val_accuracy: 0.7652
Epoch 39/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5001 - accuracy: 0.7660 - val_loss: 0.4976 - val_accuracy: 0.7733
Epoch 40/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5006 - accuracy: 0.7635 - val_loss: 0.4955 - val_accuracy: 0.7753
Epoch 41/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4984 - accuracy: 0.7636 - val_loss: 0.5039 - val_accuracy: 0.7642
Epoch 42/50
192/192 [==============================] - 1s 4ms/step - loss: 0.4985 - accuracy: 0.7663 - val_loss: 0.4957 - val_accuracy: 0.7701
Epoch 43/50
192/192 [==============================] - 1s 4ms/step - loss: 0.4984 - accuracy: 0.7662 - val_loss: 0.4946 - val_accuracy: 0.7694
Epoch 44/50
192/192 [==============================] - 1s 4ms/step - loss: 0.4988 - accuracy: 0.7686 - val_loss: 0.5082 - val_accuracy: 0.7626
Epoch 45/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4995 - accuracy: 0.7651 - val_loss: 0.5048 - val_accuracy: 0.7629
Epoch 46/50
192/192 [==============================] - 1s 6ms/step - loss: 0.4959 - accuracy: 0.7673 - val_loss: 0.5058 - val_accuracy: 0.7717
Epoch 47/50
192/192 [==============================] - 1s 7ms/step - loss: 0.4969 - accuracy: 0.7680 - val_loss: 0.4990 - val_accuracy: 0.7668
Epoch 48/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4978 - accuracy: 0.7660 - val_loss: 0.4958 - val_accuracy: 0.7769
Epoch 49/50
192/192 [==============================] - 1s 4ms/step - loss: 0.4951 - accuracy: 0.7680 - val_loss: 0.5011 - val_accuracy: 0.7658
Epoch 50/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4949 - accuracy: 0.7676 - val_loss: 0.4969 - val_accuracy: 0.7727
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_3.history['loss'])
plt.plot(history_3.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

From the above plot, we observe that both the curves train and validation are smooth.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat3 = model3.predict(X_test)
# keep probabilities for the positive outcome only
yhat3 = yhat3[:, 0]
# calculate roc curves
fpr, tpr, thresholds3 = roc_curve(y_test, yhat3)
# calculate the g-mean for each threshold
gmeans3 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans3)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds3[ix], gmeans3[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.223736, G-Mean=0.699
No description has been provided for this image
In [ ]:
y_pred_e3=model3.predict(X_test)
y_pred_e3 = (y_pred_e3 > thresholds3[ix])
y_pred_e3
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm3=confusion_matrix(y_test, y_pred_e3)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm3,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr3=metrics.classification_report(y_test,y_pred_e3)
print(cr3)
              precision    recall  f1-score   support

           0       0.87      0.72      0.79      2877
           1       0.45      0.67      0.54       955

    accuracy                           0.71      3832
   macro avg       0.66      0.70      0.66      3832
weighted avg       0.76      0.71      0.73      3832

The Dropout technique helped the model reduce the loss function of both train and validation. The F1 score also seems to be fine, with a decrease in the False Negative rate.

Now, let's try to use some of the Hyperparameter Optimization techniques we have learnt, such as RandomizedSearchCV, GridSearchCV and Keras Tuner to increase the F1 score of the model.

Model 5¶

Random Search CV¶

Some important hyperparameters to look out for while optimizing neural networks are:

  • Type of Architecture

  • Number of Layers

  • Number of Neurons in a layer

  • Regularization hyperparameters

  • Learning Rate

  • Type of Optimizer

  • Dropout Rate

In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
def create_model_v4():
    np.random.seed(1337)
    model = Sequential()
    model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
    model.add(Dropout(0.3))
    #model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(64,activation='relu'))
    model.add(Dropout(0.2))
    #model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
    #model.add(Dropout(0.3))
    model.add(Dense(32,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    #compile model
    optimizer = tf.keras.optimizers.Adam()
    model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model

We are using Random search to optimize two hyperparameters - Batch size & Learning Rate.

You can also optimize other hyperparameters as mentioned above.

In [ ]:
keras_estimator = KerasClassifier(build_fn=create_model_v4, optimizer="Adam", verbose=1)
# define the grid search parameters
learn_rate = [0.01, 0.1, 0.001]
batch_size = [32, 64, 128]
param_random = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)

kfold_splits = 3
random= RandomizedSearchCV(estimator=keras_estimator,
                    verbose=1,
                    cv=kfold_splits,
                    param_distributions=param_random,n_jobs=-1)
In [ ]:
random_result = random.fit(X_train, y_train,validation_split=0.2,verbose=1)

# Summarize results
print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))
means = random_result.cv_results_['mean_test_score']
stds = random_result.cv_results_['std_test_score']
params = random_result.cv_results_['params']
Fitting 3 folds for each of 9 candidates, totalling 27 fits
384/384 [==============================] - 5s 7ms/step - loss: 0.6308 - accuracy: 0.7337 - val_loss: 0.5739 - val_accuracy: 0.7515
Best: 0.750620 using {'optimizer__learning_rate': 0.01, 'batch_size': 32}
In [ ]:
estimator_v4=create_model_v4()

estimator_v4.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_5 (Dense)             (None, 256)               2560      
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                                 
 dense_6 (Dense)             (None, 128)               32896     
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_7 (Dense)             (None, 64)                8256      
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 32)                2080      
                                                                 
 dense_9 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 45825 (179.00 KB)
Trainable params: 45825 (179.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam()
estimator_v4.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_4=estimator_v4.fit(X_train, y_train, epochs=50, batch_size = 64, verbose=1,validation_split=0.2)
Epoch 1/50
192/192 [==============================] - 2s 5ms/step - loss: 0.6414 - accuracy: 0.7316 - val_loss: 0.5798 - val_accuracy: 0.7515
Epoch 2/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5813 - accuracy: 0.7496 - val_loss: 0.5842 - val_accuracy: 0.7515
Epoch 3/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5700 - accuracy: 0.7494 - val_loss: 0.5637 - val_accuracy: 0.7515
Epoch 4/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5643 - accuracy: 0.7502 - val_loss: 0.5583 - val_accuracy: 0.7515
Epoch 5/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5619 - accuracy: 0.7503 - val_loss: 0.5593 - val_accuracy: 0.7515
Epoch 6/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5574 - accuracy: 0.7503 - val_loss: 0.5598 - val_accuracy: 0.7515
Epoch 7/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5536 - accuracy: 0.7502 - val_loss: 0.5454 - val_accuracy: 0.7515
Epoch 8/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5525 - accuracy: 0.7505 - val_loss: 0.5472 - val_accuracy: 0.7515
Epoch 9/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5515 - accuracy: 0.7503 - val_loss: 0.5404 - val_accuracy: 0.7515
Epoch 10/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5482 - accuracy: 0.7504 - val_loss: 0.5384 - val_accuracy: 0.7515
Epoch 11/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5468 - accuracy: 0.7504 - val_loss: 0.5470 - val_accuracy: 0.7515
Epoch 12/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5433 - accuracy: 0.7504 - val_loss: 0.5352 - val_accuracy: 0.7515
Epoch 13/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5435 - accuracy: 0.7504 - val_loss: 0.5317 - val_accuracy: 0.7515
Epoch 14/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5438 - accuracy: 0.7503 - val_loss: 0.5449 - val_accuracy: 0.7515
Epoch 15/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5418 - accuracy: 0.7504 - val_loss: 0.5403 - val_accuracy: 0.7515
Epoch 16/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5392 - accuracy: 0.7504 - val_loss: 0.5295 - val_accuracy: 0.7515
Epoch 17/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5390 - accuracy: 0.7504 - val_loss: 0.5292 - val_accuracy: 0.7515
Epoch 18/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5360 - accuracy: 0.7504 - val_loss: 0.5278 - val_accuracy: 0.7515
Epoch 19/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5350 - accuracy: 0.7504 - val_loss: 0.5269 - val_accuracy: 0.7515
Epoch 20/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5348 - accuracy: 0.7502 - val_loss: 0.5229 - val_accuracy: 0.7515
Epoch 21/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5288 - accuracy: 0.7498 - val_loss: 0.5393 - val_accuracy: 0.7596
Epoch 22/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5282 - accuracy: 0.7522 - val_loss: 0.5195 - val_accuracy: 0.7518
Epoch 23/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5259 - accuracy: 0.7497 - val_loss: 0.5098 - val_accuracy: 0.7521
Epoch 24/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5231 - accuracy: 0.7514 - val_loss: 0.5060 - val_accuracy: 0.7580
Epoch 25/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5193 - accuracy: 0.7552 - val_loss: 0.5147 - val_accuracy: 0.7697
Epoch 26/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5176 - accuracy: 0.7557 - val_loss: 0.5076 - val_accuracy: 0.7596
Epoch 27/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5193 - accuracy: 0.7567 - val_loss: 0.5014 - val_accuracy: 0.7710
Epoch 28/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5215 - accuracy: 0.7565 - val_loss: 0.5038 - val_accuracy: 0.7626
Epoch 29/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5191 - accuracy: 0.7560 - val_loss: 0.5132 - val_accuracy: 0.7652
Epoch 30/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5154 - accuracy: 0.7525 - val_loss: 0.5060 - val_accuracy: 0.7652
Epoch 31/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5131 - accuracy: 0.7586 - val_loss: 0.5011 - val_accuracy: 0.7674
Epoch 32/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5109 - accuracy: 0.7613 - val_loss: 0.5035 - val_accuracy: 0.7701
Epoch 33/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5121 - accuracy: 0.7578 - val_loss: 0.5050 - val_accuracy: 0.7727
Epoch 34/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5109 - accuracy: 0.7613 - val_loss: 0.5095 - val_accuracy: 0.7665
Epoch 35/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5073 - accuracy: 0.7644 - val_loss: 0.5054 - val_accuracy: 0.7720
Epoch 36/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5099 - accuracy: 0.7621 - val_loss: 0.4966 - val_accuracy: 0.7697
Epoch 37/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5069 - accuracy: 0.7591 - val_loss: 0.5191 - val_accuracy: 0.7606
Epoch 38/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5068 - accuracy: 0.7610 - val_loss: 0.4989 - val_accuracy: 0.7701
Epoch 39/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5089 - accuracy: 0.7621 - val_loss: 0.4977 - val_accuracy: 0.7717
Epoch 40/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5032 - accuracy: 0.7585 - val_loss: 0.5053 - val_accuracy: 0.7655
Epoch 41/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5037 - accuracy: 0.7615 - val_loss: 0.5049 - val_accuracy: 0.7616
Epoch 42/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5081 - accuracy: 0.7632 - val_loss: 0.5035 - val_accuracy: 0.7674
Epoch 43/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5054 - accuracy: 0.7633 - val_loss: 0.5020 - val_accuracy: 0.7671
Epoch 44/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5042 - accuracy: 0.7628 - val_loss: 0.4991 - val_accuracy: 0.7701
Epoch 45/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5029 - accuracy: 0.7636 - val_loss: 0.4999 - val_accuracy: 0.7599
Epoch 46/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5019 - accuracy: 0.7637 - val_loss: 0.4988 - val_accuracy: 0.7704
Epoch 47/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5039 - accuracy: 0.7629 - val_loss: 0.5096 - val_accuracy: 0.7648
Epoch 48/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5053 - accuracy: 0.7642 - val_loss: 0.4981 - val_accuracy: 0.7697
Epoch 49/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5000 - accuracy: 0.7649 - val_loss: 0.4930 - val_accuracy: 0.7733
Epoch 50/50
192/192 [==============================] - 1s 5ms/step - loss: 0.4996 - accuracy: 0.7664 - val_loss: 0.4963 - val_accuracy: 0.7674
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_4.history['loss'])
plt.plot(history_4.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

From the above plot, we observe that there is noise in the training behavior of the model.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat4 = estimator_v4.predict(X_test)
# keep probabilities for the positive outcome only
yhat4 = yhat4[:, 0]
# calculate roc curves
fpr, tpr, thresholds4 = roc_curve(y_test, yhat4)
# calculate the g-mean for each threshold
gmeans4 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans4)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds4[ix], gmeans4[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.235663, G-Mean=0.707
No description has been provided for this image
In [ ]:
y_pred_e4=estimator_v4.predict(X_test)
y_pred_e4 = (y_pred_e4 > thresholds4[ix])
y_pred_e4
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm4=confusion_matrix(y_test, y_pred_e4)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm4,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr4=metrics.classification_report(y_test,y_pred_e4)
print(cr4)
              precision    recall  f1-score   support

           0       0.87      0.76      0.81      2877
           1       0.48      0.66      0.55       955

    accuracy                           0.73      3832
   macro avg       0.67      0.71      0.68      3832
weighted avg       0.77      0.73      0.75      3832

Hyperparameter tuning is used here to get a better F1 score, but the F1 score may differ each time.

Other hyperparameters can also be tuned to get better performance on the metrics.

Here, the F1 score of the model has decreased in comparison to the previous best performance, as Random Search CV will choose the hyperparameters randomly, and hence has a very low chance of finding a highly optimal configuration.

Let's use the more exhaustive Grid Search CV and see if the F1 score increases.

Model 6¶

Grid Search CV¶

Some important hyperparameters to look out for while optimizing neural networks are:

  • Type of Architecture

  • Number of Layers

  • Number of Neurons in a layer

  • Regularization hyperparameters

  • Learning Rate

  • Type of Optimizer

  • Dropout Rate

In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
def create_model_v5():
    np.random.seed(1337)
    model = Sequential()
    model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
    model.add(Dropout(0.3))
    #model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(64,activation='relu'))
    model.add(Dropout(0.2))
    #model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
    #model.add(Dropout(0.3))
    model.add(Dense(32,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    #compile model
    optimizer = tf.keras.optimizers.Adam()
    model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model

We're using Grid Search to optimize two hyperparameters - Batch Size & Learning Rate.

You can also optimize the other hyperparameters as mentioned above.

In [ ]:
keras_estimator = KerasClassifier(build_fn=create_model_v4, optimizer="Adam", verbose=1)
# define the grid search parameters
learn_rate = [0.01, 0.1, 0.001]
batch_size = [32, 64, 128]
param_grid = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)

kfold_splits = 3
grid = GridSearchCV(estimator=keras_estimator,
                    verbose=1,
                    cv=kfold_splits,
                    param_grid=param_grid,n_jobs=-1)
In [ ]:
import time

# store starting time
begin = time.time()


grid_result = grid.fit(X_train, y_train,validation_split=0.2,verbose=1)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

time.sleep(1)
# store end time
end = time.time()

# total time taken
print(f"Total runtime of the program is {end - begin}")
Fitting 3 folds for each of 9 candidates, totalling 27 fits
384/384 [==============================] - 4s 6ms/step - loss: 0.6308 - accuracy: 0.7337 - val_loss: 0.5739 - val_accuracy: 0.7515
Best: 0.750620 using {'batch_size': 32, 'optimizer__learning_rate': 0.01}
Total runtime of the program is 93.2235975265503

The best model has the following configuration:

( It may vary each time the code runs )

Result of Grid Search

{'batch_size': 64, 'learning_rate": 0.01}

Let's create the final model with the above mentioned configuration

In [ ]:
estimator_v5=create_model_v5()

estimator_v5.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_5 (Dense)             (None, 256)               2560      
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                                 
 dense_6 (Dense)             (None, 128)               32896     
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_7 (Dense)             (None, 64)                8256      
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 32)                2080      
                                                                 
 dense_9 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 45825 (179.00 KB)
Trainable params: 45825 (179.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam(grid_result.best_params_['optimizer__learning_rate'])
estimator_v5.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_5=estimator_v5.fit(X_train, y_train, epochs=50, batch_size = 64, verbose=1,validation_split=0.2)
Epoch 1/50
192/192 [==============================] - 3s 5ms/step - loss: 0.6093 - accuracy: 0.7413 - val_loss: 0.5598 - val_accuracy: 0.7515
Epoch 2/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5624 - accuracy: 0.7503 - val_loss: 0.5598 - val_accuracy: 0.7515
Epoch 3/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5588 - accuracy: 0.7502 - val_loss: 0.5516 - val_accuracy: 0.7515
Epoch 4/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5555 - accuracy: 0.7503 - val_loss: 0.5450 - val_accuracy: 0.7515
Epoch 5/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5544 - accuracy: 0.7504 - val_loss: 0.5481 - val_accuracy: 0.7515
Epoch 6/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5573 - accuracy: 0.7503 - val_loss: 0.5630 - val_accuracy: 0.7515
Epoch 7/50
192/192 [==============================] - 1s 8ms/step - loss: 0.5553 - accuracy: 0.7503 - val_loss: 0.5488 - val_accuracy: 0.7515
Epoch 8/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5531 - accuracy: 0.7500 - val_loss: 0.5421 - val_accuracy: 0.7515
Epoch 9/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5478 - accuracy: 0.7495 - val_loss: 0.5449 - val_accuracy: 0.7515
Epoch 10/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5483 - accuracy: 0.7491 - val_loss: 0.5484 - val_accuracy: 0.7515
Epoch 11/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5496 - accuracy: 0.7506 - val_loss: 0.5499 - val_accuracy: 0.7515
Epoch 12/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5501 - accuracy: 0.7501 - val_loss: 0.5415 - val_accuracy: 0.7515
Epoch 13/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5483 - accuracy: 0.7495 - val_loss: 0.5361 - val_accuracy: 0.7515
Epoch 14/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5471 - accuracy: 0.7498 - val_loss: 0.5452 - val_accuracy: 0.7515
Epoch 15/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5487 - accuracy: 0.7503 - val_loss: 0.5406 - val_accuracy: 0.7518
Epoch 16/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5485 - accuracy: 0.7503 - val_loss: 0.5495 - val_accuracy: 0.7515
Epoch 17/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5453 - accuracy: 0.7503 - val_loss: 0.5353 - val_accuracy: 0.7515
Epoch 18/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5458 - accuracy: 0.7497 - val_loss: 0.5419 - val_accuracy: 0.7515
Epoch 19/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5484 - accuracy: 0.7495 - val_loss: 0.5386 - val_accuracy: 0.7515
Epoch 20/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5446 - accuracy: 0.7500 - val_loss: 0.5338 - val_accuracy: 0.7515
Epoch 21/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5430 - accuracy: 0.7497 - val_loss: 0.5317 - val_accuracy: 0.7515
Epoch 22/50
192/192 [==============================] - 1s 8ms/step - loss: 0.5456 - accuracy: 0.7507 - val_loss: 0.5344 - val_accuracy: 0.7515
Epoch 23/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5446 - accuracy: 0.7500 - val_loss: 0.5378 - val_accuracy: 0.7515
Epoch 24/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5451 - accuracy: 0.7498 - val_loss: 0.5320 - val_accuracy: 0.7515
Epoch 25/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5484 - accuracy: 0.7498 - val_loss: 0.5266 - val_accuracy: 0.7515
Epoch 26/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5421 - accuracy: 0.7489 - val_loss: 0.5257 - val_accuracy: 0.7515
Epoch 27/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5457 - accuracy: 0.7494 - val_loss: 0.5309 - val_accuracy: 0.7515
Epoch 28/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5501 - accuracy: 0.7500 - val_loss: 0.5359 - val_accuracy: 0.7515
Epoch 29/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5475 - accuracy: 0.7489 - val_loss: 0.5359 - val_accuracy: 0.7518
Epoch 30/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5420 - accuracy: 0.7499 - val_loss: 0.5281 - val_accuracy: 0.7515
Epoch 31/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5421 - accuracy: 0.7482 - val_loss: 0.5216 - val_accuracy: 0.7515
Epoch 32/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5416 - accuracy: 0.7507 - val_loss: 0.5475 - val_accuracy: 0.7515
Epoch 33/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5408 - accuracy: 0.7498 - val_loss: 0.5469 - val_accuracy: 0.7515
Epoch 34/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5472 - accuracy: 0.7492 - val_loss: 0.5219 - val_accuracy: 0.7511
Epoch 35/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5396 - accuracy: 0.7500 - val_loss: 0.5275 - val_accuracy: 0.7515
Epoch 36/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5470 - accuracy: 0.7507 - val_loss: 0.5505 - val_accuracy: 0.7515
Epoch 37/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5502 - accuracy: 0.7488 - val_loss: 0.5500 - val_accuracy: 0.7515
Epoch 38/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5497 - accuracy: 0.7505 - val_loss: 0.5362 - val_accuracy: 0.7518
Epoch 39/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5456 - accuracy: 0.7509 - val_loss: 0.5466 - val_accuracy: 0.7515
Epoch 40/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5469 - accuracy: 0.7515 - val_loss: 0.5229 - val_accuracy: 0.7515
Epoch 41/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5383 - accuracy: 0.7498 - val_loss: 0.5477 - val_accuracy: 0.7515
Epoch 42/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5475 - accuracy: 0.7510 - val_loss: 0.5384 - val_accuracy: 0.7511
Epoch 43/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5478 - accuracy: 0.7507 - val_loss: 0.5323 - val_accuracy: 0.7515
Epoch 44/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5454 - accuracy: 0.7511 - val_loss: 0.5328 - val_accuracy: 0.7515
Epoch 45/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5389 - accuracy: 0.7512 - val_loss: 0.5344 - val_accuracy: 0.7521
Epoch 46/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5471 - accuracy: 0.7502 - val_loss: 0.5398 - val_accuracy: 0.7511
Epoch 47/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5480 - accuracy: 0.7506 - val_loss: 0.5403 - val_accuracy: 0.7505
Epoch 48/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5425 - accuracy: 0.7501 - val_loss: 0.5454 - val_accuracy: 0.7515
Epoch 49/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5452 - accuracy: 0.7503 - val_loss: 0.5334 - val_accuracy: 0.7515
Epoch 50/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5436 - accuracy: 0.7517 - val_loss: 0.5489 - val_accuracy: 0.7515
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_5.history['loss'])
plt.plot(history_5.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

From the above plot, we observe that both curves - train and validation, are smooth.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat5 = estimator_v5.predict(X_test)
# keep probabilities for the positive outcome only
yhat5 = yhat5[:, 0]
# calculate roc curves
fpr, tpr, thresholds5 = roc_curve(y_test, yhat5)
# calculate the g-mean for each threshold
gmeans5 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans5)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds5[ix], gmeans5[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.258756, G-Mean=0.509
No description has been provided for this image
In [ ]:
y_pred_e5=estimator_v5.predict(X_test)
y_pred_e5 = (y_pred_e5 > thresholds5[ix])
y_pred_e5
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[ True],
       [ True],
       [ True],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm5=confusion_matrix(y_test, y_pred_e5)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm5,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr5=metrics.classification_report(y_test,y_pred_e5)
print(cr5)
              precision    recall  f1-score   support

           0       0.77      0.70      0.73      2877
           1       0.29      0.37      0.32       955

    accuracy                           0.62      3832
   macro avg       0.53      0.53      0.53      3832
weighted avg       0.65      0.62      0.63      3832

Hyperparameter tuning with Grid Search has been used here to get a better F1 score, but the F1 score might differ each time.

Other hyperparameters can also be tuned to get better metrics.

Here, the F1 score of the model, while better than in Randomized Search, is slightly lower than in Model 4 (the Dropout model).

Dask¶

  • There is also another library called Dask, sometimes used in the industry to provide a performance boost to Hyperparameter Tuning due to its parallelized computing procedure.
  • Dask also has the option of implementing Grid Search similar to the Grid Search in Scikit-learn.

You may install the Dask library in Anaconda prompt using the below code:

  • !pip install dask-ml --user
In [ ]:
# Try below code to install dask in Google Colab
!pip install dask-ml
Requirement already satisfied: dask-ml in /usr/local/lib/python3.10/dist-packages (2023.3.24)
Requirement already satisfied: dask[array,dataframe]>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (2023.8.1)
Requirement already satisfied: distributed>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (2023.8.1)
Requirement already satisfied: numba>=0.51.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (0.58.1)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.23.5)
Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.5.3)
Requirement already satisfied: scikit-learn>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.2.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.11.4)
Requirement already satisfied: dask-glm>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (0.3.2)
Requirement already satisfied: multipledispatch>=0.4.9 in /usr/local/lib/python3.10/dist-packages (from dask-ml) (1.0.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from dask-ml) (23.2)
Requirement already satisfied: cloudpickle>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from dask-glm>=0.2.0->dask-ml) (2.2.1)
Requirement already satisfied: sparse>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from dask-glm>=0.2.0->dask-ml) (0.15.1)
Requirement already satisfied: click>=8.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (8.1.7)
Requirement already satisfied: fsspec>=2021.09.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (2023.6.0)
Requirement already satisfied: partd>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (1.4.1)
Requirement already satisfied: pyyaml>=5.3.1 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (6.0.1)
Requirement already satisfied: toolz>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (0.12.0)
Requirement already satisfied: importlib-metadata>=4.13.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe]>=2.4.0->dask-ml) (7.0.1)
Requirement already satisfied: jinja2>=2.10.3 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (3.1.3)
Requirement already satisfied: locket>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (1.0.0)
Requirement already satisfied: msgpack>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (1.0.7)
Requirement already satisfied: psutil>=5.7.2 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (5.9.5)
Requirement already satisfied: sortedcontainers>=2.0.5 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (2.4.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (3.0.0)
Requirement already satisfied: tornado>=6.0.4 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (6.3.2)
Requirement already satisfied: urllib3>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (2.0.7)
Requirement already satisfied: zict>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from distributed>=2.4.0->dask-ml) (3.0.0)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.0->dask-ml) (0.41.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.2->dask-ml) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.2->dask-ml) (2023.3.post1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.2.0->dask-ml) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.2.0->dask-ml) (3.2.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=4.13.0->dask[array,dataframe]>=2.4.0->dask-ml) (3.17.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2>=2.10.3->distributed>=2.4.0->dask-ml) (2.1.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=0.24.2->dask-ml) (1.16.0)
In [ ]:
# importing library
from dask_ml.model_selection import GridSearchCV as DaskGridSearchCV

Try to run the code twice if you encounter any error while improting Dask

  • Dask is the same as regular Grid Search in its functioning.
  • We just have to change the function from GridSearchCV to DaskGridSearchCV.
In [ ]:
def create_model_v6():
    np.random.seed(1337)
    model = Sequential()
    model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
    model.add(Dropout(0.3))
    #model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(64,activation='relu'))
    model.add(Dropout(0.2))
    #model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
    #model.add(Dropout(0.3))
    model.add(Dense(32,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    #compile model
    optimizer = tf.keras.optimizers.Adam()
    model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model
In [ ]:
keras_estimator = KerasClassifier(build_fn=create_model_v6, verbose=1)
# define the grid search parameters
learn_rate = [0.01, 0.1, 0.001]
batch_size = [32, 64, 128]
param_grid = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)

kfold_splits = 3
dask = DaskGridSearchCV(estimator=keras_estimator,
                    cv=kfold_splits,
                    param_grid=param_grid,n_jobs=-1)
In [ ]:
import time

# store starting time
begin = time.time()


dask_result = dask.fit(X_train, y_train,validation_split=0.2,verbose=1)

# Summarize results
print("Best: %f using %s" % (dask_result.best_score_, dask_result.best_params_))
means = dask_result.cv_results_['mean_test_score']
stds = dask_result.cv_results_['std_test_score']
params = dask_result.cv_results_['params']

time.sleep(1)
# store end time
end = time.time()

# total time taken
print(f"Total runtime of the program is {end - begin}")
256/256 [==============================] - 7s 14ms/step - loss: 0.6593 - accuracy: 0.7196 - val_loss: 0.5841 - val_accuracy: 0.7529
256/256 [==============================] - 7s 14ms/step - loss: 0.7176 - accuracy: 0.7145 - val_loss: 0.5892 - val_accuracy: 0.7529
160/160 [==============================] - 1s 3ms/step
160/160 [==============================] - 1s 3ms/step
256/256 [==============================] - 5s 9ms/step - loss: 0.7053 - accuracy: 0.7071 - val_loss: 0.6040 - val_accuracy: 0.7534
256/256 [==============================] - 5s 8ms/step - loss: 0.6678 - accuracy: 0.7171 - val_loss: 0.5782 - val_accuracy: 0.7529
160/160 [==============================] - 0s 2ms/step
160/160 [==============================] - 1s 4ms/step
256/256 [==============================] - 6s 9ms/step - loss: 0.6752 - accuracy: 0.7170 - val_loss: 0.5914 - val_accuracy: 0.7529
160/160 [==============================] - 1s 3ms/step
256/256 [==============================] - 7s 10ms/step - loss: 0.6436 - accuracy: 0.7315 - val_loss: 0.5745 - val_accuracy: 0.7534
160/160 [==============================] - 1s 3ms/step
256/256 [==============================] - 5s 11ms/step - loss: 0.6765 - accuracy: 0.7112 - val_loss: 0.5897 - val_accuracy: 0.7529
160/160 [==============================] - 0s 2ms/step
256/256 [==============================] - 5s 10ms/step - loss: 0.6626 - accuracy: 0.7225 - val_loss: 0.6091 - val_accuracy: 0.7529
160/160 [==============================] - 1s 6ms/step
256/256 [==============================] - 5s 10ms/step - loss: 0.7261 - accuracy: 0.7096 - val_loss: 0.5780 - val_accuracy: 0.7534
160/160 [==============================] - 0s 3ms/step
128/128 [==============================] - 5s 10ms/step - loss: 0.7253 - accuracy: 0.6993 - val_loss: 0.5844 - val_accuracy: 0.7529
128/128 [==============================] - 5s 11ms/step - loss: 0.6936 - accuracy: 0.7100 - val_loss: 0.6266 - val_accuracy: 0.7485
80/80 [==============================] - 0s 4ms/step
80/80 [==============================] - 0s 2ms/step
128/128 [==============================] - 4s 11ms/step - loss: 0.6966 - accuracy: 0.7060 - val_loss: 0.5896 - val_accuracy: 0.7534
128/128 [==============================] - 4s 11ms/step - loss: 0.7000 - accuracy: 0.7131 - val_loss: 0.5938 - val_accuracy: 0.7529
80/80 [==============================] - 0s 2ms/step
80/80 [==============================] - 0s 4ms/step
128/128 [==============================] - 5s 11ms/step - loss: 0.7100 - accuracy: 0.7161 - val_loss: 0.5920 - val_accuracy: 0.7529
80/80 [==============================] - 0s 3ms/step
128/128 [==============================] - 5s 10ms/step - loss: 0.6674 - accuracy: 0.7191 - val_loss: 0.6174 - val_accuracy: 0.7534
80/80 [==============================] - 0s 4ms/step
128/128 [==============================] - 4s 9ms/step - loss: 0.7987 - accuracy: 0.6962 - val_loss: 0.5932 - val_accuracy: 0.7529
80/80 [==============================] - 0s 3ms/step
128/128 [==============================] - 4s 9ms/step - loss: 0.7365 - accuracy: 0.7010 - val_loss: 0.5929 - val_accuracy: 0.7529
80/80 [==============================] - 0s 5ms/step
128/128 [==============================] - 4s 13ms/step - loss: 0.6770 - accuracy: 0.7163 - val_loss: 0.6244 - val_accuracy: 0.7534
64/64 [==============================] - 4s 19ms/step - loss: 0.7037 - accuracy: 0.7044 - val_loss: 0.6116 - val_accuracy: 0.7529
80/80 [==============================] - 1s 4ms/step
40/40 [==============================] - 1s 6ms/step
64/64 [==============================] - 4s 13ms/step - loss: 0.7867 - accuracy: 0.6869 - val_loss: 0.5894 - val_accuracy: 0.7529
64/64 [==============================] - 4s 14ms/step - loss: 0.8034 - accuracy: 0.6915 - val_loss: 0.6499 - val_accuracy: 0.6825
40/40 [==============================] - 0s 2ms/step
40/40 [==============================] - 0s 3ms/step
64/64 [==============================] - 3s 13ms/step - loss: 0.6962 - accuracy: 0.7050 - val_loss: 0.5872 - val_accuracy: 0.7529
40/40 [==============================] - 0s 4ms/step
64/64 [==============================] - 3s 11ms/step - loss: 1.1132 - accuracy: 0.6487 - val_loss: 0.5882 - val_accuracy: 0.7529
40/40 [==============================] - 1s 19ms/step
64/64 [==============================] - 3s 15ms/step - loss: 0.6810 - accuracy: 0.7094 - val_loss: 0.6400 - val_accuracy: 0.7534
40/40 [==============================] - 0s 5ms/step
64/64 [==============================] - 4s 18ms/step - loss: 0.6977 - accuracy: 0.7139 - val_loss: 0.6101 - val_accuracy: 0.7529
40/40 [==============================] - 0s 4ms/step
64/64 [==============================] - 4s 12ms/step - loss: 0.8207 - accuracy: 0.6831 - val_loss: 0.5936 - val_accuracy: 0.7529
40/40 [==============================] - 0s 3ms/step
64/64 [==============================] - 4s 11ms/step - loss: 0.7796 - accuracy: 0.6939 - val_loss: 0.6233 - val_accuracy: 0.7534
40/40 [==============================] - 0s 2ms/step
384/384 [==============================] - 3s 5ms/step - loss: 0.6440 - accuracy: 0.7283 - val_loss: 0.5697 - val_accuracy: 0.7515
Best: 0.750620 using {'batch_size': 32, 'optimizer__learning_rate': 0.01}
Total runtime of the program is 86.2575147151947

Unfortunately, Dask took more time to run the model when compared to Grid Search CV, and this is because Dask has some requirements to perform well:

  • The dimension of the dataset should be large.
  • Dask shows a significant performance improvement in computation when the number and range of hyperparameters we are tuning is large.

Since the dataset dimensions and hyperparameter number/range were small for this example, Dask couldn't show a significant improvement.

We can also use another optimization technique - Keras Tuner.

In [ ]:
## Install Keras Tuner
!pip install keras-tuner
Requirement already satisfied: keras-tuner in /usr/local/lib/python3.10/dist-packages (1.4.6)
Requirement already satisfied: keras in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (2.15.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (23.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (2.31.0)
Requirement already satisfied: kt-legacy in /usr/local/lib/python3.10/dist-packages (from keras-tuner) (1.0.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->keras-tuner) (2023.11.17)

Keras Tuner¶

In [ ]:
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch
In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)

Hyperparameters

  • How many hidden layers should the model have?
  • How many neurons should the model have in each hidden layer?
  • Learning Rate
In [ ]:
def build_model(h):
    model = keras.Sequential()
    for i in range(h.Int('num_layers', 2, 10)):
        model.add(layers.Dense(units=h.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=256,
                                            step=32),
                               activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=keras.optimizers.Adam(
            h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model

Initialize a tuner (here, RandomSearch). We use objective to specify the objective to select the best models, and we use max_trials to specify the number of different models to try.

In [ ]:
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
     project_name='Job_')
Reloading Tuner from ./Job_/tuner0.json
In [ ]:
tuner.search_space_summary()
Search space summary
Default search space size: 12
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 10, 'step': 1, 'sampling': 'linear'}
units_0 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}
units_2 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_3 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_4 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_5 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_6 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_7 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_8 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_9 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
In [ ]:
### Searching the best model on X and y train
tuner.search(X_train, y_train,
             epochs=5,
             validation_split = 0.2)
In [ ]:
## Printing the best models with their hyperparameters
tuner.results_summary()
Results summary
Results in ./Job_
Showing 10 best trials
Objective(name="val_accuracy", direction="max")

Trial 3 summary
Hyperparameters:
num_layers: 5
units_0: 32
units_1: 64
learning_rate: 0.01
units_2: 96
units_3: 256
units_4: 256
units_5: 160
units_6: 192
units_7: 224
units_8: 224
Score: 0.7516851425170898

Trial 0 summary
Hyperparameters:
num_layers: 9
units_0: 224
units_1: 96
learning_rate: 0.001
units_2: 32
units_3: 32
units_4: 32
units_5: 32
units_6: 32
units_7: 32
units_8: 32
Score: 0.7514677047729492

Trial 2 summary
Hyperparameters:
num_layers: 9
units_0: 192
units_1: 64
learning_rate: 0.001
units_2: 160
units_3: 32
units_4: 224
units_5: 32
units_6: 256
units_7: 96
units_8: 192
Score: 0.7514677047729492

Trial 1 summary
Hyperparameters:
num_layers: 5
units_0: 160
units_1: 160
learning_rate: 0.001
units_2: 224
units_3: 128
units_4: 224
units_5: 64
units_6: 160
units_7: 64
units_8: 32
Score: 0.7514677047729492

Trial 4 summary
Hyperparameters:
num_layers: 10
units_0: 128
units_1: 32
learning_rate: 0.0001
units_2: 160
units_3: 160
units_4: 160
units_5: 224
units_6: 96
units_7: 128
units_8: 96
units_9: 32
Score: 0.7514677047729492

Model 7¶

  • Let's create a model with the above mentioned best configuration given by Keras Tuner.
In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model7 = Sequential()
model7.add(Dense(160,activation='relu',kernel_initializer='he_uniform',input_dim = X_train.shape[1]))
model7.add(Dense(160,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model7.add(Dense(1, activation = 'sigmoid'))
In [ ]:
model7.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 160)               1600      
                                                                 
 dense_1 (Dense)             (None, 160)               25760     
                                                                 
 dense_2 (Dense)             (None, 224)               36064     
                                                                 
 dense_3 (Dense)             (None, 128)               28800     
                                                                 
 dense_4 (Dense)             (None, 224)               28896     
                                                                 
 dense_5 (Dense)             (None, 1)                 225       
                                                                 
=================================================================
Total params: 121345 (474.00 KB)
Trainable params: 121345 (474.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam(0.001)
model7.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
In [ ]:
history_7 = model7.fit(X_train,y_train,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50
192/192 [==============================] - 3s 6ms/step - loss: 1.1225 - accuracy: 0.7018 - val_loss: 0.5539 - val_accuracy: 0.7495
Epoch 2/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5978 - accuracy: 0.7343 - val_loss: 0.5578 - val_accuracy: 0.7479
Epoch 3/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5746 - accuracy: 0.7434 - val_loss: 0.5542 - val_accuracy: 0.7515
Epoch 4/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5630 - accuracy: 0.7464 - val_loss: 0.5432 - val_accuracy: 0.7521
Epoch 5/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5655 - accuracy: 0.7437 - val_loss: 0.5516 - val_accuracy: 0.7515
Epoch 6/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5569 - accuracy: 0.7482 - val_loss: 0.5395 - val_accuracy: 0.7508
Epoch 7/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5542 - accuracy: 0.7462 - val_loss: 0.5374 - val_accuracy: 0.7511
Epoch 8/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5495 - accuracy: 0.7503 - val_loss: 0.5600 - val_accuracy: 0.7518
Epoch 9/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5518 - accuracy: 0.7497 - val_loss: 0.5385 - val_accuracy: 0.7502
Epoch 10/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5465 - accuracy: 0.7485 - val_loss: 0.5408 - val_accuracy: 0.7511
Epoch 11/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5534 - accuracy: 0.7467 - val_loss: 0.5846 - val_accuracy: 0.7257
Epoch 12/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5443 - accuracy: 0.7502 - val_loss: 0.5672 - val_accuracy: 0.7495
Epoch 13/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5455 - accuracy: 0.7524 - val_loss: 0.5340 - val_accuracy: 0.7508
Epoch 14/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5418 - accuracy: 0.7502 - val_loss: 0.5534 - val_accuracy: 0.7518
Epoch 15/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5444 - accuracy: 0.7498 - val_loss: 0.5480 - val_accuracy: 0.7515
Epoch 16/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5409 - accuracy: 0.7516 - val_loss: 0.5424 - val_accuracy: 0.7492
Epoch 17/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5373 - accuracy: 0.7511 - val_loss: 0.5501 - val_accuracy: 0.7515
Epoch 18/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5430 - accuracy: 0.7508 - val_loss: 0.5453 - val_accuracy: 0.7508
Epoch 19/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5356 - accuracy: 0.7524 - val_loss: 0.5365 - val_accuracy: 0.7511
Epoch 20/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5399 - accuracy: 0.7502 - val_loss: 0.5406 - val_accuracy: 0.7502
Epoch 21/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5378 - accuracy: 0.7511 - val_loss: 0.5323 - val_accuracy: 0.7544
Epoch 22/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5331 - accuracy: 0.7510 - val_loss: 0.5280 - val_accuracy: 0.7524
Epoch 23/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5387 - accuracy: 0.7498 - val_loss: 0.5293 - val_accuracy: 0.7502
Epoch 24/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5315 - accuracy: 0.7530 - val_loss: 0.5293 - val_accuracy: 0.7606
Epoch 25/50
192/192 [==============================] - 1s 8ms/step - loss: 0.5264 - accuracy: 0.7508 - val_loss: 0.5237 - val_accuracy: 0.7518
Epoch 26/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5262 - accuracy: 0.7502 - val_loss: 0.5344 - val_accuracy: 0.7495
Epoch 27/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5237 - accuracy: 0.7520 - val_loss: 0.5626 - val_accuracy: 0.7534
Epoch 28/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5270 - accuracy: 0.7498 - val_loss: 0.5218 - val_accuracy: 0.7508
Epoch 29/50
192/192 [==============================] - 1s 4ms/step - loss: 0.5219 - accuracy: 0.7526 - val_loss: 0.5211 - val_accuracy: 0.7518
Epoch 30/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5203 - accuracy: 0.7524 - val_loss: 0.5198 - val_accuracy: 0.7570
Epoch 31/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5185 - accuracy: 0.7560 - val_loss: 0.5277 - val_accuracy: 0.7573
Epoch 32/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5156 - accuracy: 0.7548 - val_loss: 0.5132 - val_accuracy: 0.7590
Epoch 33/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5194 - accuracy: 0.7573 - val_loss: 0.5290 - val_accuracy: 0.7541
Epoch 34/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5177 - accuracy: 0.7535 - val_loss: 0.5235 - val_accuracy: 0.7554
Epoch 35/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5102 - accuracy: 0.7560 - val_loss: 0.5377 - val_accuracy: 0.7599
Epoch 36/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5104 - accuracy: 0.7610 - val_loss: 0.5215 - val_accuracy: 0.7573
Epoch 37/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5112 - accuracy: 0.7524 - val_loss: 0.5206 - val_accuracy: 0.7590
Epoch 38/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5109 - accuracy: 0.7588 - val_loss: 0.5187 - val_accuracy: 0.7580
Epoch 39/50
192/192 [==============================] - 1s 7ms/step - loss: 0.5106 - accuracy: 0.7578 - val_loss: 0.5129 - val_accuracy: 0.7551
Epoch 40/50
192/192 [==============================] - 1s 6ms/step - loss: 0.5055 - accuracy: 0.7591 - val_loss: 0.5241 - val_accuracy: 0.7590
Epoch 41/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5065 - accuracy: 0.7595 - val_loss: 0.5358 - val_accuracy: 0.7551
Epoch 42/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5111 - accuracy: 0.7577 - val_loss: 0.5263 - val_accuracy: 0.7570
Epoch 43/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5047 - accuracy: 0.7623 - val_loss: 0.5209 - val_accuracy: 0.7590
Epoch 44/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5019 - accuracy: 0.7630 - val_loss: 0.5187 - val_accuracy: 0.7652
Epoch 45/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5043 - accuracy: 0.7619 - val_loss: 0.5164 - val_accuracy: 0.7603
Epoch 46/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5062 - accuracy: 0.7605 - val_loss: 0.5308 - val_accuracy: 0.7567
Epoch 47/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5040 - accuracy: 0.7617 - val_loss: 0.5312 - val_accuracy: 0.7648
Epoch 48/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5015 - accuracy: 0.7634 - val_loss: 0.5178 - val_accuracy: 0.7665
Epoch 49/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5001 - accuracy: 0.7630 - val_loss: 0.5166 - val_accuracy: 0.7632
Epoch 50/50
192/192 [==============================] - 1s 5ms/step - loss: 0.5025 - accuracy: 0.7604 - val_loss: 0.5249 - val_accuracy: 0.7599
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_7.history['loss'])
plt.plot(history_7.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

From the above plot, we observe that the train and validation curves are smooth.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat7 = model7.predict(X_test)
# keep probabilities for the positive outcome only
yhat7 = yhat7[:, 0]
# calculate roc curves
fpr, tpr, thresholds7 = roc_curve(y_test, yhat7)
# calculate the g-mean for each threshold
gmeans7 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans7)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds7[ix], gmeans7[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.232952, G-Mean=0.678
No description has been provided for this image
In [ ]:
y_pred_e7=model7.predict(X_test)
y_pred_e7 = (y_pred_e7 > thresholds7[ix])
y_pred_e7
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm7=confusion_matrix(y_test, y_pred_e7)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm7,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr7=metrics.classification_report(y_test,y_pred_e7)
print(cr7)
              precision    recall  f1-score   support

           0       0.86      0.70      0.77      2877
           1       0.42      0.66      0.51       955

    accuracy                           0.69      3832
   macro avg       0.64      0.68      0.64      3832
weighted avg       0.75      0.69      0.71      3832

  • After using the suggested hyperparameters from Keras Tuner, the F1 score has slightly increased, and the False Negative rate is higher in comparison to the previous optimization technique model.

  • Further, you can add Batch Normalization and Dropout to the model and check the F1 score.

  • Let's try to apply SMOTE to balance this dataset and then apply hyperparamter tuning accordingly.

SMOTE + Keras Tuner¶

In [ ]:
##Applying SMOTE on train and test
from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='not majority')
X_sm , y_sm = smote.fit_resample(X_train,y_train)
In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
def build_model_2(h):
    model = keras.Sequential()
    for i in range(h.Int('num_layers', 2, 10)):
        model.add(layers.Dense(units=h.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=256,
                                            step=32),
                               activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(
        optimizer=keras.optimizers.Adam(
            h.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
        loss='binary_crossentropy',
        metrics=['accuracy'])
    return model
In [ ]:
tuner_2 = RandomSearch(
    build_model_2,
    objective='val_accuracy',
    max_trials=5,
    executions_per_trial=3,
    project_name='Job_Switch')
Reloading Tuner from ./Job_Switch/tuner0.json
In [ ]:
tuner_2.search_space_summary()
Search space summary
Default search space size: 12
num_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 10, 'step': 1, 'sampling': 'linear'}
units_0 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_1 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.01, 0.001, 0.0001], 'ordered': True}
units_2 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_3 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_4 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_5 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_6 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_7 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_8 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
units_9 (Int)
{'default': None, 'conditions': [], 'min_value': 32, 'max_value': 256, 'step': 32, 'sampling': 'linear'}
In [ ]:
tuner_2.search(X_sm, y_sm,
             epochs=5,
             validation_split = 0.2)
In [ ]:
tuner_2.results_summary()
Results summary
Results in ./Job_Switch
Showing 10 best trials
Objective(name="val_accuracy", direction="max")

Trial 1 summary
Hyperparameters:
num_layers: 5
units_0: 160
units_1: 160
learning_rate: 0.001
units_2: 224
units_3: 128
units_4: 224
units_5: 64
units_6: 160
units_7: 64
units_8: 32
Score: 0.3924380640188853

Trial 2 summary
Hyperparameters:
num_layers: 9
units_0: 192
units_1: 64
learning_rate: 0.001
units_2: 160
units_3: 32
units_4: 224
units_5: 32
units_6: 256
units_7: 96
units_8: 192
Score: 0.3745472927888234

Trial 0 summary
Hyperparameters:
num_layers: 9
units_0: 224
units_1: 96
learning_rate: 0.001
units_2: 32
units_3: 32
units_4: 32
units_5: 32
units_6: 32
units_7: 32
units_8: 32
Score: 0.3596986730893453

Trial 4 summary
Hyperparameters:
num_layers: 10
units_0: 128
units_1: 32
learning_rate: 0.0001
units_2: 160
units_3: 160
units_4: 160
units_5: 224
units_6: 96
units_7: 128
units_8: 96
units_9: 32
Score: 0.27350427707036334

Trial 3 summary
Hyperparameters:
num_layers: 5
units_0: 32
units_1: 64
learning_rate: 0.01
units_2: 96
units_3: 256
units_4: 256
units_5: 160
units_6: 192
units_7: 224
units_8: 224
Score: 0.14906562368075052
In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
model9 = Sequential()
model9.add(Dense(160,activation='relu',kernel_initializer='he_uniform',input_dim = X_train.shape[1]))
model9.add(Dense(160,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(224,activation='relu',kernel_initializer='he_uniform'))
model9.add(Dense(1, activation = 'sigmoid'))
      #Compiling the ANN with Adam optimizer and binary cross entropy loss function
optimizer = tf.keras.optimizers.Adam(0.001)
model9.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
In [ ]:
model9.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 160)               1600      
                                                                 
 dense_1 (Dense)             (None, 160)               25760     
                                                                 
 dense_2 (Dense)             (None, 224)               36064     
                                                                 
 dense_3 (Dense)             (None, 128)               28800     
                                                                 
 dense_4 (Dense)             (None, 224)               28896     
                                                                 
 dense_5 (Dense)             (None, 1)                 225       
                                                                 
=================================================================
Total params: 121345 (474.00 KB)
Trainable params: 121345 (474.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
history_9 = model9.fit(X_sm,y_sm,batch_size=64,epochs=50,verbose=1,validation_split = 0.2)
Epoch 1/50
288/288 [==============================] - 4s 6ms/step - loss: 0.9404 - accuracy: 0.5938 - val_loss: 1.1380 - val_accuracy: 0.0674
Epoch 2/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6742 - accuracy: 0.6132 - val_loss: 0.8120 - val_accuracy: 0.3644
Epoch 3/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6514 - accuracy: 0.6346 - val_loss: 1.2521 - val_accuracy: 0.1171
Epoch 4/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6498 - accuracy: 0.6296 - val_loss: 0.6781 - val_accuracy: 0.6395
Epoch 5/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6404 - accuracy: 0.6382 - val_loss: 0.9168 - val_accuracy: 0.1921
Epoch 6/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6399 - accuracy: 0.6413 - val_loss: 0.9899 - val_accuracy: 0.1199
Epoch 7/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6361 - accuracy: 0.6459 - val_loss: 0.8910 - val_accuracy: 0.2925
Epoch 8/50
288/288 [==============================] - 2s 7ms/step - loss: 0.6282 - accuracy: 0.6488 - val_loss: 0.9776 - val_accuracy: 0.2119
Epoch 9/50
288/288 [==============================] - 2s 7ms/step - loss: 0.6317 - accuracy: 0.6475 - val_loss: 0.8264 - val_accuracy: 0.3086
Epoch 10/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6275 - accuracy: 0.6549 - val_loss: 0.9303 - val_accuracy: 0.2193
Epoch 11/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6230 - accuracy: 0.6574 - val_loss: 0.7799 - val_accuracy: 0.3679
Epoch 12/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6210 - accuracy: 0.6600 - val_loss: 0.6063 - val_accuracy: 0.7942
Epoch 13/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6181 - accuracy: 0.6626 - val_loss: 0.6872 - val_accuracy: 0.4748
Epoch 14/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6195 - accuracy: 0.6632 - val_loss: 0.9843 - val_accuracy: 0.2436
Epoch 15/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6132 - accuracy: 0.6656 - val_loss: 0.7729 - val_accuracy: 0.4411
Epoch 16/50
288/288 [==============================] - 1s 4ms/step - loss: 0.6304 - accuracy: 0.6586 - val_loss: 0.9492 - val_accuracy: 0.2338
Epoch 17/50
288/288 [==============================] - 2s 6ms/step - loss: 0.6123 - accuracy: 0.6666 - val_loss: 0.7792 - val_accuracy: 0.3329
Epoch 18/50
288/288 [==============================] - 2s 7ms/step - loss: 0.6102 - accuracy: 0.6684 - val_loss: 1.0104 - val_accuracy: 0.2262
Epoch 19/50
288/288 [==============================] - 2s 5ms/step - loss: 0.6018 - accuracy: 0.6793 - val_loss: 0.7396 - val_accuracy: 0.5758
Epoch 20/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6002 - accuracy: 0.6821 - val_loss: 0.9011 - val_accuracy: 0.3568
Epoch 21/50
288/288 [==============================] - 1s 5ms/step - loss: 0.6021 - accuracy: 0.6786 - val_loss: 0.6261 - val_accuracy: 0.7223
Epoch 22/50
288/288 [==============================] - 1s 4ms/step - loss: 0.5972 - accuracy: 0.6861 - val_loss: 0.9532 - val_accuracy: 0.3179
Epoch 23/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5929 - accuracy: 0.6904 - val_loss: 0.7367 - val_accuracy: 0.5619
Epoch 24/50
288/288 [==============================] - 1s 4ms/step - loss: 0.5914 - accuracy: 0.6868 - val_loss: 0.9272 - val_accuracy: 0.3314
Epoch 25/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5906 - accuracy: 0.6908 - val_loss: 1.1307 - val_accuracy: 0.2442
Epoch 26/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5920 - accuracy: 0.6919 - val_loss: 0.7521 - val_accuracy: 0.4837
Epoch 27/50
288/288 [==============================] - 2s 6ms/step - loss: 0.5881 - accuracy: 0.6935 - val_loss: 1.0727 - val_accuracy: 0.1841
Epoch 28/50
288/288 [==============================] - 2s 7ms/step - loss: 0.5877 - accuracy: 0.6917 - val_loss: 0.9807 - val_accuracy: 0.3096
Epoch 29/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5836 - accuracy: 0.6987 - val_loss: 1.0255 - val_accuracy: 0.2618
Epoch 30/50
288/288 [==============================] - 1s 4ms/step - loss: 0.5837 - accuracy: 0.6971 - val_loss: 0.7151 - val_accuracy: 0.4959
Epoch 31/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5816 - accuracy: 0.6995 - val_loss: 0.5556 - val_accuracy: 0.7555
Epoch 32/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5758 - accuracy: 0.7031 - val_loss: 0.6548 - val_accuracy: 0.6026
Epoch 33/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5773 - accuracy: 0.7029 - val_loss: 0.6953 - val_accuracy: 0.6189
Epoch 34/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5780 - accuracy: 0.7068 - val_loss: 0.7301 - val_accuracy: 0.6295
Epoch 35/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5760 - accuracy: 0.7056 - val_loss: 1.1270 - val_accuracy: 0.1925
Epoch 36/50
288/288 [==============================] - 2s 6ms/step - loss: 0.5762 - accuracy: 0.6980 - val_loss: 0.6591 - val_accuracy: 0.6202
Epoch 37/50
288/288 [==============================] - 2s 7ms/step - loss: 0.5766 - accuracy: 0.7053 - val_loss: 0.6824 - val_accuracy: 0.5556
Epoch 38/50
288/288 [==============================] - 2s 6ms/step - loss: 0.5735 - accuracy: 0.7020 - val_loss: 0.8294 - val_accuracy: 0.4641
Epoch 39/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5755 - accuracy: 0.7031 - val_loss: 0.7203 - val_accuracy: 0.5737
Epoch 40/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5697 - accuracy: 0.7089 - val_loss: 0.9359 - val_accuracy: 0.3507
Epoch 41/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5700 - accuracy: 0.7080 - val_loss: 0.7763 - val_accuracy: 0.5791
Epoch 42/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5635 - accuracy: 0.7146 - val_loss: 0.6702 - val_accuracy: 0.6415
Epoch 43/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5684 - accuracy: 0.7126 - val_loss: 0.8945 - val_accuracy: 0.4557
Epoch 44/50
288/288 [==============================] - 1s 4ms/step - loss: 0.5673 - accuracy: 0.7131 - val_loss: 0.7436 - val_accuracy: 0.5259
Epoch 45/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5649 - accuracy: 0.7097 - val_loss: 0.9512 - val_accuracy: 0.3316
Epoch 46/50
288/288 [==============================] - 2s 7ms/step - loss: 0.5626 - accuracy: 0.7160 - val_loss: 0.6197 - val_accuracy: 0.6491
Epoch 47/50
288/288 [==============================] - 2s 7ms/step - loss: 0.5670 - accuracy: 0.7099 - val_loss: 0.6066 - val_accuracy: 0.6632
Epoch 48/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5607 - accuracy: 0.7162 - val_loss: 0.7814 - val_accuracy: 0.4872
Epoch 49/50
288/288 [==============================] - 1s 5ms/step - loss: 0.5579 - accuracy: 0.7186 - val_loss: 0.6836 - val_accuracy: 0.6347
Epoch 50/50
288/288 [==============================] - 1s 4ms/step - loss: 0.5643 - accuracy: 0.7142 - val_loss: 0.7058 - val_accuracy: 0.6452
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_9.history['loss'])
plt.plot(history_9.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

From the above plot, we observe that there is a lot of noise in the model.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat9 = model9.predict(X_test)
# keep probabilities for the positive outcome only
yhat9 = yhat9[:, 0]
# calculate roc curves
fpr, tpr, thresholds9 = roc_curve(y_test, yhat9)
# calculate the g-mean for each threshold
gmeans9 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans9)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds9[ix], gmeans9[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.442253, G-Mean=0.679
No description has been provided for this image
In [ ]:
y_pred_e9=model9.predict(X_test)
y_pred_e9 = (y_pred_e9 > thresholds9[ix])
y_pred_e9
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm9=confusion_matrix(y_test, y_pred_e9)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm9,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr6=metrics.classification_report(y_test,y_pred_e9)
print(cr6)
              precision    recall  f1-score   support

           0       0.86      0.72      0.78      2877
           1       0.43      0.64      0.51       955

    accuracy                           0.70      3832
   macro avg       0.64      0.68      0.65      3832
weighted avg       0.75      0.70      0.71      3832

After applying the SMOTE technique to the data, the F1 score increased, and the False Negative rate decreased, but if you see the loss curves of train and validation, the model seems to have overfit.

Let's use Grid Search CV and see if we can increase the model's performance on the metrics.

In [ ]:
backend.clear_session()
np.random.seed(42)
import random
random.seed(42)
tf.random.set_seed(42)
In [ ]:
def create_model_v7():
    np.random.seed(1337)
    model = Sequential()
    model.add(Dense(256,activation='relu',input_dim = X_train.shape[1]))
    model.add(Dropout(0.3))
    #model.add(Dense(128,activation='relu',kernel_initializer='he_uniform'))
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(64,activation='relu'))
    model.add(Dropout(0.2))
    #model.add(Dense(32,activation='relu',kernel_initializer='he_uniform'))
    #model.add(Dropout(0.3))
    model.add(Dense(32,activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    #compile model
    optimizer = tf.keras.optimizers.Adam()
    model.compile(optimizer = optimizer,loss = 'binary_crossentropy', metrics = ['accuracy'])
    return model
In [ ]:
keras_estimator = KerasClassifier(build_fn=create_model_v7, verbose=1)
In [ ]:
# define the grid search parameters
batch_size= [32, 64, 128]
Learn_rate = [0.001,0.01,0.1]
param_grid = dict(optimizer__learning_rate=learn_rate, batch_size=batch_size)

kfold_splits = 3
grid = GridSearchCV(estimator=keras_estimator,
                    verbose=1,
                    cv=kfold_splits,
                    param_grid=param_grid,n_jobs=-1)
grid_result = grid.fit(X_train, y_train,validation_split=0.2,verbose=1)
Fitting 3 folds for each of 9 candidates, totalling 27 fits
384/384 [==============================] - 3s 5ms/step - loss: 0.6308 - accuracy: 0.7337 - val_loss: 0.5739 - val_accuracy: 0.7515
In [ ]:
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
Best: 0.750620 using {'batch_size': 32, 'optimizer__learning_rate': 0.01}
In [ ]:
estimator_v7=create_model_v7()

estimator_v7.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_5 (Dense)             (None, 256)               2560      
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                                 
 dense_6 (Dense)             (None, 128)               32896     
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_7 (Dense)             (None, 64)                8256      
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 32)                2080      
                                                                 
 dense_9 (Dense)             (None, 1)                 33        
                                                                 
=================================================================
Total params: 45825 (179.00 KB)
Trainable params: 45825 (179.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [ ]:
optimizer = tf.keras.optimizers.Adam()
estimator_v7.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
history_7=estimator_v7.fit(X_sm, y_sm, epochs=50, batch_size = grid_result.best_params_['batch_size'], verbose=1,validation_split=0.2)
Epoch 1/50
576/576 [==============================] - 4s 5ms/step - loss: 0.7003 - accuracy: 0.6014 - val_loss: 0.8608 - val_accuracy: 0.0000e+00
Epoch 2/50
576/576 [==============================] - 4s 6ms/step - loss: 0.6665 - accuracy: 0.6224 - val_loss: 0.9427 - val_accuracy: 0.0000e+00
Epoch 3/50
576/576 [==============================] - 3s 5ms/step - loss: 0.6595 - accuracy: 0.6238 - val_loss: 0.8662 - val_accuracy: 0.0000e+00
Epoch 4/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6542 - accuracy: 0.6246 - val_loss: 0.8600 - val_accuracy: 0.0000e+00
Epoch 5/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6477 - accuracy: 0.6252 - val_loss: 0.8651 - val_accuracy: 0.0000e+00
Epoch 6/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6400 - accuracy: 0.6319 - val_loss: 0.9245 - val_accuracy: 0.1034
Epoch 7/50
576/576 [==============================] - 4s 7ms/step - loss: 0.6341 - accuracy: 0.6440 - val_loss: 0.8075 - val_accuracy: 0.2605
Epoch 8/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6329 - accuracy: 0.6465 - val_loss: 0.7977 - val_accuracy: 0.4218
Epoch 9/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6272 - accuracy: 0.6545 - val_loss: 0.8649 - val_accuracy: 0.3044
Epoch 10/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6224 - accuracy: 0.6621 - val_loss: 0.8300 - val_accuracy: 0.3629
Epoch 11/50
576/576 [==============================] - 3s 5ms/step - loss: 0.6196 - accuracy: 0.6655 - val_loss: 0.7618 - val_accuracy: 0.4270
Epoch 12/50
576/576 [==============================] - 4s 6ms/step - loss: 0.6144 - accuracy: 0.6707 - val_loss: 0.6803 - val_accuracy: 0.5374
Epoch 13/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6110 - accuracy: 0.6766 - val_loss: 0.6796 - val_accuracy: 0.6047
Epoch 14/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6085 - accuracy: 0.6791 - val_loss: 0.9325 - val_accuracy: 0.2827
Epoch 15/50
576/576 [==============================] - 3s 4ms/step - loss: 0.6046 - accuracy: 0.6838 - val_loss: 0.7305 - val_accuracy: 0.5113
Epoch 16/50
576/576 [==============================] - 3s 5ms/step - loss: 0.6021 - accuracy: 0.6843 - val_loss: 0.8137 - val_accuracy: 0.4405
Epoch 17/50
576/576 [==============================] - 3s 6ms/step - loss: 0.6018 - accuracy: 0.6842 - val_loss: 0.8529 - val_accuracy: 0.3822
Epoch 18/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5989 - accuracy: 0.6880 - val_loss: 0.8450 - val_accuracy: 0.4500
Epoch 19/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5967 - accuracy: 0.6895 - val_loss: 0.6954 - val_accuracy: 0.5828
Epoch 20/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5925 - accuracy: 0.6931 - val_loss: 0.7040 - val_accuracy: 0.5808
Epoch 21/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5937 - accuracy: 0.6934 - val_loss: 0.6715 - val_accuracy: 0.6241
Epoch 22/50
576/576 [==============================] - 3s 6ms/step - loss: 0.5930 - accuracy: 0.6937 - val_loss: 0.8507 - val_accuracy: 0.3870
Epoch 23/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5895 - accuracy: 0.6967 - val_loss: 0.7462 - val_accuracy: 0.5545
Epoch 24/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5889 - accuracy: 0.6962 - val_loss: 0.7266 - val_accuracy: 0.5487
Epoch 25/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5888 - accuracy: 0.6978 - val_loss: 0.7798 - val_accuracy: 0.5248
Epoch 26/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5889 - accuracy: 0.6979 - val_loss: 0.7571 - val_accuracy: 0.5109
Epoch 27/50
576/576 [==============================] - 3s 6ms/step - loss: 0.5880 - accuracy: 0.6981 - val_loss: 0.8523 - val_accuracy: 0.4072
Epoch 28/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5867 - accuracy: 0.6971 - val_loss: 0.7692 - val_accuracy: 0.5695
Epoch 29/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5845 - accuracy: 0.7052 - val_loss: 0.7931 - val_accuracy: 0.4913
Epoch 30/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5841 - accuracy: 0.7034 - val_loss: 0.6327 - val_accuracy: 0.6391
Epoch 31/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5852 - accuracy: 0.7017 - val_loss: 0.6510 - val_accuracy: 0.6295
Epoch 32/50
576/576 [==============================] - 3s 6ms/step - loss: 0.5851 - accuracy: 0.7000 - val_loss: 0.6746 - val_accuracy: 0.6269
Epoch 33/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5842 - accuracy: 0.6992 - val_loss: 0.6463 - val_accuracy: 0.6812
Epoch 34/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5829 - accuracy: 0.7013 - val_loss: 0.7065 - val_accuracy: 0.5945
Epoch 35/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5833 - accuracy: 0.7020 - val_loss: 0.7968 - val_accuracy: 0.4894
Epoch 36/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5828 - accuracy: 0.7027 - val_loss: 0.6168 - val_accuracy: 0.7051
Epoch 37/50
576/576 [==============================] - 3s 6ms/step - loss: 0.5832 - accuracy: 0.7010 - val_loss: 0.7796 - val_accuracy: 0.4426
Epoch 38/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5818 - accuracy: 0.7042 - val_loss: 0.6748 - val_accuracy: 0.6393
Epoch 39/50
576/576 [==============================] - 2s 4ms/step - loss: 0.5801 - accuracy: 0.7029 - val_loss: 0.7662 - val_accuracy: 0.5482
Epoch 40/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5815 - accuracy: 0.7038 - val_loss: 0.7658 - val_accuracy: 0.5522
Epoch 41/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5806 - accuracy: 0.7015 - val_loss: 0.7803 - val_accuracy: 0.5778
Epoch 42/50
576/576 [==============================] - 3s 6ms/step - loss: 0.5796 - accuracy: 0.7015 - val_loss: 0.7343 - val_accuracy: 0.5732
Epoch 43/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5778 - accuracy: 0.7046 - val_loss: 0.8265 - val_accuracy: 0.5133
Epoch 44/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5774 - accuracy: 0.7065 - val_loss: 0.7490 - val_accuracy: 0.5900
Epoch 45/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5772 - accuracy: 0.7049 - val_loss: 0.8295 - val_accuracy: 0.4846
Epoch 46/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5795 - accuracy: 0.7057 - val_loss: 0.6819 - val_accuracy: 0.6132
Epoch 47/50
576/576 [==============================] - 4s 6ms/step - loss: 0.5789 - accuracy: 0.7056 - val_loss: 0.6177 - val_accuracy: 0.7112
Epoch 48/50
576/576 [==============================] - 3s 5ms/step - loss: 0.5775 - accuracy: 0.7069 - val_loss: 0.7920 - val_accuracy: 0.4863
Epoch 49/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5777 - accuracy: 0.7054 - val_loss: 0.7217 - val_accuracy: 0.6236
Epoch 50/50
576/576 [==============================] - 3s 4ms/step - loss: 0.5767 - accuracy: 0.7060 - val_loss: 0.6357 - val_accuracy: 0.6814
In [ ]:
#Plotting Train Loss vs Validation Loss
plt.plot(history_7.history['loss'])
plt.plot(history_7.history['val_loss'])
plt.title('model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
No description has been provided for this image

From the above plot, we observe that there is a lot of noise in the model.

Grid Search CV also does not seem to work that well on the SMOTE data.

In [ ]:
from sklearn.metrics import roc_curve

from matplotlib import pyplot


# predict probabilities
yhat10 = estimator_v7.predict(X_test)
# keep probabilities for the positive outcome only
yhat10 = yhat10[:, 0]
# calculate roc curves
fpr, tpr, thresholds10 = roc_curve(y_test, yhat10)
# calculate the g-mean for each threshold
gmeans10 = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans10)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds10[ix], gmeans10[ix]))
# plot the roc curve for the model
pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
pyplot.plot(fpr, tpr, marker='.')
pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()
# show the plot
pyplot.show()
120/120 [==============================] - 0s 2ms/step
Best Threshold=0.497505, G-Mean=0.685
No description has been provided for this image
In [ ]:
y_pred_e10=estimator_v7.predict(X_test)
y_pred_e10 = (y_pred_e10 > thresholds10[ix])
y_pred_e10
120/120 [==============================] - 0s 2ms/step
Out[ ]:
array([[False],
       [ True],
       [False],
       ...,
       [False],
       [False],
       [False]])
In [ ]:
#Calculating the confusion matrix
from sklearn.metrics import confusion_matrix
cm10=confusion_matrix(y_test, y_pred_e10)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not Changing Job','Changing Job']
make_confusion_matrix(cm10,
                      group_names=labels,
                      categories=categories,
                      cmap='Blues')
No description has been provided for this image
In [ ]:
#Accuracy as per the classification report
from sklearn import metrics
cr10=metrics.classification_report(y_test,y_pred_e10)
print(cr10)
              precision    recall  f1-score   support

           0       0.86      0.72      0.79      2877
           1       0.44      0.65      0.52       955

    accuracy                           0.71      3832
   macro avg       0.65      0.69      0.65      3832
weighted avg       0.76      0.71      0.72      3832

Oversampling using SMOTE did not help improve the F1 score.

In this dataset, the SMOTE oversampling technique does not work well, as both the models we tried building have overfitted on the training dataset.

So, our final model here can be Model 4, which uses the Dropout regularization technique and works on the imbalanced dataset.

Suggested Areas of Improvement¶

  • Build any one Machine Learning model, and use that to get the feature importance of the variables. Try to use that in the neural network model.

  • You can try to do better feature engineerning by removing the flaws of the skewed variables if required.

Business Recommendations¶

  • The HR department of the company can deploy the final model from this exercise to identify with a reasonable degree of accuracy whether an employee is likely to switch jobs or not, and this process seems to be easier and more time-efficient than other methods.
In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/My Drive/Colab Notebooks/Copy of FDS_Project_LearnerNotebook_FullCode.ipynb"