ExtraaLearn Project¶

Context¶

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

  • The customer interacts with the marketing front on social media or other online platforms.
  • The customer browses the website/app and downloads the brochure
  • The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

Objective¶

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:

  • Analyze and build an ML model to help identify which leads are more likely to convert to paid customers,
  • Find the factors driving the lead conversion process
  • Create a profile of the leads which are likely to convert

Data Description¶

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.

Data Dictionary

  • ID: ID of the lead
  • age: Age of the lead
  • current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
  • first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
  • profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
  • website_visits: How many times has a lead visited the website
  • time_spent_on_website: Total time spent on the website
  • page_views_per_visit: Average number of pages on the website viewed during the visits.
  • last_activity: Last interaction between the lead and ExtraaLearn.

    • Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc
    • Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    • Website Activity: Interacted on live chat with representative, Updated profile on website, etc
  • print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.

  • print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
  • digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
  • educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
  • referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
  • status: Flag indicating whether the lead was converted to a paid customer or not.

Problem Definition¶

The objective of this analysis is to build a predictive model that can accurately identify potential leads who are most likely to convert into customers. By leveraging data related to user behavior, such as website activity, user demographics, and interaction history, we aim to develop a classification model that can assist the business in optimizing its marketing and sales strategies.

This project seeks to address the challenge of efficiently managing leads by predicting which users have the highest potential for conversion. By identifying key factors that influence lead conversion, the model will enable the business to focus its resources on high-value leads, improve customer acquisition efforts, and ultimately increase conversion rates.

Model evaluation criterion:

Model can make wrong predictions as:

  • Predicting a lead will not be converted to a paid customer but, in reality, the lead would have converted to a paid customer.

  • Predicting a lead will be converted to a paid customer but, in reality, the lead would have not converted to a paid customer. Which case is more important?

If we predict that a lead will not get converted and the lead would have converted then the company will lose a potential customer.

If we predict that a lead will get converted and the lead doesn't get converted the company might lose resources by nurturing false-positive cases.

Losing a potential customer is a greater loss for the organization.

How to reduce the losses?

  • Company would want Recall to be maximized. The greater the Recall score, higher the chances of minimizing False Negatives.

  • In this case the false negative is predicting a lead will not convert(0), when it would have converted(1).

To accomplish this we will need to build a model that maximizes recall for status = 1.

Importing Libraries, Connect to Google Drive and Load Data¶

In [2]:
# Importing the basic libraries we will require for the project

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches # For creating plot independent legends

# Libraries for statistical analysis
from scipy import stats

# Library for label encoding (for 3D plotting functions)
from Scikit-learn.preprocessing import LabelEncoder

# Collinearity checks
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing the Machine Learning models we require from Scikit-Learn
from Scikit-learn.linear_model import LogisticRegression
from Scikit-learn.svm import SVC
from Scikit-learn import tree

from Scikit-learn.tree import (
    DecisionTreeClassifier,
)

from Scikit-learn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
)

# Importing the other functions we may require from Scikit-Learn
from Scikit-learn.model_selection import (
    train_test_split,
    GridSearchCV,
    cross_val_score,
)

from Scikit-learn.preprocessing import (
    MinMaxScaler,
    LabelEncoder,
    OneHotEncoder,
    StandardScaler,
)

from Scikit-learn.impute import (
    SimpleImputer,
)

# To get diferent metric scores
import Scikit-learn.metrics as metrics
from Scikit-learn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

#Importing PCA and TSNE
from Scikit-learn.decomposition import PCA

# Importing class weights
from Scikit-learn.utils.class_weight import compute_class_weight

# Importing Advanced Analysis Libraries
from xgboost import XGBClassifier
from Scikit-learn.model_selection import RandomizedSearchCV

# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

# Comment formatting
from IPython.display import display, HTML

# Connect collab
from google.colab import drive
drive.mount('/content/drive')

# Load data from csv file
dataset = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/ExtraaLearn.csv')

# Make a working copy of the data
data = dataset.copy()
Mounted at /content/drive

Function Definitions¶

In [3]:
def enhanced_histogram_boxplot(
    data,
    feature,
    figsize=(12, 8),
    kde=True,
    bins=None,
    box_color="violet",
    mean_color="green",
    median_color="black",
    hist_palette="tab10",  # Updated palette to 'tab10'
    show_title=True,
    custom_title=None
):
    """
    Enhanced boxplot and histogram combined with outlier detection and statistical summary

    Parameters:
    data (pd.DataFrame): Input dataframe
    feature (str): Column name of the dataframe to plot
    figsize (tuple): Size of figure (default (12,8)) - Adjusted for removal of Q-Q plot
    kde (bool): Whether to show the density curve (default True)
    bins (int): Number of bins for histogram (default None)
    box_color (str): Color of the boxplot (default "violet")
    mean_color (str): Color of the mean line (default "green")
    median_color (str): Color of the median line (default "black")
    hist_palette (str): Color palette for the histogram (default "tab10")
    show_title (bool): Whether to show the plot title (default True)
    custom_title (str): Custom title for the plot (default None)
    """
    if not isinstance(data, pd.DataFrame):
        raise TypeError("data must be a pandas DataFrame")

    if feature not in data.columns:
        raise ValueError(f"Feature '{feature}' not found in the dataframe")

    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,
        sharex=False,
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )

    # Boxplot
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color=box_color)

    # Histogram
    if bins:
        sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette=sns.color_palette(hist_palette))
    else:
        sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, palette=sns.color_palette(hist_palette))

    # Add mean and median lines
    mean = data[feature].mean()
    median = data[feature].median()
    ax_hist2.axvline(mean, color=mean_color, linestyle="--", label="Mean")
    ax_hist2.axvline(median, color=median_color, linestyle="-", label="Median")

    # Add legend
    ax_hist2.legend()

    # Add labels
    ax_hist2.set_xlabel(feature)
    ax_hist2.set_ylabel("Count")
    ax_box2.set_ylabel("")

    if show_title:
        if custom_title:
            title = f"{custom_title} Distribution"
        else:
            title = f"{feature} Distribution"
        plt.suptitle(title, fontsize=16)

    plt.tight_layout()

    # Calculate statistics
    std = data[feature].std()
    skew = data[feature].skew()
    kurtosis = data[feature].kurtosis()

    # Outlier detection
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)][feature]

    # Z-score outliers
    z_scores = np.abs(stats.zscore(data[feature]))
    z_score_outliers = data[feature][z_scores > 3]

    # Print statistical summary
    print(f"\nStatistical Summary for {feature}:")
    print(f"Mean: {mean:.2f}")
    print(f"Median: {median:.2f}")
    print(f"Standard Deviation: {std:.2f}")
    print(f"Skewness: {skew:.2f}")
    print(f"Kurtosis: {kurtosis:.2f}")
    print(f"\nOutlier Analysis:")
    print(f"IQR method - Number of outliers: {len(outliers)}")
    print(f"IQR method - Percentage of outliers: {(len(outliers) / len(data[feature])) * 100:.2f}%")
    print(f"IQR method - Outlier range: < {lower_bound:.2f} or > {upper_bound:.2f}")
    print(f"Z-score method - Number of outliers (|z| > 3): {len(z_score_outliers)}")
    print(f"Z-score method - Percentage of outliers: {(len(z_score_outliers) / len(data[feature])) * 100:.2f}%")

    plt.show()

# Usage example:
# enhanced_histogram_boxplot(data, 'age')
In [4]:
# Define binary feature plotting function

def plot_binary_feature(
                            data,
                            feature,
                            figsize=(12, 7),
                            colors=['#0073a3', '#5e6d77'],
                            show_title=True,
                            custom_title=None):
    """
    Bar plot and pie chart for binary features

    Parameters:
    data (pd.DataFrame): Input dataframe
    feature (str): Column name of the dataframe to plot
    figsize (tuple): Size of figure (default (12,7))
    colors (list): List of two colors for the plots (default ['#FFA07A', '#98FB98'])
    show_title (bool): Whether to show the plot title (default True)
    custom_title (str): Custom title for the plot (default None)
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

    # Calculate the value counts and percentages
    value_counts = data[feature].value_counts().sort_index()
    percentages = value_counts / len(data) * 100

    # Bar plot
    sns.barplot(x=value_counts.index, y=value_counts.values, ax=ax1, palette=colors)
    ax1.set_title('Bar Plot')
    ax1.set_xlabel(feature)
    ax1.set_ylabel('Count')

    # Add percentage labels on the bars
    for i, v in enumerate(value_counts.values):
        ax1.text(i, v, f'{percentages[i]:.1f}%', ha='center', va='bottom')

    # Pie chart
    ax2.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', colors=colors, startangle=90)
    ax2.set_title('Pie Chart')

    if show_title:
        title = f"{custom_title} Distribution" if custom_title else f"{feature} Distribution"
        plt.suptitle(title, fontsize=16)

    plt.tight_layout()
    plt.show()
In [5]:
# Define categorical feature plotting function

def plot_categorical(data, feature, figsize=(12, 6), show_title=True, custom_title=None, top_n=None):
    """
    Plot categorical variables with appropriate chart types based on the number of categories.
    Includes value counts and percentages as a "legend" on the right side.

    Parameters:
    data (pd.DataFrame): Input dataframe
    feature (str): Column name of the dataframe to plot
    figsize (tuple): Size of figure (default (12,6))
    show_title (bool): Whether to show the plot title (default True)
    custom_title (str): Custom title for the plot (default None)
    top_n (int): Number of top categories to show, others will be grouped as 'Other' (default None)
    """
    if not isinstance(data, pd.DataFrame):
        raise TypeError("data must be a pandas DataFrame")

    if feature not in data.columns:
        raise ValueError(f"Feature '{feature}' not found in the dataframe")

    value_counts = data[feature].value_counts()
    n_categories = len(value_counts)

    if top_n and n_categories > top_n:
        top_values = value_counts.nlargest(top_n)
        other = pd.Series({'Other': value_counts.nsmallest(n_categories - top_n).sum()})
        value_counts = pd.concat([top_values, other])
        n_categories = top_n + 1

    percentages = value_counts / len(data) * 100

    # Determine the title
    title = custom_title if custom_title else f"{feature} Distribution"

    if n_categories == 2:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize, gridspec_kw={'width_ratios': [2, 1]})

        # Bar plot
        sns.barplot(x=value_counts.index, y=value_counts.values, ax=ax1)
        ax1.set_title('Bar Plot')
        ax1.set_ylabel('Count')

        # Add percentage labels on the bars
        for i, v in enumerate(value_counts.values):
            ax1.text(i, v, f'{percentages[i]:.1f}%', ha='center', va='bottom')

        # Pie chart
        ax2.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', startangle=90)
        ax2.set_title('Pie Chart')

    else:
        fig, (ax, ax_legend) = plt.subplots(1, 2, figsize=figsize, gridspec_kw={'width_ratios': [3, 1]})

        # Horizontal bar plot
        bars = ax.barh(value_counts.index, value_counts.values)
        ax.set_title('Horizontal Bar Plot')
        ax.set_xlabel('Count')

        # Add percentage labels on the bars
        for i, (value, name) in enumerate(zip(value_counts.values, value_counts.index)):
            ax.text(value, i, f'{percentages[name]:.1f}%', va='center')

        # Add value counts as "legend"
        ax_legend.axis('off')
        legend_text = "Value Counts:\n\n"
        for index, value in value_counts.items():
            legend_text += f"{index}: {value} ({percentages[index]:.1f}%)\n"
        ax_legend.text(0, 0.9, legend_text, verticalalignment='top', wrap=True)

    if show_title:
        plt.suptitle(title, fontsize=16)

    plt.tight_layout()
    plt.show()

# Example usage:
# plot_categorical(data, 'current_occupation', custom_title="Occupation Distribution")
# plot_categorical(data, 'print_media_type1', custom_title="Print Media Type 1 Usage")
# plot_categorical(data, 'educational_channels', top_n=5, custom_title="Top 5 Educational Channels")
In [6]:
# Define scatter plot function

def plot_scatter(x_column, y_column, data, title=None, x_label=None, y_label=None, palette='tab10'):
    """
    Creates a scatter plot for the specified x and y columns from the given dataset.

    Parameters:
    x_column (str): The name of the column for the x-axis.
    y_column (str): The name of the column for the y-axis.
    data (DataFrame): The pandas DataFrame containing the data.
    title (str, optional): The title of the plot. Defaults to None.
    x_label (str, optional): Label for the x-axis. Defaults to None.
    y_label (str, optional): Label for the y-axis. Defaults to None.

    Returns:
    None: Displays the scatter plot.
    """
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=x_column, y=y_column, data=data, palette='palette')
    plt.title(title if title else f'Scatter Plot: {x_column} vs {y_column}')
    plt.xlabel(x_label if x_label else x_column)
    plt.ylabel(y_label if y_label else y_column)
    plt.grid(True)
    plt.show()

# Example of how to call the function
#plot_scatter('age', 'website_visits', data, title='Age vs Website Visits')
In [7]:
# Defining an abstracted function for box plot visualization

def plot_boxplot(x_column, y_column, data, title=None, x_label=None, y_label=None, palette='tab10', figsize=(10, 6)):
    """
    Creates a box plot for the specified x and y columns from the given dataset.

    Parameters:
    x_column (str): The name of the column for the x-axis (typically categorical).
    y_column (str): The name of the column for the y-axis (typically continuous).
    data (DataFrame): The pandas DataFrame containing the data.
    title (str, optional): The title of the plot. Defaults to None.
    x_label (str, optional): Label for the x-axis. Defaults to None.
    y_label (str, optional): Label for the y-axis. Defaults to None.
    palette (str, optional): Color palette for the box plot. Defaults to 'tab10'.
    figsize (tuple, optional): Figure size for the plot (width, height). Defaults to (10, 6).

    Returns:
    None: Displays the box plot.
    """
    # Set the figure size based on the figsize parameter
    plt.figure(figsize=figsize)
    sns.boxplot(x=x_column, y=y_column, data=data, palette=palette)
    plt.title(title if title else f'Box Plot: {x_column} vs {y_column}')
    plt.xlabel(x_label if x_label else x_column)
    plt.ylabel(y_label if y_label else y_column)
    plt.grid(True)
    plt.show()

# Example of how to call the function with a custom figure size
# plot_boxplot(x_column='status', y_column='age', data=dataset, title='Status vs Age', figsize=(15, 8))
In [8]:
# Defining a generic function for count plot visualization

def plot_countplot(x_column, hue_column, data, title=None, x_label=None, y_label=None, palette='tab10', hue_order=None):
    """
    Creates a count plot for the specified x and hue columns from the given dataset.

    Parameters:
    x_column (str): The name of the column for the x-axis (typically categorical).
    hue_column (str): The name of the column for hue (typically categorical).
    data (DataFrame): The pandas DataFrame containing the data.
    title (str, optional): The title of the plot. Defaults to None.
    x_label (str, optional): Label for the x-axis. Defaults to None.
    y_label (str, optional): Label for the y-axis. Defaults to None.
    palette (str, optional): Color palette for the count plot. Defaults to 'tab10'.
    hue_order (list, optional): The order of the hues to be plotted. Defaults to None.

    Returns:
    None: Displays the count plot.
    """
    plt.figure(figsize=(10, 6))
    sns.countplot(x=x_column, hue=hue_column, data=data, palette=palette, hue_order=hue_order)
    plt.title(title if title else f'Count Plot: {x_column} vs {hue_column}')
    plt.xlabel(x_label if x_label else x_column)
    plt.ylabel(y_label if y_label else 'Count')
    plt.grid(True)
    plt.show()

# Example usage
# plot_countplot(x_column='status', hue_column='current_occupation', data=dataset, title='Status vs Current Occupation', hue_order=['Employed', 'Unemployed', 'Student'])
In [9]:
# Defining a generic function for creating 3D scatter plots

# Set Seaborn style to 'darkgrid'
sns.set(style="whitegrid")

def plot_3d_scatter_with_color(x_column, y_column, z_column, color_column, data, title=None, x_label=None, y_label=None, z_label=None, figsize=(10, 8)):
    """
    Creates a 3D scatter plot for the specified x, y, and z columns from the given dataset.
    Adds the 'color_column' to color the points based on a categorical variable.
    """
    fig = plt.figure(figsize=figsize)           # Pass figsize argument here
    ax = fig.add_subplot(111, projection='3d')

    # Scatter plot with color dimension based on the 'color_column'
    p = ax.scatter(data[x_column], data[y_column], data[z_column], c=data[color_column], cmap='coolwarm', marker='o')

    # Setting labels with padding
    ax.set_xlabel(x_label if x_label else x_column, labelpad=20)
    ax.set_ylabel(y_label if y_label else y_column, labelpad=20)
    ax.set_zlabel(z_label if z_label else z_column, labelpad=20)

    # Add color bar for reference
    cbar = fig.colorbar(p, ax=ax) # Assign color bar object to cbar
    cbar.set_label('Status') # Add this line

    # Title
    ax.set_title(title if title else f'3D Scatter Plot: {x_column}, {y_column}, {z_column}')

    plt.tight_layout()
    plt.show()

# Call the modified function with 'status' as the color dimension
#plot_3d_scatter_with_color('website_visits', 'time_spent_on_website', 'status', 'status', data, title='Website Visits, Time Spent on Website, and Status (with color dimension)')
In [10]:
# Creating metric function for regresion model evaluation

def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))

    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Converted', 'Converted'], yticklabels=['Not Converted', 'Converted'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

Data Overview¶

  • Observations
  • Sanity checks

First Five Rows¶

In [11]:
# returns the first 5 rows
data.head()
Out[11]:
ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
0 EXT001 57 Unemployed Website High 7 1639 1.861 Website Activity Yes No Yes No No 1
1 EXT002 56 Professional Mobile App Medium 2 83 0.320 Website Activity No No No Yes No 0
2 EXT003 52 Professional Website Medium 3 330 0.074 Website Activity No No Yes No No 0
3 EXT004 53 Unemployed Website High 4 464 2.057 Website Activity No No No No No 1
4 EXT005 23 Student Website High 4 600 16.914 Email Activity No No No No No 0

Observations:

  • The conversion of lead to customer is indicated by the variable status, which is the target variable and the rest of the variables are independent variables based on which we will predict the conversion based on several parameters

Last Five Rows¶

In [12]:
# returns the last 5 rows
data.tail()
Out[12]:
ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
4607 EXT4608 35 Unemployed Mobile App Medium 15 360 2.170 Phone Activity No No No Yes No 0
4608 EXT4609 55 Professional Mobile App Medium 8 2327 5.393 Email Activity No No No No No 0
4609 EXT4610 58 Professional Website High 2 212 2.692 Email Activity No No No No No 1
4610 EXT4611 57 Professional Mobile App Medium 1 154 3.879 Website Activity Yes No No No No 0
4611 EXT4612 55 Professional Website Medium 4 2290 2.075 Phone Activity No No No No No 0

Shape¶

In [13]:
# Determine the number of rows and columns by calling data.shape
print(data.shape[0])
print(data.shape[1])
4612
15

The dataset consists of 4612 rows and 14 features.

Dataset Information¶

In [14]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4612 non-null   object 
 1   age                    4612 non-null   int64  
 2   current_occupation     4612 non-null   object 
 3   first_interaction      4612 non-null   object 
 4   profile_completed      4612 non-null   object 
 5   website_visits         4612 non-null   int64  
 6   time_spent_on_website  4612 non-null   int64  
 7   page_views_per_visit   4612 non-null   float64
 8   last_activity          4612 non-null   object 
 9   print_media_type1      4612 non-null   object 
 10  print_media_type2      4612 non-null   object 
 11  digital_media          4612 non-null   object 
 12  educational_channels   4612 non-null   object 
 13  referral               4612 non-null   object 
 14  status                 4612 non-null   int64  
dtypes: float64(1), int64(4), object(10)
memory usage: 540.6+ KB

Observations

The dataset is composed of numeric and categorical features.

  • Continuous Features:

    • age
    • website_visits
    • time_spent_on_website
    • page_views_per_visit
  • Categorical Features:

    • ID
    • current_occupation
    • first_interaction
    • profile_completed
    • last_activity
    • print_media_type1
    • print_media_type2
    • digital_media
    • educational_channels
    • referral
    • status (Dependent Variable)

Summary of Features:

Based on the initial inspection of the dataset, here's an evaluation of the columns, considering that our target variable is status:

  • ID:

    • The ID column is a unique identifier for each record and does not provide predictive value for the target variable (status). It can be safely dropped.
  • current_occupation:

    • This feature could potentially be useful for identifying patterns between occupation and conversion status. Retain for analysis.
  • first_interaction:

    • This feature could help us understand the impact of the platform (e.g., Website, Mobile App) on conversion. Retain for initial analysis.
  • profile_completed:

    • Indicates the level of user engagement, which may be related to conversion. Retain for modeling.
  • website_visits, time_spent_on_website, page_views_per_visit:

    • These features capture user behavior and engagement. Retain unless multicollinearity is observed during further analysis.
  • print_media_type1, print_media_type2, digital_media, educational_channels:

    • These features provide insights into which marketing channels may be effective. Retain for analysis.
  • referral:

    • This feature reflects the lead source and could be a strong predictor of conversion. Retain for further exploration.
  • last_activity:

    • Represents the user's most recent activity, which could influence conversion. Retain for further analysis.

Data Cleansing¶

Converting categorical features from 'object' to the 'category' datatype helps reduce memory usage and simplifies the visual interpretation of data types, making it easier to manage and analyze the dataset efficiently.

In [15]:
# Changing datatypes of categorical features

categorical_features = ['current_occupation', 'first_interaction', 'profile_completed', 'last_activity',
                        'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels',
                        'referral', 'status']

for feature in categorical_features:
    data[feature] = data[feature].astype('category')  # Step 1: Convert to categorical
    data[feature] = data[feature].cat.codes           # Step 2: Encode categories as integer codes
    data[feature] = data[feature].astype('category')  # Step 3: Re-convert back to categorical


data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   ID                     4612 non-null   object  
 1   age                    4612 non-null   int64   
 2   current_occupation     4612 non-null   category
 3   first_interaction      4612 non-null   category
 4   profile_completed      4612 non-null   category
 5   website_visits         4612 non-null   int64   
 6   time_spent_on_website  4612 non-null   int64   
 7   page_views_per_visit   4612 non-null   float64 
 8   last_activity          4612 non-null   category
 9   print_media_type1      4612 non-null   category
 10  print_media_type2      4612 non-null   category
 11  digital_media          4612 non-null   category
 12  educational_channels   4612 non-null   category
 13  referral               4612 non-null   category
 14  status                 4612 non-null   category
dtypes: category(10), float64(1), int64(3), object(1)
memory usage: 226.1+ KB

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions

  1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
  2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
  3. The company uses multiple modes to interact with prospects. Which way of interaction works best?
  4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
  5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

Problem Statement Reminder¶

The objective of this analysis is to build a predictive model that can accurately identify potential leads who are most likely to convert into customers.

  • If we predict that a lead will not get converted and the lead would have converted then the company will lose a potential customer.

  • If we predict that a lead will get converted and the lead doesn't get converted the company might lose resources by nurturing false-positive cases.

Losing a potential customer is a greater loss for the organization.

To accomplish this we will need to build a model that maximizes recall for status = 1.

Summary Statistics¶

In [16]:
data.describe(include = "all").T
Out[16]:
count unique top freq mean std min 25% 50% 75% max
ID 4612 4612 EXT001 1 NaN NaN NaN NaN NaN NaN NaN
age 4612.0 NaN NaN NaN 46.201214 13.161454 18.0 36.0 51.0 57.0 63.0
current_occupation 4612.0 3.0 0.0 2616.0 NaN NaN NaN NaN NaN NaN NaN
first_interaction 4612.0 2.0 1.0 2542.0 NaN NaN NaN NaN NaN NaN NaN
profile_completed 4612.0 3.0 0.0 2264.0 NaN NaN NaN NaN NaN NaN NaN
website_visits 4612.0 NaN NaN NaN 3.566782 2.829134 0.0 2.0 3.0 5.0 30.0
time_spent_on_website 4612.0 NaN NaN NaN 724.011275 743.828683 0.0 148.75 376.0 1336.75 2537.0
page_views_per_visit 4612.0 NaN NaN NaN 3.026126 1.968125 0.0 2.07775 2.792 3.75625 18.434
last_activity 4612.0 3.0 0.0 2278.0 NaN NaN NaN NaN NaN NaN NaN
print_media_type1 4612.0 2.0 0.0 4115.0 NaN NaN NaN NaN NaN NaN NaN
print_media_type2 4612.0 2.0 0.0 4379.0 NaN NaN NaN NaN NaN NaN NaN
digital_media 4612.0 2.0 0.0 4085.0 NaN NaN NaN NaN NaN NaN NaN
educational_channels 4612.0 2.0 0.0 3907.0 NaN NaN NaN NaN NaN NaN NaN
referral 4612.0 2.0 0.0 4519.0 NaN NaN NaN NaN NaN NaN NaN
status 4612.0 2.0 0.0 3235.0 NaN NaN NaN NaN NaN NaN NaN

Observations

Age

  • The average age of leads is 46.2 years, with a standard deviation of 13.16 years, indicating a moderately wide age distribution.
  • Ages range from 18 to 63 years.
  • 50% of the leads are aged between 36 and 57 years, highlighting the central tendency of the age distribution.

Current Occupation

  • There are 3 unique occupation categories: Professional, Unemployed, and Student.
  • The majority of leads are Professionals, comprising 56.7% (2616 leads) of the total.

First Interaction

  • Leads engage through 2 interaction channels: Website and Mobile App.
  • Website is the predominant first interaction method, utilized by approximately 55% (2542 leads) of the total leads.

Profile Completion

  • Profile completion is categorized into 3 levels: High, Medium, and Low.
  • The most common level of profile completion is High (75-100%), achieved by 2264 leads.

Website Visits

  • On average, leads visit the website 3.57 times, with a standard deviation of 2.83.
  • The number of visits ranges from 0 to 30, with a median of 3 visits.
  • Most leads have 2 to 5 website visits, indicating frequent engagement within this range.

Time Spent on Website

  • Leads spend an average of 724 seconds (approximately 12 minutes) on the website, with a wide standard deviation of 743.83 seconds.
  • Time spent ranges from 0 to 2537 seconds.
  • 25% of leads spend at least 148 seconds, 50% spend at least 376 seconds, and 75% spend at least 1336 seconds on the website, reflecting significant variability in user engagement.

Page Views per Visit

  • The average number of pages viewed per visit is 3.03, with a standard deviation of 1.97.
  • Page views per visit range from 0 to 18.43.
  • Most leads view between 2 to 4 pages per visit, indicating a consistent level of engagement across sessions.

Last Activity

  • There are 3 types of last activities recorded: Email Activity, Phone Activity, and Website Activity.
  • Email Activity is the most prevalent, accounting for 49% (2278 leads) of the total leads.

Media & Referral Channels

  • For all media channels (print_media_type1, print_media_type2, digital_media, and educational_channels), the predominant value is "No", indicating that most leads were not exposed to these media channels.
  • Similarly, the referral channel shows that the majority of leads were not referred.

Status (Target Variable)

  • The target variable status is binary, with a conversion rate of approximately 29.9% (status = 1).
  • 70.1% of leads did not convert (status = 0), indicating a majority of non-converting leads within the dataset.

Univariate and Outlier Analysis¶

Unique Values¶

In [17]:
# Get nunique values for each column
nunique_df = data.nunique().reset_index()
nunique_df.columns = ['Feature', 'Unique_Count']

# Features to find unique values for (categorical only)
categorical_features = [
    'current_occupation', 'first_interaction', 'profile_completed',
    'last_activity', 'print_media_type1', 'print_media_type2',
    'digital_media', 'educational_channels', 'referral', 'status'
]

# Create an empty list to store unique value strings
unique_value_list = []

# Loop over the categorical list to get all unique values and store them as formatted strings
for feature in categorical_features:
    unique_values = data[feature].dropna().unique()  # Ensure no NaN values are included
    unique_values_sorted = sorted(unique_values, key=lambda x: (str(x).lower()))  # Sort for consistency
    formatted_list = ', '.join([str(value) for value in unique_values_sorted])
    unique_value_list.append(formatted_list)

# Create a DataFrame with the features and their corresponding unique values
unique_values_df = pd.DataFrame({
    'Feature': categorical_features,
    'Unique_Values': unique_value_list
})

# Merge nunique_df with unique_values_df
merged_df = pd.merge(nunique_df, unique_values_df, on='Feature', how='left')

# Display the combined DataFrame
display(merged_df)
Feature Unique_Count Unique_Values
0 ID 4612 NaN
1 age 46 NaN
2 current_occupation 3 0, 1, 2
3 first_interaction 2 0, 1
4 profile_completed 3 0, 1, 2
5 website_visits 27 NaN
6 time_spent_on_website 1623 NaN
7 page_views_per_visit 2414 NaN
8 last_activity 3 0, 1, 2
9 print_media_type1 2 0, 1
10 print_media_type2 2 0, 1
11 digital_media 2 0, 1
12 educational_channels 2 0, 1
13 referral 2 0, 1
14 status 2 0, 1

Observations:

ID

  • Unique Count: 4612
  • Each ID is unique, ensuring that every record in the dataset represents a distinct lead.

Age

  • Unique Count: 46
  • The age feature encompasses 46 unique values, indicating a diverse range of ages among the leads.

Current Occupation

  • Unique Count: 3
  • Unique Values: Professional, Student, Unemployed

First Interaction

  • Unique Count: 2
  • Unique Values: Website, Mobile App

Profile Completion

  • Unique Count: 3
  • Unique Values: High, Medium, Low

Website Visits

  • Unique Count: 27
  • The website_visits feature exhibits 27 unique values, reflecting a varied number of website visits per lead.

Time Spent on Website

  • Unique Count: 1623
  • The time_spent_on_website feature has 1623 unique values, suggesting a wide range of engagement durations on the website.

Page Views per Visit

  • Unique Count: 2414
  • The page_views_per_visit feature contains 2414 unique values, indicating diverse browsing behaviors across different visits.

Last Activity

  • Unique Count: 3
  • Unique Values: Website Activity, Email Activity, Phone Activity

Print Media Type1

  • Unique Count: 2
  • Unique Values: Yes, No

Print Media Type2

  • Unique Count: 2
  • Unique Values: Yes, No

Digital Media

  • Unique Count: 2
  • Unique Values: Yes, No

Educational Channels

  • Unique Count: 2
  • Unique Values: Yes, No

Referral

  • Unique Count: 2
  • Unique Values: Yes, No

Status

  • Unique Count: 2
  • Unique Values: 1 (Converted), 0 (Did Not Convert)

Missing Values Check¶

In [18]:
# Missing value check
pd.DataFrame(data={'% of Missing Values':round(data.isna().sum()/data.isna().count()*100,2)})
Out[18]:
% of Missing Values
ID 0.0
age 0.0
current_occupation 0.0
first_interaction 0.0
profile_completed 0.0
website_visits 0.0
time_spent_on_website 0.0
page_views_per_visit 0.0
last_activity 0.0
print_media_type1 0.0
print_media_type2 0.0
digital_media 0.0
educational_channels 0.0
referral 0.0
status 0.0

There are no missing values in the dataset.

Distribution of Continuous Features¶

The dataset contains the following continuous variables:

  • age
  • website_visits
  • time_spent_on_website
  • page_views_per_visit
  • status

Feature: Age¶

In [19]:
# plot the distribution of the age feature
enhanced_histogram_boxplot(data, 'age', kde = True, bins = 24, custom_title = "Age")
Statistical Summary for age:
Mean: 46.20
Median: 51.00
Standard Deviation: 13.16
Skewness: -0.72
Kurtosis: -0.80

Outlier Analysis:
IQR method - Number of outliers: 0
IQR method - Percentage of outliers: 0.00%
IQR method - Outlier range: < 4.50 or > 88.50
Z-score method - Number of outliers (|z| > 3): 0
Z-score method - Percentage of outliers: 0.00%

Observations:

  • Left-Skewed Distribution:

    • The distribution is left-skewed (negatively skewed), the majority of the leads are older, specifically concentrated in the 50-60 year age range.
    • The tail on the left indicates that there are fewer younger leads (18-30 years old) compared to older leads.
  • Mean vs. Median:

    • The mean age (green dashed line) is around 46 years, while the median (black solid line) is around 51 years.
    • The median is higher than the mean, which is characteristic of a left-skewed distribution. This indicates that a larger portion of the dataset is made up of older individuals, with a few younger leads pulling the mean down.
  • Boxplot Insights

    • The boxplot reflects that the middle 50% of the data (interquartile range, or IQR) is roughly between 36 and 57 years, with the median at 51 years.
    • There are no significant outliers in the data, as the whiskers extend close to the minimum (around 18) and maximum (around 63), and there are no individual points marked as outliers.
  • Concentration of Older Leads:

    • There is a significant concentration of leads in the 50-60 year range. This demographic dominates the dataset, which may suggest that this age group is the primary audience.
  • Smaller Younger Population:

    • The tail of the distribution shows a smaller number of younger leads, particularly between 18-30 years old.
    • If the business goal is to attract more younger leads, this age group may require targeted strategies to increase their presence in the dataset.
  • Age Variability:

    • There is a moderate spread in the data, with ages ranging from 18 to 63 years.
    • The 25th percentile is at 36 years, indicating that 25% of the leads are younger than this, while the 75th percentile is at 57 years, meaning that the majority of leads are above this age.
  • Outliers:

    • There are no outliers in this feature set.

Feature: Website Visits¶

In [20]:
# plot the distribution of the website_visits feature
enhanced_histogram_boxplot(data, "website_visits", kde = True, custom_title = "Website Visits")
Statistical Summary for website_visits:
Mean: 3.57
Median: 3.00
Standard Deviation: 2.83
Skewness: 2.16
Kurtosis: 9.35

Outlier Analysis:
IQR method - Number of outliers: 154
IQR method - Percentage of outliers: 3.34%
IQR method - Outlier range: < -2.50 or > 9.50
Z-score method - Number of outliers (|z| > 3): 66
Z-score method - Percentage of outliers: 1.43%

Observations:

  • Right-Skewed Distribution:

    • The distribution of website_visits is right-skewed (positively skewed), meaning that most users have fewer website visits, but a small number of users have significantly more visits.
    • The majority of users have between 0 and 5 visits, with the count sharply decreasing as the number of visits increases.
  • Mean vs. Median:

    • The mean number of visits (green dashed line) is slightly greater than the median (black solid line).
    • This difference suggests that the skew in the data (due to users with a high number of visits) is pulling the mean upwards, whereas the median remains a better representation of the central tendency for most users.
  • Boxplot Insights:

    • The middle 50% of the data (interquartile range, or IQR) is relatively tight, with most users having between 2 to 5 website visits.
    • There are many outliers present beyond the upper whisker, with users having between 10 to 30 visits. These outliers likely correspond to highly engaged users who visit the website frequently.
  • Common Visit Patterns:

    • The highest frequency of users have 0 to 1 visit. This group dominates the data, with a significant drop in the number of users as visits increase beyond 2-3 visits.
    • There is a small secondary peak at 3 visits, indicating that some users visit the website multiple times, though the number drops off quickly after that.
  • Outliers and Heavy Users:

    • The boxplot reveals a significant number of outliers in the higher range of website visits (above 10). These users are considered outliers because they visit the website significantly more than the majority of users.
    • While most users visit the website only a few times, these outliers likely represent a small group of highly engaged users who visit the site often.

Feature: Page Views per Visit¶

In [21]:
# plot the distribution of the page_views_per_visit feature
enhanced_histogram_boxplot(data, "page_views_per_visit", kde = True, custom_title = "Page Views per Visit")
Statistical Summary for page_views_per_visit:
Mean: 3.03
Median: 2.79
Standard Deviation: 1.97
Skewness: 1.27
Kurtosis: 4.22

Outlier Analysis:
IQR method - Number of outliers: 257
IQR method - Percentage of outliers: 5.57%
IQR method - Outlier range: < -0.44 or > 6.27
Z-score method - Number of outliers (|z| > 3): 40
Z-score method - Percentage of outliers: 0.87%

Observations:

  • Right-Skewed Distribution:

    • The distribution of page_views_per_visit is right-skewed (positively skewed). This means that most users view a small number of pages during their visits, but a few users view significantly more pages, creating a tail on the right side.
    • The majority of users view around 2 to 4 pages per visit.
  • Mean vs. Median

    • The mean (green dashed line) is slightly higher than the median (black solid line), which is typical in right-skewed distributions. The outliers with very high page views are pulling the mean higher.
    • The median is just below 3 page views per visit, indicating that half of the users view fewer than 3 pages per visit.
  • Boxplot Insights:

    • The boxplot shows that the middle 50% of the data (the interquartile range, or IQR) is between 2 and 4 page views per visit.
    • There is a significant number of outliers on the right side, with users viewing as many as 12 to 17 pages per visit. These outliers indicate that some users are highly engaged and view many more pages than average.
  • Common Page View Patterns:

    • Most users view between 2 to 3 pages per visit, as shown by the peak in the histogram. After that, the number of users viewing more pages decreases sharply.
    • A secondary small peak is observed around 4-5 pages, indicating that there are some users who view a few more pages, but the overall frequency is lower.
  • Outliers and Heavy Page Viewers:

    • The presence of numerous outliers indicates that there are some highly engaged users who view significantly more pages per visit (more than 7).
    • These outliers may represent power users or highly interested leads who are exploring a lot of content during their visits.

Feature: Time Spent on Website¶

In [22]:
# plot the distribution of the time_spent_on_website feature
enhanced_histogram_boxplot(data, "time_spent_on_website", kde = True, custom_title = "Time Spent on Website")
Statistical Summary for time_spent_on_website:
Mean: 724.01
Median: 376.00
Standard Deviation: 743.83
Skewness: 0.95
Kurtosis: -0.58

Outlier Analysis:
IQR method - Number of outliers: 0
IQR method - Percentage of outliers: 0.00%
IQR method - Outlier range: < -1633.25 or > 3118.75
Z-score method - Number of outliers (|z| > 3): 0
Z-score method - Percentage of outliers: 0.00%

Observations:

  • Right-Skewed Distribution:

    • The distribution of time_spent_on_website is right-skewed (positively skewed), meaning most users spend a relatively short amount of time on the website, with a small number of users spending significantly more time, creating the long right tail.
    • The majority of users spend less than 500 seconds (around 8.5 minutes) on the website, while a smaller group of users spends much more time.
  • Mean vs. Median:

    • The mean time spent on the website (green dashed line) is greater than the median (black solid line), which is typical in right-skewed distributions where a few high values (users spending a lot of time on the website) pull the mean upwards.
    • The median is around 376 seconds (just over 6 minutes), indicating that 50% of users spend less than 6 minutes on the website.
  • Boxplot Insights:

    • The middle 50% of the data (interquartile range, or IQR) shows that most users spend between 148 and 1336 seconds on the website.
    • There are no clear outliers beyond the whiskers, although a small percentage of users spend upwards of 2000 seconds (over 30 minutes) on the website, as seen in the histogram's tail.
  • Common Time Spent Patterns:

    • The highest concentration of users spends less than 500 seconds on the website, which accounts for the bulk of the distribution.
    • There is a noticeable dip in users who spend between 500 and 1000 seconds, but some users pick up again, showing a smaller peak toward the higher end of the time range (around 1500 seconds).

Engaged Users:

  • A small subset of users spends significantly more time on the website (up to 2500 seconds or about 40 minutes). This group of highly engaged users may represent potential leads who are deeply interested in the content.

Feature: Status¶

In [23]:
# plot the distribution of the status feature
plot_binary_feature(data, "status", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Status")

Observations:

  • Imbalanced Target Variable (Status):

    • The target variable (status) is imbalanced.
    • 70.1% of the data points belong to the 0 category, indicating that these leads did not convert.
    • 29.9% of the data points belong to the 1 category, indicating that these leads converted to paying customers.
  • Class Imbalance:

    • The dataset is imbalanced, with more leads not converting (70.1%) compared to those who converted (29.9%).
    • While a conversion rate of approximately 30% is not extremely low, the imbalance could still pose a challenge for machine learning models, as models tend to favor the majority class unless steps like class balancing or resampling are taken.

Distribution of Categorical Features¶

The dataset contains the following categorical variables:

  • current_occupation
  • first_interaction
  • profile_completed
  • last_activity
  • print_media_type1
  • print_media_type2
  • digital_media
  • educational_channels
  • referral

Feature: Current Occupation¶

In [24]:
# plot the distribution of the current_occupation feature
plot_binary_feature(data, "current_occupation", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Status")

Observations:

  • Distribution of Occupations:

    • The majority of the leads (56.7%) are Professionals, indicating that over half of the leads come from this category.
    • Unemployed leads account for 31.2% of the total, making it the second-largest group.
    • Students represent the smallest group, accounting for 12.0% of the leads.
  • Imbalance in Occupation Groups:

    • There is a clear imbalance in the distribution of occupations, with Professionals dominating the dataset.
    • Unemployed and Student categories are significantly smaller compared to the Professional group, which may affect how different occupations contribute to conversion rates or engagement metrics.

Feature First Interaction¶

In [25]:
# plot the distribution of the first_interaction feature
plot_binary_feature(data, "first_interaction", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="First Interaction")

Observations:

  • Distribution of First Interaction Channels:

    • 55.1% of leads first interacted with the platform through the Website, making it the most common channel.
    • 44.9% of leads first interacted through the Mobile App, which also represents a significant portion of the leads.
  • Balanced Split Between Website and Mobile App:

    • While the Website is the more common first interaction channel, the split between the Website (55.1%) and Mobile App (44.9%) is relatively balanced.
    • This suggests that both channels are important for engaging leads and may require equal attention in terms of optimization and content delivery.

Feature: Profile Completed¶

In [26]:
# plot the distribution of the profile_completed feature
plot_binary_feature(data, "profile_completed", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Profile Completed")

Observations:

  • Distribution of Profile Completion Levels:

    • 49.1% of leads have a High profile completion level, meaning nearly half of the users have completed 75-100% of their profiles.
    • 48.6% of leads have a Medium profile completion level (50-75% profile completion).
    • Only 2.3% of leads have a Low profile completion level (0-50% profile completion).
  • High Completion Rates:

    • The vast majority of leads have completed either Medium or High levels of their profiles, which is a positive sign of engagement. Around 97.7% of users have at least 50% of their profile completed.
    • The Low completion group represents only a small portion (2.3%) of leads, suggesting that only a small minority of users fail to engage enough to complete even 50% of their profile.

Feature: Last Activity¶

In [27]:
# plot the distribution of the last_activity feature
plot_binary_feature(data, "last_activity", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Last Activity")

Observations:

  • Distribution of Last Activity Types:

    • 49.4% of leads had their last activity through Email Activity, making it the most common last interaction type.
    • 26.8% of leads had their last interaction through a Phone Activity.
    • 23.9% of leads had their last interaction via Website Activity.
  • Dominance of Email Activity:

    • The fact that Email Activity accounts for almost half of all last activities suggests that email is the most frequent communication channel for this lead population.
    • Phone Activity and Website Activity are also significant, though they are less frequent compared to email.

Feature: Print Media Type 1¶

In [28]:
# plot the distribution of the print_media_type1 feature
plot_binary_feature(data, "print_media_type1", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Print Media Type 1")

Observations:

  • Dominance of 'No' for Print Media Type 1:

    • A large majority, 89.2% of leads, did not interact with or see Print Media Type 1, ads in newspapers.
    • Only 10.8% of leads interacted with or saw Print Media Type 1.
  • Print Media Type 1 Underutilization:

    • Since most leads have not interacted with Print Media Type 1, it appears that this channel is underutilized or may not be as effective in reaching the target audience.

Feature: Print Media Type 2¶

In [29]:
# plot the distribution of the print_media_type2 feature
plot_binary_feature(data, "print_media_type2", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Print Media Type 2")

Observations:

  • Very Limited Use of Print Media Type 2:

    • An overwhelming 94.9% of leads were not exposed to Print Media Type 2, ads in magazines.
    • Only 5.1% of leads were exposed to Print Media Type 2, indicating that this channel has very limited reach in the dataset.
  • Underutilization of Print Media Type 2:

    • With only 5.1% exposure, Print Media Type 2 seems to have been used sparingly or has not effectively reached a large portion of the leads.

Feature: Digital Media¶

In [30]:
# plot the distribution of the digital_media feature
plot_binary_feature(data, "digital_media", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Digital Media")

Observations:

  • Limited Exposure to Digital Media:

    • 88.6% of leads were not exposed to Digital Media (e.g., ads on social media, online platforms, etc.).
    • Only 11.4% of leads were exposed to Digital Media, suggesting that this channel has limited reach among the leads.
  • Underutilization of Digital Media:

    • Despite the prominence of digital marketing in modern campaigns, it appears that Digital Media exposure is relatively low in this dataset. Only a small portion of leads interacted with digital channels.

Feature: Educational Channels¶

In [31]:
# plot the distribution of the educational_channels feature
plot_binary_feature(data, "educational_channels", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Educational Channels")

Observations:

  • Limited Exposure to Educational Channels:

    • 84.7% of leads were not exposed to Educational Channels (e.g., online forums, discussion threads, educational websites, etc.).
    • Only 15.3% of leads were exposed to Educational Channels.
  • Underutilization of Educational Channels*:

    • With only 15.3% of leads exposed to these channels, it appears that educational platforms have been underutilized as a lead generation tool. Educational channels could be an opportunity for growth, given that they are often considered trusted sources of information for users looking to learn.

Feature: Referral¶

In [32]:
# plot the distribution of the referral feature
plot_binary_feature(data, "referral", colors=['#0073a3', '#5e6d77'], show_title=True, custom_title="Referral")

Observations:

  • Very Limited Use of Referral Channel:

    • A vast majority, 98.0% of leads, did not come through referrals.
    • Only 2.0% of leads were referred by others, indicating that referrals play a very small role in lead generation in this dataset.
  • Underutilization of Referrals:

    • The referral channel is heavily underutilized, with only a small fraction of leads coming through this method. Given that referrals can often bring high-quality, engaged leads, this represents a potential area for growth.

Question 1:¶

Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status?

Conversion Rates

In [33]:
# Plot status counts

plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'status', data = data, palette='tab10')

# Annotating the exact count on the top of the bar for each category
for p in ax.patches:
    ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()+ 0.35))
In [34]:
# Calculate conversion percentages
conversion_percentages = round(data['status'].value_counts(normalize=True) * 100, 2)
print(conversion_percentages)
status
0    70.14
1    29.86
Name: proportion, dtype: float64

Lead Status

In [35]:
# Group by 'current_occupation' and 'status' to get the counts for each occupation
occupation_counts = data.groupby(['current_occupation', 'status']).size().unstack()

# Define the correct x-axis labels
occupation_labels = ['Professional', 'Student', 'Unemployed']

# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))

# Plotting status 0 at the bottom and status 1 on top for each occupation with overridden x-axis labels
plt.bar(occupation_labels, occupation_counts[0], label='0', color='blue')
plt.bar(occupation_labels, occupation_counts[1],
        bottom=occupation_counts[0], label='1', color='orange')

# Adding labels and title
plt.title('Lead Status by Current Occupation')
plt.xlabel('Current Occupation')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')

# Adding annotations for each bar (status 0 and status 1 counts)
for i, occupation in enumerate(occupation_labels):
    plt.text(i, occupation_counts[0][i] / 2, f'{occupation_counts[0][i]}', ha='center', color='white')
    plt.text(i, occupation_counts[0][i] + (occupation_counts[1][i] / 2), f'{occupation_counts[1][i]}', ha='center', color='black')

plt.tight_layout()
plt.show()

Observations:

Professionals:

  • 1,687 leads with status 0.
  • 929 leads with status 1.
  • Professionals have the largest number of total leads, with a significant imbalance compared to the other occupations. They account for a much larger portion of the overall lead count.
  • The conversion rate for professionals (those in status 1) is about 35.5%, making this group one of the most responsive in terms of conversion, despite the large number of leads.

Students:

  • 490 leads with status 0.
  • 65 leads with status 1.
  • Students represent the smallest lead group, highlighting a large data imbalance when compared to professionals and unemployed individuals.
  • With a conversion rate of 11.7%, this group has the lowest responsiveness. The small number of leads makes it difficult to draw broader conclusions, but this group appears less likely to convert.

Unemployed:

  • 1,058 leads with status 0.
  • 383 leads with status 1.
  • The unemployed group falls in between professionals and students in terms of total lead count. The conversion rate for this group is 26.6%.
  • There is also a notable imbalance, as the total number of leads is significantly smaller than that of professionals but larger than students.

Key Insights:

  • The lead count is heavily skewed toward professionals, who represent the largest portion of the leads.

  • Students form the smallest group, which could affect the robustness of insights drawn from their data.

  • The unemployed group serves as a middle ground in terms of lead count but still shows a notable difference when compared to professionals.

We have now explored how current occupation affects lead status.

Question 2:¶

The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?

First Channels of Interaction

In [36]:
# Group the data by first_interaction and status, and calculate the counts
interaction_counts = data.groupby(['first_interaction', 'status']).size().unstack()

# Create a stacked bar plot
ax = interaction_counts.plot(kind='bar', stacked=True, figsize=(10, 6))

plt.title('Lead Status by First Interaction Channel')
plt.xlabel('First Interaction Channel')
plt.ylabel('Count of Leads')

# Override x-axis labels explicitly with desired labels
ax.set_xticks([0, 1])  # Set positions for the ticks (number of unique values in 'first_interaction')
ax.set_xticklabels(['Mobile App', 'Website'], rotation=45)  # Manually set labels

# Add the legend with proper labels for lead status
plt.legend(title='Lead Status', labels=['0', '1'])

# Add annotations on top of each bar
for container in ax.containers:
    ax.bar_label(container, label_type='center')

plt.tight_layout()

# Show the plot
plt.show()

Observations:

Mobile App Interaction:

  • There are 1,852 inactive leads (status = 0) and only 218 active leads (status = 1) whose first interaction was through the Mobile App.
  • This suggests that leads whose first interaction was via the Mobile App are much more likely to be inactive, with the vast majority (approximately 89%) not converting to active status.

Website Interaction:

  • For leads whose first interaction was through the Website, there are 1,383 inactive leads and 1,159 active leads.
  • The distribution here is much more balanced, with about 45.6% of Website leads being active. This suggests a much better conversion rate compared to Mobile App interactions.

Key Insights:

  • Website interaction seems to be significantly more effective at converting leads to active status, with a roughly equal split between inactive and active leads.

We have now explored how the initial channel of interaction affects lead status.

Question 3:¶

The company uses multiple modes to interact with prospects. Which way of interaction works best?

In [37]:
# Group by 'current_occupation' and 'status' to get the counts for each occupation
interaction_counts = data.groupby(['last_activity', 'status']).size().unstack()

# Define the correct x-axis labels
occupation_labels = ['Website Activity', 'Email Activity', 'Phone Activity']

# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))

# Plotting status 0 at the bottom and status 1 on top for each occupation with overridden x-axis labels
plt.bar(occupation_labels, interaction_counts[0], label='0', color='blue')
plt.bar(occupation_labels, interaction_counts[1],
        bottom=interaction_counts[0], label='1', color='orange')

# Adding labels and title
plt.title('Lead Status by Mode of Communication')
plt.xlabel('Mode of Communication')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')

# Adding annotations for each bar (status 0 and status 1 counts)
for i, occupation in enumerate(occupation_labels):
    plt.text(i, interaction_counts[0][i] / 2, f'{interaction_counts[0][i]}', ha='center', color='white')
    plt.text(i, interaction_counts[0][i] + (interaction_counts[1][i] / 2), f'{interaction_counts[1][i]}', ha='center', color='black')

plt.tight_layout()
plt.show()
In [38]:
# Calculate conversion rates
conversion_rates = round(interaction_counts.apply(lambda row: row[1] / (row[0] + row[1]), axis=1) * 100, 2)
conversion_rates
Out[38]:
0
last_activity
0 30.33
1 21.31
2 38.45

Observations:

Email Activity:

  • 1,587 leads with status 0.
  • 691 leads with status 1.
  • The conversion rate for email activity is 30.33%.

Phone Activity:

  • 971 leads with status 0.
  • 263 leads with status 1.
  • The conversion rate for phone activity is 21.33%.

Website Activity:

  • 677 leads with status 0.
  • 423 leads with status 1.
  • The conversion rate for website activity is 38.46%.

Key Insights:

  • Email Activity has the highest total number of leads, with a 30.33% conversion rate.

  • Phone Activity has a lower conversion rate of 21.33%, making it the least effective communication method for conversion.

  • Website Activity has the highest conversion rate of 38.46%.

We have now explored how modes of communication affects lead status.

Question 4:¶

The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?

In [39]:
# Define lead sources
sources = ['print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral']

# Group by 'status'. Select the source columns and convert them to numeric type before summing.
media_sources = data.groupby('status')[sources].apply(lambda x: x.apply(pd.to_numeric, errors='coerce').sum()).T

# Define the correct x-axis labels
source_labels = ['Print Media Type 1', 'Print Media Type 2', 'Digital Media', 'Educational Channels', 'Referral']

# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))

# Plotting status 0 at the bottom and status 1 on top for each source with overridden x-axis labels
plt.bar(source_labels, media_sources[0], label='0', color='blue')
plt.bar(source_labels, media_sources[1],
        bottom=media_sources[0], label='1', color='orange')

# Adding labels and title
plt.title('Lead Status by Mode of Communication')
plt.xlabel('Mode of Communication')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')

# Adding annotations for each bar (status 0 and status 1 counts)
for i, source in enumerate(source_labels):
    plt.text(i, media_sources[0][i] / 2, f'{media_sources[0][i]}', ha='center', color='white')
    plt.text(i, media_sources[0][i] + (media_sources[1][i] / 2), f'{media_sources[1][i]}', ha='center', color='black')

plt.tight_layout()
plt.show()
In [40]:
# Calculate conversion rates
conversion_rates = round(media_sources.apply(lambda row: row[1] / (row[0] + row[1]), axis=1) * 100, 2)
conversion_rates
Out[40]:
0
print_media_type1 31.99
print_media_type2 32.19
digital_media 31.88
educational_channels 27.94
referral 67.74

Observations:

Print and Digital Media Channels

  • Print Media Type 1: 31.99% conversion rate
  • Print Media Type 2: 32.19% conversion rate
  • Digital Media: 31.88% conversion rate

Insight:

  • The consistency across print and digital media channels suggests that both traditional and online advertising strategies are equally effective in driving conversions. This uniformity indicates that your target audience engages similarly across these mediums, allowing for flexibility in allocating marketing budgets between print and digital platforms without compromising conversion performance.

Educational Channels

  • Educational Channels: 27.94% conversion rate

Insight:

  • The marginally lower conversion rate in educational channels could be attributed to factors such as the nature of the content, audience intent, or the effectiveness of the call-to-action. This presents an opportunity to optimize these channels by enhancing content relevance, improving engagement strategies, or refining targeting parameters to elevate conversion rates closer to those of print and digital media.

Referral

  • Referral: 67.74% conversion rate

    Insight:

    • The exceptionally high conversion rate for referrals underscores the effectiveness of leveraging existing clients to attract new ones. Referrals benefit from built-in trust and credibility, as recommendations from satisfied customers are more persuasive. To capitalize on this strength, enhancing referral programs—such as offering incentives for referrals, simplifying the referral process, and actively encouraging satisfied customers to share their positive experiences—can further amplify conversion rates and contribute positively to the overall bottom line.

Question 5:¶

People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information. Does having more details about a prospect increase the chances of conversion?

In [41]:
# Group by 'profile_completed' and 'status' to get the counts for each profile completion level
profile = data.groupby(['profile_completed', 'status']).size().unstack()

# Calculate conversion rates for sorting
# Assuming 'status' == 1 indicates a conversion
profile['conversion_rate'] = profile[1] / (profile[0] + profile[1])

# Sort the profile_completed levels by conversion_rate in descending order
profile_sorted = profile.sort_values(by='conversion_rate', ascending=False)

# Extract the sorted profile_completed labels
profile_labels = profile_sorted.index.tolist()

# Create the figure for the stacked bar plot with proper labels
plt.figure(figsize=(10, 6))

# Plotting status 0 at the bottom and status 1 on top for each profile_completed with sorted labels
plt.bar(profile_labels, profile_sorted[0], label='0', color='blue')
plt.bar(profile_labels, profile_sorted[1],
        bottom=profile_sorted[0], label='1', color='orange')

# Adding labels and title
plt.title('Lead Conversion Status by Profile Completion Level')
plt.xlabel('Profile Completion Level')
plt.ylabel('Count of Leads')
plt.legend(title='Lead Status')

# Adding annotations for each bar (status 0 and status 1 counts)
for i, label in enumerate(profile_labels):
    plt.text(i, profile_sorted[0][i] / 2, f'{profile_sorted[0][i]}', ha='center', color='white')
    plt.text(i, profile_sorted[0][i] + (profile_sorted[1][i] / 2), f'{profile_sorted[1][i]}', ha='center', color='black')

plt.tight_layout()
plt.show()
In [42]:
# Calculate conversion rates
conversion_rates = round(profile.apply(lambda row: row[1] / (row[0] + row[1]), axis=1) * 100, 2)
conversion_rates
Out[42]:
0
profile_completed
0 41.78
1 7.48
2 18.88

Observations:

High Profile Completion

  • Conversion Rate: 41.78%

Insight:

  • A high level of profile completion significantly enhances the likelihood of conversion. This suggests that users who invest time and effort into providing comprehensive personal information are more engaged and committed to the conversion process.

Potential Factors Contributing to High Conversion:

  • Enhanced Personalization: Detailed profiles may enable more personalized experiences and targeted offerings, increasing user satisfaction and conversion chances.

  • Increased Commitment: The act of completing a profile may serve as a commitment device, making users more likely to follow through with conversions.

  • Trust and Credibility: Providing extensive personal information can indicate a higher level of trust in the platform, fostering a conducive environment for conversions.

Medium Profile Completion

  • Conversion Rate: 18.88% Users with moderately completed profiles have a conversion rate of 18.88%, which is nearly half that of users with high profile completion.

Insight:

  • While there is a positive correlation between profile completeness and conversion rates, medium completeness indicates room for improvement. Users in this category are partially engaged, and optimizing their profile completion could bridge the gap to higher conversion rates.

Potential Improvement Areas:

  • User Experience Enhancements: Simplifying the profile completion process or providing incentives for completing profiles can encourage users to provide more information.

  • Targeted Communication: Implementing targeted prompts or reminders can motivate users to complete their profiles, thereby potentially increasing conversion rates.

  • Value Proposition Clarification: Clearly communicating the benefits of completing a profile may encourage users to provide more detailed information.

Low Profile Completion

  • Conversion Rate: 7.48%, significantly lower compared to higher profile completion levels.

Insight:

  • A low level of profile completion correlates with minimal engagement and a lower propensity to convert. These users may perceive the profile creation process as cumbersome or may lack sufficient motivation to complete their profiles.

Potential Challenges:

  • Perceived Invasiveness: Users might find extensive data collection intrusive, leading to incomplete profiles and reduced conversion likelihood.

  • User Drop-Offs: Lengthy or complicated profile forms can deter users from completing their profiles, resulting in lower engagement and conversions.

  • Lack of Immediate Incentives: Without clear, immediate benefits for completing profiles, users may not see the value in providing additional information.

Key Insights

  • Positive Correlation: There is a clear positive correlation between profile completeness and conversion rates. As profile completeness increases from low to high, conversion rates rise significantly from 7.48% to 41.78%.

  • Engagement Levels: Higher profile completion levels indicate greater user engagement and commitment, which are critical drivers of conversions.

  • Opportunity for Optimization: Users with medium and low profile completion levels present substantial opportunities for targeted interventions to enhance their engagement and conversion rates.

Bivariate Analysis¶

Correlation Heatmap¶

In [43]:
# Create a correlation matrix (numerical vlaues only)

# Exclude non-numeric columns
numeric_data = data.select_dtypes(include=np.number)

# Create heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="vlag")
plt.title("Correlation Heatmap")
plt.show()

Observations:

Target feature correlations:

  • Correlation between status and age: 0.12

    • There is a weak positive correlation between age and status. As age increases, status may slightly increase as well, but the relationship is not strong.
  • Correlation between status and website_visits: -0.01

    • There is a near-zero negative correlation between website visits and status, indicating that there's virtually no relationship between the number of website visits and status.
  • Correlation between status and time_spent_on_website: 0.30

    • There is a moderate positive correlation between time spent on the website and status, suggesting that users who spend more time on the website are more likely to have a higher status.
  • Correlation between status and page_views_per_visit: 0.00

    • There is no correlation between page views per visit and status, meaning that the number of pages viewed per visit does not relate to the user's status.

Other feature correlations:

  • age has almost no correlation with website_visits (-0.01), time_spent_on_website (0.02), and page_views_per_visit (-0.04). These relationships are very weak and do not show any meaningful patterns.

  • website_visits has weak positive correlations with time_spent_on_website (0.06) and page_views_per_visit (0.07). Both relationships are minimal, showing only slight increases in time spent and page views as website visits increase.

  • time_spent_on_website has a weak positive correlation with page_views_per_visit (0.07), indicating a minimal relationship between the time spent on the website and the number of pages viewed per visit.

  • page_views_per_visit has very weak positive correlations with both website_visits (0.07) and time_spent_on_website (0.07), indicating minimal relationships.

Feature: Status vs Age¶

In [44]:
# Generate boxplot 'status' vs 'age'
plot_boxplot(
            x_column='status',
            y_column='age',
            data=dataset,
            title='Status vs Age'
            )

Observations:

Median Age:

  • For status 0 (left box), the median age is 49.
  • For status 1 (right box), the median age is 54.

Interquartile Range (IQR):

  • The age distribution for status 0 spans a range from 33 to 57.
  • The IQR for status 1 is 41 to 58, indicating a narrower range.

Whiskers (Range of Ages):

  • The whiskers extend from 18 to 63 for both statuses, meaning the age range is consistent across both groups.

Outliers:

  • There are no extreme outliers for either group in this plot.

Interpretation:

  • Age Differences by Status: The median age for status 1 is higher than for status 0. However, the IQR for status 1 is more concentrated around the higher median, indicating less variability in the age distribution compared to status 0.

  • Potential Age Concentration: Individuals with status 1 have a tighter age distribution, whereas status 0 leads have a wider age range, suggesting greater variability.

Feature: Status vs Current Occupation¶

In [45]:
# Generate count plot of 'status' vs 'current_occupation'
plot_countplot(
              x_column='status',
              hue_column='current_occupation',
              data=dataset,
              title='Lead Status vs Current Occupation'
              )

Observations:

  • Professional:

    • Professionals are the largest group in both status 0 (1,687 leads) and status 1 (929 leads), but there are more in the status 0 group.
  • Unemployed:

    • The unemployed group has 1,058 leads in status 0 and 383 leads in status 1.
    • There are fewer unemployed users in status 1 compared to status 0, contradicting the original observation. This means that unemployed individuals do not achieve status 1 more frequently than status 0.
  • Students:

    • Students have 490 leads in status 0 and 65 leads in status 1.
    • The majority of students are found in the status 0 group, indicating that students rarely achieve status 1.

Interpretation:

  • Professionals are the most represented group in both status categories, but a larger portion of them are in the status 0 group.
  • Unemployed individuals are more often found in status 0, and fewer of them reach status 1, contradicting the earlier interpretation.
  • Students are predominantly associated with status 0, indicating that they may need more engagement or resources to reach status 1.

Feature: Status vs. First Interaction¶

In [46]:
# Generate count plot of 'status' vs 'first_interaction'
plot_countplot(
              x_column='status',
              hue_column='first_interaction',
              data=dataset,
              title='Lead Status vs First Interaction'
              )

Observations:

Website Interaction:

  • Website is the dominant first interaction method for both status 0 (1,383 leads) and status 1 (1,159 leads).
  • Status 1 users have a relatively higher proportion of website interactions compared to status 0 users, indicating that website interaction is more common among those who achieve status 1.

Mobile App Interaction:

  • Mobile app interaction is less common overall, with 1,852 leads in status 0 and 218 leads in status 1.
  • A larger proportion of status 0 users first interacted through the mobile app, suggesting that this interaction method is less likely to lead to achieving status 1.

Interpretation:

  • Website interaction is more strongly associated with users who achieve status 1, while mobile app interaction is more prevalent among status 0 users.
  • This suggests that users who first interact via the website may have a higher chance of reaching status 1, while those who use the mobile app may require additional engagement strategies to improve their outcomes.

Feature: Status vs. Profile Completed¶

In [47]:
# Generate count plot of 'status' vs 'profile_completed'
plot_countplot(
              x_column='status',
              hue_column='profile_completed',
              data=dataset,
              title='Lead Status vs Profile Completed'
              )

Observations:

  • High Profile Completion:

    • Users with high profile completion are found in both status 0 and status 1 groups.
    • There is a relatively even distribution between the two groups, indicating that high profile completion is not strongly biased toward either status.
  • Medium Profile Completion:

    • Medium profile completion is more frequently associated with status 0 users.
    • There are significantly fewer users with medium profile completion in the status 1 group, suggesting that users with medium completion are less likely to achieve status 1.
  • Low Profile Completion:

    • Low profile completion is almost entirely associated with status 0.
    • There are no users with low profile completion in the status 1 group, indicating that achieving status 1 is highly unlikely for users with low profile completion.

Interpretation:

  • High profile completion is common across both status groups, but medium and low profile completion are more heavily associated with status 0.
  • Users with low profile completion are almost exclusively in the status 0 group, suggesting that improving profile completion may be a key factor in achieving status 1.

Feature: Status vs. Website Visits¶

In [48]:
# Generate boxplot with 'status' vs 'website_visits'
plot_boxplot(
            x_column='status',
            y_column='website_visits',
            data=dataset,
            title='Status vs Website Visits'
            )

Observations:

Median Website Visits:

  • Both status 0 and status 1 groups have a median of 3 website visits.

Interquartile Range (IQR):

  • The IQR for both groups is similar, with 25% of users having 2 website visits and 75% of users having 5 website visits.

Outliers:

  • There are multiple outliers in both status 0 and status 1 groups. The maximum number of visits for status 0 is 30 visits, while for status 1 it is 25 visits.
  • The status 0 group shows more extreme outliers, suggesting that some users visit the website more frequently in this category.

Interpretation:

  • Both status 0 and status 1 groups exhibit similar behavior in terms of median website visits.
  • The presence of numerous outliers, particularly in the status 0 group, indicates that some users visit the website much more frequently, though this behavior is less common in the status 1 group.

Feature: Status vs. Time Spent on Website¶

In [49]:
# Generate boxplot with 'status' vs 'time_spent_on_website' on the y-axis
plot_boxplot(
            x_column='status',
            y_column='time_spent_on_website',
            data=dataset,
            title='Status vs Time Spent on Website'
            )

Observations:

Median Time Spent:

  • Users with status 0 spend a median of approximately 317 minutes on the website.
  • Users with status 1 spend significantly more time on the website, with a median of 789 minutes.

Interquartile Range (IQR):

  • For status 0 users, the IQR is between 88 minutes and 646 minutes, indicating a more concentrated time spent.
  • For status 1 users, the IQR is much wider, between 390 minutes and 1,829 minutes, showing more variation in time spent among this group.

Outliers:

  • Both groups have outliers, with the maximum time spent being 2,531 minutes for status 0 and 2,537 minutes for status 1.

Interpretation:

  • Status 1 users spend more time on the website than status 0 users, with a higher median and a much wider range of time spent.
  • Status 0 users have a more concentrated range of time spent but exhibit frequent extreme outliers.

Feature: Status vs. Page Views per Visit¶

In [50]:
# Generate the boxplot with 'status' vs 'page_views_per_visit'
plot_boxplot(
            x_column='status',
            y_column='page_views_per_visit',
            data=dataset,
            title='Status vs Page Views per Visit'
            )

Observations:

Median Page Views per Visit:

  • Status 1 users have a median of 2.94 page views per visit.
  • Status 0 users have a median of 2.71 page views per visit, meaning that status 1 users tend to view slightly more pages per visit than status 0 users.

Interquartile Range (IQR):

  • The IQR for status 1 is from 2.08 to 3.73, indicating a relatively narrow range of page views per visit.
  • The IQR for status 0 is from 2.07 to 3.77, showing a similar range of page views per visit but with slightly more variability.

Outliers:

  • Status 0 users show more extreme outliers, with a maximum of 18.43 page views per visit.
  • Status 1 users also have outliers but with a lower maximum of 13.66 page views per visit.

Interpretation:

  • Status 1 users view slightly more pages per visit on average and show consistent behavior with fewer extreme outliers.
  • Status 0 users view slightly fewer pages per visit but have more variability and more extreme outliers.

This suggests that page views per visit are quite similar between the two groups, though status 0 users tend to exhibit more variability.

Feature: Status vs. Last Activity¶

In [51]:
# Generate count plot of 'status' vs 'last_activity'
plot_countplot(
              x_column='status',
              hue_column='last_activity',
              data=dataset,
              title='Lead Status vs Last Activity'
              )

Observations:

Website Activity:

  • Website Activity is more common in the status 0 group (677 leads) compared to status 1 (423 leads), contrary to the original observation.
  • This suggests that Website Activity is not necessarily more strongly associated with users who achieve status 1.

Email Activity:

  • Email Activity is more frequent for status 0 users (1,587 leads) compared to status 1 users (691 leads).
  • This supports the idea that email engagement is more common among users who do not achieve status 1.

Phone Activity:

  • Phone Activity is present for 971 status 0 users and 263 status 1 users.
  • This activity is relatively less frequent and does not show a strong trend favoring either status group.

Interpretation:

  • Email Activity is more prevalent among status 0 users, while Website Activity appears more common for status 1 users.
  • Phone Activity is less common overall and does not favor either group strongly in terms of association with achieving status 1.

Feature: Status vs. Print Media Type 1¶

In [52]:
# Generate count plot of 'status' vs 'print_media_type1'
plot_countplot(
              x_column='status',
              hue_column='print_media_type1',
              data=dataset,
              title='Status vs Print Media Type 1',
              hue_order=['Yes', 'No']
              )

Observations:

Website Activity:

  • Website Activity is more common in the status 0 group (677 leads) compared to status 1 (423 leads).
  • This suggests that Website Activity is not necessarily more strongly associated with users who achieve status 1.

Email Activity:

  • Email Activity is more frequent for status 0 users (1,587 leads) compared to status 1 users (691 leads).
  • This supports the idea that email engagement is more common among users who do not achieve status 1.

Phone Activity:

  • Phone Activity is present for 971 status 0 users and 263 status 1 users.
  • This activity is relatively less frequent and does not show a strong trend favoring either status group.

Interpretation:

  • Email Activity is more prevalent among status 0 users, while Website Activity appears more common for status 1 users.
  • Phone Activity is less common overall and does not favor either group strongly in terms of association with achieving status 1.

Feature: Status vs. Print Media Type 2¶

In [53]:
# Generate count plot of 'status' vs 'print_media_type2'
plot_countplot(
              x_column='status',
              hue_column='print_media_type2',
              data=dataset,
              title='Status vs Print Media Type 2',
              hue_order=['Yes', 'No']
              )

Observations:

  • Print Media Type 2 - Yes:

    • A small number of users in both status 0 and status 1 groups engaged with print media type 2.
    • The proportion of users who engaged with print media type 2 is low in both statuses, though it is slightly more common in status 0.
  • Print Media Type 2 - No:

    • The majority of users in both status 0 and status 1 groups did not engage with print media type 2.
    • There is a significant number of users in both groups who did not interact with print media type 2, with status 0 having a larger count.

Interpretation:

  • Most users in both status 0 and status 1 groups did not engage with print media type 2.
  • Engagement with print media type 2 does not seem to be a strong indicator of achieving status 1, as only a small fraction of users in both groups engaged with it.

Feature: Status vs. Digital Media¶

In [54]:
# Generate count plot of 'status' vs 'digital_media'
plot_countplot(
              x_column='status',
              hue_column='digital_media',
              data=dataset,
              title='Status vs Digital Media',
              hue_order=['Yes', 'No']
              )

Observations:

Print Media Type 2 - Yes:

  • A small number of users in both status 0 (158 leads) and status 1 (75 leads) groups engaged with print media type 2.
  • The proportion of users who engaged with print media type 2 is low in both statuses, though it is slightly more common in status 0.

Print Media Type 2 - No:

  • The majority of users in both status 0 (3,077 leads) and status 1 (1,302 leads) groups did not engage with print media type 2.

Interpretation:

  • Most users in both status 0 and status 1 groups did not engage with print media type 2.
  • Engagement with print media type 2 does not seem to be a strong indicator of achieving status 1, as only a small fraction of users in both groups engaged with it.

Feature: Status vs. Educational Channels¶

In [55]:
# Generate count plot of 'status' vs 'educational_channels'
plot_countplot(
              x_column='status',
              hue_column='educational_channels',
              data=dataset,
              title='Status vs Educational Channels',
              hue_order=['Yes', 'No']
              )

Observations:

  • Educational Channels - Yes:

    • A smaller proportion of users in both status 0 and status 1 engaged with educational channels (represented by the blue bars).
    • Engagement with educational channels is slightly more common among status 0 users (a small number of blue bars in the status 0 group compared to the status 1 group).
  • Educational Channels - No:

    • The majority of users in both status 0 and status 1 groups did not engage with educational channels (represented by the orange bars).
    • Status 0 has a larger number of users who did not engage with educational channels compared to status 1.

Interpretation:

  • Educational channel engagement is low for both status groups, with most users not engaging with educational channels.
  • A slightly higher number of status 0 users engaged with educational channels than status 1 users, but overall, educational channels don't appear to be a strong differentiator between the two statuses.

Feature: Status vs. Referral¶

In [56]:
# Generate count plot of 'status' vs 'referral'
plot_countplot(
              x_column='status',
              hue_column='referral',
              data=dataset,
              title='Status vs Referal',
              hue_order=['Yes', 'No']
              )

Observations:

Educational Channels - Yes:

  • A smaller proportion of users in both status 0 (508 leads) and status 1 (197 leads) engaged with educational channels.
  • Engagement with educational channels is slightly more common among status 0 users.

Educational Channels - No:

  • The majority of users in both status 0 (2,727 leads) and status 1 (1,180 leads) did not engage with educational channels.

Interpretation:

  • Educational channel engagement is low for both status groups, with most users not engaging with educational channels.
  • A slightly higher number of status 0 users engaged with educational channels than status 1 users, but overall, educational channels don't appear to be a strong differentiator between the two statuses.

Feature: Age vs Website Visits¶

In [57]:
# plot 'age' vs 'website_visits'

# Create age bins for better grouping
dataset['age_group'] = pd.cut(dataset['age'], bins=[10, 20, 30, 40, 50, 60, 70], labels=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])

# Generate boxplot
plot_boxplot(
            x_column='age_group',
            y_column='website_visits',
            data=dataset,
            title='Website Visits by Age Group'
            )

Observations:

  • Age and Website Visits:

    • The correlation between age and website visits is negligible with a correlation coefficient of -0.01, indicating no meaningful relationship between the two variables.
  • Boxplot Insights:

    • The median number of website visits remains relatively consistent across the different age groups. There are no significant changes in the number of visits as users get older, confirming that age does not appear to influence the frequency of website visits.
    • The interquartile ranges (IQRs) for all age groups are fairly similar, showing that the middle 50% of website visits for each group fall within a comparable range. This further supports the conclusion that age does not play a major role in determining the number of website visits.
  • Outliers:

    • Outliers are present in every age group, indicating that a small number of users across all age groups visit the website far more frequently than the typical user. However, these outliers are evenly distributed across all age groups, which means there is no age group that stands out in terms of more frequent or fewer outliers.

Interpretation:

  • The data indicates that age does not impact how frequently users visit the website. Whether a user is in their 20s, 30s, 40s, or older, their website visit patterns are similar to those of users in other age groups.

  • The distribution of visits remains consistent across age groups, with no particular age group exhibiting notably different behavior in terms of visit frequency. The presence of outliers suggests that for all age groups, there are some highly active users, but their distribution across ages is consistent.

Feature: Age vs Time Spent on Website¶

In [58]:
# Generate the boxplot for 'age' vs 'website_visits'

# Create age bins for better grouping
dataset['age_group'] = pd.cut(dataset['age'], bins=[10, 20, 30, 40, 50, 60, 70], labels=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])

# Generate boxplot
plot_boxplot(
            x_column='age_group',
            y_column='website_visits',
            data=dataset,
            title='Website Visits by Age Group'
            )

Observations:

  • Age and Time Spent on Website:

    • Age and time spent on the website show a weak positive correlation (0.02). This suggests that age has little to no impact on the time individuals spend on the website, as the correlation is too small to be considered meaningful.
  • Meaning of the Correlation:

    • The very low correlation value (0.02) indicates that age does not play a significant role in determining the time spent on the website. While there may be small variations in the time spent by different age groups, this relationship is not strong enough to be considered meaningful.
  • Distribution Across Age Groups:

    • Despite the minimal correlation, there is variability in how different age groups spend time on the website. However, older users do not consistently spend more time on the website compared to younger users, and any differences are likely due to other factors not directly related to age.

Interpretation:

  • Although there is a slight positive correlation between age and time spent on the website, the effect is so small that it is likely due to randomness or external factors. The general trend is that age does not strongly predict how long users spend on the website.

Feature: Age vs Page Views per Visit¶

In [59]:
# Generate the boxplot for 'age' vs 'page_views_per_visit'

# Create age bins for better grouping
dataset['age_group'] = pd.cut(dataset['age'], bins=[10, 20, 30, 40, 50, 60, 70], labels=['10-20', '20-30', '30-40', '40-50', '50-60', '60-70'])

# Generate boxplot
plot_boxplot(
            x_column='age_group',
            y_column='page_views_per_visit',
            data=dataset,
            title='Page Views per Visits by Age Group'
            )

Observations:

  • Age and Page Views per Visit:

    • Age and page views per visit show a weak negative correlation of -0.04, indicating that there is no meaningful relationship between these two variables. The correlation is too small to suggest any significant impact of age on the number of pages viewed per visit.
  • Boxplot Insights:

    • The boxplot reveals that the median number of pages viewed per visit remains fairly consistent across the different age groups, with no notable changes or trends as age increases.
    • The interquartile ranges (IQRs) for most age groups are similar, suggesting that the number of pages viewed per visit does not vary significantly across different age groups. This confirms the lack of a strong relationship between age and engagement in terms of page views per visit.
  • Outliers:

    • There are outliers in all age groups, with some users viewing significantly more pages per visit than the typical user. However, these outliers are evenly distributed across age groups, indicating that no specific age group consistently views more pages per visit.

Interpretation:

  • The data suggests that age does not play a major role in determining how many pages a user views per visit. Users from all age groups exhibit similar behavior in terms of page views per visit, with no distinct patterns emerging as age increases or decreases.

  • Outliers in each age group represent users who view many more pages per visit than others, but these outliers are not concentrated in any one age group.

Feature: Website Visits vs Time Spent on Website¶

In [60]:
# Generate boxplot for 'website_visits' vs 'time_spent_on_website'
plot_boxplot(
            x_column='website_visits',
            y_column='time_spent_on_website',
            data=dataset,
            title='Website Visits by Time Spetn on Website'
            )

Observations:

  • Website visits and time spent on the website have a weak positive correlation (0.06).

  • The scatter plot shows a slight upward trend, suggesting that users who visit the website more frequently tend to spend more time overall, but the relationship is not strong.

  • Most users fall within a moderate range of time spent on the website, regardless of how many visits they make.

  • Website Visits and Time Spent on the Website:

    • Website visits and time spent on the website exhibit a weak positive correlation of 0.06. This indicates that while there is a slight tendency for users who visit the website more often to spend more time on it, the correlation is too weak to imply a strong relationship.
  • Boxplot Insights:

    • The boxplot illustrates that the median time spent on the website shows only minor increases with the number of website visits. Users with higher visit counts do not consistently spend significantly more time on the site compared to those with fewer visits.
    • The interquartile ranges (IQRs) for different visit counts overlap substantially, suggesting that time spent on the website remains relatively similar across various levels of website visits.
  • Outliers:

    • The presence of outliers is evident, with some users visiting the site infrequently yet spending substantial amounts of time. Conversely, certain users with many visits exhibit very short session durations. This variability highlights that individual user behavior can significantly diverge from overall trends.

Interpretation:

  • Although a slight upward trend in time spent per visit exists, the weak correlation signifies that visit frequency alone is not a reliable predictor of the amount of time spent on the website.

  • Most users, irrespective of their visit frequency, tend to spend a moderate amount of time on the site, suggesting that other factors likely influence engagement more than just the number of visits.

  • The variability in user behavior, indicated by outliers, suggests that understanding user motivations or specific content may be crucial for enhancing engagement strategies.

Feature: Website Visits vs Page Views per Visit¶

In [61]:
# plot for 'website_visits' vs 'page_views_per_visit'
plot_boxplot(
            figsize=(15, 10),
            x_column='website_visits',
            y_column='page_views_per_visit',
            data=dataset,
            title='Website Visits by Page Views per Visit'
            )

Observations:

  • Website Visits and Time Spent on the Website:

    • Website visits and time spent on the website have a very weak positive correlation of 0.07, indicating that users who visit the website more frequently tend to spend slightly more time overall, but the relationship is not strong enough to be considered significant.
  • Outliers:

    • There are several outliers for users with both low and high website visits. Some users with very few visits have spent unusually large amounts of time on the website, while others with many visits have spent very little time. This suggests that while there is a general trend of more visits corresponding to more time, individual behaviors vary widely.

Interpretation:

  • While there is a slight upward trend indicating that more frequent website visitors tend to spend more time overall, this relationship is weak, as indicated by the low correlation value.

  • The overlap in time spent across different visit counts shows that the majority of users, regardless of how many visits they make, tend to spend similar amounts of time on the website. This suggests that frequency of visits alone does not strongly predict the amount of time spent.

  • The presence of outliers indicates that some users deviate significantly from this trend, further reinforcing that individual behaviors are diverse and not solely dependent on visit frequency.

Multicollinearity Checks¶

In [62]:
# Perform multicollinearity check

# Select numeric columns for analysis
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
data_numeric = data[numeric_cols]

# Calculate VIF
def calculate_vif(data):
    vif_data = pd.DataFrame()
    vif_data["Variable"] = data.columns
    vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
    return vif_data

vif_results = calculate_vif(data_numeric)

Observations:

Variance Inflation Factors (VIF) Results:

Variable VIF
age 3.94
website_visits 2.43
time_spent_on_website 1.91
page_views_per_visit 2.96
  • Multicollinearity Check:

    • All calculated VIF values are less than 5, indicating that multicollinearity is not an issue among these numerical features.
    • Specifically, the highest VIF is for age at 3.94, which still falls well below the threshold that would typically raise concerns about multicollinearity.
  • Future Considerations:

    • It's noted that this check will be repeated after encoding categorical features in the feature engineering section. This is important because multicollinearity can arise after encoding, especially with categorical variables that may introduce redundancy.

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Missing Values Check¶

In [63]:
# Check dataset for missing values
data.isnull().sum()
Out[63]:
0
ID 0
age 0
current_occupation 0
first_interaction 0
profile_completed 0
website_visits 0
time_spent_on_website 0
page_views_per_visit 0
last_activity 0
print_media_type1 0
print_media_type2 0
digital_media 0
educational_channels 0
referral 0
status 0

There are no missing values.

Duplicated Data Check¶

In [64]:
# Check dataset for duplicated data
data.duplicated().sum()
Out[64]:
0

There is no duplicated data.

Feature Engineering¶

  • We need to drop the index ID
  • We need to perfrom one hot encoding on binary categorical features
In [65]:
# Get the feature names before making any changes
feature_names = data.columns.tolist()
print(feature_names)
['ID', 'age', 'current_occupation', 'first_interaction', 'profile_completed', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'last_activity', 'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral', 'status']

Drop Index¶

In [66]:
# Drop the Index
data.drop('ID', axis=1, inplace=True)
In [67]:
# Get the feature names after fropping the index
feature_names = data.columns.tolist()
print(feature_names)
['age', 'current_occupation', 'first_interaction', 'profile_completed', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'last_activity', 'print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral', 'status']

One Hot Encoding¶

In [68]:
# Perform one hot encoding on categorical features

# Select only the categoricalcolumns
categorical_cols = [
                    'current_occupation',
                    'first_interaction',
                    'profile_completed',
                    'last_activity',
                    'print_media_type1',
                    'print_media_type2',
                    'digital_media',
                    'educational_channels',
                    'referral'
                  ]

# Apply one-hot encoding to categorical variables
# data_encoded = pd.get_dummies(data[categorical_cols], drop_first=True) # What happens to the decision tree if I dont drop the first item
data_encoded = pd.get_dummies(data[categorical_cols])

'''

  The current_occupation_professional is a key indicator of conversion, but that is the feature that is removed IF drop_first=True,
  I will keep this as is, and remove one of the less important ones during feature engineering.

  It's interesting to note that this actually did not make a significant difference in the performance of a few of the model, namely the decision tree.

'''

# Convert boolean values (True/False) to integers (1/0)
data_encoded = data_encoded.astype(int)

# Combine the encoded categorical features with the original numerical columns
numerical_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'status']

data_encoded = pd.concat([data[numerical_cols], data_encoded], axis=1)
In [69]:
# Get the feature names after encoding
feature_names = data_encoded.columns.tolist()
print(feature_names)
['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'status', 'current_occupation_0', 'current_occupation_1', 'current_occupation_2', 'first_interaction_0', 'first_interaction_1', 'profile_completed_0', 'profile_completed_1', 'profile_completed_2', 'last_activity_0', 'last_activity_1', 'last_activity_2', 'print_media_type1_0', 'print_media_type1_1', 'print_media_type2_0', 'print_media_type2_1', 'digital_media_0', 'digital_media_1', 'educational_channels_0', 'educational_channels_1', 'referral_0', 'referral_1']
In [70]:
# count the number of values in feature_names
len(feature_names)
Out[70]:
26

The dataframe now has 26 features after OHE.

Let's look at the shape to get an idea of how the shape has change

In [71]:
# calculate the change in the number of columns between data and data_encoded
columns_added = data_encoded.shape[1] - data.shape[1]
print(f"Number of columns added due to encoding: {columns_added}")
Number of columns added due to encoding: 12

Observations:

  • OHE added an addiitonal 12 features.

Outlier Detection and Treatment¶

We previously identified several continuous features that have potential outliers:

  • website_visits
  • time_spent_on_website
  • page_views_per_visit

There are two approaches to consider. The first involves performing our classifications with the outliers present and addressing them as needed. The second approach is to remove the outliers outright. For the purposes of this project, the outliers will be removed. If time permits, I will also conduct analyses with the outliers included and compare the performance of each model.

Let's regenerate the boxplots to aid with visualization.

In [72]:
# List of numerical features previously identified as having outliers
potential_outliers = ['website_visits', 'time_spent_on_website', 'page_views_per_visit']

# Set up the figure with the appropriate number of subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))  # 1 row and 3 columns since we have 3 plots

# Create boxplots for each numerical column
for i, col in enumerate(potential_outliers):
    sns.boxplot(x=data_encoded[col], ax=axes[i], palette='tab10')  # Apply 'tab10' color scheme
    axes[i].set_title(f'Boxplot of {col}')

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

Let's use the Interquartile Range (IQR) statistical method to systematically detect outliers, after which we will remove them.

The IQR method identifies outliers as values falling below Q1 - 1.5 IQR or above Q3 + 1.5 IQR, where Q1 and Q3 are the 25th and 75th percentiles, respectively.

In [73]:
# Detect and remove outliers using the IQR method
def remove_outliers(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# Remove outliers
data_cleaned = remove_outliers(data_encoded, potential_outliers)

Let's regenerate the boxplots to visually verify outliers have been removed successfully.

In [74]:
# List of numerical features previously identified as having outliers
potential_outliers = ['website_visits', 'time_spent_on_website', 'page_views_per_visit']

# Set up the figure with the appropriate number of subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))  # 1 row and 3 columns since we have 3 plots

# Create boxplots for each numerical column
for i, col in enumerate(potential_outliers):
    sns.boxplot(x=data_cleaned[col], ax=axes[i], palette='tab10')  # Apply 'tab10' color scheme
    axes[i].set_title(f'Boxplot of {col}')

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
In [75]:
# Calculate the number of rows removed
rows_removed = len(data_encoded) - len(data_cleaned)
rows_removed
Out[75]:
404

We have successfully removed 404 extreme outliers.

EDA¶

  • It is a good idea to explore the data once again after manipulating it.

First Five Rows¶

In [76]:
# returns the first 5 rows
data_cleaned.head()
Out[76]:
age website_visits time_spent_on_website page_views_per_visit status current_occupation_0 current_occupation_1 current_occupation_2 first_interaction_0 first_interaction_1 ... print_media_type1_0 print_media_type1_1 print_media_type2_0 print_media_type2_1 digital_media_0 digital_media_1 educational_channels_0 educational_channels_1 referral_0 referral_1
0 57 7 1639 1.861 1 0 0 1 0 1 ... 0 1 1 0 0 1 1 0 1 0
1 56 2 83 0.320 0 1 0 0 1 0 ... 1 0 1 0 1 0 0 1 1 0
2 52 3 330 0.074 0 1 0 0 0 1 ... 1 0 1 0 0 1 1 0 1 0
3 53 4 464 2.057 1 0 0 1 0 1 ... 1 0 1 0 1 0 1 0 1 0
5 50 4 212 5.682 0 0 0 1 1 0 ... 1 0 1 0 1 0 0 1 1 0

5 rows × 26 columns

Last Five Rows¶

In [77]:
# returns the last 5 rows
data_cleaned.tail()
Out[77]:
age website_visits time_spent_on_website page_views_per_visit status current_occupation_0 current_occupation_1 current_occupation_2 first_interaction_0 first_interaction_1 ... print_media_type1_0 print_media_type1_1 print_media_type2_0 print_media_type2_1 digital_media_0 digital_media_1 educational_channels_0 educational_channels_1 referral_0 referral_1
4606 58 7 210 3.598 0 0 0 1 1 0 ... 1 0 1 0 1 0 1 0 1 0
4608 55 8 2327 5.393 0 1 0 0 1 0 ... 1 0 1 0 1 0 1 0 1 0
4609 58 2 212 2.692 1 1 0 0 0 1 ... 1 0 1 0 1 0 1 0 1 0
4610 57 1 154 3.879 0 1 0 0 1 0 ... 0 1 1 0 1 0 1 0 1 0
4611 55 4 2290 2.075 0 1 0 0 0 1 ... 1 0 1 0 1 0 1 0 1 0

5 rows × 26 columns

Shape¶

In [78]:
# Determine the number of rows and columns by calling data_cleaned.shape
print(f"{data_cleaned.shape[0]} {data_cleaned.shape[1]}")
4208 26

Observations:¶

  • The dataset now consists of 4206 rows and 26 features.

Univariate Analysis¶

Statistical Information¶

In [79]:
data_cleaned.describe(include = "all").T
Out[79]:
count unique top freq mean std min 25% 50% 75% max
age 4208.0 NaN NaN NaN 46.360741 13.079381 18.0 36.00000 51.0000 57.00000 63.000
website_visits 4208.0 NaN NaN NaN 3.231464 2.157708 0.0 2.00000 3.0000 5.00000 9.000
time_spent_on_website 4208.0 NaN NaN NaN 718.396150 742.652073 0.0 142.00000 375.0000 1310.75000 2537.000
page_views_per_visit 4208.0 NaN NaN NaN 2.716066 1.490995 0.0 2.06075 2.2855 3.64325 6.266
status 4208.0 2.0 0.0 2939.0 NaN NaN NaN NaN NaN NaN NaN
current_occupation_0 4208.0 NaN NaN NaN 0.568679 0.495320 0.0 0.00000 1.0000 1.00000 1.000
current_occupation_1 4208.0 NaN NaN NaN 0.115970 0.320226 0.0 0.00000 0.0000 0.00000 1.000
current_occupation_2 4208.0 NaN NaN NaN 0.315352 0.464711 0.0 0.00000 0.0000 1.00000 1.000
first_interaction_0 4208.0 NaN NaN NaN 0.446530 0.497192 0.0 0.00000 0.0000 1.00000 1.000
first_interaction_1 4208.0 NaN NaN NaN 0.553470 0.497192 0.0 0.00000 1.0000 1.00000 1.000
profile_completed_0 4208.0 NaN NaN NaN 0.494297 0.500027 0.0 0.00000 0.0000 1.00000 1.000
profile_completed_1 4208.0 NaN NaN NaN 0.023051 0.150084 0.0 0.00000 0.0000 0.00000 1.000
profile_completed_2 4208.0 NaN NaN NaN 0.482652 0.499758 0.0 0.00000 0.0000 1.00000 1.000
last_activity_0 4208.0 NaN NaN NaN 0.496673 0.500048 0.0 0.00000 0.0000 1.00000 1.000
last_activity_1 4208.0 NaN NaN NaN 0.263783 0.440736 0.0 0.00000 0.0000 1.00000 1.000
last_activity_2 4208.0 NaN NaN NaN 0.239544 0.426856 0.0 0.00000 0.0000 0.00000 1.000
print_media_type1_0 4208.0 NaN NaN NaN 0.891635 0.310878 0.0 1.00000 1.0000 1.00000 1.000
print_media_type1_1 4208.0 NaN NaN NaN 0.108365 0.310878 0.0 0.00000 0.0000 0.00000 1.000
print_media_type2_0 4208.0 NaN NaN NaN 0.948669 0.220698 0.0 1.00000 1.0000 1.00000 1.000
print_media_type2_1 4208.0 NaN NaN NaN 0.051331 0.220698 0.0 0.00000 0.0000 0.00000 1.000
digital_media_0 4208.0 NaN NaN NaN 0.886882 0.316774 0.0 1.00000 1.0000 1.00000 1.000
digital_media_1 4208.0 NaN NaN NaN 0.113118 0.316774 0.0 0.00000 0.0000 0.00000 1.000
educational_channels_0 4208.0 NaN NaN NaN 0.848859 0.358229 0.0 1.00000 1.0000 1.00000 1.000
educational_channels_1 4208.0 NaN NaN NaN 0.151141 0.358229 0.0 0.00000 0.0000 0.00000 1.000
referral_0 4208.0 NaN NaN NaN 0.980513 0.138244 0.0 1.00000 1.0000 1.00000 1.000
referral_1 4208.0 NaN NaN NaN 0.019487 0.138244 0.0 0.00000 0.0000 0.00000 1.000

Observations:¶

Age:

  • The mean age is 46.36 years, with a standard deviation of 13.08 years, indicating a slightly narrower age distribution compared to the original data.
  • The age range remains between 18 and 63 years.
  • The median (50th percentile) age is 51 years, and half of the leads are aged between 36 and 57 years, which remains consistent with the original data.

Website Visits:

  • The mean number of website visits has decreased slightly to 3.23, with a narrower standard deviation of 2.16 compared to 2.83 previously.
  • The number of visits still ranges from 0 to a maximum of 9 visits (down from 30).
  • Most leads continue to visit the website between 2 and 5 times, as indicated by the 25th to 75th percentile values.

Time Spent on Website:

  • The mean time spent on the website is now 718 seconds (around 12 minutes), with a standard deviation of 742.65 seconds. This is similar to the previous data but slightly reduced, suggesting a consistent distribution of website engagement.
  • The time spent on the website ranges from 0 to 2537 seconds, indicating the range has remained unchanged.
  • The median time remains at 375 seconds (about 6 minutes), with the 25th and 75th percentiles at 142 and 1311 seconds, respectively, showing that user engagement still varies greatly.

Page Views per Visit:

  • The mean number of page views per visit has decreased to 2.72 from 3.03, with a standard deviation of 1.49 (down from 1.97).
  • The range of page views per visit is now 0 to 6.27 pages, which is narrower compared to the previous 18.43 pages, indicating that extreme outliers have been removed.
  • The majority of leads view between 2 and 4 pages per visit, as shown by the percentiles.

Status (Target Variable):

  • The mean conversion rate remains consistent at approximately 30.16% of leads converting (status = 1), which is nearly identical to the previous value of 29.9%.
  • The majority (around 69.8%) of leads still do not convert (status = 0), indicating no significant change in the overall conversion rate.

Current Occupation:

  • The dataset now reflects one-hot encoding, showing two occupation categories:
    • "Student": 3720 leads are not students, while 488 are.
    • "Unemployed": 2881 leads are not unemployed, while 1327 are.
    • These categories now provide more detailed insight into the distribution of occupations among leads.

First Interaction:

  • The Website remains the most common first interaction, with 2329 leads (around 55% of total leads) interacting via the website, similar to the original data.

Profile Completion:

  • Profile completion has also been one-hot encoded, breaking down into two categories:
    • "Low": Most leads (4111 out of 4208) do not have low profile completion.
    • "Medium": 2177 leads do not have medium profile completion.

Last Activity:

  • The last activity feature has been encoded into two categories:
    • Phone Activity: 3098 leads did not engage with phone activity as their last action.
    • Website Activity: 3200 leads did not engage with website activity last.

Media & Referral Channels:

  • The encoded media and referral channels now show:
    • Print Media Type 1: 3752 leads did not interact with print media type 1.
    • Print Media Type 2: 3992 leads did not interact with print media type 2.
    • Digital Media: 3732 leads were not exposed to digital media.
    • Educational Channels: 3572 leads did not interact with educational channels.
    • Referral: An overwhelming majority of leads (4126) did not come through referrals.

In summary, the major differences are the removal of outliers, the use of one-hot encoding, and a slightly more refined distribution of website visits and page views. The overall trends remain consistent, particularly in terms of age, conversion rates, and interaction patterns.

Univariate Analysis of Continuous Features¶

In [80]:
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
    print(col)
    enhanced_histogram_boxplot(data_cleaned, col, kde = True, custom_title = col) # Pass the column name as a string
age

Statistical Summary for age:
Mean: 46.36
Median: 51.00
Standard Deviation: 13.08
Skewness: -0.74
Kurtosis: -0.76

Outlier Analysis:
IQR method - Number of outliers: 0
IQR method - Percentage of outliers: 0.00%
IQR method - Outlier range: < 4.50 or > 88.50
Z-score method - Number of outliers (|z| > 3): 0
Z-score method - Percentage of outliers: 0.00%
website_visits

Statistical Summary for website_visits:
Mean: 3.23
Median: 3.00
Standard Deviation: 2.16
Skewness: 0.80
Kurtosis: -0.14

Outlier Analysis:
IQR method - Number of outliers: 0
IQR method - Percentage of outliers: 0.00%
IQR method - Outlier range: < -2.50 or > 9.50
Z-score method - Number of outliers (|z| > 3): 0
Z-score method - Percentage of outliers: 0.00%
time_spent_on_website

Statistical Summary for time_spent_on_website:
Mean: 718.40
Median: 375.00
Standard Deviation: 742.65
Skewness: 0.97
Kurtosis: -0.54

Outlier Analysis:
IQR method - Number of outliers: 0
IQR method - Percentage of outliers: 0.00%
IQR method - Outlier range: < -1611.12 or > 3063.88
Z-score method - Number of outliers (|z| > 3): 0
Z-score method - Percentage of outliers: 0.00%
page_views_per_visit

Statistical Summary for page_views_per_visit:
Mean: 2.72
Median: 2.29
Standard Deviation: 1.49
Skewness: 0.02
Kurtosis: -0.32

Outlier Analysis:
IQR method - Number of outliers: 11
IQR method - Percentage of outliers: 0.26%
IQR method - Outlier range: < -0.31 or > 6.02
Z-score method - Number of outliers (|z| > 3): 0
Z-score method - Percentage of outliers: 0.00%

Observations:¶

Age:

  • Distribution: The age distribution is still moderately spread, centered around a mean of around 46 years, but now appears smoother with the use of a kernel density estimate (KDE) curve.
  • Mean & Median: The mean age is approximately 46, and the median age is around 51. This suggests that the age distribution is slightly skewed left, meaning a slightly larger number of younger leads.
  • Outliers: There may be some outliers visible in the boxplot, but the overall age distribution appears mostly normal. Most outliers are likely clustered around the extreme ends of the age spectrum (possibly near the minimum and maximum ages).
  • Skewness: The distribution is close to symmetric but may have a slight positive skew due to a small number of younger or older individuals in the dataset.

Website Visits:

  • Distribution: The number of website visits appears to follow a right-skewed distribution, with most leads visiting the website between 2 and 5 times. There are few leads with significantly more website visits.
  • Mean & Median: The mean number of visits is around 3.23, and the median is around 3 visits, indicating that most leads visit the website 2 to 5 times, with some outliers visiting much more frequently.
  • Outliers: The boxplot may show a few outliers on the higher end, indicating that a small portion of leads visit the website unusually frequently.
  • Skewness: The skewness of this feature is likely positive, as the KDE curve shows a long right tail. This suggests that while most leads visit the website fewer times, some have much higher visit counts.

Time Spent on Website:

  • Distribution: The time spent on the website has a very broad distribution, with a majority of users spending a relatively short time (a few minutes) and a long tail of users who spend much more time on the website.
  • Mean & Median: The mean time spent is approximately 718 seconds (around 12 minutes), while the median time is lower, around 375 seconds (6 minutes), suggesting that a small number of users spend significantly more time than the average.
  • Outliers: The boxplot likely shows some extreme outliers, representing users who spend an unusually large amount of time on the website (up to 2537 seconds). These outliers contribute to the right skew of the distribution.
  • Skewness: The distribution is positively skewed, meaning that a small number of users spend much more time on the website than the majority.

Page Views per Visit:

  • Distribution: The number of page views per visit has a slightly more condensed distribution, with most users viewing between 2 and 4 pages on average. The KDE curve highlights this central tendency.
  • Mean & Median: The mean page views per visit is around 2.72, with a median of approximately 2.3 pages. This suggests that the majority of users view just a few pages during their visit.
  • Outliers: The boxplot may show outliers for users who view significantly more pages (possibly up to 6.27 pages), indicating a smaller subset of highly engaged users.
  • Skewness: The distribution is likely right-skewed, with the majority of users viewing fewer pages and a long tail of users who view a much larger number of pages per visit.

General Patterns and Insights:

  • Right Skewed Distributions: Both the website visits, time spent on website, and page views per visit features show right-skewed distributions, where a small subset of leads exhibit unusually high engagement (higher visit counts, longer time spent, or more page views).
  • Moderate Outliers: There are moderate outliers present in the data, particularly for the time spent on the website and website visits features, indicating that some users are significantly more engaged than the average.
  • Overall User Behavior: The leads tend to interact with the website moderately, with most spending a few minutes and viewing a few pages. However, there are a few highly engaged users who visit the website frequently and for longer periods.

Univariate Analysis of Categorical Features¶

In [81]:
# Original Categories (w/Manual Mapping)
original_categories = {
    'current_occupation': ['Professional', 'Unemployed', 'Student'],
    'first_interaction': ['Website', 'Mobile App'],
    'profile_completed':  ['Low', 'Medium', 'High'],
    'last_activity': ['Email Activity', 'Phone Activity', 'Website Activity'],
    'print_media_type1': ['No', 'Yes'],
    'print_media_type2': ['No', 'Yes'],
    'digital_media': ['No', 'Yes'],
    'educational_channels': ['No', 'Yes'],
    'referral': ['No', 'Yes']
    # Add other variables and their categories here
}

# Initialize the mapping dictionary
ohe_mapping = {}

# Get the columns from the DataFrame
columns = data_cleaned.columns

for col in columns:
    try:
        variable, category_idx = col.rsplit('_', 1)
        category_label = original_categories.get(variable, [])[int(category_idx)] if variable in original_categories else category_idx
    except (ValueError, IndexError):
        variable = col
        category_label = None  # Handle unexpected naming patterns

    if variable not in ohe_mapping:
        ohe_mapping[variable] = {}

    ohe_mapping[variable][col] = category_label

# Iterate and print value counts with original labels
for variable, cols in ohe_mapping.items():
    print(f"Variable: {variable}")
    for ohe_col, category_label in cols.items():
        print(f"  Category: {category_label}")
        print(data_cleaned[ohe_col].value_counts(normalize=True))
        print('*' * 40)
    print('=' * 60)
Variable: age
  Category: None
age
57    0.084601
58    0.082700
56    0.072719
59    0.071055
60    0.052281
55    0.042063
32    0.039686
53    0.020437
50    0.019724
43    0.019487
48    0.019487
54    0.019487
51    0.019011
49    0.018774
46    0.018774
21    0.018298
52    0.018061
42    0.017823
24    0.017823
23    0.017348
45    0.017348
19    0.017110
47    0.017110
34    0.016873
44    0.016635
33    0.016160
20    0.015447
22    0.015209
41    0.015209
35    0.014971
18    0.014496
40    0.013783
38    0.013070
37    0.012595
36    0.012357
39    0.011644
63    0.010932
62    0.010456
30    0.009743
29    0.007842
61    0.007842
31    0.007842
28    0.005703
25    0.003802
26    0.003327
27    0.002852
Name: proportion, dtype: float64
****************************************
============================================================
Variable: website
  Category: visits
website_visits
2    0.272338
1    0.171578
3    0.144487
4    0.110266
5    0.092681
6    0.063926
7    0.051331
0    0.041350
8    0.033983
9    0.018061
Name: proportion, dtype: float64
****************************************
============================================================
Variable: time_spent_on
  Category: website
time_spent_on_website
0       0.041350
1       0.015922
65      0.004515
83      0.004040
76      0.004040
          ...   
2500    0.000238
1540    0.000238
1862    0.000238
1397    0.000238
2290    0.000238
Name: proportion, Length: 1541, dtype: float64
****************************************
============================================================
Variable: page_views_per
  Category: visit
page_views_per_visit
0.000    0.043013
2.168    0.003327
2.154    0.003089
2.192    0.002614
2.188    0.002376
           ...   
1.826    0.000238
4.954    0.000238
4.295    0.000238
5.577    0.000238
2.692    0.000238
Name: proportion, Length: 2126, dtype: float64
****************************************
============================================================
Variable: status
  Category: None
status
0    0.698432
1    0.301568
Name: proportion, dtype: float64
****************************************
============================================================
Variable: current_occupation
  Category: Professional
current_occupation_0
1    0.568679
0    0.431321
Name: proportion, dtype: float64
****************************************
  Category: Unemployed
current_occupation_1
0    0.88403
1    0.11597
Name: proportion, dtype: float64
****************************************
  Category: Student
current_occupation_2
0    0.684648
1    0.315352
Name: proportion, dtype: float64
****************************************
============================================================
Variable: first_interaction
  Category: Website
first_interaction_0
0    0.55347
1    0.44653
Name: proportion, dtype: float64
****************************************
  Category: Mobile App
first_interaction_1
1    0.55347
0    0.44653
Name: proportion, dtype: float64
****************************************
============================================================
Variable: profile_completed
  Category: Low
profile_completed_0
0    0.505703
1    0.494297
Name: proportion, dtype: float64
****************************************
  Category: Medium
profile_completed_1
0    0.976949
1    0.023051
Name: proportion, dtype: float64
****************************************
  Category: High
profile_completed_2
0    0.517348
1    0.482652
Name: proportion, dtype: float64
****************************************
============================================================
Variable: last_activity
  Category: Email Activity
last_activity_0
0    0.503327
1    0.496673
Name: proportion, dtype: float64
****************************************
  Category: Phone Activity
last_activity_1
0    0.736217
1    0.263783
Name: proportion, dtype: float64
****************************************
  Category: Website Activity
last_activity_2
0    0.760456
1    0.239544
Name: proportion, dtype: float64
****************************************
============================================================
Variable: print_media_type1
  Category: No
print_media_type1_0
1    0.891635
0    0.108365
Name: proportion, dtype: float64
****************************************
  Category: Yes
print_media_type1_1
0    0.891635
1    0.108365
Name: proportion, dtype: float64
****************************************
============================================================
Variable: print_media_type2
  Category: No
print_media_type2_0
1    0.948669
0    0.051331
Name: proportion, dtype: float64
****************************************
  Category: Yes
print_media_type2_1
0    0.948669
1    0.051331
Name: proportion, dtype: float64
****************************************
============================================================
Variable: digital_media
  Category: No
digital_media_0
1    0.886882
0    0.113118
Name: proportion, dtype: float64
****************************************
  Category: Yes
digital_media_1
0    0.886882
1    0.113118
Name: proportion, dtype: float64
****************************************
============================================================
Variable: educational_channels
  Category: No
educational_channels_0
1    0.848859
0    0.151141
Name: proportion, dtype: float64
****************************************
  Category: Yes
educational_channels_1
0    0.848859
1    0.151141
Name: proportion, dtype: float64
****************************************
============================================================
Variable: referral
  Category: No
referral_0
1    0.980513
0    0.019487
Name: proportion, dtype: float64
****************************************
  Category: Yes
referral_1
0    0.980513
1    0.019487
Name: proportion, dtype: float64
****************************************
============================================================

Observations:¶

Current Occupation:

  • Professional: Approximately 56.9% of respondents fall into this category, indicating that more than half of the dataset consists of professionals.
  • Unemployed: 11.6% of the respondents are unemployed, making it the smallest group within this variable.
  • Student: 31.5% of respondents are students, representing a significant portion of the dataset.

Observation: The dataset is predominantly made up of professionals, but there is still a large group of students, suggesting a mix of working and non-working individuals.

First Interaction:

  • Website: 55.3% of respondents first interacted through the website.
  • Mobile App: 44.7% of respondents first interacted through the mobile app.

Observation: The website is slightly more popular as a first point of interaction, though mobile app usage is not far behind, indicating a balanced distribution between these two channels.

Profile Completion:

  • Low: 49.4% of respondents have a low profile completion, indicating that nearly half the users have not fully completed their profiles.
  • Medium: Only 2.3% of respondents have a medium level of profile completion, which is a very small group.
  • High: 48.3% of respondents have high profile completion, closely matching the proportion of users with low completion.

Observation: There's a bimodal distribution between low and high profile completions, while the medium category is significantly underrepresented. This suggests that most users either leave their profiles incomplete or complete them fully, without stopping in between.

Last Activity:

  • Email Activity: 49.7% of respondents had recent email activity.
  • Phone Activity: 26.4% of respondents engaged in phone activity.
  • Website Activity: 23.9% of respondents engaged in website activity.

Observation: Email is the most common form of recent activity, while phone and website activity are less frequent. This may reflect that email remains a dominant channel for engagement in this dataset.

Print Media Type 1:

  • Yes: Only 10.8% of respondents engaged with print media type 1.

Observation: Print media type 1 has low engagement, indicating that this channel may not be as effective or prevalent as digital media.

Print Media Type 2:

  • Yes: 5.1% of respondents engage with print media type 2.

Observation: Even less engagement than print media type 1, suggesting print media in general may have minimal reach compared to other channels.

Digital Media:

  • Yes: 11.3% of respondents engage with digital media.

Observation: The use of digital media is still relatively low, which could mean there's an opportunity for more engagement in this area or it could reflect that the audience primarily engages through other channels.

Educational Channels:

  • Yes: 15.1% of respondents interact with educational channels.

Observation: A small portion of the audience engages with educational channels, suggesting that this may not be a primary area of interest or need for the majority of users.

Referral:

  • Yes: Only 1.9% of respondents come through referrals.

Observation: The referral system is underutilized, indicating either a lack of promotion or effectiveness of the referral program, or that users are not highly motivated to refer others.

Key Insights:

  • Demographics: The majority of users are professionals or students, with a small proportion of unemployed individuals.
  • User Engagement: Website and mobile app are equally important as entry points, with email being the dominant engagement method over phone and website activities.
  • Profile Completion: Users tend to either complete their profiles entirely or not at all, with very few in the middle.
  • Media Preferences: Digital and print media channels are not as widely used, but there may be potential for increased engagement in these areas.
  • Referrals: The referral program is largely underutilized and may need attention if this is a desired method of growth.

Conclusion: The observations provide a clear view of user engagement and demographics within the dataset. The data suggests areas for potential growth, particularly in enhancing profile completion and leveraging digital media channels for better engagement.

Bivariate Analysis¶

Correlation Heatmap¶

In [82]:
# Create a correlation matrix (numerical vlaues only)

# Exclude non-numeric columns
numeric_data = data_cleaned.select_dtypes(include=np.number)

# Create heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(numeric_data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="vlag")
plt.title("Correlation Heatmap")
plt.show()

Observations:¶

Here are the verified observations based on the correlation analysis and heatmap results:

Focusing on Our Target Feature, status:

  • Status and Age (0.12):

    • There is a weak positive correlation between status and age, indicating that as age increases, there is a slight tendency for users to have a higher status. This suggests that older individuals might have a different status compared to younger ones.
  • Status and Time Spent on Website (0.30):

    • There is a moderate positive correlation between status and the time spent on the website. This indicates that users with a higher status tend to spend more time on the website, making it an important predictor for modeling.
  • Status and Page Views Per Visit (0.02):

    • There is a near-zero correlation between status and page_views_per_visit, indicating that the number of pages viewed per visit does not have a strong linear relationship with status, making it unlikely to be a key predictor.
  • Status and Current Occupation - Student (-0.14):

    • There is a weak negative correlation between status and being a student. This suggests that users who are students are somewhat less likely to have a higher status, indicating that student status may be inversely related to the target variable.
  • Status and Current Occupation - Unemployed (-0.05):

    • There is a very weak negative correlation between status and being unemployed. This suggests that being unemployed has a minor relationship with status.
  • Status and First Interaction Website (0.39):

    • There is a moderately strong positive correlation between status and first interaction being through the website. This indicates that users who first interacted via the website are more likely to have a higher status, making this a valuable predictor for your model.
  • Status and Profile Completion - Low (0.01):

    • There is almost no correlation between status and having low profile completion, suggesting that this feature is not significant for determining user status.
  • Status and Profile Completion - Medium (-0.11):

    • There is a weak negative correlation between status and medium profile completion, suggesting that users with medium profile completion are slightly less likely to have a higher status.
  • Status and Last Activity - Phone Activity (-0.11):

    • There is a weak negative correlation between status and phone activity being the last activity. This may indicate that users whose last activity was via phone tend to have a slightly lower status.
  • Status and Last Activity - Website Activity (0.05):

    • There is a very weak positive correlation between status and website activity as the last interaction, suggesting little to no relationship with status.
  • Status and Digital Media Interaction (-0.01):

    • There is a near-zero correlation between status and digital media interaction, indicating that this feature is not related to user status.
  • Status and Referral (0.12):

    • There is a weak positive correlation between status and whether the user was referred. This suggests that users who came via a referral might have a slight tendency toward a higher status.

Summary of Key Takeaways:

  • Moderate Positive Correlations:

    • First interaction being through the website (0.39) and time spent on the website (0.30) are the strongest positive relationships with status, making them key variables for predicting user status.
  • Weak Positive Correlations:

    • Age (0.12) and referral (0.12) show weak but potentially meaningful relationships with status, suggesting that older users and those referred to the platform might have a slightly higher status.
  • Weak to Moderate Negative Correlations:

    • Current occupation as a student (-0.14) and profile completion as medium (-0.11) show weak negative correlations with status, indicating that these groups are somewhat less likely to have higher status.
  • Near-Zero Correlations:

    • Features like page views per visit (0.02), digital media interaction (-0.01), and print media type interactions show near-zero correlations with status, suggesting they may not be useful predictors.

Conclusion:

The most promising predictors for status based on this correlation analysis are first interaction through the website and time spent on the website. Variables like age and referral may also provide some predictive power but will likely be weaker in impact. Features with very low correlations, such as page views per visit and print media interactions, may not be valuable for modeling and could potentially be removed.

Multicollinearity Checks¶

Multicollinearity occurs when two or more independent variables (predictors) in a regression model are highly correlated with each other. This means that one predictor can be linearly predicted from the others with a high degree of accuracy.

Interpreting VIF:

  • VIF = 1: No correlation between the variable and others.
  • 1 < VIF < 5: Moderate correlation.
  • VIF > 5: Strong multicollinearity that may need to be addressed,
In [83]:
# Perform multicollinearity check for numeric columns only

# Select numeric columns for VIF analysis
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
data_numeric = data[numeric_cols]

# Function to calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(data):
    vif_data = pd.DataFrame()
    vif_data["Variable"] = data.columns
    vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(data.shape[1])]
    return vif_data

# Calculate VIF for numeric columns only
vif_results_numeric = calculate_vif(data_numeric)

Observations:¶

Variance Inflation Factors (VIF) for the numeric variables

Variable VIF
Age 3.94
Website Visits 2.43
Time Spent on Website 1.91
Page Views per Visit 2.96
  • Age (VIF = 3.94):

    • The VIF for age is 3.94, which is below the commonly used threshold of 5. This suggests that while there is some degree of correlation between age and other variables, it is not severe enough to warrant concern.
  • Website Visits (VIF = 2.43):

    • The VIF for website visits is 2.43, indicating low multicollinearity. This means that the number of visits to the website does not strongly correlate with other variables and can be safely included in the regression model.
  • Time Spent on Website (VIF = 1.91):

    • The VIF for time spent on the website is 1.91, which is very low. This suggests that this variable has little correlation with the others. It's a strong independent predictor and does not introduce multicollinearity.
  • Page Views per Visit (VIF = 2.96):

    • The VIF for page views per visit is 2.96, which is well below the threshold of 5. This indicates that there is low multicollinearity for this variable as well. It can be used in the model without concern for inflated standard errors.

Key Insights:

  • All of the variables have VIF values below 5, which suggests that multicollinearity is not a major issue for the model.

Model Building¶

The goal of this model is to:

  • help identify which leads are more likely to convert to paid customers,
  • Find the factors driving the lead conversion process
  • Create a profile of the leads which are likely to convert

Before training the model, let's choose the appropriate model evaluation criterion as per the problem at hand.

Model evaluation criterion

The model can make two types of wrong predictions:

  1. Predicting a lead will convert when the lead doesn't convert.
  2. Predicting a lead will not convert when the lead actually converts.

Which case is more important?

  • Predicting that the lead will convert but the lead does not convert would be considered a major miss for any lead conversion predictor and hence the more important case of wrong predictions.

How to reduce this loss i.e the need to reduce False Negatives?

  • The company would want the Recall to be maximized, the greater the Recall, the higher the chances of minimizing false negatives. Hence, the focus should be on increasing the Recall (minimizing the false negatives) or, in other words, identifying the true positives (i.e. Class 1) very well, so that the company can boost efforts to convert leads. This would help to make the company more profitable.

Target Feature Separation¶

In [84]:
# Separating the target variable and other variables
X = data_cleaned.drop('status', axis=1)   # Create a copy of the data with 'status' removed
y = data_cleaned['status']                # Create a new variable with only 'status'

Data Split¶

Splitting the data into 70% train and 30% test set

In [85]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Check the Shape of the Train and Test Data¶

In [86]:
# Check the shape of both the train and test data

#Collect the data for the table
data_ = {
    "Data Split": ["Training Set", "Test Set"],
    "Shape": [X_train.shape, X_test.shape],
    "Class Percentage": [
        y_train.value_counts(normalize=True).to_dict(),
        y_test.value_counts(normalize=True).to_dict()
    ]
}

# Create the table using pandas
shape = pd.DataFrame(data_)

# Display the table
print(shape)
     Data Split       Shape                                Class Percentage
0  Training Set  (2945, 25)  {0: 0.6937181663837012, 1: 0.3062818336162988}
1      Test Set  (1263, 25)  {0: 0.7094220110847189, 1: 0.2905779889152811}

Missing Values Check¶

Check the training and testing data for missing values. There is no reason to expect any missing vlues as our dataset had none.

In [87]:
# Missing value check for X_train and X_test as a percentage

# Create DataFrames for % of missing values in each dataset
train_missing = pd.DataFrame({
    '% of Missing Values (Train)': round(X_train.isna().sum() / X_train.isna().count() * 100, 2)
})

test_missing = pd.DataFrame({
    '% of Missing Values (Test)': round(X_test.isna().sum() / X_test.isna().count() * 100, 2)
})

# Concatenate the two DataFrames to have a nice table comparing both
missing_values_table = pd.concat([train_missing, test_missing], axis=1)

# Display the table
missing_values_table
Out[87]:
% of Missing Values (Train) % of Missing Values (Test)
age 0.0 0.0
website_visits 0.0 0.0
time_spent_on_website 0.0 0.0
page_views_per_visit 0.0 0.0
current_occupation_0 0.0 0.0
current_occupation_1 0.0 0.0
current_occupation_2 0.0 0.0
first_interaction_0 0.0 0.0
first_interaction_1 0.0 0.0
profile_completed_0 0.0 0.0
profile_completed_1 0.0 0.0
profile_completed_2 0.0 0.0
last_activity_0 0.0 0.0
last_activity_1 0.0 0.0
last_activity_2 0.0 0.0
print_media_type1_0 0.0 0.0
print_media_type1_1 0.0 0.0
print_media_type2_0 0.0 0.0
print_media_type2_1 0.0 0.0
digital_media_0 0.0 0.0
digital_media_1 0.0 0.0
educational_channels_0 0.0 0.0
educational_channels_1 0.0 0.0
referral_0 0.0 0.0
referral_1 0.0 0.0
  • There are no missing vlaues.

Logistic Regression¶

Standardize/Scale the Data¶

Logistic regression benefits from scaled data, especially if some numeric features have a wide range of values.

In [88]:
# Standardize Numeric Features

# List of numeric columns that need to be scaled
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']

# Identify non-numeric columns
non_numeric_cols = [col for col in X_train.columns if col not in numeric_cols]

# Initialize the scaler
scaler = StandardScaler()

# Scale the numeric columns in the training set
X_train_scaled_numeric = pd.DataFrame(
                                      scaler.fit_transform(X_train[numeric_cols]),
                                      columns=numeric_cols,
                                      index=X_train.index  # Retain original index
                                      )

# Scale the numeric columns in the test set
X_test_scaled_numeric = pd.DataFrame(
                                      scaler.transform(X_test[numeric_cols]),
                                      columns=numeric_cols,
                                      index=X_test.index  # Retain original index
                                      )

# Extract the non-numeric columns from the training set without resetting the index
X_train_non_numeric = X_train[non_numeric_cols]

# Extract the non-numeric columns from the test set without resetting the index
X_test_non_numeric = X_test[non_numeric_cols]

# Concatenate the scaled numeric columns with the non-numeric columns for the training set
X_train_final = pd.concat([X_train_scaled_numeric, X_train_non_numeric], axis=1)

# Concatenate the scaled numeric columns with the non-numeric columns for the test set
X_test_final = pd.concat([X_test_scaled_numeric, X_test_non_numeric], axis=1)

# Preserve the original column order
X_train_scaled = X_train_final[X_train.columns]
X_test_scaled  = X_test_final[X_test.columns]

Build the Model

In [89]:
# Initialize and fit the logistic regression model
logreg = LogisticRegression(max_iter=1000,random_state=1)
logreg.fit(X_train_scaled, y_train)
Out[89]:
LogisticRegression(max_iter=1000, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=1)

Train and Evaluate the Model

In [90]:
# Evaluate the Model on the Training Data
y_train_pred = logreg.predict(X_train_scaled)

# Evaluate performance on training data
metrics_score(y_train, y_train_pred)
              precision    recall  f1-score   support

           0       0.86      0.90      0.88      2043
           1       0.74      0.66      0.70       902

    accuracy                           0.82      2945
   macro avg       0.80      0.78      0.79      2945
weighted avg       0.82      0.82      0.82      2945

Test the Model

In [91]:
# Evaluate the Model on the Training Data
y_test_pred = logreg.predict(X_test_scaled)

# Evaluate performance on training data
metrics_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.86      0.89      0.88       896
           1       0.72      0.66      0.68       367

    accuracy                           0.82      1263
   macro avg       0.79      0.77      0.78      1263
weighted avg       0.82      0.82      0.82      1263

Observations

Accuracy:

  • The model has an overall accuracy of 82%, meaning that 82% of the predictions on the training data are correct.

Class 0 (Not Converted):

  • Precision: 0.86 – Of all the cases predicted as not converted 86% were correct.

  • Recall: 0.90 – The model successfully identified 90% of all actual not converted cases.

  • F1-Score: 0.88 – This harmonic mean between precision and recall indicates strong performance in identifying not converted users.

Class 1 (Converted):

  • Precision: 0.74 – Of all the cases predicted as converted, 74% were correct.

  • Recall: 0.66 – The model correctly identified 66% of all actual converted cases.

  • F1-Score: 0.70 – This score suggests moderate performance in identifying converted users, but there is room for improvement.

Macro Average (Precision: 0.80, Recall: 0.78):

  • This indicates an overall balance between precision and recall across both classes, giving equal weight to both.

Weighted Average (Precision: 0.82, Recall: 0.82):

  • This average accounts for the class imbalance (since there are far more status = 0 users).

    The results are consistent with the overall accuracy of 82%.

PCA¶

We previously scaled the data in preparation for Logistic Regression so we can start by applying PCA

In [92]:
# Apply PCA to the entire feature set after scaling (including numeric + other features)
pca = PCA(n_components=0.95)  # Retain 95% of the variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Print the number of components chosen by PCA
print(f"Number of components retained: {X_train_pca.shape[1]}")
Number of components retained: 12

PCA analysis has determined that only 12 of the 26 features are important.

Build the Model

In [93]:
# Initialize and fit the decision tree on the PCA-transformed training data
dt_model = DecisionTreeClassifier(random_state=1)
dt_model.fit(X_train_pca, y_train)
Out[93]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Test and Evaluate the Model

In [94]:
# Generate predictions on the training data
y_train_pred_pca = dt_model.predict(X_train_pca)

# Evaluate the model's performance on the training data
train_accuracy_pca = accuracy_score(y_train, y_train_pred_pca)
metrics_score(y_train, y_train_pred_pca)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2043
           1       1.00      1.00      1.00       902

    accuracy                           1.00      2945
   macro avg       1.00      1.00      1.00      2945
weighted avg       1.00      1.00      1.00      2945

Test and Evaluate the Model

In [95]:
# Generate predictions on the test data
y_test_pred_pca = dt_model.predict(X_test_pca)

# Evaluate the model's performance on the test data
test_accuracy_pca = accuracy_score(y_test, y_test_pred_pca)
metrics_score(y_test, y_test_pred_pca)
              precision    recall  f1-score   support

           0       0.85      0.83      0.84       896
           1       0.60      0.64      0.62       367

    accuracy                           0.77      1263
   macro avg       0.72      0.73      0.73      1263
weighted avg       0.78      0.77      0.77      1263

Observations

Overfitting on the Training Set:

  • The model performs perfectly on the training set with 100% precision, recall, and F1-scores for both classes. This suggests that the Decision Tree has overfitted to the training data, capturing even noise and irrelevant features. Overfitting is a common issue with Decision Trees when not properly pruned or regularized.

Test Set Performance:

  • The performance on the test set is significantly lower than on the training set, further confirming that the model is overfitting.
  • The accuracy on the test set is 77%, which is a notable drop from the perfect accuracy on the training set.

  • For Class 0 (the majority class), precision is 0.85, recall is 0.83, and the F1-score is 0.84, showing a decent but not exceptional performance on the majority class.

  • For Class 1 (the minority class), precision is 0.60, recall is 0.64, and the F1-score is 0.62, indicating that the model is struggling more with identifying and predicting the minority class. It misses about 36% of the actual positive instances and has a relatively high false positive rate for this class.

Effect of PCA:

  • The use of PCA (Principal Component Analysis) for dimensionality reduction likely helped reduce the number of features and potentially removed some irrelevant data, but the Decision Tree still appears to overfit the training data.

  • Although PCA can sometimes reduce overfitting by reducing complexity, in this case, the Decision Tree's inherent tendency to overfit (due to its flexibility) has not been mitigated, possibly due to the structure of the tree being too deep or not enough regularization applied.

Imbalance in Class Performance:

  • As with previous models, the imbalance in class distribution continues to affect performance. The model predicts Class 0 (the majority class) more accurately than Class 1 (the minority class), which is reflected in the F1-scores: 0.84 for Class 0 and 0.62 for Class 1.

  • This suggests that the Decision Tree has learned patterns for the majority class much better than the minority class, a common issue with imbalanced datasets.

Macro and Weighted Averages:

  • The macro average F1-score of 0.73 indicates that the model performs better on the majority class, as the average across both classes is lower than Class 0's individual score.

  • The weighted average F1-score of 0.77 reflects the model’s stronger performance on Class 0, weighted by the class distribution in the test set.

Conclusion:

The PCA-populated Decision Tree has clear signs of overfitting, as evidenced by perfect performance on the training set and a significant drop in performance on the test set. While Class 0 predictions remain decent, Class 1 predictions are relatively weak, and the overall performance is not optimal, particularly for the minority class. Addressing overfitting (e.g., through pruning the Decision Tree or using more regularization) and handling class imbalance could improve the model’s generalization on unseen data.

Decision Tree¶

Note: Since our target feature is binary we will employ a decision tree classifier.

Build the Model

In [96]:
# Fitting the decision tree classifier on the training data
d_tree =  DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)
Out[96]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Train and Evaluate the Model

In [97]:
# Checking performance on the training data
y_pred_train = d_tree.predict(X_train)

# Instantiate the metrics_score function
metrics_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2043
           1       1.00      1.00      1.00       902

    accuracy                           1.00      2945
   macro avg       1.00      1.00      1.00      2945
weighted avg       1.00      1.00      1.00      2945

Observations:

Classification Metrics:

  • Precision, Recall, and F1-scores are all 1.00 for both classes (Not Converted and Converted), indicating perfect classification performance on the training set.

  • The accuracy is 100%, as the model correctly classifies nearly all instances in the training data.

Confusion Matrix:

  • The confusion matrix shows that all 2043 instances of the "Not Converted" class were correctly predicted, and all 901 instances of the "Converted" class were also correctly predicted, except for 1 minor misclassification.

    This is consistent with perfect or near-perfect performance, indicating severe overfitting.

Conclusions:

  • Overfitting : The perfect accuracy on the training set is a sure sign of overfitting, especially in decision trees, which tend to memorize the training data when no regularization (like pruning or max-depth constraints) is applied. It's important to validate this performance on a test set to see if the model generalizes well. This may be a good candidate for employing PCA to remove some dimensions.

  • Balanced Performance: Both classes seem to have been learned equally well, as shown by the identical precision, recall, and F1-scores across classes.

Test and Evaluate the Model

In [98]:
# Checking performance on the test data
y_pred_test = d_tree.predict(X_test)

# Instantiate the metrics_score function
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.86      0.85      0.86       896
           1       0.65      0.67      0.66       367

    accuracy                           0.80      1263
   macro avg       0.76      0.76      0.76      1263
weighted avg       0.80      0.80      0.80      1263

Observations (with OHE drop_first = True):

Classification Metrics:

  • Accuracy: 79.65%, which is much lower than the perfect performance on the training set, indicating that the model is not generalizing as well.

  • Class 0 (Not Converted): Precision and recall are around 0.86, indicating relatively good performance.

  • Class 1 (Converted): Precision is 0.64, and recall is 0.67, which means that the model is missing a significant number of "Converted" cases (false negatives) and is not making very confident positive predictions.

Confusion Matrix:

  • Out of 896 "Not Converted" instances, 760 were correctly predicted, but 136 were misclassified as "Converted".

  • Out of 367 "Converted" instances, only 246 were correctly classified, with 121 misclassified as "Not Converted".

Conclusions:

  • The model generalizes moderately well on the test data, but it is clearly overfitting on the training set. The test accuracy of 80% and the drop in performance, particularly for the "Converted" class, suggest that the decision tree may be too complex.

  • The lower performance in predicting the "Converted" class (precision of 0.64 and recall of 0.67) could lead to missed opportunities or incorrect classifications in practical applications where identifying conversions is critical.

Observations (with OHE drop_first = False):

Test Performance Improvement:

  • The test accuracy remains at 80%, similar to the previous result. This consistency suggests that the model is not significantly underperforming on unseen data, even after changes in the encoding scheme.

  • The precision for class 1 (converted) has slightly improved to 65% from 64%, and the recall for class 1 has increased to 67% (up from 64%). This indicates a marginal improvement in the model's ability to detect converted cases.

Class Imbalance:

  • The model still struggles more with class 1 (converted), which can be seen in the lower precision (65%) and recall (67%) compared to class 0 (not converted), which maintains high performance (precision 86% and recall 85%). This suggests that the model is still biased toward the majority class (not converted).

Confusion Matrix Insights:

  • The confusion matrix shows that 131 non-converted cases were incorrectly classified as converted, while 120 converted cases were misclassified as non-converted. These errors indicate that while the model is reasonably good at identifying conversions, there's still room for improvement in reducing false positives and false negatives.

Do we need to prune the tree? Yes¶

Inorder to determine if pruning is needed we first must determine the importance of features that went into the decision tree.

Feature Importance¶

In [99]:
# Plot the feature importance

importances = d_tree.feature_importances_  # Use d_tree
columns = X.columns  # Use columns from the DataFrame used with d_tree
importance_df = pd.DataFrame(importances, index=columns, columns=['Importance']).sort_values(by='Importance', ascending=False)
print(importance_df)

# Plot
plt.figure(figsize=(13, 13))
sns.barplot(data=importance_df, x='Importance', y=importance_df.index, palette='tab10')
plt.title('Feature Importance in Decision Tree')
plt.show()
                        Importance
time_spent_on_website     0.286512
first_interaction_0       0.154206
profile_completed_0       0.134657
page_views_per_visit      0.095233
age                       0.091493
current_occupation_0      0.066631
last_activity_1           0.047065
website_visits            0.039448
last_activity_2           0.024113
digital_media_0           0.008002
current_occupation_2      0.007526
last_activity_0           0.005211
referral_0                0.005136
educational_channels_0    0.004923
profile_completed_1       0.004662
print_media_type1_0       0.004083
profile_completed_2       0.003841
digital_media_1           0.003714
print_media_type2_0       0.003433
educational_channels_1    0.003226
print_media_type1_1       0.002581
print_media_type2_1       0.002551
current_occupation_1      0.001505
referral_1                0.000248
first_interaction_1       0.000000

Observations:

Top Features:

  • The most important feature based on the importance values is time_spent_on_website with a score of 0.2865. This suggests that the time users spend on the website plays the largest role in determining the model's predictions.
  • Other significant features include:
    • first_interaction_Mobile App (0.1542), indicating that how users first interact with the mobile app influences the outcome.
    • profile_completed_High (0.1347), meaning a high level of profile completion has a strong impact.
    • page_views_per_visit (0.0952) and age (0.0915), also playing crucial roles in the predictions.

Lesser Importance:

  • Features such as current_occupation_Professional (0.0666) and last_activity_Phone Activity (0.0471) are still important but have less predictive power than the top features.
  • Many features have relatively small importance, such as digital_media_No (0.0080) and profile_completed_Low (0.0047), indicating these features play only a minimal role in the decision tree's predictions.

Negligible or No Importance:

  • Some features, such as first_interaction_Website (0.0000) and referral_Yes (0.0002), have virtually no importance in this decision tree model, suggesting they do not contribute to the predictive power.

Business Insights:

  • The results indicate that user behavior on the website, such as the time spent and page views per visit, is crucial for prediction. Additionally, the completion level of user profiles (especially high completion) is a key determinant.
  • Digital media interactions and print media types, on the other hand, have limited influence in this specific model.

Visualize the current decision Tree¶

In [100]:
# Set the size of the plot
plt.figure(figsize=(20, 20))

# Plot the tree using plot_tree from Scikit-learn
tree.plot_tree(
              d_tree,
              feature_names=X.columns,  # Column names for feature names
              class_names=True,         # Display class names
              filled=True,              # Color nodes by class
              rounded=True,             # Rounded boxes for nodes
              proportion=False,         # Not scaled to the proportion of samples at each node
              fontsize=8,               # Font size for labels
              max_depth = 3             # Limit the depth of the tree to 3 levels
              )

# Display the plot
plt.show()

Decision Tree Pruning¶

Pruning Using Maximum Depth (max_depth):¶

Let's determine the optimal tree depth.

In [101]:
# Determine the optimum depth to prune

# Define the range of depths to test
depth_range = range(1, 21)  # Test depths from 1 to 20
cv_scores = []  # List to store cross-validation scores

# Loop over the depths and perform cross-validation
for depth in depth_range:
    # Create a decision tree with the current max_depth
    pruned_tree = DecisionTreeClassifier(random_state=1, max_depth=depth)

    # Perform cross-validation (5-fold cross-validation in this case)
    scores = cross_val_score(pruned_tree, X_train, y_train, cv=5, scoring='accuracy')  # You can change scoring to 'f1', 'precision', 'recall', etc.

    # Append the mean score for this depth
    cv_scores.append(np.mean(scores))

# Find the depth with the best cross-validation score
optimal_depth = depth_range[np.argmax(cv_scores)]
print(f"The optimal max_depth is: {optimal_depth}")

# Plot the results to visualize
plt.figure(figsize=(10, 6))
plt.plot(depth_range, cv_scores, marker='o')
plt.xlabel('Max Depth')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Cross-Validation Accuracy vs Max Depth')
plt.show()
The optimal max_depth is: 6

Build the Model.

In [102]:
# Build the model
pruned_tree_depth_max = DecisionTreeClassifier(random_state=1, max_depth=optimal_depth)
pruned_tree_depth_max.fit(X_train, y_train)
Out[102]:
DecisionTreeClassifier(max_depth=6, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, random_state=1)

Train and Evaluate the Model.

In [103]:
# Train the final decision tree with the optimal depth
y_pred_train_depth_max = pruned_tree_depth_max.predict(X_train)

# Instantiate the metrics_score function
metrics_score(y_train, y_pred_train_depth_max)
              precision    recall  f1-score   support

           0       0.90      0.95      0.92      2043
           1       0.87      0.75      0.80       902

    accuracy                           0.89      2945
   macro avg       0.88      0.85      0.86      2945
weighted avg       0.89      0.89      0.89      2945

Test and Evaluate the Model

In [104]:
# Checking performance on the training data
y_pred_test_depth_max = pruned_tree_depth_max.predict(X_test)

# Instantiate the metrics_score function
metrics_score(y_test, y_pred_test_depth_max)
              precision    recall  f1-score   support

           0       0.86      0.92      0.89       896
           1       0.77      0.63      0.69       367

    accuracy                           0.84      1263
   macro avg       0.81      0.78      0.79      1263
weighted avg       0.83      0.84      0.83      1263

Comparison of Training and Testing Observations:

  1. Overall Accuracy:

    • Training Accuracy: 89%
    • Testing Accuracy: 84%
    • The slight drop in accuracy from training to testing (89% to 84%) suggests that the model generalizes well, with no significant overfitting.
  2. Class 0 (Not Converted):

    • Training Precision/Recall: 0.90/0.95
    • Testing Precision/Recall: 0.86/0.92
    • The model performs slightly better on the training set for "Not Converted" users, with higher precision and recall values. However, it still maintains strong performance on the test set, with only a slight decline in recall (92% vs 95%).
  3. Class 1 (Converted):

    • Training Precision/Recall: 0.87/0.75
    • Testing Precision/Recall: 0.77/0.63
    • For the "Converted" class, the precision and recall are slightly lower on the test set, indicating that the model finds it more challenging to predict the minority class on unseen data. The recall drops from 75% to 63%, showing a greater likelihood of false negatives on the test set.
  4. Class Imbalance:

    • The model performs better with the majority class (Class 0) on both the training and testing sets, as seen in the consistently high precision and recall values. For the minority class (Class 1), there is a noticeable drop in recall on the test set.
  5. Generalization:

    • Macro Average (Training): 0.88 Precision, 0.85 Recall
    • Macro Average (Testing): 0.81 Precision, 0.78 Recall
    • The macro averages indicate that the model's performance on the training set is slightly higher across both classes. However, it still maintains reasonable performance when applied to unseen data, with a balanced accuracy across classes.

Conclusion:

  • The model shows good generalization from training to testing data, maintaining high accuracy. The "Not Converted" class is predicted consistently well, while the "Converted" class shows a larger performance drop, particularly in recall on the test data.

Pruning Using Limited Depth(s):¶

Let's examine what happens if we bracket the 'optimum' pruning depth determined above

Let Depth = 5

Build and evaluate the model on the training data.

Build the Model

In [105]:
# Create a decision tree with pruning by limiting the depth
pruned_tree_depth_5 = DecisionTreeClassifier(random_state=1, max_depth=5)
pruned_tree_depth_5.fit(X_train, y_train)
Out[105]:
DecisionTreeClassifier(max_depth=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=5, random_state=1)

Train and Evaluate the Model

In [106]:
# Checking performance on the training data
y_pred_train_pruned_depth_5 = pruned_tree_depth_5.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_depth_5)
              precision    recall  f1-score   support

           0       0.89      0.94      0.92      2043
           1       0.84      0.75      0.79       902

    accuracy                           0.88      2945
   macro avg       0.87      0.84      0.85      2945
weighted avg       0.88      0.88      0.88      2945

Test and Evaluate the Model.

In [107]:
# Checking performance on the training data
y_pred_test_pruned_depth_5 = pruned_tree_depth_5.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_depth_5)
              precision    recall  f1-score   support

           0       0.89      0.92      0.90       896
           1       0.79      0.71      0.75       367

    accuracy                           0.86      1263
   macro avg       0.84      0.82      0.83      1263
weighted avg       0.86      0.86      0.86      1263

Observations

Reduced Overfitting:

  • Pruning the tree by limiting its depth to 5 has successfully reduced the overfitting seen in the previous models. The training accuracy is now 88%, which is much more realistic compared to the perfect performance previously observed. The model is now better at generalizing to unseen data.

Balanced Performance Across Train and Test Sets:

  • The test set accuracy is 86%, which is closely aligned with the training set accuracy of 88%. This narrow gap between training and test performance indicates that the model generalizes well without being overly biased towards the training data.

Class 0 (Majority Class) Performance:

  • For Class 0, both the training and test sets show consistently strong results:

    • On the training set, Class 0 has a precision of 0.89, recall of 0.94, and an F1-score of 0.92.

    • On the test set, Class 0 also performs well, with precision of 0.89, recall of 0.92, and an F1-score of 0.90.

  • These metrics reflect that the model is effectively classifying the majority class with high accuracy, precision, and recall.

Class 1 (Minority Class) Performance:

  • For Class 1, the performance is decent but not as strong as Class 0:

    • On the training set, Class 1 has a precision of 0.84, recall of 0.75, and an F1-score of 0.79.

    • On the test set, Class 1 has a precision of 0.79, recall of 0.71, and an F1-score of 0.75.

  • While the model is able to identify most of the minority class instances, the recall is slightly lower, meaning that about 29% of the positive instances in the test set are not being correctly identified. The precision remains fairly strong, indicating that when the model predicts Class 1, it is usually correct.

Macro and Weighted Averages:

  • The macro average F1-score on the test set is 0.83, which reflects the average performance across both classes, taking into account the lower recall for Class 1.

  • The weighted average F1-score on the test set is 0.86, showing that overall the model is performing well, with stronger performance on Class 0 but not neglecting Class 1 entirely.

Improvement Over Previous Models:

  • Compared to previous iterations, this pruned Decision Tree offers a much better balance between train and test performance, indicating less overfitting and more generalizability.

  • The drop in performance for the minority class (Class 1) is still present but is not as severe as in some previous models. The model still struggles somewhat with identifying all instances of Class 1, but precision and recall are much better balanced.

Conclusion:

Pruning the Decision Tree by limiting its depth to 5 has improved generalization, reducing overfitting and providing balanced performance between the training and test sets. The model handles Class 0 (the majority class) very well, while its performance on Class 1 (the minority class) is decent but could be further improved, particularly in recall. This pruned model strikes a good balance between complexity and predictive power.

Let Depth = 7

Build and evaluate the model on the training data.

Build the Model

In [108]:
# Create a decision tree with pruning by limiting the depth
pruned_tree_depth_7 = DecisionTreeClassifier(random_state=1, max_depth=7)
pruned_tree_depth_7.fit(X_train, y_train)
Out[108]:
DecisionTreeClassifier(max_depth=7, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=7, random_state=1)

Train and Evaluate the Model

In [109]:
# Check the performance after pruning
y_pred_train_pruned_depth_7 = pruned_tree_depth_7.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_depth_7)
              precision    recall  f1-score   support

           0       0.92      0.95      0.93      2043
           1       0.87      0.81      0.84       902

    accuracy                           0.90      2945
   macro avg       0.89      0.88      0.89      2945
weighted avg       0.90      0.90      0.90      2945

Test and Evaluate the model

In [110]:
# Check the performance after pruning
y_pred_test_pruned_depth_7 = pruned_tree_depth_7.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_depth_7)
              precision    recall  f1-score   support

           0       0.87      0.90      0.89       896
           1       0.74      0.68      0.71       367

    accuracy                           0.84      1263
   macro avg       0.81      0.79      0.80      1263
weighted avg       0.83      0.84      0.84      1263

Observations

Improved Training Set Performance:

  • With a depth of 7, the model achieves 90% accuracy on the training set, showing a stronger fit compared to the pruned model with depth 5.

  • Class 0 (majority class) has a precision of 0.92, recall of 0.95, and an F1-score of 0.93, indicating that the model predicts the majority class very well on the training data.

  • Class 1 (minority class) also performs better than with a shallower depth, with precision of 0.87, recall of 0.81, and an F1-score of 0.84. This shows an improvement in recall and overall performance for the minority class on the training data.

Test Set Performance:

  • On the test set, the accuracy is 84%, which is a reasonable drop from the training accuracy of 90%, indicating that some overfitting is present, but the model still generalizes fairly well.

  • Class 0 has a precision of 0.87, recall of 0.90, and an F1-score of 0.89, showing that the model continues to predict the majority class effectively.

  • Class 1 has a precision of 0.74, recall of 0.68, and an F1-score of 0.71. This is a slight drop in performance compared to the training set, especially in terms of recall, where the model is missing about 32% of the actual positive cases in the test data. Precision is still decent, but the recall remains a challenge.

]Effect of Increased Depth:

  • Increasing the depth to 7 has improved the model’s performance on the training set, especially for Class 1 (the minority class). However, it comes with the trade-off of slightly decreased generalization to the test set, as the test accuracy is slightly lower than the training accuracy.

  • The gap in recall for Class 1 between the training set (0.81) and test set (0.68) suggests that the model is still somewhat overfitting, especially for the minority class.

]Class Imbalance Effects:

  • Similar to previous models, the imbalance between Class 0 and Class 1 leads to better performance on the majority class (Class 0), while the minority class (Class 1) sees lower precision and recall.

  • The drop in recall for Class 1 on the test set indicates that the model struggles more to generalize its predictions for the minority class, potentially missing important patterns in the test data for this class.

]Macro and Weighted Averages:

  • The macro average F1-score on the test set is 0.80, indicating that the performance on both classes is relatively balanced but slightly skewed in favor of Class 0.

  • The weighted average F1-score of 0.84 on the test set reflects the stronger performance on the majority class, which influences the overall score more due to the class imbalance.

Conclusion: Increasing the depth to 7 has improved the model’s fit on the training set, particularly for the minority class (Class 1). However, there is still a gap between training and test performance, particularly in terms of recall for Class 1, which indicates that the model is slightly overfitting to the training data. The model performs well overall, with solid results for Class 0, but the challenge of accurately predicting the minority class remains. A balance between depth and generalization could be further refined to enhance performance on unseen data.

Pruning Using Minimum Samples per Leaf (min_samples_leaf):¶

Build the Model

In [111]:
# Create a decision tree with pruning by limiting minimum samples per leaf
pruned_tree_leaf = DecisionTreeClassifier(random_state=1, min_samples_leaf=5)  # Set min_samples_leaf to 5 as an example
pruned_tree_leaf.fit(X_train, y_train)
Out[111]:
DecisionTreeClassifier(min_samples_leaf=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(min_samples_leaf=5, random_state=1)

Train and Evaluate the Model

In [112]:
# Check the performance after pruning
y_pred_train_pruned_leaf = pruned_tree_leaf.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_leaf)
              precision    recall  f1-score   support

           0       0.94      0.95      0.95      2043
           1       0.89      0.86      0.88       902

    accuracy                           0.93      2945
   macro avg       0.92      0.91      0.91      2945
weighted avg       0.92      0.93      0.92      2945

Test and Evalaute the Model

In [113]:
# Check the performance after pruning
y_pred_test_pruned_leaf = pruned_tree_leaf.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_leaf)
              precision    recall  f1-score   support

           0       0.86      0.86      0.86       896
           1       0.66      0.67      0.66       367

    accuracy                           0.80      1263
   macro avg       0.76      0.76      0.76      1263
weighted avg       0.80      0.80      0.80      1263

Observations

Training Set Performance:

  • The model performs very well on the training set, achieving an overall accuracy of 93%, indicating a good fit to the training data.

  • Class 0 (majority class) has a precision of 0.94, recall of 0.95, and an F1-score of 0.95, showing excellent performance in predicting the majority class.

  • Class 1 (minority class) also performs well on the training set, with a precision of 0.89, recall of 0.86, and an F1-score of 0.88. This suggests that the model is fairly successful in identifying the minority class, with a small trade-off in recall compared to Class 0.

Test Set Performance:

  • The test set accuracy is 80%, which is a significant drop compared to the training set accuracy. This indicates that the model generalizes less well to unseen data, likely due to slight overfitting.
  • Class 0 (majority class) on the test set has a precision of 0.86, recall of 0.86, and an F1-score of 0.86, reflecting stable and consistent performance for the majority class, even on the test set.

  • Class 1 (minority class) shows a noticeable decline in performance on the test set compared to the training set. It has a precision of 0.66, recall of 0.67, and an F1-score of 0.66, which reflects the model's struggle to generalize well to the minority class. About 33% of the actual positive instances are not identified, and there are a higher number of false positives.

Macro and Weighted Averages:

  • The macro average F1-score on the test set is 0.76, which reflects a balanced view of the model's performance across both classes. However, this lower score indicates that the model struggles more with the minority class (Class 1).

  • The weighted average F1-score is 0.80, highlighting the influence of the majority class (Class 0) on the overall performance due to its larger support. The better performance for Class 0 positively skews the overall metric.

Effect of Minimum Samples per Leaf:

  • Using a minimum samples per leaf constraint forces the model to generalize more by preventing the creation of overly specific decision rules for small subsets of data. This is why the model performs well on the training set without completely overfitting.

  • However, the drop in performance for Class 1 on the test set suggests that while the model generalizes better for the majority class, it still struggles with the minority class, likely due to class imbalance and the limited training examples for Class 1.

  1. Class Imbalance Challenge:

    • As in previous models, the class imbalance (more instances of Class 0 than Class 1) impacts the performance. The model does a much better job at predicting the majority class (Class 0), but it struggles with the minority class (Class 1), particularly in recall and precision on the test set.

Conclusion:

The use of a minimum samples per leaf constraint leads to a well-performing model on the training set and reasonable generalization to the test set, especially for the majority class. However, the model struggles to generalize for the minority class (Class 1), particularly on the test set, resulting in a lower F1-score and recall for this class. Addressing the class imbalance or further refining the model (e.g., adjusting the minimum samples per leaf) could improve its performance for Class 1.

Pruning Using Minimum Samples to Split (min_samples_split):¶

First we need to determine the optimum number of splits.

In [114]:
# Import necessary libraries
from Scikit-learn.model_selection import cross_val_score
import numpy as np

# Define the range of values for min_samples_split to test
split_range = range(2, 51, 2)  # Testing even values from 2 to 50
cv_scores_split = []  # List to store cross-validation scores

# Loop over the different min_samples_split values
for split in split_range:
    # Create a decision tree with the current min_samples_split value
    pruned_tree_split = DecisionTreeClassifier(random_state=1, min_samples_split=split)

    # Perform cross-validation (5-fold cross-validation in this case)
    scores = cross_val_score(pruned_tree_split, X_train, y_train, cv=5, scoring='accuracy')

    # Append the mean score for this split value
    cv_scores_split.append(np.mean(scores))

# Find the min_samples_split value with the best cross-validation score
optimal_split = split_range[np.argmax(cv_scores_split)]
print(f"The optimal min_samples_split is: {optimal_split}")

# Plot the results to visualize
plt.figure(figsize=(10, 6))
plt.plot(split_range, cv_scores_split, marker='o')
plt.xlabel('Min Samples Split')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Cross-Validation Accuracy vs Min Samples Split')
plt.show()
The optimal min_samples_split is: 48

Build the Model.

In [115]:
# Create a decision tree with pruning by limiting the minimum number of samples required to split a node
pruned_tree_split = DecisionTreeClassifier(random_state=1, min_samples_split=48)
pruned_tree_split.fit(X_train, y_train)
Out[115]:
DecisionTreeClassifier(min_samples_split=48, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(min_samples_split=48, random_state=1)

Train and Evalue the Model

In [116]:
# Check the performance after pruning
y_pred_train_pruned_split = pruned_tree_split.predict(X_train)
metrics_score(y_train, y_pred_train_pruned_split)
              precision    recall  f1-score   support

           0       0.91      0.94      0.92      2043
           1       0.85      0.79      0.82       902

    accuracy                           0.89      2945
   macro avg       0.88      0.86      0.87      2945
weighted avg       0.89      0.89      0.89      2945

Test and Evaluate the Model

In [117]:
# Check the performance after pruning
y_pred_test_pruned_split = pruned_tree_split.predict(X_test)
metrics_score(y_test, y_pred_test_pruned_split)
              precision    recall  f1-score   support

           0       0.87      0.90      0.88       896
           1       0.73      0.67      0.70       367

    accuracy                           0.83      1263
   macro avg       0.80      0.79      0.79      1263
weighted avg       0.83      0.83      0.83      1263

Observations

Training Set Performance:

  • The model achieves an accuracy of 89% on the training set, indicating a strong fit, but with less risk of overfitting compared to prior models with more lenient splitting conditions.

  • Class 0 (majority class) performs very well with a precision of 0.91, recall of 0.94, and an F1-score of 0.92. This shows that the model is highly capable of correctly predicting the majority class.

  • Class 1 (minority class) shows a solid performance, with a precision of 0.85, recall of 0.79, and an F1-score of 0.82. The recall for Class 1 is slightly lower, indicating that the model misses some positive instances in the training data, but overall, it performs reasonably well in identifying and predicting the minority class.

Test Set Performance:

  • On the test set, the model achieves an accuracy of 83%, which is a slight drop from the training set accuracy but indicates a good generalization to unseen data.

  • Class 0 (majority class) maintains strong performance on the test set with a precision of 0.87, recall of 0.90, and an F1-score of 0.88, indicating the model continues to predict the majority class effectively and with high confidence.

  • Class 1 (minority class) has a precision of 0.73, recall of 0.67, and an F1-score of 0.70, which shows that the model is able to predict the minority class reasonably well but misses about 33% of the actual positive instances. The precision is relatively high, meaning that when the model predicts Class 1, it is often correct, but recall could be improved.

Effect of min_samples_split:

  • The choice of an optimal min_samples_split of 48 means that the model requires at least 48 samples in a node before splitting further. This encourages the model to create more generalized splits, preventing overly specific decision rules that could lead to overfitting.

  • The impact of this constraint is visible in the balanced performance across both the training and test sets. The model generalizes well, particularly for Class 0, while still achieving reasonable performance for Class 1.

Macro and Weighted Averages:

  • The macro average F1-score on the test set is 0.79, reflecting the overall performance across both classes. This score highlights that the model is more effective at predicting the majority class, as the minority class brings down the macro average.

  • The weighted average F1-score on the test set is 0.83, which indicates that, while the model performs better for Class 0 due to the class imbalance, it still delivers decent performance for the minority class (Class 1).

Class Imbalance Impact:

  • The class imbalance continues to affect the model, as Class 0 receives better predictions than Class 1. The recall for Class 1 on the test set is lower than the training set, indicating that the model struggles to capture all positive instances for the minority class. Despite this, the precision remains relatively strong, which is beneficial for minimizing false positives.

Conclusion:

The use of an optimal min_samples_split of 48 results in a balanced and well-generalized model. The model performs well on both the training and test sets, with Class 0 (majority class) showing strong performance and Class 1 (minority class) maintaining decent precision, though recall could be improved. The model avoids overfitting and generalizes reasonably well to unseen data, but addressing the class imbalance or further fine-tuning the model could enhance performance for the minority class.

In [118]:
# Create a results table for pruned decision trees

# List of pruning methods and their corresponding predictions
pruning_methods = [
    ("Max Depth (Optimal)", y_pred_test_depth_max),
    ("Max Depth (5)", y_pred_test_pruned_depth_5),
    ("Max Depth (7)", y_pred_test_pruned_depth_7),
    ("Min Samples Leaf (5)", y_pred_test_pruned_leaf),
    ("Min Samples Split (48)", y_pred_test_pruned_split)
]

# Initialize a list to store the results
results = []

# Loop through each pruning method and calculate the metrics
for method_name, y_pred in pruning_methods:
    accuracy = round(accuracy_score(y_test, y_pred), 2)
    precision_0 = round(precision_score(y_test, y_pred, pos_label=0), 2)
    precision_1 = round(precision_score(y_test, y_pred, pos_label=1), 2)
    recall_0 = round(recall_score(y_test, y_pred, pos_label=0), 2)
    recall_1 = round(recall_score(y_test, y_pred, pos_label=1), 2)
    f1_0 = round(f1_score(y_test, y_pred, pos_label=0), 2)
    f1_1 = round(f1_score(y_test, y_pred, pos_label=1), 2)

    # Append the results to the list
    results.append([method_name, accuracy, precision_0, precision_1, recall_0, recall_1, f1_0, f1_1])

# Create a DataFrame for better visualization
pruned_results = pd.DataFrame(results, columns=[
    "Pruning Method",
    "Accuracy",
    "Precision Status 0",
    "Precision Status 1",
    "Recall Status 0",
    "Recall Status 1",
    "F1-Score Status 0",
    "F1-Score Status 1"
])

# Display the DataFrame
pruned_results
Out[118]:
Pruning Method Accuracy Precision Status 0 Precision Status 1 Recall Status 0 Recall Status 1 F1-Score Status 0 F1-Score Status 1
0 Max Depth (Optimal) 0.84 0.86 0.77 0.92 0.63 0.89 0.69
1 Max Depth (5) 0.86 0.89 0.79 0.92 0.71 0.90 0.75
2 Max Depth (7) 0.84 0.87 0.74 0.90 0.68 0.89 0.71
3 Min Samples Leaf (5) 0.80 0.86 0.66 0.86 0.67 0.86 0.66
4 Min Samples Split (48) 0.83 0.87 0.73 0.90 0.67 0.88 0.70

Conclusions:

Key Takeaways:

Max Depth (5) offers the best overall balance:

  • Accuracy: Highest at 86%.

  • Class 1 (minority class): Achieves precision of 0.79 and recall of 0.71, leading to the highest F1-score (0.75) for Class 1 across all pruning methods.

  • Class 0 (majority class): Maintains strong performance, with an F1-score of 0.90 and a recall of 0.92.

Max Depth (Optimal) strikes a reasonable balance:

  • Accuracy is slightly lower at 84%, but this method balances the performance between both classes.

  • Class 0 still maintains a high recall of 0.92, while Class 1 has a decent precision of 0.77 but lower recall at 0.63, leading to an overall F1-score of 0.69.

Max Depth (7) is more prone to overfitting:

  • Similar to the "optimal" depth in accuracy, but Class 1 performance drops with lower recall (0.68) and F1-score (0.71), indicating overfitting as it tries to memorize the training set rather than generalizing well.

Min Samples Leaf (5) results in the lowest performance:

  • Accuracy of 80% is the lowest, and Class 1 performance is weak, with precision and recall both at 0.66, resulting in a matching F1-score of 0.66.

Min Samples Split (48) provides a balanced generalization:

  • Accuracy is 83%, and Class 1 precision and recall are slightly better than the "min samples leaf" method, achieving 0.73 precision and 0.67 recall, but still not as strong as depth-based pruning.

  • Class 0 maintains good performance with precision of 0.87 and recall of 0.90, resulting in a solid F1-score of 0.88.

Final Conclusion:

  • Max Depth (5) is the most effective pruning method for balancing the performance across both classes while maintaining high accuracy and strong generalization. It offers the best trade-off between model complexity and predictive power, particularly for the minority class (Class 1), which is the hardest to classify.

Decision Tree - Hyperparameter Tuning¶

We have seen thre is anissue with data imbalance, and addressing it properly could lead to significantly improved results.

To validate this hypothesis, we need to examine the class distribution of the target variable.

In [119]:
# Check the distribution of the target variable
class_distribution = y_train.value_counts(normalize=True)

# Display the class distribution in percentage
class_distribution_percentage = class_distribution * 100
class_distribution_percentage
Out[119]:
proportion
status
0 69.371817
1 30.628183

In [120]:
# Alternatively, we can use visualization to check the imbalance:
import matplotlib.pyplot as plt

# Plot the class distribution
class_distribution.plot(kind='bar', color=['skyblue', 'orange'])
plt.title('Status Distribution in y_train')
plt.xlabel('Status')
plt.ylabel('Frequency')
plt.show()

Calculating Class Weights

In [121]:
# Calculate class weights manually
classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = dict(zip(classes, class_weights))
print(class_weight_dict)
{0: 0.7207537934410181, 1: 1.6324833702882484}

Build the Classifier

In [122]:
# Use the computed class weights in the decision tree
tuned_tree_tuned_rand = DecisionTreeClassifier(class_weight=class_weight_dict)

# Fit the model
tuned_tree_tuned_rand.fit(X_train, y_train)
Out[122]:
DecisionTreeClassifier(class_weight={0: 0.7207537934410181,
                                     1: 1.6324833702882484})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.7207537934410181,
                                     1: 1.6324833702882484})

Tain and Evaluate the model.

In [123]:
# Predict on the test set
y_pred_class_weight = tuned_tree_tuned_rand.predict(X_train)
metrics_score(y_train, y_pred_class_weight)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2043
           1       1.00      1.00      1.00       902

    accuracy                           1.00      2945
   macro avg       1.00      1.00      1.00      2945
weighted avg       1.00      1.00      1.00      2945

Test and Evaluate the Model

In [124]:
# Predict on the test set
y_pred_class_weight = tuned_tree_tuned_rand.predict(X_test)
metrics_score(y_test, y_pred_class_weight)
              precision    recall  f1-score   support

           0       0.87      0.86      0.86       896
           1       0.66      0.68      0.67       367

    accuracy                           0.81      1263
   macro avg       0.76      0.77      0.77      1263
weighted avg       0.81      0.81      0.81      1263

Nothing in the way of improvement, lets altre the weights a bit and try again

In [125]:
# Set up the decision tree with class weights
dt_model = DecisionTreeClassifier()

# Define a grid of class weights to search through
param_grid = {'class_weight': [{0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 5}]}

# Perform a grid search to find the best class weights
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='f1_weighted')
grid_search.fit(X_train, y_train)

# Output the best parameters (weights) found
print("Best class weights:", grid_search.best_params_)
Best class weights: {'class_weight': {0: 1, 1: 2}}

Lets iterate over the new weights and eavluate

In [126]:
# Set up the decision tree with class weights
dt_model = DecisionTreeClassifier()

# Define a grid of class weights to search through
param_grid = {'class_weight': [{0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 5}]}

# Perform a grid search to find the best class weights
grid_search_class_weight = GridSearchCV(dt_model, param_grid, cv=5, scoring='f1_weighted', return_train_score=True)
grid_search_class_weight.fit(X_train, y_train)

# Iterate over the results and print evaluation metrics
for i, params in enumerate(grid_search.cv_results_['params']):
    print(f"\nEvaluation for class weights: {params}")

    # Refit the model on the training set with the current parameters
    dt_model_class_weight = DecisionTreeClassifier(class_weight=params['class_weight'])
    dt_model_class_weight.fit(X_train, y_train)

    # Predict on the test set
    y_pred__class_weight = dt_model_class_weight.predict(X_test)

    # Calculate accuracy, precision, recall, and F1-score
    accuracy = accuracy_score(y_test, y_pred__class_weight)
    report = classification_report(y_test, y_pred__class_weight, target_names=['Class 0', 'Class 1'])

    # Output metrics for each set of weights
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Classification Report:\n{report}")

# Output the best parameters (weights) found by GridSearchCV
print("\nBest class weights found by GridSearchCV:", grid_search.best_params_)
Evaluation for class weights: {'class_weight': {0: 1, 1: 2}}
Accuracy: 0.8124
Classification Report:
              precision    recall  f1-score   support

     Class 0       0.87      0.87      0.87       896
     Class 1       0.68      0.68      0.68       367

    accuracy                           0.81      1263
   macro avg       0.77      0.77      0.77      1263
weighted avg       0.81      0.81      0.81      1263


Evaluation for class weights: {'class_weight': {0: 1, 1: 3}}
Accuracy: 0.7965
Classification Report:
              precision    recall  f1-score   support

     Class 0       0.86      0.85      0.86       896
     Class 1       0.65      0.66      0.65       367

    accuracy                           0.80      1263
   macro avg       0.75      0.76      0.75      1263
weighted avg       0.80      0.80      0.80      1263


Evaluation for class weights: {'class_weight': {0: 1, 1: 5}}
Accuracy: 0.8052
Classification Report:
              precision    recall  f1-score   support

     Class 0       0.87      0.85      0.86       896
     Class 1       0.65      0.70      0.68       367

    accuracy                           0.81      1263
   macro avg       0.76      0.77      0.77      1263
weighted avg       0.81      0.81      0.81      1263


Best class weights found by GridSearchCV: {'class_weight': {0: 1, 1: 2}}

Observations:

Here are the observations based on the results:

Impact of Class Weights: Changing the class weights between Class 0 and Class 1 has a noticeable effect on the classification metrics, particularly for Class 1, which is the minority class:

  • As the weight for Class 1 increases (from 2 to 5), the recall for Class 1 slightly improves, though it remains relatively stable around 0.68 across the different configurations.

  • The precision for Class 1 remains consistent at 0.66, meaning the proportion of correctly classified positive instances out of all instances predicted as positive does not change significantly.

Overall Accuracy: The overall accuracy remains relatively stable across all evaluations, hovering around 80.4% to 80.8%. This indicates that the change in class weights does not have a drastic effect on overall model accuracy but helps balance the classification of Class 1 (the minority class).

F1-Score: The F1-score for Class 1 remains consistent at 0.67 across all weight configurations. This balance of precision and recall suggests that the model's performance on the minority class is steady, though no significant improvement in predictive power is observed as the class weight for Class 1 increases.

GridSearchCV Selection: Despite little improvement in recall or F1-score with the higher weight of Class 1, GridSearchCV selected the weight configuration {'class_weight': {0: 1, 1: 5}}. This indicates that, based on its internal scoring metrics, this was the optimal balance for the dataset, likely giving more importance to recall for Class 1 (the minority class).

Class 0 Performance: The performance for Class 0 (majority class) remains strong and stable across all configurations, with high precision, recall, and F1-scores consistently at 0.86.

Conclusion: The use of higher class weights for Class 1 helps maintain or slightly improve recall without significantly affecting precision or overall accuracy. GridSearchCV's choice of a higher class weight for Class 1 likely prioritizes correctly identifying more instances of the minority class at a slight cost to the overall accuracy, which is reasonable for imbalanced datasets.

Building a Random Forest Model¶

Build the Model

In [127]:
# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=1)

# Train the model
rf_model.fit(X_train, y_train)
Out[127]:
RandomForestClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)

Train and Evaluate the Model

In [128]:
# Predict on the test set
y_pred_rf = rf_model.predict(X_train)
metrics_score(y_train, y_pred_rf)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2043
           1       1.00      1.00      1.00       902

    accuracy                           1.00      2945
   macro avg       1.00      1.00      1.00      2945
weighted avg       1.00      1.00      1.00      2945

Test and Evaluate the Model

In [129]:
# Predict on the test set
y_pred_rf = rf_model.predict(X_test)
metrics_score(y_test, y_pred_rf)
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       896
           1       0.73      0.70      0.72       367

    accuracy                           0.84      1263
   macro avg       0.81      0.80      0.80      1263
weighted avg       0.84      0.84      0.84      1263

Observations:

Overfitting on the Training Set:

  • The model performs perfectly on the training set with 100% precision, recall, and F1-scores for both Class 0 and Class 1. This indicates that the Random Forest model has likely overfitted to the training data, capturing even minor patterns and noise.

  • Overfitting is a common occurrence when a model becomes too complex and learns to fit the training data too well, at the expense of generalizing to unseen data.

Test Set Performance:

  • The performance on the test set is lower than on the training set, which is expected but indicates a significant difference between training and test set results, further supporting the overfitting concern.

  • Accuracy on the test set is 84%, which is fairly strong overall. However, the recall for Class 1 (the minority class) is 0.70, which shows that the model misses about 30% of the positive cases. The precision for Class 1 is 0.73, meaning that about 27% of the predicted positives are actually false positives.

  • Class 0 (the majority class) has good precision and recall (both around 0.90), which indicates that the model has learned to predict the majority class well.

Class Imbalance Effects:

  • As seen from the test results, Class 1 has weaker precision and recall compared to Class 0. This suggests that the class imbalance in the dataset (with Class 0 having more examples than Class 1) might be influencing the model's ability to generalize well for the minority class.

  • The model performs better on Class 0 than Class 1, and the gap between precision and recall for Class 1 is noticeable, showing that the model struggles slightly with predicting minority class instances.

Do we need to prune the tree? Yes¶

Random Forest models typically do not need pruning in the same way that decision trees do, as they are designed to avoid overfitting by averaging the predictions of many decision trees. However, let's perform the analysis to determine the answer to the question.

In [130]:
# Define the model
rf_model = RandomForestClassifier(random_state=1)

# Define a grid of hyperparameters for "pruning"
param_grid = {
    'max_depth': [4, 6, 8, 10],              # Pruning by limiting depth
    'min_samples_split': [2, 5, 10],         # Minimum samples to split
    'min_samples_leaf': [1, 2, 5],           # Minimum samples in each leaf
    'max_features': ['sqrt', 'log2', None],  # Number of features to consider for splitting
    'n_estimators': [50, 100, 200],          # Number of trees in the forest
}

# Perform a grid search to find the best hyperparameters
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Output the best parameters (pruning configuration) found
print("Best parameters:", grid_search.best_params_)
Best parameters: {'max_depth': 6, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}

Build the Model

In [131]:
# Build the model
best_pruned_rf_model = grid_search.best_estimator_
best_pruned_rf_model.fit(X_train, y_train)
Out[131]:
RandomForestClassifier(max_depth=6, max_features=None, min_samples_split=5,
                       n_estimators=200, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=6, max_features=None, min_samples_split=5,
                       n_estimators=200, random_state=1)

Train and Evalute the Model

In [132]:
# Train and evaluate the pruned random forest model
y_pred = best_pruned_rf_model.predict(X_train)
metrics_score(y_train, y_pred)
              precision    recall  f1-score   support

           0       0.91      0.94      0.92      2043
           1       0.85      0.79      0.82       902

    accuracy                           0.89      2945
   macro avg       0.88      0.86      0.87      2945
weighted avg       0.89      0.89      0.89      2945

Test and Evaluate the Model

In [133]:
# Test and evaluate the pruned random forest model
y_pred = best_pruned_rf_model.predict(X_test)
metrics_score(y_test, y_pred)
              precision    recall  f1-score   support

           0       0.89      0.91      0.90       896
           1       0.78      0.74      0.76       367

    accuracy                           0.86      1263
   macro avg       0.83      0.82      0.83      1263
weighted avg       0.86      0.86      0.86      1263

Observations:

Improved Generalization:

  • Compared to the unpruned model, the pruned Random Forest model shows improved generalization. The training set accuracy is 89%, which is more aligned with the test set accuracy of 86%. This indicates that the model is less likely to be overfitting compared to the previous model that had perfect performance on the training set.

Training Set Performance:

  • For the training set, Class 0 (majority class) has a precision of 0.91 and a recall of 0.94, indicating strong performance in identifying the majority class.

  • For Class 1 (minority class), the model's precision is 0.85 and recall is 0.79, with an F1-score of 0.82. This shows that the pruned model is capable of identifying a good portion of the minority class without overfitting to the training data.

Test Set Performance:

  • On the test set, the overall accuracy is 86%, which is a slight improvement compared to the previous Random Forest model’s accuracy of 84%.

  • The precision for Class 1 (minority class) is 0.78, and the recall is 0.74, leading to an F1-score of 0.76. This is a decent performance, although slightly lower than Class 0, which is expected in imbalanced datasets. These metrics suggest that the model is identifying the minority class relatively well but still has room for improvement, especially in recall.

  • For Class 0 (majority class), precision is 0.89 and recall is 0.91, with an F1-score of 0.90. This indicates strong performance for the majority class.

Balanced Performance:

  • The macro average F1-score on both the training and test sets shows a more balanced performance across both classes. In particular, the macro average F1 on the test set is 0.83, reflecting good overall performance on both the majority and minority classes.

Effect of Pruning:

  • By limiting the tree depth (max_depth=6) and adjusting other hyperparameters like min_samples_split and n_estimators, the model has successfully reduced overfitting while maintaining good performance on the test set.

  • The pruning has also helped smooth out the variance between the training and test performance, making the model more robust when predicting on unseen data.

Conclusion:

The pruned Random Forest model shows significant improvement in generalization, with a more balanced performance between the training and test sets. The overall accuracy is strong at 86%, and while Class 0 performs better, Class 1 predictions (precision of 0.78 and recall of 0.74) are respectable for a moderately imbalanced dataset. This suggests that the pruning strategy worked well, reducing overfitting while preserving the model’s effectiveness.

Random Forest Hyperparameter Tuning¶

Build the Model

In [134]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=1)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],  # Number of trees
    'max_depth': [4, 6, 8, 10],  # Limiting depth (pruning)
    'min_samples_split': [2, 5, 10],  # Control for tree growth
    'min_samples_leaf': [1, 2, 4],  # Minimum number of samples in a leaf node
    'max_features': ['sqrt', 'log2']  # Features to consider when splitting
}

# Set up GridSearchCV for hypertuning
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best parameters:", grid_search.best_params_)
Fitting 5 folds for each of 144 candidates, totalling 720 fits
Best parameters: {'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}

Train and Evalute the Model

In [135]:
# Train and evaluate the pruned random forest model
y_pred = grid_search.best_estimator_.predict(X_train)
metrics_score(y_train, y_pred)
              precision    recall  f1-score   support

           0       0.91      0.95      0.93      2043
           1       0.86      0.79      0.82       902

    accuracy                           0.90      2945
   macro avg       0.89      0.87      0.88      2945
weighted avg       0.90      0.90      0.90      2945

Test and Evaluate the Model

In [136]:
# Test and evaluate the pruned random forest model
y_pred = grid_search.best_estimator_.predict(X_test)
metrics_score(y_test, y_pred)
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       896
           1       0.77      0.70      0.73       367

    accuracy                           0.85      1263
   macro avg       0.83      0.81      0.82      1263
weighted avg       0.85      0.85      0.85      1263

Observations:

Training Set Performance:

  • The model achieves an accuracy of 90% on the training set, demonstrating a strong fit. The selected hyperparameters, particularly the optimal max_depth=8 and min_samples_split=5, allow the model to capture patterns in the data while minimizing overfitting.

  • Class 0 (majority class) performs exceptionally well, with a precision of 0.91, recall of 0.95, and an F1-score of 0.93. This indicates that the model is highly capable of correctly identifying the majority class with minimal false positives and negatives.

  • Class 1 (minority class) shows solid performance, with a precision of 0.86, recall of 0.79, and an F1-score of 0.82. While the precision for Class 1 is strong, the recall is slightly lower, meaning the model misses some positive instances in the training data. Overall, it performs well in identifying and predicting the minority class without overfitting.

Test Set Performance:

  • On the test set, the model achieves an accuracy of 85%, which is a slight drop from the training set but still indicates good generalization to unseen data.

  • Class 0 (majority class) maintains strong performance on the test set, with a precision of 0.88, recall of 0.92, and an F1-score of 0.90. This shows that the model continues to effectively predict the majority class, with a high degree of confidence.

  • Class 1 (minority class) has a precision of 0.77, recall of 0.70, and an F1-score of 0.73. While the precision for Class 1 remains strong, the recall is lower, indicating that about 30% of positive instances are missed in the test set. This suggests that the model could improve in identifying minority class instances in unseen data, though its overall performance is decent.

Effect of Hyperparameter Tuning:

  • The tuned parameters, particularly the max_depth of 8 and min_samples_split of 5, help create a well-balanced model that generalizes effectively. The constraint on tree depth prevents the model from overfitting to training data, while the max_features='sqrt' ensures that a subset of features is considered at each split, preventing overly complex decision boundaries.

  • The model successfully balances performance across the training and test sets, particularly for the majority class (Class 0), while achieving reasonable performance for the minority class (Class 1).

Macro and Weighted Averages:

  • The macro average F1-score on the test set is 0.82, reflecting the model's overall performance across both classes. This score highlights the stronger performance for Class 0, with Class 1 contributing to a slightly lower macro average.

  • The weighted average F1-score is 0.85, which indicates that while Class 0's performance dominates due to class imbalance, the model still delivers decent performance for the minority class (Class 1).

Class Imbalance Impact:

  • The model continues to be affected by the class imbalance, with Class 0 receiving better predictions than Class 1. The recall for Class 1 on the test set is lower than on the training set, showing that the model struggles to fully capture all positive instances for the minority class in unseen data. Despite this, the precision for Class 1 remains high, indicating that the model is confident when it predicts positive instances, with fewer false positives.

Conclusion:

The hypertuned Random Forest model performs well on both the training and test sets, showing strong generalization without overfitting. Class 0 (majority class) maintains excellent performance across both sets, while Class 1 (minority class) performs decently, though its recall could be improved. The hyperparameters chosen, including max_depth=8, max_features='sqrt', and min_samples_split=5, contribute to a well-balanced model that effectively handles the data. Addressing class imbalance (e.g., using class weights) could further improve performance for the minority class.

Observations on the Data Imbalance¶

We have previously observed a data imbalance, with the target variable distributed in a ratio of 69.4% to 30.6%, where our key feature of interest is converted leads (status = 1).

There are several techniques available to address this imbalance. The most effective methods include:

  • Class Weight Adjustment: Modifying the model to account for imbalanced classes by adjusting class weights.
  • Oversampling: Using techniques like SMOTE to generate additional samples for the minority class.
  • Undersampling: Reducing the number of samples from the majority class to achieve balance.
  • Hybrid Methods: Combining both oversampling and undersampling to create a balanced dataset.

Approach: Class Weighting¶

In [137]:
# Train a Random Forest model with class weight adjustment
rf_class_weight = RandomForestClassifier(class_weight='balanced', random_state=1, n_estimators=100)
rf_class_weight.fit(X_train, y_train)

# Predict on the test set
y_pred_class_weight = rf_class_weight.predict(X_test)

# Generate the classification report
class_weight_report = classification_report(y_test, y_pred_class_weight, output_dict=True)
class_weight_report_df = pd.DataFrame(class_weight_report).transpose()

# Display the classification report
print(class_weight_report_df)
              precision    recall  f1-score      support
0              0.880694  0.906250  0.893289   896.000000
1              0.753666  0.700272  0.725989   367.000000
accuracy       0.846397  0.846397  0.846397     0.846397
macro avg      0.817180  0.803261  0.809639  1263.000000
weighted avg   0.843782  0.846397  0.844675  1263.000000

Approach: SMOTE¶

In [138]:
pip install imblearn
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.3)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4)
Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.5.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Installing collected packages: imblearn
Successfully installed imblearn-0.0
In [139]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data to oversample the minority class
smote = SMOTE(random_state=1)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the Random Forest model on SMOTE-resampled data
rf_smote = RandomForestClassifier(random_state=1, n_estimators=100)
rf_smote.fit(X_train_smote, y_train_smote)

# Predict on the test set
y_pred_smote = rf_smote.predict(X_test)

# Generate the classification report
smote_report = classification_report(y_test, y_pred_smote, output_dict=True)
smote_report_df = pd.DataFrame(smote_report).transpose()

# Display the SMOTE classification report
print(smote_report_df)
              precision    recall  f1-score      support
0              0.888765  0.891741  0.890251   896.000000
1              0.733516  0.727520  0.730506   367.000000
accuracy       0.844022  0.844022  0.844022     0.844022
macro avg      0.811141  0.809631  0.810378  1263.000000
weighted avg   0.843653  0.844022  0.843832  1263.000000

Approach: Undersampling¶

In [140]:
from imblearn.under_sampling import RandomUnderSampler

# Apply undersampling to the training data
undersample = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = undersample.fit_resample(X_train, y_train)

# Train the Random Forest model on the undersampled data
rf_undersample = RandomForestClassifier(random_state=1, n_estimators=100)
rf_undersample.fit(X_train_under, y_train_under)

# Predict on the test set
y_pred_undersample = rf_undersample.predict(X_test)

# Generate the classification report
undersample_report = classification_report(y_test, y_pred_undersample, output_dict=True)
undersample_report_df = pd.DataFrame(undersample_report).transpose()

# Display the classification report
print(undersample_report_df)
              precision    recall  f1-score      support
0              0.923267  0.832589  0.875587   896.000000
1              0.670330  0.831063  0.742092   367.000000
accuracy       0.832146  0.832146  0.832146     0.832146
macro avg      0.796798  0.831826  0.808840  1263.000000
weighted avg   0.849769  0.832146  0.836796  1263.000000

Approach: Hybrid - SMOTE oversampling and Undersampling¶

In [141]:
from imblearn.combine import SMOTEENN

# Apply SMOTE + ENN (hybrid oversampling and undersampling) to the training data
smote_enn = SMOTEENN(random_state=1)
X_train_hybrid, y_train_hybrid = smote_enn.fit_resample(X_train, y_train)

# Train the Random Forest model on the hybrid resampled data
rf_hybrid = RandomForestClassifier(random_state=1, n_estimators=100)
rf_hybrid.fit(X_train_hybrid, y_train_hybrid)

# Predict on the test set
y_pred_hybrid = rf_hybrid.predict(X_test)

# Generate the classification report
hybrid_report = classification_report(y_test, y_pred_hybrid, output_dict=True)
hybrid_report_df = pd.DataFrame(hybrid_report).transpose()

# Display the classification report
print(hybrid_report_df)
              precision    recall  f1-score     support
0              0.923913  0.758929  0.833333   896.00000
1              0.590133  0.847411  0.695749   367.00000
accuracy       0.784640  0.784640  0.784640     0.78464
macro avg      0.757023  0.803170  0.764541  1263.00000
weighted avg   0.826924  0.784640  0.793354  1263.00000

Analysis of Curent Results¶

Before Imbalance Handling¶

In [142]:
# Generate a table for easy comparion of all methods used in analaysis

# Define the columns for the table
columns = ['Model', 'Dataset', 'Accuracy', 'Precision Class 0', 'Precision Class 1',
           'Recall Class 0', 'Recall Class 1', 'F1-Score Class 0', 'F1-Score Class 1']

# Data for each model (Train and Test results for each model)
data = [
    ['Logistic Regression', 'Train', 0.82, 0.86, 0.74, 0.90, 0.66, 0.88, 0.70],
    ['Logistic Regression', 'Test', 0.82, 0.86, 0.72, 0.89, 0.66, 0.88, 0.68],

    ['PCA', 'Train', 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00],
    ['PCA', 'Test', 0.77, 0.85, 0.60, 0.83, 0.64, 0.84, 0.62],

    ['Decision Tree', 'Train', 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00],
    ['Decision Tree', 'Test', 0.80, 0.86, 0.65, 0.85, 0.67, 0.86, 0.66],

    ['Max Depth (Pruning)', 'Train', 0.89, 0.90, 0.87, 0.95, 0.75, 0.92, 0.80],
    ['Max Depth (Pruning)', 'Test', 0.84, 0.86, 0.77, 0.92, 0.63, 0.89, 0.69],

    ['Limited Depth (Depth=5)', 'Train', 0.88, 0.89, 0.84, 0.94, 0.75, 0.92, 0.79],
    ['Limited Depth (Depth=5)', 'Test', 0.86, 0.89, 0.79, 0.92, 0.71, 0.90, 0.75],

    ['Limited Depth (Depth=7)', 'Train', 0.90, 0.92, 0.87, 0.95, 0.81, 0.93, 0.84],
    ['Limited Depth (Depth=7)', 'Test', 0.84, 0.87, 0.74, 0.90, 0.68, 0.89, 0.71],

    ['Min Samples per Leaf', 'Train', 0.93, 0.94, 0.89, 0.95, 0.86, 0.95, 0.88],
    ['Min Samples per Leaf', 'Test', 0.80, 0.86, 0.66, 0.86, 0.67, 0.86, 0.66],

    ['Min Samples to Split', 'Train', 0.89, 0.91, 0.85, 0.94, 0.79, 0.92, 0.82],
    ['Min Samples to Split', 'Test', 0.83, 0.87, 0.73, 0.90, 0.67, 0.88, 0.70],

    ['Random Forest', 'Train', 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00],
    ['Random Forest', 'Test', 0.84, 0.88, 0.73, 0.90, 0.70, 0.89, 0.72],

    ['Pruned Random Forest', 'Train', 0.90, 0.91, 0.86, 0.95, 0.79, 0.93, 0.82],
    ['Pruned Random Forest', 'Test', 0.85, 0.88, 0.77, 0.92, 0.70, 0.90, 0.73],

    ['Random Forest (Tuned)', 'Train', 0.90, 0.91, 0.86, 0.95, 0.79, 0.93, 0.82],
    ['Random Forest (Tuned)', 'Test', 0.85, 0.88, 0.77, 0.92, 0.70, 0.90, 0.73]
]

# Create a DataFrame
model_performance_df = pd.DataFrame(data, columns=columns)

# Display the table
model_performance_df
Out[142]:
Model Dataset Accuracy Precision Class 0 Precision Class 1 Recall Class 0 Recall Class 1 F1-Score Class 0 F1-Score Class 1
0 Logistic Regression Train 0.82 0.86 0.74 0.90 0.66 0.88 0.70
1 Logistic Regression Test 0.82 0.86 0.72 0.89 0.66 0.88 0.68
2 PCA Train 1.00 1.00 1.00 1.00 1.00 1.00 1.00
3 PCA Test 0.77 0.85 0.60 0.83 0.64 0.84 0.62
4 Decision Tree Train 1.00 1.00 1.00 1.00 1.00 1.00 1.00
5 Decision Tree Test 0.80 0.86 0.65 0.85 0.67 0.86 0.66
6 Max Depth (Pruning) Train 0.89 0.90 0.87 0.95 0.75 0.92 0.80
7 Max Depth (Pruning) Test 0.84 0.86 0.77 0.92 0.63 0.89 0.69
8 Limited Depth (Depth=5) Train 0.88 0.89 0.84 0.94 0.75 0.92 0.79
9 Limited Depth (Depth=5) Test 0.86 0.89 0.79 0.92 0.71 0.90 0.75
10 Limited Depth (Depth=7) Train 0.90 0.92 0.87 0.95 0.81 0.93 0.84
11 Limited Depth (Depth=7) Test 0.84 0.87 0.74 0.90 0.68 0.89 0.71
12 Min Samples per Leaf Train 0.93 0.94 0.89 0.95 0.86 0.95 0.88
13 Min Samples per Leaf Test 0.80 0.86 0.66 0.86 0.67 0.86 0.66
14 Min Samples to Split Train 0.89 0.91 0.85 0.94 0.79 0.92 0.82
15 Min Samples to Split Test 0.83 0.87 0.73 0.90 0.67 0.88 0.70
16 Random Forest Train 1.00 1.00 1.00 1.00 1.00 1.00 1.00
17 Random Forest Test 0.84 0.88 0.73 0.90 0.70 0.89 0.72
18 Pruned Random Forest Train 0.90 0.91 0.86 0.95 0.79 0.93 0.82
19 Pruned Random Forest Test 0.85 0.88 0.77 0.92 0.70 0.90 0.73
20 Random Forest (Tuned) Train 0.90 0.91 0.86 0.95 0.79 0.93 0.82
21 Random Forest (Tuned) Test 0.85 0.88 0.77 0.92 0.70 0.90 0.73

Observations Based on Model Performance (before imbalance handling):

To determine the best method in terms of accuracy and recall for Class 1 (status = 1), let's evaluate the models:

  • Best Accuracy:

    • Limited Depth (Depth=5) and Pruned Random Forest both have the highest test accuracy at 0.86 and 0.85, respectively.
  • Highest Recall for Class 1:

    • The model with the highest recall for Class 1 is Limited Depth (Depth=5), achieving a recall of 0.71 for the minority class (status = 1).

Conclusion: The Limited Depth (Depth=5) model provides the best balance between overall accuracy (0.86) and recall for Class 1 (0.71). It offers the most reliable performance for accurately identifying positive instances in the minority class while maintaining strong overall model accuracy.

With Imbalance Handling¶

In [143]:
import pandas as pd

# Define the columns for the table
columns = ['Method', 'Accuracy', 'Precision Class 0', 'Precision Class 1',
           'Recall Class 0', 'Recall Class 1', 'F1-Score Class 0', 'F1-Score Class 1']

# Data for each method
data = [
    ['Class Weighting', 0.8464, 0.8807, 0.7537, 0.9063, 0.7003, 0.8933, 0.7260],
    ['SMOTE', 0.8440, 0.8888, 0.7335, 0.8917, 0.7275, 0.8903, 0.7305],
    ['Undersampling', 0.8321, 0.9233, 0.6703, 0.8326, 0.8311, 0.8756, 0.7421],
    ['Hybrid (SMOTE + Undersampling)', 0.7846, 0.9239, 0.5901, 0.7589, 0.8474, 0.8333, 0.6957]
]

# Create a DataFrame
imbalance_handling_df = pd.DataFrame(data, columns=columns)

# Display the table
imbalance_handling_df
Out[143]:
Method Accuracy Precision Class 0 Precision Class 1 Recall Class 0 Recall Class 1 F1-Score Class 0 F1-Score Class 1
0 Class Weighting 0.8464 0.8807 0.7537 0.9063 0.7003 0.8933 0.7260
1 SMOTE 0.8440 0.8888 0.7335 0.8917 0.7275 0.8903 0.7305
2 Undersampling 0.8321 0.9233 0.6703 0.8326 0.8311 0.8756 0.7421
3 Hybrid (SMOTE + Undersampling) 0.7846 0.9239 0.5901 0.7589 0.8474 0.8333 0.6957

Observations Based on Imbalance Handling Methods:

Using the criteria of accuracy and recall for Class 1 (status = 1):

Class Weighting:

  • Accuracy: 0.8464

  • Recall for Class 1: 0.7003

  • This method provides a strong balance between overall accuracy and recall for Class 1.

SMOTE:

  • Accuracy: 0.8440

  • Recall for Class 1: 0.7275

  • SMOTE achieves slightly lower accuracy than class weighting but improves recall for Class 1, making it more effective at capturing positive instances.

Undersampling:

  • Accuracy: 0.8321

  • Recall for Class 1: 0.8311

  • Undersampling gives the highest recall for Class 1 (0.8311), but overall accuracy drops slightly. It focuses more on improving minority class predictions.

Hybrid (SMOTE + Undersampling):

  • Accuracy: 0.7846

  • Recall for Class 1: 0.8474

  • This method has the highest recall for Class 1 (0.8474) but sacrifices accuracy. It is better suited if your focus is mainly on maximizing recall for Class 1.

Conclusion:

  • SMOTE provides the best balance between accuracy (0.8440) and recall for Class 1 (0.7275), making it a strong contender if you want to maintain good performance across both classes.

  • Undersampling and Hybrid methods perform better on recall for Class 1, but they result in lower overall accuracy.

Recommendation¶

To evaluate which model—Limited Depth (Depth=5) or SMOTE—is better suited to our goals (balancing overall accuracy with recall for Class 1), let's compare both models based on key performance metrics.

Metrics for Comparison:¶

  • Accuracy: Overall performance across both classes.
  • Precision for Class 1: The ability to avoid false positives for Class 1 (status = 1).
  • Recall for Class 1: The ability to correctly identify positive instances of Class 1.
  • F1-Score for Class 1: The balance between precision and recall for Class 1.

Limited Depth (Depth=5) Model:¶

Metric Value
Accuracy 0.86
Precision (Class 1) 0.79
Recall (Class 1) 0.71
F1-Score (Class 1) 0.75

SMOTE Model:¶

Metric Value
Accuracy 0.8440
Precision (Class 1) 0.7335
Recall (Class 1) 0.7275
F1-Score (Class 1) 0.7305

Observations:

Overall Accuracy:

  • The Limited Depth (Depth=5) model has slightly higher accuracy (0.86) compared to the SMOTE model (0.8440). This suggests that the Limited Depth model performs slightly better at predicting both classes combined.

Precision for Class 1:

  • The Limited Depth model has higher precision for Class 1 (0.79) than the SMOTE model (0.7335). This means that the Limited Depth model is better at avoiding false positives for Class 1.

Recall for Class 1:

  • The SMOTE model has a marginally better recall for Class 1 (0.7275) compared to the Limited Depth model (0.71). This means SMOTE is slightly better at identifying positive instances of Class 1, but the difference is small.

F1-Score for Class 1:

  • The Limited Depth model has a higher F1-Score for Class 1 (0.75) compared to the SMOTE model (0.7305), which indicates a better balance between precision and recall for the Limited Depth model.

Conclusion:

  • The Limited Depth (Depth=5) model performs better overall in terms of accuracy, precision, and F1-Score for Class 1. It strikes a better balance between identifying true positives and avoiding false positives.
  • While the SMOTE model slightly improves recall for Class 1, the gains are small, and it performs worse in terms of precision and overall accuracy.

Initial Model Recommendation:¶

None of the models we have explored do a great job of classifying and predicting which lead is most leikely to convert. Additional model exploration is necessary to find the optimum model. I suspect more advanced feature engineering would be beneficial.

In the analysis provided the Limited Depth (Depth=5) model is the better choice for our goals, as it maintains a good balance between accuracy and recall, with better overall precision and F1-Score for Class 1. This makes it more effective at minimizing false positives while still correctly identifying most positive instances.

Advanced Feature Engineering¶

Issues:

The current models have shown limitations in accurately predicting conversions, particularly in distinguishing the minority class (converted leads). Inorder to improve predictive accuracy the plan was to leverage advanced machine learning techniques, more agressive feature engineering, and additional hyperparameter tuning. This will hopefully enable us to generate a more efficient predictive model.

The key challenge is to handle the data imbalance and effectively utilize features such as time spent on the website, interaction type, and profile completion to enhance the model's ability to predict lead conversions, while minimizing false negatives (missed conversion opportunities).

Time is a constraint on exploring these avenues as the code below has run for numerous hours without being able to complete.

In [145]:
from Scikit-learn.preprocessing import PolynomialFeatures

# Generate interaction terms for existing features
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Create a DataFrame with the new features
# Use get_feature_names_out() instead of get_feature_names()
X_train_poly_df = pd.DataFrame(X_train_poly, columns=poly.get_feature_names_out(X_train.columns))
X_test_poly_df = pd.DataFrame(X_test_poly, columns=poly.get_feature_names_out(X_test.columns))

# Display the new features
X_train_poly_df.head()
Out[145]:
age website_visits time_spent_on_website page_views_per_visit current_occupation_0 current_occupation_1 current_occupation_2 first_interaction_0 first_interaction_1 profile_completed_0 ... digital_media_1 educational_channels_0 digital_media_1 educational_channels_1 digital_media_1 referral_0 digital_media_1 referral_1 educational_channels_0 educational_channels_1 educational_channels_0 referral_0 educational_channels_0 referral_1 educational_channels_1 referral_0 educational_channels_1 referral_1 referral_0 referral_1
0 48.0 1.0 266.0 5.812 1.0 0.0 0.0 0.0 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1 55.0 3.0 984.0 2.970 0.0 0.0 1.0 1.0 0.0 0.0 ... 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 33.0 2.0 2478.0 2.189 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 59.0 0.0 0.0 0.000 1.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
4 23.0 1.0 365.0 1.933 0.0 1.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

5 rows × 325 columns

Advanced Regression Models¶

Random Forest with Polynomial Features¶

In [155]:
# Defining models to be tested
models = {
    'RandomForest': RandomForestClassifier(random_state=1)
}

# Simplified hyperparameter search space for each model
param_grids = {
    'RandomForest': {
        'n_estimators': [100, 200],
        'max_depth': [5, 10],
        'min_samples_split': [2, 5],
    },
}

# Running RandomizedSearchCV for the model
for model_name, model in models.items():
    param_grid = param_grids[model_name]
    randomized_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=5, scoring='f1', cv=5, random_state=1)
    randomized_search.fit(X_train_poly_df, y_train)

# Train the model
test_model = randomized_search.best_estimator_
test_model.fit(X_train_poly_df, y_train)

# Test the model
y_pred_train = test_model.predict(X_test_poly_df)

# Evaluate the model on the test set
y_pred_test = test_model.predict(X_test_poly_df)
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       896
           1       0.78      0.69      0.73       367

    accuracy                           0.85      1263
   macro avg       0.83      0.81      0.82      1263
weighted avg       0.85      0.85      0.85      1263

Gradient Boosting with Polynomial Features¶

In [156]:
# Defining models to be tested
models = {
    'GradientBoosting': GradientBoostingClassifier(random_state=1)
}

# Simplified hyperparameter search space for each model
param_grids = {
    'GradientBoosting': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5],
    }
}

# Running RandomizedSearchCV for the model
for model_name, model in models.items():
    param_grid = param_grids[model_name]
    randomized_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=5, scoring='f1', cv=5, random_state=1)
    randomized_search.fit(X_train_poly_df, y_train)

# Train the model
test_model = randomized_search.best_estimator_
test_model.fit(X_train_poly_df, y_train)

# Test the model
y_pred_train = test_model.predict(X_test_poly_df)

# Evaluate the model on the test set
y_pred_test = test_model.predict(X_test_poly_df)
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       896
           1       0.77      0.68      0.72       367

    accuracy                           0.85      1263
   macro avg       0.82      0.80      0.81      1263
weighted avg       0.84      0.85      0.85      1263

XGBoost with Class Weights and Polynomial Features¶

In [165]:
# Calculate the class weights to balance the classes
class_weights = compute_class_weight(class_weight='balanced',
                                     classes=np.unique(y_train),
                                     y=y_train)

# Create a dictionary for class weights (0 for non-converted, 1 for converted)
weight_dict = {0: class_weights[0], 1: class_weights[1]}
print(f"Class Weights: {weight_dict}")

# Set the scale_pos_weight for XGBoost
scale_pos_weight = weight_dict[1] / weight_dict[0]

# Initialize the XGBoost classifier with the calculated scale_pos_weight
xgb_model = XGBClassifier(random_state=1, scale_pos_weight=scale_pos_weight)

# Train the XGBoost model
xgb_model.fit(X_train_poly_df, y_train)

# Make predictions on the test set
y_pred_test = xgb_model.predict(X_test_poly_df)

metrics_score(y_test, y_pred_test)
Class Weights: {0: 0.7207537934410181, 1: 1.6324833702882484}
              precision    recall  f1-score   support

           0       0.89      0.86      0.88       896
           1       0.69      0.75      0.72       367

    accuracy                           0.83      1263
   macro avg       0.79      0.80      0.80      1263
weighted avg       0.83      0.83      0.83      1263

XGBoost with Threshold Tuning and Polynomial Features¶

In [173]:
from Scikit-learn.metrics import classification_report, roc_curve

# Initialize and train the XGBoost model (without class weights for simplicity here)
xgb_model = XGBClassifier(random_state=1)
xgb_model.fit(X_train_poly_df, y_train)

# Predict probabilities instead of labels
y_pred_proba = xgb_model.predict_proba(X_test_poly_df)[:, 1]  # We only need probabilities for class 1

# Tune the threshold - Here, we set the threshold lower than 0.5 to improve recall
threshold = 0.3   # Lowering the threshold to improve recall
y_pred_custom_threshold = (y_pred_proba >= threshold).astype(int)

# Evaluate the model's performance at the new threshold
metrics_score(y_test, y_pred_custom_threshold)

# Optionally, plot the ROC curve to analyze thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       896
           1       0.68      0.80      0.73       367

    accuracy                           0.83      1263
   macro avg       0.80      0.82      0.81      1263
weighted avg       0.84      0.83      0.84      1263

Majority Class Undersampling with Polynomial Features¶

In [174]:
from imblearn.under_sampling import RandomUnderSampler

# Apply Random Undersampling to balance the classes
undersampler = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = undersampler.fit_resample(X_train_poly_df, y_train)

# Train the XGBoost model on the undersampled data
xgb_model = XGBClassifier(random_state=1)
xgb_model.fit(X_train_under, y_train_under)

# Predict on the test set (Note: Test set is not undersampled, as we want to evaluate performance on the original data)
y_pred_test = xgb_model.predict(X_test_poly_df)

# Evaluate the model's performance using classification report
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.92      0.80      0.86       896
           1       0.63      0.83      0.72       367

    accuracy                           0.81      1263
   macro avg       0.78      0.82      0.79      1263
weighted avg       0.84      0.81      0.82      1263

SMOTE and Undersampling with Polynomial Features¶

In [175]:
from imblearn.combine import SMOTETomek

# Apply SMOTE and Random Undersampling (SMOTE + Tomek Links for cleaning)
smote_undersampler = SMOTETomek(random_state=1)
X_train_smote, y_train_smote = smote_undersampler.fit_resample(X_train_poly_df, y_train)

# Train the XGBoost model on the combined SMOTE + undersampled data
xgb_model = XGBClassifier(random_state=1)
xgb_model.fit(X_train_smote, y_train_smote)

# Predict on the test set (Note: Test set is not resampled)
y_pred_test = xgb_model.predict(X_test_poly_df)

# Evaluate the model's performance using classification report
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.88      0.88      0.88       896
           1       0.71      0.72      0.71       367

    accuracy                           0.83      1263
   macro avg       0.80      0.80      0.80      1263
weighted avg       0.83      0.83      0.83      1263

Optimize Hyperparameters for Recall¶

In [176]:
from xgboost import XGBClassifier
from Scikit-learn.model_selection import RandomizedSearchCV
from Scikit-learn.metrics import classification_report

# Define hyperparameter grid for XGBoost
param_grid = {
              'n_estimators': [100, 200, 300],    # Number of boosting rounds
              'learning_rate': [0.01, 0.1, 0.2],  # Step size at each boosting step
              'max_depth': [3, 5, 7],             # Maximum depth of a tree
              'colsample_bytree': [0.7, 1.0],     # Subsample ratio of columns when constructing each tree
              'subsample': [0.8, 1.0],            # Subsample ratio of the training instance
              'gamma': [0, 0.1, 0.2]              # Minimum loss reduction required to make a further partition on a leaf node
            }

# Initialize the XGBoost classifier
xgb_model = XGBClassifier(random_state=1)

# Set up RandomizedSearchCV to optimize for recall
random_search = RandomizedSearchCV(
                                    estimator=xgb_model,
                                    param_distributions=param_grid,
                                    n_iter=10,         # Number of different parameter settings to try
                                    scoring='recall',  # Optimize for recall
                                    cv=5,              # 5-fold cross-validation
                                    random_state=1,
                                    verbose=1
                                  )

# Fit the RandomizedSearchCV on the training data
random_search.fit(X_train_poly_df, y_train)

# Get the best model after hyperparameter tuning
xgb_model = random_search.best_estimator_

# Evaluate the best model on the test set
y_pred_test = xgb_model.predict(X_test_poly_df)

# Print classification report (to check recall performance)
metrics_score(y_test, y_pred_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
              precision    recall  f1-score   support

           0       0.88      0.89      0.89       896
           1       0.72      0.72      0.72       367

    accuracy                           0.84      1263
   macro avg       0.80      0.80      0.80      1263
weighted avg       0.84      0.84      0.84      1263

Final Model Recommendation¶

Employing alternate techniques to boost minority class (status = 1) recall significant improvements were found using XGBoost with Threshold Tuning and Polynomial Features (80%), but the largest gain came from Majority Class Undersampling with Polynomial Features (83%).

Conclusion

The best model for our purposes is Majority Class Undersampling with Polynomial Features

Class Precision Recall F1-Score Support
0 0.92 0.80 0.86 896
1 0.63 0.83 0.72 367
Accuracy 0.81 1263
Macro Avg 0.78 0.82 0.79 1263
Weighted Avg 0.84 0.81 0.82 1263

Customer Profile¶

In [150]:
# Extract feature importances from the pruned_tree_depth_5 model
importances = pruned_tree_depth_5.feature_importances_
feature_names = X_train.columns  # Assuming X_train is your feature set

# Create a DataFrame to show the importance of each feature
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the top 10 most important features
print("Top 10 Important Features:")
print(feature_importance_df.head(10))
Top 10 Important Features:
                  Feature  Importance
2   time_spent_on_website    0.267570
7     first_interaction_0    0.259239
9     profile_completed_0    0.221637
4    current_occupation_0    0.103402
13        last_activity_1    0.062585
1          website_visits    0.016099
5    current_occupation_1    0.015881
14        last_activity_2    0.014340
0                     age    0.008964
10    profile_completed_1    0.007838

Customer profiles:¶

Converting Customers (Class 1 - Likely to Convert):¶

Time Spent on Website (Importance: 0.2676):

  • Customers who spend more time on the website are significantly more likely to convert. This suggests that customers who engage deeply with the website, spending time exploring products, services, or information, are strong candidates for conversion.

First Interaction Type (first_interaction_0) (Importance: 0.2592):

  • The nature of the first interaction plays a crucial role in conversion. Those who engaged early on through a particular channel (such as email, social media, or direct visits) are more likely to convert. This implies that optimizing the first point of contact can strongly influence conversions.

Profile Completion (profile_completed_0) (Importance: 0.2216):

  • Customers who have completed or nearly completed their profiles tend to convert at higher rates. This indicates that customers who take the time to fill in their details and engage more thoroughly are more serious about the offerings.

Current Occupation (current_occupation_0) (Importance: 0.1034):

  • Certain occupations are more likely to convert. Understanding the profile of occupations can help tailor offers, communications, or recommendations based on a customer’s professional background.

Last Activity (last_activity_1) (Importance: 0.0626):

  • Recent activities on the platform are indicative of conversion. Customers who have interacted with the website or platform in certain specific ways just before converting are likely more engaged.

Website Visits (Importance: 0.0161):

  • The number of website visits also plays a role, though it is less significant compared to time spent on the website. Customers who visit the website multiple times have a higher chance of converting, indicating increased interest over time.

Other Factors:

  • Age (Importance: 0.0090): Although age is a factor, it is less significant compared to other behavioral characteristics.

  • Profile Completed (Importance: 0.0078): While completing a profile is crucial, its specific state or variation influences conversion less than other factors.

  • Current Occupation (Occupation Type 1) and Last Activity (last_activity_2) play minor roles but still contribute to the model’s decision-making process.

Non-Converting Customers (Class 0 - Less Likely to Convert):¶

Time Spent on Website:

  • Customers who spend very little time on the website are less likely to convert. This suggests that low engagement leads to fewer conversions.

First Interaction Type:

  • Customers who do not engage meaningfully during the first interaction are less likely to convert, highlighting the importance of optimizing initial customer engagement.

Incomplete Profiles:

  • Customers with incomplete profiles or who do not bother to provide sufficient information tend to be less serious and less likely to convert.

Lower Engagement in Last Activity:

  • Customers who have had minimal or insignificant activity recently are likely disengaged and may not convert. Tracking recent activity can provide early signs of reduced interest.

Summary of Customer Profiles:¶

Converting Customers:¶

  • Behavior: High time spent on the website, frequent interactions (especially meaningful ones in the early stages), and recent activity signal strong engagement.

  • Demographics: While less significant, certain occupations are more likely to convert, and slightly younger or middle-aged customers show higher interest.

  • Engagement: Completing profiles and frequent website visits are key indicators of conversion.

Non-Converting Customers:¶

  • Behavior: Low time spent on the website, few interactions, and less engagement overall.

  • Demographics: Occupations that are less aligned with the product/service offering and incomplete profiles.

  • Engagement: Minimal or no meaningful activity in recent times, indicating loss of interest.

Actionable Insights and Recommendations¶

Improve Lead Engagement via Website Optimization¶

  • Feature Importance: The most important feature identified was time_spent_on_website. This indicates that customers who spend more time on the website are significantly more likely to convert.

  • Actionable Insight: Increase engagement by optimizing the website to encourage leads to spend more time exploring your offerings.

  • Actionable Recommendations:

    Heatmaps & User Behavior Analysis: Implement tools like Google Analytics or Hotjar to create heatmaps and track session recordings. Analyze which pages customers are spending the most time on and which areas are being clicked the most. This will help you identify sections of your website that need improvement, such as adding clear Call to Actions (CTAs) on key pages or enhancing content for high-traffic areas.

    Personalized Content Recommendations: Use machine learning algorithms or recommendation engines to show personalized product or content suggestions based on the user’s behavior (e.g., pages visited, items viewed). This increases the chances of keeping them engaged for longer.

    Interactive Tools: Integrate interactive features like quizzes, product selectors, or price calculators. Interactive content keeps users engaged longer and gives them a reason to explore your website deeper.

    Chatbots and Live Chat: Implement AI-driven chatbots or live chat features to engage with visitors in real-time. These tools can answer questions, provide product recommendations, and guide users through the buying process. Bots that trigger when visitors spend a long time on a particular page can nudge them toward conversion.

Leverage First Interaction and Profile Completion¶

  • Feature Importance: The features first_interaction_0 and profile_completed_0 were significant predictors of conversion, indicating that the initial interaction and completion of profiles influence whether a lead converts.

  • Actionable Insight: Focus on optimizing the first interaction (email, landing page, or sign-up form) and encourage profile completion to improve conversion rates.

  • Actionable Recommendations:

    First Interaction Personalization: Customize the first interaction with the lead based on the source (e.g., social media, email, ad campaign). Use dynamic content in your emails or landing pages that speaks directly to the lead's interests or needs. For instance, leads who arrive via an ad campaign might be more interested in seeing offers, while those from organic search might need more information.

    Strong CTAs: Ensure your landing pages and forms have strong CTAs (e.g., “Get Your Free Quote,” “See a Demo”) that clearly guide the lead toward the next step. Test different CTA variations (A/B Testing) to determine which versions drive the highest engagement.

    Gamification for Profile Completion: Encourage users to complete their profiles by gamifying the process. Show a progress bar that visually indicates how close they are to completing the profile and offer rewards (e.g., a discount or free content) when the profile reaches 100%. This not only encourages engagement but also provides you with more data to personalize your marketing efforts.

    Onboarding Sequences: Use email onboarding sequences that gradually prompt leads to complete their profiles. For example, after they sign up, send follow-up emails offering tips, reminders, or incentives to complete their profile.

Targeting Based on Occupation¶

  • Feature Importance: The features current_occupation_0 and current_occupation_1 were identified as significant predictors. Certain occupations are more likely to convert, indicating that targeting different customer segments based on occupation could improve conversion rates.

  • Actionable Insight: Segment marketing efforts based on a lead’s occupation, offering personalized communication, offers, and services that resonate with their profession.

  • Actionable Recommendations:

    1. Occupation-Based Segmentation: Segment your lead database by occupation and create tailored marketing campaigns for each segment. For example, professionals in the tech industry might appreciate cutting-edge product demos, while small business owners may respond better to cost-saving solutions.
    2. Custom Landing Pages: Create custom landing pages that cater to different occupations. Use language and visuals that speak directly to the needs of that profession, which increases the chances of engagement and conversion.
    3. Industry-Specific Offers: Provide industry-specific discounts or bundles. For instance, if certain occupations have higher conversion rates, you can offer them exclusive pricing or free trials tailored to their industry needs.
    4. Partnership Programs: Consider partnering with associations or organizations related to high-conversion occupations. This creates a direct line of communication with potential leads who are more likely to convert.

Optimizing Last Activity Timing¶

  • Feature Importance: The features last_activity_1 and last_activity_2 were significant, indicating that recent activity is an important predictor of conversion. Leads who have interacted with your platform or website recently are more likely to convert.

  • Actionable Insight: Track user activity and implement retargeting campaigns to engage with users who recently interacted with your website.

  • Actionable Recommendations:

    Real-Time Engagement: Implement real-time alerts or CRM triggers that notify your sales or marketing team when a lead has interacted with specific areas of your site (e.g., viewed a pricing page). Set up an automated follow-up sequence (via email or SMS) to re-engage the lead within a specific timeframe (e.g., 24 hours after the last interaction).

    Behavioral Retargeting: Use behavioral retargeting ads to reach users who left your site without converting. These ads should reflect the user’s recent activity, such as a product they viewed or a service page they visited.

    Automated Workflows: Create automated workflows that trigger emails or messages based on last activity. For example, if a lead viewed your product page but didn’t complete the purchase, an email with product benefits or an offer could nudge them toward conversion.

    Urgency and Scarcity Tactics: Combine activity tracking with urgency tactics (e.g., “Offer valid for 48 hours”) to encourage leads to act quickly. When paired with retargeting, this can improve the likelihood of conversion.

Referral and Loyalty Programs¶

  • Supplementary Action: Implement referral programs to leverage your existing customer base and generate high-quality leads. Word-of-mouth marketing is a proven method for converting leads who trust the referral source.

  • Actionable Recommendations:

    Referral Rewards: Create a simple, easy-to-understand referral program that incentivizes both the referrer and the referred lead. For example, offer existing customers a discount, credit, or gift card when they successfully refer a new lead who converts.

    Promote the Program: Make the referral program visible across all customer touchpoints, including your website, email campaigns, and during the checkout process. Send periodic reminder emails encouraging customers to refer friends and family.

    Track and Optimize: Use analytics tools to track how many leads are coming through referrals, and adjust the incentive structure if necessary. Additionally, identify your top referrers and engage them with personalized rewards or exclusive offers.

    Loyalty Programs: In addition to referrals, loyalty programs can encourage repeat business and referrals. Offer points or rewards for actions like repeat purchases, referrals, or social media engagement.

Follow-up with Leads Who Have High Website Visits but No Conversion¶

  • Feature Importance: The website_visits feature shows that repeated visits to the website correlate with conversion likelihood, but some leads still do not convert despite visiting the website multiple times.

  • Actionable Insight: Automate personalized follow-up emails or SMS messages for leads with multiple website visits who haven’t yet converted.

  • Actionable Recommendations:

    Behavior-Based Email Campaigns: Set up trigger-based email campaigns that automatically send targeted emails to leads who have visited your website multiple times but haven’t converted. Include content that addresses potential concerns or offers a personalized incentive (e.g., a limited-time discount).

    Lead Scoring for Priority Follow-Up: Use lead scoring to prioritize follow-up with leads who have the highest number of website visits. Leads with high visit counts should receive more personalized outreach, such as a phone call or a customized offer.

    Retargeting Ads for High-Intent Leads: Implement retargeting campaigns specifically designed for leads who visited your website multiple times. These ads can highlight the product or service they were interested in and offer a limited-time promotion to push them toward conversion.

    Live Chat Intervention: Implement a live chat feature that prompts when users visit the website for the second or third time without converting. The live chat can proactively offer assistance, answer questions, or provide a personalized recommendation to nudge them toward a decision.

Convert notebook to HTML¶

In [151]:
import os
os.getcwd()

# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/Learner+Notebook+-+Full+Code+Version+-+Potential+Customers+Prediction.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/Learner+Notebook+-+Full+Code+Version+-+Potential+Customers+Prediction.ipynb to html
[NbConvertApp] Writing 5704577 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Project_-_Classification_and_Hypothesis_Testing/Learner+Notebook+-+Full+Code+Version+-+Potential+Customers+Prediction.html