SuperKart Sales Prediction¶

Context:¶

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. It is extremely important for a company to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action. Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective:¶

SuperKart is an organization that owns a chain of supermarkets and food marts providing a wide range of products. They want to predict the future sales revenue of its different outlets so that they can strategize their sales operation across different tier cities and plan their inventory accordingly. To achieve this purpose, SuperKart has hired a data science firm, shared the sales records of its various outlets for the previous quarter, and asked the firm to come up with a suitable model to predict the total sales of the stores for the upcoming quarter.

Data Description:¶

The data contains the different attributes of the various products and stores. The detailed data dictionary is given below. • Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number. • Product_Weight - the weight of each product • Product_Sugar_Content - sugar content of each product like low sugar, regular, and no sugar • Product_Allocated_Area - the ratio of the allocated display area of each product to the total display area of all the products in a store • Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others • Product_MRP - maximum retail price of each product • Store_Id - unique identifier of each store • Store_Establishment_Year - the year in which the store was established • Store_Size - the size of the store depending on sq. feet like high, medium, and low • Store_Location_City_Type - the type of city in which the store is located like Tier 1, Tier 2, and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts. • Store_Type - the type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2, and Food Mart Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store

In [ ]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Library to split the data
from Scikit-learn.model_selection import train_test_split

# Libaries to help with data visualization
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)

# Import libraries for building linear regression model
from statsmodels.formula.api import ols
import statsmodels.api as sm
from Scikit-learn.linear_model import LinearRegression

# Import library for preparing data
from Scikit-learn.model_selection import train_test_split

# Import library for checking multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Import library for visualization of the regression
import statsmodels.stats.api as sms

# Import library to plot q-q plot of residuals
import scipy.stats as stats
import pylab

#import librabry used to re-split the data in a different way
from Scikit-learn.model_selection import cross_val_score

import warnings
warnings.filterwarnings("ignore")
In [2]:
# Connect collab
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
kart = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Four - Regression and Prediction/Guided Project/SuperKart.csv')
In [ ]:
# Copying data to another variable to avoid any changes to original data
data = kart.copy()
In [ ]:
data.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36

Observations:¶

  • Prediction of the total sales of the stores for the upcoming quarter means the variable Product_Store_Sales_Total is our target, all other variables will be considered to be independent.
In [ ]:
data.tail()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
8758 NC7546 14.80 No Sugar 0.016 Health and Hygiene 140.53 OUT004 2009 Medium Tier 2 Supermarket Type2 3806.53
8759 NC584 14.06 No Sugar 0.142 Household 144.51 OUT004 2009 Medium Tier 2 Supermarket Type2 5020.74
8760 NC2471 13.48 No Sugar 0.017 Health and Hygiene 88.58 OUT001 1987 High Tier 2 Supermarket Type1 2443.42
8761 NC7187 13.89 No Sugar 0.193 Household 168.44 OUT001 1987 High Tier 2 Supermarket Type1 4171.82
8762 FD306 14.73 Low Sugar 0.177 Snack Foods 224.93 OUT002 1998 Small Tier 3 Food Mart 2186.08

Data Insights:¶

In [ ]:
# Undestand the shape of the data
shape = data.shape
print(f'There are {shape[0]} rows and {shape[1]} columns in the dataset.')
There are 8763 rows and 12 columns in the dataset.
In [ ]:
# Check the data types
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8763 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8763 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8763 non-null   int64  
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB

Observations:¶

  • Product_Weight, Product_Allocated_Area, Product_MRP, Store_Establishment_Year, and Product_Store_Sales_Total are the numeric columns while the rest are objects.
  • There are a total of 8763 non-null observations in each of the columns. This indicates that there are no missing values in the data.
In [ ]:
# Checking for missing values in the data
data.isnull().sum()
Out[ ]:
0
Product_Id 0
Product_Weight 0
Product_Sugar_Content 0
Product_Allocated_Area 0
Product_Type 0
Product_MRP 0
Store_Id 0
Store_Establishment_Year 0
Store_Size 0
Store_Location_City_Type 0
Store_Type 0
Product_Store_Sales_Total 0

In [ ]:
# Check for duplicated rows
data.duplicated().sum()
Out[ ]:
0

Observations:¶

  • There are no missing values in the database
  • There are no duplicated rows
In [ ]:
# Lets see if there will be any data cleanup necessary - we will go column by column

# Product_Id: Numeric
P_Id_Desc = data['Product_Id'].describe()
print(f"Product_Id: ")
print('\n')
print(P_Id_Desc)
print('\n')

# Product_Weight: This is a numeric column so we can look at the min, amx and mean
P_Wt_Desc = data['Product_Weight'].describe()
print(f"Product_Weight: ")
print('\n')
print(P_Wt_Desc)
print('\n')

# Product_Sugar_Content: This is a Categorical (object) so we can look at unique values
P_Tp_Desc = data['Product_Sugar_Content'].unique()
print('\n')
print(f"Product_Sugar_Content: ")
print(P_Tp_Desc)
print('\n')

# Product Allocation Area: Numeric
P_Al_Desc = data['Product_Allocated_Area'].describe()
print('\n')
print(f"Product_Allocated_Area: ")
print(P_Al_Desc)
print('\n')

# Product_Type: Categorical
print(data['Product_Type'].value_counts())
print('\n')

# Product_MRP: Numeric
P_MRP_Desc = data['Product_MRP'].describe()
print('\n')
print(f"Product_MRP: ")
print(P_MRP_Desc)
print('\n')

# Store_Id: Categorical
Store_Id_Desc = data['Store_Id'].describe()
print('\n')
print(f"Store_Id: ")
print(Store_Id_Desc)
print('\n')
print(data['Store_Id'].unique())
print('\n')

# Store_Establishment_Year: Numeric
Store_Est_Unique = data['Store_Establishment_Year'].nunique()
print('\n')
print(f"Store_Establishment_Year: ")
print(Store_Est_Unique)
print('\n')
print(data['Store_Establishment_Year'].unique())
print('\n')

# Store_Size: Categorical
Store_Size_Desc = data['Store_Size'].unique()
print('\n')
print(f"Store_Size: ")
print(Store_Size_Desc)
print('\n')
print(data['Store_Establishment_Year'].unique())
print('\n')

# Store_Location_City_Type: Categorical
Store_Loc_Desc = data['Store_Location_City_Type'].nunique()
print('\n')
print(f"Store_Location_City_Type: ")
print(Store_Loc_Desc)
print(data['Store_Location_City_Type'].unique())
print('\n')

# Store_Type: Categorical
Store_Type_Desc = data['Store_Type'].unique()
print('\n')
print(f"Store_Type: ")
print(Store_Type_Desc)
print('\n')


# Print the values for Store_Type along with a count of the number of stores in that type

# Create a DataFrame with unique Store IDs and their types
unique_stores = data[['Store_Id', 'Store_Type']].drop_duplicates()

# Count the occurrences of each store type
store_type_counts = unique_stores['Store_Type'].value_counts()

print(store_type_counts)
print('\n')

#Product_Store_Sales_Total: Numeric
P_St_Desc = data['Product_Store_Sales_Total'].describe()
print('\n')
print(f"Product_Store_Sales_Total: ")
print(P_St_Desc)
print('\n')
Product_Id: 


count       8763
unique      8763
top       FD6114
freq           1
Name: Product_Id, dtype: object


Product_Weight: 


count    8763.000000
mean       12.653792
std         2.217320
min         4.000000
25%        11.150000
50%        12.660000
75%        14.180000
max        22.000000
Name: Product_Weight, dtype: float64




Product_Sugar_Content: 
['Low Sugar' 'Regular' 'No Sugar' 'reg']




Product_Allocated_Area: 
count    8763.000000
mean        0.068786
std         0.048204
min         0.004000
25%         0.031000
50%         0.056000
75%         0.096000
max         0.298000
Name: Product_Allocated_Area, dtype: float64


Product_Type
Fruits and Vegetables    1249
Snack Foods              1149
Frozen Foods              811
Dairy                     796
Household                 740
Baking Goods              716
Canned                    677
Health and Hygiene        628
Meat                      618
Soft Drinks               519
Breads                    200
Hard Drinks               186
Others                    151
Starchy Foods             141
Breakfast                 106
Seafood                    76
Name: count, dtype: int64




Product_MRP: 
count    8763.000000
mean      147.032539
std        30.694110
min        31.000000
25%       126.160000
50%       146.740000
75%       167.585000
max       266.000000
Name: Product_MRP, dtype: float64




Store_Id: 
count       8763
unique         4
top       OUT004
freq        4676
Name: Store_Id, dtype: object


['OUT004' 'OUT003' 'OUT001' 'OUT002']




Store_Establishment_Year: 
4


[2009 1999 1987 1998]




Store_Size: 
['Medium' 'High' 'Small']


[2009 1999 1987 1998]




Store_Location_City_Type: 
3
['Tier 2' 'Tier 1' 'Tier 3']




Store_Type: 
['Supermarket Type2' 'Departmental Store' 'Supermarket Type1' 'Food Mart']


Store_Type
Supermarket Type2     1
Departmental Store    1
Supermarket Type1     1
Food Mart             1
Name: count, dtype: int64




Product_Store_Sales_Total: 
count    8763.000000
mean     3464.003640
std      1065.630494
min        33.000000
25%      2761.715000
50%      3452.340000
75%      4145.165000
max      8000.000000
Name: Product_Store_Sales_Total, dtype: float64


In [ ]:
# See all numeric variables info at the same time
#data.describe(include = np.number)

# See all variables info at the same time
data.describe(include='all')
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
count 8763 8763.000000 8763 8763.000000 8763 8763.000000 8763 8763.000000 8763 8763 8763 8763.000000
unique 8763 NaN 4 NaN 16 NaN 4 NaN 3 3 4 NaN
top FD6114 NaN Low Sugar NaN Fruits and Vegetables NaN OUT004 NaN Medium Tier 2 Supermarket Type2 NaN
freq 1 NaN 4885 NaN 1249 NaN 4676 NaN 6025 6262 4676 NaN
mean NaN 12.653792 NaN 0.068786 NaN 147.032539 NaN 2002.032751 NaN NaN NaN 3464.003640
std NaN 2.217320 NaN 0.048204 NaN 30.694110 NaN 8.388381 NaN NaN NaN 1065.630494
min NaN 4.000000 NaN 0.004000 NaN 31.000000 NaN 1987.000000 NaN NaN NaN 33.000000
25% NaN 11.150000 NaN 0.031000 NaN 126.160000 NaN 1998.000000 NaN NaN NaN 2761.715000
50% NaN 12.660000 NaN 0.056000 NaN 146.740000 NaN 2009.000000 NaN NaN NaN 3452.340000
75% NaN 14.180000 NaN 0.096000 NaN 167.585000 NaN 2009.000000 NaN NaN NaN 4145.165000
max NaN 22.000000 NaN 0.298000 NaN 266.000000 NaN 2009.000000 NaN NaN NaN 8000.000000

Observations:¶

  • There are no duplicated product ID's
  • Product_Sugar_Content contains some data that will need to be cleaned, I am assumng 'Regular' and 'reg' are the same thing.
  • We will also likely need to transform Store_Established_Year into an age.
  • There are actually only four stores represented in the data.
  • Store_Establishment_Year is unique.
  • Store_Type is also unique, the data represents one store of each type.
  • One of the stores has an extrememly small value for Product_Store_Sales_Total $33 where the best performing store has $8k.
  • There are 16 different product tyoes. Fruits and Vegetables are the best selling(1249) followed by Snack Foods (1149)
In [ ]:
# Let's fix the Product_Sugar_Content data, place 'reg' into 'Regular'
data['Product_Sugar_Content'] = data['Product_Sugar_Content'].replace('reg', 'Regular')
print(data['Product_Sugar_Content'].value_counts())
Product_Sugar_Content
Low Sugar    4885
Regular      2359
No Sugar     1519
Name: count, dtype: int64
In [ ]:
# Since Product_Id's are all unique they could be dropped, but there is a non-numeric prefix that may give us some useful information, so ltes drop the numeric portion and keep only the prefixes
data['Product_Id'] = data['Product_Id'].str[:2]
print(data['Product_Id'].value_counts())
Product_Id
FD    6539
NC    1519
DR     705
Name: count, dtype: int64

Exploratory (Univariate) Analysis¶

Data Distribtion for numeric features¶

In [ ]:
# Function to plot a boxplot and a histogram along the same scale

def histogram_boxplot(data, feature, figsize = (12, 7), kde = False, bins = None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows = 2,      # Number of rows of the subplot grid = 2
        sharex = True,  # x-axis will be shared among all subplots
        gridspec_kw = {"height_ratios": (0.25, 0.75)},
        figsize = figsize,
    )                   # Creating the 2 subplots
    sns.boxplot(
        data = data, x = feature, ax = ax_box2, showmeans = True, color = "violet"
    )                   # Boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2, bins = bins, palette = "winter"
    ) if bins else sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2
    )                   # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color = "green", linestyle = "--"
    )                   # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color = "black", linestyle = "-"
    )

Product_Weight

In [ ]:
histogram_boxplot(data, "Product_Weight")
No description has been provided for this image

Observations:

  • Product_Weight is uniformly distributed with a median and mean of approximately 12.5

Product_Allocated_Area

In [ ]:
histogram_boxplot(data, "Product_Allocated_Area")
No description has been provided for this image

Observations:

  • Product_Allocated_Area is right skewed in distributed with a median of approximately 0.05

If we needed to perform a linear transformation:

In [ ]:
sns.histplot(np.log(data['Product_Allocated_Area']) + .02)
Out[ ]:
<Axes: xlabel='Product_Allocated_Area', ylabel='Count'>
No description has been provided for this image

.02 is a bias values added to tune

Product_MRP

In [ ]:
histogram_boxplot(data, "Product_MRP", kde = True)
No description has been provided for this image

Observations:

  • Product_Weight is uniformly distributed with a mean and median of approximately 150

Product_Store_Sales_Total

In [ ]:
histogram_boxplot(data, "Product_Store_Sales_Total")
No description has been provided for this image

Observations:

  • Product_Store_Sales_Total is uniformly distributed with a median and mean of approximately 3500

Data Distribtion for categorical features¶

In [ ]:
# Function to create labeled barplots

def labeled_barplot(data, feature, perc = False, n = None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])            # Length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize = (count + 1, 5))
    else:
        plt.figure(figsize = (n + 1, 5))

    plt.xticks(rotation = 90, fontsize = 15)
    ax = sns.countplot(
        data = data,
        x = feature,
        palette = "Paired",
        order = data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )                              # Percentage of each class of the category
        else:
            label = p.get_height()         # Count of each level of the category

        x = p.get_x() + p.get_width() / 2  # Width of the plot
        y = p.get_height()                 # Height of the plot

        ax.annotate(
            label,
            (x, y),
            ha = "center",
            va = "center",
            size = 12,
            xytext = (0, 5),
            textcoords = "offset points",
        )                                 # Annotate the percentage

    plt.show()                            # Show the plot

Product_Id¶

In [ ]:
labeled_barplot(data, "Product_Id", perc = True)
No description has been provided for this image

Observations

  • Almost 75% of products sold fall under thye FD category (assumed to mean food)

Product_Sugar_Content¶

In [ ]:
labeled_barplot(data, "Product_Sugar_Content", perc = True)
No description has been provided for this image

Observations:

  • Approximatley 56% of items are Low Sugar versus nearly 27% having regular sugar, with a little more that 17% having no sugar at all.

Product_Type¶

In [ ]:
labeled_barplot(data, "Product_Type", perc = True)
No description has been provided for this image

Observations:

  • Fruits and Vegetables (14.3%) and Snack Food (13.1%) comprise the majority of items, with Seafood (0.9%) being the smallest

Store_Id¶

In [ ]:
labeled_barplot(data, "Store_Id", perc = True)
No description has been provided for this image

Observations

  • OUT004 has the largest number of products at 53%, it is larger than all other stores combined.
  • OUT002 has the lowest number of goods sold at just over 13%

Store_Size¶

In [ ]:
labeled_barplot(data, "Store_Size", perc = True)
No description has been provided for this image

Observations:

  • Medium stores account for almost 70% of all goods sold.
  • High and Small stores account for roughly the same amout of sales.

Store_Location_City_Type¶

In [ ]:
labeled_barplot(data, "Store_Location_City_Type", perc = True)
No description has been provided for this image

Observations:

  • Nearly 72% of the products have been sold from stores which are located in Tier 2 cities.
  • The products that have been sold from the stores are located in Tier 1 and Tier 3 cities are almost the same.

Store_Type¶

In [ ]:
labeled_barplot(data, "Store_Type", perc = True)
No description has been provided for this image

Observations:

  • Supermarket Type2 stores account for over 53% of total sales.
  • Department Sotres, Food Marts and SUpermarket Type 1 stores all have similar numbers of products sold.

Bivariate Analysis¶

In [ ]:
# Create a copy of the dataset data dropping all non-numeric values

# Create a copy of the dataset
data_numeric = data.copy()

# Drop non-numeric columns
data_numeric = data_numeric.drop(['Product_Id', 'Product_Sugar_Content', 'Product_Type', 'Store_Id', 'Store_Size', 'Store_Location_City_Type', 'Store_Type'], axis=1)

# Display the first few rows of the new DataFrame
data_numeric.head()
Out[ ]:
Product_Weight Product_Allocated_Area Product_MRP Store_Establishment_Year Product_Store_Sales_Total
0 12.66 0.027 117.08 2009 2842.40
1 16.54 0.144 171.43 1999 4830.02
2 14.28 0.031 162.08 1987 4130.16
3 12.10 0.112 186.31 1987 4132.18
4 9.57 0.010 123.67 1998 2279.36
In [ ]:
# Create the correlation heatmap
data_numeric.corr()
Out[ ]:
Product_Weight Product_Allocated_Area Product_MRP Store_Establishment_Year Product_Store_Sales_Total
Product_Weight 1.000000 0.014754 0.532716 -0.161907 0.737955
Product_Allocated_Area 0.014754 1.000000 -0.009508 0.004467 -0.000933
Product_MRP 0.532716 -0.009508 1.000000 -0.189357 0.787989
Store_Establishment_Year -0.161907 0.004467 -0.189357 1.000000 -0.185027
Product_Store_Sales_Total 0.737955 -0.000933 0.787989 -0.185027 1.000000

Correlation for Numerical Variables¶

In [ ]:
# Plot the heatmap on data_numeric
plt.figure(figsize=(10, 8))
sns.heatmap(data_numeric.corr(), annot=True, fmt = '.2f', cmap='coolwarm')
# fmt is the precison (number of decimals)
plt.show()
No description has been provided for this image

Observations:

  • Our target vaiable Product_Store_Sales_Total is highly correlated with Product_Weight and Product_MRP.
  • Product_Weight and Product_MRP are moderatley correlated.
  • Interestingly Store_Establishment_Year has a negative correlation with most other features.

Correlation for Categorical Variables¶

In [ ]:
#data_rev = data.groupby(['Product_Id'], as_index = False)['Product_Store_Sales_Total'].sum()
#sns.barplot(x = 'Product_Id', y = 'Product_Store_Sales_Total', data = data_rev)
In [ ]:
# alternatley we could do the following:

# Visualize the effect of Product_Id on Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.barplot(data = data, x = 'Product_Id', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.show()

## Removed as it did not give any insights
# Visualize the effect of Product_Weight on Product_Store_Sales_Total
#plt.figure(figsize=(25, 8))
#sns.barplot(data = data, x = 'Product_Weight', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
#plt.xticks(rotation = 90)
#plt.show()

# Visualize the effect of Product_Sugar_Content on Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.barplot(data = data, x = 'Product_Sugar_Content', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.xticks(rotation = 90)
plt.show()

# Visualize the effect of Product_Type on Product_Store_Sales_Total
plt.figure(figsize=(25, 8))
sns.barplot(data = data, x = 'Product_Type', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.xticks(rotation = 90)
plt.show()

# Visualize the effect of Store_Id on Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.barplot(data = data, x = 'Store_Id', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.xticks(rotation = 90)
plt.show()

# Visualize the effect of Store_Size on Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.barplot(data = data, x = 'Store_Size', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.xticks(rotation = 90)
plt.show()

# Visualize the effect of Store_Location_City_Type on Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.barplot(data = data, x = 'Store_Location_City_Type', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.xticks(rotation = 90)
plt.show()

# Visualize the effect of Store_Type on Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.barplot(data = data, x = 'Store_Type', y = 'Product_Store_Sales_Total', estimator = np.sum, ci = 1, palette = 'icefire')
plt.xticks(rotation = 90)
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Target Variable Distribution with Respect to Numerical Features¶

In [ ]:
# Visualize the relationship between Product_Weight and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Product_Weight', y='Product_Store_Sales_Total', data=data)
plt.title('Product Weight vs. Product Store Sales Total')
plt.xlabel('Product Weight')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • Product_Weight and Product_Store_Sales_Total are approximately linearly correlated.
In [ ]:
# Visualize the relationship between Product_Allocated_Area and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Product_Allocated_Area', y='Product_Store_Sales_Total', data=data)
plt.title('Product Allocated Area vs. Product Store Sales Total')
plt.xlabel('Product Allocated Area')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • There does not appear to be a relationship between Product_Allocated_Area and Product_Store_Sales_Total.
In [ ]:
# Visualize the relationship between Product_MRP and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Product_MRP', y='Product_Store_Sales_Total', data=data)
plt.title('Product MRP vs. Product Store Sales Total')
plt.xlabel('Product MRP')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • Product_MRP and Product_Store_Sales_Total are approximately linearly correlated.
In [ ]:
## Removed - not useful
# Visualize the relationship between Store_Establishment_Year and Product_Store_Sales_Total
#plt.figure(figsize=(10, 6))
#sns.scatterplot(x='Store_Establishment_Year', y='Product_Store_Sales_Total', data=data)
#plt.title('Store Establishment Year vs. Product Store Sales Total')
#plt.xlabel('Store Establishment Year')
#plt.ylabel('Product Store Sales Total')
#plt.show()

Observations:

Target Variable Distribution with Respect to Categorical Features¶

In [ ]:
# Visualize the relationship between Product_Id and Product_Store_Sales_Total
plt.figure(figsize=(25, 8))
sns.boxplot(x='Product_Id', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Product ID vs. Product Store Sales Total')
plt.xlabel('Product ID')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • The difference are negligible between the Product_Id tyes.
In [ ]:
# Visualize the relationship between Product_Sugar_Content and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.boxplot(x='Product_Sugar_Content', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Product Sugar Content vs. Product Store Sales Total')
plt.xlabel('Product Sugar Content')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • The effect of differences in SugarProduct_Sugar_Content in products appear to be negligible.
In [ ]:
# Visualize the relationship between Product_Type and Product_Store_Sales_Total
plt.figure(figsize=(25, 8))
sns.boxplot(x='Product_Type', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Product Type vs. Product Store Sales Total')
plt.xlabel('Product Type')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • The effect of differences in Product_Type appears to be minimal.
In [ ]:
# Visualize the relationship between Store_Id and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.boxplot(x='Store_Id', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Store ID vs. Product Store Sales Total')
plt.xlabel('Store ID')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • OUT003 has the highest median.
  • OUT002 has sold the lowest median.
In [ ]:
# Visualize the relationship between Store_Size and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.boxplot(x='Store_Size', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Store Size vs. Product Store Sales Total')
plt.xlabel('Store Size')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • High sized stores appear to have the highest median.
  • Small sized stores have the lowest median.
In [ ]:
# Visualize the relationship between Store_Location_City_Type and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.boxplot(x='Store_Location_City_Type', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Store Location City Type vs. Product Store Sales Total')
plt.xlabel('Store Location City Type')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • Stores in tier 1 cities have the highest median.
  • Stores om tier 3 cities have the lowest median.
In [ ]:
# Visualize the relationship between Store_Type and Product_Store_Sales_Total
plt.figure(figsize=(10, 6))
sns.boxplot(x='Store_Type', y='Product_Store_Sales_Total', data=data, palette = 'icefire')
plt.title('Store Type vs. Product Store Sales Total')
plt.xlabel('Store Type')
plt.ylabel('Product Store Sales Total')
plt.show()
No description has been provided for this image

Observations:

  • Department Stores have the highest median.
  • Food Marts have the lowest median.

Store Data Analysis - Deep Dive - Not complete yet, can be expanded¶

Let's look at the effect on various store characteristics versus the Product_MRP to get some insight into the tyoes of stores we are dealing with

In [ ]:
data.groupby(['Store_Id'])['Product_Store_Sales_Total'].sum()
Out[ ]:
Product_Store_Sales_Total
Store_Id
OUT001 6223113.18
OUT002 2030909.72
OUT003 6673457.57
OUT004 15427583.43

In [ ]:
# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Product_Id Vs Product_Store_Sales_Total', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Product_Id',  y='Product_Store_Sales_Total',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"{store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()

#
print('/n')
print('/n')
#

# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Product_Id Vs Product_MRP', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Product_Id',  y='Product_MRP',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"{store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()
No description has been provided for this image
/n
/n
No description has been provided for this image

Observations:

In [ ]:
# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Product_Sugar_Content Vs Product_Store_Sales_Total', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Product_Sugar_Content', y='Product_Store_Sales_Total',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"{store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()

#
print('/n')
print('/n')
#

# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Product_Sugar_Content Vs Product_MRP', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Product_Sugar_Content',  y='Product_MRP',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"{store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()
No description has been provided for this image
/n
/n
No description has been provided for this image

Observations:

In [ ]:
# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Store_Location_City_Type Vs Product_Store_Sales_Total', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Store_Location_City_Type', y='Product_Store_Sales_Total',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"Store: {store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Add a single x-axis title
fig.text(0.5, 0.04, 'Store_Location_City_Type', ha='center', fontsize=14)

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()

#
print('/n')
print('/n')
#

# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Store_Location_City_Type Vs Product_MRP', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Store_Location_City_Type',  y='Product_MRP',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"{store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()
No description has been provided for this image
/n
/n
No description has been provided for this image

Observations:

In [ ]:
# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Store_Type Vs Product_Store_Sales_Total', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Store_Type', y='Product_Store_Sales_Total',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"Store: {store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Add a single x-axis title
fig.text(0.5, 0.04, 'Store_Type', ha='center', fontsize=14)

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()

#
print('/n')
print('/n')
#

# Function to format the y-axis labels as plain numbers
def format_yaxis(x, pos):
    return f'{int(x):,}'

# Create a subplot grid with 1 row and 4 columns, sharing the y-axis
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(20, 6), sharey=True)

# Add a main title for the entire figure
fig.suptitle('Store_Type Vs Product_MRP', fontsize=16)

# Iterate through stores and create a plot for each
for i, store_id in enumerate(['OUT001', 'OUT002', 'OUT003', 'OUT004']):
    # Plot the barplot on the corresponding axes
    sns.barplot(data=data[data['Store_Id'] == store_id], x='Store_Type',  y='Product_MRP',
                estimator=np.sum, ci=1, palette='icefire', ax=axes[i])

    # Set the title for each subplot
    axes[i].tick_params(axis='x', rotation=90)
    axes[i].set_title(f"{store_id}")

    # Apply the y-axis formatting to avoid exponents
    axes[i].yaxis.set_major_formatter(FuncFormatter(format_yaxis))

# Adjust the layout to prevent overlap
plt.tight_layout(rect=[0, 0.05, 1, 0.95])  # Adjust bottom and top margins to fit the suptitle and x-axis label

plt.show()
No description has been provided for this image
/n
/n
No description has been provided for this image

Observations:

In [ ]:
# Visualize the effect of Product_Id on Product_Store_Sales_Total
plt.figure(figsize = [10, 8])
sns.boxplot(data = data, x = data.Product_Id, y = data.Product_MRP, palette = 'icefire')
plt.xticks(rotation = 90)
plt.title("Boxplot - Product_ID Vs Product_MRP")
plt.xlabel("Product_Id")
plt.ylabel("Product_MRP (of each product)")
plt.show()
No description has been provided for this image

Observations:

  • The median of all MRP by Product_Id is roughly equal.
In [ ]:
# Visualize the effect of Product_Type on Product_Store_Sales_Total
plt.figure(figsize = [25, 8])
sns.boxplot(data = data, x = data.Product_Type, y = data.Product_MRP, palette = 'icefire')
plt.xticks(rotation = 90)
plt.title("Boxplot - Product_Type Vs Product_MRP")
plt.xlabel("Product_Type")
plt.ylabel("Product_MRP (of each product)")
plt.show()
No description has been provided for this image

Observations:

  • The median of all MRP by Product_Type is roughly equal.

Product Analysis - Deep Dive - Not complete yet, can be expanded¶

In [ ]:
# Investigate the product distribution between stores, does store OUT003 carry the same items (but perhpas at a different prince) as store OUT004

pd.crosstab(data.Store_Id, data.Product_Type)
Out[ ]:
Product_Type Baking Goods Breads Breakfast Canned Dairy Frozen Foods Fruits and Vegetables Hard Drinks Health and Hygiene Household Meat Others Seafood Snack Foods Soft Drinks Starchy Foods
Store_Id
OUT001 136 30 10 119 150 142 199 38 114 134 130 31 13 202 106 32
OUT002 96 23 15 88 104 101 168 30 91 100 87 19 10 146 62 12
OUT003 99 34 19 90 145 122 182 23 89 107 106 32 13 186 74 28
OUT004 385 113 62 380 397 446 700 95 334 399 295 69 40 615 277 69
In [ ]:
# Visualize this
plt.figure(figsize=(15, 6))
sns.heatmap(pd.crosstab(data.Store_Id, data.Product_Type),
            annot = True,
            cmap = 'icefire',
            fmt = 'g')
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(15, 6))
sns.heatmap(pd.crosstab(data.Product_Sugar_Content, data.Product_Type),
            annot = True,
            cmap = 'icefire',
            fmt = 'g')
plt.show()
No description has been provided for this image

Data Preprocessing¶

Let's convert the Store_Established_Year nto something more meaningful and easier to use by the modle. So ley's ttransform that in store age.

In [ ]:
# Store Age
# Check if 'Store_Established_Year' exists in the DataFrame
if 'Store_Establishment_Year' in data.columns:
    data['Store_Age'] = 2024 - data['Store_Establishment_Year']
else:
    print("'Store_Establishment_Year' column not found in the DataFrame.")

# We no longer need the Store_Establishment_Year column, so let's drop it
data.drop('Store_Establishment_Year', axis=1, inplace=True)

There are currently 16 different product types, let's categorize those so we can reduce the number of dimensions. There are a number of ways we can do this, for this example lets simplify things into two categories, persihable and non-perishable.

In [ ]:
# Find the unique pproduct types  that exist in the data frame
#print(data['Product_Type'].unique())

# Or we cn do the fllowing (which may be easier to read)
data['Product_Type'].value_counts().index.to_list()
Out[ ]:
['Fruits and Vegetables',
 'Snack Foods',
 'Frozen Foods',
 'Dairy',
 'Household',
 'Baking Goods',
 'Canned',
 'Health and Hygiene',
 'Meat',
 'Soft Drinks',
 'Breads',
 'Hard Drinks',
 'Others',
 'Starchy Foods',
 'Breakfast',
 'Seafood']
In [ ]:
# Create a list of perishible items
perishables = [
              "Dairy",
              "Meat",
              "Fruits and Vegetables",
              "Breakfast",
              "Breads",
              "Seafood"
              ]

# Define a function to substitute the new categories for the origals based on our list
def change(x):
    if x in perishables:
        return "Perishables"
    else:
        return "Non Perishables"

# Intsantiate the function to preview the changes
data.Product_Type.apply(change)
Out[ ]:
Product_Type
0 Non Perishables
1 Perishables
2 Non Perishables
3 Non Perishables
4 Non Perishables
... ...
8758 Non Perishables
8759 Non Perishables
8760 Non Perishables
8761 Non Perishables
8762 Non Perishables

8763 rows × 1 columns


In [ ]:
# Perform the transformation by appending a new column to the end of the data frame
change1 = []
for i in range(0, len(data)):
    if data.Product_Type[i] in perishables:
        change1.append("Perishables")
    else:
        change1.append("Non Perishables")

data["Product_Type_Categories"] = change1

# We can now drop the Product_Type column
data.drop('Product_Type', axis=1, inplace=True)

# Check the transformation
data.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_MRP Store_Id Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Store_Age Product_Type_Categories
0 FD 12.66 Low Sugar 0.027 117.08 OUT004 Medium Tier 2 Supermarket Type2 2842.40 15 Non Perishables
1 FD 16.54 Low Sugar 0.144 171.43 OUT003 Medium Tier 1 Departmental Store 4830.02 25 Perishables
2 FD 14.28 Regular 0.031 162.08 OUT001 High Tier 2 Supermarket Type1 4130.16 37 Non Perishables
3 FD 12.10 Low Sugar 0.112 186.31 OUT001 High Tier 2 Supermarket Type1 4132.18 37 Non Perishables
4 NC 9.57 No Sugar 0.010 123.67 OUT002 Small Tier 3 Food Mart 2279.36 26 Non Perishables

Outlier Checks for Numercial Features¶

We need to check for. and take care of any outliers that will throw off our prediction calculation.

In [ ]:
# Outlier detection using boxplot

#numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
#numeric_columns.remove("Store_Establishment_Year")
#numeric_columns.remove("Store_Age_Years")


numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns
Out[ ]:
['Product_Weight',
 'Product_Allocated_Area',
 'Product_MRP',
 'Product_Store_Sales_Total',
 'Store_Age']
In [ ]:
plt.figure(figsize = (20, 15))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 5, i + 1)
    plt.boxplot(data[variable], whis = 1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Observations:

  • The data contains many outlies, however, all are explanable
  • We will not be treat them as non are out of the ordinary.

Data Preparation for Modeling¶

We need to encode categorical data using one-hot encoding

In [ ]:
# Determine the types of features present in the data frame
data.dtypes.value_counts()
Out[ ]:
count
object 7
float64 4
int64 1

In [ ]:
data.select_dtypes(include='object').columns.to_list()
Out[ ]:
['Product_Id',
 'Product_Sugar_Content',
 'Store_Id',
 'Store_Size',
 'Store_Location_City_Type',
 'Store_Type',
 'Product_Type_Categories']
In [ ]:
# Perform OHE
data_OHE = pd.get_dummies(
                         data,
                         columns = data.select_dtypes(include='object').columns,
                         drop_first=True
                         )

data_OHE.head()
Out[ ]:
Product_Weight Product_Allocated_Area Product_MRP Product_Store_Sales_Total Store_Age Product_Id_FD Product_Id_NC Product_Sugar_Content_No Sugar Product_Sugar_Content_Regular Store_Id_OUT002 Store_Id_OUT003 Store_Id_OUT004 Store_Size_Medium Store_Size_Small Store_Location_City_Type_Tier 2 Store_Location_City_Type_Tier 3 Store_Type_Food Mart Store_Type_Supermarket Type1 Store_Type_Supermarket Type2 Product_Type_Categories_Perishables
0 12.66 0.027 117.08 2842.40 15 True False False False False False True True False True False False False True False
1 16.54 0.144 171.43 4830.02 25 True False False False False True False True False False False False False False True
2 14.28 0.031 162.08 4130.16 37 True False False True False False False False False True False False True False False
3 12.10 0.112 186.31 4132.18 37 True False False False False False False False False True False False True False False
4 9.57 0.010 123.67 2279.36 26 False True True False True False False False True False True True False False False
In [ ]:
x = data_OHE.drop('Product_Store_Sales_Total', axis = 1)  # These are the features
y = data_OHE['Product_Store_Sales_Total']                 # This is the target
In [ ]:
# Let's make sure everything looks as it should
x.shape, y.shape
Out[ ]:
((8763, 19), (8763,))
In [ ]:
# Lets use the stats models library ( we could also use Scikit-learn)
# Create a bias variable and pre-pend that onto the data frame
x = sm.add_constant(x)

x.head(5)
Out[ ]:
const Product_Weight Product_Allocated_Area Product_MRP Store_Age Product_Id_FD Product_Id_NC Product_Sugar_Content_No Sugar Product_Sugar_Content_Regular Store_Id_OUT002 Store_Id_OUT003 Store_Id_OUT004 Store_Size_Medium Store_Size_Small Store_Location_City_Type_Tier 2 Store_Location_City_Type_Tier 3 Store_Type_Food Mart Store_Type_Supermarket Type1 Store_Type_Supermarket Type2 Product_Type_Categories_Perishables
0 1.0 12.66 0.027 117.08 15 True False False False False False True True False True False False False True False
1 1.0 16.54 0.144 171.43 25 True False False False False True False True False False False False False False True
2 1.0 14.28 0.031 162.08 37 True False False True False False False False False True False False True False False
3 1.0 12.10 0.112 186.31 37 True False False False False False False False False True False False True False False
4 1.0 9.57 0.010 123.67 26 False True True False True False False False True False True True False False False

Split Data¶

In [ ]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
In [ ]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape
Out[ ]:
((6134, 20), (2629, 20), (6134,), (2629,))

Model Training¶

Step One: Multicollinearity Analysis¶

  • We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data.

  • Features having a VIF score > 5 will be dropped/treated till all the features have a VIF score < 5

    Note: It is quite likekly that the VIF score for the categorical values we performed OHE On Hot Encoding) on will be high, we can safely ignore that for those features.

In [ ]:
# We need to convert the data type of the T/F data to 1/0 for the actual regression
x_train1 = x_train.astype(float)
y_train1 = y_train.astype(float)

# We need to do the same thing for the test data
x_test1 = x_test.astype(float)
y_test1 = y_test.astype(float)
In [ ]:
# To see how this works you can compute it in the following way, for a single column
variance_inflation_factor(x_train1.values, 0)
Out[ ]:
0.0
In [ ]:
# Create an empty datafram
VIF_df = pd.DataFrame()

VIF_df['Features'] = x_train1.columns
VIF_df['VIF'] = [variance_inflation_factor(x_train1.values, i) for i in range(x_train1.shape[1])]
VIF_df
Out[ ]:
Features VIF
0 const 0.000000
1 Product_Weight 1.752928
2 Product_Allocated_Area 1.001184
3 Product_MRP 1.885058
4 Store_Age inf
5 Product_Id_FD 2.774371
6 Product_Id_NC inf
7 Product_Sugar_Content_No Sugar inf
8 Product_Sugar_Content_Regular 1.088506
9 Store_Id_OUT002 inf
10 Store_Id_OUT003 inf
11 Store_Id_OUT004 inf
12 Store_Size_Medium inf
13 Store_Size_Small inf
14 Store_Location_City_Type_Tier 2 inf
15 Store_Location_City_Type_Tier 3 inf
16 Store_Type_Food Mart inf
17 Store_Type_Supermarket Type1 inf
18 Store_Type_Supermarket Type2 inf
19 Product_Type_Categories_Perishables 1.217690

Note: the VIF of the calculated Store_Age column if inf, meaning that the information contained in the column can be exactly created by informationin other columns, we will need to drop that column.

In [ ]:
x_train.drop('Store_Age', axis = 1, inplace = True)

# We need to remember to do the same thin in the testing data set as well
x_test.drop('Store_Age', axis = 1, inplace = True)
In [ ]:
# Recompute the VIF
# Create an empty datafram
VIF_df = pd.DataFrame()

VIF_df['Features'] = x_train1.columns
VIF_df['VIF'] = [variance_inflation_factor(x_train1.values, i) for i in range(x_train1.shape[1])]
VIF_df
Out[ ]:
Features VIF
0 const 0.000000
1 Product_Weight 1.752928
2 Product_Allocated_Area 1.001184
3 Product_MRP 1.885058
4 Store_Age inf
5 Product_Id_FD 2.774371
6 Product_Id_NC inf
7 Product_Sugar_Content_No Sugar inf
8 Product_Sugar_Content_Regular 1.088506
9 Store_Id_OUT002 inf
10 Store_Id_OUT003 inf
11 Store_Id_OUT004 inf
12 Store_Size_Medium inf
13 Store_Size_Small inf
14 Store_Location_City_Type_Tier 2 inf
15 Store_Location_City_Type_Tier 3 inf
16 Store_Type_Food Mart inf
17 Store_Type_Supermarket Type1 inf
18 Store_Type_Supermarket Type2 inf
19 Product_Type_Categories_Perishables 1.217690

Building Models¶

Let's create a function to calculate the performance metrics for our regression model so that we don't need to use the same code repeatedly.

In [ ]:
model = sm.OLS(y_train1, x_train1).fit()
model.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: Product_Store_Sales_Total R-squared: 0.824
Model: OLS Adj. R-squared: 0.823
Method: Least Squares F-statistic: 2860.
Date: Wed, 04 Sep 2024 Prob (F-statistic): 0.00
Time: 21:04:47 Log-Likelihood: -46132.
No. Observations: 6134 AIC: 9.229e+04
Df Residuals: 6123 BIC: 9.236e+04
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -58.4335 6.034 -9.684 0.000 -70.262 -46.605
Product_Weight 135.0943 3.436 39.318 0.000 128.359 141.830
Product_Allocated_Area -139.5679 118.876 -1.174 0.240 -372.608 93.472
Product_MRP 12.8696 0.254 50.702 0.000 12.372 13.367
Store_Age 5.0845 1.693 3.003 0.003 1.766 8.403
Product_Id_FD 0.3932 21.936 0.018 0.986 -42.609 43.396
Product_Id_NC 16.0039 12.324 1.299 0.194 -8.155 40.163
Product_Sugar_Content_No Sugar 16.0039 12.324 1.299 0.194 -8.155 40.163
Product_Sugar_Content_Regular 30.8608 13.284 2.323 0.020 4.819 56.902
Store_Id_OUT002 -261.6689 4.242 -61.689 0.000 -269.984 -253.354
Store_Id_OUT003 323.1208 9.566 33.779 0.000 304.368 341.873
Store_Id_OUT004 -143.9189 6.803 -21.155 0.000 -157.255 -130.583
Store_Size_Medium 179.2019 14.093 12.716 0.000 151.575 206.828
Store_Size_Small -261.6689 4.242 -61.689 0.000 -269.984 -253.354
Store_Location_City_Type_Tier 2 -119.8854 5.097 -23.523 0.000 -129.877 -109.894
Store_Location_City_Type_Tier 3 -261.6689 4.242 -61.689 0.000 -269.984 -253.354
Store_Type_Food Mart -261.6689 4.242 -61.689 0.000 -269.984 -253.354
Store_Type_Supermarket Type1 24.0335 7.601 3.162 0.002 9.133 38.934
Store_Type_Supermarket Type2 -143.9189 6.803 -21.155 0.000 -157.255 -130.583
Product_Type_Categories_Perishables 7.7111 13.233 0.583 0.560 -18.231 33.653
Omnibus: 1745.262 Durbin-Watson: 1.987
Prob(Omnibus): 0.000 Jarque-Bera (JB): 46497.082
Skew: 0.775 Prob(JB): 0.00
Kurtosis: 16.399 Cond. No. 6.36e+34


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.51e-62. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.



A concern: OUT004 sells more than any other store, in some cases outselling all other stores combined, but the coef would indicate it has a negative impact on sales?

Predictions¶

In [ ]:
y_hat = model.predict(x_train1)
In [ ]:
from Scikit-learn.metrics import r2_score, mean_absolute_percentage_error, mean_absolute_error, mean_squared_error
In [ ]:
# Let's examine the r2 score
r2_score(y_train1, y_hat)
Out[ ]:
0.8236738620512217
In [ ]:
# mean_squared_error
mean_squared_error(y_train1, y_hat)
Out[ ]:
199511.99287139165
In [ ]:
# Model Performance on test and train data
def model_evaluation(model, x, y):  # Changed function parameters to accept x and y directly
  y_hat = model.predict(x)          # Use the correct x for prediction

  rmse = np.sqrt(mean_squared_error(y, y_hat))
  mae  = mean_absolute_error(y, y_hat)
  mape = mean_absolute_percentage_error(y, y_hat)
  r2   = r2_score(y, y_hat)

  return pd.DataFrame({'RMSE': [rmse], 'MAE': [mae], 'MAPE': [mape], 'R2': [r2]})

# Perform the regression on the training data
train = model_evaluation(model, x_train1, y_train1)

# Perform the regression on the test data
test = model_evaluation(model, x_test1, y_test1)

# Label rows
train.index = ['Train']
test.index = ['Test']

# Concatenate the result
pd.concat([train, test], axis = 0)
Out[ ]:
RMSE MAE MAPE R2
Train 446.667654 262.110666 0.098564 0.823674
Test 446.685287 266.690587 0.104608 0.825641

Observations:¶

RMSE and MAE values are relatively low, indicating that your model’s predictions are close to the actual values.

MAPE is under 10%, which suggests that your model’s predictions are fairly accurate in percentage terms.

R2 is around 0.824, meaning your model explains about 82.4% of the variance in the data, which is generally considered a good fit.

These results suggest that your model is performing well.

This may be able to be improved by dropping some of the insignificat (high p value) columns or adding complexity back in by adding the Product_Types back in.

There is an opportunity to fine tune the model here - not completed¶

Drop insignificant variables (variables with p-value > 0.05) from the above model and create the regression model again.¶

Observations:¶

The model performs slightly worse with the test data, but not to a great extent.

We could go through and take a closer look at any functions with a high p value (p >= 0.05) and re-train the model and run the regression.

Check model with zero mean (not sure what that means)

In [ ]:
# Redidue of the model
model.resid

# which is this calculation#
y_train1 - model.predict(x_train1)
Out[ ]:
0
910 -157.136399
7022 -27.190868
8056 359.346837
2939 0.843308
68 126.494755
... ...
2895 165.265382
7813 -240.653427
905 200.238129
5192 20.060083
235 -106.886905

6134 rows × 1 columns


Checking the below linear regression assumptions¶

  1. Mean of residuals should be 0
  2. No Heteroscedasticity
  3. Linearity of variables
  4. Normality of error terms

1. Check for mean residuals¶

This should be as close to zero as possible

In [ ]:
np.mean(model.resid)
Out[ ]:
-1.438585163756384e-11

2. Check for homoscedasticity¶

  • Homoscedasticity - If the residuals are symmetrically distributed across the regression line, then the data is said to be homoscedastic.

  • Heteroscedasticity- - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case, the residuals can form a funnel shape or any other non-symmetrical shape.

  • We'll use Goldfeldquandt Test to test the following hypothesis with alpha = 0.05:

    • Null hypothesis: Residuals are homoscedastic
    • Alternate hypothesis: Residuals have heteroscedastic
In [ ]:
result = sms.het_goldfeldquandt(y_train1, x_train1)
result
# The second value is nothing but the p-value
Out[ ]:
(0.9838049659112555, 0.6741005762249113, 'increasing')

Observation:

  • Since p-value > 0.05, we cannot reject the Null Hypothesis that the residuals are homoscedastic and the corresponding assumption is satisfied.

3. Linearity of variables¶

It states that the predictor variables must have a linear relation with the dependent variable.

To test the assumption, we'll plot residuals and the fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.

In [ ]:
# Plot residual vs fitted values -  we need to not see a pattern here
model.fittedvalues    # the same as model.predict(x_train1)

# create a scatterplot
plt.scatter(model.fittedvalues, model.resid, alpha = 0.4)
plt.axhline(y = 0, color = 'r', linestyle = '--', alpha = 0.5)
plt.title('Residuals vs Fitted Values')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
No description has been provided for this image

Observation:

  • There is no pattern in the residual vs fitted values plot. Hence, the corresponding assumption is satisfied.

4. Normality of error terms¶

The residuals should be normally distributed.

In [ ]:
# Plot histogram of residuals
sns.histplot(model.resid, kde = True)
Out[ ]:
<Axes: ylabel='Count'>
No description has been provided for this image

This looks like a normal distibution, but let's chck one more thing to be sure.

In [ ]:
stats.probplot(model.resid, dist = 'norm', plot = pylab)
plt.title('Normal Probability Plot of Residuals')
plt.show()
No description has been provided for this image

Observation:

  • From the above plots, the residuals seem to follow a normal distribution. Hence, the corresponding assumption is satisfied. Now, we will check the model performance on the train and test datasets.

Apply cross validation to improve the model and evaluate it using different evaluation metrics¶

In [ ]:
model2 = LinearRegression()

mae_ = cross_val_score(model2, x_train1, y_train1, cv = 10, scoring = 'neg_mean_absolute_percentage_error')  # This gives percenatge error for different folds where cv is the number of folds
In [ ]:
# Get the mean of mae_
np.mean(-mae_)
Out[ ]:
0.09884834456107033
In [ ]:
# Std deviation
np.std(-mae_)
Out[ ]:
0.009025108770564186
In [ ]:
# Assume a normal distibution
np.mean(-mae_) - 3 * np.std(-mae_), np.mean(-mae_) + 3 * np.std(-mae_)  # range at which 99.7% of the time the error stays withing this range
Out[ ]:
(0.07177301824937778, 0.1259236708727629)
In [ ]:
# Do the same on the test data set - even though this doesnt make a lot of sense given the model is now being trained on a very small data set
np.mean(cross_val_score(model2, x_test1, y_test1, cv = 10, scoring = 'neg_mean_absolute_percentage_error'))
Out[ ]:
-0.09818012800305127

Actionable Insights and Business Recommendations¶

  • We can use this prediction model to predict the total sales that will be done by SuperKart in the next quarter.

  • The model explains around 79% of the variation in the data.

  • OUT004 - OUT004, which is of Supermarket Type2, located in a Tier 2 city and having store size as medium, is performing well. SuperKart can look to increase the size of this store from medium to high. They can also look to set up stores in this type of city having comparable socio-economic conditions in order to expand their business and reach.

  • OUT002 - OUT002, being a food mart in a Tier 3 city and having small store size, is also performing well. SuperKart can look to upgrade its size or target similar cities for business expansion.

  • OUT001 - OUT001 which is a store of Supermarket Type 1, located in a Tier 2 city and having store size as high is not performing upto the mark. SuperKart can look to look build new marketing strategies (maybe give attractive discounts and offers) in this store in order to attract more customers.

  • OUT003 - Similar approach can be taken to increase the business of OUT003 which is a Departmental store in a Tier 1 city and having store size as medium. It is the premium store of the company where most of the costly goods are sold, so the correct set of audience should be targetted.

  • Daily needs like fruits and vegetables and snack foods are the biggest contributors to the revenue across all the stores. So SuperKart should look to maintain the inventory of these products properly and ensure that these products never face a shortage.

Additional information that can be collected to gain better insights -

  • Customers' details like age and gender can be incorporated in this model so that the company gets to know their target audience well and can build their sales strategies according to that.

  • The company should also keep a watch for the number of festive occasions present in a quarter so that they can strategize their inventory accordingly.

In [ ]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Four - Regression and Prediction/Guided Project/SuperKart Sales Prediction.ipynb"