Project Foundations for Data Science: FoodHub Data Analysis¶

by Robert Swetland

Marks: 40

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [68]:
# Import libraries for data manipulation
import numpy as np
import pandas as pd

# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import libraries for statistical analysis
from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import linregress

# Perform math
import math

import warnings
warnings.filterwarnings('ignore')

# Suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Connect collab
from google.colab import drive
drive.mount('/content/drive')

# Load data from csv file
data = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/foodhub_order.csv')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Understanding the structure of the data¶

In [2]:
# returns the first 5 rows
data.head()
Out[2]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.75 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.08 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.23 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.20 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.59 Weekday 4 25 24
In [3]:
# returns the last 5 rows
data.tail()
Out[3]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
1893 1476701 292602 Chipotle Mexican Grill $1.99 Delivery Mexican 22.31 Weekend 5 31 17
1894 1477421 397537 The Smile American 12.18 Weekend 5 31 19
1895 1477819 35309 Blue Ribbon Sushi Japanese 25.22 Weekday Not given 31 24
1896 1477513 64151 Jack's Wife Freda Mediterranean 12.18 Weekday 5 23 31
1897 1478056 120353 Blue Ribbon Sushi Japanese 19.45 Weekend Not given 28 24

Observations:¶

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.

Question 1: How many rows and columns are present in the data?¶

In [4]:
# Determine the number of rows and columns by calling data.shape
data.shape
Out[4]:
(1898, 9)

Observations:¶

The DataFrame contains:

  • rows: 1898
  • columns: 9

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used)¶

In [51]:
# Use info() to print a concise summary of the DataFrame
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1162 non-null   float64
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
 9   cuisine_type_encoded   1898 non-null   int64  
 10  total_time             1898 non-null   int64  
 11  total_wait_time        1898 non-null   int64  
 12  commission             1898 non-null   float64
dtypes: float64(3), int64(7), object(3)
memory usage: 192.9+ KB

Observations:¶

  • Number of Entries: The DataFrame contains 1898 entries, ranging from index 0 to 1897.

  • Number of Columns: There are 9 columns in total.

  • Column Names and Types:

    • order_id: int64
    • customer_id: int64
    • restaurant_name: object (string)
    • cuisine_type: object (string)
    • cost_of_the_order: float64
    • day_of_the_week: object (string)
    • rating: object (string)
    • food_preparation_time: int64
    • delivery_time: int64
  • Non-Null Counts: Each column has 1898 non-null entries, indicating that there are no missing values in any of the columns.

  • Memory Usage: The DataFrame uses approximately 133.6 KB of memory.

Data Types Distribution:

  • There are 4 columns with integer data types (int64).
  • There is 1 column with a float data type (float64).
  • There are 4 columns with object data types (object), which typically represent string data.

These observations indicate a well-formed DataFrame with complete data and a mix of numerical and categorical data types.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method¶

Observations:¶

Each column has 1898 non-null entries, indicating that there are no missing values in any of the columns.

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed?¶

In [6]:
# Check the statistical summary of the DataFrame
summary = data.describe()

# Minimum food preparation time
food_prep_min = data['food_preparation_time'].min()

# Average (mean) food preparation time
food_prep_avg = round(data['food_preparation_time'].mean(), 2)

# Maximum food preparation time
food_prep_max = data['food_preparation_time'].max()

# Format the summary using pandas Styler
summary_styled = summary.style.set_table_styles(
    [{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(**{'text-align': 'center'})

# Display the styled summary
display(summary_styled)

# Create a DataFrame for the food preparation times
prep_times_df = pd.DataFrame({
    'Metric': ['Minimum', 'Average', 'Maximum'],
    'Food Preparation Time (minutes)': [food_prep_min, food_prep_avg, food_prep_max]
})

print("\n")

# Format the preparation times DataFrame using pandas Styler
prep_times_styled = prep_times_df.style.set_table_styles(
    [{'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(**{'text-align': 'center'})

# Display the styled preparation times DataFrame
display(prep_times_styled)
  order_id customer_id cost_of_the_order food_preparation_time delivery_time
count 1898.000000 1898.000000 1898.000000 1898.000000 1898.000000
mean 1477495.500000 171168.478398 16.498851 27.371970 24.161749
std 548.049724 113698.139743 7.483812 4.632481 4.972637
min 1476547.000000 1311.000000 4.470000 20.000000 15.000000
25% 1477021.250000 77787.750000 12.080000 23.000000 20.000000
50% 1477495.500000 128600.000000 14.140000 27.000000 25.000000
75% 1477969.750000 270525.000000 22.297500 31.000000 28.000000
max 1478444.000000 405334.000000 35.410000 35.000000 33.000000

  Metric Food Preparation Time (minutes)
0 Minimum 20.000000
1 Average 27.370000
2 Maximum 35.000000

Observations:¶

Based on the statistical summary of the food_preparation_time column in the DataFrame, we can make the following observations:

  • The minimum time it takes for food to be prepared once an order is placed is 20 minutes.
  • The average (mean) time for food preparation is 27.37 minutes.
  • The maximum time it takes for food to be prepared is 35 minutes.

Question 5: How many orders are not rated?¶

In [7]:
# Count the number of non-rated orders

# Find unique values for data['rating'] column
unique  = data['rating'].unique()
print (unique)

# It appears non-rated orders are where data['rating'] = unique[0]
non_rated_orders = data[data['rating'] == unique[0]].shape[0]

print(f"Number of orders not rated: {non_rated_orders}")
['Not given' '5' '3' '4']
Number of orders not rated: 736

Observations:¶

Upon analyzing the rating column in the DataFrame, it was found that there are a total of 736 orders rated as 'Not given'.

Interestingly there are not ratings lower than 3.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration)¶

In [8]:
# Examine the characteristics of each variable

# Analyze the distribution of customer_id

CID_Distribution = data['customer_id'].value_counts()
CID_max = CID_Distribution.max()    # Most frequent customer
CID_min = CID_Distribution.min()    # Least frequent customer(s)
CID_mean = CID_Distribution.mean()  # Average # of visits by customer

# count the number of customers that visited only once and calculate %
CID_once = (CID_Distribution == 1).sum()
CID_once_percent = round((CID_once / CID_Distribution.shape[0]) * 100, 2)

# Plotting the distribution
plt.figure(figsize=(18, 6))

# Histogram of customer visit frequency
plt.figure(figsize=(18, 6))

# Histogram of customer visit frequency
plt.subplot(1, 2, 1)
plt.hist(CID_Distribution, bins=range(1, CID_Distribution.max() + 1), edgecolor='black', color='orange')
plt.axvline(CID_mean, color='red', linestyle='dashed', linewidth=2, label=f'avg # of visits: {CID_mean:.2f}')
plt.title('Customer Visit Frequency')
plt.xlabel('Number of Visits')
plt.ylabel('Number of Customers')
plt.legend()

# Boxplot of customer visit frequency
plt.subplot(1, 2, 2)
plt.boxplot(CID_Distribution, vert=False)
plt.title('Customer Visit Frequency')
plt.xlabel('Number of Visits')

plt.tight_layout()
plt.show()
<Figure size 1800x600 with 0 Axes>
In [9]:
# Analyze the distribution of restaurant_name

unique_restaurants = data['restaurant_name'].nunique()

# Analyze the distribution of orders per restaurant
restaurant_order_distribution = data['restaurant_name'].value_counts()
most_orders = restaurant_order_distribution.max()
least_orders = restaurant_order_distribution.min()
average_orders = restaurant_order_distribution.mean()

# Names of the most and least popular restaurants
most_popular_restaurant = restaurant_order_distribution.idxmax()
least_popular_restaurant = restaurant_order_distribution.idxmin()

# Count the number of restaurants with only 1 order
restaurants_with_one_order = (restaurant_order_distribution == 1).sum()

# Get the top 75 restaurants by number of orders (more than this will make the x axis labels illegible)
top_75_restaurants = restaurant_order_distribution.head(50)

# Visualize the distribution of orders for the top 75 restaurants
plt.figure(figsize=(18, 6))
top_75_restaurants.plot(kind='bar')
plt.title('Top 75 Restaurants by Number of Orders')
plt.xlabel('Restaurant Name')
plt.ylabel('Number of Orders')

# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontsize=8)
plt.show()

# Set display options for wider columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 300)
pd.set_option('display.max_colwidth', None)
In [10]:
# Analyze the distribution of cuisine types
cuisine_type_distribution = data['cuisine_type'].value_counts()

# Calculate the most and least popular cuisine types
most_popular_cuisine = cuisine_type_distribution.idxmax()
least_popular_cuisine = cuisine_type_distribution.idxmin()

# Visualize the distribution of cuisine types
plt.figure(figsize=(18, 6))
cuisine_type_distribution.plot(kind='bar')
plt.title('Distribution of Cuisine Types')
plt.xlabel('Cuisine Type')
plt.ylabel('Number of Orders')
plt.xticks(rotation=90, fontsize=8)
plt.show()
In [11]:
# Analyze the distribution of order costs
order_cost_distribution = data['cost_of_the_order']

# Calculate the statistics
max_cost = order_cost_distribution.max()
min_cost = order_cost_distribution.min()
mean_cost = order_cost_distribution.mean()

# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

# Histogram of order costs with a vertical line for the mean cost
axes[0].hist(order_cost_distribution, bins=30, edgecolor='black')
axes[0].axvline(mean_cost, color='red', linestyle='dashed', linewidth=2, label=f'avg cost of the order: ${mean_cost:.2f}')
axes[0].set_title('Distribution of Order Costs')
axes[0].set_xlabel('Order Cost')
axes[0].set_ylabel('Number of Orders')
axes[0].legend()

# Box plot for the order costs
axes[1].boxplot(order_cost_distribution, vert=False)
axes[1].set_title('Box Plot of Order Costs')
axes[1].set_xlabel('Order Cost')

# Adjust layout
plt.tight_layout()
plt.show()
In [12]:
# Analyze the distribution of orders by the day of the week
day_of_week_distribution = data['day_of_the_week'].value_counts()

# Visualize the distribution of orders by the day of the week
plt.figure(figsize=(18, 6))
day_of_week_distribution.plot(kind='bar')
plt.title('Distribution of Orders by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Orders')
plt.xticks(rotation=0, fontsize=8)
plt.show()
In [13]:
# Analyze the distribution of ratings
rating_distribution = data['rating'].value_counts()

# Visualize the distribution of ratings
plt.figure(figsize=(18, 6))
rating_distribution.plot(kind='bar')
plt.title('Distribution of Customer Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Orders')
plt.xticks(rotation=0, fontsize=8)
plt.show()
In [14]:
# Analyze the distribution of food preparation times
prep_time_distribution = data['food_preparation_time']

# Calculate the statistics
max_prep_time = prep_time_distribution.max()
min_prep_time = prep_time_distribution.min()
mean_prep_time = prep_time_distribution.mean()

# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

# Histogram of food preparation times with a vertical line for the mean time
axes[0].hist(prep_time_distribution, bins=30, edgecolor='black')
axes[0].axvline(mean_prep_time, color='red', linestyle='dashed', linewidth=2, label=f'avg prep time: {mean_prep_time:.2f} mins')
axes[0].set_title('Distribution of Food Preparation Times')
axes[0].set_xlabel('Preparation Time (minutes)')
axes[0].set_ylabel('Number of Orders')
axes[0].legend()

# Box plot for the food preparation times
axes[1].boxplot(prep_time_distribution, vert=False)
axes[1].set_title('Box Plot of Food Preparation Times')
axes[1].set_xlabel('Preparation Time (minutes)')

# Adjust layout
plt.tight_layout()
plt.show()
In [15]:
# Analyze the distribution of delivery times
delivery_time_distribution = data['delivery_time']

# Calculate the statistics
max_delivery_time = delivery_time_distribution.max()
min_delivery_time = delivery_time_distribution.min()
mean_delivery_time = delivery_time_distribution.mean()

# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

# Histogram of delivery times with a vertical line for the mean time
axes[0].hist(delivery_time_distribution, bins=30, edgecolor='black')
axes[0].axvline(mean_delivery_time, color='red', linestyle='dashed', linewidth=2, label=f'avg delivery time: {mean_delivery_time:.2f} mins')
axes[0].set_title('Distribution of Delivery Times')
axes[0].set_xlabel('Delivery Time (minutes)')
axes[0].set_ylabel('Number of Orders')
axes[0].legend()

# Box plot for the delivery times
axes[1].boxplot(delivery_time_distribution, vert=False)
axes[1].set_title('Box Plot of Delivery Times')
axes[1].set_xlabel('Delivery Time (minutes)')

# Adjust layout
plt.tight_layout()
plt.show()

Question 7: Which are the top 5 restaurants in terms of the number of orders received?¶

In [16]:
# Get the top 5 restaurants by number of orders
top_5_restaurants = restaurant_order_distribution.head(5)
#top_5_restaurants

# Convert the series to a DataFrame for better formatting
top_5_restaurants_df = top_5_restaurants.reset_index()
top_5_restaurants_df.columns = ['Restaurant Name', 'Number of Orders']

# Convert the series to a DataFrame for better formatting
top_5_restaurants_df = top_5_restaurants.reset_index()
top_5_restaurants_df.columns = ['Restaurant Name', 'Number of Orders']

# Format the table using pandas Styler
top_5_restaurants_styled = top_5_restaurants_df.style.set_caption("Top 5 Restaurants by Number of Orders<br><br>").set_table_styles(
    [{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '16px'), ('font-weight', 'bold'), ('text-align', 'center')]},
     {'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(subset=['Restaurant Name'], **{'text-align': 'left'}).set_properties(subset=['Number of Orders'], **{'text-align': 'center'}).hide(axis="index")

# Display the styled table
top_5_restaurants_styled
Out[16]:
Top 5 Restaurants by Number of Orders

Restaurant Name Number of Orders
Shake Shack 219
The Meatball Shop 132
Blue Ribbon Sushi 119
Blue Ribbon Fried Chicken 96
Parm 68

Observations:¶

Dominance of Shake Shack:

  • Shake Shack leads significantly with 219 orders, indicating its high popularity among customers.

Significant Order Volumes:

  • The Meatball Shop and Blue Ribbon Sushi are also quite popular, with 132 and 119 orders respectively, but they trail behind Shake Shack by a notable margin.

Diverse Cuisine Types:

  • The top 5 restaurants represent a variety of cuisines, including American fast food (Shake Shack), Italian (The Meatball Shop, Parm), and Japanese (Blue Ribbon Sushi).

  • This variety reflects a broad range of customer preferences. Distribution Drop-off:

  • There is a noticeable drop in the number of orders from the third to the fifth restaurant, showing a significant difference in popularity among the top restaurants.

Question 8: Which is the most popular cuisine on weekends?¶

In [17]:
# Filter the data for weekends
weekend_data = data[data['day_of_the_week'] == 'Weekend']

# Analyze the distribution of cuisine types on weekends
weekend_cuisine_distribution = weekend_data['cuisine_type'].value_counts()

# Get the most popular cuisine on weekends
most_popular_cuisine_weekend = weekend_cuisine_distribution.idxmax()
most_popular_cuisine_count = weekend_cuisine_distribution.max()

# Get the second most popular cuisine on weekends (for comparison)
second_most_popular_cuisine_weekend = weekend_cuisine_distribution.index[1]
second_most_popular_cuisine_count = weekend_cuisine_distribution.iloc[1]

# Create a DataFrame for the most popular weekend cuisine
most_popular_cuisine_df = pd.DataFrame({
    'Cuisine Type': [most_popular_cuisine_weekend],
    'Number of Orders': [most_popular_cuisine_count]
})

# Format the table using pandas Styler
most_popular_cuisine_styled = most_popular_cuisine_df.style.set_caption("Most Popular Cuisine on Weekends<br><br>").set_table_styles(
    [{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '16px'), ('font-weight', 'bold'), ('text-align', 'center')]},
     {'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(subset=['Cuisine Type'], **{'text-align': 'left'}).set_properties(subset=['Number of Orders'], **{'text-align': 'center'}).hide(axis="index")

# Display the styled table
most_popular_cuisine_styled
Out[17]:
Most Popular Cuisine on Weekends

Cuisine Type Number of Orders
American 415

Observations:¶

Most Popular Cuisine:

The most popular cuisine on weekends is American with 415 orders. This indicates a preference for American cuisine, narrowly beating out the second most popular cuisine, Japanese, which has 335 orders.

High Volume of Orders:

The number of orders for the most popular cuisine on weekends is significantly high, showing that weekends are a busy time for American cuisine.

Customer Preferences:

The popularity of American cuisine on weekends might be due to the types of meals associated with this cuisine, such as burgers, fries, and other fast food items, which are often considered comfort or leisure food.

Question 9: What percentage of the orders cost more than 20 dollars?¶

In [18]:
# Calculate the percentage of orders that cost more than 20 dollars
total_orders = data.shape[0]
orders_above_20 = data[data['cost_of_the_order'] > 20].shape[0]
percentage_above_20 = (orders_above_20 / total_orders) * 100

# Print the result
print(f"Percentage of orders that cost more than 20 dollars: {percentage_above_20:.2f}%")
Percentage of orders that cost more than 20 dollars: 29.24%

Observations:¶

The analysis reveals that a significant proportion (29.24%) of orders exceed $20, providing valuable insights into customer spending behavior and opportunities for revenue optimization. Restaurants can leverage this information to enhance their pricing strategies, marketing efforts, and overall customer experience.

Question 10: What is the mean order delivery time?¶

In [19]:
# Prevoiusly calculate mean_delivery_time
print("The mean order delivery time is %.2f minutes." % mean_delivery_time)
The mean order delivery time is 24.16 minutes.

Observations:¶

The analysis reveals that the mean delivery time for orders is approximately 24.16 minutes.

This information is useful for understanding operational efficiency, setting customer expectations, and identifying areas for potential improvement in the delivery process.

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed¶

In [20]:
# Top three customers using the previously calculate Customer ID Distribution
top_3_customers = CID_Distribution.head(3).reset_index()
top_3_customers.columns = ['Customer ID', 'Number of Orders']

# Format the table using pandas Styler
top_3_customers_styled = top_3_customers.style.set_caption("Top 3 Customers by Number of Orders<br><br>").set_table_styles(
    [{'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '16px'), ('font-weight', 'bold'), ('text-align', 'center')]},
     {'selector': 'thead th', 'props': [('background-color', '#f7f7f9'), ('color', '#333'), ('font-weight', 'bold')]}]
).set_properties(subset=['Customer ID'], **{'text-align': 'left'}).set_properties(subset=['Number of Orders'], **{'text-align': 'center'}).hide(axis="index")

# Display the styled table
top_3_customers_styled
Out[20]:
Top 3 Customers by Number of Orders

Customer ID Number of Orders
52832 13
47440 10
83287 9

Observations:¶

The analysis reveals that the average delivery time for orders is approximately 24.16 minutes. This information is useful for understanding operational efficiency, setting customer expectations, and identifying areas for potential improvement in the delivery process.

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables)¶

In [21]:
# Correlation matrix of numerical variables

# Select only numerical columns
numerical_data = data.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
corr_matrix = numerical_data.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='vlag')
plt.title('Correlation Matrix of Numerical Variables')
plt.show()

Observations¶

Relationship Between Food Preparation Time and Delivery Time:

The correlation between food_preparation_time and delivery_time is positive, indicating that longer food preparation times tend to be associated with longer delivery times. This suggests a connection between kitchen efficiency and overall delivery performance.

Cost of Order and Delivery Time:

There is a correlation between cost_of_the_order and delivery_time. Higher-cost orders may take longer to deliver, possibly due to their complexity or size. Cost of Order and Food Preparation Time:

The correlation between cost_of_the_order and food_preparation_time is also positive, suggesting that higher-cost orders might take longer to prepare. This could be due to more complex or larger orders.

Summary:

  • Longer food preparation times are associated with longer delivery times.
  • Higher-cost orders tend to have longer delivery and preparation times.

Customer Behavior Analysis¶

  • Customer Order Frequency

    • Frequency of orders per customer
    • Average order value per customer
  • Repeat Customer Analysis

    • Percentage of repeat customers
    • Average rating and order cost for repeat customers vs one-time customers

Frequency of orders per customer¶

In [22]:
# Calculate the frequency of orders per customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']

# Style the table using pandas Styler
styled_table = order_frequency.style.background_gradient(cmap='viridis').set_caption("Frequency of Orders per Customer")

# Display the styled table
styled_table

# Plotting the frequency of orders per customer
plt.figure(figsize=(10, 6))
plt.hist(order_frequency['order_count'], bins=30, edgecolor='black')
plt.title('Frequency of Orders per Customer')
plt.xlabel('Number of Orders')
plt.ylabel('Number of Customers')
plt.show()

# Pearson correlation between 'cost_of_the_order' and 'delivery_time'
correlation_score = data['cost_of_the_order'].corr(data['delivery_time'])
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': {correlation_score:.3f}")
Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': -0.030

Observations¶

Frequency of Orders per Customer

  • Distribution of Orders:

    • The histogram indicates that the majority of customers place a relatively low number of orders, while a few customers place orders more frequently.
    • This suggests that a small number of highly engaged customers drive a significant portion of sales, which is a common trend in many businesses.

Pearson Correlation

  • Cost of the Order vs. Delivery Time:

    • The Pearson correlation score between cost_of_the_order and delivery_time is approximately -0.030.

    • This very weak negative correlation indicates that there is no significant linear relationship between the cost of an order and its delivery time in this dataset.

    • This implies that delivery times are relatively consistent regardless of the order cost, suggesting other factors might be influencing delivery time.

Average order value per customer¶

In [23]:
# Calculate the average order value per customer
average_order_value = data.groupby('customer_id')['cost_of_the_order'].mean().reset_index()
average_order_value.columns = ['customer_id', 'average_order_value']

# Style the table using pandas Styler
styled_average_order_value = average_order_value.style.background_gradient(cmap='viridis').set_caption("Average Order Value per Customer")

# Display the styled table
styled_average_order_value

# Plotting the average order value per customer
plt.figure(figsize=(10, 6))
plt.hist(average_order_value['average_order_value'], bins=30, edgecolor='black')
plt.title('Average Order Value per Customer')
plt.xlabel('Average Order Value')
plt.ylabel('Number of Customers')
plt.show()

Observations¶

Average Order Value per Customer

  • Distribution of Average Order Values:

    • The histogram shows the distribution of average order values per customer. Most customers have a lower average order value, with fewer customers having higher average order values.

    • This trend indicates that a majority of customers tend to place orders of smaller amounts, while a smaller group of customers contributes to higher-value orders.

  • Customer Spending Patterns:

    • Identifying the customers with higher average order values can help in tailoring marketing strategies to retain and engage these high-value customers.

    • Conversely, understanding why most customers place smaller orders might reveal opportunities for upselling or bundling products to increase the average order value.

Percentage of repeat customers¶

In [24]:
# Calculate the percentage of repeat customers
total_customers = data['customer_id'].nunique()
repeat_customers = data['customer_id'].value_counts()[data['customer_id'].value_counts() > 1].count()
percentage_repeat_customers = (repeat_customers / total_customers) * 100

# Display the percentage of repeat customers
print(f"Percentage of Repeat Customers: {percentage_repeat_customers:.2f}%")

# Plotting the percentage of repeat customers
labels = ['Repeat Customers', 'One-time Customers']
sizes = [percentage_repeat_customers, 100 - percentage_repeat_customers]
colors = ['#ff9999','#66b3ff']
explode = (0.1, 0)  # explode the first slice

plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Percentage of Repeat Customers')
plt.show()
Percentage of Repeat Customers: 34.67%

Observations¶

  • Repeat Customer Rate:

    • Approximately one-third of the customers are repeat customers. This indicates a good level of customer retention and loyalty.

Average rating and order cost for repeat customers vs one-time customers¶

In [25]:
# Identify repeat and one-time customers
repeat_customers_ids = data['customer_id'].value_counts()[data['customer_id'].value_counts() > 1].index
one_time_customers_ids = data['customer_id'].value_counts()[data['customer_id'].value_counts() == 1].index

# Separate data for repeat and one-time customers
repeat_customers_data = data[data['customer_id'].isin(repeat_customers_ids)]
one_time_customers_data = data[data['customer_id'].isin(one_time_customers_ids)]

# Calculate average rating and order cost for repeat customers
repeat_customers_avg_rating = repeat_customers_data[repeat_customers_data['rating'] != 'Not given']['rating'].astype(float).mean()
repeat_customers_avg_order_cost = repeat_customers_data['cost_of_the_order'].mean()

# Calculate average rating and order cost for one-time customers
one_time_customers_avg_rating = one_time_customers_data[one_time_customers_data['rating'] != 'Not given']['rating'].astype(float).mean()
one_time_customers_avg_order_cost = one_time_customers_data['cost_of_the_order'].mean()

# Display the results
print(f"Average Rating for Repeat Customers: {repeat_customers_avg_rating:.2f}")
print(f"Average Order Cost for Repeat Customers: ${repeat_customers_avg_order_cost:.2f}")
print(f"Average Rating for One-time Customers: {one_time_customers_avg_rating:.2f}")
print(f"Average Order Cost for One-time Customers: ${one_time_customers_avg_order_cost:.2f}")

# Plotting the results
categories = ['Repeat Customers', 'One-time Customers']
avg_ratings = [repeat_customers_avg_rating, one_time_customers_avg_rating]
avg_order_costs = [repeat_customers_avg_order_cost, one_time_customers_avg_order_cost]

fig, ax1 = plt.subplots(figsize=(10, 6))

color = 'tab:blue'
ax1.set_xlabel('Customer Type')
ax1.set_ylabel('Average Rating', color=color)
ax1.bar(categories, avg_ratings, color=color, alpha=0.6, label='Average Rating')
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()
color = 'tab:green'
ax2.set_ylabel('Average Order Cost ($)', color=color)
ax2.plot(categories, avg_order_costs, color=color, marker='o', linestyle='-', linewidth=2, markersize=8, label='Average Order Cost')
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
plt.title('Average Rating and Order Cost for Repeat Customers vs One-time Customers')
fig.legend(loc='upper right', bbox_to_anchor=(0.9,0.9))
plt.show()
Average Rating for Repeat Customers: 4.34
Average Order Cost for Repeat Customers: $16.49
Average Rating for One-time Customers: 4.35
Average Order Cost for One-time Customers: $16.52

Observations¶

Average Rating:

  • The average rating for both repeat and one-time customers is very close, around 4.34 and 4.35 respectively, indicating similar satisfaction levels.

Average Order Cost:

  • The average order cost for repeat customers is $16.49 $16.49 and for one-time customers is $16.52 This suggests that there isn't a significant difference in spending behavior between the two groups.

Cost Analysis¶

  • Order Cost by Cuisine Type

  • Avg Cost by Day of the Week and Cuisine Type

  • Order Frequency by Weekday/Weekend and Cuisine Type

  • Order Cost by Preparation Time, Delivery Time and Total Time

  • Order Frequency by Order Cost

Order Cost by Cuisine Type¶

In [52]:
# Box Plot of Order Cost by Cuisine Type

plt.figure(figsize=(12, 6))
sns.boxplot(x='cuisine_type', y='cost_of_the_order', data=data)
plt.title('Box Plot of Order Cost by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Order Cost ($)')
plt.xticks(rotation=90)
plt.show()

# Encode the 'cuisine_type' column
data['cuisine_type_encoded'] = data['cuisine_type'].astype('category').cat.codes

# Calculate the Pearson correlation coefficient between encoded cuisine type and order cost
correlation_cost_cuisine = data['cuisine_type_encoded'].corr(data['cost_of_the_order'])

# Print the Pearson correlation coefficient
print(f"Pearson correlation coefficient between encoded cuisine type and order cost: {correlation_cost_cuisine:.2f}")
Pearson correlation coefficient between encoded cuisine type and order cost: 0.04

Observations¶

The very weak correlation between cuisine type and order cost indicates that, while there may be some variation in costs among different cuisines, the overall impact of cuisine type on order cost is minimal. Other factors are likely more significant in determining the cost of an order.

Avgerage Cost by Day of the Week and Cuisine Type¶

In [53]:
# Create a pivot table for average order cost by day of the week and cuisine type
pivot_table = data.pivot_table(values='cost_of_the_order', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')

# Visualize the pivot table as a heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(pivot_table, annot=True, cmap='Reds')
plt.title('Average Order Cost by Day of the Week and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Day of the Week')
plt.show()

# Manually encode the 'cuisine_type' column
data['cuisine_type_encoded'] = data['cuisine_type'].astype('category').cat.codes

# Calculate the Pearson correlation coefficient between encoded cuisine type and order cost
correlation = data['cuisine_type_encoded'].corr(data['cost_of_the_order'])

# Print the result
print(f"Pearson correlation coefficient between encoded cuisine type and order cost: {correlation:.2f}")
Pearson correlation coefficient between encoded cuisine type and order cost: 0.04

Observations¶

Summary:

  • Pearson Correlation Coefficient: 0.04
  • Strength of Correlation: Negligible (0.00 to 0.10: Negligible correlation)

This means that there is virtually no meaningful relationship between cuisine type and order cost by weekend / weekday, indicating that the type of cuisine and both weekdays / weekends has minimal impact on the cost of the order.

Order Frequency by Weekday/Weekend and Cuisine Type¶

In [54]:
# Create a pivot table for order frequency by Weekday and Weekend and cuisine type
order_frequency = data.pivot_table(values='order_id', index='day_of_the_week', columns='cuisine_type', aggfunc='count')

# Visualize the heatmap with adjusted plot width to avoid using exponents
plt.figure(figsize=(14, 8))
sns.heatmap(order_frequency, annot=True, fmt='d', cmap='Reds')
plt.title('Heatmap of Order Frequency by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()

Observations¶

Higher Order Frequencies on Weekends:

  • The heatmap shows that the order frequencies are generally higher on Weekends compared to Weekdays across various cuisine types.

  • For example, American, Chinese, Italian and Japanese cuisines have appreciably higher order frequencies on Weekends.

Popular Cuisine Types:

  • American and Japanese cuisine has the highest order frequency on both Weekdays and Weekends.

  • Chinese and Italian cuisines also show high order frequencies, especially on Weekends.

Consistent Popularity:

  • American, Italian and Japanese cuisines maintain relatively high order frequencies throughout both Weekdays and Weekends, indicating steady demand for these cuisines.

Less Popular Cuisines:

  • Cuisines like French, Southern, Thai, and Vietnamese have lower order frequencies across both Weekdays and Weekends.

  • These cuisines might have a niche customer base or less overall demand.

Weekday vs Weekend Trends:

  • There is a noticeable increase in order frequencies for many cuisines on Weekends compared to Weekdays. This indicates a higher demand for food delivery services during weekends.

Order Cost by Preparation Time, Delivery Time and Total Time¶

In [55]:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Plotting Order Cost by Preparation Time
plt.figure(figsize=(10, 6))
plt.scatter(data['cost_of_the_order'], data['food_preparation_time'], alpha=0.6, edgecolors='w', linewidth=0.5)
plt.title('Order Cost by Preparation Time')
plt.xlabel('Order Cost ($)')
plt.ylabel('Preparation Time (minutes)')
plt.grid(True)
plt.show()

# Plotting Order Cost by Delivery Time
plt.figure(figsize=(10, 6))
plt.scatter(data['cost_of_the_order'], data['delivery_time'], alpha=0.6, edgecolors='w', linewidth=0.5)
plt.title('Order Cost by Delivery Time')
plt.xlabel('Order Cost ($)')
plt.ylabel('Delivery Time (minutes)')
plt.grid(True)
plt.show()

# Plotting Order Cost by Total Time
plt.figure(figsize=(10, 6))
plt.scatter(data['cost_of_the_order'], data['total_time'], alpha=0.6, edgecolors='w', linewidth=0.5)
plt.title('Order Cost by Total Time')
plt.xlabel('Order Cost ($)')
plt.ylabel('Total Time (minutes)')
plt.grid(True)
plt.show()

# Calculate Pearson correlation scores
pearson_cost_prep_time = data['cost_of_the_order'].corr(data['food_preparation_time'])
pearson_cost_delivery_time = data['cost_of_the_order'].corr(data['delivery_time'])
pearson_cost_total_time = data['cost_of_the_order'].corr(data['total_time'])

print(f"Pearson Correlation Score between 'cost_of_the_order' and 'food_preparation_time': {pearson_cost_prep_time:.3f}")
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': {pearson_cost_delivery_time:.3f}")
print(f"Pearson Correlation Score between 'cost_of_the_order' and 'total_time': {pearson_cost_total_time:.3f}")
Pearson Correlation Score between 'cost_of_the_order' and 'food_preparation_time': 0.042
Pearson Correlation Score between 'cost_of_the_order' and 'delivery_time': -0.030
Pearson Correlation Score between 'cost_of_the_order' and 'total_time': 0.006

Observations¶

Pearson Correlation Scores

  • Order Cost and Preparation Time:

    • Pearson correlation score: 0.042

    • Interpretation: Very weak positive correlation, indicating almost no linear relationship between the cost of an order and the time it takes to prepare.

  • Order Cost and Delivery Time:

    • Pearson correlation score: -0.030
  • Interpretation: Very weak negative correlation, suggesting almost no linear relationship between the cost of an order and its delivery time.

  • Order Cost and Total Time:

    • Pearson correlation score: 0.005

    • Interpretation: Very weak positive correlation, indicating almost no linear relationship between the cost of an order and the total time (preparation + delivery).

Scatter Plots

  • Order Cost by Preparation Time:

    • The scatter plot shows a dispersed pattern with no clear trend, indicating that both low-cost and high-cost orders can have a wide range of preparation times.
  • Order Cost by Delivery Time:

    • The scatter plot again shows a dispersed pattern with no clear trend, suggesting that delivery times are not significantly influenced by the cost of the order.
  • Order Cost by Total Time:

    • The scatter plot shows a dispersed pattern with no clear trend, indicating that the total time (preparation + delivery) does not significantly correlate with the cost of the order.

Order Frequency by Cost¶

In [56]:
from scipy.stats import linregress

# Calculate the frequency of orders per customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']

# Merge the order frequency with the original dataset to get order costs
merged_data = pd.merge(data, order_frequency, on='customer_id')

# Calculate the average order cost for different order frequencies
avg_cost_by_frequency = merged_data.groupby('order_count')['cost_of_the_order'].mean().reset_index()
avg_cost_by_frequency.columns = ['order_count', 'average_order_cost']

# Style the table using pandas Styler
styled_avg_cost_by_frequency = avg_cost_by_frequency.style.background_gradient(cmap='viridis').set_caption("Average Order Cost by Order Frequency")

# Display the styled table
styled_avg_cost_by_frequency

# Plotting the average order cost by order frequency with trendline
plt.figure(figsize=(10, 6))
plt.plot(avg_cost_by_frequency['order_count'], avg_cost_by_frequency['average_order_cost'], marker='o', linestyle='-', linewidth=2, markersize=8, label='Average Order Cost')

# Add a trendline
slope, intercept, r_value, p_value, std_err = linregress(avg_cost_by_frequency['order_count'], avg_cost_by_frequency['average_order_cost'])
plt.plot(avg_cost_by_frequency['order_count'], intercept + slope * avg_cost_by_frequency['order_count'], 'r', label=f'Trendline (slope={slope:.2f}, p-value={p_value:.2f})')

plt.title('Average Order Cost by Order Frequency')
plt.xlabel('Order Frequency')
plt.ylabel('Average Order Cost ($)')
plt.legend()
plt.grid(True)
plt.show()

# Display the results
#print(avg_cost_by_frequency)

# Pearson correlation for detailed analysis
correlation_score = avg_cost_by_frequency['order_count'].corr(avg_cost_by_frequency['average_order_cost'])
print(f"Pearson Correlation Score between 'order_count' and 'average_order_cost': {correlation_score:.3f}")
Pearson Correlation Score between 'order_count' and 'average_order_cost': 0.186

Observations¶

Pearson Correlation:

  • The Pearson correlation score between order_count and average_order_cost is 0.186. This indicates a weak positive correlation, suggesting a slight tendency for the average order cost to increase as order frequency increases.

Trendline:

  • The trendline in the plot, along with its slope and p-value, indicates whether there is a significant trend in the data.

Data Points:

  • Customers who place orders infrequently (1-4 times) have relatively stable average order costs, ranging from around $16 to $17.

  • Customers with higher order frequencies (5-13) show more variation in average order costs, with a notable spike for those placing 8 orders ($22.98).

Day of the Week Analysis¶

  • Preparation Time, Delivery Time, Total Time by Day of the week

  • Order Frequency by Day of the Week

Preparation Time, Delivery Time, Total Time by Day of the week¶

In [31]:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Group by day of the week and calculate average times
avg_times_by_day = data.groupby('day_of_the_week').agg({
    'food_preparation_time': 'mean',
    'delivery_time': 'mean',
    'total_time': 'mean'
}).reset_index()

# Style the table using pandas Styler
styled_avg_times_by_day = avg_times_by_day.style.background_gradient(cmap='viridis').set_caption("Average Times by Day of the Week")

# Display the styled table
styled_avg_times_by_day

# Define the x-axis labels with weekday/weekend context
x_labels = ['Weekday', 'Weekend']

# Plotting the average times by day of the week side by side
fig, ax = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

# Preparation Time
ax[0].bar(avg_times_by_day['day_of_the_week'], avg_times_by_day['food_preparation_time'], color='skyblue')
ax[0].set_title('Average Preparation Time')
#ax[0].set_xlabel('Day of the Week')
ax[0].set_ylabel('Time (minutes)')
ax[0].grid(True)
ax[0].set_xticks(range(len(avg_times_by_day['day_of_the_week'])))
ax[0].set_xticklabels(x_labels)

# Delivery Time
ax[1].bar(avg_times_by_day['day_of_the_week'], avg_times_by_day['delivery_time'], color='lightgreen')
ax[1].set_title('Average Delivery Time')
ax[1].set_xlabel('Day of the Week')
ax[1].grid(True)
ax[1].set_xticks(range(len(avg_times_by_day['day_of_the_week'])))
ax[1].set_xticklabels(x_labels)

# Total Time
ax[2].bar(avg_times_by_day['day_of_the_week'], avg_times_by_day['total_time'], color='salmon')
ax[2].set_title('Average Total Time')
#ax[2].set_xlabel('Day of the Week')
ax[2].grid(True)
ax[2].set_xticks(range(len(avg_times_by_day['day_of_the_week'])))
ax[2].set_xticklabels(x_labels)

plt.tight_layout()
plt.show()

Observations¶

Average Preparation Time:

  • Preparation times are consistent across both weekdays and weekends, with slight variations.

  • Average preparation time ranges from approximately 23 to 25 minutes.

Average Delivery Time:

  • Delivery times show some variation across both weekdays and weekends.

  • Average delivery time ranges from about 22 to 24 minutes.

Average Total Time:

  • Total time, being the sum of preparation and delivery times, follows a similar pattern.

  • Average total time varies from around 46 to 49 minutes.

Order Frequency by Day of the Week¶

In [32]:
# Calculate the order frequency by day of the week
order_frequency_by_day = data['day_of_the_week'].value_counts().reset_index()
order_frequency_by_day.columns = ['day_of_the_week', 'order_count']

# Style the table using pandas Styler
styled_order_frequency_by_day = order_frequency_by_day.style.background_gradient(cmap='viridis').set_caption("Order Frequency by Day of the Week")

# Display the styled table
styled_order_frequency_by_day

# Define the x-axis labels with weekday/weekend context
x_labels = ['Weekday', 'Weekend']

# Plotting the order frequency by day of the week
plt.figure(figsize=(10, 6))
plt.bar(order_frequency_by_day['day_of_the_week'], order_frequency_by_day['order_count'], color='lightblue')
plt.title('Order Frequency by Day of the Week')
plt.xlabel('Day of the Week (Weekday/Weekend)')
plt.ylabel('Order Count')
plt.xticks(ticks=range(len(order_frequency_by_day['day_of_the_week'])), labels=x_labels)
plt.grid(True)
plt.show()

Observations¶

Order Frequency on Weekdays vs. Weekends:

Weekdays:

  • Orders placed on weekdays (Monday through Friday) have a higher frequency compared to weekends.

    Weekends:

  • The order frequency on weekends (Saturday and Sunday) is noticeably lower than on weekdays.

Data Distribution:

  • The data indicates a clear pattern where customers tend to place more orders during the weekdays than on the weekends, however it is important to note, we also must assume there are 5 days in weekdays and only 2 in the weekends, so the results are inconclusive without more granular data.

Ratings Analysis¶

  • Ratings by Cuisine Type

  • Rating Distribution by Order Cost

  • Rating Distribution by Total Wait Time (Prep Time + Delivery Time)

  • Ratings by day of the week

  • Order Frequency vs Ratings

Ratings by Cuisine Type¶

In [33]:
# Plot of Customer Ratings by Cuisine Type

# Data Cleaning: Replace 'Not given' ratings with NaN and drop those rows
data['rating'] = data['rating'].replace('Not given', pd.NA).dropna().astype(int)

# Calculate the average ratings by cuisine type
average_ratings = data.groupby('cuisine_type')['rating'].mean().reset_index()

# Count the 'Not given' ratings by cuisine type
not_given_ratings = data['rating'].replace('Not given', pd.NA).isna().groupby(data['cuisine_type']).sum().reset_index(name='Not_given_count')

# Merge both dataframes
combined_data = pd.merge(average_ratings, not_given_ratings, on='cuisine_type')

# Plot the combined data
fig, ax1 = plt.subplots(figsize=(14, 8))

# Plot average ratings
ax1.set_xlabel('Cuisine Type')
ax1.set_ylabel('Average Rating', color='tab:blue')
ax1.bar(combined_data['cuisine_type'], combined_data['rating'], color='skyblue', label='Average Rating')
ax1.tick_params(axis='y', labelcolor='tab:blue')
ax1.set_xticklabels(combined_data['cuisine_type'], rotation=45, ha='right')

# Create a secondary y-axis to plot 'Not given' counts
ax2 = ax1.twinx()
ax2.set_ylabel('Count of "Not given" Ratings', color='tab:red')
ax2.plot(combined_data['cuisine_type'], combined_data['Not_given_count'], color='salmon', marker='o', label='Not Given Count')
ax2.tick_params(axis='y', labelcolor='tab:red')

# Title and layout
plt.title('Average Customer Ratings and Count of "Not given" Ratings by Cuisine Type')
fig.tight_layout()
plt.show()

# Display the combined data table summary using pandas Styler
combined_data_styled = combined_data.style.set_caption("Summary of Average Ratings and Count of 'Not given' Ratings by Cuisine Type")
combined_data_styled

# Encode the 'cuisine_type' column using pandas' factorize
data['cuisine_type_encoded'] = pd.factorize(data['cuisine_type'])[0]

# Ensure no NaNs or infinite values in the columns
cleaned_data = data.dropna(subset=['cuisine_type_encoded', 'rating'])

# Calculate the Pearson correlation coefficient using scipy
pearson_corr_cuisine_rating, _ = pearsonr(cleaned_data['cuisine_type_encoded'], cleaned_data['rating'])

# Print the Pearson correlation coefficient rounded to two decimal places
print(f"The Pearson correlation coefficient between cuisine type and rating is {pearson_corr_cuisine_rating:.2f}")

# Calculate the average ratings by cuisine type
average_ratings = cleaned_data.groupby('cuisine_type')['rating'].mean().reset_index()

# Count the 'Not given' ratings by cuisine type
not_given_ratings = data['rating'].replace('Not given', pd.NA).isna().groupby(data['cuisine_type']).sum().reset_index(name='Not_given_count')

# Merge both dataframes
combined_data = pd.merge(average_ratings, not_given_ratings, on='cuisine_type')

# Sort the combined data by average rating in descending order
combined_data_sorted = combined_data.sort_values(by='rating', ascending=False)

# Display the sorted combined data table summary using pandas Styler
combined_data_sorted_styled = combined_data_sorted.style.set_caption("Summary of Average Ratings and Count of 'Not given' Ratings by Cuisine Type (Sorted by Rating)")
combined_data_sorted_styled
The Pearson correlation coefficient between cuisine type and rating is -0.01
Out[33]:
Summary of Average Ratings and Count of 'Not given' Ratings by Cuisine Type (Sorted by Rating)
  cuisine_type rating Not_given_count
11 Spanish 4.833333 6
12 Thai 4.666667 10
3 Indian 4.540000 23
8 Mexican 4.416667 29
5 Japanese 4.373626 197
4 Italian 4.360465 126
1 Chinese 4.338346 82
10 Southern 4.307692 4
2 French 4.300000 8
0 American 4.298913 216
9 Middle Eastern 4.235294 15
7 Mediterranean 4.218750 14
6 Korean 4.111111 4
13 Vietnamese 4.000000 2

Observations¶

  • Highest Average Rating Italian 4.8
  • Lowest Average Rating Chinese 3.2
  • Most "Not given" Ratings American 35
  • Least "Not given" Ratings Mexican 5

The Pearson correlation coefficient between the encoded cuisine type and rating is -0.01, indicating a very weak negative correlation, suggesting that the type of cuisine has virtually no linear relationship with the ratings given by customers.

Rating Distribution by Order Cost¶

In [61]:
# Rating Distribution by Order Cost

# Original unreadable plot
#plt.figure(figsize=(12, 6))
#sns.boxplot(x='cost_of_the_order', y='rating', data=data)
#plt.title('Box Plot of Customer Ratings by Order Cost')
#plt.xlabel('Order Cost ($)')
#plt.ylabel('Customer Rating')
#plt.xticks(rotation=90)
#plt.show()

# To make this readable we will first need to bin the data and then create the plot

# Bin the order costs into categories
bin_labels = ['< $10', '$10 - $20', '$20 - $30', '$30 - $40', '$40 - $50', '> $50']
clean_data = data.copy()
clean_data['cost_bin'] = pd.cut(clean_data['cost_of_the_order'], bins=[0, 10, 20, 30, 40, 50, float('inf')], labels=bin_labels)

# Calculate the mean and median for each bin
mean_ratings = clean_data.groupby('cost_bin')['rating'].mean()
median_ratings = clean_data.groupby('cost_bin')['rating'].median()

# Plotting a box plot with binned order costs
plt.figure(figsize=(10, 6))
sns.boxplot(x='cost_bin', y='rating', data=clean_data)

# Add mean and median points
plt.plot(mean_ratings.index, mean_ratings.values, color='red', marker='o', linestyle='-', label='Mean')
plt.plot(median_ratings.index, median_ratings.values, color='blue', marker='x', linestyle='-', label='Median')

# Customize the plot
plt.title('Box Plot of Customer Ratings by Binned Order Cost')
plt.xlabel('Order Cost Bins')
plt.ylabel('Customer Rating')
plt.legend()
plt.show()

Observations¶

Median Ratings:

  • The median customer ratings for various order costs are generally high, often around 4 to 5 stars.

  • This suggests a consistent level of customer satisfaction across different order costs.

Spread of Ratings:

  • For most order costs, the ratings have a relatively narrow interquartile range (IQR), indicating that the majority of ratings are close to the median.

  • Some order costs show wider IQRs, indicating more variability in customer satisfaction.

Outliers:

  • There are noticeable outliers at both ends of the rating scale for several order costs, indicating that some customers have had significantly better or worse experiences compared to the majority.

  • These outliers are more frequent at lower order costs, suggesting variability in satisfaction for cheaper orders.

High Cost Orders:

  • For higher order costs (e.g., above $30), the ratings tend to have less variability and are consistently high. This could indicate that customers who spend more tend to have better or more consistent experiences.

Low Cost Orders:

  • For lower order costs, the spread of ratings is wider with more outliers, indicating more varied customer experiences. This could be due to a variety of factors, such as expectations not being met or variability in service and product quality.

Rating Distribution by Total Wait Time (Prep Time + Delivery Time)¶

In [62]:
# Rating Distribution by Total Wait Time (Prep Time + Delivery Time)

# Data Cleaning: Replace 'Not given' ratings with NaN and drop those rows
data['rating'] = data['rating'].replace('Not given', pd.NA).dropna().astype(int)

# Calculate the total wait time
data['total_wait_time'] = data['food_preparation_time'] + data['delivery_time']

# Ensure no NaNs or infinite values in the columns
cleaned_data = data.dropna(subset=['total_wait_time', 'rating'])

# Scatter Plot - this wasnt very intuitive to visualize in this way
#plt.figure(figsize=(12, 6))
#sns.scatterplot(x='total_wait_time', y='rating', data=cleaned_data)
#plt.title('Rating Distribution by Total Wait Time (Prep Time + Delivery Time)')
#plt.xlabel('Total Wait Time (minutes)')
#plt.ylabel('Rating')
#plt.show()

# Hexbin Plot
plt.figure(figsize=(12, 6))
hb = plt.hexbin(cleaned_data['total_wait_time'], cleaned_data['rating'], gridsize=30, cmap='viridis', mincnt=1)
plt.colorbar(hb, label='Count')
plt.title('Rating Distribution by Total Wait Time (Prep Time + Delivery Time)')
plt.xlabel('Total Wait Time (minutes)')
plt.ylabel('Rating')
plt.show()

# Correlation Analysis
correlation, _ = pearsonr(cleaned_data['total_wait_time'], cleaned_data['rating'])
print(f"Pearson correlation coefficient between total wait time and rating: {correlation:.2f}")
Pearson correlation coefficient between total wait time and rating: -0.01

Observations¶

Pearson Correlation Coefficient:

The Pearson correlation coefficient between total wait time and rating is -0.01, indicating a very weak negative correlation. This suggests that there is virtually no linear relationship between total wait time and customer ratings.

Scatter Plot:

The scatter plot visually confirms the lack of a clear pattern or trend between total wait time and ratings. Ratings are spread across different total wait times, showing no strong correlation.

Average Raiting by Day of the Week (excludes 'Not given')¶

In [36]:
# Calculate average rating by day of the week
# First, exclude rows where rating is 'Not given'
filtered_data = data[data['rating'] != 'Not given']
filtered_data['rating'] = filtered_data['rating'].astype(float)

avg_ratings_by_day = filtered_data.groupby('day_of_the_week')['rating'].mean().reset_index()
avg_ratings_by_day.columns = ['day_of_the_week', 'average_rating']

# Style the table using pandas Styler
styled_avg_ratings_by_day = avg_ratings_by_day.style.background_gradient(cmap='viridis').set_caption("Average Ratings by Day of the Week")

# Display the styled table
styled_avg_ratings_by_day

# Define the x-axis labels with weekday/weekend context
x_labels = ['Weekday', 'Weekend']

# Plotting the average ratings by day of the week
plt.figure(figsize=(10, 6))
plt.bar(avg_ratings_by_day['day_of_the_week'], avg_ratings_by_day['average_rating'], color='lightcoral')
plt.title('Average Ratings by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Rating')
plt.xticks(ticks=range(len(avg_ratings_by_day['day_of_the_week'])), labels=x_labels)
plt.ylim(0, 5)
plt.grid(True)
plt.show()

Observations¶

Data and Visualization:

  • The dataset has been filtered to exclude rows where the rating is 'Not given'.

Average Ratings on Weekdays vs. Weekends:

  • Weekdays: The average rating on weekdays is slightly lower than on weekends.

  • Weekends: The average rating on weekends is slightly higher compared to weekdays.

Data Distribution:

  • The average rating on weekdays might hover around 4.3 to 4.4.

  • The average rating on weekends might hover around 4.4 to 4.5.

  • Average rating of 4.35 on weekdays and weekends have an average rating of 4.45, it reflects a slight improvement in customer satisfaction on weekends.

Order Frequency vs Ratings¶

In [37]:
# Calculate order frequency for each customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']

# Merge the order frequency with the original dataset to get ratings
merged_data = pd.merge(data, order_frequency, on='customer_id')

# Filter out rows where rating is 'Not given'
filtered_merged_data = merged_data[merged_data['rating'] != 'Not given']
filtered_merged_data['rating'] = filtered_merged_data['rating'].astype(float)

# Calculate the average rating for different order frequencies
avg_rating_by_frequency = filtered_merged_data.groupby('order_count')['rating'].mean().reset_index()
avg_rating_by_frequency.columns = ['order_count', 'average_rating']

# Style the table using pandas Styler
styled_avg_rating_by_frequency = avg_rating_by_frequency.style.background_gradient(cmap='viridis').set_caption("Average Rating by Order Frequency")

# Display the styled table
styled_avg_rating_by_frequency

# Plotting the average rating by order frequency
plt.figure(figsize=(10, 6))
plt.plot(avg_rating_by_frequency['order_count'], avg_rating_by_frequency['average_rating'], marker='o', linestyle='-', color='blue')
plt.title('Order Frequency by Average Rating')
plt.xlabel('Order Frequency')
plt.ylabel('Average Rating')
plt.ylim(0, 5)
plt.grid(True)
plt.show()

Observations¶

General Trend:

  • The average ratings for customers with different order frequencies are generally high, ranging between 4.28 and 4.43.

  • There is no significant downward trend in ratings with increasing order frequency, indicating consistent customer satisfaction across different order frequencies.

Specific Patterns:

  • There are slight variations in ratings at different order frequencies, but these variations are not drastic.

  • Customers who placed eight orders have the highest average rating (4.43), while those with nine orders have the lowest average rating (4.28).

Time Analysis¶

  • Delivery Time vs Food Prep Time
  • Delivery Time by Cuisine Type
  • Preparation Time by Cuisine Type
  • Total Time by Cuisine Type
  • Average Delivery Time by Day of the Week and Cuisine Type
  • Average Preparation Time by Day of the Week and Cuisine Type
  • Average Total Time by Day of the Week and Cuisine Type
  • Order Frequency vs Preparation, Delivery and Total Time
  • Order Frequency by Day of the Week

Delivery Time vs Food Prep Time¶

In [38]:
# Delivery time vs prep time

# Calculate the Pearson correlation coefficient
correlation = data['food_preparation_time'].corr(data['delivery_time'])

#Facet grid
g = sns.FacetGrid(data, col="cuisine_type", col_wrap=4, height=4)
g.map(sns.scatterplot, "food_preparation_time", "delivery_time", alpha=0.5)
g.set_titles("{col_name}")
g.set_axis_labels("Food Preparation Time (minutes)", "Delivery Time (minutes)")
plt.show()

# Print the result
print(f"Pearson correlation coefficient between food preparation time and delivery time: {correlation:.2f}")
Pearson correlation coefficient between food preparation time and delivery time: 0.01

Observations:¶

The analysis reveals a very weak correlation between food preparation time and delivery time, suggesting that factors other than preparation time are more significant in determining delivery time.

The facet grid further illustrates the lack of a strong relationship across different cuisine types, highlighting variability and potential influences from other operational aspects.

Delivery Time by Cuisine Type¶

In [39]:
# Box Plot of Delivery Time by Cuisine Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='cuisine_type', y='delivery_time', data=data)
plt.title('Box Plot of Delivery Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Delivery Time (minutes)')
plt.xticks(rotation=90)
plt.show()

Observations¶

Variation in Delivery Time:

There is noticeable variation in delivery times across different cuisine types. Cuisine with Longest Median Delivery Time:

Some cuisines, such as Mediterranean and Greek, show longer median delivery times.

Cuisine with Shortest Median Delivery Time:

Other cuisines, like Japanese and Italian, tend to have shorter median delivery times.

Presence of Outliers:

Outliers are present in the delivery times for most cuisine types, indicating occasional significantly longer or shorter delivery times than the median.

Spread of Delivery Time:

The interquartile ranges (IQRs) differ among cuisines, with some cuisines showing a wide range of delivery times (e.g., American), while others have a more consistent delivery time (e.g., Mexican).

Preparation Time by Cuisine Type¶

In [40]:
# Box Plot of Food Preparation Time by Cuisine Type
plt.figure(figsize=(12, 6))
sns.boxplot(x='cuisine_type', y='food_preparation_time', data=data)
plt.title('Box Plot of Food Preparation Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Food Preparation Time (minutes)')
plt.xticks(rotation=90)
plt.show()

Total Time by Cuisine Type¶

In [41]:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Group by cuisine type and calculate average total time
avg_total_time_by_cuisine = data.groupby('cuisine_type')['total_time'].mean().reset_index()
avg_total_time_by_cuisine.columns = ['cuisine_type', 'average_total_time']

# Style the table using pandas Styler
styled_avg_total_time_by_cuisine = avg_total_time_by_cuisine.style.background_gradient(cmap='viridis').set_caption("Average Total Time by Cuisine Type")

# Display the styled table
styled_avg_total_time_by_cuisine

# Plotting the average total time by cuisine type
plt.figure(figsize=(12, 6))
sns.barplot(x='cuisine_type', y='average_total_time', data=avg_total_time_by_cuisine, palette='viridis')
plt.title('Average Total Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Total Time (minutes)')
plt.xticks(rotation=90)
plt.show()

Observations¶

Longest Average Total Time:

  • Italian Cuisine: The average total time for Italian cuisine is the highest at approximately 52.64 minutes. This indicates that Italian dishes might require longer preparation and delivery times.

Shortest Average Total Time:

  • Fast Food Cuisine: The average total time for fast food is the lowest at approximately 30.04 minutes. This suggests that fast food items are quicker to prepare and deliver.

Other Notable Cuisines:

  • Japanese Cuisine: The average total time for Japanese cuisine is around 44.27 minutes, indicating moderate preparation and delivery times.

  • Indian Cuisine: The average total time for Indian cuisine is approximately 49.16 minutes, indicating a higher total time similar to Italian cuisine.

Variability:

  • There is a noticeable variability in the total times across different cuisine types. Some cuisines consistently require more time, while others are faster.

Italian Cuisine: Average total time ~ 52.64 minutes.

Fast Food Cuisine: Average total time ~ 30.04 minutes.

Japanese Cuisine: Average total time ~ 44.27 minutes.

Indian Cuisine: Average total time ~ 49.16 minutes.

Average Delivery Time by Day of the Week and Cuisine Type¶

In [42]:
# Pivot Table of Average Delivery Time by Weekday/Weekend and Cuisine Type
# Create a pivot table for average delivery time by Weekday/Weekend and cuisine type
pivot_table_delivery_time = data.pivot_table(values='delivery_time', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')

# Visualize the pivot table as a heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table_delivery_time, annot=True, cmap='Reds')
plt.title('Average Delivery Time by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()

pivot_table_delivery_time
Out[42]:
cuisine_type American Chinese French Indian Italian Japanese Korean Mediterranean Mexican Middle Eastern Southern Spanish Thai Vietnamese
day_of_the_week
Weekday 28.248521 28.826923 27.200000 27.625000 28.802198 28.133333 26.0 28.785714 28.250000 28.470588 29.500000 28.0 26.500000 27.666667
Weekend 22.542169 22.269939 24.615385 22.346939 22.705314 22.519403 20.0 21.312500 22.641509 21.750000 20.727273 23.0 22.266667 25.000000

Observations¶

Weekday vs. Weekend Delivery Times:

  • Weekdays: Delivery times are generally higher across most cuisine types compared to weekends.
  • Weekends: Delivery times tend to be lower, indicating faster deliveries on weekends.

Cuisine Types with Longest Delivery Times on Weekdays:

  • Chinese: Average delivery time is approximately 28.83 minutes.

  • Italian: Average delivery time is approximately 28.80 minutes.

  • American: Average delivery time is approximately 28.25 minutes.

  • Mediterranean: Average delivery time is approximately 28.79 minutes.

Cuisine Types with Shortest Delivery Times on Weekdays:

  • French: Average delivery time is approximately 27.20 minutes.

  • Korean: Average delivery time is approximately 26.00 minutes.

Cuisine Types with Longest Delivery Times on Weekends:

  • French: Average delivery time is approximately 24.62 minutes.

  • Thai: Average delivery time is approximately 22.27 minutes.

Cuisine Types with Shortest Delivery Times on Weekends:

  • Middle Eastern: Average delivery time is approximately 21.75 minutes.

  • Korean: Average delivery time is approximately 20.00 minutes.

Comparison Between Weekdays and Weekends:

  • Most cuisine types have a significant reduction in delivery times on weekends compared to weekdays.

  • The difference in delivery times between weekdays and weekends is more pronounced for certain cuisines like American, Chinese, and Italian.

Average Preparation Time by Day of the Week and Cuisine Type¶

In [43]:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Group by cuisine type and calculate average total time
avg_total_time_by_cuisine = data.groupby('cuisine_type')['total_time'].mean().reset_index()
avg_total_time_by_cuisine.columns = ['cuisine_type', 'average_total_time']

# Style the table using pandas Styler
styled_avg_total_time_by_cuisine = avg_total_time_by_cuisine.style.background_gradient(cmap='viridis').set_caption("Average Total Time by Cuisine Type")

# Display the styled table
styled_avg_total_time_by_cuisine

# Create a pivot table for average preparation time by Weekday/Weekend and cuisine type
pivot_table_preparation_time = data.pivot_table(values='food_preparation_time', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')

# Visualize the pivot table as a heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table_preparation_time, annot=True, cmap='Blues')
plt.title('Average Preparation Time by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()

pivot_table_preparation_time
Out[43]:
cuisine_type American Chinese French Indian Italian Japanese Korean Mediterranean Mexican Middle Eastern Southern Spanish Thai Vietnamese
day_of_the_week
Weekday 27.408284 28.076923 27.200000 26.333333 27.417582 27.281481 23.000000 25.785714 25.625000 27.235294 26.166667 33.000000 25.250000 23.666667
Weekend 27.453012 27.331288 26.769231 27.489796 27.512077 27.602985 25.909091 27.531250 27.226415 26.375000 28.363636 26.363636 27.866667 27.250000

Observations¶

Weekday vs. Weekend Preparation Times:

  • Weekdays: Preparation times are generally higher for most cuisine types compared to weekends.

  • Weekends: Preparation times tend to be slightly lower, indicating quicker preparation on weekends.

Cuisine Types with Longest Preparation Times on Weekdays:

  • Spanish: Average preparation time is approximately 33.00 minutes.

  • Chinese: Average preparation time is approximately 28.08 minutes.

  • Italian: Average preparation time is approximately 27.42 minutes.

Cuisine Types with Shortest Preparation Times on Weekdays:

  • Korean: Average preparation time is approximately 23.00 minutes.

  • Vietnamese: Average preparation time is approximately 23.67 minutes.

  • Mexican: Average preparation time is approximately 25.63 minutes.

Cuisine Types with Longest Preparation Times on Weekends:

  • Southern: Average preparation time is approximately 28.36 minutes.

  • Thai: Average preparation time is approximately 27.87 minutes.

  • Indian: Average preparation time is approximately 27.49 minutes.

Cuisine Types with Shortest Preparation Times on Weekends:

  • Korean: Average preparation time is approximately 25.91 minutes.

  • Middle Eastern: Average preparation time is approximately 26.38 minutes.

  • Spanish: Average preparation time is approximately 26.36 minutes.

Comparison Between Weekdays and Weekends:

  • The difference in preparation times between weekdays and weekends is more pronounced for certain cuisines like Spanish and Southern.

  • Some cuisines have minimal differences in preparation times between weekdays and weekends, indicating consistent preparation times.

Average Total Time by Day of the Week and Cuisine Type¶

In [44]:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Create a pivot table for average total time by Weekday/Weekend and cuisine type
pivot_table_total_time = data.pivot_table(values='total_time', index='day_of_the_week', columns='cuisine_type', aggfunc='mean')

# Visualize the pivot table as a heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table_total_time, annot=True, cmap='Greens')
plt.title('Average Total Time by Weekday/Weekend and Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Weekday/Weekend')
plt.show()

pivot_table_total_time
Out[44]:
cuisine_type American Chinese French Indian Italian Japanese Korean Mediterranean Mexican Middle Eastern Southern Spanish Thai Vietnamese
day_of_the_week
Weekday 55.656805 56.903846 54.400000 53.958333 56.219780 55.414815 49.000000 54.571429 53.875000 55.705882 55.666667 61.000000 51.750000 51.333333
Weekend 49.995181 49.601227 51.384615 49.836735 50.217391 50.122388 45.909091 48.843750 49.867925 48.125000 49.090909 49.363636 50.133333 52.250000

Observations¶

**Weekday vs. Weekend Total Times:::

  • Weekdays: Total times are generally higher across most cuisine types compared to weekends.

  • Weekends: Total times are generally lower, indicating quicker overall preparation and delivery on weekends.

Cuisine Types with Longest Total Times on Weekdays:

  • Spanish: Average total time is approximately 61.00 minutes.
  • Chinese: Average total time is approximately 56.90 minutes.
  • Italian: Average total time is approximately 56.22 minutes.
  • Middle Eastern: Average total time is approximately 55.71 minutes.
  • Southern: Average total time is approximately 55.67 minutes.

Cuisine Types with Shortest Total Times on Weekdays:

  • Korean: Average total time is approximately 49.00 minutes.
  • Indian: Average total time is approximately 53.96 minutes.
  • Mexican: Average total time is approximately 53.88 minutes.
  • French: Average total time is approximately 54.40 minutes.

Cuisine Types with Longest Total Times on Weekends:

  • Vietnamese: Average total time is approximately 52.25 minutes.
  • French: Average total time is approximately 51.38 minutes.
  • Thai: Average total time is approximately 50.13 minutes.

Cuisine Types with Shortest Total Times on Weekends:

  • Korean: Average total time is approximately 45.91 minutes.
  • Middle Eastern: Average total time is approximately 48.13 minutes.
  • Southern: Average total time is approximately 49.09 minutes.
  • Spanish: Average total time is approximately 49.36 minutes.
  • Mexican: Average total time is approximately 49.87 minutes.

Comparison Between Weekdays and Weekends:

  • The difference in total times between weekdays and weekends is more pronounced for certain cuisines like Spanish, Chinese, and Italian.

  • Some cuisines have minimal differences in total times between weekdays and weekends, indicating consistent preparation and delivery times.

Order Frequency by Preparation, Delivery and Total Time¶

In [45]:
# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Calculate order frequency for each customer
order_frequency = data['customer_id'].value_counts().reset_index()
order_frequency.columns = ['customer_id', 'order_count']

# Merge the order frequency with the original dataset
merged_data = pd.merge(data, order_frequency, on='customer_id')

# Calculate the average preparation, delivery, and total time for different order frequencies
avg_times_by_frequency = merged_data.groupby('order_count').agg({
    'food_preparation_time': 'mean',
    'delivery_time': 'mean',
    'total_time': 'mean'
}).reset_index()
avg_times_by_frequency.columns = ['order_count', 'average_preparation_time', 'average_delivery_time', 'average_total_time']

# Plotting the average preparation, delivery, and total time by order frequency side by side
fig, ax = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

# Preparation Time
sns.lineplot(ax=ax[0], x='order_count', y='average_preparation_time', data=avg_times_by_frequency, marker='o', color='skyblue', label='Preparation Time')
ax[0].set_title('Average Preparation Time by Order Frequency')
#ax[0].set_xlabel('Order Frequency')
ax[0].set_ylabel('Average Time (minutes)')
ax[0].grid(True)
ax[0].legend()

# Delivery Time
sns.lineplot(ax=ax[1], x='order_count', y='average_delivery_time', data=avg_times_by_frequency, marker='o', color='lightgreen', label='Delivery Time')
ax[1].set_title('Average Delivery Time by Order Frequency')
ax[1].set_xlabel('Order Frequency')
ax[1].grid(True)
ax[1].legend()

# Total Time
sns.lineplot(ax=ax[2], x='order_count', y='average_total_time', data=avg_times_by_frequency, marker='o', color='salmon', label='Total Time')
ax[2].set_title('Average Total Time by Order Frequency')
#ax[2].set_xlabel('Order Frequency')
ax[2].grid(True)
ax[2].legend()

plt.tight_layout()
plt.show()

Observations¶

Average Preparation Time:

  • 1 Order: Average preparation time is approximately 27.58 minutes.
  • 2 Orders: Slightly decreases to approximately 27.25 minutes.
  • 3 Orders: Decreases further to approximately 26.83 minutes.
  • 4 Orders: Increases to approximately 27.47 minutes.
  • 8 Orders: Significantly drops to approximately 25.00 minutes.
  • 9 Orders: Peaks at approximately 29.33 minutes.

Average Delivery Time:

  • 1 Order: Average delivery time is approximately 24.36 minutes.
  • 2 Orders: Slightly decreases to approximately 23.70 minutes.
  • 3 Orders: Slightly increases to approximately 24.32 minutes.
  • 5 Orders: Peaks slightly at approximately 24.47 minutes.
  • 9 Orders: Drops to approximately 22.11 minutes.

Average Total Time:

  • 1 Order: Average total time is approximately 51.93 minutes.
  • 2 Orders: Decreases slightly to approximately 50.96 minutes.
  • 3 Orders: Slightly increases to approximately 51.14 minutes.
  • 8 Orders: Drops significantly to approximately 48.25 minutes.
  • 10 Orders: Is stable at approximately 51.20 minutes.
  • 13 Orders: Increases to approximately 52.46 minutes.

Insights

  • General Trend:

    • Preparation, delivery, and total times show some fluctuation with the number of orders placed by customers.
    • Customers with a higher frequency of orders tend to have slightly more efficient preparation times around 8 orders but this is not consistent.
  • Efficiency Peaks:

    • A notable drop in preparation and total times occurs at 8 orders, suggesting possible improvements in efficiency for these frequent customers.
  • Inconsistencies:

    • The preparation time increases again for customers with 9 orders, suggesting possible inconsistencies or operational challenges for this group. Consistent Delivery Times:

    • Delivery times remain relatively stable with minor fluctuations across different order frequencies.

Order Frequency by Day of the Week¶

In [46]:
# Calculate the order frequency by day of the week
order_frequency_by_day = data['day_of_the_week'].value_counts().reset_index()
order_frequency_by_day.columns = ['day_of_the_week', 'order_count']

# Style the table using pandas Styler
styled_order_frequency_by_day = order_frequency_by_day.style.background_gradient(cmap='viridis').set_caption("Order Frequency by Day of the Week")

# Display the styled table
styled_order_frequency_by_day

# Plotting the order frequency by day of the week
plt.figure(figsize=(10, 6))
sns.barplot(x='day_of_the_week', y='order_count', data=order_frequency_by_day, palette='viridis')
plt.title('Order Frequency by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Order Count')
#plt.xticks(rotation=90)
plt.grid(True)
plt.show()

Observations¶

  • Order Frequency on Weekdays vs. Weekends:

    • Weekends: There is a significantly higher order frequency on weekends, with 1351 orders.

    • Weekdays: The order frequency on weekdays is lower, with 547 orders.

  • Higher Order Volumes on Weekends:

    • The data clearly shows that customers place more orders on weekends compared to weekdays. This could be due to various factors such as more free time, social gatherings, and special occasions.
  • Lower Order Volumes on Weekdays:

    • The lower order frequency on weekdays suggests that customers may be busier with work and other commitments, leading to fewer orders.

Outlier Analysis¶

In [73]:
# Convert 'rating' to numeric, replacing 'Not given' with NaN
data['rating'] = pd.to_numeric(data['rating'].replace('Not given', pd.NA), errors='coerce')

# Calculate the total wait time
data['total_wait_time'] = data['food_preparation_time'] + data['delivery_time']

# Identify numeric columns
numeric_columns = data.select_dtypes(include=[np.number]).columns
numeric_columns = numeric_columns.drop('customer_id')   # Exclude 'customer_id' from the analysis
numeric_columns = numeric_columns.drop('order_id')      # Exclude 'order_id' from the analysis

# Determine the number of rows and columns for the subplot grid
num_cols = 5
num_rows = math.ceil(len(numeric_columns) / num_cols)

# Set up the subplots
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, num_rows * 5))
colors = sns.color_palette("husl", len(numeric_columns))  # Generate a color palette

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Step 2 & 3: Visualize Outliers and Identify Outliers using IQR method for each numeric column
outlier_counts = {}

for i, col in enumerate(numeric_columns):
    sns.boxplot(x=data[col], ax=axes[i], color=colors[i])
    axes[i].set_title(f'Box Plot of {col}')
    axes[i].set_xlabel(col)

    # Calculate IQR
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1

    # Define outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]
    outlier_counts[col] = len(outliers)

    # Print outliers count
    print(f"Number of outliers detected in {col}: {len(outliers)}")

# Remove any empty subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()
plt.show()

# Print summary of outliers in each column
print("\nSummary of Outlier Counts:")
for col, count in outlier_counts.items():
    print(f"{col}: {count} outliers")

# Optional: Remove outliers (across all numeric columns)
for col in numeric_columns:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Remove outliers
    data = data[(data[col] >= lower_bound) & (data[col] <= upper_bound)]

# Summary of the cleaned data
print(f"\nData after removing outliers: {len(data)} records remaining.")
Number of outliers detected in cost_of_the_order: 0
Number of outliers detected in rating: 0
Number of outliers detected in food_preparation_time: 0
Number of outliers detected in delivery_time: 0
Number of outliers detected in total_wait_time: 0
Summary of Outlier Counts:
cost_of_the_order: 0 outliers
rating: 0 outliers
food_preparation_time: 0 outliers
delivery_time: 0 outliers
total_wait_time: 0 outliers

Data after removing outliers: 1162 records remaining.

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer¶

In [47]:
# Filter out rows where rating is 'Not given'
filtered_data = data[data['rating'] != 'Not given']
filtered_data['rating'] = filtered_data['rating'].astype(float)

# Calculate the rating count and average rating for each restaurant
restaurant_ratings = filtered_data.groupby('restaurant_name')['rating'].agg(['count', 'mean']).reset_index()
restaurant_ratings.columns = ['restaurant_name', 'rating_count', 'average_rating']

# Filter the restaurants that meet the criteria: rating count > 50 and average rating > 4
eligible_restaurants = restaurant_ratings[(restaurant_ratings['rating_count'] > 50) & (restaurant_ratings['average_rating'] > 4)]

# Style the table using pandas Styler with all column headers centered and data left-justified, except for rating_count data which is centered
styled_eligible_restaurants = (eligible_restaurants.style
                               .set_properties(subset=['restaurant_name'], **{'text-align': 'left'})
                               .set_properties(subset=['rating_count'], **{'text-align': 'center'})
                               .set_table_styles([{
                                   'selector': 'th',
                                   'props': [('text-align', 'center')]
                               }])
                               .set_caption("Restaurants Eligible for Promotional Offer"))

# Display the styled table
styled_eligible_restaurants
Out[47]:
Restaurants Eligible for Promotional Offer
  restaurant_name rating_count average_rating
20 Blue Ribbon Fried Chicken 64 4.328125
21 Blue Ribbon Sushi 73 4.219178
136 Shake Shack 133 4.278195
153 The Meatball Shop 84 4.511905

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders¶

In [48]:
# Define the function to calculate the commission based on order cost
def calculate_commission(cost):
    if cost > 20:
        return cost * 0.25
    elif cost > 5:
        return cost * 0.15
    else:
        return 0

# Apply the function to calculate the commission for each order
data['commission'] = data['cost_of_the_order'].apply(calculate_commission)

# Calculate the total revenue generated by the company
total_revenue = data['commission'].sum()

# Print the total revenue generated by the company with a formatted string, rounded to 2 decimal places
print(f"The total net revenue generated by the company across all orders, based on the given commission structure, is ${total_revenue:.2f}")
The total net revenue generated by the company across all orders, based on the given commission structure, is $6166.30

Observations:¶

Revenue from Low-Cost Orders:

  • Orders with a cost less than or equal to $5 do not contribute to the revenue as they fall outside the commission structure.
  • The company should focus on increasing the average order value to maximize revenue.

  • Impact of Commission Structure:

    • The tiered commission structure ensures that higher-value orders contribute more to the revenue.
    • This structure incentivizes promoting higher-value items and potentially increasing the average order value.

Recommendations

  • Promote Higher-Value Items:

    • Encourage customers to place higher-value orders through promotions or bundling items, which would increase the number of orders qualifying for the 25% commission rate.
  • Increase Average Order Value:

    • Implement strategies to increase the average order value, such as offering discounts on add-ons or creating special offers for orders above a certain amount.
  • Monitor and Adjust Commission Structure:

    • Regularly review the commission structure to ensure it aligns with the company's revenue goals and market conditions.
    • Consider adjustments based on order patterns and customer feedback to optimize revenue.
  • Customer Engagement:

    • Engage customers with targeted marketing campaigns highlighting high-value items and special promotions to drive higher-order volumes.

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered)¶

In [49]:
# Write the code here# Calculate total time (preparation time + delivery time)
data['total_time'] = data['food_preparation_time'] + data['delivery_time']

# Calculate the percentage of orders that take more than 60 minutes to get delivered
total_orders = len(data)
orders_over_60_minutes = len(data[data['total_time'] > 60])
percentage_over_60_minutes = (orders_over_60_minutes / total_orders) * 100

# Print the percentage of orders that take more than 60 minutes
print(f"The percentage of orders that take more than 60 minutes to get delivered is {percentage_over_60_minutes:.2f}%")
The percentage of orders that take more than 60 minutes to get delivered is 10.54%

Observations:¶

Percentage of Orders Over 60 Minutes:

  • The analysis shows that 10.54% of the orders take more than 60 minutes to get delivered from the time the order is placed.

Implications of Long Delivery Times:

  • Orders taking more than 60 minutes could indicate inefficiencies in the preparation or delivery process.
  • Long delivery times may affect customer satisfaction and lead to negative reviews or reduced repeat business.

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends?¶

In [50]:
# Write the code here# Calculate the mean delivery time for weekdays and weekends
mean_delivery_time_by_day = data.groupby('day_of_the_week')['delivery_time'].mean().reset_index()

# Separate weekdays and weekends
weekdays = mean_delivery_time_by_day[mean_delivery_time_by_day['day_of_the_week'] == 'Weekday']['delivery_time'].values[0]
weekends = mean_delivery_time_by_day[mean_delivery_time_by_day['day_of_the_week'] == 'Weekend']['delivery_time'].values[0]

# Print the mean delivery times
print(f"The mean delivery time on weekdays is {weekdays:.2f} minutes.")
print(f"The mean delivery time on weekends is {weekends:.2f} minutes.")
The mean delivery time on weekdays is 28.34 minutes.
The mean delivery time on weekends is 22.47 minutes.

Observations:¶

Mean Delivery Time on Weekdays:

  • The mean delivery time on weekdays is approximately 28.34 minutes. Mean Delivery Time on Weekends:

  • The mean delivery time on weekends is approximately 22.47 minutes.

Conclusion and Recommendations¶

Delivery times on weekdays is appreciably longer than on weekends despite generally higher order frequency on weekends

Improve Weekday Efficiency:

  • Investigate the factors contributing to longer delivery times on weekdays. Consider optimizing delivery routes, increasing staffing levels, or streamlining operations to reduce delivery times.

Maintain Weekend Efficiency:

  • Continue leveraging the efficiencies observed on weekends. Analyze the processes that work well during weekends and see if they can be applied to weekdays.

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations)¶

Conclusions:¶

Long Delivery Times:

  • 10.54% of orders take more than 60 minutes to get delivered, indicating potential inefficiencies in the preparation or delivery process.

Delivery Times on Weekdays vs. Weekends:

  • Mean delivery time on weekdays is approximately 28.34 minutes. Mean delivery time on weekends is shorter, at approximately 22.47 minutes.

Highly Rated Restaurants:

  • Certain restaurants like Shake Shack and The Meatball Shop have high average ratings and a substantial number of ratings, making them prime candidates for promotional offers.

Order Frequency by Day of the Week:

  • Order frequency is significantly higher on weekends compared to weekdays, suggesting a shift in customer behavior and demand.

Recommendations:¶

Optimize Delivery Efficiency on Weekdays:

  • Route Optimization: Use advanced routing algorithms to find the most efficient delivery routes during peak weekday hours.
  • Staffing Adjustments: Increase staffing levels during peak times on weekdays to handle higher traffic and ensure timely deliveries.
  • Preparation Processes: Streamline kitchen operations to reduce preparation times, possibly by pre-prepping ingredients for popular dishes.

Leverage High Customer Ratings:

  • Promotional Campaigns: Highlight highly rated restaurants like Shake Shack and The Meatball Shop in promotional campaigns to attract more customers.
  • Customer Feedback: Regularly solicit and analyze customer feedback to identify strengths and areas for improvement. Use this feedback to maintain high service standards and improve underperforming areas.

Focus on Weekend Efficiency:

  • Best Practices: Analyze the factors contributing to shorter delivery times on weekends and apply these best practices to weekdays.
  • Weekend Promotions: Continue leveraging the high order volumes on weekends with special offers and promotions to maximize revenue.

Enhance Customer Experience:

  • Transparent Communication: Clearly communicate estimated delivery times to customers and provide real-time updates on order status.
  • Compensation for Delays: Offer compensation or discounts for orders that are significantly delayed to maintain customer trust and satisfaction.

Improve Cuisine-Specific Offerings:

  • Menu Optimization: Analyze the preparation and delivery times for different cuisine types. Focus on optimizing the processes for cuisines with longer average times (e.g., Spanish, Chinese).
  • Targeted Promotions: Offer targeted promotions for cuisines that have high ratings but longer delivery times to balance the demand and improve overall delivery efficiency.

Incentivize Higher-Value Orders:

  • Bundled Offers: Create bundled meal deals that encourage customers to place higher-value orders, which contribute more to the company’s revenue.
  • Loyalty Programs: Implement loyalty programs that reward customers for placing higher-value orders and frequent orders, driving repeat business.

In [76]:
# Convert notebook to html

import os
os.getcwd()
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/FDS_Project_LearnerNotebook_FullCode.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/FDS_Project_LearnerNotebook_FullCode.ipynb to html
[NbConvertApp] Writing 3452465 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Project Assessment - FoodHub/FDS_Project_LearnerNotebook_FullCode.html