Guided Project: Classification and Hypothesis Testing¶


Context:¶

Hospital management is a vital area that gained a lot of attention during the COVID-19 pandemic. Inefficient distribution of resources like beds, ventilators might lead to a lot of complications. However, this can be mitigated by predicting the length of stay (LOS) of a patient before getting admitted. Once this is determined, the hospital can plan a suitable treatment, resources, and staff to reduce the LOS and increase the chances of recovery. The rooms and bed can also be planned in accordance with that.

HealthPlus hospital has been incurring a lot of losses in revenue and life due to its inefficient management system. They have been unsuccessful in allocating pieces of equipment, beds, and hospital staff fairly. A system that could estimate the length of stay (LOS) of a patient can solve this problem to a great extent.

Objective:¶

As a Data Scientist, you have been hired by HealthPlus to analyze the data, find out what factors affect the LOS the most, and come up with a machine learning model which can predict the LOS of a patient using the data available during admission and after running a few tests. Also, bring about useful insights and policies from the data, which can help the hospital to improve their health care infrastructure and revenue.

Data Dictionary:¶

The data contains various information recorded during the time of admission of the patient. It only contains records of patients who were admitted to the hospital. The detailed data dictionary is given below:

  • patientid: Patient ID
  • Age: Range of age of the patient
  • gender: Gender of the patient
  • Type of Admission: Trauma, emergency or urgent
  • Severity of Illness: Extreme, moderate, or minor
  • health_conditions: Any previous health conditions suffered by the patient
  • Visitors with Patient: The number of patients who accompany the patient
  • Insurance: Does the patient have health insurance or not?
  • Admission_Deposit: The deposit paid by the patient during admission
  • Stay (in days): The number of days that the patient has stayed in the hospital. This is the target variable
  • Available Extra Rooms in Hospital: The number of rooms available during admission
  • Department: The department which will be treating the patient
  • Ward_Facility_Code: The code of the ward facility in which the patient will be admitted
  • doctor_name: The doctor who will be treating the patient
  • staff_available: The number of staff who are not occupied at the moment in the ward

Approach to solve the problem:¶

  1. Import the necessary libraries
  2. Read the dataset and get an overview
  3. Exploratory data analysis - a. Univariate b. Bivariate
  4. Data preprocessing if any
  5. Define the performance metric and build ML models
  6. Checking for assumptions
  7. Compare models and determine the best one
  8. Observations and business insights

Importing Libraries, Mount Drive and Read Data¶

InĀ [7]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Visualization
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Overwrite defauls pandas display limits
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', 200)
# pd.set_option('display.max_colwidth', None)
InĀ [8]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
InĀ [9]:
# Read dataset file
data = pd.read_csv("/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Guided_Project:_Classification_and_Hypothesis_Testing/Dataset - Hospital LOS Prediction.csv")
InĀ [10]:
# Lets make a backup copy of the data frame
data_backup = data.copy()

Data Overview¶

InĀ [11]:
# Lets get an idea about the data
# Print the first two rows
data.head(2)
Out[11]:
Available Extra Rooms in Hospital Department Ward_Facility_Code doctor_name staff_available patientid Age gender Type of Admission Severity of Illness health_conditions Visitors with Patient Insurance Admission_Deposit Stay (in days)
0 4 gynecology D Dr Sophia 0 33070 41-50 Female Trauma Extreme Diabetes 4 Yes 2966.408696 8
1 4 gynecology B Dr Sophia 2 34808 31-40 Female Trauma Minor Heart disease 2 No 3554.835677 9
InĀ [12]:
# Print the last two rows
data.tail(2)
Out[12]:
Available Extra Rooms in Hospital Department Ward_Facility_Code doctor_name staff_available patientid Age gender Type of Admission Severity of Illness health_conditions Visitors with Patient Insurance Admission_Deposit Stay (in days)
499998 2 radiotherapy A Dr John 1 29957 61-70 Female Trauma Extreme Diabetes 2 No 4694.127772 23
499999 3 gynecology F Dr Sophia 3 45008 41-50 Female Trauma Moderate Heart disease 4 Yes 4713.868519 10
InĀ [13]:
# What is the shape of our data
data.shape
Out[13]:
(500000, 15)

The dataset contains 500000 rows and 15 features

InĀ [14]:
# What tpes of features do we have
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 15 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Available Extra Rooms in Hospital  500000 non-null  int64  
 1   Department                         500000 non-null  object 
 2   Ward_Facility_Code                 500000 non-null  object 
 3   doctor_name                        500000 non-null  object 
 4   staff_available                    500000 non-null  int64  
 5   patientid                          500000 non-null  int64  
 6   Age                                500000 non-null  object 
 7   gender                             500000 non-null  object 
 8   Type of Admission                  500000 non-null  object 
 9   Severity of Illness                500000 non-null  object 
 10  health_conditions                  348112 non-null  object 
 11  Visitors with Patient              500000 non-null  int64  
 12  Insurance                          500000 non-null  object 
 13  Admission_Deposit                  500000 non-null  float64
 14  Stay (in days)                     500000 non-null  int64  
dtypes: float64(1), int64(5), object(9)
memory usage: 57.2+ MB

The dataset contains no null values, but does appear to have some missing values in the health_conditions feature, we will need to address this.

The data is comprised of the following:

  • Continuous Features (numerical data types):
  • Available Extra Rooms in Hospital
  • staff_available
  • patientid
  • Visitors with Patient
  • Admission_Deposit
  • Stay (in days)
  • Categorical Features (object data type):
  • Department
  • Ward_Facility_Code
  • doctor_name
  • Age
  • gender
  • Type of Admission
  • Severity of Illness
  • health_conditions
  • Insurance (binary object)

Let's check the unique values in each column

InĀ [15]:
# Checking unique values in each column
data.nunique()
Out[15]:
0
Available Extra Rooms in Hospital 18
Department 5
Ward_Facility_Code 6
doctor_name 9
staff_available 11
patientid 126399
Age 10
gender 3
Type of Admission 3
Severity of Illness 3
health_conditions 5
Visitors with Patient 28
Insurance 2
Admission_Deposit 499508
Stay (in days) 49

  • There appears to be 24 rooms in our facility.

  • There is a 10 staff members and 9 doctors.

  • Patients have and average of 3.5 visitors with the range being between 0 and 32 visitors.

  • Patients pay an average $4722.32 deposit (assumed to be in dollars) on admission with the range being between $1654 and $10104.72.

  • The average length of stay (in days) is 12.38, with the range being between 3.0 and 51 days.

General Observations:

  • All numerical feature data is slightly skewed (as the median is not the same as the mean).

Data Duplication Checks¶

InĀ [16]:
data.duplicated()
Out[16]:
0
0 False
1 False
2 False
3 False
4 False
... ...
499995 False
499996 False
499997 False
499998 False
499999 False

500000 rows Ɨ 1 columns


InĀ [17]:
data.duplicated().sum()
Out[17]:
0

There are no duplications to address

Exploratory Data Analysis¶

InĀ [18]:
# How many unique patients has the facility served
data['patientid'].nunique()
Out[18]:
126399

The facility has served 126399 unique patients.

Statistical Analysis¶

InĀ [19]:
# Descriptive statistics for numerical features
data.describe().T
Out[19]:
count mean std min 25% 50% 75% max
Available Extra Rooms in Hospital 500000.0 3.638800 2.698124 0.000000 2.000000 3.000000 4.000000 24.00000
staff_available 500000.0 5.020470 3.158103 0.000000 2.000000 5.000000 8.000000 10.00000
patientid 500000.0 63150.519058 41689.479956 -3269.000000 25442.000000 57864.000000 103392.000000 134400.00000
Visitors with Patient 500000.0 3.549414 2.241054 0.000000 2.000000 3.000000 4.000000 32.00000
Admission_Deposit 500000.0 4722.315734 1047.324220 1654.005148 4071.714532 4627.003792 5091.612717 10104.72639
Stay (in days) 500000.0 12.381062 7.913174 3.000000 8.000000 9.000000 11.000000 51.00000

Let's explore these variables in some more depth by observing their distributions

InĀ [20]:
# Define numerical features
num_cols = ['Available Extra Rooms in Hospital', 'staff_available', 'Visitors with Patient', 'Admission_Deposit', 'Stay (in days)']

# Creating histograms
data[num_cols].hist(figsize=(10,10))
plt.show()
No description has been provided for this image

Observations:

Thank you for the clarification. Since all the data comes from one facility, the observations can be revised as follows:

  1. Available Extra Rooms in Hospital:

    • Most of the time, the facility has fewer than 5 extra rooms available, indicating high occupancy or a limited number of rooms. It's rare for this facility to have more than 10 extra rooms at any given time.
  2. Staff Available:

    • The facility typically has around 8-10 staff members available at any time. This suggests that staffing is relatively consistent, with the facility rarely having fewer than 5 staff members.
  3. Visitors with Patient:

    • Most patients have fewer than 5 visitors, with the number of visitors sharply decreasing as the count goes higher. This indicates that it's uncommon for patients to receive many visitors, possibly due to facility policies or visitor preferences.
  4. Admission Deposit:

    • The majority of patients are required to deposit between 4000 and 6000 units of currency for admission. Higher deposits are rare, but there are occasional outliers with deposits up to 10,000, possibly reflecting different treatment levels or payment plans.
  5. Stay (in days):

    • The typical patient stay is under 10 days, with a sharp drop-off after that. However, there are cases of patients staying for up to 50 days, though these longer stays are uncommon. This could suggest that most treatments or hospitalizations are resolved relatively quickly, with only a few cases requiring prolonged care.

In summary, this facility operates with high patient turnover and typically has a stable number of staff and visitors. Most patients deposit moderate amounts of money and stay for less than 10 days. Longer stays and higher deposits are outliers.

Feature: Available Rooms¶

InĀ [21]:
# Define a function to find unique values, counts and percentages associated with a feature
total_patients = data['Department'].count()

def calculate_percentage(df, column_name):
  """Calculates the percentage of each unique value in a given column.

  Args:
    df: The pandas DataFrame containing the data.
    column_name: The name of the column to analyze.

  Returns:
    A pandas DataFrame with the percentage of each unique value.
  """
  return (df[column_name].value_counts(1, dropna=False)
            .reset_index(name='Percentage')
            .rename(columns={'index': column_name})
            .assign(Count=lambda x: (x['Percentage'] * total_patients).astype(int))
            .assign(Percentage=lambda x: x['Percentage'].apply(lambda p: f'{p * 100:.2f}%'))
            .sort_values(by='Count', ascending=False)
            .loc[:, [column_name, 'Count', 'Percentage']]
          )
InĀ [22]:
# Calculate the percentage of time when "Available Extra Rooms in Hospital" is 0 (min) and 24 (max)
calculate_percentage(data, 'Available Extra Rooms in Hospital')
Out[22]:
Available Extra Rooms in Hospital Count Percentage
0 3 145044 29.01%
1 2 141205 28.24%
2 4 114011 22.80%
3 5 47644 9.53%
4 6 15561 3.11%
5 1 12194 2.44%
6 7 4975 1.00%
7 12 2844 0.57%
8 24 2786 0.56%
9 21 2377 0.48%
10 13 1918 0.38%
11 8 1750 0.35%
12 11 1600 0.32%
13 0 1573 0.31%
14 10 1559 0.31%
15 14 1265 0.25%
16 20 1020 0.20%
17 9 674 0.13%
  • The Facility has zero available rooms (maximum capacity) only 0.31% of the time during new intakes and has 24 available rooms (minimum capacity) 0.56% of the time during new intakes.
  • The majority of the time (just over 80%) the facility has between 2 and 4 available rooms.

Feature: Departments¶

InĀ [23]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Department')
Out[23]:
Department Count Percentage
0 gynecology 343478 68.70%
1 radiotherapy 84315 16.86%
2 anesthesia 44179 8.84%
3 TB & Chest disease 22890 4.58%
4 surgery 5138 1.03%

The Gynecology department sees the most patients at over 68%, the Surgey department sees the fewest at just over 1%.

Feature: Wards¶

InĀ [24]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Ward_Facility_Code')
Out[24]:
Ward_Facility_Code Count Percentage
0 F 120538 24.11%
1 D 119055 23.81%
2 B 103885 20.78%
3 E 95374 19.07%
4 A 46551 9.31%
5 C 14597 2.92%

Ward F sees the most patients at just over 24%, ward C sees the fewest at just under 3%.

Feature: Doctors¶

InĀ [25]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'doctor_name')
Out[25]:
doctor_name Count Percentage
0 Dr Sarah 99596 19.92%
1 Dr Olivia 98352 19.67%
2 Dr Sophia 74753 14.95%
3 Dr Nathan 70777 14.16%
4 Dr Sam 55711 11.14%
5 Dr John 51263 10.25%
6 Dr Mark 44410 8.88%
7 Dr Isaac 3359 0.67%
8 Dr Simon 1779 0.36%
  • Dr. Sarah is the busiest praticioner seeing almost 20% of all patients.
  • Dr. Simon is the least busiest, seeing just .36%.
  • Dr. Mark and Dr. John are the only two praticioners to work in more than one department.

The maximum number of visits for a patient is 21, the least is 1.

Feature: Available Staff¶

We know form the statistical analysis above the value here ranges from 0 to 10.

InĀ [26]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'staff_available')
Out[26]:
staff_available Count Percentage
0 6 46620 9.32%
1 10 45768 9.15%
2 7 45703 9.14%
3 3 45658 9.13%
4 8 45583 9.12%
5 9 45530 9.11%
6 4 45290 9.06%
7 5 45215 9.04%
8 0 45032 9.01%
9 1 44931 8.99%
10 2 44670 8.93%

The facility is over-staffed just over 9% of the time, and is under-staffed approximately 9% of the time.

Feature: Age Ranges¶

InĀ [27]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Age')
Out[27]:
Age Count Percentage
0 21-30 159793 31.96%
1 31-40 133373 26.67%
2 41-50 80406 16.08%
3 11-20 46536 9.31%
4 61-70 26556 5.31%
5 51-60 21718 4.34%
6 71-80 18703 3.74%
7 81-90 8181 1.64%
8 0-10 3368 0.67%
9 91-100 1366 0.27%
  • Over 74% of admissions are for patients are between the ages of 21 and 50.
  • Very young patents (ages 0 to 10) account for only 0.67% of admissions.
  • Elderly patients (ages 91 to 100) account for only 0.27% of admissions.

Feature: Gender¶

InĀ [28]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'gender')
Out[28]:
gender Count Percentage
0 Female 370810 74.16%
1 Male 103480 20.70%
2 Other 25710 5.14%

As expected, by looking at the facility department, Females make up the majority of admissions, at just over 74%.

Feature: Admission Type¶

InĀ [29]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Type of Admission')
Out[29]:
Type of Admission Count Percentage
0 Trauma 310536 62.11%
1 Emergency 135784 27.16%
2 Urgent 53680 10.74%

Trauma accounts for just over 62% of admissions.

Feature: Severity of Illness¶

InĀ [30]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Severity of Illness')
Out[30]:
Severity of Illness Count Percentage
0 Moderate 280197 56.04%
1 Minor 131537 26.31%
2 Extreme 88266 17.65%

Moderate illness is the most common serverity of illness at just over 56% of admissions.

Feature: Health Conditions¶

InĀ [31]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'health_conditions')
Out[31]:
health_conditions Count Percentage
0 NaN 151888 30.38%
1 Other 94411 18.88%
2 High Blood Pressure 79402 15.88%
3 Diabetes 73644 14.73%
4 Asthama 65514 13.10%
5 Heart disease 35141 7.03%
  • There are 5 health conditions in the dataset.
  • Some data-cleanup is necessary to handle the "NaN" values - those would be equivalent to "none", accounting for just over 30% of admissions.
  • "Other" conditions accounts for the most number of admissions at nearly 19%.

Feature: Visitors¶

InĀ [32]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Visitors with Patient')
Out[32]:
Visitors with Patient Count Percentage
0 2 204716 40.94%
1 4 171986 34.40%
2 3 53474 10.69%
3 6 25930 5.19%
4 5 12495 2.50%
5 8 8633 1.73%
6 7 3922 0.78%
7 9 2909 0.58%
8 10 2873 0.57%
9 12 2729 0.55%
10 1 2111 0.42%
11 14 2107 0.42%
12 11 1647 0.33%
13 13 851 0.17%
14 0 822 0.16%
15 15 714 0.14%
16 16 412 0.08%
17 24 313 0.06%
18 19 209 0.04%
19 20 203 0.04%
20 22 190 0.04%
21 18 168 0.03%
22 17 146 0.03%
23 32 127 0.03%
24 23 126 0.03%
25 21 110 0.02%
26 25 45 0.01%
27 30 32 0.01%

The majority of patients have between 2 and 4 visitors, just over 86%.

Feature: Insurance¶

InĀ [33]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Insurance')
Out[33]:
Insurance Count Percentage
0 Yes 392960 78.59%
1 No 107040 21.41%

Nearly 79% of patients have health insurance.

Feature: Admission Deposit¶

InĀ [34]:
# Find unique values, counts and percentages associated with this feature
data['Admission_Deposit'].describe()
Out[34]:
Admission_Deposit
count 500000.000000
mean 4722.315734
std 1047.324220
min 1654.005148
25% 4071.714532
50% 4627.003792
75% 5091.612717
max 10104.726390

The mean deposit is approximatley $4722

Feature: Stay (in days)¶

InĀ [35]:
# Find unique values, counts and percentages associated with this feature
calculate_percentage(data, 'Stay (in days)')
Out[35]:
Stay (in days) Count Percentage
0 9 124110 24.82%
1 8 113462 22.69%
2 10 53854 10.77%
3 7 42292 8.46%
4 6 21559 4.31%
5 11 13577 2.72%
6 5 9794 1.96%
7 22 8893 1.78%
8 23 8713 1.74%
9 24 8299 1.66%
10 21 7509 1.50%
11 25 7128 1.43%
12 20 6112 1.22%
13 26 5777 1.16%
14 19 5157 1.03%
15 27 4625 0.92%
16 18 4603 0.92%
17 28 4203 0.84%
18 29 3929 0.79%
19 30 3764 0.75%
20 32 3761 0.75%
21 17 3735 0.75%
22 31 3631 0.73%
23 12 3431 0.69%
24 33 3411 0.68%
25 34 3129 0.63%
26 35 2768 0.55%
27 16 2756 0.55%
28 36 2373 0.47%
29 37 1933 0.39%
30 15 1901 0.38%
31 13 1650 0.33%
32 38 1465 0.29%
33 14 1374 0.27%
34 4 1289 0.26%
35 39 1122 0.22%
36 40 868 0.17%
37 41 648 0.13%
38 42 433 0.09%
39 43 328 0.07%
40 44 218 0.04%
41 45 172 0.03%
42 46 119 0.02%
43 47 55 0.01%
44 48 36 0.01%
45 3 18 0.00%
46 49 11 0.00%
47 50 3 0.00%
48 51 2 0.00%
InĀ [36]:
# Bin the 'Stay (in days)' column, fixing the bin labels
data['Stay_bins'] = pd.cut(data['Stay (in days)'],
                           bins=[0, 2, 5, 7, 10, 13, 16, 19, 50, np.inf],
                           labels=['0-2', '3-5', '6-7', '8-10', '11-13', '14-16', '17-19', '20-50', '51+'],
                           include_lowest=True)

# Calculate the percentage of patients in each bin
bin_percentages = data['Stay_bins'].value_counts(normalize=True) * 100

# Get the counts of patients in each bin
bin_counts = data['Stay_bins'].value_counts()

# Concatenate the counts and percentages
bin_counts = pd.concat([bin_counts, bin_percentages], axis=1)
bin_counts.columns = ['Count', 'Percentage']

# Add percentage symbol to Percentage column
bin_counts['Percentage'] = bin_counts['Percentage'].apply(lambda x: f'{x:.2f}%')

# Order the data by labels
desired_order = ['0-2', '3-5', '6-7', '8-10', '11-13', '14-16', '17-19', '20-50', '51+']
bin_counts = bin_counts.reindex(desired_order)

# Print the results
print(bin_counts)
            Count Percentage
Stay_bins                   
0-2             0      0.00%
3-5         11101      2.22%
6-7         63851     12.77%
8-10       291426     58.29%
11-13       18658      3.73%
14-16        6031      1.21%
17-19       13495      2.70%
20-50       95436     19.09%
51+             2      0.00%

The Length of Stay for the majority of patients is between 8 and 10 days, at just over 58% of patients, with just over 73% staying 10 days or less.

Observations

  • Data Cleanup:
  • There are a lot of inconsistencies in how columns have been named hinting at sloppy design, it might make sense for us to clean that up.
  • There are (151,888) values expressed as 'nan' in the health_conditions filed that will need to be investigated and addressed.
  • The statistical description suggests there might be patient(s) id(s) with a negative value, but we will not need to worry about that as thatis only an identifier and will not be helpful in further analysis, and as such, will be dropped.
  • Facility:

    • At times the minimum number of available beds is 0, capacity might need to be examined.
    • At times the number of available staff is 0, so staffing might need to be examined.
    • There is a wide range in patient visitor number, in-room guest accomodations, or limiting the number of visitors might need to be examined.
InĀ [37]:
data.columns
Out[37]:
Index(['Available Extra Rooms in Hospital', 'Department', 'Ward_Facility_Code',
       'doctor_name', 'staff_available', 'patientid', 'Age', 'gender',
       'Type of Admission', 'Severity of Illness', 'health_conditions',
       'Visitors with Patient', 'Insurance', 'Admission_Deposit',
       'Stay (in days)', 'Stay_bins'],
      dtype='object')
InĀ [38]:
# We MUST drop the "Stay_bins" as this is something we appended for LOS determinations - else it will mess with our decision tree
#data.drop('Stay_bins', axis=1, inplace=True)
#data.drop('Stay_bins_3-5', axis=1, inplace=True)
#data.drop('Stay_bins_6-7', axis=1, inplace=True)
#data.drop('Stay_bins_8-10', axis=1, inplace=True)
#data.drop('Stay_bins_11-13', axis=1, inplace=True)
#data.drop('Stay_bins_14-16', axis=1, inplace=True)
#data.drop('Stay_bins_17-19', axis=1, inplace=True)
#data.drop('Stay_bins_20-50', axis=1, inplace=True)
#data.drop('Stay_bins_51+', axis=1, inplace=True)

Data Cleanup¶

Health Conditions:

InĀ [39]:
# Handle the missing values in the health_conditions column
data['health_conditions'].fillna('None', inplace=True)

# make sure all missing vlaues have been replaced
data['health_conditions'].isnull().sum()

data['health_conditions'].unique()
Out[39]:
array(['Diabetes', 'Heart disease', 'None', 'Other', 'Asthama',
       'High Blood Pressure'], dtype=object)
InĀ [40]:
# Lets make sure that corrected the missing values counts
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 16 columns):
 #   Column                             Non-Null Count   Dtype   
---  ------                             --------------   -----   
 0   Available Extra Rooms in Hospital  500000 non-null  int64   
 1   Department                         500000 non-null  object  
 2   Ward_Facility_Code                 500000 non-null  object  
 3   doctor_name                        500000 non-null  object  
 4   staff_available                    500000 non-null  int64   
 5   patientid                          500000 non-null  int64   
 6   Age                                500000 non-null  object  
 7   gender                             500000 non-null  object  
 8   Type of Admission                  500000 non-null  object  
 9   Severity of Illness                500000 non-null  object  
 10  health_conditions                  500000 non-null  object  
 11  Visitors with Patient              500000 non-null  int64   
 12  Insurance                          500000 non-null  object  
 13  Admission_Deposit                  500000 non-null  float64 
 14  Stay (in days)                     500000 non-null  int64   
 15  Stay_bins                          500000 non-null  category
dtypes: category(1), float64(1), int64(5), object(9)
memory usage: 57.7+ MB

PatientID:

InĀ [41]:
# The statistical description suggests there might be a patient id with a negative value, so data cleanup my be needed
# Find any patient ID that is <=0
data[data['patientid'] <= 0]
Out[41]:
Available Extra Rooms in Hospital Department Ward_Facility_Code doctor_name staff_available patientid Age gender Type of Admission Severity of Illness health_conditions Visitors with Patient Insurance Admission_Deposit Stay (in days) Stay_bins
2238 1 gynecology D Dr Sarah 0 -914 11-20 Female Trauma Moderate High Blood Pressure 6 Yes 6778.130714 8 8-10
2977 2 anesthesia A Dr Mark 6 -842 51-60 Male Trauma Moderate None 2 Yes 4814.433104 41 20-50
4037 4 anesthesia A Dr Mark 2 -1046 21-30 Male Trauma Extreme Heart disease 4 Yes 4440.225183 19 17-19
5354 3 gynecology F Dr Sophia 9 -928 31-40 Female Trauma Moderate Heart disease 3 Yes 8376.323315 8 8-10
5682 2 gynecology D Dr Nathan 8 -61 41-50 Female Emergency Minor None 4 Yes 4798.440801 8 8-10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
496609 3 gynecology B Dr Nathan 2 -751 31-40 Female Emergency Minor Other 4 No 3674.268067 8 8-10
497094 5 gynecology B Dr Sarah 9 -569 41-50 Female Trauma Extreme Asthama 4 Yes 3172.891233 9 8-10
497200 6 gynecology D Dr Sarah 3 -588 11-20 Female Emergency Minor High Blood Pressure 4 No 4581.742646 7 6-7
498190 2 anesthesia E Dr John 10 -55 81-90 Female Trauma Moderate Diabetes 2 No 4624.389375 32 20-50
498341 4 gynecology F Dr Sarah 8 -739 11-20 Female Emergency Moderate High Blood Pressure 2 Yes 4042.058382 9 8-10

892 rows Ɨ 16 columns

Since patientid is a database table key it will not be usefull in our analysis and should be dropped.

InĀ [42]:
# Drop the patientid column
data.drop('patientid', axis=1, inplace=True)

data.head(2)
Out[42]:
Available Extra Rooms in Hospital Department Ward_Facility_Code doctor_name staff_available Age gender Type of Admission Severity of Illness health_conditions Visitors with Patient Insurance Admission_Deposit Stay (in days) Stay_bins
0 4 gynecology D Dr Sophia 0 41-50 Female Trauma Extreme Diabetes 4 Yes 2966.408696 8 8-10
1 4 gynecology B Dr Sophia 2 31-40 Female Trauma Minor Heart disease 2 No 3554.835677 9 8-10

Univariate Analysis - Numerical Features¶

InĀ [43]:
# Function to plot a boxplot and a histogram along the same scale

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows = 2,      # Number of rows of the subplot grid = 2
        sharex = True,  # x-axis will be shared among all subplots
        gridspec_kw = {"height_ratios": (0.25, 0.75)},
        figsize = figsize,
    )                   # Creating the 2 subplots
    sns.boxplot(data = data, x = feature, ax = ax_box2, showmeans = True, color = "violet"
    )                   # Boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2, bins = bins, palette = "winter"
    ) if bins else sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2
    )                   # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color = "green", linestyle = "--"
    )                   # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color = "black", linestyle = "-"
    )                   # Add median to the histogram

Feature: Available Rooms¶

InĀ [44]:
histogram_boxplot(data, "Available Extra Rooms in Hospital", kde = True, bins = 24)
No description has been provided for this image

Observations:

Feature: Available Staff¶

InĀ [45]:
histogram_boxplot(data, "staff_available", kde = True, bins = 5)
No description has been provided for this image

Observations:

  • The data looks to be faily evenly distributed about the mean with no outliers.

Feature: Visitors¶

InĀ [46]:
histogram_boxplot(data, "Visitors with Patient", kde = True, bins = 20)
No description has been provided for this image

Observations

  • The majority of patients have 5 or fewer visitors.

Feature: Admission Deposit¶

InĀ [47]:
histogram_boxplot(data, "Admission_Deposit", kde = True, bins = 20)
No description has been provided for this image

Observations:

  • The majority of admission deposits are clustered around 4,000 to 5,500, with a few exceptionally high values acting as outliers, skewing the distribution slightly to the right.

Feature: Length of Stay¶

InĀ [48]:
histogram_boxplot(data, "Stay (in days)", kde = True, bins = 30)
No description has been provided for this image

Observations:

  • Fewer patients are staying more than 10 days in the hospital and very few stay for more than 40 days. This might be because the majority of patients are admitted for moderate or minor illnesses.
  • The peak of the distribution shows that most of the patients stay for 8-9 days in the hospital.

Univariate Analysis - Categorical Features¶

InĀ [49]:
# Define categorical features
cat_cols = ['Department', 'Ward_Facility_Code', 'doctor_name', 'Age', 'gender', 'Type of Admission', 'Severity of Illness', 'health_conditions', 'Insurance']

# Printing the % sub categories of each category
for i in cat_cols:
    print(data[i].value_counts(normalize=True))
    print('*'*40)
Department
gynecology            0.686956
radiotherapy          0.168630
anesthesia            0.088358
TB & Chest disease    0.045780
surgery               0.010276
Name: proportion, dtype: float64
****************************************
Ward_Facility_Code
F    0.241076
D    0.238110
B    0.207770
E    0.190748
A    0.093102
C    0.029194
Name: proportion, dtype: float64
****************************************
doctor_name
Dr Sarah     0.199192
Dr Olivia    0.196704
Dr Sophia    0.149506
Dr Nathan    0.141554
Dr Sam       0.111422
Dr John      0.102526
Dr Mark      0.088820
Dr Isaac     0.006718
Dr Simon     0.003558
Name: proportion, dtype: float64
****************************************
Age
21-30     0.319586
31-40     0.266746
41-50     0.160812
11-20     0.093072
61-70     0.053112
51-60     0.043436
71-80     0.037406
81-90     0.016362
0-10      0.006736
91-100    0.002732
Name: proportion, dtype: float64
****************************************
gender
Female    0.74162
Male      0.20696
Other     0.05142
Name: proportion, dtype: float64
****************************************
Type of Admission
Trauma       0.621072
Emergency    0.271568
Urgent       0.107360
Name: proportion, dtype: float64
****************************************
Severity of Illness
Moderate    0.560394
Minor       0.263074
Extreme     0.176532
Name: proportion, dtype: float64
****************************************
health_conditions
None                   0.303776
Other                  0.188822
High Blood Pressure    0.158804
Diabetes               0.147288
Asthama                0.131028
Heart disease          0.070282
Name: proportion, dtype: float64
****************************************
Insurance
Yes    0.78592
No     0.21408
Name: proportion, dtype: float64
****************************************

Observations:

Here's an analysis of the categorical data from this facility:

  1. Doctor Distribution:

    • The top three doctors—Dr. Sarah, Dr. Olivia, and Dr. Sophia—account for more than 50% of the cases, indicating that they handle the majority of patients.
    • The remaining doctors see progressively fewer patients, with Dr. Isaac and Dr. Simon managing the least, suggesting a strong concentration of patient assignments among the top doctors.
  2. Age Distribution:

    • Most patients (over 58%) fall within the 21-40 age range, indicating that this is the primary age group served by the facility.
    • Older age groups (60 and above) make up a much smaller proportion of the patient base, collectively accounting for less than 15%.
    • The very young (0-10) and elderly (91-100) are rare in this facility, suggesting it may focus on adult and middle-aged patients.
  3. Gender Distribution:

    • The facility predominantly treats female patients, who account for about 74% of the total, while male patients make up around 21%.
    • A small proportion (around 5%) identify as "Other", highlighting inclusivity in gender categories, though they form a minor segment of the patient base.
  4. Type of Admission:

    • A significant majority of admissions (62%) are trauma-related, suggesting that the facility specializes in or frequently handles trauma cases.
    • Emergency cases make up about 27% of admissions, while urgent, non-emergency cases are the least common at 10%.
  5. Severity of Illness:

    • Over half the cases (56%) are categorized as "Moderate" in terms of severity, indicating that the facility deals with a high number of patients with moderately serious conditions.
    • Minor cases make up about 26%, while extreme cases account for around 18%. This suggests the facility handles a wide range of illness severities, but most cases are not extremely severe.
  6. Health Conditions:

    • About 30% of patients report no health conditions, which could indicate either a focus on acute or trauma care where pre-existing conditions are less common, or that some patients are generally healthy before admission.
    • The most common conditions among patients are "Other" unspecified conditions (18.9%), followed by high blood pressure (15.9%) and diabetes (14.7%).
    • Asthma (13.1%) and heart disease (7%) are less common but still notable conditions in the patient population.
  7. Insurance Status:

    • The vast majority of patients (78.6%) have insurance, which suggests that the facility primarily caters to insured individuals.
    • A smaller proportion of patients (21.4%) are uninsured, potentially indicating a reliance on insurance coverage for treatment or a clientele that is mostly able to afford health insurance.

Summary¶

This facility seems to handle a large number of trauma and emergency cases, mostly for moderately severe illnesses. The patient base is predominantly female and within the 21-40 age group, and most patients have insurance. While many patients have no major pre-existing conditions, high blood pressure, diabetes, and asthma are common among those who do. The workload is concentrated among a few doctors, with most patients treated by a small number of physicians.

Multivariate Analysis¶

Correlation Matrix of Numerical Features¶

InĀ [50]:
# Correlation matrix of numerical variables
# Select only numerical columns
numerical_data = data.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
numerical_data.corr()
Out[50]:
Available Extra Rooms in Hospital staff_available Visitors with Patient Admission_Deposit Stay (in days)
Available Extra Rooms in Hospital 1.000000 -0.001784 0.070459 -0.050127 -0.019219
staff_available -0.001784 1.000000 0.000578 0.000763 0.007398
Visitors with Patient 0.070459 0.000578 1.000000 -0.069043 0.027302
Admission_Deposit -0.050127 0.000763 -0.069043 1.000000 0.044203
Stay (in days) -0.019219 0.007398 0.027302 0.044203 1.000000
InĀ [51]:
# Plot the correlation matrix

corr_matrix = numerical_data.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='vlag')
plt.title('Correlation Matrix of Numerical Variables')
plt.show()
No description has been provided for this image

Observations:

Available Extra Rooms in Hospital:

  • Has a very slight negative correlation with both Admission_Deposit (-0.05) and Stay in days (-0.019), meaning that an increase in available rooms doesn't significantly influence these variables.
  • It has a weak positive correlation with Visitors with Patient (0.07), suggesting a slight relationship between available rooms and the number of visitors.

Staff Available:

  • Staff_Available shows almost no correlation with the other variables. The highest correlation is with Stay in days (0.0074), but this is negligible.
  • Overall, staff available appears to be mostly independent of the other variables in this matrix.

Visitors with Patient:

  • Shows a small negative correlation with Admission_Deposit (-0.069), meaning that more visitors might be associated with lower deposits, but this relationship is very weak.
  • The other correlations, like with Stay in days (0.027), are close to zero, indicating almost no significant relationship.

Admission Deposit:

  • There's a small positive correlation between Admission_Deposit and Stay in days (0.044). This suggests a slight tendency for higher admission deposits to be associated with longer stays, but it's not a strong relationship.
  • The other correlations are either weak or near zero, implying little to no direct relationship with the other factors.

Stay in Days:

  • Has a weak positive correlation with Admission_Deposit (0.044) and Visitors with Patient (0.027), though these relationships are minimal.
  • It has almost no significant correlation with other variables, like Available Extra Rooms in Hospital (-0.019).

Summary:

  • The matrix shows generally weak correlations among the variables, with most values close to 0.
  • There are no strong relationships between the numerical variables, meaning these variables are largely independent of each other in terms of linear association.
  • The highest correlation, 0.07, between Available Extra Rooms and Visitors with Patient, is still very weak, indicating that no two variables are highly linearly related in this dataset.
InĀ [52]:
# Function to plot stacked bar plots

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count  = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1   = pd.crosstab(data[predictor], data[target], margins = True).sort_values(
        by = sorter, ascending = False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize = "index").sort_values(
        by = sorter, ascending = False
    )
    tab.plot(kind = "bar", stacked = True, figsize = (count + 1, 5))
    plt.legend(
        loc     = "lower left",
        frameon = False,
    )
    plt.legend(loc = "upper left", bbox_to_anchor = (1, 1))
    plt.show()
InĀ [53]:
# Function to calculate and plot average LOS by feature
def plot_avg_stay(data, group_by_column, target_column):
  """
  Plots the average stay for a target column grouped by a specified column.

  Args:
    data: The pandas DataFrame containing the data.
    group_by_column: The column to group by.
    target_column: The target column for calculating average stay.
  """
  avg_stay = data.groupby(group_by_column)[target_column].mean().sort_values(ascending=True)
  sns.barplot(y=group_by_column, x=target_column, data=data, order=avg_stay.index)
  plt.show()

# For cases where the plot needs to have its axis swapped
def plot_avg_stay2(data, group_by_column, target_column):
  """
  Plots the average stay for a target column grouped by a specified column.

  Args:
    data: The pandas DataFrame containing the data.
    group_by_column: The column to group by.
    target_column: The target column for calculating average stay.
  """
  avg_stay = data.groupby(group_by_column)[target_column].mean().sort_values(ascending=True)
  sns.barplot(x=group_by_column, y=target_column, data=data, order=avg_stay.index)
  plt.show()

Target Variable: Length of Stay¶

Length of Stay by Department¶

InĀ [54]:
plot_avg_stay(data, 'Department', 'Stay (in days)')
No description has been provided for this image
InĀ [55]:
stacked_barplot(data, 'Department', 'Ward_Facility_Code')
Ward_Facility_Code      A       B      C       D      E       F     All
Department                                                             
All                 46551  103885  14597  119055  95374  120538  500000
radiotherapy        21093       0   9079       0  54143       0   84315
anesthesia          15611       0   4199       0  24369       0   44179
TB & Chest disease   4709       0   1319       0  16862       0   22890
gynecology              0  103885      0  119055      0  120538  343478
surgery              5138       0      0       0      0       0    5138
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
InĀ [56]:
stacked_barplot(data, 'Department', 'doctor_name')
doctor_name         Dr Isaac  Dr John  Dr Mark  Dr Nathan  Dr Olivia  Dr Sam  \
Department                                                                     
surgery                 3359        0        0          0          0       0   
All                     3359    51263    44410      70777      98352   55711   
anesthesia                 0    14920    29259          0          0       0   
TB & Chest disease         0     7739    15151          0          0       0   
radiotherapy               0    28604        0          0          0   55711   
gynecology                 0        0        0      70777      98352       0   

doctor_name         Dr Sarah  Dr Simon  Dr Sophia     All  
Department                                                 
surgery                    0      1779          0    5138  
All                    99596      1779      74753  500000  
anesthesia                 0         0          0   44179  
TB & Chest disease         0         0          0   22890  
radiotherapy               0         0          0   84315  
gynecology             99596         0      74753  343478  
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Length of Stay by Ward¶

InĀ [57]:
plot_avg_stay(data, 'Ward_Facility_Code', 'Stay (in days)')
No description has been provided for this image
InĀ [58]:
stacked_barplot(data, 'Ward_Facility_Code', 'Department')
Department          TB & Chest disease  anesthesia  gynecology  radiotherapy  \
Ward_Facility_Code                                                             
A                                 4709       15611           0         21093   
All                              22890       44179      343478         84315   
B                                    0           0      103885             0   
C                                 1319        4199           0          9079   
D                                    0           0      119055             0   
E                                16862       24369           0         54143   
F                                    0           0      120538             0   

Department          surgery     All  
Ward_Facility_Code                   
A                      5138   46551  
All                    5138  500000  
B                         0  103885  
C                         0   14597  
D                         0  119055  
E                         0   95374  
F                         0  120538  
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
InĀ [59]:
stacked_barplot(data, 'Ward_Facility_Code', 'Severity of Illness')
Severity of Illness  Extreme   Minor  Moderate     All
Ward_Facility_Code                                    
All                    88266  131537    280197  500000
D                      29549   27220     62286  119055
B                      24222   23579     56084  103885
A                      13662    7877     25012   46551
E                      11488   22254     61632   95374
F                       5842   47594     67102  120538
C                       3503    3013      8081   14597
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

  • Wards B, D and F all have lower LOS and are used exclusively for the Gynecology department.
  • Surgical patients are admitted to ward A only.
  • Ward A has the highest number of extreme cases.

Length of Stay by Doctor¶

InĀ [60]:
plot_avg_stay(data, 'doctor_name', 'Stay (in days)')
No description has been provided for this image
InĀ [61]:
stacked_barplot(data, 'doctor_name', 'Department')
Department   TB & Chest disease  anesthesia  gynecology  radiotherapy  \
doctor_name                                                             
All                       22890       44179      343478         84315   
Dr Isaac                      0           0           0             0   
Dr Simon                      0           0           0             0   
Dr John                    7739       14920           0         28604   
Dr Mark                   15151       29259           0             0   
Dr Nathan                     0           0       70777             0   
Dr Sam                        0           0           0         55711   
Dr Olivia                     0           0       98352             0   
Dr Sarah                      0           0       99596             0   
Dr Sophia                     0           0       74753             0   

Department   surgery     All  
doctor_name                   
All             5138  500000  
Dr Isaac        3359    3359  
Dr Simon        1779    1779  
Dr John            0   51263  
Dr Mark            0   44410  
Dr Nathan          0   70777  
Dr Sam             0   55711  
Dr Olivia          0   98352  
Dr Sarah           0   99596  
Dr Sophia          0   74753  
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
InĀ [62]:
stacked_barplot(data, 'doctor_name', 'Ward_Facility_Code')
Ward_Facility_Code      A       B      C       D      E       F     All
doctor_name                                                            
All                 46551  103885  14597  119055  95374  120538  500000
Dr Sam              13889       0   6029       0  35793       0   55711
Dr John             14041       0   4888       0  32334       0   51263
Dr Mark             13483       0   3680       0  27247       0   44410
Dr Isaac             3359       0      0       0      0       0    3359
Dr Nathan               0   17317      0   27035      0   26425   70777
Dr Olivia               0   33761      0   31598      0   32993   98352
Dr Sarah                0   29983      0   34565      0   35048   99596
Dr Simon             1779       0      0       0      0       0    1779
Dr Sophia               0   22824      0   25857      0   26072   74753
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Observations:

Length of Stay by Available Staff¶

InĀ [63]:
plot_avg_stay(data, 'staff_available', 'Stay (in days)')
# sort x axis desc
No description has been provided for this image

Observations:

Length of Stay by Age¶

InĀ [64]:
# sort by stay in descending order
data.sort_values(by='Age', ascending=False, inplace=True)
sns.barplot(y='Age', x='Stay (in days)', data=data)
plt.show()
No description has been provided for this image

Observations:

Length of stay by Gender¶

InĀ [65]:
plot_avg_stay(data, 'gender', 'Stay (in days)')
No description has been provided for this image

Observations:

Length of stay by Type of Admission¶

InĀ [66]:
plot_avg_stay(data, 'Type of Admission', 'Stay (in days)')
No description has been provided for this image

Observations:

Length of stay by Severity of Illness¶

InĀ [67]:
plot_avg_stay(data, 'Severity of Illness', 'Stay (in days)')
No description has been provided for this image

Observations:

Length of stay by Health Conditions¶

InĀ [68]:
plot_avg_stay(data, 'health_conditions', 'Stay (in days)')
No description has been provided for this image

Observations:

Length of stay by Visitors¶

InĀ [69]:
plot_avg_stay(data, 'Visitors with Patient', 'Stay (in days)')
No description has been provided for this image

Observations:

Length of stay by Insurance¶

InĀ [70]:
plot_avg_stay(data, 'Insurance', 'Stay (in days)')
No description has been provided for this image

Observations:

Length of stay by Admission Deposit¶

InĀ [71]:
# This is too costly to run in terms of compute power
#plot_avg_stay(data, 'Admission_Deposit', 'Stay (in days)')

Observations:

Data Preparation¶

Encode Categorical Data¶

InĀ [72]:
# Perform one host encoding
data = pd.get_dummies(
                      data,
                      columns = data.select_dtypes(include = ['object', 'category']).columns,
                      drop_first = True,
                      dtype = int
                      )

Split Data¶

InĀ [73]:
x = data.drop('Stay (in days)', axis=1)
y = data['Stay (in days)']
InĀ [74]:
# Import additional libraries
from Scikit-learn.model_selection import train_test_split
from Scikit-learn.tree import DecisionTreeRegressor
from Scikit-learn.ensemble import RandomForestRegressor, BaggingRegressor

from Scikit-learn.model_selection import GridSearchCV

from Scikit-learn.metrics import mean_absolute_error, mean_squared_error, r2_score
InĀ [75]:
# Perform the data split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle=True, random_state=1)
InĀ [76]:
# Verify the split
x_train.shape, x_test.shape, y_train.shape, y_test.shape
Out[76]:
((400000, 50), (100000, 50), (400000,), (100000,))

Performance Metrics¶

InĀ [77]:
# Define a function to calculate the adjusted r2
def adj_r2(predictors, target, predictions):
    r2 = r2_score(target, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]

    return 1-((1-r2)*(n-1)/(n-k-1))
InĀ [78]:
# Define a function to calculate mean_absolute_error
def mape_score(target, predictions):
    return np.mean(np.abs((target - predictions) / target)) * 100
InĀ [79]:
def model_performance_regression(model, predictors, target):
    pred   = model.predict(predictors)
    r2     = r2_score(target, pred) # Use r2_score instead of r2
    adjusted_r2 = adj_r2(predictors, target, pred) # Pass predictors to the function and rename the variable to adjusted_r2
    rmse   = np.sqrt(mean_squared_error(target, pred))
    mae = mean_absolute_error(target, pred)
    mape = mape_score(target, pred)

    df_perf = pd.DataFrame({
                            'RMSE': rmse,
                            'MAE': mae,
                            'R2': r2,
                            'Adjusted R2': adjusted_r2, # Use adjusted_r2 instead of adj_r2
                            'MAPE': mape
                          }, index=[0])
    return df_perf

Decision Tree Regression¶

Decision trees are a popular method in machine learning for both classification and regression tasks. They are known for their interpretability and simplicity, functioning by creating a model that predicts the value of a target variable based on several input features. Here's a detailed breakdown of how decision trees work:

InĀ [80]:
# Create a regressor
dt_regressor = DecisionTreeRegressor(random_state=1)
dt_regressor.fit(x_train, y_train)
Out[80]:
DecisionTreeRegressor(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=1)
InĀ [81]:
# Evaluate the regressor
dt_regressor_perf_test = model_performance_regression(dt_regressor, x_test, y_test)
dt_regressor_perf_test
Out[81]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.588745 0.85419 0.959936 0.959916 6.213695
InĀ [82]:
# Lets visulaize the tree
from Scikit-learn import tree

features = list(x.columns)
features
Out[82]:
['Available Extra Rooms in Hospital',
 'staff_available',
 'Visitors with Patient',
 'Admission_Deposit',
 'Department_anesthesia',
 'Department_gynecology',
 'Department_radiotherapy',
 'Department_surgery',
 'Ward_Facility_Code_B',
 'Ward_Facility_Code_C',
 'Ward_Facility_Code_D',
 'Ward_Facility_Code_E',
 'Ward_Facility_Code_F',
 'doctor_name_Dr John',
 'doctor_name_Dr Mark',
 'doctor_name_Dr Nathan',
 'doctor_name_Dr Olivia',
 'doctor_name_Dr Sam',
 'doctor_name_Dr Sarah',
 'doctor_name_Dr Simon',
 'doctor_name_Dr Sophia',
 'Age_11-20',
 'Age_21-30',
 'Age_31-40',
 'Age_41-50',
 'Age_51-60',
 'Age_61-70',
 'Age_71-80',
 'Age_81-90',
 'Age_91-100',
 'gender_Male',
 'gender_Other',
 'Type of Admission_Trauma',
 'Type of Admission_Urgent',
 'Severity of Illness_Minor',
 'Severity of Illness_Moderate',
 'health_conditions_Diabetes',
 'health_conditions_Heart disease',
 'health_conditions_High Blood Pressure',
 'health_conditions_None',
 'health_conditions_Other',
 'Insurance_Yes',
 'Stay_bins_3-5',
 'Stay_bins_6-7',
 'Stay_bins_8-10',
 'Stay_bins_11-13',
 'Stay_bins_14-16',
 'Stay_bins_17-19',
 'Stay_bins_20-50',
 'Stay_bins_51+']
InĀ [83]:
# Train a shallow tree to aid in visualization
dt_regressor_shallow = DecisionTreeRegressor(max_depth=3, random_state=1)
dt_regressor_shallow.fit(x_train, y_train)
Out[83]:
DecisionTreeRegressor(max_depth=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=3, random_state=1)
InĀ [84]:
# Evaluate the regressor
dt_regressor_shallow_test = model_performance_regression(dt_regressor_shallow, x_test, y_test)
dt_regressor_shallow_test
Out[84]:
RMSE MAE R2 Adjusted R2 MAPE
0 2.117114 1.242283 0.928857 0.928822 10.18492
InĀ [85]:
# Visualize the tree
plt.figure(figsize=(20, 20))
tree.plot_tree(dt_regressor_shallow, feature_names=features, filled=True, fontsize=12, node_ids=True, class_names=True)
plt.show()
No description has been provided for this image
InĀ [86]:
# There is another way to visualize this as text
print(tree.export_text(dt_regressor_shallow, feature_names=features, show_weights=True))
|--- Stay_bins_20-50 <= 0.50
|   |--- Stay_bins_17-19 <= 0.50
|   |   |--- Stay_bins_6-7 <= 0.50
|   |   |   |--- value: [8.93]
|   |   |--- Stay_bins_6-7 >  0.50
|   |   |   |--- value: [6.66]
|   |--- Stay_bins_17-19 >  0.50
|   |   |--- Department_radiotherapy <= 0.50
|   |   |   |--- value: [17.94]
|   |   |--- Department_radiotherapy >  0.50
|   |   |   |--- value: [18.37]
|--- Stay_bins_20-50 >  0.50
|   |--- Department_radiotherapy <= 0.50
|   |   |--- Age_31-40 <= 0.50
|   |   |   |--- value: [31.35]
|   |   |--- Age_31-40 >  0.50
|   |   |   |--- value: [20.87]
|   |--- Department_radiotherapy >  0.50
|   |   |--- Visitors with Patient <= 3.50
|   |   |   |--- value: [22.95]
|   |   |--- Visitors with Patient >  3.50
|   |   |   |--- value: [23.61]

Ensemble Learning Methods¶

Bagging (Bootstrap Aggregating)¶

An ensemble learning method that aims to improve the stability and accuracy of machine learning models by combining the predictions of multiple decision trees (50 by default). It is primarily used to reduce variance and prevent overfitting, especially for high-variance models like decision trees.

InĀ [87]:
# bagging
bagging_estimator = BaggingRegressor(random_state=1)
bagging_estimator.fit(x_train, y_train)
Out[87]:
BaggingRegressor(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingRegressor(random_state=1)
InĀ [88]:
# Calculate the performance
bagging_estimator_perf_test = model_performance_regression(bagging_estimator, x_test, y_test)
bagging_estimator_perf_test
Out[88]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.196142 0.721663 0.977291 0.977279 5.538399

Observations:

  • We see a significant improvement using bagging over a single decision tree, as we would expect.

Random Forest¶

An ensemble learning method that builds multiple decision trees (100 by default) independently on random subsets of data and features, and then averages their predictions to create a more robust and accurate model. The random sampling and feature selection help reduce overfitting and improve generalization.

InĀ [89]:
rf_regressor = RandomForestRegressor(random_state=1)
rf_regressor.fit(x_train, y_train)
Out[89]:
RandomForestRegressor(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=1)
InĀ [90]:
# Calculate the performance
rf_regressor_perf_test = model_performance_regression(rf_regressor, x_test, y_test)
rf_regressor_perf_test
Out[90]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.145173 0.696898 0.979185 0.979174 5.372642

Adapted Boosting (AdaBoost)¶

An ensemble learning method that combines the output of multiple weak learners (usually decision trees with a single split, known as decision stumps) to create a strong predictive model. The key idea behind AdaBoost is to focus more on difficult-to-classify instances by adjusting the weights of the data points iteratively during the training process.

Adapted boosting, or AdaBoost, iteratively adjusts the weight of misclassified instances and combines multiple weak models to form a stronger predictive model.

InĀ [91]:
from Scikit-learn.ensemble import AdaBoostRegressor
InĀ [92]:
ada_regressor = AdaBoostRegressor(random_state=1)
ada_regressor.fit(x_train, y_train)
Out[92]:
AdaBoostRegressor(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostRegressor(random_state=1)
InĀ [93]:
# Measure performance
ada_regressor_perf_test = model_performance_regression(ada_regressor, x_test, y_test)
ada_regressor_perf_test
Out[93]:
RMSE MAE R2 Adjusted R2 MAPE
0 2.179979 1.461467 0.92457 0.924532 13.377773

Gradient Boosting¶

A powerful ensemble machine learning technique that combines multiple weak models (typically decision trees) to create a strong predictive model. The key idea behind gradient boosting is to build new models that correct the errors made by the previous models, sequentially refining the overall prediction.

  • It focuses on minimizing a loss function by using the gradients of the loss, which guides how each subsequent model should be constructed.

  • It is widely used for its high accuracy and versatility but requires careful tuning to avoid overfitting and excessive computation time.

InĀ [94]:
from Scikit-learn.ensemble import GradientBoostingRegressor
InĀ [95]:
grad_regressor = GradientBoostingRegressor(random_state=1)
grad_regressor.fit(x_train, y_train)
Out[95]:
GradientBoostingRegressor(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(random_state=1)
InĀ [96]:
# Measure performance
grad_regressor_perf_test = model_performance_regression(grad_regressor, x_test, y_test)
grad_regressor_perf_test
Out[96]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.499456 0.900673 0.964313 0.964295 6.687052

Extreme Gradient Boosting (XGBoost)¶

A highly optimized and efficient implementation of gradient boosting, specifically designed for speed and performance. It’s widely used for machine learning tasks due to its scalability and high accuracy. XGBoost has gained immense popularity in data science competitions like Kaggle because of its performance advantages over traditional gradient boosting implementations.

  • It is well-suited for both classification and regression tasks and is known for delivering high accuracy.

  • XGBoost incorporates several advanced features like regularization and early stopping, making it more robust and efficient than traditional gradient boosting.

InĀ [97]:
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.11/dist-packages (2.1.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.26.4)
Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.11/dist-packages (from xgboost) (2.21.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.13.1)
InĀ [98]:
from xgboost import XGBRegressor
InĀ [99]:
# Instantiate the regressor
xgb_regressor = XGBRegressor(random_state=1)
xgb_regressor.fit(x_train, y_train)
Out[99]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=1, ...)
InĀ [100]:
# Measure performance
xgb_regressor_perf_test = model_performance_regression(xgb_regressor, x_test, y_test)
xgb_regressor_perf_test
Out[100]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.272956 0.794048 0.97428 0.974267 6.153615
InĀ [101]:
# Compare all regression models
models_test_comp = pd.concat(
    [
        dt_regressor_perf_test.T,
        bagging_estimator_perf_test.T,
        rf_regressor_perf_test.T,
        ada_regressor_perf_test.T,
        grad_regressor_perf_test.T,
        xgb_regressor_perf_test.T
    ],
    axis=1)

models_test_comp.columns = [
                            'Decision Tree',
                            'Bagging',
                            'Random Forest',
                            'AdaBoost',
                            'Gradient Boosting',
                            'XGBoost'
                            ]
models_test_comp.T
Out[101]:
RMSE MAE R2 Adjusted R2 MAPE
Decision Tree 1.588745 0.854190 0.959936 0.959916 6.213695
Bagging 1.196142 0.721663 0.977291 0.977279 5.538399
Random Forest 1.145173 0.696898 0.979185 0.979174 5.372642
AdaBoost 2.179979 1.461467 0.924570 0.924532 13.377773
Gradient Boosting 1.499456 0.900673 0.964313 0.964295 6.687052
XGBoost 1.272956 0.794048 0.974280 0.974267 6.153615

We can choose the best model and further enhance it by using hyperparameters.

Tuning Random Forest Regressor¶

InĀ [102]:
# Instantiate a random forest regressor
rf_tuned = RandomForestRegressor(random_state=1)

# Specify hyerparameters (note you MUST include the default) - this is very costly in terms of computation power (3*3*2*5 so 90 forests)
rf_tuned_params = {
                  'n_estimators': [100, 110, 120],
                  'max_depth': [None, 5, 7],
                  'max_features': [0.8, 1],
                  }
InĀ [103]:
from seaborn.axisgrid import Grid
InĀ [104]:
# Instatiate (we can choose what to maximize)
rf_tuned_grid_obj = GridSearchCV(rf_tuned, rf_tuned_params, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=3)
rf_tuned_grid_obj.fit(x_train, y_train)
Fitting 5 folds for each of 18 candidates, totalling 90 fits
Out[104]:
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [None, 5, 7], 'max_features': [0.8, 1],
                         'n_estimators': [100, 110, 120]},
             scoring='neg_mean_squared_error', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [None, 5, 7], 'max_features': [0.8, 1],
                         'n_estimators': [100, 110, 120]},
             scoring='neg_mean_squared_error', verbose=3)
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)
InĀ [105]:
# Measure performance
rf_tuned_grid_obj_perf_test = model_performance_regression(rf_tuned_grid_obj, x_test, y_test)
rf_tuned_grid_obj_perf_test
Out[105]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.140207 0.695835 0.979365 0.979354 5.366609
InĀ [106]:
# we can use .best_estimator_ to see which hyperperamaters worked the best and then fit those to the text data
rf_tuned_regressor = rf_tuned_grid_obj.best_estimator_
rf_tuned_regressor.fit(x_train, y_train)
Out[106]:
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)
InĀ [107]:
# Measure performance
rf_tuned_regressor_perf_test = model_performance_regression(rf_tuned_regressor, x_test, y_test)
rf_tuned_regressor_perf_test
Out[107]:
RMSE MAE R2 Adjusted R2 MAPE
0 1.140207 0.695835 0.979365 0.979354 5.366609
InĀ [108]:
# Compare all regression models
models_test_comp = pd.concat(
    [
        dt_regressor_perf_test.T,
        bagging_estimator_perf_test.T,
        rf_regressor_perf_test.T,
        ada_regressor_perf_test.T,
        grad_regressor_perf_test.T,
        xgb_regressor_perf_test.T,
        rf_tuned_regressor_perf_test.T
    ],
    axis=1)

models_test_comp.columns = [
                            'Decision Tree',
                            'Bagging',
                            'Random Forest',
                            'AdaBoost',
                            'Gradient Boosting',
                            'XGBoost',
                            'Random Forest Tuned'
                            ]
models_test_comp.T
Out[108]:
RMSE MAE R2 Adjusted R2 MAPE
Decision Tree 1.588745 0.854190 0.959936 0.959916 6.213695
Bagging 1.196142 0.721663 0.977291 0.977279 5.538399
Random Forest 1.145173 0.696898 0.979185 0.979174 5.372642
AdaBoost 2.179979 1.461467 0.924570 0.924532 13.377773
Gradient Boosting 1.499456 0.900673 0.964313 0.964295 6.687052
XGBoost 1.272956 0.794048 0.974280 0.974267 6.153615
Random Forest Tuned 1.140207 0.695835 0.979365 0.979354 5.366609
InĀ [109]:
# Let's visualize which features were most important
importances = rf_tuned_regressor.feature_importances_   # Returns an unsorted array of importances
indices = np.argsort(importances)

plt.figure(figsize=(10, 10))
plt.title('Feature Importances')
plt.barh(range(len(features)), importances[indices], color='b', align='center')
plt.yticks(range(len(features)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show;                                               # The ';' supresses all output but the plt
No description has been provided for this image
InĀ [110]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Guided_Project:_Classification_and_Hypothesis_Testing/Guided Project: Classification and Hypothesis Testing.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Guided_Project:_Classification_and_Hypothesis_Testing/Guided Project: Classification and Hypothesis Testing.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 25 image(s).
[NbConvertApp] Writing 1989593 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Five_-_Classification_and_Hypothesis_Testing/Guided_Project:_Classification_and_Hypothesis_Testing/Guided Project: Classification and Hypothesis Testing.html