Hospital Length of Stay (LOS) Prediction¶

Context:¶

Hospital management is a vital area that gained a lot of attention during the COVID-19 pandemic. Inefficient distribution of resources like beds, ventilators might lead to a lot of complications. However, this can be mitigated by predicting the length of stay (LOS) of a patient before getting admitted. Once this is determined, the hospital can plan a suitable treatment, resources, and staff to reduce the LOS and increase the chances of recovery. The rooms and bed can also be planned in accordance with that.

HealthPlus hospital has been incurring a lot of losses in revenue and life due to its inefficient management system. They have been unsuccessful in allocating pieces of equipment, beds, and hospital staff fairly. A system that could estimate the length of stay (LOS) of a patient can solve this problem to a great extent.

Objective:¶

As a Data Scientist, you have been hired by HealthPlus to analyze the data, find out what factors affect the LOS the most, and come up with a machine learning model which can predict the LOS of a patient using the data available during admission and after running a few tests. Also, bring about useful insights and policies from the data, which can help the hospital to improve their health care infrastructure and revenue.

Data Dictionary:¶

The data contains various information recorded during the time of admission of the patient. It only contains records of patients who were admitted to the hospital. The detailed data dictionary is given below:

  • patientid: Patient ID
  • Age: Range of age of the patient
  • gender: Gender of the patient
  • Type of Admission: Trauma, emergency or urgent
  • Severity of Illness: Extreme, moderate, or minor
  • health_conditions: Any previous health conditions suffered by the patient
  • Visitors with Patient: The number of patients who accompany the patient
  • Insurance: Does the patient have health insurance or not?
  • Admission_Deposit: The deposit paid by the patient during admission
  • Stay (in days): The number of days that the patient has stayed in the hospital. This is the target variable
  • Available Extra Rooms in Hospital: The number of rooms available during admission
  • Department: The department which will be treating the patient
  • Ward_Facility_Code: The code of the ward facility in which the patient will be admitted
  • doctor_name: The doctor who will be treating the patient
  • staff_available: The number of staff who are not occupied at the moment in the ward

Approach to solve the problem:¶

  1. Import the necessary libraries
  2. Read the dataset and get an overview
  3. Exploratory data analysis - a. Univariate b. Bivariate
  4. Data preprocessing if any
  5. Define the performance metric and build ML models
  6. Checking for assumptions
  7. Compare models and determine the best one
  8. Observations and business insights

Importing Libraries¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To build models for prediction
from Scikit-learn.model_selection import train_test_split, cross_val_score, KFold
from Scikit-learn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from Scikit-learn.tree import DecisionTreeRegressor
from Scikit-learn.ensemble import RandomForestRegressor,BaggingRegressor

# To encode categorical variables
from Scikit-learn.preprocessing import LabelEncoder

# For tuning the model
from Scikit-learn.model_selection import GridSearchCV

# To check model performance
from Scikit-learn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error
In [1]:
from google.colab import drive
drive.mount('/content/Mydrive')
Mounted at /content/Mydrive
In [ ]:
# Read the healthcare dataset file
data = pd.read_csv("healthcare_data.csv")
In [ ]:
# Copying data to another variable to avoid any changes to original data
same_data = data.copy()

Data Overview¶

In [ ]:
# View the first 5 rows of the dataset
data.head()
Out[ ]:
Available Extra Rooms in Hospital Department Ward_Facility_Code doctor_name staff_available patientid Age gender Type of Admission Severity of Illness health_conditions Visitors with Patient Insurance Admission_Deposit Stay (in days)
0 4 gynecology D Dr Sophia 0 33070 41-50 Female Trauma Extreme Diabetes 4 Yes 2966.408696 8
1 4 gynecology B Dr Sophia 2 34808 31-40 Female Trauma Minor Heart disease 2 No 3554.835677 9
2 2 gynecology B Dr Sophia 8 44577 21-30 Female Trauma Extreme Diabetes 2 Yes 5624.733654 7
3 4 gynecology D Dr Olivia 7 3695 31-40 Female Urgent Moderate None 4 No 4814.149231 8
4 2 anesthesia E Dr Mark 10 108956 71-80 Male Trauma Moderate Diabetes 2 No 5169.269637 34
In [ ]:
# View the last 5 rows of the dataset
data.tail()
Out[ ]:
Available Extra Rooms in Hospital Department Ward_Facility_Code doctor_name staff_available patientid Age gender Type of Admission Severity of Illness health_conditions Visitors with Patient Insurance Admission_Deposit Stay (in days)
499995 4 gynecology F Dr Sarah 2 43001 11-20 Female Trauma Minor High Blood Pressure 3 No 4105.795901 10
499996 13 gynecology F Dr Olivia 8 85601 31-40 Female Emergency Moderate Other 2 No 4631.550257 11
499997 2 gynecology B Dr Sarah 3 22447 11-20 Female Emergency Moderate High Blood Pressure 2 No 5456.930075 8
499998 2 radiotherapy A Dr John 1 29957 61-70 Female Trauma Extreme Diabetes 2 No 4694.127772 23
499999 3 gynecology F Dr Sophia 3 45008 41-50 Female Trauma Moderate Heart disease 4 Yes 4713.868519 10
In [ ]:
# Understand the shape of the data
data.shape
Out[ ]:
(500000, 15)
  • The dataset has 500,000 rows and 15 columns.
In [ ]:
# Checking the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 15 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Available Extra Rooms in Hospital  500000 non-null  int64  
 1   Department                         500000 non-null  object 
 2   Ward_Facility_Code                 500000 non-null  object 
 3   doctor_name                        500000 non-null  object 
 4   staff_available                    500000 non-null  int64  
 5   patientid                          500000 non-null  int64  
 6   Age                                500000 non-null  object 
 7   gender                             500000 non-null  object 
 8   Type of Admission                  500000 non-null  object 
 9   Severity of Illness                500000 non-null  object 
 10  health_conditions                  500000 non-null  object 
 11  Visitors with Patient              500000 non-null  int64  
 12  Insurance                          500000 non-null  object 
 13  Admission_Deposit                  500000 non-null  float64
 14  Stay (in days)                     500000 non-null  int64  
dtypes: float64(1), int64(5), object(9)
memory usage: 57.2+ MB

Observations:

  • Available Extra Rooms in Hospital, staff_available, patientid, Visitors with Patient, Admission_Deposit, and Stay (in days) are of numeric data type and the rest of the columns are of object data type.
  • The number of non-null values is the same as the total number of entries in the data, i.e., there are no null values.
  • The column patientid is an identifier for patients in the data. This column will not help with our analysis so we can drop it.
In [ ]:
# To view patientid and the number of times they have been admitted to the hospital
data['patientid'].value_counts()
Out[ ]:
126719    21
125695    21
44572     21
126623    21
125625    19
          ..
37634      1
91436      1
118936     1
52366      1
105506     1
Name: patientid, Length: 126399, dtype: int64

Observation:

  • The maximum number of times the same patient admitted to the hospital is 21 and minimum is 1.
In [ ]:
# Dropping patientid from the data as it is an identifier and will not add value to the analysis
data=data.drop(columns=["patientid"])

Data Preparation for Model Building¶

  • Before we proceed to build a model, we'll have to encode categorical features.
  • Separate the independent variables and dependent Variables.
  • We'll split the data into train and test to be able to evaluate the model that we train on the training data.
In [ ]:
# Creating dummy variables for the categorical columns
# drop_first=True is used to avoid redundant variables
data = pd.get_dummies(
    data,
    columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
    drop_first = True,
)
In [ ]:
# Check the data after handling categorical data
data
Out[ ]:
Available Extra Rooms in Hospital staff_available Visitors with Patient Admission_Deposit Stay (in days) Department_anesthesia Department_gynecology Department_radiotherapy Department_surgery Ward_Facility_Code_B Ward_Facility_Code_C Ward_Facility_Code_D Ward_Facility_Code_E Ward_Facility_Code_F doctor_name_Dr John doctor_name_Dr Mark doctor_name_Dr Nathan doctor_name_Dr Olivia doctor_name_Dr Sam doctor_name_Dr Sarah doctor_name_Dr Simon doctor_name_Dr Sophia Age_11-20 Age_21-30 Age_31-40 Age_41-50 Age_51-60 Age_61-70 Age_71-80 Age_81-90 Age_91-100 gender_Male gender_Other Type of Admission_Trauma Type of Admission_Urgent Severity of Illness_Minor Severity of Illness_Moderate health_conditions_Diabetes health_conditions_Heart disease health_conditions_High Blood Pressure health_conditions_None health_conditions_Other Insurance_Yes
0 4 0 4 2966.408696 8 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
1 4 2 2 3554.835677 9 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0
2 2 8 2 5624.733654 7 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1
3 4 7 4 4814.149231 8 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
4 2 10 2 5169.269637 34 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
499995 4 2 3 4105.795901 10 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0
499996 13 8 2 4631.550257 11 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
499997 2 3 2 5456.930075 8 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
499998 2 1 2 4694.127772 23 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0
499999 3 3 4 4713.868519 10 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1

500000 rows × 43 columns

In [ ]:
# Separating independent variables and the target variable
x = data.drop('Stay (in days)',axis=1)

y = data['Stay (in days)']
In [ ]:
# Splitting the dataset into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True, random_state = 1)
In [ ]:
# Checking the shape of the train and test data
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)
Shape of Training set :  (400000, 42)
Shape of test set :  (100000, 42)

Serialization¶

Serialization is the process of converting a data object (e.g., Python objects, models) into a format that allows us to store or transmit the data and then recreating the object when needed using the deserialization process. We are going to discuss two such serialization formats that are used in Python - Pickle and Joblib.

Pickle¶

What is Pickle?

Pickle is a Python module that can be used to convert a Python object into a byte stream and vice versa. The byte stream can be stored in a file or transmitted over a network.

Why is Pickle used?

Pickle is used to store Python objects in a format that can be easily retrieved and used later. This is useful when you need to save the state of your program, for example, when you want to save the trained model of a machine learning algorithm so that it can be used later to make predictions on new data.

Advantages of using Pickle:

  • It is easy to use.
  • It can handle almost any Python object, including custom classes and functions.
  • The serialized data can be compressed, making it smaller in size and faster to transmit.
  • The deserialized object is guaranteed to have the same type and value as the original object.

Importing the library¶

In [ ]:
import pickle
In [ ]:
# Create a model with desired hyperparameters
model = RandomForestRegressor(n_estimators=120, max_depth=None, max_features=0.8, random_state=1)
In [ ]:
model.fit(x_train, y_train)
Out[ ]:
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_features=0.8, n_estimators=120, random_state=1)

Saving the trained model using Pickle¶

In [ ]:
with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)

The above code is using two functions:

  • with open('model.pkl', 'wb') as file: opens a new file named model.pkl in write binary mode. The with statement ensures that the file is closed properly after the data has been written to it.

  • pickle.dump(model, file) writes the model object to the file object in binary format using the pickle.dump() method.

Loading the trained model using Pickle¶

In [ ]:
with open('model.pkl', 'rb') as file:
    loaded_model_pkl = pickle.load(file)
  • This code uses the open() function to open the model.pkl file in read mode ('rb'), and assigns the file object to the variable file.

  • Then, the pickle.load() method is called to load the saved model from the file file into the loaded_model_pkl variable.

  • To summarize, this code is loading the saved Random Forest Regression model stored in the model.pkl file using the pickle module, and assigning the loaded model to the loaded_model_pkl variable. This loaded model can be used to make predictions on new data.

Joblib¶

Joblib is a Python library that provides tools to provide lightweight pipelining in Python, as well as utilities for multi-threading. In the context of machine learning, it is primarily used for efficient pickling of large NumPy arrays, as well as for persisting scikit-learn models.

Some advantages of using Joblib for model persistence include:

  • Efficiency: Joblib is optimized for efficient processing of large NumPy arrays, making it a good choice for persisting large models and their associated data
  • Parallel processing: Joblib can be used to easily parallelize operations across multiple cores, which can greatly speed up the training and evaluation of machine learning models
  • Seamless integration with scikit-learn: Joblib is designed to work seamlessly with the popular scikit-learn machine learning library, making it a natural choice for persisting scikit-learn models

One disadvantage of using Joblib is that it is not as widely used or well-known as other serialization libraries like Pickle, which can make it more difficult to find resources or support if you encounter issues. Additionally, while Joblib is optimized for large NumPy arrays, it may not be the best choice for persisting other types of data or models.

Importing the library¶

In [ ]:
import joblib

Saving the trained model using Joblib¶

In [ ]:
joblib.dump(model, 'model.joblib')
  • This code is using the joblib library to save a trained machine learning model to disk in the file model.joblib.

  • The joblib.dump() function takes two arguments: the first argument is the trained machine learning model ('model') that needs to be saved and the second argument is the filename ('model.joblib') where the model will be saved.

  • The advantage of using joblib over pickle is that it is optimized for dealing with large numpy arrays, which are commonly used in machine learning. This means that joblib is often faster and more efficient than Pickle for saving and loading machine learning models.

Loading the trained model using Joblib¶

In [ ]:
loaded_model_joblib = joblib.load('model.joblib')
  • This code uses the joblib.load() function from the joblib library to load a trained machine learning model that was saved in a binary file format with the .joblib extension. The name of the file to be loaded is passed as an argument to the joblib.load() function.

  • Once the model is loaded from the file, it is stored in the variable loaded_model_joblib and can be used for making predictions on new data.

  • joblib.load(): This function is used to load a machine learning model that was saved using the joblib.dump() function

  • 'model.joblib': This is the name of the file containing the saved model. It should be in the same directory as the notebook or script that is loading the model.

  • loaded_model_joblib: This is the variable where the loaded model will be stored. Once the model is loaded, it can be used to make predictions on new data.

In [ ]:
/content/Mydrive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Four - Regression and Prediction/Hospital_LOS_Prediction/Hospital_LOS_Prediction_Part_4.ipynb