Data Analyst looking for opportunities
Unique background in business, outdoor recreation and customer service
Outdoor enthusiast, raft guide, reader
View My LinkedIn Profile
Hospitals are a critical part of our healthcare infrastructure and one of the main providers of diabetic treatment for millions of people in the US. With rates of diabetes increasing amongst Americans it is critical that hospitals are providing sufficient care in an efficient way. Clinical outcomes for diabetic patients have been improving, but only when appropriate care is provided. If hospitals can utilize the right treatments sooner, the likelihood of patient readmission decreases. This greatly reduces management and administration costs for the hospital, while also improving morbidity and mortality rates among diabetic patients.
To determine how well hospitals are doing I obtained patient data from 130 hospitals tracking clinical care of diabetic patients over the course of ten years (1999 - 2008). Each row of the data represents records of patients diagnosed with diabetes, who stayed in the hospital for up to 14 days while receiving care. For this project the data is broken out into two tables, the “health” table and the “demographics” table. The health table shows the length of stay at the hospital, lab procedures performed, medications taken, and whether the patient was readmitted. There are 101,766 instances in the data with 71,518 distinct patient records. The demographics table shows the patient’s race, gender and age. Both tables share the “patient_nbr” attribute as the unique identifier for each record. The original dataset can be found here.
Using SQL I was able to analyze this data and extract some key insights on the hospital’s performance. Specifically, I was asked to determine the following:
Let’s see what we were able to find.
SQL is not known for its data visualization capabilities. However, with the right SQL query we can create a basic histogram from the data. This is a great way to quickly find the answer to our first question. Are most of patients staying less than 7 days? Here are the results:
Using this special SQL script, we can see that the vast majority of patients are in the hospital for less than 7 days, so the hospitals in this dataset are doing great on that front. The RPAD function is the key to this histogram in SQL. RPAD creates a column with a sting value in each row. In this case we are using the COUNT() of patients to determine the length of the strings, and representing each patient with an ‘*’ to have the strings represent bars in a histogram. I divided the counts by 100 because we are dealing with a lot of patients. This shortens the bars so that we could see the full histogram on one page. Of the 101,766 instances in this dataset, 80,617 of those resulted in a hospital stay of less than 7 days.
Next, I wanted to determine the relationship between the number of lab procedures performed, and the length of stay at the hospital. Our hospital experts determined that the number of procedures performed could be broken down into three categories. The “Few” category contains patients with less than 25 procedures performed, the “Average” is patients with 25 - 55 procedures, and “Many” procedures is anything above 55. Let’s create a summary to see what the average length of stay is for these three categories:
This breakdown makes a lot of sense. Patients in the “many” category have the longest hospital stays, averaging 5.7 days, while patients in the “few” category have the shortest average stay at 3.3 days. The CASE WHEN function in this query allows us to define our category parameters and order our average stay calculations (from the SELECT statement) appropriately.
Now let’s figure out what medical specialties / procedures are utilized the most often. In this case we are looking for specialties that have an average procedure performed count above 2.5 and have a total performance count greater than 50. This ensures that we are looking at procedures that have been performed on many patients, not just a few patients that may have a higher average. Here are the results:
Using our above parameters, we get a list of five medical specialties. The Thoracic Surgery has the highest average, but we can see that the Cardiology specialty has the highest total count of procedures performed with 5,352.
Ensuring everyone gets equal care is very important, but unfortunately it has been proven that many people and organizations suffer from biases, both conscious and unconscious. Using information from our demographics table alongside the health table data, we should be able to determine if there are any racial biases present in patient treatment. To get a snapshot of this we can break down the data by race and the number of lab procedures performed. In theory a significant difference in the number of procedures performed based on race would indicate a racial bias in patient treatment. What does the data say?
Our aggregated column, the average number of procedures performed, comes from the health table. However, to be able to group these averages by race we need to use a JOIN statement to pull in data from our demographics table. Based on our results there does not seem to be any significant differences in the treatment of different races. This is the result we hope for, and hopefully will come to expect in all cases.
Let’s highlight our hospital’s successes. For this we will consider a success story as a patient that was admitted with a diabetic emergency, indicated by a “1” in the “admission_type_id” column, and who had a hospital stay that was shorter than the average of 4.4 days. To do this we will need to use a subquery inside our SELECT and WHERE queries because we can’t use an aggregating function, like AVG, as a parameter when trying to sort data. The query below is a good example of a subquery, and produces a full list of our success stories:
The WITH statement in the query above allows us to give our subquery results a name that we can then reference in the second query to produce our list of hospital successes. Altering the above query with a COUNT function shows that there are 33,684 success stories in our dataset!
Finally, we want to know how many patients are being readmitted after the treatments listed in this dataset. In an ideal world patients would receive sufficient and appropriate treatments on their first stay, saving money for the hospitals and, more importantly, keeping these patients healthy. To figure out if, and how soon, patients were readmitted we can use another CASE WHEN statement to organize the records based on three categories found in the “readmitted” column. Here are the results:
These results are promising. We can see that most patients (54%) were not readmitted. However, there were many patients that were readmitted after one month, and a significant portion were readmitted within a month of their first hospital stay. This could indicate that the treatments being utilized may not always be what the patients need.
Using SQL, we were able to gain some valuable knowledge from our hospital dataset. To recap we found that:
The hospitals in this dataset did a great job with the 101,766 instances of diabetes related hospitalizations. However, the data shows that there is room for improvement. Hopefully, using data insights like these, patient outcomes can be improved in the future.
Thank you for reading this SQL analysis of healthcare data. I performed this project as part of the Data Analytics Accelerator Bootcamp from Data Career Jumpstart. This was a fun and valuable project for demonstrating my SQL analytics skills. If you enjoyed this analysis and would like to follow along for more feel free to connect with me on LinkedIn.