Background
Introduction
The Centers for Disease Control (CDC) maintains a wealth of robust public data sets across topics such as vaccination, pregnancy, injury & violence, and COVID-19. For this lab, I wanted to explore a personal area of interest in healthcare and landed on the data set for Disability and Health Data Systems (DHDS).
The DHDS is a collection of state-level data on adults with disabilities across six disability groups: cognitive, hearing, mobility, vision, self-care, and independent living. This data set is maintained by the CDC and was most recently updated in May 2021, but the actual range of the data extends from 2016 to 2019. As such, the data set includes only data from before the onset of the COVID-19 pandemic.
Inspiration
My inspiration for this project is partially personal, stemming from changes in health in family members as they age. I was interested in seeing how factors such as age, location, and race might impact disabilities across the US population.
For visual inspiration, I was most recently inspired by Statistical Atlas’ article on Household Income in New York. This article employs numerous visualizations across bar charts, line charts, and maps to show income disparities in the New York metropolitan region. I hoped to similarly derive multiple facets and perspectives from one comprehensive data set.
Process
Tools
I processed the data for this project in R using RStudio. I primarily used the ggplot library within the Tidyverse package for visualizations, and found the package to be extremely comprehensive. While robust, one notable absence from ggplot is a native function for creating pie charts, and as a result, I stuck to bar charts for this project.
Methods
My first goal with the DHDS data set was to understand the values of each of the 30 columns to understand how they were populated. For this, the CDC website offers a rudimentary but useful “visualize” tool on their website. I browsed columns by creating quick charts directly on their website like the one below:
While useful, the tool did not support filtering, and instead showed me opportunities for data refinement. For example, when looking at location, it is difficult to understand the state distribution when over 75% of the chart is occupied by “(Other)”.
I noted redundancies and hierarchies of columns and filtered data as I processed them into various data frames. For example, Stratification1 and StratificationCategory1 are closely related columns, where the value of StratificationCategory1 is a parent to the values in Stratification1.
I followed a similar process for each visualization: filter the data for relevant columns, process the remaining columns, and finally plot the table.
Product
Visualizations & Code
I published my visualizations and accompanying code to my personal Github here.
Interpretation
The most surprising insight for me was from the “Occurrences of Disability Types by HHS Region in 2019” chart. HHS regions are defined by the Office of Intergovernmental and External Affairs, and are designed to allow regional leads to maintain close contact with the needs of their regional communities. Across all disability types, HHS Region 4 shows elevated occurrences compared to the other regions. HHS Region 4 represents the South and encompasses Alabama, Florida, Georgia, Kentucky, Mississippi, North Carolina, South Carolina, and Tennessee. As a Florida native with aging parents still in the state, I found it alarming to see such stark health discrepancies across the South.
I found the other charts to reveal impactful but not surprising insights. For example, when isolating age groups along disability status, it was less surprising to see the prevalence of any disability rise with age. For patients who reported financial barriers to healthcare, the states with the most occurrences were also the states with the highest populations.
Reflection
I enjoyed using R to create the charts for this project and similarly enjoyed the process of searching and adapting new functions within the ggplot library. Compared to more robust programming languages that I have used in the past like Matlab and Python, I found R to be fairly lightweight and easy to get started with.
Given more time, I would have liked to try this project again using Python to have a clearer understanding of how the two languages differ and what the strengths of each language are. I have also been interested in trying out new executable notebooks like Google’s Colaboratory in order to widen my skill set.