Introduction to Data Set
Each year, hundreds of thousands of people die in New York City due to multiple causes. I was interested in taking a closer look at these causes by using NYC OpenData’s New York City Leading Causes of Death data set. This data set uses death certificates in NYC since 2007 and categorizes these deaths by leading cause, gender, and race ethnicity. It calculates deaths by number of deaths, death rates within sex and race ethnicity categories, and age-adjusted death rates.
With the data provided, I wanted to answer some of these questions:
- What are the most prevalent causes of death?
- Do males and females die at the same rate in NYC?
- Which causes of death are most prevalent in each ethnicity?
- What types of death are most prevalent?
- Which ethnicity has the highest death rate?
Discussion of Similar Visualizations
Visualizations of death rate data are quite popular and varied. Example 1 shows causes of death in the United Sates by both state and cause. This visualization interacts with both bar graph and map. If my data set had included boroughs or coordinates, it would have been interesting to see where in NYC deaths are most prevalent.
The other visualization I looked at was this simple node link map on the major causes of death in the 20th century (Example 2). It divides the causes of death into major categories with subcategories linked as nodes. It also includes sizing (and possible color, although I could not figure out the key) to show frequency. This visualization inspired me to additionally group my leading causes by type in Tableau to create a similar visualization (see Leading Causes by Type visualization).
Materials
To create by visualizations, I used the following materials:
Methods and Process
The data set had multiple quantitative and qualitative variables. At first, I played around with displaying them in different combinations on Tableau. Time was the easiest to use as a X variable, so I decided to create a timeline first. I used counts of death as my Y variable to show the trend in total # of deaths from 2007 to 2016. I also added the sex variable as a group to compare death numbers between males and females.
Attempts at Showing Leading Causes of Death by Ethnicity
I wanted my next visualization to show which causes of death were the most prevalent, so I used leading causes as my X variable and count of deaths as my Y variable. I had to group the leading causes variable into groups because there were redundancies in some of the categories. I also applied a filter so that users could look at selected causes. The bar chart seemed to be the best visualization for this information because it clearly showed which causes had the most deaths comparatively.
One of my biggest questions that I wanted to answer using a visualization was what were the leading causes of death in each ethnicity. This required multiple variables to be used, and I ended up trying multiple visualizations to best show this.
For my first attempt, I used leading causes as the X variable and count of age adjusted death rate as the Y variable. I then used ethnicity as a detail and color coded the ethnicities. It was difficult to examine the results using the bar chart I created because there were too many variables to look at. It was clear which causes were the most prevalent, but it was hard to compare their prevalence amongst ethnicities because the bar graphs were divided into too many ethnicities.
In my second attempt, I made both leading cause and ethnicity Y variables in order compare ethnicities side-by-side for each leading cause death. I used distinct count of age adjusted death rate for the X variables. This visualization was much more effective because I was able to compare each leading cause death rate by ethnicity. I could not only tell which causes were most prevalent overall but could also determine prevalence by individual ethnicities.
Lastly, I tried a heat map to show the leading causes of death by ethnicity. While it was able to show the most prevalent causes of death, it ended up revealing a problem with my data that I had not previously noticed.
Issue with Data Set
Many of the variables were missing data, which caused my death counts and death rates to be incorrect when totaled. I did not realize until I made this final visualization that I should have cleaned up my data set before using it in Tableau. It is unclear whether the blank cells for these variables meant that no deaths occurred in these categories or that no information was known. If it was the former, I would have used Google Refine to change all the blank cells to zero. If the issue was the latter, I would have added a symbol such as a period or dash in the blank cells in indicate that the information was not available. Even though some of my death counts and rates were tallied incorrectly, overall, I think that the visualizations I made were able to answer many of my questions I wanted to address using this data.