While Leading causes of death in New York City have changed over time, they could vary significantly by sex, race, income levels, or areas. Understanding the leading causes of death helps us to analyze citizen’s living conditions as well as the social and economic problems behind. In addition, the findings would be critical for future disease control and medical research to maximize population health. Therefore, I would like to illustrate the patterns and trends of New York City’s causes of death over time in this report. By using the dataset from NYC Open Data, New York City Leading Causes of Death, there are some questions I want to address:
- What’s New York City’s leading cause of death over the years?
- Who’s affected by the leading cause the most?
- Comparison between causes of death across sex and ethnicity groups.
Methods & Process
Data Collection & Cleaning
I found the dataset, New York City Leading Causes of Death, from NYC Open Data. It has 1,380 rows of data derived from the NYC Death certificates from 2007 to 2016. It categorizes these deaths by leading causes, sex, and ethnicity and calculates deaths by the number of deaths, death rates within sex and ethnicity groups, and age-adjusted death rates. This dataset fulfills the requirements for this lab exercise that it has more than one thousand rows of data which is historical and one or more quantitative and categorical dimensions respectively. However, I found out that it has some missing values when I imported the dataset in the OpenRefine to clean up and organize the data. Thus, I chose the safest way to manipulate this situation which was to discard the entire records that had missing values. Also, I decided to retain data only from 2012 to 2016 because there are too many missing values in the previous years. After this adjustment, the dataset was ready to go with the next step.
Before starting my data analysis and visualization, I did research on the visualization examples of similar topics as inspirations and references.
Example 1 is a line graph showing the trends of cancer death rates by type over the years. It’s a good example of using lines to express the overall shape of the time-series data. However, there are too many colors at the bottom of the chart overlapping with each other because the values and patterns are so close and similar. Since those categories hold only small parts of the whole data, showing all of their colors appeared to be redundant. Therefore, I would use only a few colors to highlight the most important part if I encounter such kind of graph.
Example 2 is an interactive chart of stacked area graphs showing how causes of death vary across sex and race. It’s interesting to see the overall trends and how they change by each demographic group when clicking on different categories. Yet I think this kind of graph will not be applicable when the categories are too many or the values are too close.
Example 3 is a stacked bar graph showing the differences of causes of death in the U.S. of four categories. It’s a good way to compare the same topic from different groups together in one chart and I think I could apply this method to the comparison between causes of death across sex and ethnicity groups.
Data Analysis & Visualization
I used Tableau Public for data visualization. With the cleaned up dataset uploaded, I started with grouping leading causes, sex, and ethnicity into big categories for better analysis afterward. Then I began several trying of filtering data elements to determine the best way to encode the values and create the visualizations.
Results & Findings
1. Diseases of Heart was New York City’s leading cause of death from 2012 to 2016.
From figure 1 we could see that the majority of deaths are attributed to the Diseases of Heart which accounted for more than 6% of total deaths from 2012 to 2016. The follow up cause is Malignant Neoplasms that held 5% of total deaths continuously over the years.
2. More female died because of Diseases of Heart than male.
Figure 2 shows that female held a larger portion of death caused by Diseases of Heart. However, male and female didn’t have large differences in the number of death caused by Diseases of Heart. In fact, figure 3 indicates that the percentage of male kept rising while the percentage of female showed a downward trend overall.
3. White female was affected by the Diseases of Heart the most, yet its percentage was declining.
Figure 4 shows that White people held over half of total deaths caused by Diseases of Heart. However, when looking into the trend from 2012 to 2016 in figure 5, we could find that White was the only ethnicity group that kept declining. Moreover, within White people, when male has declined for a while but risen since 2014, female has continuously fallen in its percentage of total deaths caused by the Diseases of Heart (figure 6).
4. The leading causes of death across sex and ethnicity groups were the same which were Diseases of Heart and Malignant Neoplasms.
These stacked bar graphs (Figure 7, 8) below show that causes of death didn’t vary significantly by sex and ethnicity. Diseases of Heart and Malignant Neoplasms were the main reasons that caused people’s death for all the demographic groups.
Overall, I believe the visualizations I made were able to answer the questions I wanted to address in this report. However, in the process of making the graphs and charts, there’s a problem kept bothering me. There were too many kinds of causes of death and some of them had really long names that it’s hard to show them all in one chart. Therefore, I chose to omit the ones that were less important, yet I wasn’t sure if the information would still be clear enough because of this. Perhaps I should group those causes into bigger categories for better visualizations next time. Besides, how to retain the color consistency was another task to me since there were causes of death, ethnicity groups, and two genders that needed colors for differentiation. When I look into the visualizations I made, I feel like there are too many colors appearing at the same time which might cause some confusions to the reader.
For future directions, I would like to join more data of age groups, location groups, income level, education level, etc…. I believe it would be helpful to further understand the socioeconomic and demographic characteristics of the population affected by the leading causes of death.
- OpenRefine: A open-source tool for data cleanup.
- Tableau Public: A free software for creating interactive data visualizations.