Time-series analysis is integral to understanding trends in data sets that span a distance in time. The following report covers the creation of a dashboard of line graphs in Tableau Public to depict the leading causes of death in New York City from 2007 to 2014. Potential definition changes between accident and substance abuse (label starts with “Mental and Behavioral” in data) deaths after 2010 and trends in cancer deaths in race and ethnicity groups were identified in the resulting graphs. Gaps in data and the absence of overall population numbers hindered additional examination of the death numbers. The inability to properly visualize null value gaps with a line break also hindered the visualization process. All resulting graphs with notable data are included in the Results & Discussion section.
- Tableau Public: Free data visualization software available for download at https://public.tableau.com/en-us/s/.
- NYC OpenData: Source of data used in this report. This site houses publicly available data about New York City provided by the city government. Data set used in this report comes from the “New York City Leading Causes of Death” page at https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam.
Data Collection & Preparation
Data relating to leading causes of death in New York City was collected from the NYC OpenData site to begin this project. Before starting the visualization process, three visualizations were collected to guide the creative process. The visualizations collected were all line graphs depicting population, car accident, and taxi pickup data. The line graph format was selected because I set out to examine distinct trends in mortality rates in NYC death data from 2007 to 2014 and this visualization type allows trends over time to be viewed clearly by the average observer.
Example Visualization 1
The following line graph depicts the recorded resident population in New York from 1900 to 2016:
To start off, I wanted to find a simple line graph. The above graph shows a clear, positive trend in New York’s resident population number throughout the 20th Century. However. this visualization also includes gray shaded regions for U.S. recessions – a data point that may explain some dips in the positive trend for resident population over time. The design of this visualization is simple and sleek: using only grays and blues within and around the graph make it easy to read the information that is presented. Graphs that compare different groups over time would likely require more colors, but this graph is a fine example of visualizing a single group over time.
Example Visualization 2
The line graph below depicts the number of car accidents in New York City’s five boroughs between January and May:
This graph does a nice job of visualizing the overall difference in the number of car accidents between New York City’s five boroughs and allows for easy identification of highs and lows in the data. However, it is difficult to examine specific portions of the information, especially when comparing the light blue line of Queens accidents with the light green line of Manhattan accidents where they overlap and are squeezed very close to each other. The hollowed-out circles provide a nice way for a user to see exactly when data was collected, which makes gaps in the data easier to identify, unlike in the previous visualization. However, the x-axis does not provide more specific time measurements beyond months, even when a large number of data points are provided between these month labels. The numbers in text form provided within the graph for data points with a particularly large number of accidents seems unnecessary: these data points are the easiest to discover and it makes the design look inconsistent. Nonetheless, if the intention was just to compare overall accident numbers between New York City’s five boroughs, this graph does a fine job of visualizing that.
Example Visualization 3
The next graph compares yellow taxi, green taxi, and Uber pickups in Brooklyn from January 2014 to July 2015:
This graph is the closest to what I wanted to create in Tableau Public. The graph has three different lines with clearly distinct colors that are not squished tightly together. A gap in the data for Uber pickups is clearly seen by a break in the corresponding line so that a trend will be less likely to be assumed for the missing time period. A user can see that as green taxi and Uber pickups become more prevalent, yellow taxi pickups dip. Also, as Uber pickups increase, the rate at which green taxi pickups increases is lowered. This graph allows for the comparison of more entities than the first visualization while not overwhelming the user with too many colors and overlapping, squished together lines, as seen in the second visualization, which is my goal for the visual presentation of the data examined in this report.
After collecting the data and analyzing the three visualizations provided earlier in this report, a CSV file of the “New York City Leading Causes of Death” data was downloaded from NYC OpenData. This file was then linked to Tableau Public to begin the visualization process. First, similar causes of death were combined where appropriate and oddly named or unclear labels were edited to enhance understandability (note: legend titles with “(group)” added to it are a reminder to myself that the corresponding graph has edited groups). Next, variables were placed on appropriate axes and the option for line graphs was chosen. At this point, I created a collection of several line graphs that compared the number of deaths from various causes (using different colors) according the data set’s overall population, sex, and race/ethnic identifications. Notable graphs were included in a dashboard and can be seen in the Results & Discussion section that follows.
Results & Discussion
Please click on the following link to view the resulting dashboard:
In case the above link does not open on the reader’s device, all graphs included in the dashboard are provided in the section below.
Depicting this data in a line graph format allowed me to see trends clearly, as I expected; however, it also allowed me to more clearly view problems and inconsistencies with the data.
First, I have a graph that includes all of the death numbers over time after I refined and grouped the data:
Illness deaths are so high that it is difficult to see any other trends in the data. It does, however, appear that illness-related deaths decreased as “All Other Causes” increased, but since the label ”All Other Causes” is unspecified, it is hard to make any impactful conclusions about this relationship. Graphs that follow will exclude “All Other Causes” and separate causes of death from each other that have wildly different numbers in order to better view the information up-close.
My next set of graphs examines what I will call AASS – Accidents, Assault, Suicide (labeled “Intentional Self-Harm”), and Substance Abuse (label that starts with “Mental and Behavioral”) – deaths for all groups first and then between the female and male groups provided in the data set.
The above graph depicts all AASS death numbers for all groups in the data set. The one thing that jumped out at me when reviewing this graph is the change in relationship between Accidents and Substance Abuse (“Mental and Behavioral”) deaths after 2010. This relationship seems to shift from a positive correlation to a negative correlation after 2010, which leads me to believe that a possible definition change for these causes of death occurred after 2010. It’s unclear what the label “Accidents” actually encompasses (except besides excluding drug poisoning according to the data set), so this definition change is certainly possible, although further research is required to determine this.
Next, I examined AASS deaths between the female and male groups provided in this data set:
The relationship between Accidents and Substance Abuse (“Mental and Behavioral”) deaths after 2010 is still noticeable here for both female and male groups, but it is slightly more noticeable in the male group due to data gaps in the female group. I can see here that data regarding Substance Abuse (“Mental and Behavioral”) deaths for the female group is missing from 2010 and 2012 – I couldn’t determine how to include gaps in my lines, so the graphs depict null data as zero here (this will be discussed in more detail in the Moving Forward section below). The female group is also missing Assault death data. The Assault death trend for the male group seems to be decreasing and the Suicide (“Intentional Self-Harm”) death trend seems to be increasing for the same group, but it is difficult to make any further conclusions based on this graph because the death numbers sometimes shift significantly in different directions between just two years, and this data set only spans seven years.
My last set of graphs looks at cancer deaths between the data set’s race and ethnicity groups.
The above graph depicts cancer death numbers for all of the race and ethnicity groups included in the data set. After 2010, it looks like cancer deaths from the White Non-Hispanic group decrease while Black Non-Hispanic, Hispanic, and Asian and Pacific Islander groups’ cancer deaths rise. Since the data collected was mostly from white participants, and the Not Stated/Unknown and Other Race/Ethnicity labels provide little information, I excluded their numbers in the following graph to better view the other groups’ trends:
Here we can examine what I mentioned above about post-2010 cancer deaths for these groups more closely. While Hispanic and Asian and Pacific Islander cancer deaths increase more steadily after 2010 (after 2008 for the latter group), cancer deaths do not increase after 2010 as I thought earlier for the Black Non-Hispanic group. Looking back, I can see this trend for the Black Non-Hispanic group in the graph that includes all race and ethnicity death data, but it is much easier to see this when the lines with more extreme numbers are excluded, which provides a more zoomed-in view for this group. Not much more can be concluded from these graphs without overall population numbers.
Tableau Public is a powerful and surprisingly simple tool that data visualizers can use to depict complex sets of data in various styles and formats. While there are certainly more complex functions that an expert user of this software can discover, compelling visualizations of data can be created with only a small amount of training on the user’s part. One useful function, however, is very difficult to find: breaking a line in a graph to depict gaps in data. I was not able to properly determine how to break lines in my graph during the course of this study, so my graphs are not as accurate as I would like them to be. I would encourage that future updates to this software make this feature more easily discoverable, especially since it can promote a more ethical depiction of data for researchers and designers that are not trying to mislead observers of their visualizations. With regards to the NYC death information, cleaner and more complete data (e.g. the missing Assault death numbers for the female group), better defined race and ethnicity labels, and collecting overall population numbers would have helped me to make more conclusions from the resulting visualizations after incorporating them into my design. A future study with these issues fixed could yield different outcomes.