INTRODUCTION
In December 2017 and the first several months of 2018, Amtrak, the United States’ intercity passenger railway, was involved in three fairly high profile serious accidents. In the aftermath of these accidents, the news media seemed very interested in interrogating whether Amtrak had a lax safety culture that had led to the accidents, despite the fact that it was fairly immediately clear to knowledgeable observers that the two 2018 accidents were not the fault of Amtrak.
Inspired by the question of whether Amtrak has a “safety problem” I chose a data set from the Federal Railroad Administration about rail injuries in the United States. I was curious about how typical the injuries and fatalities in the three recent Amtrak accidents were, and whether Amtrak’s safety record and injury rate were typical of passenger railways.
Railroads are required to report all deaths and injuries that occur on trains or on railroad property to the Federal Railroad Administration. The casualty data submitted to the FRA is available on its website, and provides an extraordinarily rich insight into the types and causes of rail injuries. The data include narrative descriptions of incidents and information about type and location of injury, environmental conditions, time of day, tools used, and more.
INSPIRATION
For inspiration, I looked at other visualizations of injury types, rates, and causes across various areas.
The first example I found was a packed bubble chart describing rugby injuries. I found this visualization a little bit confusing, as the dimensions it used were percent of total injuries, and percent of total days lost to injury. It was unclear to me why percent was a more useful measure than simply number of injuries or number of days, and because there were separate bubbles for forward and backline players, I wasn’t sure if the “totals” referred to total injuries for a specific position or total injuries for all players. It was also unclear what the significance of the horizontal placement of the bubbles was.
I looked at another example from the New York City Department of Transportation in a report about the problem of pedestrians being hit by left-turning drivers. Vehicles turning left present a particular danger for pedestrians who are legally crossing the street. The NYCDOT chose to illustrate the percent of pedestrians and cyclists hit by left-turning vehicles compared to vehicles going straight or turning right using a pie chart. This visualization is not great, especially if the goal is to highlight left turns, since it is apparent that despite left turns being very dangerous, the majority of accidents involve vehicles going straight ahead. The decision to illustrate left turns in red does little to ameliorate this problem.
The most successful visualization I found for inspiration is this one by the Centers for Disease Control illustrating causes of workplace injuries for healthcare employees. Through the use of color and by choosing a simple bar chart, this visualization effectively demonstrates which employees are injured the most frequently, and which injury causes have the largest disparities.
MATERIALS
This lab required a dataset with at least 1000 rows, at least one quantitative and one categorical dimension, as well as a temporal dimension.
I used the Federal Railroad Administration Office of Safety Analysis’s data set of railroad casualties in the state of New York for the year 2017. When I generated this data set, the data for December 2017 had not yet been uploaded to the database, so this data set only includes the first 11 months of the year.
In order to decode the data, I used the FRA’s instruction manual on submitting rail casualty reports.
I used Microsoft Excel to clean up and decode the data set, and Tableau Public to generate my visualizations.
METHODS
In order to obtain this data set, I used the FRA’s Office of Safety Analysis website to generate a table of the railroad casualty reports submitted by railroads for the state of New York in the year 2017.
Because this was a very rich and detailed data set, it was necessary to significantly reduce the number of variables. I first eliminated columns that were irrelevant or redundant, such as the FRA incident number assigned to the injury, or the state and region—all New York and thus all the same. Next, I eliminated columns that contained incomplete information, such as latitude, longitude, and substance use, which were reported only in some cases. I also eliminated the three columns that contained narrative descriptions of the incidents. This still left me with around 20 columns, so I tried to focus just on dimensions that would answer my question in a fairly broad sense, eventually only keeping the columns for railroad, type of person, nature of injury, month of injury, county, and age.
Once I had reduced my data set to a manageable six columns, I used the FRA Guide for Preparing Accident Reports to decode the data. I then opened the data set in Tableau.
My first visualization I set out to create a bar chart just showing the number of injuries of each type for each railroad. However, with all ten of the different freight railroads listed separately, the visualization was not terribly useful. Most of the freight rails had few enough injuries that they were wildly out of proportion with the number of injuries on each of the passenger rails and with over a dozen railroads represented individually, there was too much text. In order to address this, I created a group with all of the freight railroads together, while leaving Amtrak and the commuter rails to be listed individually. This made the bar chart much less cluttered and easier to understand. I also wanted a way to see who was being injured, so I colored the bars according to the type of person injured. This presented a similar problem to the freight railroads. With eight different types of people described, there were too many colors, and many of the distinctions—for instance between contractors and employees—were not particularly useful to the average viewer, so I combined some type of person categories into slightly broader ones. In order to eliminate visual clutter, I put similar types of people into the same color families, for instance, passengers and ‘nontrespassers on railroad property’ (mostly people in train stations) in shades of blue, and employees both off duty and on, in pink and purple.
For my second visualization, I made a line chart of injuries by type of person, over time, simply graphing the number of records associated with each month of 2017, except for December which was not included in my data set. I kept the same colors for this visualization, for the sake of consistency, and because there was no reason to change them. I experimented with labeling the lines in addition to color coding them, but I found this too visually cluttered and though the legend with the colors was likely sufficient.
For my fourth visualization, I wanted to examine injuries by location. I first created a packed bubble chart, keeping type of person as the variable defined by color, and looking at the number of records per county. This ended up not working at all, because New York State has 62 counties and even though they weren’t all represented in my data set, there were still way too many circles, many of which were very small because the majority of the state’s rail activity is concentrated toward the NYC Metropolitan Area, with comparatively few injuries occurring in counties like Monroe and Chautauqua. In order to address that problem, I grouped the counties into eight regions. This improved the bubble chart, but I still felt like it looked interesting but wasn’t actually useful or easy to understand, so I tried a tree map instead. This was a much more effective way to visualize what I was interested in. I retained the same colors to distinguish type of person injured.
In creating my dashboard, I tried to arrange the elements in a way that was visually pleasing and also told a coherent story. Because the bar chart is significantly easier to see and understand as it gets larger, I had it take up the entire left side of the dashboard. Then, because the color legend is the same across all three visualizations, I wanted it to highlight that as well, so I put it in the top right corner. I thought the line chart describing injuries by month was the least successful visualization in terms of imparting interesting information on its own, so I put that in the bottom right corner, which left the tree map of injuries by region to go in the middle on the right.
RESULTS/DISCUSSION
This resulted in what I think is a fairly useful visualization dashboard giving a broad overview of who is injured on railroad property in New York State, what those injuries are, and where and when they take place. Because this was such a rich and detailed data set, it was difficult to discard so much interesting information in order to match the scope of the assignment, but what I ended up with is a pretty effective introduction to the data set and the topic.
In answer to my specific question about Amtrak’s safety culture as compared to other railways, it appears that Amtrak has a fairly typical injury rate, and that the most common injuries are bumps and bruises, and thus fairly minor.
FUTURE DIRECTIONS
In the future, I would like to work more with this data set. Because it is so detailed, there is an opportunity to look at trends beyond just those I looked at for this lab, and perhaps look at injuries by time of day, circumstance, more specific location, etc.