Crashing Out: Motor Vehicle Collisions in NYC

Background

For this project, I was interested in investigating the geographic distribution of motor vehicle accidents in NYC. Were certain neighborhoods generally more dangerous to drive through than others? Why might this be: density of traffic, type of street (small, large, highway), or speed limits, perhaps? Did conditions change over time, rendering some neighborhoods more dangerous and others less? This analysis was intended to address a variety of questions around the motor vehicle collisions in NYC, but given time constraints, I decided to focus on looking at one recent year in particular – 2022 – and investigating differences across zip codes.

Sourcing the Data

As with most NYC-based data analyses, I turned to OpenDataNYC to obtain my dataset. I discovered a dataset going back to 2012 that included information about date and time of crash; type of motor vehicle; location of accident (borough, zip code, streets, latitude/longitude); numbers of people (motorists, pedestrians, cyclists) killed or injured; and contributing factors. The data was compiled by NYC police, who are required to complete a police report for any collision that results in damage exceeding $1,000 or when there is an injury or death. To limit the size of the file, I pulled one year’s worth of data: 2022. (One note: by definition, this dataset only includes collisions where police are involved, so it undercounts the true number of collisions, and also inflates some of the derived calculations, e.g. casualty rate, or the % of deaths and injuries related to accidents.)

Because some of the rows of data contained lat/long information but not zipcodes, I needed to use a NYC zipcode shapefile to fill in the holes. I found a Modified Zip Code Tabulation Area (Modzcta) from the NYC Open Data that was helpful in doing this.

Cleaning the Data

Because I was planning to analyze the dataset geographically, I started out by removing any data that lacked longitude/latitude data. That brought my dataset from 103.603 to 92,827 rows. To speed things up, I also decided to remove a number of columns, including the type of person injured or killed (motorist vs. pedestrian vs. cyclist); street names; and type of vehicle. (This last category was something I’d come to regret, as I would have liked to have done an analysis on bikes and e-scooter casualties by borough.) Finally, I filled in missing zip code information by joining the spreadsheet with the Modzcta shapefile using GGIS, exporting a new data file to use in Tableau.

I ran across one challenge when reviewing the new data file with zip codes. When comparing the new calculated zip code column (based on the process described above) to the original column, I noticed that there were some discrepancies. Sometimes it was simply because there was no data in the original zip code column — which is why I used QGIS in the first place. But in many rows, there was a clear discrepancy — 12,490 (7% of the total rows), to be exact, contained discrepancies between the original and newly calculated zip codes. Here’s an example:

Discrepancies with zip code information: Modzcta vs. Original

Which one was more reliable, the original zip code or the newly created one? I took a couple of examples (about a dozen) and used the lat/long coordinates to find out the correct zipcode (according to ChatGPT). Sometimes it turned out that the original zip code was correct, and at other times it turned out that the new one was correct; once it turned out that neither were. Consequently, I decided to go with the new Modzcta zip code data. (The exception to this was that the final 1,080 rows , where Modzcta zip codes were showing up as 999999. In those thousand cases, I went with the original zip codes.) The last thing I did at this stage was to remove the 435 rows that had no zip code data whatsoever. I was left with a dataset where every datapoint (corresponding to a motor vehicle collision) contained a zip code.

Visualization #1: Vehicle Collisions by Zip Code

For my first set of analyses, I chose to zero in on 2022 data. After uploading the dataset and checking to ensure that the variables were properly configured (e.g. changing CollisionID from number to string, confirming that Zipcode was a string and had a Geographic Role of zipcode/Postcode), I filtered the dataset to only include 2022 data. I then dragged longitude into columns and latitude into rows, and zip code into Details, generating the map with zip code areas visible. Then, to get a chloropleth of zip code areas colored by # of total collisions in 2022, I dragged CollisionID into Details and Color, and converted it into a Sum. Finally, I modified the legend from a continuous range into 5 steps. Here is the result:

This map was a good start, but I it felt a bit vague. How was one to know which specific neighborhoods on the map corresponded to the zip codes with the highest collisions? While there is likely a way to automate the process of assigning neighborhoods to zip code regions (perhaps by finding a data table with neighborhood and zip code information and joining it to the original dataset in Tableau), I was running out of time, so I decided to zero in on the most significant regions and manually find the information, creating some text boxes on the dashboard to make this clear. Here is what I came up with:

Visualization #2: Change in Vehicle Collisions by Zip Code, 2012-2022

After creating a “baseline” visualization of collisions by zip code for 2022, I was interested in learning about which zip code areas experienced the most significant change (positive or negative) in motor collision frequency during the past decade. I calculated a number of new fields in Tableau to get the percentage change between 2012 and 2022, and re-ran the visualizaton. Here’s what I generated:

Increase/Decrease in Motor Vehicle Collisions, 2012-2022

One thing that bothers me about this visualization is the legend. While it’s helpful to see the bottom of the range (-65%) and the top (+151%), as well as the “0” (no change) in the light gray section, it’s impossible to intuit the ranges for each color segment. I couldn’t figure out a way to modify the legend to capture this information.

Visualization #3: Casualties (Deaths & Injuries) By Zip Code, 2002

In addition to analyzing the absolute number of collisions by zip in 2022, and understanding the change since 2012, I wanted to look at the repercussions of these accidents — namely, how dangerous were they in terms of injury / fatality for drivers and pasengers, pedestrians, and cyclists? I calculated a new field called Casualties (= number of injuries + number of fatalities) and arrived at the following:

Casualties Due to Vehicle Collisions , 2022

The results seemed to roughly correspond to the first visualization, looking at the absolute number of collisions in each zip code area. That would make sense — the greater the number of collisions, the greater the number of casualties.

Visualization #4: The Impact of Time of Day on Total NYC Traffic Collisions, 2022

Although this didn’t entail a mapping visualization, I was curious about how time of day impacted the frequency of motor vehicle collisions in NYC as a whole. I created the following chart to show this:

Average Daily Traffic Collisions in NYC Boroughs, by Time of Day, 2022

I found this somewhat surprising, as I would have expected a greater share of collisions to occur during morning rush hour (6-9am) than in the late evening (9pm – 12am). I also found it surprising that there were so many collisions during the 3-6am period.

Critique and Reflections

In reviewing the various visualizations I conducted on this data set, I’m frankly left feeling a bit “meh.” While it was interesting to view differences across NYC zip code regions, the analysis ultimately feels shallow. It begs the question: what factors account for the differences in collision counts across various parts of the city?

My assumption is that traffic collisions vary by zip code region simply as a function of traffic density. In other words, the more traffic in an area, the more collisions. One way to test this assumption would be to normalize the zip code regions in Tableau in terms of traffic density. I attempted to do this by retrieving the “Automated Traffic Volume Counts” dataset from NYC Open Data. The data is based on a network of traffic sensors around the city, but they take measurements at random times on random days, so I had to do a ton of cleaning/massaging of data (and employ many assumptions) to come up with an average daily traffic count. I made several attempts at converting this dataset into something useful by attempting to join it with a NYC shapefile using QGIS, but I was unsuccessful. If given more time, perhaps I could come up with something useful.

In addition to traffic density, what might account for the variances in collisions across parts of the city? Perhaps it has something to do with speed limits, or types of streets (highways versus smaller networks). Unfortunately, I would need to pull in other data sets to address this question.

Sources from NYC Open Data

Motor Vehicle Collisions – Crashes

Modified Zip Code Tabulation Areas

Automated Traffic Volume Counts

Information Visualization

Student work at the School of Information, Pratt Institute

Crashing Out: Motor Vehicle Collisions in NYC

Background

Sourcing the Data

Cleaning the Data

Visualization #1: Vehicle Collisions by Zip Code

Visualization #2: Change in Vehicle Collisions by Zip Code, 2012-2022

Visualization #3: Casualties (Deaths & Injuries) By Zip Code, 2002

Visualization #4: The Impact of Time of Day on Total NYC Traffic Collisions, 2022

Critique and Reflections

Sources from NYC Open Data

Background

Sourcing the Data

Cleaning the Data

Visualization #1: Vehicle Collisions by Zip Code

Visualization #2: Change in Vehicle Collisions by Zip Code, 2012-2022

Visualization #3: Casualties (Deaths & Injuries) By Zip Code, 2002

Visualization #4: The Impact of Time of Day on Total NYC Traffic Collisions, 2022

Critique and Reflections

Sources from NYC Open Data

Related posts: