Goal
For the Tableau lab in our Information Visualization class, I decided to focus on 311 hotline data from the NYC Open Data portal, in particular the subset of complaint data published by the Department of Housing Preservation and Development (HPD). This dataset represents all calls by tenants to the city since 2010 to report either issues in their apartments or building-wide issues, with the exception of structural issues for buildings. These fall under the jurisdiction of the Department of Buildings.
The tutorial videos from Tableau helped outline the possibilities with the software. I thought it would be interesting to visualize the data to allow analysis of the following:
- areas within NYC experiencing a high level of housing trouble;
- types of problems reported and the frequency of each type of complaint;
- the spread of complaints over the months of the year.
Inspiration
I remembered I had seen one really good visualization for noise complaints in New York City in the past, but I was unable to find the exact one, though I was able to locate a few others. What I like about the first substitute I found, Example A, is that it shows the entire city and that its rendering as a chloropleth map allows you to clearly see which areas are the “noisiest”.
I also really like Example B that visualizes noise complaint data, too. It’s not entirely different from Example A, but this series of visualizations includes a breakdown of types of noise. Knowing that HPD breaks down tenant complaints into categories (no heat, pests, and plumbing issues are all very separate issues), I wanted to be able to represent the variety of complaints occurring throughout the city.
The last example that I found on WIRED really resonated with me: Example C.
This steam graph illustrates that by choosing smart ways to visualize your data, you can assist people in understanding the world that exists behind the data. By aggregating all 311 complaint data not only by type, but also by hour, the creator of the visualization has offered an inner portrait of New York City. According to the steam graph, noise bothers New Yorkers the most when it occurs from the evening into the wee hours of the morning, whereas problems like poor street conditions and illegal use of buildings are only report-worthy to New Yorkers during the day. Taxi complaints arguably reflect taxi usage: throughout the day with an obvious taper between late evening and morning.
New York is infamous for bad housing conditions. And my experience has been no exception. I have had some really bad landlords, so I am very conscious of my complaint algorithm as a tenant. An hourly visualization such as the one on WIRED might not paint a revealing portrait of housing conditions in NYC, but I felt a monthly visualization could offer an interesting view of what New Yorkers endure in their apartments, season by season.
Downloading and preparing the data
The entire HPD dataset available on the NYC Open Data platform spans 2010 to the current date. The initial downloaded file with minimal column filters applied was ca. 2 GB. When I tried to open it with Excel, the program warned that it was not able to open the entire file. I decided to visit the site again and remove any columns that I was fairly certain I wouldn’t need. The resulting file was about 800MB, comprising over 3.5 million complaint records. According to Wikipedia, there are ca. 8.5 million people living in NYC. As an admittedly crudely calculated statistic, that means for one in every three people living in New York, there has been a housing complaint called in since 2010. That’s pretty astonishing considering a percentage of NYC residents are homeowners.
For this project, I didn’t want to include partial years. Excel still wouldn’t open this smaller file. I used the program Sublime Text to open the file and drop a marker to delete all complaints made in 2015. The downloaded data from HPD is reasonably normalized, so I did not need to clean or reformat the data any further with GoogleRefine or using Python.
Tableau worksheets: Three views
I knew already that I wanted to provide three views of the data to accomplish my goals in visualizing the data:
- A map view of New York City with tenant complaints and number of complaints mapped to addresses;
- A bar chart of the different types of complaints and the number of complaints. My intention is to allow this chart to also serve as an interactive filter for the map from the previous worksheet when I design the dashboard later. The bar chart should allow a person to view complaints by single year;
- A steam graph of tenant complaints aggregated by month across the years, inspired by the graph from WIRED.
Preliminary assessment of the visualizations
Once I created these views, a few things became apparent to me. Both within the categories and within the descriptors there appeared to be a lot of overlap of terms in use. By grouping items, I was able to resolve some of these issues in order to keep the lists shorter. One simple solution was creating single groups for items that clearly belonged together, for example, for the descriptors “water-leak” and “water leak”. Others were less evident, but reasonable, like collapsing the categories “STRUCTURAL” AND “WATER LEAK”, since the only descriptor under “STRUCTURAL” was water leak. Water leak also appears under the category “PLUMBING”, but the descriptors under plumbing seem to refer to issues with sink and bathtub plumbing, whereas the descriptors under the “WATER LEAK” category seem broader, more structural.
At the same time, some category names were non-descriptive as labels, so I changed them to allow at least minimal understanding of the terms: for example, I collapsed “NONCONST”, “SAFETY”, and “UNSANITARY CONDITION” into one term because there was overlap between sub-category descriptors, and then renamed the group as “UNSANITARY CONDITION/SAFETY”. One problem that I could not resolve was the category used by HPD called “GENERAL”. Descriptors found within that category seemed to be spread across many different other categories and overlapped considerably with “GENERAL CONSTRUCTION”, which I grouped with “GENERAL”.
It’s clear that over time, the names of categories and descriptors have changed. There may even be differences between how individual 311 operators classify housing complaints. For now, I let my created group “GENERAL” be.
Other changes I made were along the lines of design aesthetics. I changed colors to be more intuitive, for example, “heat and hot water” to red, “unsanitary condition/safety” to brown, and “plumbing” to blue. I also tried to conform complaint groups of a similar nature to a particular color palette, like using different blues for categories that pertain to water.
Notes on the Steam Graph
Unfortunately, Tableau does not offer a default way to render your data as a steam graph. For the particular steam graph I wanted to create showing aggregated monthly trends, the Tableau area chart came the closest. Googling Tableau and steam graph, I did find one blogpost on VizWiz by Andy Kriebel who attempted a steam graph with Tableau and came pretty close to representing one. I couldn’t apply his method, since the setup of my data was not quite the same, but it at least showed it was possible and directed me to the calculated field function. In the end, I was able to find a solution that created a steam graph similar to the one in the blogpost.
First, I put the “Created Date” dimension from the original data on the column shelf of the area chart and set it to month. The normal way to create the area graph is to put the number of records measure element into the rows shelf. If you look into the editor for that element, you will see that the calculated field is “1”. I copied that element and named it “Steam Graph Calculation”. Looking at the calculated field syntax in the blogpost, I modified some of the complaint categories to calculate “-1” in order to set some of the elements below the horizontal zero to create a more interesting steam graph-like shape similar to the one in Kriebel’s blogpost. But the calculated field editor would not let me set the dimensions element I had created with grouped complaint types as the case. That meant I had to use the individual original complaint types (not the grouped complaint types) in my script in order to designate which categories should be calculated with “-1”. The final calculated field script looks like this:
CASE [Complaint Type (original)]
WHEN 'ELECTRIC' THEN -1
WHEN 'PLUMBING' THEN -1
WHEN 'HEATING' THEN -1
WHEN 'HEAT/HOT WATER' THEN -1
WHEN 'STRUCTURAL' THEN -1
WHEN 'WATER LEAK' THEN -1
ELSE 1
END
The script also reflects my choice to set complaint categories primarily addressing problems with utilities beneath the zero line, with those above the line slowly moving to non-utility complaints within the apartment to common spaces and finally outside the building. I then removed the marks along the y-axis to avoid confusion, especially since the y-axis now showed negative numbers due to the calculations I applied. I am sure there are more ways to manipulate the calculation to create a better graph, but this was a good stopping point for me as someone who is still learning to use the calculated field.
These are screenshots of the three preliminary visualizations I created:
Future Directions
My Tableau worksheets will certainly change as I lay them out in a dashboard. For example, legends currently appear for each one, though once they appear together, I hope the bar chart will serve as the filter for the city map and alleviate the need for a legend there. The legend will be removed from the steam graph, since I chose to explicitly label the color areas. Here’s a laundry list of ways I would improve what I have so far:
- Create a dashboard for the worksheets;
- Further refine the descriptors for complaint categories to make the list more compact, which would require more diligent investigation of HPD’s data documentation;
- Reconsider the placement of complaints in the steam graph. I like my imposition of a “logical” order, but similar colors might be too close;
- Research whether deleting more columns from my dataset will increase the response speed for the visualizations, especially the map. If not, I might consider reducing the visualizations to only represent “warranty of habitability issues”;
- Read more about the calculated field editor. One thought is if there is a way to edit the record count to a range instead of just a negative and positive count and still make an area graph, I will be able to adjust the primary areas around the center line to render more like a Rorschach inkblot (more like a steam graph) than just colored areas above and below a center line.