a tokenized text analysis of 6.2 million NYC311 public housing claims and how they are (or ARE NOT) RESOLVED


Charts & Graphs, Lab Reports, Visualization

INTRODUCTION

Ensuring a high quality of life for citizens in a city as large and dense as New York, requires constant upkeep: legal protections require enforcement, public infrastructure requires maintenance, systems of organization require revision. When these things fail to happen citizens need a way to hold the government accountable. In 2003, the city of New York launched a hotline towards just this. Today, this 311 hotline triages upwards of 44 million complaints in a year, operators forwarding issues from broken parking meters to suspected animal abuse to the relevant NYC departments for follow-up and resolution.

In this analysis, I focus on the hundreds of thousands of these complaints that each year affect New Yorker’s homes. Broken heating, no water, unsanitary conditions – the concerns that citizens file about residential living conditions are serious. While past analysis of this 311 data typically focus on the types of problems these complaints cover, in this analysis I go one step deeper and examine not just what is breaking in residential buildings, but whether it’s being addressed how the city government claims. According to the Department of Housing Preservation & Development (HPD), the city agency responsible for resolving these claims, all 311 complaints are channeled through a rigorous system of warnings, inspections, and resolution, that prevents landlords from delaying repairs of time-sensitive issues such as broken heat. In this analysis, I begin the 311 data cleaning necessary to compare the reality of the HPD’s response to this plan, tokenizing the resolution descriptions of more than 6 million 311 records assigned to the HPD between 2010 and mid-2020. Below, I describe my methodology and present my results as visualized using Tableau.

PROCESS

Most analyses of NYC’s 311 data (such as this, this, and this) focus on the same few questions:

  1. What are New York citizens registering complaints about?
  2. Do citizen concerns change with the seasons? Have they systematically changed over the years?
  3. Do citizen concerns vary across neighborhood?

I was inspired to look into the data because I was intrigued by this homogeneity in the focus of past analysis. I wondered why nobody was looking into what happened after the complaint was registered. Yes, it’s important to understand what problems the city faces, but isn’t it equally important whether the issues are equitably and effectively fixed?

To investigate this, I used the New York City open data portal’s Python API to pull all 311 records assigned to the HPD department in the past decade. This is over 6.2 million records, which I had to batch to work around the API’s 50,000 record limit. I used Postman to troubleshoot my API requests.

A cursory look into the data quickly revealed to me why past analyses had not investigated the resolution of the claims: while the data on complaint type and source is all nicely organized into distinct, easily analyzable columns, the information regarding the lifecycle of the claim within the HPD is dumped into an unstructured text field.

I have some experience analyzing unstructured data, but was curious what others had done when faced with problems of similar scale. I was particularly interested in this research project that identified the text content of every State of the Union address given between 1790 and 2015. If researchers could find a clear narrative amongst the ‘nearly 2 million’ mined words from hundreds of complex speeches, could I do the same with the 6.2 million HPD 311 resolution descriptions?

The dominant topics discussed across every State of the Union address from 1790 to 2015 (Columbia University)

To attempt this analysis, I needed to first clean and tokenize the 311 text data. Tokenization is simply the process of isolating important words or phrases (ie. tokens) from a piece of text. These tokens then provide a structured way to discern patterns. I broke this tokenization process into 4 steps:

  1. Cleaning the data: Across the 6.2 million records there were 343 possible inputs for the resolution description. A quick skim showed that a lot of these only appeared distinct because of truncation and formatting issues, and were actually redundant. Fixing these formatting issues revealed that there were truly only 40 unique paragraphs that the HPD uses to describe their resolution process.
  2. Tokenizing each of these 40 unique paragraphs with common phrases: I found that everything contained in these paragraphs could be reduced down into 45 meaningful tokens, that with the exception of three that describe the record type, signal an event in one of three categories:
    1. a process that the HPD had taken
    2. an outcome that had come of the HPD’s work
    3. a next step that the HPD potentially planned to take or suggested the tenant take
  3. Tagged the 6.2 million records with the tokens: I tagged each record with the tokens that it contained. A record sometimes contains multiple tokens in the same category. For example, a resolution description might both contain the outcome token ‘no violations were issued’ and the token ‘a complaint was closed’.
  4. Grouping/ordering the tags: I noticed that the records that contain multiple process and/or outcome tokens, always have one that is more detailed and one that is less detailed. For example, if a violation is issued, it seems that the HPD always closes the complaint, so having both of these tags listed explicitly is redundant. In other records though, the record being closed was the only detail given. In those, that tag is important to keep. I ordered the tags in the process and outcome categories from most to least detailed and tagged each record with their most descriptive token. I also grouped tags that were saying similar things (ex. grouping ‘literature will be emailed within 24 hours’ and ‘literature has been mailed’ together into ‘literature being sent’). I aggregated the 17 outcome tokens into 9 tags and the 10 process tokens into 7 tags. In this final tagging, each 311 record was tagged with one aggregated outcome descriptor and one aggregated process descriptor that reflected the most detailed account available from the resolution description of the HPD’s actions.
The number of HPD 311 records with each of the initial 45 tokens

I conducted all of this cleaning, tokenization, and analysis in Python, relying heavily on the Pandas package. You can find more on my methodology in the Markdown of my code itself, which is hosted on GitHub.

With every record tagged with one of the aggregated outcome and process tags, I was able to discard the messy unstructured text column and use these two sets of tags and other metadata associated with a record to analyze patterns in the type of process the HPD uses to resolve complaints, the outcome of the complaint, the type of complaint, and the neighborhood the complaint came from.

I used to Tableau to visualize these patterns, but opted to do the actual analysis behind each visualization in Python. This was necessary because Tableau Public only allows you to import data as a JSON or .xlsx Excel file. Even in a wide format, my dataset is ~6.2 million rows long which exceeds the row limit of an Excel file and the size limit that Tableau puts on JSON files. By conducting the analysis in Python and exporting a condensed table for each visualization I wanted to created in Tableau, I was able to get around this.

You can find the details of my pre-Tableau analysis in the Markdown of the same iPython notebook as the cleaning on GitHub. One particularly challenging aspect of this analysis was dealing with the many records that had zip codes, but were missing borough names, something I wanted to use in the visualizations. To resolve these I used Google’s Python Geocoding API to scrape the borough from Google Maps response to a zip code query.

I also found a number of issues with the data integrity of the 311 dataset. For example, in a number of records the timestamp of the complaint being closed is the same (down to the second) of it being opened, yet the resolution description claims that the HPD resolved the complaint using something that would have taken time, such as conducting an inspection. I’ve emailed someone at the HPD department to see if they have any recommendations for how to handle these anomalies, and in the meantime have dropped any that seem particularly suspect. You can see what I’ve discard and what I’ve kept in Section 5 of the notebook hosted on GitHub.

RESULTS

Using the extracted process and outcome tokens, I created two dashboards of visualizations in Tableau:

  1. Exploratory dashboard: This is designed for users who are interested in using my tokenization to uncover their own insights. The visualizations in it are customizable and animated across time so that the user can choose which years, outcomes, and processes they want to focus on. It includes maps that can be used to explore how each process and outcome vary over the past decade and across neighborhoods, and heatmap tables that can be used to explore how resolution outcome and process vary by complaint type.
  2. Insights: This dashboard explains some of the insights that I found using the exploratory dashboard. This dashboard is narrative, not exploratory, in nature. The insights included are not exhaustive by any means. This is a very rich and large dataset, and this insights dashboard is just the beginning of what can be gleaned from this tokenization analysis and the exploratory dashboard.

Snippets from exploratory dashboard

Snippet from insights dashboard

During the development of these visualizations, I made deliberate design decisions towards each dashboard’s goal. I have summarized a few of them below:

  1. Using association to choose color palettes of choropleth maps: In the exploratory dashboard I chose a gray color palette, while in the insights narrative I chose a rust color palette. Both of these decisions are attempting to employ common color associations to improve cognition. Many people associate gray with neutrality, which is appropriate for a dashboard that is meant to be an unbiased exploratory tool. Many people also associate the color red with negativity, which is appropriate in my insights dashboard where every choropleth map’s darkest red reflects the proliferation of something negative. I chose a rust colored scale instead of a pure red scale, because a) studies show that having too much bright color can be difficult to look at and b) this made the less dense, better areas more tan than red, an appropriate color as these are neutral areas that shouldn’t draw the viewers attention.
  2. Using histograms to determine color domains and emphasize contrast: Many of the trends which I visualized in my insights dashboard are very right skewed. Tableau defaults to creating a color scale using the min and max of the domain, but if there’s a skew this results in a visualization where only the extreme outliers stand out, which is not the intent of the narrative. To fix this, I plotted histograms of the right skewed variables and set the maximum of the color domains to be at the beginning, rather than the end of the tail.
  3. Coloring by percentage instead of absolute magnitude: In most of my visualizations I chose to plot percentages, rather than absolute magnitudes. This is because complaints are not evenly distributed across geography or complaint type. If I had plotted counts of each process or outcome token across these variables without normalizing to percentages, any variation would have been washed out by the pattern in their distribution. For example, if I had plotted the raw number of complaints that were resolved with a certain outcome, instead of the percent of complaints in each zip code that have a certain outcome, the map would just show me which areas have the most complaints overall, not whether that outcome of interest varies across geography.
  4. Setting percentage color domains of exploratory maps to always be 0 to 100: If you add a filter in Tableau, as I did in the exploratory dashboard to allow the user to select an outcome and process or filter out complaints types, it defaults to creating a different color scale for each filter view, again taking the max/min value of the variable selected to define the bounds of the color scale. I wanted the different views of the exploratory dashboard to be comparable, so I set every percent color domain to have a min of 0 and max of 100. With this convention, a user can both see variation across geography and complaint type, but also discern which outcomes and processes are most common overall. For example, if the user selects the outcome ‘Violations were issued’ in the choropleth map in the exploratory dashboard and then switches to ‘No violations were issued’ on the same map, the entire map becomes significantly darker. Because the color scales on both are set from 0 to 100, the user can not only use the map to examine how each of these outcomes vary across neighborhood, but also could conclude things like, ‘it is overall less common for the HPD to issue violations than not.’
  5. Binning colors to improve contrast: Tableau defaults to continuous color scales for continuous variables. While this is an appropriate and helpful convention, it can be difficult to discern variation in shades, especially of gray, when the difference isn’t large. For example, with this default scale I was able to differentiate between 20% and 80% easily, but not between things like 20% and 30%. To help with this, I binned the continuous color scale into 10 categories.
  6. Changing the basemaps of choropleth maps to improve coastal comprehension and remove distractions: The default basemap in Tableau colors the land a light gray and the ocean white. I found this confusing for a map center on New York, where the whiteness separating the grayness of Manhattan, Brooklyn, and Staten Island made the map look inverted (ie. like the land was the ocean and the ocean was the land). I’m guessing this is because gray is closer to blue than white is, and most people have a strong association with blue and water. I was able to fix this by using a basemap layer that colors the water a light blue. I removed all notations that Tableau Public adds by default which didn’t provide helpful context (ex. land cover shading), because I found it distracting. I also added a coastline boundary that was not on by default, as it helped improve the distinction between water and land in areas such as the Queens peninsula.
  7. Using tableau aliases and filtering to improve readability: I used Tableau aliases to simplify the outcome and process tags to the minimum amount of text necessary for comprehension. This helped the axes be less cluttered. I also used filtering in the heatmap tables, so that the user can limit to the complaint types they are interested in comparing without being distracted by other information.
  8. Improving usability by removing zoom function: Tableau defaults to allowing the user to zoom and pan when maps are published. Because I’m only plotting data in one city at a relatively course level (ie. by zip code), it’s not necessary for the user to zoom in or out. Leaving this functionality makes it very easy for the user to accidentally shift the view when starting the time loop or selecting a new variable to investigate.
  9. Add borough into tooltip rather than as boundary file: There are some interesting patterns at the NYC borough level, but I didn’t want a user’s ability to identify these trends to rely on their knowledge of where each borough is located. I originally mapped the boroughs as an overlaid color to the grayscale maps, as demonstrated here with U.S. regions. I found that the bright background colors made it difficult to compare patterns across boroughs though, and so compromised by instead adding the borough to the toooltip. While this doesn’t make borough-based trends jump out, it does allow the user to hover over clusters of darkness/lightness and find out if they belong to different boroughs.
  10. Use heatmap table rather than stacked bar chart: I started out by plotting the heatmap tables as stacked horizontal bar charts such as the one below. While this stacked bar chart makes it easy to choose one outcome or process and see very precisely how much more common it was for complaints of type A versus type B, it was difficult to to do the inverse (ie. choose one complaint type and see how much more common it was for it to have outcome A or outcome B). This is because there are nine possible outcomes, and it is challenging to scan the legend for a color, map it back to the bars, and hold that in memory while you do it again for a second color. Because I wanted the visualization to make this type of analysis easy on a user, I chose to capture the percentage as a color shade in a heatmap table, rather than bar length in stacked bar chart. While studies show that users can’t discern differences in shade as well as they can differences in length, I was OK with this loss of precision in service of this exploratory chart enabling more diverse analysis.
Stacked bar chart alternative to the heatmap table below
Heatmap table that replaced stacked bar chart above

If you would like to examine the individual Tableau worksheets that were used to create the final exploratory dashboard and insights, you can find them here.

REFLECTION

I found Tableau well-suited for creating the exploratory dashboard. Coding animated, interactive visualizations such as the choropleth maps without a tool such as Tableau is time intensive. By using the simple Tableau pages and filter features to add this interactivity and time series animation, I was able to focus my efforts on the actual 311 analysis. I found this particularly helpful in the initial stages of my analysis, during which I used Tableau to roughly visualize cuts of the data, and rapidly prototype towards the final visualizations that I chose and developed. If I had been coding the early visualizations natively in Python or JavaScript/HTML/CSS, each figure that I tested out and then scrapped would have required much more sunk time.

I do think that Tableau has three big limitations that are important to keep in mind when assessing whether it’s appropriate for a task:

  1. If Tableau is not built to make a type of visualization (ie. if it’s not directly listed in the ‘Show Me’ panel), it can be unnecessarily complicated to make. For example, Tableau doesn’t have a ‘network’ option in their default charts panel. While it is possible to make one according to this post, it seems it would take more effort to make a rough prototype than it would for someone with knowledge of Python to code it.
  2. Tableau does have many options for customizability, but it’s not limitless. For example, while creating one of my dashboards I wanted to ungroup the time animation for two of the charts and leave it linked for the others. Tableau doesn’t let you do this. Unless you want to take the time to publish the visualization, extract the embed code, and then change the HTML/JavaScript to add this functionality, there’s no way around this limitation. It again might be faster to code from scratch if needs extend beyond the options for customizability in Tableau.
  3. Doing data cleaning and analysis in Tableau does not provide a clear way to track the changes you’ve made to a dataset. This isn’t ideal if you want to preserve or publish your knowledge of the way a public dataset needs to be cleaned. I found it helpful to do my cleaning and analysis in Python and move to Tableau only for the visualization.

If I had time to extend my analysis itself I would want to aggregate and order the follow-up tags in the same way that I did the process and outcome, and add this into the analysis. It would also be helpful to have someone from the HPD department use the exploratory tool to investigate the tokenized data, since I think they would pull out trends that I don’t have the context to understand the significance of.

I would also ideally merge in the HPD datasets on the New York City data portal that detail the public housing buildings under HPD jurisdiction and the status of the work orders issued when a building owner fails to respond to a violation. This would help isolate the 311 claims that are related to affordable housing, so that I could analyze whether the complaints and resolution process for these properties differs from other buildings in NYC. To do this, I would need to speak to someone at the HPD agency to understand which records are expected to appear in all of these datasets and which are not. It could also be interesting to spatially merge in zoning information and examine the correlation of this with certain complaints and outcomes.

Additionally, I would be interested in finding a dataset that contains more information about the violations issued for all buildings so that I could see if the time it takes landlords to correct a violation is the time that the HPD claims they hold landlords accountable to. For example, are hot water violations really getting corrected immediately?

One modification I would make to the current maps would be to filter out zip codes that have a small sample size in a given year. Right now there are a few zip codes that only have 1 or 2 complaints in a given year, 100% of which have a certain outcome or process used. These zip codes show up as misleading as strong outliers in the current maps.

Overall, I hope to have the opportunity to explore this dataset further.

REFERENCES