TExt analysis via network visualizations: EXAMINING patterns in HOW NYC HANDLES 311 housing COMPLAINTS


Lab Reports, Networks, Visualization

INTRODUCTION

When something breaks in a residential building in New York and a landlord refuses to promptly fix it, citizens have the option to report the issue to the government via 311. These complaints are assigned to the Department of Housing Preservation and Development (HPD) for resolution. Depending on the complaint, this can involve different actions. According to the HPD department’s website, most claims begin with a warning to the building manager that a violation will be issued if the problem persists. When this fails to resolve the issue, the department typically dispatches an inspector to investigate the grounds for a violation. If there are not grounds, the complaint is typically closed. The department attempts to resolve all complaints, but in some cases this is not possible and cases are closed without resolution.

In this project, I visualize what this resolution process has looked like in practice over the past decade. Which types of complaints typically escalate to inspections? Are these inspections successful? How often are violations actually issued? This is an extension of my text analysis of the 6.2 million 311 housing records filed between January 2010 and September 2020 available on the NYC Open Data Portal. My analysis focuses on the resolution description field, an unstructured paragraph that HPD officials publish documenting the process and outcome of each complaint follow-up. In my previous analysis, I tagged each unstructured paragraph with process and outcome key phrases that enable comparisons to be drawn across records. In this project, I visualize these tags as networks, elucidating important patterns in the HPD department’s 311 resolution protocol.

PROCESS

For this project, I built two types of a network visualizations, each designed to answer a slightly different set of questions:

  1. Force-directed network: Are certain resolution processes more likely to result in certain outcomes?
  2. Sankey diagram: Does the type of the complaint (ex. plumbing, heating, etc.) affect its overall lifecycle (ie. the process used to resolve it and the outcome)?

For the force-directed network, I drew inspiration from the following network visualization of character co-occurrences in Shakespeare’s A Midsummer Night’s Dream. I appreciate the way that the author used node size to capture the relative importance of each character, while concurrently using edge weight to capture the strength of the characters’ relationships with each other. By constructing the visualization in this way you not only can quickly discern which characters typically co-occur in scenes, but also which are critical to the play overall. While the subject of my analysis is different, I am similarly interested in both discerning which processes and outcomes are most common overall, and which are most likely to occur together.

Network graph of characters in a Midsummer Night’s Dream

For the Sankey diagram, I drew from the below example of energy flow projections. I found a Sankey diagram particularly compelling because, in addition to capturing connectivity via the traditional network structure of nodes and edges, it is well positioned to capture the element of time. In Mike Bostock’s visualization this is the time that passes as fuel sources are converted to energy. In my analysis, it is the period over which a complaint is opened, categorized, processed, and resolved by HPD. A force-directed network does not capture this intrinsic ordering of the nodes, but a Sankey diagram does by positioning the nodes related to early events on the left and those related to later events on the right. Studies show that people typically read visualizations in the direction that they read their primary language, therefore, this positioning elegantly ensures that the viewer will progress through the visualization in the order that events actually happen. In this way, without instruction, the visualization reads almost like a story. Given that my second set of research question pertains to whether the life cycle of a complaint has to do with the complaint type, a Sankey diagram seemed a natural choice.

Plotly example of Sankey diagram

I coded both of the visualizations I created in Python. I relied heavily on the Pandas package for data cleaning and Plotly for the visualizations. For the force-directed diagram I used the NetworkX package to determine the positioning of the nodes. Plotly was particularly helpful for creating interactive features, as it has the capability to both troubleshoot the visualization inline in a Jupyter Notebook as a widget and export the final visualization to HTML.

You can find my code with explanatory Markdown for each step on GitHub. I also used the chart_studio package, as explained here, to host my visualizations on the Plotly website.

RESULTS

After iterating through NetworkX’s layout options, I settled on the spring_layout, which positions the nodes according to the Fruchterman-Reingold force-directed algorithm. I found this to be the best layout option because it limited edge crossings while making clusters of nodes very clear. I ran the algorithm with its default specifications, expect for an optional parameter that specifies how much space the algorithm should try to leave between nodes. This parameter has a default value of 1/sqrt(n), but I found this was too low for the graph to be readable. I increased it until the nodes were discernible. Interestingly, the spacing of the nodes does not monotonically increase with this parameter. There is a range for which the spacing does increase with larger values, but above this they just get jumbled and stop moving predictably. I increased it to the top of this range.

In this visualization I’ve colored the key phrases that relate to the process as a rusty red and the key phrases that relate to outcomes as turquoise. Unfortunately, Plotly has limited capability for adding legends. I’ve created a version of the network that has a legend which I’ve added directly using HTML and CSS after exporting from Python. The nodes, representing the process and outcome tags that I generated in my previous analysis, are sized according to the number of times that phrase occurs in the 6.2 million records. The edges are weighted according to the number of records where the connected phrases co-occur. Because no record contains more than one outcome or process keyword phrase (by design in my tagging) this network naturally visualizes as a bi-partite graph, where nodes of the process type connect to nodes of the outcome type. Since this bipartite property is inherent to the edge definitions, I was able to model it as a simple graph and just color by node type at the end. To generate the node and edge weights I used sklearn’s MinMaxScaler on the counts. I played around with annotating the nodes directly on the graph, but found the phrases were too long to overlap no matter how I positioned them. This makes the visualization perform well as an exploratory tool, but not well as a narrative tool.

While the embedded version is unfortunately not interactive, if you click anywhere on the image it will open in the Plotly website where you can hover over the nodes to see what phrases they represent.

I think there are a number of interesting things that can be gleaned from this network visualization. For example, by hovering over the largest nodes (those in the central bottom area) we can see that for building-wide issues it is very common for the HPD department to issue a violation based on an inspection of an apartment unrelated to the complaint. This is very important because the second most common process is for HPD to fail at conducting an investigation (ie. the person is not home, won’t let them in, etc.). If the HPD department were interested in improving their resolution rates, it it would be critical to address these failed inspections. Inspecting a nearby apartment, their data would suggest, is often a good option. This region of nodes also raises some questions. For example, the HPD department perplexingly sometimes issues a violation even when an inspector fails to conduct any investigation, both of the original apartment and of a random apartment in the same building. In these cases, it is unclear what they are using to make this verdict. To ensure consistency across inspectors, this should be explained explicitly.

The clusters also highlight a few interesting things. There are in fact some complaints that the HPD department can resolve simply by warning the building manager and verifying with the occupant, as their website claims. These are represented by the two clusters to the far left. Conversely, the cluster on the far right, related to ‘literature preparation’ is not accounted for at all by the description of protocol found on the HPD department’s website. This could be helpful for them to clarify. The remaining two large clusters (in the center) are messier combinations of inspections, calls, and violations. The top cluster contains more of the ‘process unknown’ and ‘outcome unknown’ nodes, and it could be a good place for the HPD department to focus if they want to improve their protocol. It seems important information is getting lost with records in this cluster, which could affect their ability to follow-up properly.

The second visualization that I created is the Sankey diagram. This diagram tracks the records through their lifecycle: from when a record is tagged with a complaint type to the period when HPD is investigating it to when it is tagged with an outcome. The visualization colors the records by complaint type throughout. In the first layer this simply means that all of the edges of a certain color add up to the number of records in the 6.2 million pool that are of that type. In the second layer this is further subdivided by the process. For example, the turquoise band coming out of the ‘Contacted Tenant’ node in the ‘Resolution Process’ layer represents the number of complaints related to heat/hot water where the HPD employed the resolution process of contacting the tenant. These second layer subdivisions are only helpful if you have the ability to open the visualization on a large screen, as they can be difficult to distinguish on a small screen. I have also created a black and white version that can be used to examine coarser trends on a small screen. To emphasize the significance of each layer in the overall complaint life cycle, I’ve added labels at the bottom of the figure using Plotly’s annotation capability. If you click on the figure and open it in Plotly, you can hover over each edge to find out how many records it represents.

Using this Sankey diagram we can expand upon our insights from the previous visualization. For example, we can now see that the complaints where the HPD department issued a violation after an inspector entirely failed to complete an inspection are all related to heat/hot water. I would be interested in learning more about these. Is is possible that the burden of proof is lower for issues that could have immediate and serious health effects, such as lack of heat in the winter?

We can also see clearly from the Sankey diagram that irregardless of complaint type, with the exception of heat/hot water issues, it is very common for the HPD department to inspect the property. Most of the time though this does not result in a violation. This is interesting because the HPD’s outline of their resolution process makes it sound like they hope to be able to resolve many of the complaints simply by calling and threatening with a potential violation, inspecting only when this doesn’t work. This diagram shows that most of the time this threat does seem to work. If the HPD department wants to improve their response time, which is very long in some neighborhoods, see previous visualizations, they either need to find a way to have fewer complaints escalate to inspection or hire more inspectors.

REFLECTION

While creating these visualizations I found it convenient to be able to keep my data manipulation and visualization in one notebook (as opposed to cleaning in Python and plotting in another software such as Gephi). I found NetworkX to be intuitive and powerful for defining the node position, and Plotly to be elegant in its ability to format the graph and generate HTML. Unfortunately, Plotly did have a few frustrating limitations that I found to impede my design. For example, in the Sankey module of Plotly there is a white shadow on all of the text labels that is impossible to turn off natively in Python. You can add CSS to override it as I’ve demonstrated in this version of my Sankey digram, but this does not fix the version hosted on Plotly. I found this shadow limited my choice of background color. I preferred the black background over the white overall because it made the thin edges of the less common complaint types more visible, but the text shadow on top of this dark background created a distracting vibrating effect. This was less pronounced on the white background, so I chose this despite the visibility issues.

I also found the lack of customizability in the positioning of nodes and labels in the Sankey diagram to be prohibitive. I found the diagram, as plotted in the results section, difficult to read because of the lack of continuity between the second and first layers. I would have liked each process outcome node to have had a hidden set of complaint type nodes so that the colors were continuous across layers. I tried defining these nodes and using the grouping method available in Plotly to group them by the Process outcome but this dissolves the nodes to the level of the grouped layer and looks exactly like the one I’ve presented. Short of calculating the x,y position of every node there is no way to suggest an ordering in Plotly and create this continuity. Similarly, there is no way to suggest a positioning of node labels. Because of this, I found it difficult to improve readability beyond what I’ve presented.

Overall, I think Plotly is helpful for iterating between visualization configurations and analysis, but that it might be worth exploring other languages and platforms for generating fully polished versions.

REFERENCES

Plotly/NetworkX examples

  • https://towardsdatascience.com/python-interactive-network-visualization-using-networkx-plotly-and-dash-e44749161ed7
  • https://plotly.com/python/network-graphs/
  • https://plotly.com/python/text-and-annotations/#multiple-annotations
  • https://towardsdatascience.com/how-to-create-a-plotly-visualization-and-embed-it-on-websites-517c1a78568b
  • https://www.kaggle.com/iyadavvaibhav/plotly-sankey-with-filters
  • https://towardsdatascience.com/tutorial-network-visualization-basics-with-networkx-and-plotly-and-a-little-nlp-57c9bbb55bb9

Design

  • Colors for Sankey diagram: https://sashamaps.net/docs/resources/20-colors/

Data

  • https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

Related past work

  • https://github.com/mghersher/nyc_hpd_311
  • https://studentwork.prattsi.org/infovis/visualization/a-tokenized-text-analysis-of-6-2-million-nyc311-public-housing-claims-and-how-they-are-or-are-not-resolved/
  • https://public.tableau.com/profile/monica4617#!/vizhome/Atokenizedtextanalysisof6_2millionNYC311publichousingclaims/InsightsfromExploration