Introduction
How many house properties are sold in NYC every year? Which types of houses are mostly sold? To answer these questions, this project is going to analyze and visualize the data of the sold house properties in New York City during a certain time period. I made four interactive infographics to show different aspects of this topic.
Inspiration
I started with the research of the property-related visualization online. I found the following two infographics, which show the building heights and the value of land in Manhattan, are helpful and inspiring. The contents are not fully related to my topic. However, there are some visualizing strategies that could be used as references. The sequential color schemes successfully organized the quantitative data from high to low using a gradient effect. It is a good way to show the progression rather than a contrast.
Inspired by these two infographics, I decided to do at least one map to show an overview of where those sold houses are located in NYC. In addition, as for the color scheme, if there would be a need to show the data’s progression, I would consider using the same color strategy.
Materials
Kaggle – open data resource
Openrefine – a tool to clean up data
Tableau Public – data visualizing software
Methods
Steps
1. Find and clean up the data
I downloaded the NYC Property Sales database from Kaggle and edited the data in Openrefine. First, to clean up the unnecessary information, I deleted a column named “apartment number” because I thought it was unnecessary to keep it in terms of my topic. Secondly, I changed all the “-” into “NA”. In addition, the sold dates of each house properties were changed into MM-DD-YYYY in order to keep consistency. After all the data were cleaned up, I saved it as a CSV file for further steps.
2. Import the data into Tableau
3. Visualizing data
The first step was to create a new sheet. In the beginning, I was not clear what I was expecting to see by visualizing the data. Thus, I just played around with different variables to explore what is interesting. After I had done some experiments, I found that it would be interesting to see a whole year tendency of the total sold houses in a line chart. However, the data had over 80,000 rows of sold houses records and each of them had a date. It was impossible to show all of them. Thus, in order to make it doable, I grouped all the dates into months. Similarly, when I was doing the When the Sold Properties Were Built chart, I grouped the built dates into every 20 years. (see fig. 3)
In terms of choosing colors for each sheet, I kept the color schemes in the same tone since four sheets were related. Inspired by the examples that I found online, I used the sequential colors in “custom sequential” palette to clarify the different quantitative variations. The gradient effect showed quantitative data from low to high. The deeper the color is, the more the records it contains.
In addition, labeling was challenging in this project, especially in the Types of Sold Properties chart and When the sold house properties were built chart. In these two charts, there were many categories in the horizontal axis, and each of them was assigned a long name. If I went by default settings, the long texts were squeezed together. Not all the texts were able to be fully shown because of the limited space. To make sure all the information was displayed, I tried a lot of ways in editing the horizontal axis. Finally, I ended up making the aliases vertically so that no text is missing. In addition, to make some important data stand out, I marked the max and min records in the line chart and one bar chart. It is easier and quicker for the viewers to get the information.
Besides, there were some other adjustments that I had done to make the visualization more clearly displayed. For example, in the map chart, I minimized the background information to make the visualization stands out. In the Types of Sold Properties chart, in order to put the bar that has the most record always stay in the first place, the bars were sorted in descending order.
4. Create a dashboard
After I had finished the four charts, I began to arrange them in the dashboard. I deleted three unnecessary legends because the legend information was already clearly shown in the charts. In terms of format, the four individual sheets became related by filtering one sheet. I chose the map chart as my filter sheet. When the viewer clicks on a certain area on the map, the information in the rest charts will be filtered by this area. It was a more interactive way of displaying the data.
5. Save to Tableau Public
Results
Click hsre to view the NYC Sold House Properties, Sep 2016-Aug 2017 on Tableau
I made four infographics in total to show the different aspects of the topic. See description below.
1.The Sold House Properties in New York (the choropleth map)
This map showed the number of sold houses in different areas. The areas are categorized by zip codes.
2.Sold House Properties During a Year
This chart showed the sold houses in each month during Sep 2016-Aug 2017. I chose the line chart to discover if there is a tendency on a span of one year.
3.When the Sold Properties Were Built
Discover the buildings built in which time periods were mostly sold.
4. Types of Sold Properties
I categorized the buildings by their functionalities and tried to find which types of buildings were mostly sold.
Findings:
1. In general, most sold house properties were for living purpose.
2. The sold houses types were varied in areas.
In most areas which had higher building densities, like middle and lower Manhattan, condos and coops were the best-selling types. While in some areas stayed away from downtown, which had lower building densities and larger lands, like Staten Island, one and two-family dwellings were sold better than condos and coops.
3. There wasn’t a seasonal variation in the real estate industry at this time period.
By looking at the line chart, unlike some industries that have clear seasonal variations (i.e. farming, Halloween retailers), I found that the real estate industry doesn’t have such a pattern. If we pulled out several line charts from different areas showing the trading volumes of each month during a year(see Fig.6), no month always stayed on a peak or a valley. That means the trading volume wasn’t affected by seasons or months. It was constantly changing.
Reflection
1. One concern came into my mind when I looked at the choropleth map. If you look at the left bottom, one area in Staten Island had the most sold records among NYC. It is also the largest area on the map. As a house buyer viewing this map, he or she might consider this area as the most popular neighborhood for housing investment because it has the most sold houses. However, it is hard to tell if the high records are due to the neighborhoods’ popularity OR the large land because larger land and more records are correlated. To prove it, more data are needed. If I had time I would like to do further researches. As for the map we had here, for those people who are not able to identify the tricks, it is better to be careful about the potential misleading message.
2. There were several areas that the data are missing. I checked the google map afterward, most of them are green lands. One area that the data was missing is where La Guardia airport is located. However, the area where JFK is located had the data. I assumed JFK shared the same zip code with the adjacent area, so they shared the data when filtering by area. To conclude, although the zip code is an easy way to divide areas, it is not always one hundred percent precise. Be aware of it.