A visualisation of Housing Prices in California


Visualization

Introduction

This visualisation is an exploration of the housing prices in the state of California. The dataset gives an insight into household income, housing price , age of residents and location of the properties. The entire dataset is 20,000+ entries and is a fairly tidy dataset. The idea with this visualisation was to figure proximity to ocean which are so-called premium properties and plot it against median-income and median-resident age. I also wanted to explore property prices as a function ocean proximity and affordability as suggested by median income.

Dataset

https://www.kaggle.com/datasets/camnugent/california-housing-prices

The dataset was from an open-data platform made available by Kaggle. The datasheet was a fairly tidy one and didn’t require much cleaning.

The data covers parameters like.

1. longitude

2.latitude

3.housing_median_age

4.total_rooms

5.total_bedrooms

6.population

7.households

8.median_income

9.median_house_value

10.ocean_proximity

I further used Open Refine to check for tidyness of dataset

Tools -used

Tools that were used to make this visualisation were Excel, Open Refine and R. The report was made and published using WordPress.

Process

The making of this visualisation required tools like OpenRefine, excel and R. Each tool served a separate purpose

Research

The first phase of my process involved researching for Open-data sources that could provide this data. This also involved using excel to manually go through and understand the data and then using Open Refine to do a check on any further cleaning that may be required.

Open-Refine- Data Cleaning

I used Open Refine to clean the data. The data was fairly clean, hence I just exported it as a csv

R- Data Representation

The visualisations were done in R. I imported the tidyverse library and then used the read_csv to import my cleaned csv. I also added a few columns like housing price/1000 to denote the numbers in a $100K multiples format.I also used the mutate command and generated another column to view median income in million $ multiple format

Visualisations & Observations

1. Geometric Point: I initially used geometric points to plot median household age vs median income and used ocean proximity of the property as an aesthetic value to understand what type of housing a certain segment of society was able to afford. This turned out to be very scattered, hence I used a logarithmic scale to tone down the data points.

Median income in millions for California vs median household age with colour representing proximity of house to the ocean

https://rpubs.com/snehganjoo/1018160

2. Column Data – The above data was difficult to read hence I plotted the same using a column chart which made understanding that data easier. I also used a column chart to plot ocean proximity vs median income.

Median income in millions for California vs median household age with colour representing proximity of house to the ocean

https://rpubs.com/snehganjoo/1018155

https://rpubs.com/snehganjoo/1018158

3. Box Plot – I finally used a box-plot to plot distribution of median income vs housing cost and used the ocean proximity to reinforce premium nature of property.

Median income vs median house value for houses in California with property proximity to the ocean

https://rpubs.com/snehganjoo/1018157

Reflection and Critique

Limitations

R has a learning curve, hence I felt limited in terms of knowledge to perform complex if, else functions. I would have also liked to plot geographic visualisation of these properties as per longitude and latitude data mentioned in the data sheet as a factor of housing cost but was limited by my knowledge of R to plot this.

Positives

R is an interesting tool to visualise data and can lead to lot of deep insights through analysis and visualisations. It has a bit of a learning curve but it’s ability to perform complex operations and render visualisations can be quite powerful.

Peer Critique and changes

My final chart had some issues with the scale and was difficult to understand, I tried to use an introduced column but that didn’t work hence I ended up performing a divide operation in the ggplot to fix the scale.

Bibliography

California Housing Prices. (n.d.). Retrieved March 21, 2023, from https://www.kaggle.com/datasets/camnugent/california-housing-prices

RPubs – Week 7. (n.d.). Retrieved March 21, 2023, from https://rpubs.com/jladams/week_7