Introduction
This visualisation is an exploration of the housing prices in the state of California. The dataset gives an insight into household income, housing price , age of residents and location of the properties. The entire dataset is 20,000+ entries and is a fairly tidy dataset. The idea with this visualisation was to figure proximity to ocean which are so-called premium properties and plot it against median-income and median-resident age. I also wanted to explore property prices as a function ocean proximity and affordability as suggested by median income.
Dataset
https://www.kaggle.com/datasets/camnugent/california-housing-prices
The dataset was from an open-data platform made available by Kaggle. The datasheet was a fairly tidy one and didn’t require much cleaning.
The data covers parameters like.
1. longitude
2.latitude
3.housing_median_age
4.total_rooms
5.total_bedrooms
6.population
7.households
8.median_income
9.median_house_value
10.ocean_proximity
I further used Open Refine to check for tidyness of dataset
Tools -used
Tools that were used to make this visualisation were Excel, Open Refine and R. The report was made and published using WordPress.
Process
The making of this visualisation required tools like OpenRefine, excel and R. Each tool served a separate purpose
Research
The first phase of my process involved researching for Open-data sources that could provide this data. This also involved using excel to manually go through and understand the data and then using Open Refine to do a check on any further cleaning that may be required.
Open-Refine- Data Cleaning
I used Open Refine to clean the data. The data was fairly clean, hence I just exported it as a csv
R- Data Representation
The visualisations were done in R. I imported the tidyverse library and then used the read_csv to import my cleaned csv. I also added a few columns like housing price/1000 to denote the numbers in a $100K multiples format.I also used the mutate command and generated another column to view median income in million $ multiple format
Visualisations & Observations
1. Geometric Point: I initially used geometric points to plot median household age vs median income and used ocean proximity of the property as an aesthetic value to understand what type of housing a certain segment of society was able to afford. This turned out to be very scattered, hence I used a logarithmic scale to tone down the data points.
https://rpubs.com/snehganjoo/1018160
2. Column Data – The above data was difficult to read hence I plotted the same using a column chart which made understanding that data easier. I also used a column chart to plot ocean proximity vs median income.
https://rpubs.com/snehganjoo/1018155
https://rpubs.com/snehganjoo/1018158
3. Box Plot – I finally used a box-plot to plot distribution of median income vs housing cost and used the ocean proximity to reinforce premium nature of property.
https://rpubs.com/snehganjoo/1018157
Reflection and Critique
Limitations
R has a learning curve, hence I felt limited in terms of knowledge to perform complex if, else functions. I would have also liked to plot geographic visualisation of these properties as per longitude and latitude data mentioned in the data sheet as a factor of housing cost but was limited by my knowledge of R to plot this.
Positives
R is an interesting tool to visualise data and can lead to lot of deep insights through analysis and visualisations. It has a bit of a learning curve but it’s ability to perform complex operations and render visualisations can be quite powerful.
Peer Critique and changes
My final chart had some issues with the scale and was difficult to understand, I tried to use an introduced column but that didn’t work hence I ended up performing a divide operation in the ggplot to fix the scale.
Bibliography
California Housing Prices. (n.d.). Retrieved March 21, 2023, from https://www.kaggle.com/datasets/camnugent/california-housing-prices
RPubs – Week 7. (n.d.). Retrieved March 21, 2023, from https://rpubs.com/jladams/week_7