Introduction and Inspiration:
To find a dataset large enough to work for this lab, I searched the NYC OpenData website. At first, I was amazed at the amount of public data that is available and it took me quite some time to find even just a department that I’d be interested. Eventually, I decided on looking at datasets from the Department of Health and Mental Hygiene (DOHMH) thinking that I would find something exciting. I stumbled upon a dataset that was incredibly relevant to me at the moment – popular baby names.
Currently, a close friend of mine is pregnant with her first child and soon to give birth to a baby boy. This dataset contains some of the most popular baby name by sex and ethnic group in NYC from 2011 to 2016. I downloaded the CSV file and took a quick peek at the information that was inside and thought this would be a great dataset to visualize. The data in the file recorded baby names and ranked them by frequency. I chose to focus specifically on male birth in 2016 due to the fact that it is the most recent year in that dataset and because there were more male births that year than female. Additionally, the fact that my friend is giving birth to a boy helped sway my opinion.
Method:
The first step in creating my visualization was to understand the data that was in the CSV file. This was simple to accomplish due to the fact that NYC OpenData website gave a clear outline of the column names, description of each column, and the type of value for data in each column. The next step was to clean to the data and check for any missing values. The dataset contains 19.4k records but only consisted of six columns. I simply saved the CSV file as and Excel document and wrote COUNT and COUNTIF statements to check for any missing data. However, there were no missing cells.
I used a COUNT formula to count the number of cells that contain numbers which returned 19418 and a COUNTIF formula to count the number of cells that contains text. In my COUNTIF formula, I used “*” as a wildcard for the criteria to count any text in the cell, instead of nesting multiple COUNTIF statements for every possible text value. Since all the formulas returned a value of 19418, I assumed that the dataset was clean.
Next, I uploaded my original CSV file into Tableau Public. This is a strong visualization software that enables users to create charts and graphs to help visualize data. Initially, I intended on displaying the top ten baby names across all ethnicities in the year 2016. However, after scanning through the data and creating a simple bar chart displaying the top ten, I felt it would be more interesting to break out the baby names by ethnicity, similar to how it was originally recorded. To do so, I created four bar charts which broke out the four ethnicities in the data set: Asian and Pacific Islander, Black, Hispanic, and White. I used the Child’s first name as the column and the sum of the count of time that name appeared in the dataset. Then I added filters on Ethnicity, Rank, Gender, and Year of Birth. Filtering on ethnicity made it easy to create duplicate charts for all four ethnicities. The purpose of adding the rank filter was to only display the top 5 names.
After breaking out the top five names by ethnicity, I decided on displaying the total number of male births by ethnicity for the year 2016. This was done to show how each ethnicity compares relative to each other.
Creating this graph was simple. I used the sum of count of times the name appeared in the data in the columns and ethnicity in the rows. Then I added filters on gender and year of birth.
A big part of the reason that I chose to visualize male births in 2016 was due to the fact that my friend is giving birth to a boy. However, I was interested in the percentage of male to female birth that year, so I created a simple pie chart to visualize this information.
There were more male births than female births that year, which was something unexpected.
Finally, since some names overlapped between ethnicities, I thought it would be interesting to see to what all the names would look like by ethnicity, compared to each other. To do so the charts that best fit this was a heat map.
This was done by placing ethnicity in the column and child’s first name in the row. Then, adding filters on ethnicity, ranking for top five, gender, and year of birth. Finally, to get the color and count correct, I placed the sum of count of times the name appeared in the data on the colors box in the Marks section. This was to give the count of frequency.
Lastly I placed all the charts and graphs together and created a dashboard which can be found here.
Reflection:
I thoroughly enjoyed this lab assignment. Having the freedom to work with any dataset that I choose was very exciting. It was also exciting to be able to find a dataset that was relevant to me. However, there are two issues that I have with the dataset that I worked with. My main concern is that I would like to know how they defined these ethnicities and chose these select four. Specifically, I was concerned with how babies of two ethnicities are counted. Were they counted individually or both groups? Also, what about small groups such as Native American? My other concern was simply just the curiosity as to why this dataset stop recording at 2016 but it probably only gets updated every few years.
Moving forward, I think that this dashboard would be more engaging if users were able to filter through the years of by gender. Perhaps giving users autonomy to examine a dataset will be implemented in my next project.
Materials:
NYC OpenData Website
Tableau Public
Microsoft Excel