We use the internet every day. Internet is a necessity in our daily life and many people consider it as a utility like water, electricity, and gas. But do you know how many households in the US do not have the internet? Who are these people, and why they do not have the internet? To answer these questions, this project focused on analyzing and visualizing the data of those people who have no internet in their life in the US.
I started to find the researches which are related to internet usage. A study from Klaus Ackermann at the University of Chicago jumped into my eye. In this research, internet observations show how global sleep patterns are changing. The researchers have been studied in global internet usage for years. They were trying to find how Internet connectivity grows and eventually becomes saturated in societies all over the world. And the diagram(fig. 1) shows the internet use in 122 countries every 15 minutes between 2006 and 2012. The chart below shows a slowing growth trend with dizzy lines. I think it is successful in showing the trend in 6 years span.
In addition, I also found some scatterplots examples because I think my dataset has the potential to do a scatterplot due to a lot of numeric values. Fig.2 is a scatterplot related to IMDB ratings. I think the style of this chart is successful. The color of the dots is great, and I love the dot’s transparency so that you still can see the information behind. So, for my own visualization, I would like to use a similar design strategy.
Kaggle – an open data resource
openrefine – a data cleaning tool
Tableau – a software to create data visualization
To get started, I found a dataset on Kaggle named “People without Internet”. The U.S. Census Bureau began asking internet use in American Community Survey (ACS) in 2013, as part of the 2008 Broadband Data Improvement Act, and has published 1-year estimate each year since 2013. However, only the 2016 data is available online for public use.
This dataset contains data for total of 651 counties with a population over 65000, compiled from the 2016 ACS 1-year estimate. ACS 1-year estimates only summarize data for large geographic areas over 65000 population. Due to the large population has been recorded, the data provides sufficient data for us to gain an insight into internet use. Here is the list of columns:
To clean the dataset, I imported the data into openrefine. There were a lot of missing values which shown as “null”. To get rid of that, I changed all “null” values into “0”.
Then I imported the data into Tableau for further analysis.
After I have tried different approaches, I found the population without internet are mostly affected by the three factors, which are ethnicity, household income, and education levels. The findings are showing below.
Findings related to ethnicity
Us is an immigrant country. The immigrants from different races live together on this land. People from different ethnicities may have different ways of living, including the way of how they connect to each other and to the world. Thus, diversity made me wonder about the relationship between people’s internet usage and their ethnicities.
The dataset includes the counts of people in different races in each county, but the data has not been normalized. It is necessary to normalize the race data because every county has a different population. Thus, I calculated the percentage of the people from each race in Tableau and used the normalized data for analysis. (fig. 4)
As the infographic showing above, although it doesn’t show a clear correlation or causation between the race and the percentage of people without internet, there is one noticeable outlier standing out and worth to take into the discussion. If we focused on the point which has the highest percentage of people without internet (shown in red dot), in Asian, Black, Hawaiian, and other’s section, it always stays just next to the vertical axis. This means there are nearly no people form these four races live in this county. However, if we looked at the Native graph, the position of the same point changes horizontally from almost 0% to over 70%. This significant difference indicates over 70% of the residences live in this county are from the native race. Since this county has the highest percentage of people without internet and the majority of the residences are native, my assumption is people from the native race are less likely to use the internet than other races, and the network coverage in the place where the native race gathered is probably lower than the place where other races gathered.
In conclusion, it is worthy to take race into consideration when doing the internet analysis. When the overall pattern didn’t tell a clear trend or causation, take a closer look at the outliers, you will find something surprising. Like the outlier I observed in Native graph, I got surprised by the large population of the native people and the low internet saturation in that county. It is a valuable finding.
Findings related to the incomes
As a student in New York, I can’t imagine my life without the internet, and I would love to pay for it. However, as a paid service, the internet is probably not affordable for everyone in the US. Therefore, I started to wonder if people’s income would affect their internet usage. To answer this question, I did an analysis of the relationship between internet use and household income(fig.5). In addition, a logographic trend line was added to show a correlation between the two factors. As the median household income in counties is increasing, the percentage of people without internet in counties is decreasing. Therefore, it is sure that the percentage of people without internet is affected by household income.
In addition, I also did a related analysis showing the percentage of people below poverty and the percentage of people without internet(fig. 6). A correlation is clearly shown. Besides, all the dots were colored by their Gini Index (also called Gini Coefficient). I found that the people live in the counties that have lower Gini Index usually have lower internet occupancy.
To conclude, household income and financial status is an important factor when analyzing residences’ internet usage. Internet saturation is affected by people’s income level.
Findings related to education Level
Based on my experience, the internet has been widely used in every age group and every education level. From primary school to graduate school, students use the internet in almost every study stage. Indeed, it is possible that more educated people may have higher internet needs. However, if someone stopped studying at a certain degree, the internet will still be used in daily life. So, in the beginning, I don’t believe people’s education level is related to whether people are using the internet. But, surprisingly, after I did fig.7, I found there are some relationships related to the education level. This infographic shows the relationship between the residents’ education level and the percentage of people without internet. In the original dataset, the education level was divided into five groups, which is below the middle school, with some high school, with high school equivalent, with some college, and with BFA degree and above. Based on that, five small multiples were generated. In each graph, I also added a logarithmic trend line analysis to see the trend more clearly. Then, from left to right, I arranged the five graphs in the order of education level from low to high. As we could see in fig. 7, all five graphs have correlations, but each has a different trend. Overall, The first three graphs have increasing lines. If we look at the first three graphs and consider them as a whole, for people with high school and below education levels, the higher the percentage of them has in the county, the more possibility people live in this county don’t have internet. In the fourth graph, the trend line is almost vertical. However, in the last graph, which is the one shows the people with high education level, the trend has been changed into a decreasing line. This means the counties with more well-educated population are less likely to have a higher percentage of people without internet. Therefore, as the five graphs showed, there is an internet usage difference affected by people’s education levels.
To summarize, this is an interesting finding that I didn’t expect. The result goes against my assumption. If I got a chance, I want to dig deeper into this topic.
Sampling data analysis
In the UX research, one interviewer who has no data analysis background said that the scatter plot is not easy to understand at the first glimpse. That made me think about audience acceptance if the visualization is being too difficult to comprehend. So, I started to think if there is a type of visualization that everyone can understand easily. Since the bar chart and the line chart are mostly used in our daily life, I started to try to visualize the data with them. However, if I want to visualize the US overall pattern, there are 651 counties’ data need to be included in the chart, which is impossible. So I decided to pick some specific samples instead. To select the samples, firstly I arranged a total of 651 counties in the order from high to low by the people’s percentage with no internet(a section of that sheet is shown in fig.8). And then I added a median line showing the average percentage of people without internet. I picked 5 sampling counties out of 651. 2 of them have the highest percents of people without internet; 1 has the median percentage of people without internet; 2 of them have the lowest percentage of people without internet. The information of these five counties are showing below:
– Apache County, AZ, 54.01%
– Mckinley County, NM, 46.81%
– Cowlitz County, WA, 14.7%
– Loudoun County, VA, 3.81%
– Douglas County, CO, 2.66%
After that, I made a bar chart of these five counties. At first, I used the county’s name instead of the Geoid. But during the UX research, one interviewer told me that a county name might be used with several counties. Sometimes they share a name. So using the county name as the column name might confuse people. Therefore, I followed his suggestion and changed the county names to the Geoid, because the Geoid is unique to every single county.
The bar chart shows the five counties in the percentage of people without internet, household income and BFA degree and above. In order to show the trend clearly, a line graph is accompanied as well. The results are similar to the previous analysis, but they are arranged in a form that is easier to understand.
To conclude, this project gives me an insight into people’s internet usage in the US. Some findings were surprising and impressive. I really enjoy it. But the project isn’t perfect. If I got more time, here are some revisions that I would do. Most of them are related to data preparation.
First, data is always changing over time. Compared with the data collected from 2016, the more recent dataset could be more convincing. In fact, the data has been collected and updated every year, but I didn’t find access to the more recent ones. Therefore, if I got time to revise my project, I would spend more time finding more recent dataset. In addition, it would be interesting if comparing the datasets collected from different years. By comparing them, the multiple datasets might show some trends in internet usage over years or some other noticeable findings.
Secondly, if we consider the topic precisely, the current dataset couldn’t reflect the facts of the whole country. The dataset that I use only has the records of the counties with the population over 65000. The data of the counties with a smaller population is missing. Although the current dataset showed the facts of the majority population, the missing data of the smaller counties made me wonder if there are some unique facts happened in those areas that have not been recorded. Therefore, for the reason of being rigorous, the dataset should include all the counties’ data in the US if we investigated the whole country.
Thirdly, as a dataset for analyzing people’s internet use, I expected more variables related to the internet itself. The main body of this dataset is census related. And there is only one column related to the internet. I appreciate the number of variables related to the census, but it could be more interesting if some other factors of the internet itself were included in this dataset, like the internet cable coverage, signal towers, etc. Actually, at the beginning of data preparing, I’ve tried to combine other internet datasets into the current one, but they didn’t match for many reasons. So I gave up and only use the current one for visualization.