Lung & Bronchus Cancer and Air Quality in the U.S. from 2010 to 2014 (A Tableau Public Dashboard)


Final Projects, Visualization

Introduction  

Dashboard creation is important for organizations and companies looking to stay on track with their data analytics. A convenient collection of graphs that visualizes essential data can save a development or finance team a lot of time and money down the road. It can also be vital to public health research when trying to pinpoint regions of crisis or resource needs.

In this report, I present a dashboard created in Tableau Public of graphs that depict lung and bronchus cancer incidence rates alongside air quality index records from 2010 to 2014. This visualization was created in order to identify major trends in both the lung & bronchus cancer and air quality data across time and regions and to see if there is any correlation between trends in the two kinds of data. All of the data used in the report was collected from federal government open source data collections. Remote user testing was conducted to reveal opportunities for improvement with the visualizations and also helped to expose some problems with the data itself. A gap in air quality index findings due to a lack of collection sites hinders a user’s ability to find a strong connection between the cancer and air quality graphs. When paired with smoker population density data, to factor out cancer incidences caused by that variable, this dashboard may help to reveal even more about the connection between air quality and lung and bronchus cancers. All resulting graphs with notable data are included in a link and pictures in the Results section.

 

Materials

  1. Tableau Public: free data visualization software available for download at https://public.tableau.com/en-us/s/.
  2. Centers for Disease Control and Prevention (CDC) United States Cancer Statistics (USCS): Source of lung and bronchus cancer incidence rates data used in this report. This database houses publicly available U.S. cancer statistic provided by the CDC. Lung and bronchus cancer data sets used for this study can be found at https://nccd.cdc.gov/uscs/cancersbystateandregion.aspx.
  3. United States Environmental Protection Agency (EPA) Air Data: Source of Air Quality Index (AQI) data used in this report. This page includes air quality data sets and information provided by the EPA. The data sets used for this study can be found in the “Pre-Generated Data Files” section at https://aqs.epa.gov/aqsweb/airdata/download_files.html.
  4. AirNow: Source of the AQI range chart used in my pre-user testing dashboard. This site, powered by the EPA, provides real-time AQI data in maps while also delivering air quality health guides and other related information. The AQI range chart used in my dashboard draft can be found in the “Air Quality Index (AQI) Basics” section at https://airnow.gov/index.cfm?action=aqibasics.aqi.
  5. MeckNC.gov: Source of the AQI range chart used in my final dashboard version. This site houses regional information for Mecklenburg County, NC provided by the county government. The AQI range chart used in my final visualization can be found in the “Air Quality Index” section of the site at https://www.mecknc.gov/LUESA/AirQuality/EducationandOutreach/Pages/aqi.aspx.
  6. UserTesting: Service used for two of the three user tests conducted in this study. This site provides an online remote user testing platform for companies, nonprofits, universities, and individuals. This site records metrics and videos of users depending on the features that the test creators purchase. UserTesting can be accessed at https://www.usertesting.com/.

 

Methods

Data Collection & Cleanup

Data sets for lung and bronchus cancer incidence rates by state were collected from the CDC’s USCS site and data sets for air quality index values by county were collected from the EPA’s Air Data site to begin this project. These data sets were all restricted to the years 2010 to 2014 I wanted a 5-year span and 2014 was as recent as I could go) based on what was provided by these U.S. federal agencies. I then researched the process for age-adjusting rates for health information as well as how the AQI is calculated and used to better understand my data sets. Before starting the visualization process, three visualizations were collected to guide my dashboard design process. The three visualizations that I collected included a bar graph, a line chart, and a map to reflect the three types of graphs that I intended to use in my resulting dashboard.

Pre-Design Research

In this section, I will briefly discuss some design inspirations for the graphs that I created for my resulting dashboard.

Example Visualization 1

The visualization below depicts a bar graph of breast cancer deaths of Canadian women in 2016 by age group:

(Source: http://breast-cancer.ca/httpbreast-cancer-camortratings/)

As a simple tool, this bar graph gets the job done: it provides easily accessible data that a layperson can acquire by looking at it for a brief moment. However, the graph itself doesn’t seem necessary when each individual bar has its value right above it, which is great for identifying precise points of data but distracts from the visualization itself. Also, the use of different colors for each bar is unnecessary since they do not signify any difference for each bar. In the bar graphs that I created for my dashboard, I intended to have my bars be interactive, so that precise data could be determined about each bar just by hovering over them rather than having their values pasted above them. I also intended to use a single color for the bars unless I incorporated a color gradient that signifies something important that differentiates them.

Example Visualization 2

This next graph, also used as an inspiration for an earlier lab report of mine, compares yellow taxi, green taxi, and Uber pickups in Brooklyn from January 2014 to July 2015:

(Source: http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/)

This graph has three different lines with clearly distinct colors that are not squished tightly together. This resulting image allows for the easy identification and comparison of data points on each line, which is what I intended for the line graph that I created for my dashboard. I made sure to use distinct, but somewhat similar colors to stay within the color scheme used for the type of data in my line graph that is depicted in other graphs on my dashboard.

Example Visualization 3

The next map, which also guided my visualization of a previous mapping lab of mine, displays walkability determinations across 2010 census tracts in New York City:

(Source: https://beh.columbia.edu/neighborhood-walkability/)

This map excels in its overall simplicity and in the understandability of its legend. A clear color gradient that utilizes varying intensities of orange makes it easy to spot the geographic clusters of high and low walkability index locations among the census tracts. This color gradient choice makes the bins in the legend easy to be discerned. I intended to reproduce similar color gradients, that utilizes shades of purple and mustard green rather than orange, to represent the different values on the maps in my resulting dashboard. However, I aimed to use a lighter background map with the absence of region names since my maps would have a hover effect included to populate details of specific areas. Furthermore, my maps are limited to the United States, which would be indicated in the map titles and would be incorporated into my target user profile, so no location names would be necessary on my resulting maps.

Visualization Creation

After collecting my data and reviewing the graphs that I discussed in the previous section, I uploaded the lung & bronchus cancer incidence and AQI data files into Tableau Public and began my work on the dashboard creation. I set out to create two maps that were adjustable by year for the cancer and AQI rates, two bar graphs with top ten cancer and AQI rates from all years combined (i.e. 2010 through 2014), a bar graph of cancer incidence rates by sub-region (also adjustable by year) and a line graph of cancer incidence rate for major regions showing yearly changes. Within Tableau Public, I made sure to exclude AQI data that had under 100 AQI measurements per year from my AQI map and I excluded counties with under 300 AQI measurements per year from my AQI top 10 bar graph to ensure that extreme, but infrequent, measurements did not skew any user’s interpretation of these graphs. I also excluded data from the U.S. Virgin Islands, Puerto Rico, and the country of Mexico that were provided with some of my data files since they were either beyond the scope of this study (as was the case with the Mexico data) or were not consistent between the data files (as was the case with the Virgin Islands and Puerto Rico data).

From my understanding of the data, my design inspirations, and the tools available in Tableau Public, I made targeted decisions about color, graph types, labeling, and legends. Since I imported two types of data onto my dashboard – lung & bronchus cancer rates and air quality index (AQI) values – I made sure to include two main color palettes. A blue-tinted purple was used for the cancer data and a mustard green was used for the AQI data. I picked these two color palettes since they made differences in my bin values relatively easier to see when compared to the other single color gradients available in Tableau Public. I unfortunately was not able to come up with a more distinct color palette when customizing the bin colors in my legend: Tableau Public only allows for the most extreme ends of a scale to be customized in a color gradient, and when two different colors are picked manually, the middle of the scale remains neutral. Importing a screenshot from a more advanced mapping program such as Carto may fix this problem, but for now, I chose the best available option available in Tableau Public for my map colors.

My background map images were adjusted so that land was grayed out and country names and borders were removed. Since my target user profile is of U.S. residents, country labels are not necessary, as are state labels. I included a year-adjustment tool on the maps as well as an AQI reference chart to aid user interpretation of the maps. Graphs below the maps were added based on what I expected users would want from a study such as this. When measuring something seen as bad, such as the cancer and AQI rates (which measure air contaminants) that were examined in this study, a user would likely want to see ‘the worst of the worst.’ To account for this, I added bar charts for the top ten worst states and counties with regards to lung & bronchus cancer incidence rates and average median AQI measurements for all five years, 2010 through 2014, combined. Below these two graphs, I included an adjustable bar graph by sub-region (which was later removed in the dashboard’s final version after user testing) and a line graph for the yearly change in lung and bronchus cancer incidence rates for major regions. I wanted to include a similar line graph for AQI measurements, but the data provided by the EPA was too inconsistent year-by-year (testing sites and numbers of days with measurements changed) to make an accurate enough graph of this.

User Testing

After completing my initial dashboard, I conducted a round of user testing. The dashboard that I used for my test sessions* can be found through the following link, with a static image provided below it:

https://public.tableau.com/views/LungBronchusCancerandAirQualityintheU_S_from2010to2014/LungBronchusCancerAQIDashboard?:embed=y&:display_count=yes

*Please note that the final, updated version of my dashboard is located the following section of this report. The link and image provided here are of the draft that I used for my user testing.

For my user testing, I conducted two unmoderated remote user tests through UserTesting.com and one moderated remote user test over telephone. All users lived within the United States (since my data was limited this country) and were 25 years of age or older. This age range was chosen because UserTesting.com does not include education level as a user demographic option, but it does allow for 18+ or 25+ as age options (without an available minimum age in between those), and I wanted users that were beyond the typical age of undergraduate students. UserTesting.com found the unmoderated test participants through their service using the age and location requirements that I provided and I recruited the moderated test participant myself from my social network in New York City.

For the unmoderated tests, the user completion of tasks was recorded digitally and post-test questions were answered with typed responses. For the moderated test, notes were taken by me while the user completed all tasks and post-test questions. Each user test followed the following format:

Introduction

You will be looking at a dashboard of graphs depicting data about lung & bronchus cancer rates and air quality in the U.S. from 2010 to 2014. Please answer the questions that follow to the best of your ability and share your answers out loud.

Tasks

  1. For a brief moment (just a few seconds), please explore the dashboard freely. When you feel that you have a general idea about what you can find on this page, you may move onto the next task.
  2. Find your state on the cancer map. What was the lung & bronchus cancer rate in your state in 2012?
  3. Using the dashboard, please find the county that had the 6th highest average median AQI in 2010-2014. What is this county’s name and what is its average median AQI?
  4. Do you think lung & bronchus cancer rates are increasing or decreasing overall based on what you see on this dashboard?

Post-Test Questions

  1. What frustrated you most about this dashboard?
  2. If you had a magic wand, how would you improve this dashboard?
  3. What did you like about this dashboard?
  4. If you have any additional thoughts about this dashboard or the tasks, please share them here.

 

Data collected from these tests was analyzed by me and helped me to make important edits to my final visualization. These user test results and design edits will be presented and explored in the section that follows.

 

Results

User Testing Results

All three of the user test participants were able to complete each task successfully and finish the entire test in under ten minutes. From these sessions, I identified the following four complaints that all three users had with my original dashboard:

  • X-axis and legend titles that include “per 100K” need the word “people” afterwards to explain the unit of measurement more clearly.
  • The bar graph of sub-regions, which included “New England” and “Middle Atlantic,” was unclear and unnecessary. Users could not identify easily which states are in each sub-region, and the presence of the graph disrupted the symmetry of the dashboard because a line graph was included to its right rather than another bar graph.
  • The original AQI chart was too big and colorful. The colors did not match the colors of the map underneath it and it blocked a large portion of the map whenever a user zoomed in on a region.
  • Users either desired more context for the data (additional text explaining the data) or thought that the dashboard should be intended for an expert audience. All of the users did not believe that a layperson would be able to understand the graphs.

Two users identified the following problem:

  • The “Top 10” graphs did not clearly state if they were the “best” or the “worst” of the states and counties available.

All three users praised the following two elements of the dashboard:

  • The interactivity of the maps and graphs: the adjustment of the maps by year and a hover effect that provided clear feedback concerning area and point data of the item underneath the user’s cursor were highly regarded. All three users expressed surprise and enjoyment over these features.
  • All users expressed interest in the visualized data and two users said that the information was “very interesting.”

Two users shared praise concerning the following:

  • The unmoderated users said that the graphs were “clear” and “easy to understand” for them. This seemed to contradict their points about adding more context, but the users in this test did not have trouble themselves with the interpretation of the graphs.

From these insights, I made the following four adjustments to my dashboard:

  1. The word “People” was added to x-axis and legend titles after “100K.”
  2. The bar graph of sub-regions, which included “New England” and “Middle Atlantic,” was removed from my dashboard, and the line graph to its right was centered to ensure symmetry.
  3. The original AQI chart was removed and replaced with a smaller version that was free of distracting colors.
  4. The two “Top 10” graphs had their titles changed to include “Worst 10” in them. While I originally wanted to avoid titles that indicated a value judgement, the majority of my test users desired a word in the graph titles that indicated a such judgement so that they could clearly know if the top 10 items were the “best” or the “worst” of the data set.

I chose not to include more text in my visualization that would provide context for users since all of my test users said that they were able to easily interpret my graphs. However, I believe it would be best to target this visualization towards a population that has at least an undergraduate degree (or is currently pursuing one) in order to ensure the presence of some critical graph analysis skills in users that would aid in the interpretation of this dashboard. This might mean publication in certain media that target such a demographic or that the dashboard should be presented at a health conference or an event at a school.

Final Version

After making the above edits based on my user testing, I completed the final version of my dashboard. Please click the link below to view this visualization (I provided an expandable image* of the dashboard below it for easy reference for cases when the interactive version is not accessible):

https://public.tableau.com/views/LungBronchusCancerandAirQualityintheUnitedStates2010-2014/LungBronchusCancerAQIDashboard?:embed=y&:display_count=yes

*Please note that the year-adjustment tools for the two maps do not appear properly within this image. You must click on the link above the in order to see this tool (as well as observe the hover effects included within each graph).

 

Discussion

The graphs themselves within my dashboard lead users to make informed and interesting conclusions about states and counties regarding lung & bronchus cancer incidence rates and air quality – that much was evident from my user testing. However, I could not identify a clear trend between the cancer and AQI data by region while using my dashboard. Two additions to this dashboard may help to correct this:

  1. Adjustments based on the smoker population density of each state and county.
  2. More complete AQI data that could aid in the creation of a line graph similar to the one I made for lung and bronchus cancer incidence rates included at the bottom of my dashboard.

Region with poorer air quality may have fewer smokers or vice versa, so it is difficult to account for this generally accepted high risk factor for lung and bronchus cancer without it included on my dashboard. However, this AQI data is very specific to particular testing locations, so any additional attempt to compare AQI, smoker, and lung & bronchus cancer incidence rates by state or county might be misleading. Critically understanding a region’s air quality would likely involve additional testing sites or more specialized knowledge: a follow-up study with the aid of an expert in air quality may lead to a deeper analysis. As the dashboard stands now, it provides an interesting tool for those who want to look up their home state or county’s lung & bronchus cancer incidence rates and AQI measurements. Users during my test sessions seemed particularly excited about this potential of the dashboard, possibly to determine their own risk factor for lung and bronchus cancer or need for better air quality standards in their county.

In addition to accounting for smoker population density, the creation of maps in another program, such as Carto, or a version of Tableau with more customizable options for color gradients, could help make this dashboard more usable. Right now, the color gradients used for the maps in my dashboard are inaccessible for those with vision impairments or some degree of colorblindness. A color gradient with two different (but not opposite) colors without a neutral median would be easier to view and identify individual color bins with, which is fairly simple to do in a design tool that specializes in spatial data visualizations, such as Carto.

With additional smoker data and maps created in a tool such as Carto, this visualization may provide deeper insights into the relationship between lung & bronchus cancer rates and air quality. However, the presentation of the current version of this dashboard to expert audiences may still provide some useful insights for such a crowd. Additionally, based on the user testing conducted in this study, the average user could still use this dashboard to find out some interesting statistics concerning lung & bronchus cancer and air quality in their home state or county.