Nutrition data analysis of breakfast foods


Visualization

Introduction

For my final project, I will analyze nutrition and ingredient data from breakfast products from foodfacts.com to gain insights about their general composition and nutrition values. The background for the project is my industrial design Master’s thesis, which deals with nutrition labeling, aiming at creating new interfaces that communicate food data to support informed decision-making for consumers. In relation to this, I am also exploring the potential benefits of implementing data visualization methodologies in the industrial design practice, that is, using data to inform a design process beyond information visualization. The target audience for my visualizations is thus designers, and I will evaluate the results based on how successfully they will be able to inform a further design process (be that of interfaces of physical products). My approach to the data is exploratory, so another goal is to (for the designers involved) provide an overview and general understanding of the nutrition information, hopefully uncovering trends that are not discoverable by looking at nutrition labels individually and allowing for decision-making about where to focus further analysis.

The final result is represented on a poster, where selected visualizations (histograms/bar graphs from Tableau and networks from Gephi) are arranged together to provide an overview of discoveries. Visualization made with Tableau Public are also online here.

 

Working with Nutriton Data

Nutrition data has two main components. The first one consists of numeric nutrition facts, such as serving size measures, sugar content, percentage of daily recommended value, etc. This data lends itself mainly to statistical analysis. The second component, ingredient data, can be subjected to statistical as well as network analyses in order to uncover trends in part-to-whole relationships, overall structure and clustering. I will conduct these analyses using Tableau and Gephi, but first I will have to collect and format the data.

In my previous network analysis, I manually transferred ingredient data from labels to a spreadsheet, and arranged the ingredients in unique pairs to allow for analysis and visualization in Gephi. Since I was only looking at 4 different products, this approach was feasible. Now that I am interested in expanding the analysis, I have employ other strategies and methods. With assistance, I manage to scrape data from foodfacts.com, a website which has nutrition label data on 10,000s of products; I choose to focus on breakfast products, which gives me 6,152 rows of data in a csv-file. Even though the data is pretty well structured from the website, there are a lot of inconsistencies I need to address in order to be able to create meaningful visualizations. I work with the data in OpenRefine; initial cleaning involves using commas to split ingredient listed in a single cell into multiple columns, getting rid of units in nutrition facts cells (this is done fairly easily with nutrient data, but for serving size it is more difficult, since there is no consistent unit use in the column). The more I work with the data, the more aware I become of how much information it contains; many different approaches can be employed to derive a variety of insights from this data. ‘Serving Size’ is of particular interest to me, because it is the unit that determines the values of all the nutrient data, and therefore plays an important role in how consumers understand and evaluate nutrition facts. By using clustering in Open Refine, I learn that there are 6 main serving size units in the data set; cups, grams, piece (e.g. 3 cookies), packet, ounce, and tbsp. Some data entries do not show a serving size unit, so I decide to add an ‘unknown’ category. With these 7 categories established, I create a new column called ‘Serving Size Unit’ through manipulation using mainly value.contains() and split functionality. I also create columns for Serving Size Oz, Serving Size Grams, ect. where I can store the numeric serving size values for each product.

All of these manipulations are aimed at creating data with no cross-tab information for analysis in Tableau. The required format for network analysis of the ingredient data in Gephi is entirely different; I create a new file for this, and delete all columns except name and ingredient. The goal is to create a network where each ingredient is a node and each edge represents a connection between two ingredients, meaning there is an occurrence of both ingredients in one food item. To begin to create these unique data pairs, I work with the csv in Excel manually to prepare it for transposing (bringing “target” ingredient data across columns into one column in correct pairing with the unchanged “source” ingredient column). The spreadsheet reaches a size of over 400,000 rows even before it is transposed, and a couple of tries lead to the conclusion that OpenRefine is unable to process that. Instead, the data is reformatted in R, using melt, count and aggregate functions, resulting in a csv with all unique pairs and a column with counts for their occurrence, which will be used as “weight” in the network analysis.

The process of cleaning and formatting the data was ongoing, because working with the data continuously revealed new inconsistencies. Some important insights from this step in the process were:

  • Ingredient data contains much more information than just what ingredient it is, e.g. an ingredient could be “BHT Added to Preserve Freshness” or “100% Organic Whole Grain Rolled Oats”; in order to analyze this information thoroughly, it would probably be necessary to create separate columns for the ingredient’s ‘purpose’, ‘certification’, ‘processing’, and maybe even ‘texture’ (e.g. ‘Whole Grain Brown Rice’ vs. ‘Whole Grain Brown Rice Flour’). But determining what these columns should be depends entirely of the purpose of the analysis (e.g. Should there also be a “whole grain” column? A “Sprouted” column? A “Bleached/unbleached” column?) which leads to:
  • Cleaning the data requires a lot of human expertise; some ingredients might actually be the same thing, even though they have different names (e.g. is ‘Riboflavin’ the same as ‘Riboflavin B12’?), while other ingredients might not be the same, even if their names are very similar (e.g. what’s the difference between ‘Tocopheryl Acetate’ and ‘Tocopherol Acetate’?).

Keeping these thing in mind, I brought the data to at state of being ‘analyzeable’ and imported tailored csv’s to Gephi and Tableau respectively. Non-linear experimentation and discovery processes led me to export the Edges and Nodes Tables from Gephi to let them undergo statistical analysis in Tableau as well, from which I decided to create columns for ‘Organic’, ‘Whole Grain’, ‘Natural’, and ‘Gluten Free’, as I saw value in being able to get an overview of the frequency of occurrence of these seemingly frequent attributes in ingredients.

 

Visualization Process and UX Research

As mentioned, the target audience for my visualizations is designers, ideally designers who would be working on a project about improving consumer interaction with nutrition data. I recruit three test persons from the Pratt Industrial Design department, and decide to use an interview/focus group approach to getting their feedback and input. The studios at Pratt are the perfect setting, as this is a setting where the design process takes place; and when visualizing for a designers, I should keep the chaotic and hectic character of this setting in mind, as it is probably not what would normally be considered ideal for data analysis work requiring perhaps a different kind of focused attention.

My first round of interviews serves to get an idea of the types of insights that might be interesting for designers, and the responds coincide with my initial approach; anything that you cannot immediately gather from observation or interviews, which are the usual approach of designers in the gathering of information for a project, would be helpful.

I begin my work in Tableau with looking at averages and distributions to get an overview of trends in the numeric nutrient data, and create histograms for ingredient count, servings per container and calories per serving. I use grouping to create intervals for the graphs; especially servings per container and ingredient count have very wide spans, with fewer and fewer entries towards the higher counts; these are grouped in ‘40+’ categories. I also create a bar graph for Serving Size Unit, but keep in mind that these are distinct categories, not existing across a continuum, meaning that the ‘shape’ of the bar graph (unlike with the other distribution graphs) doesn’t carry any meaning.

Finally, I create a set of simple bar graphs to show the relationship of specific types of ingredients to the whole (using percentage of total).

Underway, I get feedback from my focus group by asking them to look at the graphs and describe what information they can withdraw. I becomes clear that this approach requires a lot of attention from the user; it takes them a while to familiarize themselves with the dashboard layout, and only once this is achieved can they begin to derive specific insights. I decide to include more descriptions of the data and highlight things that might be of interest in the final composition. I also try to provide an overall introduction to the topic before asking the test person to engage with the visualizations, which proves very effective. A few more specific suggestions from test persons, such as changing the Tool Tip information from showing total count to showing percentage of total (across various graphs, such as the serving size units bar graph), are implemented. I also notice how the users struggle to keep an overview of the many graphs on a small screen, often spread across multiple tabs, and suggest presenting the final results on a printed poster, which might work better in the designers’ work environment; this is well-received.

Coloring the graphs is difficult, since there is no particular logic to apply; there are too many ingredients and different categories to use color consistently to add meaning, especially if imagining looking at all the visualizations together, which is the case on the final poster. My goal with the colors thus becomes to keep it as simple as possible; to not create any distractions or confusion. For Tableau, I end up going with a neutral gray for the distribution graphs; only graphs with distinct categories (serving size units) and three information dimensions (plot of ingredient count in relation to sugar content and calories per serving) are applied color. I keep this minimal approach in mind in my work in Gephi as well.

Analyzing and visualizing the network data proves to be challenging because of the size of the network, which has 5841 nodes, 421022 edges, and an average degree of 144. I experiment with different algorithms to get a sense of the network structure, but don’t see any strong patterns; even thought the network density is only 0.025, many of the nodes are strongly interconnected and the network thus appears very ‘tight’.

Running modularity on the Force Atlas 2 configuration returns 4 clusters seemingly based on their distance from the center of the network, where the nodes with the highest degree are located. I had hoped to be able to get clustering more according to the ‘type’ of ingredient (e.g. grain vs. vitamin vs. colors), but the ingredient use appears not to be consistent enough for this to happen, meaning that this (potentially valuable, but labor-intensive) layer information would have to be added on the spreadsheet level. This is also again an example of how much human expertise is required to make sense of this data.

To overcome the issue that sizing nodes according to degree might provide in terms of perception (by misleadingly indicating a greater quantity rather than frequency in the use of high-degree ingredients), I try running the ‘Circular’ algorithm, hoping that the even distribution of nodes along the circle’s circumference will allow the edge weight to stand out more clearly and communicate the same information about degree of connectivity as node-sizing might have. This, however, leads to issues when it comes to displaying labels; there are so many, appearing so close together that they cannot be read. Scaling the network up helps alleviate this problem to some extent, but too much scaling seems to interfere with the original algorithm and the problem remains unsolved.  In the end, it is very difficult to create a network visualization that affords insights without further data editing and perhaps a more clear goal in mind. In another attempt to learn more about the data, I decide to reduce the data set to only include ingredient pairs with weights over 100. A visualization of this shows again very few, central ingredients with with a high frequency of occurrence (mainly sugar and salt) and many other (very chemical sounding) ingredients surrounding them, and closely connected as well. The density of this network is 0.127.

 Full ingredient network

screen-shot-2016-12-11-at-3-22-41-pm

screen-shot-2016-12-11-at-1-59-27-am

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Full ingredient network – circular

screen-shot-2016-12-11-at-2-19-23-am

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Reduced ingredient network

Breakfast Food Ingredient Network (reduced)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Findings and recommendations

My findings and recommendations relate both to methodologies and to the data itself, as well as how to implement UX research and feedback.

Methodological findings

Learning about the data is an ongoing process, and with data as complex and extensive as this, cleaning and visualizing go hand in hand, as visualizations will reveal inconsistencies. This complicated the workflow, as there is a need to be able to backtrack the process, and a demand for consideration for when to separate or merge or reformat data become crucial. For moving this project forward, I would use my insights to re-clean the data almost from the original spreadsheet, where all information is collected in one place, and make a strategy for which attributes I would want to analyze both in networks and in statistical analyses. Consulting a dietitian or other expert to make sure the clustering of data entries is done accurately would also be helpful.

Data findings

There are many insights to be gained from the visualizations. First of all, the complexity of the notation of ingredients became very clear, as well as the amount of ‘chemical-sounding’ ingredients which seemed much more overwhelming when presented all together, than when noticed individually in nutrition labels.

I also learned how different the notation of serving size is, and how difficult it is to compare nutrient data across units of volume, weigth, and entity. Another interesting finding shows how servings per container greatly favors even numbers, and particularly the number eight; I suspect that this has to do with the fact that it can easily be divided by 2 and 4, and that it has somewhat ‘friendly’ connotations. The average is high at 10 servings per container, and the relatively large amount of items with 40+ servings per container suggests that there are bulk size items in the data set. These could possibly represent larger packages of food items already included elsewhere in the data; a further investigation of this could show whether these entries could be left out (again, depending on the goal of the analysis and visualization).

In terms of calories per serving, it was interesting to find that the average is 163 cal, and that the distribution is narrow, with the great majority of products found in the 100-250 cal span; 163 seems on the lower side for a breakfast meal, confirming the suspicion that serving sizes often do not represent the amount that people actually consume (something which is being addressed with the new regulations on nutrition labeling from the Federal Register per July 26, 2016!).

It was surprising to find, that a large number of food items have more than 50 and some more than 100 ingredients, the average also being fairly high at 24. I believe this is a representation of the highly processed foods in the collection. This could also be related to he trend that as number of ingredients increases, so does the average sugar content per serving, up until 60 ingredients per item. After this, a smaller group of items display a different trend; lower sugar content, greater ingredient count, and more calories per serving (presumably from fat). (Here I found it helpful to display to graphs showing the same data, as they give different insights; one allow for reading of color, while the other has data points sized according to number or records, making the relationship of the data points to the data set more clear – originally, the color gradient was gray, but as test persons had difficulties perceiving the differences, I decided to add color. The dark red draws attention to the outlier on the top graph.)

From the network analysis, it was difficult to draw conclusions about overall trends, besides the fact that most nodes have many connections, and some nodes with key ingredients are extremely well-connected. In order to improve the visualization, it would be helpful to limit the diversity of food items; even though it is already limited to breakfast products, it perhaps doesn’t make much sense to compare Cheerios to a premade egg and bacon sandwich. To move the project forward, I would narrow down my selection of breakfast product to perhaps only cereals and granolas, as I expect them to share more of the same ingredients. Small multiples could then be helpful in comparing different food categories or brands to each other.

UX research findings

The UX research gave valuable inputs for  structuring the visualizations, and also showed how much introduction is necessary for designers to really engage and take ownership of the data – for a design team, I believe finding should be communicated by someone working with the data, as there is also lot of knowledge gained from the process. The visualizations then become a great way of sharing findings. User observations were moreover helpful in realizing what draws the eye of the audience first, and how the overall composition is assessed.

Overall, the project gave a lot of new insights that are valuable both in moving the data analysis and visualization process forward, but also to begin to point designers towards the communicative challenges related to nutrition data; in a way, the messy, unstructured and overwhelming data masses represent the reality that meets the consumer (though usually in smaller doses) – realizing this, as well as getting an overview of the realities of nutrient, portion size and calories per servings (etc) averages, can begin to lead designers towards designing better solutions – maybe by employing data visualization best practices as well.

Final poster

Poster 01

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

screen-shot-2016-12-11-at-4-31-16-pm