Cereal Ingredient Network Visualization

For my Gephi network analysis and visualization, I am interested in food composition; more specifically, I would like to investigate the similarities and variations in use of ingredients in packaged cereal and granola. Since I can’t find a data set that allows me to do so (most food-related data sets I come across deal with nutrition facts and not ingredient use), I decide to create my own. I choose four food items to begin with; since the network analysis looks at the pairings between ingredients, and a cereal with just 10 ingredients will have 45 unique ingredient pairs (x = n(n-1)/2), four food items will provide me with enough data to test and assess the analysis, and potentially create a strategy for a more extensive study.

The four breakfast products I decide to work with are Kashi Summer Berry Granola, Cheerios, Bob’s Red Mill High Fiber Hot Cereal, and Just the Clusters Chocolate Almond Granola Cereal. I transfer the nutrition and ingredient facts from the packages to a spreadsheet, at first creating one row for each ingredient. Then I rearrange the data in the columns Source, Target, Type (all required by Gephi), and finally Product Name, allowing me to see identify the product origin of each ingredient. With all the unique pairs from the four products in place (created simply by copying and pasting in Numbers), I end up with at total of 486 rows. The type is “undirected” for all of them since the relationship between Source and Target here simply represents two ingredients existing in the same product.

Three food data network visualizations inspire my work; two of them are from Truth and Beauty created for mymuesli.de, and are also dealing with cereal ingredient data. MyMuesli is a German company that allows customers to create their own cereal flavor combinations, and the visualizations are made to show the most popular, that is, the most frequently occurring, ingredients. One way of doing this is with a radial network as shown below.

The ingredients are displayed in groups based on food categories as nodes along the periphery of the circle, and connected by lines that vary in width corresponding to the frequency of occurrence. While this visualization is effective in highlighting connections across categories (such as Strawberry to Crunchy and Oats), it doesn’t show same-category connections as well, which is why a second visualization is created. With this matrix representation the two most frequently occurring ingredients, strawberry and raspberry, stand out more clearly. The symmetry across the diagonal shows that the connections are undirected.

Finally, I’ve looked at a third example using a network visualization technique more similar to the one I will use, also dealing with ingredient co-occurrence, but this time not only in cereal. This network aims using ingredient networks to provide recipe recommendations based on the idea that ingredients that often occur in the same dishes might go well together in other dishes as well. In this network we see a separation between predominantly sweet ingredients that you might find in deserts on the left, and savory ingredients on the left, connected by common ingredients such as salt and water. This visualization helps inform what my visualization might look like with more cereals (even though the more homogeneous use of ingredients in breakfast food means that they would probably not form two distinct clusters).

Once my data is formatted, I export a csv-file and bring it into Gephi (empty columns gave me a little trouble to begin with, but once those were gone, the data imported nicely). I try out different layout algorithms, and end up going with “Force Atlas 2”. From working with the creation of the data, I know that there are not that many overlaps in ingredient use, so I expect to see four clusters, one for each food item; the Force Atlas 2 algorithm does exactly this.

With the layout in place, I run stats on the network. The average degree is 18, meaning that on average, each ingredient has 18 connections. I use degree as an attribute to determine node size, making the most well-connected nodes accordingly bigger. I also attempt to run Modularity statistics, but the calculation does not succeed in my version of Gephi (0.9.1). I intended to use the modularity to color the nodes according to their cluster, but since my network is fairly small, I decide to color the nodes individually instead. Since the clusters roughly represent the four cereals, I choose colors that I associate with each item. For the shared nodes, such as salt, sugar and rolled oats, I assign colors that are somewhere in between the main colors. The network edges are colored with gradients from node to another, making the clusters and their relations even more apparent.

Finally, I add labels to the network by copying the node ids to the Labels column in the nodes table. The labels are also sized according to each node’s degree of connectivity. Once I’ve run a preview of the visualization, I export it as an svg in order to bring it into Adobe Illustrator to make a few final adjustments and additions. I change the font and the font color to match the nodes, and adjust the position of the labels to make them more legible. Lastly, I add images of the four cereals to make the ingredient affiliations even more clear.

Cereal network visualization exported from Gephi

Cereal network visualization edited with Adobe Illustrator

To develop this project further, I would first of all add more products to the dataset. The visualization I have created begins to reveal the most common ingredients in cereals, but not to an extent that couldn’t have been derived from just looking at the ingredient lists; as the number of products increases, the value of the visualization will increase as well. I might also experiment with different algorithms, such as trying a radial layout to see if this gives different insights.

Information Visualization

Student work at the School of Information, Pratt Institute

Cereal Ingredient Network Visualization