Introduction
Have you ever wondered what ingredients are used the most often across all of the world’s cuisines? As an avid cook, I have long believed that flavor is more versatile than arbitrary geographic boundaries. Korean kimchi jjigae, for instance, always tastes to me like a sauerkraut soup my Polish grandfather used to make, albeit a tad spicier. How did these two dishes, half a world away from each other, end up tasting so similar? By visualizing relationships between ingredients, I wanted to see if I could uncover other trends and similarities that exist between cuisines, especially those that we do not often associate together. To do so, this project visualizes the co-occurence of spices and herbs as found in a selection of 2,173 recipes from Allrecipes.com.
Research Questions:
– What types of spices and herbs are used the most in various cuisines?
– What spice and herbs connect the most cuisines and recipes?
– What are common pairings?
Methodology
1 – Collecting & Cleaning the Data
To start, I needed to find or create a dataset of recipes classified by their cuisine of origin. Luckily, I discovered that Allrecipes.com has an extensive directory of recipes organized by their respective cuisines. I wrote a Python script that retrieves each of the recipe URLs on each cuisine’s page, and then subsequently scrapes the list of ingredients from each recipe. I was left with a JSON file containing 49 cuisines and 2,173 unique recipes. I transformed this into a CSV file and continued into the most intensive part of data preparation: cleaning and reconciling the 22,899 rows of ingredients featured in these recipes.
I performed the bulk of my cleaning between Google Sheets and OpenRefine. First, I removed all qualifying information that is often included in an ingredient line, such as “to taste,” “thinly sliced,” or “divided,” using several iterations of finding and replacing by regular expressions. Because a majority of recipes from Allrecipes are user-contributed, I also needed to reconcile the various ways in which the same ingredient was referred to, especially across different cuisines and terminology. For instance, “recao” refers to the same herb also known as “culantro”, but not to be confused with “cilantro.” I quickly realized how complicated parsing recipe data can be, especially when it involves natural language processing, which is something written about by the New York Times in 2015 while they were developing the NYT Cooking app.
Furthermore, after spending a few hours with this massive dataset, I quickly realized I needed to narrow in my focus on what ingredients to include in my visualization. At first I had grand visions of studying relationships between all types of ingredients- meats, vegetables, fruits, herbs, spices, condiments, and so on, but I simply did not have enough time or computing power to reconcile all this data for an accurate analysis. At this point, I decided to selectively focus on herbs and spices given that they would be easy to filter down to, they have strong associations with global cuisines, and they are often responsible for lending recipes their most distinctive flavor profiles.
Still, I had some classification work ahead of me. To create a standardized method for selecting my data, I established these guidelines:
1. Salt (no, it is not a spice) is excluded. It is simply too fundamental of an ingredient. Black pepper, however, is included.
2. Fresh herbs like parsley, cilantro, and basil are included, but other aromatics such as garlic, onion, and chili peppers were only selected if they were dried. Basically, I asked myself: would I find this in the spice aisle at a supermarket? If the answer was yes, I included it.
3. Pastes, such as Thai curry paste and Korean gochujang are included, as long as the primary ingredient or ingredients are spices, herbs, or similar aromatics. This means miso paste is excluded but tahini is included (I know, it’s controversial. Many would argue it should be considered a nut butter instead, but I had to follow my rules).
I faced murkier territory when dealing with spice blends. For instance, curry powder (an extremely loaded topic to discuss on another day) and five-spice powder felt like important spice blends to include, but what about mass produced commercial blends, such as Old Bay, Lawry’s, Mrs. Dash, and Goya Sazón? After many iterations of visualization, I ultimately decided to exclude them because of their ambiguity and how they skewed the final spice groupings within my network. In many ways, I think an entire separate project could be done studying the various combinations of spices included in these blends alone.
After many, many hours of philosophical debates and classification arguments with my fellow food-minded friends, I ended up with 108 unique spices and herbs to be included in the visualization.
2 – Prepare Edges
Once all spices had been reconciled and my dataset filtered, I began preparing it for the network analysis. To visualize relationships between spices, I created two versions of the network. One in which spices were related to each other by being included in the same recipe, and a second one in which they were related to each other by belonging to the same cuisine to which the recipe is categorized. To make the edge list, which connects points in the network to one another, I made two spreadsheets: the first one has each unique combination of a recipe and the spice included in it, and the second has the unique combination of cuisines and spices. Following Prof. Sula’s workflow, I then transposed each spreadsheet in OpenRefine by key/value columns and used his R script to prepare the final CSV edge files.
3 – Visualizing the Network
Finally, I brought the data into Gephi. Since I wanted to compare spices between cuisines, I thought that the network based on cuisine relationships would provide me with the clearest picture of how spices were used in each. However, it was clear right away that the network formed by cuisine coexistence was very dense, and it was difficult to understand how spices and herbs were related to each other. In fact, it was rare that two spices weren’t directly related to each other. To try and parse this network, I ran the modularity and Eigenvector centrality metrics to detect communities within the network and to find out what spices connected the most others. I arranged the network using the Yifan-Hu algorithm which pushed the most centrally located nodes to the middle. Below, we can see that spices like black pepper, ginger, cumin, and cinnamon are most commonly used amongst all cuisines.
Furthermore, by coloring according to modularity groups, I was able to begin to see usage patterns in spice combinations, which also hinted at the types of cuisines using these spices. For instance, the teal group includes lemongrass, star anise, five-spice powder, and tamarind which is likely used by Asian cuisines such as Thai, Chinese, and Vietnamese. To better read these clusters, I created an additional layout that sorted them based on their modularity.
While this answered some of my questions regarding how spices are typically paired, I felt like I was still missing nuance in terms of relationships. With this in mind, I turned to my recipe-based relationship data.
Final Visualizations & Discussion
My final visualizations ended up focusing on these recipe-based relationships in order to detect communities of spices and then further examine the presence of these spice groups by cuisine. By doing so, we are able to more concretely examine how clusters of spices are associated with various cuisines.
Like the previous networks, I first graphed the overall network. This time I used the Yifan-Hu algorithm layout, and sized the nodes by degree, or number of connections each ingredient has. I believe this paints a more realistic image of how common each spice is within the network. Next, I ran a few iterations of modularity metrics at various resolutions until I found one at ~.85 that gave me 6 distinct spice groups that made sense. I arranged a second layout according to these groups.
The following charts list the spices in each group, ordered by how many times they appear in this recipe data.
Group 1 is overwhelmingly composed of black pepper, followed by cayenne pepper. It’s hard to say what the exact meaning is behind this grouping, other than the fact that black pepper and cayenne are highly central to the network in general. Group 2 is primarily green herbs such as parsley, oregano, and bay leaf. Group 3 shows an interesting meld of spices including popular spices such as cilantro, paprika, and mustard, but also includes curry leaves, pandan, and dried Persian limes. Likewise Group 4 includes cumin, turmeric, mint, and garam masala, all predominant South Asian spices, in addition to chipotle chili peppers and Mexican oregano. Group 5 is likely a result of baking spices like cinnamon, vanilla, and nutmeg. Group 6 is primarily Asian-centric spices such as ginger, white pepper, lemongrass, and Thai curry pastes.
At this point, I also used Flourish to create an interactive visualization in which the ingredients are grouped by these categories. When hovering over one of the points, you are able to see all the other spices that one has been paired with.
Finally, to see how cuisines are shaped by these spice groups and compare amongst them, I created a series of small multiple pie charts using Tableau and Adobe Illustrator that shows what percentage of each spice group composes the cuisine’s recipes.
I found that this visualization is helpful at providing additional context to the network graph. For instance, we can see that Scandinavian, Finnish, Austrian, and Danish cuisines are predominantly composed of spices from Group 5. Upon further investigation, it becomes clear that this is because a majority of these recipes are for baked goods utilizing warm baking spices. Other relationships are more surprising. I was particularly struck by the similarity in charts between Polish, Russian, Cajun & Creole, and Soul Food.
It makes sense that Polish and Russian food, and Cajun, Creole, and Soul Food are separately similar to each other, but what makes all four of them so alike? To answer this question, I developed another Tableau dashboard that allows users to select specific cuisines to compare, and see a breakdown of what spices are in each. Black pepper and cayenne are responsible for Group 1’s representation, while bay leaves, garlic powder, and parsley all unite the cuisines from Group 2.
Limitations & Next Steps
Ultimately, this dataset represents a minuscule portion of the world’s cuisine, and it is important to remember that the authors of these recipes comprise a small section of the world’s voices when it comes to food. There are striking gaps in the data. For instance, there is no category for Mexican food, yet one for Tex-Mex exists. There is also no category for American food. Is this because Allrecipes assumes this is the assumed, “neutral” cuisine? Not to mention, food is messy and defies categorization. My classification of this data comes from my own understanding and biases around food. Small changes to it may drastically affect the shape of the network.
For the future, I would love to create additional interactive tools that enable users to explore this network. Outside of a cooking context, I could see this spice network being compiled from various perspectives. For instance, a scientific perspective could study spice usage patterns and flavor compounds in relation to each plant’s classification and taxonomy. Likewise, a geographic approach could visualize these spices based on the areas where they are grown and from where they originate. A political and economic approach could incorporate how historic and present day trade routes have influenced the prevalence of certain spices in the world’s cuisines. The archivist and food historian in me would be especially interested in compiling a network of how flavor profiles and spice usage in cuisines have shifted across time. This dataset would have to be meticulously assembled through studying historic cookbooks, but Barbara Ketcham Wheaton’s digital database The Sifter could be a good start.