Introduction and Inspiration
Created by Stan Lee together with several artists in 1961, the Marvel Universe is a fictional universe where the stories in most Marvel Comics publications (e.g. American comic books) take place. It has built an library of over 8,000 characters, which includes super-teams such as the Avengers, the X-Men, the Guardians of the Galaxy, etc. and superheroes such as Spider-man, Iron Man, Thor, Captain America, Doctor Strange, the Hulk, Black Panther, Wolverine, and numerous others. Although depicted as a “multiverse” consisting of thousands of separate universes, the term is often used to refer to stories happen Earth. Over the last decade, characters as such have gained worldwide popularity and become cultural icons, due to the release of several recent blockbuster movies. As a fan of the Marvel Cinematic Universe franchise, I am fascinated by the heroes and their stories. Taking this project as a chance, I decided to visualize the connections between the superheroes to better understand how they fit into the universe.
Upon looking for inspirations, I came across this infographic from 4 years ago, which visualizes the network of Marvel Cinematic Universe characters. Although this graph focuses on a countable number of characters, I like it for the fact that the clusters clearly indicate communities and we can see from the central presence who had the most one-on-one relationships with almost all of the main Avengers team characters (Black Widow).
Based on previous exploration, I found this more recent visualization of the Marvel Cinematic Universe. We could see that characters with more connections have bigger nodes. Here a total of 9 communities are represented using different colors and it’s easy to tell that characters within the same communities tend to associate with their own films.
This is the oldest graph (2011), yet I found the visualization to be to be most aesthetically pleasing. The colors of this ego network is reminiscent of Marvel’s colors, and the labels are more readable in terms of size and contrast.
Materials and Methods
- Kaggle Datasets: Open source platform for public to explore and publish data sets
- OpenRefine: Open source application for large data cleanup and transformation
- Gephi 0.9.2: Open source software for network visualization and analysis
- Schemecolor.com: A collection of color schemes
- Oxford Internet Institute (OII) Network Visualisation: Enable an interactive display of network visualization in a web browser using the open-source Sigma.js plugin for Gephi
1. Selecting Data
I found The Marvel Universe Social Network data source created by Claudio Sanhueza (2015) on Kaggle. It came with 3 datasets in total. Since I’m focusing on the relationship between superheros rather than their shared appearances in comics, I chose the “nodes” and “hero-network” csv file. The “nodes” file contains both comics and heroes, so I extracted all the nodes that belong to the type “hero” using OpenRefine. Also I renamed “node” to “label”, “hero1” to “source”, “hero2” to “target” and created a “type” column filled with “undirected” so they are more readable by Gephi.
2. Visualizing in Gephi
After I uploaded the dataset to Gephi, I tried several layout and chose the ForceAtlas2 as it seems to represents and converges the network spatialization to a more balanced state. With over 6k nodes and 574k edges in total, the first view of the network seemed overwhelming and turbid, but after I experimented with different size and color configuration, the hierarchy became clear as of who is the biggest or most influential characters in this network.
The size of nodes denotes degree of connection. Nodes with bigger size correspond to the more influential characters. In Figure 4 we can identify that communities appear as groups of nodes whose colors were adjusted by Modularity Class. The visual densities produced by this force-directed layout are well indicative of the structural densities. Originally I wanted to use a label size proportional to node size, but I realized the amount of texts overlapping would make them less legible and this might sacrifice the length of original labels, so I decided to use a standard size.
To reduce the amount of noise from this huge dataset, I used the filter to set the minimum of degree to 80 to qualify for a node; I also set up another edge filter which includes only nodes with an edge weight of at least 2. Together these filters produced around 60% (913 count) out of the total nodes and 40% (38220 count) out of the total edges (Figure 5).
I further distinguish the node sizes by turning up the maximum size value and used the “Prevent Overlap” to avoid possible visual confusion. Then I spaced out the nodes by increasing the scaling so they would have more proximity for proper display of labels.
Referring to Figure 2, I also ran multiple Modularity statistics with different resolution to eventually produce a total of 9 communities. After exploring several Marvel inspired color schemes on Schemecolor, I created the palette above for each group. The last 3 modularity class were attributed with the same color, silver gray, since individual groups were less visually notable in the network.
In addition to exporting as static image files when the visualization was ready, I installed the SigmaJS Exporter plugin for Gephi. This plugin was released by the Oxford Internet Institute and it allows for an interactive display of network visualization in a web browser. Because the exported folder was local, I uploaded it to a free web server so that the web page is available online.
Results and Interpretations
From the final visualization, we can see that Captain America is the most centralized character in Marvel Universe, as he has the most connections with other superheros. The second most connected superhero is Spider-man, followed by Iron Man, Thing, Mr.Fantastic and Wolverine. We can also relate the communities to the major superhero groups.
Red marks most members from the Avengers, which is also the most centralized and influential group. There are a few exceptions. For example, despite that Thor who is part of the Avengers, is closely drawn to other key members, he is also a leading role of his own community marked by light gray. Blue makes the X-Men group, leading by Beast, Wolverine, Professor X, etc.. Purple indicates the Spider-Man cluster. Orange points to Guardians of the Galaxy. Yellow is the Fantastic Four group. Green is mostly Dr.Strange and Hulk’s allies.
Spider-man has a large connection, but its distance to the central clusters and being in a different community indicates that by this phase he had not yet became a member of the Avengers. We can also tell that superheroes in the Avengers group have more interactions with other heroes from different groups as the red nodes and edges permeate throughout majority of the graph, whereas the X-Men Spider-man communities are more involved within their own group.
Reflection for Future
This is my first time touching on network data and visualization, which I think employs a different mindset transitioning from the previous statistical visualization. Gephi has powerful capabilities to analyze and visualize network data, but the learning curve is steeper than the last project, especially with a number of network analysis and statistical terminologies. They seemed a little abstract at first and it took a little while to sink in.
Another challenge was that since my data volume was enormous, it was constantly causing the software to slow, freeze or shut down. In addition, every time I reopened the file, the settings such as colors, sizes, resolution for modularity settings, etc. appeared to be reverted back to default, although the graph itself doesn’t. It seems like Gephi does not keep a record of previous settings and display current properties. This could be a little frustrating because it requires users to keep track of all customized settings.
Regarding the visualization, I had considered making the thickness of edges proportional to weight, but the connections were dense enough as it is. Varying line thickness would significantly increase the visual work or might just be indistinguishable given the high density of edges. So I think it is better not to add this extra layer of complexity and decide to focus on the big picture for now. But for the future I am interested in looking at the strongest or most frequent connections between characters (with fewer nodes), perhaps with the use of a radial Diagram. If possible it will also be interesting to retrieve a more updated dataset and compare with this one to see how the universe evolves. I would like to explore other tools and get a deeper understanding of network analysis in Gephi, such as how different layouts behaves and how other metrics (e.g. Betweeness Centrality, Closeness Centrality, Eccentricity, etc.) work best for different cases.