Superheroes and Social Cliques


Tyler Dennis

Dr. Sula

Information Visualization – Gephi Lab

“Marvel Chronology Project”


Final Gephi Visual



The dataset used for this, the Gephi lab, was exhaustive and super ambitious in size and scope. It was curated by the Marvel Chronology Project, which aims to “catalog every actual appearance by every significant character in the Marvel universe, and place them in their proper chronological order.” Characters that co-exist closely and most often, for example, are linked together. Those with the most occurrences together are closest. Those with the most links to other characters are more central and prominent within clusters. Characters that are frequent co-stars in a series are grouped together, with the most popular characters in each cluster forming the nucleus of the cluster. Characters with the most connections are biggest, and pull lesser-networked bubbles into their cluster. Obscure characters or versions of popular characters that only existed in alternate universes (the dataset is that auspicious) appear along the outskirts of the visual.


Looking at examples of linked social data and knowing the large amount of data I had to work made it clear that I’d probably have a graph of entities clustered due to how connectivity. This is an effective way of representing large datasets like the Marvel one. I wanted my visualization would look something like this one, found on wikimedia. Nothing too flashy, something monochromatic and simple. Having the bubbles be unlabelled, with a graph as expansive as mine, felt cruel. No one wants to hover over as many bubbles as are in my visualization.




Materials and Methods:

The original data had 10,469 nodes (representing the amount of characters) and 178, 115 edges (number of linkages between characters). Not only does the dataset aim to capture just major characters, but also not-so-major ones, I found. The amount of data in this set made it hard for me to actually work with it. The data didn’t do so well on my Windows computer and it didn’t work too great on the desktop computers at Pratt. I knew then that I had to edit the data down significantly.

This led me to my roommates much-newer, very lovely Apple computer. At first I used it to actually open the dataset.  After that, I whittled the nodes and edges down in the data laboratory tab. This was easy, since the data felt intuitively ranked by characters with most connections at the top to characters with the least at the bottom. By the end of this, the node count had gone from 10,469 to 553. The edges were now at 11,472. This felt much more manageable and less stress-inducing. I found it hard to care about so much about the deleted data–much of which was one-time characters from the seventies with names like “Rubber Man” or “Neon Ant.”

The strategy for whittling down this incredibly-expansive dataset was to delete characters with few linkages to others. I attached amount of linkages with relevance and deleted those that were fit for this category. These were obscure characters, and I didn’t feel bad deleting them. There is such a wealth of data in the Marvel Chronology Project; I knew any amount I left, no matter how much had been deleted, would tell a good story that I could work with.

By the end of my deletions, important characters and sensible connections between them are overwhelmingly in tact. Characters who coexist in supergroups are close to one another, for example, in the visual. Member of groups like the Avengers and The X-Men, etc.  are clustered together, naturally. On the other hand, characters who are larger than a single comic title (think heavy-hitters like Iron Man or Wolverine) are shown as being crossover characters who frequently interact with people outside of the comic they primarily appear in. The visual makes a lot of assumptions about the way these characters relate to one another, and it gets a lot of thing right despite me butchering the dataset.

After I deleted that data, there were still issues. This time in the form of occlusion—all my bubbles were clustered around one super-popular character until I tweaked the scaling and the gravity.  I still hated how the graph looked after this. Aiding in the occlusion were intrusively popular characters who  seemed to have all of the characters drawn to them. It created a “center-of-the-universe” graph in which the most-connected character was the over-crowded nucleus in a very confusing, bubble-manic graph.

Deleting nodes with the most interconnectedness made the graph more dynamic. When there wasn’t such a clear front-runner with regards to interconnectedness, the graph became more interesting. Sacrificing a larger, obvious piece of data felt justified so that more-interesting connections could be shown more clearly. When you have few characters that are overwhelmingly well-connected in these visuals, they end up being represented by huge bubbles which make all the other characters appear super small. Deleted data related to the top most-connected characters made the graph more dynamic.

When the amount of superheroes felt right-meaning it was no longer crashing my old computer- I plugged the data sheet into Gephi and got something I wasn’t initially pleased with (early incarnations of my unsightly first drafts will conclude this lab reports). The data was still very occluded. The character bubbles all gravitated to the most linked character. To change this, I played around with statistical settings—functions like modularity and density, but they only drew the character bubbles closer together, making it harder to make sense of an already large and potentially very-confusing graph. I found that the simplest settings, with minimal tweaks in layout and attributes, created the simplest, most cohesive graph.  It felt lie Fruchterman-Reingold did most of the work honestly. At the conclusion of this lab, I felt like Gephi was a master of me and not the other way around.


There should be a key with this graph. As it stands, you would have to have a human guide giving some exposition on what it all means—at least briefly. This is especially true if the viewer doesn’t know comics. Or maybe the bubbles speak for themselves—who knows?  Fruchterman-Reingold seems very no frills. The settings under that presented my data in the way that made the most sense.

I always start by making my graphs look crazy with bizarre color schemes. Even though I did that this time as well, I ultimately found that simple is best when you are representing something this large. Tt’s best to keep color innovation at a minimum when you are working with 11,000 pieces of networked data among 500 nodes. But maybe I could’ve used a more bold color than the purples.

I had trouble with Gephi, as was aforementioned, because the dataset was so large. At first, I didn’t really enjoy working with datasets or Gephi because I didn’t feel like I could customize my graph beyond really superficial things like color or size or tinkering with line widths. And since the dataset was so large, I kind of felt like it had more control over the final result of my lab that I did, in a way.By the end of working on this lab report, I had discovered that you can download other applications that make Gephi more customizable. That aside, I also realize now that Gephi is essential for large datasets like this. It would be impossible to map huge mines of social data without graphs like these.                                                                                                                                                                                           On a final, superficial note:  I saw many relational network databases online at the time of writing this report. It struck me that the ones with bold colors representing the “edge” lines looked best against black backgrounds. Seeing these images makes mine feel kind of boring by comparison. In a future revision, I would make mine look sleeker and more eye-catching by copying that.

Evolution of the Visual

A: The Data

Characters matched with ID number.


Source, Target, and Type!


B: Early Draft


Source for dataset: