by Cameron Dudzisz-Pounds
Introduction
While Marvel’s Marvel Cinematic Universe movies and TV shows have dominated pop culture for the last decade-and-change, Marvel Comics has existed steadily since the 1930s. Since then, thousands of characters have been introduced in at least as many comic issues. In a previous post, I examined the appearances of these characters within the comics and how often they are connected (or not, as is the case with much of Thor’s supporting cast) to other characters within the universe through comic co-appearances.
For this report, I wanted to continue along this line of examination, but my original dataset was limited as it only contained names of characters and issues (number of appearances could also be inferred with these mappings, but it was not an explicit data field) and thus I couldn’t do much more with it than in that lab. I then went looking for a dataset with more data and found a dataset by Andrew Flowers, a data journalist at FiveThirtyEight.com that included many more fields, including first appearances, morality, physical traits, and even LGBT status (though this, as we will see later, this singular field is inadequate) that will be useful for more faceted and detailed analysis. With this data, I want to see which characters appear most frequently compared to how long they’ve been in use, both in total and as a ratio. I also want to make some demographic comparisons using the morality and physical features data to determine if there are any obvious biases in the hair and eye colors given to “good,” “bad,” and “neutral” characters — as these are fictional characters, not real people, their traits are a result of their writers.
Tools & Process
The tools used are:
- GitHub — I didn’t “use” it, but it was where is sourced my data from as a .csv sheet containing data by Andrew Flowers at FiveThirtyEight
- OpenRefine 3.5.2 — To clean and edit data
- Marvel Database — Research and fact-checking
- MS Excel — Native .csv files; reviewing data
- Tableau Public Online & Desktop — creating visualizations
- Discord — Sharing visualizations with UX testers
- Greenshot — taking cropped screenshots for sharing in UX surveys and report
- WordPress (here) — Writing this report
After downloading the .csv file from Github and quickly checking in Excel to ensure the file saved and opens properly, I loaded it into OpenRefine to clean the data. The first thing I did was to remove the parentheticals in the name field. While many of these are alternate names, this was not consistent. Some characters are listed under their real name or “secret” identity, and others under their superhero name or identity, while others do not have their alternate name in a parenthetical at all despite having alternate identities. Since this was not consistently applied, I removed this information. I also stripped the “Earth-616” parenthetical that was in most of the names, as this was not needed information and only indicates that the character was used in the “main” continuity of the comics.
Unfortunately, while what data was present was mostly free of formatting errors and typos, there were many characters with missing data, often including vital info such as the number of appearances or the character’s first appearance. This was not just limited to obscure, old, or low-appearance characters. Well-known, popular, and important characters like Namor (one of the oldest Marvel characters, first appearing before Marvel was even its own company, and the 14th most frequently occurring by issue count) and Rogue (a popular X-Man that has been a viewpoint character in many comics and adaptations) were among those with missing data. As there were 16,377 entries, this would take far too long to add and correct all of this missing data manually. However, I did not want to exclude more important characters. So, for the 150 highest-appearing characters, I went through the list and manually added any missing data using the Marvel Database. I then re-ran facet checks to catch any typos I made. After this, I removed any entries for characters that did not have an entry for a number of appearances, as that was a vital field for my later work. I did not remove characters with missing data in other fields, such as eye or hair color, as I could still use them for other graphs. This took my data from 16,377 character entries to 10,471. Once this was done, I moved my work into Tableau to design my visualizations.
Visualizations & Design
For my first visualization, I made a point chart of characters on their debut year vs. their total number of appearances, with Morality as a detail color. I colored “Good” characters blue, “Bad” characters red, and “Neutral” characters purple. Blue and Red are colors commonly used in English-speaking countries as visual shorthand for good and red as bad or evil, and purple is, of course, a combination of the two.
This chart uses all of the same information as the previous one, but here I also divided the number of issues by 2014 minus the year of the character’s introduction to get how many years they’ve been in Marvel’s roster, then divided the total number of issues by that number to get the average appearances each year since their introduction. Some characters have different colors in their bars or multiple first issues because of the character fulfilling different story roles, alternate versions, or reboots of the same character.
This table charts Marvel LGBT characters by decade of introduction and includes morality, totaling them by each decade. Various categories are divided by color, but more about this chart will have to be discussed in the Findings as the way the data was structured was severely limiting.
These treemaps track the hair and eye colors, respectively, of the characters separated by morality. I originally structured these as bar charts, but treemaps better demonstrate the differences within each morality, which is more important than comparing between moralities.
Findings
From the first graph, the four most used characters — Iron Man, Wolverine, Captain America, and Spider-man — have a pretty large gap between them and the rest of the characters. These four are all on the older end, with Captain America introduced in 1941, and Wolverine, the most recent, in 1974. Interestingly, all the villains are clustered at the bottom, with even the most used “Bad” characters barely scraping to 800 appearances. This may not be entirely surprising, as antagonists are going to occur less than protagonist characters. Spider-man doesn’t only fight Green Goblin. However, as this chart only looks at the raw number of appearances, popular older characters are naturally going to have more appearances, just from having more time to appear in releases. To correct for this, we will move on to the next visualization.
In the next chart, we’ve calculated the average number of appearances per year. This means that more recently created popular characters will appear higher up, as they don’t have to “catch up” to the backlog of older characters. We still see that the top 4 are near the top, but we can see that there are a few new characters at the top. Of note are Eva Bell and Christopher Muse, two characters that were introduced the year before, but appeared in many issues in that year — explaining how they took the top 4 and 5 spots despite not even getting a generated label on the previous graph. Spider-man and Wolverine are still at the top by a wide margin, though Captain America has fallen a few rankings as a result of this averaging and Iron Man rose a spot.
The chart of LGBT characters is a complete list (as of 2014) of official LGBT characters, out of the 10,471 total. This is a fairly small number, and it also shows a fairly strong pattern of more LGBT characters being introduced each decade, from only 2 in the 1940s (one of them being Loki, a character based on a mythological being that is LGBT) to 26 in the 2000s, and 15 in the first 3 years of the 2010s. There are some flaws in this data that make it unreliable, however. As it is all encoded in a single field, transgender characters cannot have their sexualities included. Additionally, asexual and romantic are not included as categories at all, not is heterosexual — heterosexual (and any other sexualities that aren’t heterosexual or explicitly included) characters are just left blank in that field and thus can’t be differentiated.
The treemaps of hair and eye color charts are mostly similar looking between moralities, but have some small, enlightening differences, and even those similarities can tell stories. Among all three, black and brown hair (in that order) are the most common hair colors by far. A little further down is where we see the first interesting difference. Far fewer good characters, relatively, are bald or have no hair (the difference being the latter are species or entities that don’t have hair naturally such as robots or reptillians), while neutral and especially bad characters are much more likely to not have hair. We can see an even more pronounced example of this phenomenon in the eye color treemap, where red eyes are the 8th most common eye color for good and neutral characters, but jumps all the way to the 3rd most common for evil characters. Blue and green eyes are very common in every category, possibly indicating an overrepresentation of Caucasian characters.
UX Research
My first reviewer for the UX research was very familiar with Marvel and the characters, but not with visualization and UX design except as a regular end user. We got a little sidetracked on discussing the visualizations, as he got distracted several times from what the data was saying about the characters and inaccuracies according to his interpretations of the characters, and I had to steer him back to the visualizations themselves.
Once we we got back on topic, he did point out an issue on the dashboard that was appearing on his screen that I was not seeing. On the LGBT chart, the rows were being squished to the point the names were difficult to read due to being partly cut off, so I made a slight alteration to center the text in each cell.
My second tester is my wife, who is more experiences with UI from her tech sales job (and having to make visualizations for presentations to prospects) advised me to increase the contrast of the chart
With my wife’s feedback, I made the same color adjustments in the average appearances chart.
Reflections
In hindsight, while writing this report, I should have not removed any characters with missing data at the OpenRefine stage. Tableau allows users to filter out entries with Null data in a given field, which I would end up using anyway to exclude characters with missing data from given visualizations. Although the characters cut were generally unimportant or background characters, cutting such a large number potentially skewed the data for the other demographic charts and comparisons, I should still not have excluded them from use at such an early stage when the data that was included in their entries could have been considered for those visualizations.
Alternatively, it is possible to manually re-add the missing data, but would be extremely time consuming and of questionable value without a large budget or many volunteers. Removing characters based on missing appearance data alone cut the list by a third, and there are many more characters with missing fields than just those.
There are many more ways to combine and compare the data in this sheet than what I did in this report. Looking at which characters are alive or dead (although this being comics, that could change at any moment for any given character) against the various traits, number of appearances, and so on could be used to gain additional insight.
I should also have used a strict script for the UX research. I had initially intended to write a Google Form script, but I ran out of time and my ad-libbed interviews were less effective than more formal research would be.