Gotta Connect ‘Em All – A Network Visualization for Pokémon


Lab Reports, Networks, Visualization

Introduction

With more than 25 years of history and loved by many generations of children, Pokémon is a famous Japanese media franchise that focuses on fictional creatures called “Pokémon”. “Pokémon” started as video games on the handheld console Game Boy with the goal to catch, train and battle using these Pokémon. It started out with 151 species and as of today, the Pokémon company has created more than 900 different species of Pokémon. Although each of these Pokémon are very unique and all have different characteristics, they also share many interesting attributes that we can look into using a network visualization.

Preparing Data

For this study, we are going to use a dataset that was found on Kaggle.com. This dataset was scrapped from a well-known website http://serebii.net/, that has a very complete information of all the Pokémon exists. However, the dataset was created 5 years ago, so the newest generation of Pokémon with index number after 800 was not added.

About the Dataset & Cleaning Data

Figure 1: A preview of the dataset downloaded from Kaggle.com

The dataset consist of all the Pokémon from generation 1 to 7, and has a total of 41 columns for each of the Pokémon. If we take a closer look, we can see that there are actually some columns that we can clean up and not very helpful for us to create the visualization later on.

Abilities

Each Pokémon can have abilities, and each Pokémon can only have up to 3 abilities (2 abilities and 1 hidden ability). Column 1 of this dataset shows what kind of abilities each Pokémon can have. While scrolling through this dataset, I found out that there are some rows that have more than 3 abilities, which got me a little confused. It turns out that those rows of Pokémon actually have a different form, known as the Alolan form, that changes the possible abilities that the Pokémon can have. For example, #19 Rattata and its evolution #20 Raticate have a total of 6 abilities (Run Away, Guts, Hustle, Gluttony, Hustle, Thick Fat) on column 1. In this study, we will only focus on the main two abilities to simplify the dataset, which is the first two abilities of each row, each ability will be separated into it’s own column and anything after the first two abilities will be deleted from the dataset.

Figure 2: Normal Rattata (left) and Alolan Rattata (right)
Type weakness/advantage

There are total of 18 types for Pokémon, and each Pokémon can have up to 2 types at the same time. The types are identified as followed:

Figure 3: Pokemon types

Each of these types have an advantage/weakness against another type(s), and therefore when Pokémon battle against each other, their attack points could be 2 times, 1 time or 1/2 times the original attack point. In the dataset from column 2 to column 19, it’s all about the calculation of the attack point against different types. However, these columns are not as useful in a network visualization as we are mostly focusing on categorical data, such as the actual types of the Pokémon, so we will be removing these columns.

Statistics and other non important columns

Each Pokémon has basic statistics such as Attack, Defense, Weight, Height, Speed, etc. These are all numerical data and these values can be very different from Pokémon to Pokémon. Since this study we are only focusing on making a network visualization, these columns will be deleted. These columns are most likely better to be represented with bar graphs or scatter plots to see the distribution of the statistics of the Pokémon.

Generation & isLegendary

Finally we have the generation column and the is_Legendary column. Generation column are represented with number 1 to 7. These could be somewhat confusing, so these values are being renamed as their corresponding region names: 1 – Kanto, 2 – Johto, 3 – Hoenn, 4 – Sinnoh, 5 – Unova, 6 – Kalos, 7 – Alola.

The isLegendary column indicates whether the Pokémon is a legendary Pokémon or not. Most of the Pokémon are not legendary, but we will rename the value 1 and 0 with “Yes” or “No” or better understanding.

As a result, here is a preview of the final sets of columns that we are going to use for this study:

Figure 4: Cleaned data

Transforming Data using R

Now we have to prepare the dataset so that we can get a format that the software that we are using to understand the dataset. For this task we will be using R with R Studio. This allow us to transform the above 800 rows of data into a list of “Edges” of the network graph. An edge is the connection between two nodes, and each nodes is basically a value of a cell in the above table. Basically, we have to use R to generate the list of edges for us to get all the connections between all the possible values exists in this dataset.

How To Get Started with Social Network Analysis | by Mitchell Telatnik |  Towards Data Science
Figure5: A sample diagram for a simple network graph
Figure6: A preview of using RStudio
Figure 7: Generated result of list of edges

Graphing the Dataset

To graph the dataset, we will be using a software called Gephi. Gephi is a simple but powerful software that allow users to generate network visualization. To create a new visualization, we would first have to prepare the data like we did in the above section, then imported into Gephi to automatically generate the network graph.

At first, when you import the data into Gephi it probably doesn’t make any sense and you would probably see something similar to this.

Figure 8: Gephi preview

So we would have to see the layout using one of the algorithms provided, in this case we will use Force Atlas and make the Repulsion strength to be 450000.0 since the amount of data we have is huge.

Another important step that I found out have to do was that to partition the type of nodes that we have furthermore, it was actually more helpful if we add another column into the Nodes list, generated inside Gephi. This way we would be able to colorize each of the node so that all the Pokémon nodes could be 1 color, all the abilities could be 1 color, etc.

At first, there was actually another type of node that we had. If we look closely back in the original and cleaned dataset, we could see that there was actually a column called “Classification”.

Figure 9: Cleaned dataset

However, after we graph the data out, I realized that the graph connections got very complicated. Most of these classification are very unique to each Pokémon and it’s only the same if the Pokémon has an evolution. For example, Bulbasaur evolves to Ivysaur, and they are both classified as “Seed Pokémon”. It’s very uncommon to have other Pokémon that also have the same classification but not in the same evolution family.

Here is a preview of the typed nodes, with a label column added as well to later on allow the network graph to show labels.

Figure 10: Enhanced Nodes list

Here is a snapshot of the graph generated after applying layout algorithm, partitioning (typing) the nodes and also resizing the labels and nodes to corresponds with the degree (how many times that node is connected with another node).

Figure 11: A preview of using Gephi to create network graph

At first the graph was all in grey color, but we would want to distinguish what kind of nodes we are looking at. So I picked a pinkish red for each of the Pokémon and light blue for the abilities. For types I used light green because it’s easier to see within the sea of red and blue circles and as a “background” I used yellow to orange to red gradient scale for the degree of the edges (how many connections the node have). For generation (regions) of the Pokémon, I used dark green because I thought it is the closest thing to grass color whenever I think of a region. Finally, for legendary Pokémon group, I used yellow/gold color for the node to make it stands out a little more among the smaller circles.

Final Results

Figure 12: final result

Above is a final version of the network graph. First with all the regions group to the center as they have the most impactful classification for all the Pokémon. We can see that Unova region Pokémon have the most number of Pokémon and therefore being the biggest circle. We can also see that all the types are being grouped around the regions and if we look closely, we can see that the individual Pokémon are grouped nicely according to their types around the type nodes. We can also interestingly see that there are actually a lot of Pokémon that share similar abilities according to their type.

Figure 13: A zoomed in version of the final network graph

If we also look at the color of the edges, we can also see that which abilities are the most common among each type. For example, a lot of the grass-type Pokémon has Chlorophyll and Overgrow abilities, and their lines are definitely more red compare to the other edges.

Figure 14: Zoomed in version of final result graph on grass type

Conclusion & Future Direction

This was definitely a very interesting dataset for to play with and I learned a lot about not just making network graphs but also can see that there are so many connections between all the Pokémon species. There are definitely a couple things that I would like to improve and perhaps look into further.

First, I still am not very satisfied with the color choice here because it looks like a diagram full of negative and positive ions you would see on a chemistry textbook. If possible, I think I would want to use actual Pokémon images instead of the little circles with labels. For types, I want to use the color that were used in the games (figure 3) so that it is easier to read.

Next, having too many attributes can cause the graph to be very confusing. As we have type, region, abilities and Pokémon all in one graph, it’s very difficult to read until you zoom in into each section. If I were to redo this, I would probably separate them into different section so that it’s easier to read.

Interactive graph in a report would be much better than just static images. For this study, I realized that the preview was interactive and you can hover on each node and see what it is connected to, but the final result is not possible if you generate it as SVG/PDF/JPEG. I later on found out that it is actually possible to embed the interactive version using SigmaJS.

Finally, there are many numeric value columns that we omitted during this study. There are definitely more interesting findings that we can see with those numerical data, so next time I hope I can find some ways to put them into visualizations and see what kind of results we can see.