the NETwork in SCI-FI books world

Lab Reports, Networks, Visualization


Over the past century, human’s imagination of science fiction had never stopped. There are tons of fascinating literature which depict the worlds that exist in our imagination. How much do we know about those sci-fi books? What are they talking about? Are they related to each other? This project is using Gephi as an analytical visualization tool to explore the network within the 147 sci-fi books in a database.


When I browsed online searching for inspiration, Marvel’s New “Uberframework” Graphs Every Character In The Universe caught my eyes. This article talked about how did Marvel team visualize the massive database to show fans the relation of characters.

The infographic they produced was intriguing. Below is a powerful illustration of the character connections. It showed the social network of the superheroes. I liked the layout of this network. The black background makes the network in lighter tone standing out. And the variation of colors shows the range of node’s degrees. It is a clear infographic indicating the overall relationships and the center of this network.

Fig.1 a Marvel superhero social network


CASOS – an open data resource
Openrefine- to clean up the data
R- edit the spreadsheet to make an edge table
Gephi- visualize the network based on the data


Data preparing
First, I found a sci-fi books dataset on CASOS. This dataset was collected by Dr. Kathleen M. Carley. It didn’t contain a pre-built network. So I had to generate a network from the variables. The variables in this dataset included the names of the books, the author’s names and genders, when the book was written, and the content’s types.
To generate a network from those variables, I started to analyze the content types of each book. The data had 11 columns describing the content’s types of the story. Each book’s content was rated by those 11 content types from the scale 0-4.
0 = that was not present;
1 = that was present but peripheral;
2 = that was present at a stronger level but not strongly integral to the story;
3 = that was present, strong and integral to the story

The 11 content types are showing below,

Fig.2 A screenshot of the original table showing the content types and the rates of each book

From the table, we could easily find the connections between the books, which is whether they have the same content type. If two books have the same content, which means they both have values greater than 0 in the cells in one content column, I counted as one edge. 0 means there’s no connection between them.
To count the edges between books, I imported the spreadsheet into R to make an edge table. Basically, what I did in R is to make a network based on the 11 content types given by the spreadsheet. I paired each book with the same content type, then made an edge table showing every single connection between them. Then I eliminated the pairs which had the same target value and the source value. After finished the edge table, I used Openrefine to make a node table.

Import the data into Gephi
Firstly, I imported the edge table into Gephi. And I was not satisfied with the node table which was automatically generated by Gephi because it didn’t have much information I needed. So I imported my own node table into Gephi.


1.the author’s gender and modularity classes

Gender is an interesting aspect to explore in the context of science fiction. The data recorded the author’s gender of each book. And I utilized it as a part of my analysis. As the image showed below, all the male authors were labeled in blue, while the female authors were in pink. The numbers of blue nodes were almost twice greater than the number of pink nodes, which means there was a numerical unbalance between male writers and female writers. (See fig.3 )

Fig.3 A network showing the author’s genders in two different colors

Do the differences between the author’s genders affect other attributes? Or vice versa? To do further exploration, I ran the modularity calculation. When the setting was 0.8, the nodes were categorized into six modularity classes in six colors. I labeled the book’s names on each node for references. By juxtaposing the gender’s network and the modularity class network, the similarity between the two was very vague. The clusters from the modularity network didn’t reflect on the gender network. Therefore, I couldn’t tell there’s a clear connection between gender differences and the modularity classes. That being said, the author’s gender is not the main reason for forming the cluster. There might be other reasons behind.

Fig. 4 Modularity classes

2. the female writers
By analyzing the author’s gender of each book, my interest in investigating the female writers had been aroused. In the context of science fiction, female writers, as the minority in the field, had not been taken seriously. It would be valuable if there is a study only focusing on female writers. My assumption about female-written books was they might have similar tastes or stronger preference in a certain type, which is differentiated from the male-written books. For example, there might be more romance-related books which were written by female than the male because females are usually being more emotional. This assumption probably was a stereotype, but I wanted to be honest about what’s in my mind at that moment.
To do so, first I highlighted the female writers in pink and kept all the male writers in dark grey to get a clearer overview. The pink nodes on this network were sparse. There are no certain clusters among those nodes. This means the result went against my assumption. Female writers, as a group of demographics, were differentiated to each other in writing sci-fi books. They were not clustered in the contents of sci-fi books, which means they rarely had certain preferred sci-fi types or genres. In other words, female writers were open to many topics and possibilities. They had the ability to write influential sci-fi literature on many topics and varieties. That being said, there are no considerable literal differences between female and male-written books.
But again, female writers were still the minority. To make the numbers balanced, I hope more female writers would join in the field in the future.

Fig. 5 A network showing the author’s genders. The female writers were highlighted in pink

Furthermore, when I filtered all the books by timeline, another surprising finding was the earliest influential sci-fi book Frankenstein in this dataset was written by a female writer Mary Shelley. This book was written in 1842, which is the earliest recorded year in this table. As I manipulated the time filter, from the year it was written, this book had influenced many other male-written books over one hundred years until the mid 20th century. (See Fig. 6) Also, it was the only female-written book during a century.

Fig.6 The network of books written before 1950. The nodes were colored by gender.
Blue=male writer, plink= female writer


Due to the limited time, I didn’t figure out why the modularity classes were formed. It would be a valuable finding. I assumed that it probably related to the content of each book, but the content types were not included in the node table. If I had more time, I would like to add the books’ content types into the node table and do further researches.
Secondly, all the findings were based on the dataset that I found. I don’t think it could reflect the history of sci-fi books because not every single book was included.