The charge for this project was to learn about network visualization and practice with software specializing in this field. Finding a suitable dataset that was personally interesting and comprehensive proved challenging for this project, as many existing datasets were outdated or lacked proper labeling. After exploring the CASOS, SNAP, and Network Repository datasets, I landed on an extremely interesting 2016 study in which Dr. Kathleen M. Carley collected data on acclaimed Science Fiction (SF) of the time. What stood out the most was data collected on SF story content classification. For each book of focus, two people read the book and came to consensus on if that book fell within one of the following 11 categories.
- robots, androids or AI computers
- time travel
- multi-species, sentient species
- psychic powers
- novel technology (not AIish, ex. steam-based technology is considered novel)
- after catastrophe – often post-apocalyptic
Each book was coded from 0 to 3 for each category
- 0 = category not present
- 1 = category present but peripheral
- 2 = category present at a stronger level but not strongly integral to the story
- 3 = category present and strongly integral to the story.
This was all recorded in a spreadsheet, like the following.
I decided to focus on two aspects of the data – the categorization discussed above, and the frequency in which a book was included in a top “n” list in 2016. The first task was to convert the original data sheet into a network-friendly format. This was done by creating “nodes” and “edges” sheets. I used book names and the 11 categories for nodes,
and then generated an edges sheet by focusing on how the book nodes connected to the 11 category nodes.
After applying a simple ID to each book, I was able to sort each category from most to least. This formatted the data table in an easier way, enabling me to copy each book’s ID and corresponding number code for each category.
|Book||ID||Magic (0-3 Rating)|
|The Lion, the Witch and the Wardrobe||12||3|
|Tales of the Dying Earth||34||3|
|Lord of Light||56||2|
Any rating of “0” was omitted, as I determined that a “category not present” rating translated to no network connection.
The 0-3 rating was directly applicable, as I could use that data for weight values in the edge file.
The nodes and edges files were then imported into Gephi for dynamic visualization. After trying out each layout method, the ForceAtlas 2 algorithm with high gravity and scaling settings produced the best outcomes.
Further refinement was done on the text and node sizing to better space out this dense visualization, and spatially represent more popular books (books on more top n lists) by making them have larger nodes.
The Gephi viz was then exported to svg and imported into Adobe Illustrator. Significant cleanup work was done to
- Optimize label positions
- Improve the information hierarchy by
- adding color for each category.
- pushing back the connecting lines.
- increasing the category text size.
- applying a narrow font family.
The final result is an information-dense poster of popular science fiction in 2016 based on 11 sub-categories like magic, interplanetary travel, and novel technology.
The goal of this sheet is to quickly enable people to browse popular science fiction works by category, looking for interesting reads based on the flavor of SF they enjoy the most.
The primary reflection and concern is the accuracy of the packing/network algorithm in the placement of book nodes. I found through manual cross-checking that nodes around the edge were sometimes inaccurate. For example, consider the book “Revelation Space” and its position on the network:
All connections are of equal weight, so it does not make as much sense why this node was pushed to the outer edge, vs. landing in some centralized location between the 4 related categories. Perhaps some aspect of the packing algorithm conflicted with the best fit for nodes… but this would be difficult to fix as any lessening of packing strength would lead to a significantly larger image.
Another method I had tried was to “settle” each major category node along the edge, allowing the book nodes to fall more towards the center
This method ended up creating problems as middle nodes might be interpreted as having a bit of every category due to their central location, such as this node
But, a middle node might also just have represented a connection between two categories on opposite sides of the chart.
This method seemed to be less accurate, so I stuck with the original minimally modified ForceAtlas 2 outputs. With more time I would research how to better solve this problem, and try out some calculations that did not force the nodes to be so close to each other.