This week, I’m exploring networks with Gephi, an open-source network visualization platform. It’s a powerful graphing software that also allows you to run analyses on networks. I managed to find a dataset listing all of the New York Times Bestsellers from 2011-2018. I want to explore the relationships between bestselling books and publishers.
In any network, there are nodes, or actors, and edges, or connections. In this case, the nodes are publishers and authors, and the edges between them are the bestselling books that they published or wrote. This is what we call a directed network, because the connections between publishers and authors are directed edges—publishers publish authors, and authors are published by publishers, but never vice versa.
This dataset constitutes a “whole network,” which focuses on an entire system and its connectivity. Our system of New York Times Bestsellers is relatively limited compared to the broader writing and publishing world. For one of the most well known and competitive lists in the industry, I expect to find that the usual suspects will have the greatest degree of centrality—that is, that household-name publishers will have the most connections—but I’m curious about the outliers. there are bound to be interesting patterns.
Understanding and Preparing the Data
During my usual first step, which involves data cleanup in OpenRefine, I encountered my first limitation with the data—aggregation. Using Refine to cluster terms such as “Kensington” and “Kensington Publishing” was a no-brainer. But I found myself at an impasse when deciding how or whether to combine “Harper,” “HarperCollins,” and “Harper Perennial.” Even as a layperson in terms of the publishing industry, I know that these entities are distinct, albeit related. To combine all of the trade names, or imprints, of these publishing companies might skew my data in an unfavorable way, so I used a light touch.
I ran into a similar limitation when cleaning up author data. Many books on this list were written by more than one author, but as a network novice, I could not think of a way to disaggregate them without compromising the structure of my data.
To build the node table, I isolated unique publisher and author values in my clean dataset, assigned them ID numbers, and assigned categories to differentiate publishers from authors. Then, I used Excel’s index match feature to create an edge table using those IDs for my sources and targets. I chose not to use IDs for edges since many publisher-authors pairs show up multiple times, but the book titles are unique for each entry.
Graphing and Analyzing
I settled on the Fruchterman Reingold layout early in my study. In comparison to other network layouts like Yifan Hu, this one seemed to present node centrality in a more intuitive way. By this point I already knew that my final graph would rely largely on degree to demonstrate the influence and prevalence of certain publishers.
The first styles I used colored the nodes according to their category—publishers were green and authors, orange. I changed node and text size based on degree and observed that green (publisher) nodes became more noticeable when scaling based on out-degree, and likewise, orange (author) nodes came forward when scaling based on in-degree.
This operated in line with what I understood about the network and my data. Out-degree corresponds to the number of bestsellers that a particular publisher has produced. In-degree corresponds to the number of times an author has landed a book on the bestseller list. I did not style using edge weight while creating this graph, because each edge represents a unique title on the bestseller list, so the edge weight for every connection is the same—1.0.
Instead, I ran a modularity analysis to highlight distinctive families among the most prolific publishers, using color. The algorithm correctly identified relationships between publishing entities. For example, one will notice that publishers like Berkeley, Putnam, and Penguin are clustered in the graph, and indeed they are associated in the real world. Berkeley and Putnam are both imprints of Penguin Random House.
With a few additional tweaks to styling and layout, I had my final graph. In the end, it confirmed what I suspected about the New York Times Bestseller network. The Gephi distribution graphs below show that this network exhibits the power-law property, in that most nodes have a low degree and a small proportion of nodes has a very high degree.
The nodes with the lowest out-degree inhabit the outer ring of the graph. Many of these edges represent books that were self-published by the author.
Should I decide to take this project further, there is a clear path for me to create a hypergraph based on the publisher clusters that I identified. However, I also think it would be interesting to run a comparative analysis between this and other bestseller lists. Would those networks have similar structures? How might I use these tools to examine bias in curation?
Given more time, I would also love to transport this data to Tableau, a data visualization software I explored earlier this semester. Gephi lacks robust options for styling and interactivity, but Tableau picks up where Gephi leaves off. I would use Tableau’s styling options to differentiate my publisher and author nodes while leaving the modularity class color scheme intact. Edges would demonstrate more directionality, and I would enable tooltips that display details like book title and date of publication on hover.