Amazon Product Co-Purchasing Network: An Information Visualization




For the purposes of this graphic report, Gephi and Excel (Numbers) were used to manipulate data applying network analysis and creating a visualization based on the “Amazon Product Co-Purchasing Network”.  This course provides the necessary guidelines for students to use the available open resources in the age of information visualization.

When it comes to information visualization data, the overwhelming aspect of how to get started can easily end up in a bubble of isolation, exploring methods to enhance a network which is comprised of nodes (point of items) and edges (links or connectivity).  There are vast amounts of data that stem from your typical chart, to graphs, and mapping in order for it to make sense of data.  Consequently, the shield is lifted once the networks are formed to show functional characters such as grouping of nodes that is a representation of clusters among other elements.

Furthermore, the need to ask questions about the graph explains why there is a need to do one in the first place.  What is the graph trying to convey?  How does Amazon Product Co-Purchasing Network work? After reviewing several datasets, I realized that the need to know the logistics of a graph became a priority.  The graph depicts line network that is an essential component within a graph.  The chosen dataset for this graph is directed which means that the edges have a direction.

According to the dataset description, the Amazon graph was initiated by purchasing items directly from the online stores instead of department stores.  Understanding online customer behavior denotes how data is perceived.  “It is based on customers who bought this item also bought feature of the Amazon website.  If a product is frequently co-purchased with product j,   the graph contains an undirected edge from i to j.  Each product category provided by Amazon defines each ground-truth community.  We have removed the ground-truth communities which have less than 3 nodes.  As for the network, we provide the largest connected component.” Please refer to source citation: J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012.  Analyzing what customers are focusing on is shown in the graph by product strengths and different insights.


The following graph (Figure 1) is a wonderfully made design made using TVNViewer.  This is an interactive visualization source that allows exploring network designs to change over time and space.  This graph shows the edges in color revealing what it entails when using labels.  Hovering over the edges will highlight the nodes and reveal the source.  Further describing the format, the in edges are in red, out edges are in green, and cyan for bi-directional edges.  It is important to mention that manipulating the graph will allow the user to shape the color of a node as an example.  Another fascinating function allows the user to view multi-faceted nodes by using “Selection Depth to show slider”.

 Figure 1graph-documentation

Upon viewing the multiple graphs available, this animated floating graph nodes (Figure 2) caught my attention.  As the author explains, the graph is entitled “Art meets Computer Science”.  The graph consists of 70 nodes, 20 extra edges, and a balanced network style.  The color scheme with blue background with light blue nodes clearly defines the relationship between the nodes and edges.  Although this image is static, the animated floating graph moves or drifts around within the rectangular shaped canvas showing random speeds as the nodes move around.  The edges are what makes the graph into a network, and resembles a mesh as covered during the class lecture.

Figure 2


The structural data within a graph can either be directed or undirected.  I felt compelled to show a graph (Figure 3) about the difference between the two.  Can you guess which one of the selected three graphs is directed or undirected?

Figure 3


Dataset, Software, and Materials

Creating visualization in lab involved carefully selecting datasets that I felt comfortable working with.  The SNAP: Network datasets have multiple categories in which to choose from.  I chose the network dataset on Amazon Product Co-Purchasing Network (com-amazon.ungraph.txt.gz).  The datasets were broken down in various categories such as Amazon Communities and Amazon Communities (top 5,000).

For this project, we are using Gephi 0.9.1, free open-source software platform that deals with layout, metrics network analysis and real-time visualization.  The first step is to download the dataset into Excel (Numbers).  Make sure the dataset is clean before importing into Gephi.  After importing the report, Gephi will prompt you to any additional errors or issues found.   The columns are broken down into three categories ID, Source, and Target changing the existing columns to create and organize the graph.  Proceed to eliminate any rows that do not contain data.  Now you are ready to export the data into Gephi, make sure the graph format or extension ends in CSV.  If the data is intact, proceed to validate to view the graph.  The first step that I took was to regulate the “Edge Thickness” by using the slider to control this visualization.  As far as the layout goes, this is the layout algorithms which form the shape of the graph.  Once you locate the Layout module on the left hand side of the panel, choose “Force Atlas” and run the algorithm.


The basic network structure that resembles the outcome of my design is a full circle which is made up of nodes and edges to emphasize the network that is “densely linked” to communities whereas the edges concentrate on the members of the community.  The degree report which includes In-Degree, Out-Degree, and Degree is a combined average of 1.968.  There are a total of 66,562 nodes, and 65,499 edges.  The modularity sets in at 0.996 with the number of communities at 6144.  The graph was marked as directed in Gephi.  However, when obtaining the dataset information, the results appeared as an undirected graph when retrieving the information from this link:  The undirected graph has no direction as clearly stated in Figure 3.  My assessment of this graph suggests that it is directed, in addition, the combined “Degree” as shown in Gephi is also indicative that this graph is directed.

Figure 4


Future Direction

Including the datasets to show the labels is one possible solution to revealing the network in its entirety.  However, this dataset did not have any labels.  Further exploring labels to achieve the relationship between the nodes is a possibility.  For example, value one would show a direct relationship whereas value two will show the connected nodes to the selected node in addition to their neighbors.  Also, the ability to create a hypergraph is a foreseeable endeavor for this dataset – did not succeed in creating a hypergraph as the process was quite daunting.  Expanding your reach to unexplored datasets and graphs will unleash your creativity.

Utilizing other platforms for large scale graphs is an option.  Using large scale graphs, such as the ones provided, to resolve problems in a graph is the wave of the future.  While researching the topic, I came across a very interesting article that poses a question on how well graph-processing platforms perform – see link:  Exploring the benefits of other platforms will ensure a greater variety and avoid obstacles or limitations in designing a graph according to your specifications.  The power of visualizations goes a long way with providing data information in charts, graphs, and mapping in an understandable and actionable way.