Most of the products and services I purchase are based on user reviews. Finding trustworthy, unbiased reviews can be hard to find. As a result, I began to think about an individual’s ability to dictate the worth of a product or service and their power of influence on a community of buyers. When I came across a dataset, put together by M. Richardson and R. Agrawal and P. Domingos on Stanford Network Analysis Project (SNAP), which portrays data that covers the relationships of users-to-influencers on an online platform of product and service reviews, I was eager to dive in. After looking further into the dataset, I discovered some surprising gaps and trends. The driving questions that led this whole structure network inquiry included: How significant are the network ties between influencers and buyers? Do communities of trust involve a one influencer-to-many buyers or many influencers-to-many buyers relationship?
Process and Tooling
SNAP is a helpful tool offered through Stanford University that offers network datasets and other great resources. With a couple logical clicks, I was able to quickly navigate to the data sources available.
I was able to identify a data set that met my needs, ironically I found my data set on trusted networks from the social network section vs. “Networks with ground-truth communities”.
The dataset offered helpful statistical information prior to using Gephi that offered a good sense of the size of the single mode network.
Once I identified a workable dataset. I inserted the .txt file into Openrefine to convert it into a .csv format in Gephi. As the dataset only had three columns, it was fairly easy to modify to the appropriate template.
First, I imported the data and excluded the first four rows as they were not pertinent to the analysis work.
Unfortunately, it merged the values of two columns into one, with additional space, during the import function. I corrected this after clicking “Create Project” by clicking on the down-arrow by “Column 1” to edit the cells and collapsed the consecutive white space. This made it easier to separate the values into two columns because I didn’t have to guess how many spaces were between the values.
I was then able to separate the values into two columns respectively using a single space (” “).
Once the values were correctly separated, I renamed the columns using the down-arrows respectively to align with the template’s naming conventions.
From there, I exported the file as a .csv to import into Gephi.
Once in Gephi, per the guidance, I went to the Data Laboratory to import my .csv. The tool provided a helpful preview of my data. I kept the settings as-is because they fit the dataset’s needs and clicked “Next”.
From there I also kept the settings for Time presentation and completed the import by clicking “Finish”.
Once imported, Gephi creates a report to communicate any issues. Luckily, there were no issues and the # of Nodes and Edges matched the dataset’s statistical information, so I could conclude that no data was lost. However, before exiting the report, I made sure to select the Undirected Graph Type since the nodes could be symmetrical (There is no one-way trust within the community, members could trust each other).
Before I looked the data visualization, I performed a statistic analysis to better assess the dataset’s behavior. I first looked at the average degree to get an idea of how strongly connected the nodes are in the dataset.
Based on this distribution, only a few nodes host the highest in-degree.
I kept the default settings before running the assessment. The modularity results support the degree distribution in that there are only a few significant clusters where as the bulk are very small communities.
Based on the data, I created a filtered range based on the node degree to minimize the data clutter by making the minimum value around the same amount as the average: 8.
Then I changed the appearance configurations to show the size differentiation in the nodes by degree strength.
From there, per the guidance from T. Venturini, M. Jacomy, and P. Jensen written work, I applied the ForceAtlas2 force-directed algorithms because it works best for analyzing large networks and improving legibility of the results.
I added a color palette to more easily visualize the most significant nodes.
I chose not to include the edges in the data visual because it didn’t add anything to the assessment of the data and appeared cluttered.
I then took a look at the Network Diameter. Unfortunately when I tried to replicate this step to capture the graph, Gephi kept crashing. But the result supports a very wide network with a diameter of 11. This makes sense after assessing the Graph Density, which came back with an output of 0; Meaning that most of the nodes are not heavily connected. To see if there were any significant clusters, I performed a Modularity assessment in order to measure the connectedness of the different components of the network.
The resulting network data visual does a good job of notating the collections of nodes along with the distribution of the largest nodes, which could represent the users/reviewers that hold the most influence on the communities.
It’s interesting to see the node positioning and how the colors aren’t completed polarized in one location or another but are distributed throughout to some degree (no pun intended).
I conducted some additional due diligence on this to see how this could correlate to the user behavior on the website and deduced that since the data set was collected in 2003, the website has changed both in name and likely in function as you can see below. Epinions.com has now become Shopping.com. One could deduce that the key influences indicated by the nodes in the visual above could represent reviewers for one or more categories below.
Another potential cause in the segmentation of the nodes could be the countries where the site is offered. However, it still begs the question of why there’s one large node in a color that is not representative of the main segmentations, nor is it insignificant as it’s not colored with gray. Once possibility I could deduce from this is that this node represents the website administrator that spans across all users and is thereby connected to all reviewers.
This visualization offers a helpful view of the community user behavior. If I had more time, I would work assess the sizing bid degree more and compare it with the modularity to see which method shows the most mutual links. I would also become a member of the website to better understand the kinds of user profiles and how users interact with one another to assess if they are trustworthy and how they organize themselves (ie. by country, category.. etc). These pieces of information could help provide further context to the dataset in order to make more accurate interpretations of the data.
- M. Richardson and R. Agrawal and P. Domingos. Trust Management for the Semantic Web. ISWC, 2003.
- Venturini, T., Jacomy, M., & Jensen, P. (n.d.). What do we see when we look at networks—An introduction to visual network analysis and force-directed layouts. Retrieved July 9, 2021, from https://pratt.instructure.com/courses/9250/files/332017/download?download_frd=1