Stack Overflow is an online community where users can learn coding and share their knowledge through the question and answer format of the website. In addition to troubleshooting problems, Stack Overflow also serves as a career resource where users can find job opportunities. One unique career feature from the site is its “Developer Stories,” which aims to function as a much more detailed resume that highlights projects. Users can also highlight which technologies they participate in on the website through tags.
The Stack Overflow Tag Network dataset available on Kaggle provides an opportunity to gain insight into these technology tags and visualize how often certain technologies are connected or grouped together.
Materials & Process
My process entailed the following steps:
- Download the following Stack Overflow Tag Network files from Kaggle:
- Stack_network_links.csv (Edges file)
- Stack_network_nodes.csv (Nodes file)
- Ensure that the Edges file contains both a Target and Source column, deleting any other miscellaneous columns (Brown University):
- Edit the Edges file so that both the Source and Target labels match the label ID numbers listed in the Nodes file. For example, change all “hibernate” labels in the edges file to the number 3, as 3 is the ID number for “hibernate” in the nodes file (Brown University):
- Open edited files in Gephi to conduct network analysis and visualizations using Gephi’s Quick Start Tutorial as a guide.
Results & Analysis
My goal in conducting this network analysis was to extract insights into how various technology tags were associated with one another in the Stack Overflow community. By visualizing these associations, one can also get a sense of which technology tags are the most commonly used in addition to some of the most popular clusters.
With this goal in mind, I selected the following properties for my graph:
- “Force Atlas” for the layout to visualize which nodes are attracted to one another.
- “Betweeness Centrality” for size ranking to indicate which nodes have the highest values.
- “Modularity Class” to detect and partition into distinct communities.
- “Degree Range Filter” to remove nodes with lower values and that are not connected to larger communities.
After implementation of the aforementioned properties and adjusting colors and graph styles, the Stack Overflow tags are presented as the following in my finalized graph:
With the visualization created, it becomes much easier to see which tags are grouped with one another and how each group connects to the others. By adjusting the nodes to reflect its values, it also becomes clear to see which technologies are the most used.
As these tags were pulled from Stack Overflow’s Developer Stories, tags and its groupings can be extremely varied according to the user. While most users develop skillsets that cater to certain jobs or fields, there are users who develop skills across multiple fields within technology. This variation could account for the seemingly unrelated nodes in certain clusters in addition to the different naming conventions for similar tags (for example: Amazon web services vs. web services).
While it took some time to edit my CSV files and adjust myself to the program, Gephi served as an excellent tool for conducting basic network visualizations and manipulations. After receiving some feedback from Professor Sula, my next steps would entail uncovering potential names for each cluster formed in my graph in addition to merging similar technology tags to avoid duplicates. Naming each group can help highlight what types of roles each tag is contributing to, such as database administrator or frontend development. These roles, in turn, can also shed some insight insight into the types of roles users have either acquired or are seeking to acquire.
Gephi Quick Start Tutorial:
Brown University: Gephi Network Analysis Tutorial
Stack Overflow: About