#ScreamIntoTheVoid: How Far Does Your Twitter Network Reach?

Introduction

Digital technology (DT) has revolutionized communications. Coupled with the boundless potential of the internet, DT promises to continue redefining the standards for engagement in the public sphere. In particular, social networking sites like Twitter, Reddit, Facebook, Instagram, and Tumblr offer users an unprecedented level of access to the world around them. Twitter, being the site I use most frequently, is the data source for this lab.

Having the ability to transcend distance (Kadushin, 2012, p. 18) and easily locate users with similar interests (Kadushin, 2012, p. 19-20) is key in the formation of online communities on Twitter. The nature and functionality of these groupings, which are of increasing interest to social scientists and communications scholars, is ripe for analysis should the necessary data be collected (Kadushin, 2012, p. 4). Fortunately, software exists that can mine, manipulate, and analyze the data, while network visualizations provide useful graphics to facilitate the formation of key conclusions (Krempel, 2011, p. 560).

Discussion

True to my individualist Western socialization, I was interested to see a visualization of my own Twitter network. The wealth of data available from social networking sites is unfathomable, so to start, I needed a way to more finely comb through the site. In his blog post “Just Landed: Processing, Twitter, Metacarta & Hidden Data,” artist and educator Jer Thorp shares a thought which greatly informed my data collection method. In “thinking about the data that is hidden in various social network information streams,” Thorp “wondered if it would be possible to extract… information from people’s public Twitter streams by searching for [a specific] term” (Thorp, 2009). Thorp rightly postulates that (perhaps) unbeknownst to themselves, users of social networking sites share pertinent information in the text of their posts that remain “hidden” during more precise searches. For example, if one were wanting to plot the distance certain Twitter users travel when they fly, querying the phrase “Just landed in” and collecting those tweets’ geospatial data could produce more records more easily than if one were to rely on a more specific approach (like searching for tweets whose location is within a certain distance from the known location of an airport). (Thorp, 2009)

(Thorp, 2009)

Following Thorp’s example, I thought about the phrases my followers, the users I follow, and I use most frequently in our tweets. I decided to use the slang term “sis,” whose meaning in AAVE is expanded to encompass and surpass the commonly known usage as an abbreviation of “sister.”

When visualizing Twitter conversations, the common layout is a series of concentric circles. The outer circles are often more populated, but the nodes have fewer edges between them. Moving towards the center, the nodes increase their average number of edges but the area between them is usually greater. This cluster pattern reveals the largest or most extensive interactions happening around a certain subject or query over a given period of time; it also makes plain the users who are most popular, most active, or most commonly referenced within a given scope. The following visualizations, though representing a diverse array of subjects, all follow the same general spatial arrangement of nodes and edges.

A visualization of tweets containing the hashtag #Hokies (Kushin, 2015)

A visualization of retweets reflecting Anti-Sunni sentiments (Siegel, 2015)

Materials

For this lab I relied on a computer with internet connection and the following programs: Microsoft Excel 2013, NodeXL, Twitter, OpenRefine, Gephi 0.9.1. To recreate this lab you must have your own Twitter account.

Methods

The Social Media Research Foundation created a tool called NodeXL to collect, analyze, and visualize data from Twitter and Facebook. I used NodeXL as data collection tool only.

NodeXL operates as a plugin to Microsoft Excel. Once Excel was running, I selected the NodeXL tab in the control bar to view the program options. The main feature available in the free version of NodeXL is importing data from a personal, public Twitter account. I followed the program prompts and logged in to my Twitter. Next, I entered my query parameters. I wanted all tweets in my network that contained “sis” and entered that string in the given field, but NodeXL allows searches for specific users and hashtags as well.

The results are automatically opened in multiple sheets of a blank workbook, along with NodeXL’s own visualization and metric calculations displayed in a sidebar. I intended to visualize my data with Gephi, so I saved the sheet of edges as a .csv file and imported it to OpenRefine. The data cleaning process involved deleting columns containing extraneous data; NodeXL collects tweet url, geospatial data, full text of the tweet, a separate column with any hyperlinks found in the text, and several other fields. To be compatible with Gephi, the table needed at least three fields in this order: Source, the sender of the tweet; Target, the user(s) mentioned in the tweet; and Type, a field that enables Gephi to calculate network metrics, limited to Directed or Undirected. I chose to keep the NodeXL field titled “Relationship.” The values in this field are “Mentions,” “Replies To,” and “Tweet.” These terms tell what kind of interaction the users in that tweet had. I exported the clean data as a .csv and imported the file to Gephi.

I followed the Gephi import wizard protocol for an edges file, then rearranged columns where necessary. In the overview tab, I calculated Average Degree, Avg. Weighted Degree, Network Diameter, Graph Density, Modularity, Ave. Clustering Coefficient, and Clustering Coefficient. I ran a ForceAtlas 2 layout and then a Fruchterman Reingold. The nodes were in their final relative locations but were too spread out so increased gravity to draw them closer together. I color coded the edges by the “Relationship” type and tied node size to degree.

In Preview I added node labels, replacing the circles with the text of the corresponding user’s Twitter handle. I changed the background to Black so that the color coded edges were easier to see and adjusted the relative sizing of the font to improve readability.

I attempted to use the SigmaJS plugin for Gephi to export an interactive version of the visualization, but the vis did not appear in the folder containing the relevant files. This could be due to my version of Gephi being incompatible with the latest iteration of the plugin. Instead, I downloaded the vis as an image file.

An important note: Gephi crashes often, particularly with larger datasets, so I saved a copy of the project after every step.

Results

A visualization of all the tweets in my network containing the string “sis” between June 5, 2017 and June 12, 2017

“Encoding numerical information into visual layers” (Krempel, 2011, p. 562) aids the viewer in their processing of information. In my visualization, the size of the node, represented by the Twitter user’s handle, corresponds to the number of tweets containing “sis” in which that handle appears: more mentions equals a larger font size. This encoded representation of the quantitative qualities of the data is easily decoded by viewers because there exists a directly proportional relationship between the “magnitude of a physical stimulus” and “its perceived intensity or strength” in human perception (Krempel, 2011, p. 563). Accordingly, the nodes with the greater centrality are also written in larger font, in addition to being located in the middle of the vis.

Lines represent the edges of the network, or the tweets themselves. The number of lines connecting to a particular node conveys that node’s betweenness (Osphal, 2011), thus nodes with a high measure of betweenness are bridges. The color of the line tells what kind of Twitter action occurred between the users. “Mentions” are purple. These tweets involve the sender interacting with users who have replied to the tweet of an original poster (OP). A mention can also include the OP and any other users who have not replied to the OP and are instead included at the sender’s discretion. “Replies to” are orange. A Reply results when a sender selects the option to tweet directly to an OP. A “Tweet” is green. This category contains tweets which are not a part of any “thread,” or series of related tweets. They represent a single, unidirectional interaction.

A complication for my visualization is all the smaller conversations, mostly dyads, which create confusion for viewers who have no context for the graphic (Kadushin, 2012, p. 27). These brief, isolated interactions within my Twitter network form a thick ring around the clusters in the center. Related is the ethics of my pulling this user data without consent. NodeXL only searched through public accounts connected to mine, but if these accounts are tweeting the queried term with users outside that scope, those handles appear in the dataset as well. A more refined data collection method would address both concerns.

Future Directions

Narrowing my scope with a more specific search term would decrease the size of my dataset and consequently produce a more manageable visualization. This end could also be achieved by searching for my own handle in my network’s tweets or querying viral hashtags that are common on my timeline. Additionally, time series data could be used to track the most active members in my Twitter community.

Kadushin asserts that “the social statuses, positions, and social institutions” linking nodes outside of the visualization’s intended scope “can themselves be regarded as connected networks” which “are constantly emerging and as a result affect and change” the original network (Kadushin, 2012, p. 11). This statement supports applying a color coding of the edges according to a larger network structure to provide wider context for the visualization. Color could also be used to communicate multiplex relationships between nodes (Kadushin, 2012, p. 26). Understanding how users are related to each other outside of the Twitterverse would also inform how these users interact in the platform.

References

Kadushin, C. (2012). Understanding Social Networks: Theories, Concepts, and Findings. Oxford University Press, USA.

Krempel, L. (2011). Network Visualization. In The SAGE Handbook of Social Network Analysis (1 edition, pp. 558–577). London ; Thousand Oaks, Calif: SAGE Publications Ltd.

Kushin, M. (2015, October 4). #Hokies Tweets Network Visualization: How I Extracted Tweets Via Tags 6 and Visualized Them in Gephi. Retrieved July 2, 2017, from http://mattkushin.com/tag/twitter/

Osphal, T. (2011, June 9). Node Centrality in Weighted Networks. Retrieved July 2, 2017, from https://toreopsahl.com/tnet/weighted-networks/node-centrality/

Siegel, A. (n.d.). Sectarian Twitter Wars: Sunni-Shia Conflict and Cooperation in the Digital Age. Retrieved July 2, 2017, from http://carnegieendowment.org/2015/12/20/sectarian-twitter-wars-sunni-shia-conflict-and-cooperation-in-digital-age-pub-62299

Thorp, J. (n.d.). Just Landed: Processing, Twitter, MetaCarta & Hidden Data. Retrieved July 2, 2017, from http://blog.blprnt.com/blog/blprnt/just-landed-processing-twitter-metacarta-hidden-data

Information Visualization

Student work at the School of Information, Pratt Institute