Digital Humanities
@ Pratt

Inquiries into culture, meaning, and human value meet emerging technologies and cutting-edge skills at Pratt Institute's School of Information

Social Media Scraping for Qualitative Data Analysis at NYCDH Week 2017


In her chapter in Digital Humanities in the Library: Challenges and Opportunities for Subject Specialists, Caro Pinto discusses the evolution of the traditional solitary work of humanities scholars to the collaborative nature of the majority of digital humanities projects. Pinto cites consortiums such as the Tri-Co Digital Humanities Initiative and Five Colleges DH as particularly successful examples of cross-institution collaboration. Similarly, NYCDH week is a perfect example of the way that digital humanists around New York are able to share their resources and knowledge-base with their colleagues.

The NYCDH Week’s Social Media Scraping workshop was led by Sarah Demott at New York University’s Bobst Library on Tuesday February 14, 2017. The workshop revolved around the qualitative data analysis software NVivo and its associated browser extension, NCapture. This software makes it easy for researchers to access user data from three major social networking sites, Twitter, Facebook, and YouTube. NCapture provides a few options for capturing data from each of these sites, as well as offering a number of tools to help researchers analyze their results. This tool is especially useful for researchers who have little to no coding skills but would still like to compile a dataset of social media information.

Sarah began the workshop by explaining that NVivo adheres to the Twitter, Facebook, and YouTube terms of use by accessing data through their respective APIs. She then instructed each participant to download NCapture, which is freely available from the Chrome Web Store. Once the browser extension was added, participants were able to use it to capture the data from Twitter feeds, Facebook pages, or YouTube comment sections as either a dataset or as a PDF. NCapture functions slightly differently in each platform. In Twitter, NCapture will prompt the user to choose between saving tweets as a dataset or as a PDF. If researchers are capturing the feed for a search term rather than a specific user’s page, they will be given the option to either capture retweets or to omit retweets. When using NCapture with Facebook, researchers are only able to capture information from a particular user or company’s page. Researchers are given the option to save posts as a dataset, or to save the whole page as a PDF. Finally, when using NCapture with YouTube researchers are given three different options. The first is to capture the video itself, the second option is to capture the video and the comment section, and the third option is the standard PDF page capture. Across all of these platforms, the least useful option is to save the webpage as a PDF, since this does not allow for any degree of interactivity on the researcher’s part. Once the datasets have been captured, researchers must open the files in NVivo.

NVivo project example from Kiwi Data Science

While NCapture is free to add as a Google Chrome browser extension, it saves the associated dataset as a proprietary .nvcx filetype, which can only be opened in NVivo. This means that users will need to purchase a subscription to NVivo in order to open and manipulate the datasets that they download using NCapture. Once we had captured our desired social media information as datasets, Sarah walked us through how to create a new project and import the NCapture files into NVivo.

Once the files have been imported, they display in a spreadsheet format. For example, with YouTube comment data, the columns include “Comment ID”, “Comment Username”, “Comment”, “Reply ID”, “Reply by Username”, and “Reply”. NVivo has tab options on the side of the spreadsheet which include a map function that plots the data points based on geography. Sarah also demonstrated how to query different aspects of our dataset, such as term frequency. This function included information on term length and frequency, as well as allowing for the addition of stop words, which helped to reduce the number of irrelevant words included in the results. The term frequency query also included a number of visual aids, such as a word tree and a word cloud.

Ultimately, NVivo and Ncapture seem like very useful tools for extracting social media data when the user is unfamiliar with coding. However, these tools do have some significant limitations. The most apparent of these is cost. While NCapture is freely available, NVivo is required to view NCapture data and student subscriptions start at $75 a year. The second issue with the NCapture/NVivo package is that it is really only capable of capturing real-time results from Twitter. NCapture would be unable to meet the needs of a project like ours in Digital Humanities II, where we are hoping to look at historical uses of a term and its associated hashtag. If NVivo were able to provide access to historical Twitter data, it might be worth purchasing for our project. Despite this shortcoming, it does seem to be a useful tool for real-time social media data collection, and provides a nice set of tools for researchers who are unfamiliar with coding and are new to data analysis. 

Leave a Reply

Your email address will not be published. Required fields are marked *