Social Media Scraping for Qualitative Research February 14, Sarah DeMott, NYU Bobst Library

Social Media Scraping for Qualitative Research workshop was organized as a part of New York City Digital Humanities week in February 2017. The workshop took place in NYU Bobst Library. In the workshop NUY Data Service specialist Sarah DeMott demonstrated how NCapture can be used for collecting data from different social media services, such as Twitter and Facebook, and how NVivo program can be used to analyze and visualize this material. The focus of the workshop was topical as constant growth of social media use has made it definitely interesting and important to understand in humanities and social sciences. Digital humanities may provide fruitful approaches to understand better effects and nature of social interaction in web.

NCapture is a free web browser extension. By using NCapture it is possible to collect a variety of different kind of material from Internet. The application “scrapes” the data from the opened web page and the data can be stored, for example in pdf-format. NCapture can collect material from the most popular social media sites, e.g. Facebook, Twitter, YouTube and LinkedIn, which makes it powerful tool for data gathering. NCapture is not only tool for social media scraping, but unlike many other solutions, use of NCapture does not require coding skills. NCapture follows the terms of use of the above mentioned social media services which makes the data collection acceptable. However, as DeMott notes, the academic practice of citation still need to be followed when the material is referred in research. To analyze and visualize the material collected by NCapture commercial software NVivo is needed.

NVivo is a computer software designed for qualitative analysis on text and multimedia. According to the producer the program is used by academic, government, health and commercial researchers. Rivals of NVivo are, for example Atlas.ti and QDA Miner. All of the mentioned programs are designed for managing, coding, analyzing and visualizing data.

NCapture seems to be currently the easiest way to collect data from social media sites. However, there are some important limitations that should be taken into consideration: in Facebook the number of posts that you can capture is restricted. According to the producer of the software the exact number of collected posts may vary depending on the number of posts available, privacy settings of user etc. Furthermore, in Facebook it is possible to collect data only from public groups or “walls”. (The ethical issues of using this material are considered later.) In Twitter there are also limitations: the number of captured tweets varies depending on the number of posts available and the amount of traffic on Twitter at the time of collecting – what the terms “available” and “amount of traffic” exactly means are not defined further. Due to the restrictions collecting data about particular topic may require taking multiple captures over time.

In the perspective of academic research above mentioned limitations need to be taken into consideration. It is problematic that NCapture does not provide more detailed information about restrictions of data collection. The possibilities to control collection process and evaluate the accuracy and coverage of data can now be based only on the software producer’s assertion. Repeatability, an important criterion of academic research, is difficult or impossible if the results of data gathering vary each time.

However, the presented software also provide many possibilities for social media research and can, despite the mentioned limitations, provide valuable means for understanding new kind of social interaction and mechanisms of social communication in Internet. NCapture is of course designed to work seamlessly with NVivo which gives the software a vantage point in comparison with its competitors. The combination of NCapture and NVivo might be successfully used for qualitative analysis in case studies concerning clearly defined topics, such as discussion about certain issue in a limited period of time. If aiming to collect larger datasets the data gathering process should be designed more carefully and considered if it is possible to double check that social media capturing has reached the aims of the study. This double checking could be done for example by making random checks by using other data scraping tools or default search tools provided by social media services.

In addition, in the workshop one ability of NCapture arouse my interest: the possibility to collect easily visual material alongside the text. It is inevitable that visual images (in the forms of photographs, memes, symbols, emojis etc.) play an important role in the social media communication. However, in humanities and social sciences researchers’ interest has been mainly focused on text-based analysis. However, to observe, for example political rhetoric in Twitter, in which the length of posts is restricted to 140 characters, visual images may have significant role in making the message more effective and persuasive. The possibility of saving Facebook or Twitter discussion threads easily in the PDF-form makes it possible to observe connections between text and images in their original form. It also saves the original visual context of social interaction. In the longer perspective, as the social media platforms are continually transforming, these collections or “screen savings” can provide interesting material to analyze changes in social media communication.

To compare with other data coding and visualization programs, at the moment NVivo seems to have some benefits: The user interface is intuitive and easy to learn. The visualizations tools support different formats, such as mapping and network visualization. Of course, after a short workshop it is not possible to compare if there is a significant difference between NVivo and its rivals. The biggest benefit with NVivo is the fact that NCapture can make data collection process really quick and simple. Collected data can be also easily imported to software. Presumably other software producers are developing similar data collection tools in the future. In fact, the new version of Atlas.ti already promotes to include a “direct import of Twitter, Endnote, Evernote data”.

The workshop did not cover other tools of social media scraping. However, when planning research project it would be useful to get familiar with other options too, for example to be able to make double checks in data gathering.

There exist several application programming interfaces (APIs) designed for collecting data from Facebook and Twitter. Using API requires usually at least basic coding skills, however, practical guides are easily available. For example Matthew A. Russel’s (2013, 2^nd ed.) Mining the social web introduces several ways of acquiring data from web. [1] It seems that APIs are especially suitable for data mining, e.g. scraping large datasets. Sometimes it may also be possible to control collection criteria better in APIs than in commercial software. Collecting data for qualitative analysis purposes, e.g. for above mentioned visual analysis, would again require a careful planning [2]. On the other hand, both software, such as NVivo and Atlas.ti, and APIs may provide a fruitful possibility to combine quantitative and qualitative approaches. For example, data mining could be used to show large structures of social media networks, and qualitative analysis could be added to demonstrate communication in these networks.

Collecting data from social media – and in general from other web sources – requires researchers to reconsider ethical questions. In the data mining and scraping projects it is not often possible to get informed consent from the “producers” of the data, i.e. people whose posts or social networks are analyzed. It can be also difficult to verify for example the age of the people whose posts are analyzed (adult vs. under-age). In some cases it can be assumed that people are aware that their activity is mainly public (e.g. in Twitter) but in other cases users may not intent their activity to be public (e.g. writing inadvertently a comment to Facebook post which is publicly shared) [2]. Rapidly changing forms of different social media services and opaque privacy settings make it even more difficult to evaluate ethicality of collecting and using social media data. In any case, social media users may not thing that their information would be used for research purposes. In this situation, the least researcher can do is to mind the privacy/anonymity issues. The data should be anonymized and careful consideration is needed when quoting direct extracts of the data (as the original post and its author could be then traced).

To conclude, NYCDH Week workshop about Social media scraping for qualitative research provided useful experience of using newly designed tools for data collecting and analyzing. Hopefully there will be soon digital humanities projects which demonstrates how to utilize different possibilities of these tools in creative way.

Readings:

[1] IPython notebook to follow examples from Mining the Social Web: https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/

[2] An example of using API to collect data from Facebook and some ethical considerations: http://pewrsr.ch/1bwVcfy

Image source: http://help-ncapture.qsrinternational.com/desktop/cn_ncapture_infographic.jpg

Digital Humanities@ Pratt

Post navigation

Leave a Reply Cancel reply

Digital Humanities
@ Pratt