At Columbia’s “Research Without Borders” program last week, panel-goers were exposed to the “Big Open Data” phenomenon from three distinct vantage points.
The first panelist was David Wrisley, an English professor at the American University of Beirut and a Medieval Fellow at Fordham’s Center for Medieval Studies. His research focuses are medieval comparative literature and the digital humanities. Wrisley’s approach to data-driven research was the most familiar from a DH standpoint. He focused on how, with a large dataset, we move from consuming texts to making things out of them; being able to have these data assets “in your hands”, you are able to make new things out of them, manipulate them, or study them in an entirely new way. Wrisley’s method is similar to the idea of a “distant reading” that we studied in Franco Moretti’s Graphs, Maps, Trees. In terms of medieval comparative literature, the hypertext capabilities of digitized text are particularly useful for analysis.
Linking texts to other texts through human markup is a well-established method in comparative literature, but Wrisley was more interested in discussing automatic annotations and topic modeling, as opposed to human created ontologies and the lack of collaboration that comes from one person’s hand producing the markup. He also touched on the spatial humanities, or creating maps of places both real in fictional, whether in the ancient world or in literature. This was also reminiscent of Moretti’s text. Finally, Wrisley brought up another method that is particularly pertinent to medieval documents; multispectral imaging. Spectral imaging allows for image data to be capture from frequencies beyond the visible light range; this allows for the extraction of information that the human eye fails to capture. In this way, there is a granularity in big data, and it allows for a super close reading, far closer than the human eye. Showing traces of old documents, Wrisley stated, allows for a radical materiality of the humanities; DH methods that harness big open data are remarkable not just in their scope, or “distant readings”, but also in their ability to conduct super-close readings. I hadn’t thought of this connection; it is true that both far away and extremely close perspectives are offered through a technological processing of information.
The next panelist was Jonathan Stray, a computational journalist who teaches at Columbia University. Stray is also the lead on the development of the Overview Project, an open-source document archive analysis system for journalists. He defined computational journalism as journalism that either uses, or is simply about, computation. Applying computational methods to journalism means huge datasets that journalists no longer have to read through. Visualization methods such as word clouds can take these huge, text datasets and reveal unusual and unexpected trends. Besides word count visualizations, Stray also discussed the usefulness topic modeling/subject sorting and network analysis.
I was specifically interested in how computational journalism uses DH methods for political ends; network analysis in a political context, for instance, allows for you to paste in the names of people and companies and look for connections that shouldn’t be there. Stray gave the example of organized crime, money laundering, and fraud. More specifically, he discussed how this method allowed for WikiLeaks reporters to access a broader view of what happened regarding the private security contractors in Iraq. I could definitely see how networks provide algorithmic assistance for investigative journalism, for the computer automatically points to what is happening beneath the surface or between the lines of huge masses of bureaucratic paperwork.
After discussing the academic revelations and democratic freedoms one can amass through “Big Open Data”, I was wondering when the conversation would hit a darker chord. Next up was Alice Marwick, an Assistant Professor at Fordham University. Her work investigates online identity and consumerism “through the lenses of privacy, surveillance, and consumption. Marwick provided the more dystopian approach to big open data. The focus of her presentation was how the “little data” that we all generate, including both our online and offline activities, are aggregated, bought, and sold to social media agencies. Thus, our closed data becomes open data. She first discussed a major offender, Facebook’s Atlas ad program. Atlas’ website advertises that is “helps you reach your business objectives” and that it is cross-device, online to offline, with real, proven results that all you to “illuminate and understand customer journeys”. To put it simply, if you browse the web, whether you’re on your computer, tablet, or phone, while logged in to Facebook, Atlas can see all of your activities. This program has your facial recognition algorithm, your closet friends, political beliefs, games played, and your music listened to. This has allowed for Atlas to form very precise and targeted consumer audiences and identities.
Facebook sells information to data brokers, and over seven hundred million consumers have personal files. Data brokers have a variety of sources, such as public records, magazine subscriptions, online and offline shopping, education and salary, and DMV and voting records. What is different about Facebook is that it sells private closed information that isn’t a part of the public record. For this reason, individuals reveal much that they do not intend to.
Sensitive information, such as being a smoker, being overweight, having divorced parents, and sexual orientation, can be inferred through correlations in closed data. Data brokers then sort the population in a digital dossier, into 71 segments and 19 categories. We are not allowed to know what these categories are or how we’ve been categorized, and this information can allow for the most vulnerable members of the population to be targeted, and it is also widely available for purchase. Obama’s campaign pioneered micro-targeted advertising; the government is not legally allowed to collect this information, but it is allowed to purchase it. Data brokers have even sold this information to criminals by mistake!
Alice Marwick’s presentation made me seriously regret a lot of my online naiveté. Clearly, despite the advantages offered up through an unprecedented accessibility to large portions of data, there is certainly a dark side to algorithmic data retrieval and analysis. When it comes down to people being denied health insurance or a mortgage because of unclear, invisible surveillance, you can really begin to feel watched from all sides. I am certainly interested in Marwick’s research and plan to eventually read her book. Still, from a DH/research perspective, Jonathan Stray’s presentation was inspiring and a bit more hopeful. I am really interested in DH applications for political change, or the way that using DH research methods on big open data can serve emancipatory functions. Whether these future projects aim towards critiquing the state or the government, or towards allowing for citizens to access the information that is bureaucratically buried or hidden in a large mass of “open documents”, I hope that this area of DH grows a lot in the upcoming years.
Further Reading
Moretti, F. (2007). Graphs, maps, trees: Abstract models for literary history. New York: Verso.