Title: Research Without Borders: Big Open Data
Where: The Faculty House at Columbia University
Sponsored by: Scholarly Communication Program and Data Science Institute at Columbia University
When: Thursday, December 4, 2014 2:00 to 4:00 PM
Speakers: David Wrisley, Jonathon Stray, Alice Marwick, and David Park (moderator)
Hashtag: #rwob
Last week at the Faculty House at Columbia University, three professors spoke at a public event, the Research Without Borders Conference: Big Open Data. The speakers included: David Wrisley, a medievalist and Associate Professor at the American University of Beirut; Jonathon Stray, a computational journalist and lead developer of the Overview Project, as well as a computational journalist at Columbia University; and Alice Marwick, Assistant Professor at Fordham University and affiliate from the Center on Law and Information Policy. The moderator at this event was David Park, Dean of Strategic Initiatives at Columbia, member of Columbia’s Institute for Data Sciences and Engineering New Media Center, senior advisor to the Executive Vice President and Dean of Faculty of the Arts and Sciences, Director of Special Projects at the Applied Statistics Center, and founding member of Columbia’s Digital Storytelling Lab.
The three speakers used the first hour to share their views and works in big open data. Wrisley, the humanist in the panel, shared his views on what big open data means to the humanities and says that big open data has a different meaning in the humanities than some other disciplines in his presentation, Big (Comma) Open Data: A View from the Humanities. To summarize, Wrisley said that what may be big open data to the humanities appears to be a medium data and open in its own way. In the humanities, open means there is more available access to the masses and there is a move from having access to and expanding the availability of works online to transforming what you can do with it (Wrisley). Wrisley also shared with the audience some examples of big open data projects, or rather what can be categorized as digital humanities projects. Some of the projects and types he shared included: the Kings Project, which uses text reuse and intertextuality (when authors use the language of other authors); spatial humanities, a “heat” map of places mentioned in ancient texts and mapped as a network visualization; palimsets, where you can see text written above or below lines of text that have been erased or recycled through the use of multispectral imaging; and the DigiPal Project, a digital paleography project which catalogs letters to see who wrote what letters, the time in which the letters were written, and what style was used in which time period and to see the changes over time.
Jonathon Stray is a computational journalist- a journalist who uses software and other mining techniques to gather information or new information- whose presentation, Big Open Data in Journalism, covered new methods used to gather information from large sources and amounts of data. Stray said that sometimes data visualizations create “hairball data,” which means that a bunch of data is taken, thrown on a map but sometimes does not mean a whole lot to the person looking at it. Stray also said that from a journalist’s view, “the most interesting work comes from data that is not always open to the public,” and that “you can make your own data,” which also isn’t open to the public. Stray also uses computers to do things that aren’t digital, like making documents available on the Internet so that everyone can see them. But new techniques and software can assist journalists to pull new information from documents. Stray is one of the developers working on the Overview Project, a software program that can go through thousands of pages in a document, like a government report. The data that the software collects can create word clouds, for example, to let you see trends or make connections or see a keyword that you may not have been able to see before if you were just manually going through each page in a seven thousand page long document. The example that Stray gave to demonstrate this tool’s capability were the dealings of private security contractors in Iraq. Overview Project was used to pick out activities in a document that was several thousand of pages long. The program can also connect names to trends, which is useful when you follow crime family’s movements and cases of money laundering. Stray ended his presentation saying that big open data in journalism isn’t just about downloading large amounts of data and making it public- you also need techniques to help you go through it and create visuals to help you make connections, see patterns or even just see things you may not have caught before. Stray also posed a few questions for the audience to ponder:
- How best to combine machine and human intelligence
- How to sustainably fund technology transfer into journalism
- What stories should journalists cover
- Does the reality of transparency match the theory
Alice Marwick’s presentation, Tracking and Mining the Mobile, Social Web, differed from Stray’s and Wrisley’s. As she put it, “if the first two presentations present the utopian aspect of open data, then mine is the dystopian view of it.” Her presentation dealt with how little data connects to off-line personal data leads to the aggregation of personal data that data brokers then collect and sell to those who want it, whether it’s for companies trying to solicit one group in the population or the government wanting to collect the social media and the Internet trackings of an individual by purchasing it. Marwick explained how Atlas, an ad-serving technology employed by Facebook, tracks an individual’s movements online and collects it to give you specific advertisements on your Facebook newsfeed. The scary thing that Marwick pointed out was that Facebook is taking the information that Atlas collects and selling it to data brokers. Her presentation went on to explain how these ad-serving technologies not only track a person’s on-line movements but also track your off-line activities to collect an enormous amount of personal data and are made available to data brokers who then sell this information, to just about anyone who can pay for it. She even mentioned how one data broker company sold individuals’ personal information (credit card numbers, Social Security numbers, etc.) to identity thieves.
The next hour of the event was a Q and A session in which the audience addressed the three speakers and David Park with questions. Most of the questions asked dealt with privacy issues and what we can do to protect it, but as long as we continue using social media and willingly provide access to personal information, we as a society risk our privacy being exposed. Towards the end though, more questions were directed towards dealing collaboration and working with other specialists to obtain the kind of information needed in some of these projects.
Overall, the conference was very informative. I very much liked all three of the presentations but I believe that Wrisley presented his the best. Stray and Marwick had very interesting information to share but they went through their slides and spoke very quickly and it was hard to follow what they were sharing. Their information was very interesting and fascinating, but I felt that as soon as I was writing one thing down, they had moved onto another topic. Wrisley took his time to go over the information he presented and left his slides up a little longer so people could follow along.
I personally took more of an interest to Wrisley’s presentation since his focused on more of an academic use of big open data and explained its use in simple terms (taking the information we have, transform it, and then get something new out of it), which is not a bad thing at all, since digital humanities is an emerging field and can be difficult to define what a true digital humanities project is. By presenting it in a simple way, I think that everyone who was in the audience could follow along with what he shared and then could understand what Stray presented in his presentation with the Overview Project and how that tool could be used to mine through a large document. It was a nice transition from the humanities to a different discipline. It was also easier to follow Marwick’s presentation knowing that a big open data project can be used to not only collect data but to see what is being done with that collected on-line and off-line personal data.
From classes taken about the digital humanities and academic librarianship and scholarly communication, it was also interesting to hear how collaboration on these projects is still difficult and conducting projects is still not as widely accepted as writing an article or a book in a high ranking published journal or even publishing in an open access journal is frowned upon still in the fields. I know that from these classes that it is still hard to get the recognition for the work done on these projects and that in the past it was hard to find people to collaborate on the projects, but I thought that this was changing and that it was being more widespread from tweets and other social media outlets. From the speakers, it’s not as fast tracking as I had thought. Marwick said she believed that “there are only a small amount of academics out there see the need for interdisciplinary collaboration and those academics are on the fringes in their field.” I believed she meant that those small group of academics are not the widely accepted, leading academics in their respective disciplines. She also said that collaboration “is ideal, but it’s hard.” Wrisley also made a similar comment, but added we “live in a world where publishing together is hard,” and “institutions do not see it as valuable as if you had done the work yourself,” and academics are “caught in a bind- those who release their data get more back and share and allow others to use it,” but the institutions then don’t see the data as valuable because everyone has access to it. Wrisley also said “closed walls make more prestigious valuable works and contributions.” I feel that we want more open projects and want to have access to more primary sources and projects but from what the speakers said, these projects are not as respected and are hitting road blocks as other academics may not see these works as valuable contributions to the field and thus not gaining recognition for the work done.
The projects presented and the different uses of big open data can be valuable to the field but the presenters made it clear that while headway is being made for these projects and that there is a need to have these projects and that they are useful but it will still be awhile until projects will be widely accepted by everyone in the field. This event did raise awareness to the projects being worked on in the respective fields (humanities, journalism, communications) and made known to the general public the need for more collaboration and wider acceptance of projects. It was nice to hear from academics who are currently undergoing these kind projects and facing these roadblocks rather than just reading about them. If any of these speakers present again, I would go to those events, just to learn more about them.
There were about three things I took away from this event:
1. Big open data projects do not always have to address big questions and the projects can be focused on a small issue or just one document and present a lot of new information or a different way of viewing the item or event (the Overview Project, palimsets)
2. Interdisciplinary collaboration needs to increase to further others’ projects and increase awareness
3. There is still some prejudice in the academic world against big open data projects
While there was not just one project that would change the way people see big open data or digital humanities projects presented at this event, I think that this event raised awareness about the need for more collaboration and works in big (or small or medium) open data and there needs to be a change in the way projects are reviewed for tenure or recognition in the field.