On February 8th, I attended a NYCDH Week affiliated workshop led by Dr. Heidi Knoblauch (digital projects coordinator at Bard College) entitled Public Participation in Humanities Research: Using APIs and Crowd Sourcing Platforms. At this workshop we saw how scripts in iPython notebooks accessed through DHBox’s Jupyter Notebook plug-in can be used to pull images from a collection on the Internet Archive and converted into a more easily usable format. We were shown how the platform crowdcrafting.org could have these images uploaded to it so that someone conducting a study could use crowdsourcing in the sorting and classification of the images. In this fashion, large amounts of images from Internet Archive collections could be winnowed down just into the ones that interest the researcher. For instance if the researchers are interested only in looking at the images of people from the run of a certain magazine, they can load the total amount of images from the magazine that they have obtained from the Internet Archive, and have the participants tag the jpegs of these pages as to whether they have an image on them, and click the tag for people if the image they are shown has a person in it.
An Example of a Crowd Crafting Project
Dr. Knoblauch was familiar with this method as she used it in the preparation of for her paper regarding the growth of the concept of patient privacy in medical history. She used a project created on the Crowdcrafting platform to have her corpus of images from medical journal pages from 1865-1962 sorted into either photos or diagrams. Those that were photographs, she had visitors to Crowdcrafting tag whether that photo was of a person, and if there was any attempt to anonymize that person. She found that although there was a popular conception that the concept of patient privacy had been a given since the 1890s when doctors began to get sued for patient privacy breaches, instead by and large the images of patients were not anonymized in medical journals until the 1920s.
To minimize the amount of programs or extra skills that would be needed to walk through how to use similar methods for our own projects, we used the virtual machine DHBox created by CUNY. The plus of using a virtual machine like this is a large amount of useful programs that you’d need are held in the cloud, decreasing the cost of setup and letting you run trial versions of your experiment. Without as labor-intensive of a configuration, the possibility of your experiment failing or the method or software in use turning out not to fit the task you need, is a smaller problem. The minus of using a virtual machine is that being online is a necessity to use it, and too much use at once can cause issues with the server as we found.
We logged on to the demonstration copy of DHBox and walked through how to use the command-line function in order to set up the necessary software and server connections in the cloud to execute the rest of our project, including the connection to the Internet Archive’s API. After those connections were set, we used the Jupyter Notebook function in the DHBox to open the iPython notebook that Dr. Knoblauch had set up in Github for use in our workshop.
The Python notebook was split up into separate parts with bits of script and broad explanations about what those scripts were doing. Utilizing the scripts in the notebook as loaded through DHBox we were able to go through the steps of downloading pdfs from a collection on the Internet Archive and convert them to PDFs. As Dr. Knoblauch said, she had the aid of the developer on this, so less attention was given to how the scripts were constructed and more to what their function was.
After conducting a trial run through the Scientific American collection on Internet Archive, we were given some instruction in how different collections could be searched instead on the Internet Archive. However, because of the extra bandwidth being taken up on the server by the class of about a dozen people, the program sputtered out a couple of times. Dr. Knoblauch handled this well, taking it in stride, and hit upon the possibility of decreasing the resolution of the images we were downloading in order to lessen the total workload for the virtual machine and the internet for the classroom as a whole.
With either the Scientific American images in hand, or ones from collections that we’d found if the we were introduced to the site crowdcrafting.org where you could set up crowd-sourcing “projects” for people to tag or define images, helping you go through a lot of different images or pages and either sort through to get exactly what you were looking for, or to just single out the images that were more worthy of closer scrutiny. Of course, while the site does arrange your images and questions neatly into a package ready for users, it doesn’t supply those users. You’d have to email, tweet, share, and otherwise disperse the link to your project
Overall, the workshop presented an interesting structure that a project could be built on. Yet, not in a way that I felt I could grasp how to transfer this knowledge to other APIs or to figure out how to pull other formats from the Internet Archive API. It may have been more helpful if the slides for the presentation had been linked to for the purposes of following along or at-home access to the necessary steps, but the instructor was open about sending the information should we email her afterwards, and the Python notebooks are still active on the Github. The process of creating the Crowdcrafting project seems like it would help a researcher really focus on what they want to ask of their data, and what questions they need answered to form conclusions as to what the data is telling them. Certainly, I do feel like I found out an interested new way to gather material, as well as classify it, as a way to answer a research question.