A beautiful soup of data…
Lessons from web scraping the CBA’s digital collections catalog.
Written by Maddy Casey and Darcy Krasne
Edited by Claudia Berger
Initially written October 12, 2024
As part of her contribution to our project, our team member Darcy is exploring the CBA’s artists’ books collection metadata, and highlighting the topics covered in a data physicalization. A few weeks into our project, I sat down with her to discuss one of the challenges to her project: collecting the data.
Hi Darcy. What is your project? What are you trying to do?
I want to understand the CBA’s collection of Artists’ Books. I think I’m doing this by sorting their fine arts collection into thematic groups. Originally I was thinking I would go by the object type field, (which generations of CBA staff have assigned), but when I started actually looking at the early data I was collecting, that field was too messy. For instance, they have an “accordion books” object type- but there’s also a bunch of accordion books which are just cataloged as books. That made me want to find a different way to understand the collection. So now I’m looking at the descriptions of the objects and the subjects or keywords that have been assigned to them. I’m using those fields to detect similarities between the objects, and seeing if they cluster in ways where I can create more meaningful groupings.
What is a big challenge you’ve had to overcome as you work on your project?
The challenge to overcome is sort of all of it, especially translating what I want to do in data terms to something physical- I want it to feel worthy of representing this collection of artists’ books. Along the way, there are the things I already know how to do in Python, and things I am trying to work out as I go along.
Can you tell me more about the web scraping you did?
We were waiting on receipt of the CBA’s collections metadata, but I was able to scrape the data myself, because we had used the Beautiful Soup Python library in my Programming for Cultural Heritage class. I knew the principles, but going and doing it was a challenge, outside of the confines of an assignment. I had to ask, what are the particulars of the CBA’s web page and its structures that I need to understand in order to get the data.
How are you going about learning how to use a web scraper?
I have been using Beautiful Soup documentation, testing a little piece at a time, seeing what I can get to work and then building it out into bigger pieces.
What resources have you been using as you approach the next steps of your project?
I can search Python library documentation all I want, but if I don’t know the name of what I’m looking for it can be hard to find. As I work, if I have a coding issue, sometimes ChatGPT can give me a function I didn’t know existed. For other things I’ve been reading a lot of esoteric, high level articles about programming and text analysis. The challenge has been applying these principles to my project, as opposed to a test corpus a tutorial author has created to demonstrate a tool. I keep trying to avoid using ChatGPT because of the environmental toll it takes, and then end up using it.
One of the issues with getting the data, they do not have an API. What are the other things you found challenging about this particular case?
The collections website was not consistently available. So I ended up downloading some html files to work with on my computer, which wasn’t my initial assumption. In the end, I went through every single page of the main collection page and saved each piece of it, and then I only extracted information that was in the fine arts collection. Their call number starts with an F, so I was able to easily identify that part of the collection. I then had a smaller corpus to work with. Another challenge was that, despite the fact that they are using a Content Management System, there were all sorts of inconsistencies in the structures of the web pages. Every so often a phrase would be missing, or one component of a page would be missing, random things. So every time, I had to go back, and modify the code such that it continued to work with all of the things I had been doing, but had built in exceptions for the bits that didn’t fit that model. You’re never going to build a web scraper that works for a different site than the one you built it to work with- this project epitomized that limitation.
What did you learn from the process?
I feel more confident in my web scraping skills. But next time I have to use Beautiful Soup I will have to go back to the documentation for it. It’s nicely organized, you can go to the part of the documentation that is relevant for what you’re trying to do, like going down the documentary or going along a particular branch, looking at the children of a particular div. I had a vague recollection of how all of that worked, but for any individual task, I had to play around, looking at the source code of a particular page, and test if a command would get me what I was looking for.
In this particular case, for our project, we knew that we had the ability to use all of this data, but were there any ethical questions you had going through the process related to web scraping in general?
I built in a fairly hefty sleep time after each pull. I started with 5 seconds between each request because I didn’t want to crash their website. I realized it was going to take me 12 hours to download the data, so I reduced it to 3 seconds. When we learned about web scraping the advice was to do a 0.2 second sleep between each request so you don’t overload their server… me doing it manually at a certain point overloaded their server. That was the main issue because we had permission to use their data. For something else, if I didn’t have permission to share something freely, I wouldn’t go scraping a collection website.
Okay, so you worked out your scripts, you ended up doing two of them, one for the main page of the collections site, one to pull data from each individual item’s page. And you dealt with many exceptions along the way, when you noticed shifts in the site structure. What’s next?
One thing I did after getting the data was try to process the data in order to get edge and node CSVs for use in Gephi (a tool to create network visualizations), because I wanted to play with different things as nodes, different things as edges. I wrote a program to generate the CSVs based on the data I had because I didn’t want to be generating them by hand every time. Then I started to think about weights [of network edges], and learned a bit better how Gephi deals with combining edges. Then I realized I really need to figure out weights within Python. It was useful for the moment in exploring how the different object types were connected by subject. I also worked on cleaning the descriptions and lemmatizing them, which were more things I had to learn exactly how to do. And I tried outputting a bunch of CSVs of different object types, and put them into Voyant to look at them. Now I’m thinking through what might be a better way to explore and group the data. Which took me to text analysis, and understanding collections themes through that.
What did you learn from those explorations using digital tools?
Not every item in the CBA’s collection has a description or subject headings. I do want to represent missing data in my project as well, so those resources will still show up in some fashion. I thought subject headings might connect everything, but in my explorations thus far, there are a lot of little constellations and islands of subjects that didn’t have subject term connections to the rest of the collections.
Sometimes the description is the publisher’s description, sometimes it’s the curator’s description, sometimes it might be the artists’ description. Sometimes it’s a physical description, sometimes it’s a description of contents. We have various ways of parsing the collection, but whether or not the data we have is representative of the collection, and how I will parse it, remains to be seen!
In Closing… helpful questions to ask yourself when working on a similar project:
- Do I have permission to use this data?
- What data do I need to explore to complete my project?
- What is the structure of the webpage from which I am trying to pull information?
- How can I avoid overloading the server I am requesting data from?
- Are there other options from which I can source the data I need, without web scraping?
- How will the data I am collecting help me answer my underlying research questions?
- What are the limitations and strengths of the data I have access to?
- Who created the data I am using, and how did they collect it? What does that mean for my project?
Explore more of Darcy’s work
Learn more about the Center for Book Arts’ collections.
Explore the rest of the process blog:
Curious about any of the terms used on this page?
Check out our site index which includes basic definitions of some of the concepts we reference.
Suggested Citation:
Maddy Casey, and Darcy Krasne. A Beautiful Soup of Data… Lessons from web scraping the CBA’s digital collections catalog. Books as Art, Art as Data. (Claudia Berger, Ed.) Pratt Institute: 2024. https://studentwork.prattsi.org/bookarts/blog-post-2-darcy/
Proudly powered by WordPress