Initially written October 12, 2024


As part of her contribution to our project, our team member Darcy is exploring the CBA’s artists’ books collection metadata, and highlighting the topics covered in a data physicalization.  A few weeks into our project, I sat down with her to discuss one of the challenges to her project: collecting the data.  

Hi Darcy.  What is your project? What are you trying to do?

I want to understand the CBA’s collection of Artists’ Books.  I think I’m doing this by sorting their fine arts collection into thematic groups.  Originally I was thinking I would go by the object type field, (which generations of CBA staff have assigned), but when I started actually looking at the early data I was collecting, that field was too messy.  For instance, they have an “accordion books” object type- but there’s also a bunch of accordion books which are just cataloged as books.  That made me want to find a different way to understand the collection.  So now I’m looking at the descriptions of the objects and the subjects or keywords that have been assigned to them.  I’m using those fields to detect similarities between the objects, and seeing if they cluster in ways where I can create more meaningful groupings. 

What is a big challenge you’ve had to overcome as you work on your project?

The challenge to overcome is sort of all of it, especially translating what I want to do in data terms to something physical- I want it to feel worthy of representing this collection of artists’ books.  Along the way, there are the things I already know how to do in Python, and things I am trying to work out as I go along.

Can you tell me more about the web scraping you did?  

How are you going about learning how to use a web scraper?

I have been using Beautiful Soup documentation, testing a little piece at a time, seeing what I can get to work and then building it out into bigger pieces.

What resources have you been using as you approach the next steps of your project?

I can search Python library documentation all I want, but if I don’t know the name of what I’m looking for it can be hard to find.  As I work, if I have a coding issue, sometimes ChatGPT can give me a function I didn’t know existed.  For other things I’ve been reading a lot of esoteric, high level articles about programming and text analysis.  The challenge has been applying these principles to my project, as opposed to a test corpus a tutorial author has created to demonstrate a tool. I keep trying to avoid using ChatGPT because of the environmental toll it takes, and then end up using it.

One of the issues with getting the data, they do not have an API.  What are the other things you found challenging about this particular case?

The collections website was not consistently available.  So I ended up downloading some html files to work with on my computer, which wasn’t my initial assumption.  In the end, I went through every single page of the main collection page and saved each piece of it, and then I only extracted information that was in the fine arts collection.  Their call number starts with an F, so I was able to easily identify that part of the collection.  I then had a smaller corpus to work with.  Another challenge was that, despite the fact that they are using a Content Management System, there were all sorts of inconsistencies in the structures of the web pages.  Every so often a phrase would be missing, or one component of a page would be missing, random things.  So every time, I had to go back, and modify the code such that it continued to work with all of the things I had been doing, but had built in exceptions for the bits that didn’t fit that model.   You’re never going to build a web scraper that works for a different site than the one you built it to work with- this project epitomized that limitation.

What did you learn from the process?

I feel more confident in my web scraping skills. But next time I have to use Beautiful Soup I will have to go back to the documentation for it.  It’s nicely organized, you can go to the part of the documentation that is relevant for what you’re trying to do, like going down the documentary or going along a particular branch, looking at the children of a particular div.  I had a vague recollection of how all of that worked, but for any individual task, I had to play around, looking at the source code of a particular page, and test if a command would get me what I was looking for.  

In this particular case, for our project, we knew that we had the ability to use all of this data, but were there any ethical questions you had going through the process related to web scraping in general?

I built in a fairly hefty sleep time after each pull.  I started with 5 seconds between each request because I didn’t want to crash their website. I realized it was going to take me 12 hours to download the data, so I reduced it to 3 seconds. When we learned about web scraping the advice was to do a 0.2 second sleep between each request so you don’t overload their server… me doing it manually at a certain point overloaded their server.  That was the main issue because we had permission to use their data. For something else, if I didn’t have permission to share something freely, I wouldn’t go scraping a collection website.

Okay, so you worked out your scripts, you ended up doing two of them, one for the main page of the collections site, one to pull data from each individual item’s page. And you dealt with many exceptions along the way, when you noticed shifts in the site structure. What’s next? 

What did you learn from those explorations using digital tools?

Not every item in the CBA’s collection has a description or subject headings. I do want to represent missing data in my project as well, so those resources will still show up in some fashion.  I thought subject headings might connect everything, but in my explorations thus far, there are a lot of little constellations and islands of subjects that didn’t have subject term connections to the rest of the collections.  

Sometimes the description is the publisher’s description, sometimes it’s the curator’s description, sometimes it might be the artists’ description. Sometimes it’s a physical description, sometimes it’s a description of contents. We have various ways of parsing the collection, but whether or not the data we have is representative of the collection, and how I will parse it, remains to be seen!

In Closing… helpful questions to ask yourself when working on a similar project:

  • Do I have permission to use this data?
  • What data do I need to explore to complete my project?  
  • What is the structure of the webpage from which I am trying to pull information?
  • How can I avoid overloading the server I am requesting data from?  
  • Are there other options from which I can source the data I need, without web scraping?
  • How will the data I am collecting help me answer my underlying research questions? 
  • What are the limitations and strengths of the data I have access to? 
  • Who created the data I am using, and how did they collect it?  What does that mean for my project?

Explore the rest of the process blog:


Suggested Citation: