August 25, 2019 - All
For my project, I wanted to see how many therapists there are per state in the United States to assess which states have the most access per 100,000 people and which have the least. Writing the Script First, I needed a way to iterate through each state’s listings of therapists on PsychologyToday.com using the BeautifulSoup Python Package. After attempting to go through search results by just searching for all therapists in a state, I realized that the listings were constantly randomized and didn’t actually have a specific list when going through regular search. I then discovered that the website also provides an alphabetical listing of therapists per state, with consistent URLs showing the state and the corresponding letter in the alphabet (for example, all therapists with a last name that starts with “r” that live in Vermont are under the URL https://www.psychologytoday.com/us/therapists/profile-listings/vermont/r). After figuring this out, I created a Python list that included each state URL corresponding to each letter in the alphabet. I also left out any nonexistant URLs: if a state didn’t have, for example, any therapists with a last name that starts with “q”, then I made sure to not include a URL corresponding to that letter for that state. Running the Script Once the list was finished, I made sure that my Python script successfully iterated through each page by spot checking random URLs in the list. Once I confirmed that it worked, I let the script run to scrape all of the info I needed for the project, a process that took about 4 or 5 days. As the script would abort for various reasons, usually because my internet at home stopped working or because I was missing a comma or hyphen in a URL, I was finally able to produce 6 JSON files of data. Cleaning the Data Once I had all of my JSON files, I imported them into OpenRefine for some data cleanup.