Introduction
In journalism, there is an old saying to “follow the money” to discover insights into the way systems can be influenced by capital. While it may sound like a metaphor, there is plenty of value in analyzing financial records, especially in the area of U.S. elections. This project explores financial transactions from the 2020 election cycle across all candidates who ran for federal office, downloaded through opensecrets.com. Open Secrets is a nonprofit organization dedicated to compiling millions of campaign finance records into datasets from the Federal Elections Commission. With a vast amount of data to work with covering every candidate possible, I figured this data would be interesting to visualize, with the goal of emphasizing money’s influence on politics.
Inspiration
I’ve always been impressed by the visualizations produced during election cycles in the past. I wanted to try my hand at harnessing election data myself, but wasn’t too inspired by some of the data I came across in my research. Initially, I wanted to look at polling data to pull trends from average polls from a variety of sources, but I figured this work has been done in the past and would be a massive undertaking to compile. In researching other sources, I came upon Open Secrets, which mainly focuses on finances tied to federal election candidates. I signed up for an account to download datasets from the bulk data download page and read up on the documentation and data dictionaries included to understand the fields better. Upon parsing through the different datasets, the one that stuck out to me most highlights contributions to candidates from Political Action Committees, which often hold the largest donation amount for any given candidate.
Methodology and Process
Tools
For this project, I made use of the R programming language with help from RStudio, an interactive development environment that organizes project files, run scripts, and create markdown presentations to upload on the web. R is one of the leading industry-standard tools for programming and visualization, especially in statistical settings. To support the default packages included in R, I also installed other useful packages to aid in visualization, such as Tidyverse, Viridis, and plotly.
Preparation
In order to get the datasets organized in a working format, I first had to upload them to OpenRefine to tidy up any anomalies and formatting issues. The data available for download from Open Secrets is in CSV format, but as shown in the image below, there are additional characters on either side of the values that need to be removed and it will only open as a .txt file. Additionally, the PAC dataset contains around 890,000 rows, which I was sure could be reduced further.
Using OpenRefine, I removed any columns that were unnecessary for my analysis, and removed the vertical bars in columns that contained them. Because the data did not contain any specific identifiers for the candidates to whom donations were made, I also cleaned up the candidates dataset, with the intention of combining the two tables based on Candidate ID.
Once the data was exported in CSV format, I then took to creating my R project using the newly improved datasets.
Visualizing the Data
Although I have worked with joins in the past using SQL and Python, I had never experimented with manipulating datasets in R. To approach this, I did some research on what tools are available to accomplish this task in R. I decided on applying a full join on the two datasets, which combines all columns of the dataset based on Candidate ID, filling any rows of duplicates IDs from the PAC contribution dataset with the proper candidate information from the candidates dataset. The function I used to do this was combined <- full_join(candidates_concat, pac_donations)
, where candidates_concat
is a reduced dataset removing other unnecessary columns/records.
I noticed that after combining the data, there were a few duplicate values which could be eliminated. To do this, I transformed the data using combined <- distinct(combined)
to remove any matching rows. I then filtered the dataset to remove any candidates without PAC contributions for that election cycle to closely match the original number of records in the PAC dataset.
After creating a dataframe containing all the relevant info needed to create visualizations, I began to experiment with R tools to see what may be possible. I started with a simple geom_col()
plot showing the total PAC contributions (in millions) to each party across all elections during the 2020 cycle.
While this visualization is hardly surprising, it is interesting to see exactly how much more money the Republican and Democratic parties accept from PACs compared to other parties. With a bit more practice using R, I was able to create a few more datasets from the original combined one, utilizing aggregate functions, mathematical mutations, and other useful functions within the tidyverse
package. Below are a few examples of the visualizations I’ve created; to view all, see the full markdown project on RPubs.
Reflection
This project was an interesting one to explore, given my background in coding and visualization using other software and languages. Though I could have used Python to complete my research and analysis, this was an important exercise in diversifying my approaches to working with data. Though I was new to R, I found the language to be fairly intuitive with the general knowledge of data structures I possessed previously. Learning a new language, however, is not without its drawbacks as it often took some Google searches to better understand how to use certain functions, setting me back in terms of time spent on a given task. However, with the help of feedback from my peer reviewer, I was able to edit some design choices such as keeping consistent colors for similar visualizations (i.e. the winners/losers visualization), as well as exploring plotly for more interactivity.
The visualizations I created, while not too surprising, definitely provided insights that confirm the influence of money in politics. The visualizations I thought to be most insightful were the Democratic primary timeline as well as both visualizations of PAC contributions to winning and losing candidates. Looking at the Democratic primary timeline, Joe Biden receives far more PAC donations from the beginning of his campaign, which continues to trend upward until his eventual victory. The winners and losers timeline does not include presidential election donations to avoid any outliers, yet it is still clear that the winners in each election simply outspent their competitors by way of PAC contributions.
Overall, I enjoyed analyzing the data from Open Secrets, despite the disorganized format it was originally downloaded in. For future visualizations, I would love to explore some of the design-specific packages in R to increase the aesthetic value of my visualizations, and add more interactivity to my plots.