Introduction
A global pastime, the statistics available for the game of soccer not only spans cities and countries, but continents. The 2022 World Cup (the Men’s – though it’s not officially labeled, as is the Women’s) is around the corner in November, and the data set I found has about everything you need to feel mentally prepared for the games to come. Providing an overview of all international men’s soccer matches since 1993 and totaling almost twenty-four thousand rows of records, it includes information on home and away team roles, the score, location, date, tournament status, as well as each team’s FIFA ranking at the time. This gave me a rich resource to pull from when constructing my visualizations with R using the set’s many variables, and the results aim to be as varied and interesting as the collection of information provided.
Methods
Dataset
I chose this dataset off of Kaggle, which seemed intimidating at first; I had gotten my previous project data off of NYC Open Data, which read professional and refined across the board (though I still struggled with the data meaning itself). I worried I would not have the expertise needed to ensure that whatever I got from Kaggle was reliable, but I lucked out coming across this soccer dataset early on after perusing the recent records added to the site.
Software
To complete this project, I used the programming language of R through RStudio to produce informational visualizations through statistical graphics. In addition to the demonstration of the program in class, I leaned heavily on the R Graph Gallery resource provided to us, which ultimately supplied templates for every image I produced of my chosen data.
Process and Reflections
After struggling with interpreting the meaning of my last lab’s data and subsequently confusing my way around the visualization process, I was pleased to have a dataset for this project that both made sense to me intellectually and seemed to lend itself to some more straightforward, but still interesting, selection of graphs and charts. As a dataset it was also very clean and well constructed, so it was gratifying to just be able to dive into working to make visualizations from its variables. After running the necessary Tidyverse and ggplot2 information in RStudio, I went right into the R Graph Gallery to try and find some examples that I could see mapping out well onto my soccer data. I readily admit I did not feel at all confident in being able to concoct my own visualizations from the project I had saved from following along in the lecture, so I was hoping to find code that would be both understandable to me where my edits could easily be plugged into a design for a certain kind of a graph and turn out without too much fiddling while also not being too simplistic or unclear in their objective.
I had a lot of trial and error in choosing different graphs to emulate and work off of, which mainly involved finding a visualization that actually made sense given the information it was representing. My first work, which concerns three different home team game results pulled out of the larger data – win, lose, or draw – resulted in a pie chart. This ended up making the most sense to me, since there were so few possibilities, and it was a straightforward interpretation process for users to see that a home field advantage does bear some truth given its almost fifty percent makeup of the circle. After filtering for data on home team results, I used the code provided in the gallery’s most basic version of a pie chart, filling in my own variables.
I ended up being rather proud of my second visualization, a box plot showing goal distribution for all teams hailing from South American countries. I added the fill line to the most basic box plot code from R Graph Gallery to change the colors based on team, and despite its other minor issues such as the overlap of x-axis test, I was pleased with how this element of variety stands out visually against my other work in both this lab and past projects.
I ran both visualizations past a friend of mine outside of the program who actually uses R in her day job as an academic lab manager. Because of this, she had plenty of criticisms and suggestions for me – some of which I took, some of which I decided against. Originally, not every axis and legend had individual titles; this was not hard to find in R Graph and add to my project in order to give more context to each visualization. She also pointed out how potentially helpful adding percentage numbers to my pie chart sections could be, which was something I simply could not figure out how to isolate and fix.
While the R Graph Gallery was an incredible resource (I would not have gotten through this assignment so readily without it), it still took an immense amount of time just to get my bearings and ensure I was plugging in the correct information in the correct place. Like other platforms that we have touched on in this class, I wish there was more time to play around and get to know the software, so the learning and exploration phases are hopefully not so lumped together. Going forward, I do think this particular dataset is an excellent resource for trying out different visualization formats, and I would be interested in becoming more experimental in the kinds of charts I attempt. Because of the constraint on the course timeline and my own trepidation at trying my hand at an entirely new programming language, this did not seem like the prime moment to learn about the more out-there options of doughnuts or circular packing – but I can see this data working well in an experimental context.
References
Brenda_L. (2022). FIFA World Cup 2022. [Data set]. Kaggle.
https://www.kaggle.com/datasets/brenda89/fifa-world-cup-2022