Flight delays are a common problem faced by travelers and airlines alike, while affecting them in different ways. They are an immense source of frustration, making it natural to look for a way to foresee these delays in order to avoid them. This lab report outlines the process of visualizing recent delayed flight data, in order to gain some insight on what can be done to avoid being on one of these delayed flights based on some broad factors.
The dataset used for these visualizations, Airlines delay was obtained from Kaggle, an online source for public datasets. This dataset contains data about domestic flights in the US in January 2023, their travel times, starting points, destinations and the days on which they traveled.
R is an open source programming software that is used for data exploration, data analysis and data visualization. I used this software to refine and clean up the dataset a bit before visualizing it and to create the visualizations themselves, primarily using the ggplot package within R.
I downloaded the dataset as a CSV file and loaded it into R to examine the data and check for any problems. The elements in the original dataset posed some potential problems for visualization at an initial glance. For example, the flight starting points and destinations were mentioned in terms of their IATA (International Air Transport Association) codes, not all of which are familiar to the majority of people. Similarly, the days of the week on which these flights were delayed were coded as numbers, such as ‘1’ representing ‘Sunday’ and so on. The original dataset contained details for all flights that occurred during the period of data collection, which meant that even flights that were not delayed were included. The column ‘Class’ containing values ‘0’ and ‘1’ were indicative of whether or not a flight was delayed, with ‘0’ meaning that the flight had not been delayed and ‘1’ meaning that the flight had been delayed. In order to make these elements more optimal for visualization, I changed all IATA codes to the names of the airports, changed the codes representing the days of the week to the actual names of the days, and moved all delayed flights (with ‘Class’=’1’) to a separate data frame for visualization, since my major focus was on flights that had been delayed and the rest of the data would not be useful in that context.
Once I had the cleaned dataset with relevant values, I started the visualization process. Since I was mainly focusing on the number of delays, I felt that a summation of the number of occurrences of whatever category I was focusing on in relation to delays would be the best way to convey this information in the simplest possible manner. This is something that R automatically does for the most part, which was advantageous. I felt that most questions that I was trying to answer using this dataset were best represented by bar charts or histograms, and therefore I have primarily focused on these while playing with incorporating other dimensions into the visualizations through color. The first question I was trying to answer was which airlines faced a significantly higher number of delays than others. My preferred visualization for this question was a bar chart, though I attempted a scatterplot as well which was not successful.
This visualization did not seem to be the most impactful option, while it definitely did convey the information it was meant to. The following histogram was created as an alternative.
This visualization seemed to be much more impactful, and clearly conveys that Southwest Airlines has faced the most delays. In order to enhance it visually, I added color to this base visualization.
As an additional thought, I wanted to see whether I could incorporate the days on which these flights were delayed into this visualization through color, which resulted in the following visualization.
This visualization did not seem to be the best way to visualize this aspect, since the difference between the amount of different colors was not pronounced enough.
The next question I wanted to answer was whether there were any particular days on which there were a higher number of delays. The following bar chart shows this information, and is colored according to the day of the week for visual appeal.
This visualization clearly showed that Tuesdays saw the highest number of delays overall, closely followed by Wednesday. Fridays saw the least number of delays, and other days saw a similar number of delays for the most part.
In order to relate this information with airline names, I tried a line graph to represent this information, which did an adequate job of conveying the information visually, but wasn’t visually appealing enough to be a good final visualization.
I then tried incorporating the airline names into this visualization through color, resulting in the following visualization.
This visualization was slightly more readable than the previous visualization in which I had tried to bring these three aspects together, in that there is a clear indication that Southwest Airlines has faced the most delays across all days. Apart from this, other distinctions are not very clear.
Lastly, I attempted to find out how many hours these airlines were flying, to see if there was any apparent correlation between the hours flown and the number of delays. The following visualization incorporated hours flown by different airlines through a histogram.
This visualization does clearly show that Southwest Airlines has flown the most number of hours, which could be the reason why they also have the most number of delays recorded. This graph could be analyzed for further insights about other airlines, but it seems a bit difficult. Feedback from my lab partner suggested that a better option might be to visualize this information with a pie chart rather than a histogram for better readability, though even that might still have been a bit complex given that a lot of airlines were included in this dataset.
I tried a similar visualization combing the days of the week and the hours flown to see if there was a specific day on which more flights were occurring, resulting in the following visualization.
This visualization seems to convey that The most hours were flown on Thursdays. However, similar to the previous visualization, further readability is slightly low.
Through this lab session, I was able to find answers to some interesting questions that are also very relatable to the general population at this time. This was one of my initial experiences with R, though not the first, and I feel that working with it a little more extensively may help me learn better ways to visualize information using the language, and therefore achieve better final visualizations. There were some instances during this process where I would have liked to do something different, but lacked the appropriate expertise. I would like to analyze this information further using R, in order to see if any predictions regarding flight delays can be made with this data. I also feel that working with this amount of data seems to cause R to shut down at times, and I would like to be more aware of ways in which data with this many instances can be handled. All in all it was a great learning experience, which also made me realize that there is still quite a long way to go!