As an avid fan of both the sport of baseball, and the statistics which underpin the study and understanding of its history, I thought it would be interesting to examine some of those historical statistics visually. In particular, I am interested to see how the prevalence of the home run has changed over time, and how the profile of home run hitters may have changed in different baseball eras. We have only recently moved beyond what is commonly known as the “Steroid Era” of baseball, when the use of performance enhancing drugs supposedly enabled many ballplayers to far exceed the home run output of previous generations, and doing so at ages when a player’s career was traditionally nearing its end. To gain some perspective into this topic, I thought it would be interesting to plot how different age groups have performed relative to one another over baseball history.
In contemplating a visualization to represent this, I took note of one study which followed a similar premise. Seeking to analyze the differences between baseball eras, the group performing this study created several scatterplots representing top-10 performances in home runs, runs batted in, and on-base percentage for each year between 1900 and 2014. The eras are differentiated by using different color points.
This is visually quite accessible. However, I also considered another method to articulate historical eras, using “regimes,” or bands of color to represent the different eras. I noted an example of this format at this data visualization blog, and this kind of banding appears an effective way of representing eras when using a line graph.
To add further context, I thought it might be useful to include some significant home run achievements through baseball history, and considered another example, a baseball visualization created by Fangraphs, which I found to be particularly effective at demarcating both individual regimes and significant events along the timeline.
To create my visualization, I began by retrieving a CSV data table from the Fangraphs website (www.fangraphs.com). Fangraphs allows users to customize data reports for export, and I used this feature to generate a table listing the total home runs and plate appearances, as well as the age of each individual player in each Major League Baseball season between 1871 and the season currently underway, 2016. I then modified the table in Excel, creating an additional column for home runs per plate appearance (HR/PA), which calculation I applied to each row. I imported this CSV into Tableau Public 9.0, which I used to create the visualization.
Rather than use home run totals for top players, as in the study above, I decided to look at the rate of home run production across the league, as represented by HR/PA. This should present a more generalized picture of the home run environment throughout history, without being skewed as much by exceptional individual players, or historical differences in the length of baseball seasons. Because my analysis reports an average of a rate (HR/PA), I limited my data set to “qualified” players in order to avoid distortions caused by players with very few plate appearances. A qualified player must accrue 3.1 plate appearances per team game in a season, which essentially restricts this data set to full-time players.
I plotted average HR/PA rates in two separate line graphs, one representing the overall average from season to season, and one representing the average rate within four separate age groups. The age groups I selected represent what is typically considered a player’s career prime (25-29), a player’s post-prime (30-34), as well as very young players (under 25) and players at the end of their careers (35+). For the age group graph, I reported a 5-year running average, including the given year and two years on either side. This smoothed out some of the year-to-year variance within each age group, making it easier to make comparisons between age groups.
After these two graphs, I included a third line graph to indicate what percentage of the player population each age group made up in each year.
I then added bands to each graph representing generally agreed-upon baseball eras, (derived from the above study, as well as from this site,) and several lines demarcating significant single-season home run achievements.
The results suggest a number of storylines. Most obviously, there seems to be a visible correlation between different baseball eras and changing trends in home run rate, with the overall trend being increased home runs. It does not come as a huge surprise that home run rates correlate to eras, as those eras are rather arbitrarily defined, with one of the major predicators of their definition being shifts in home run production. To a large extent, then, this is a self-fulfilling prophecy, though it is gratifying to see it illustrated so clearly here, and it is still interesting to observe the overall changes from era to era.
It is also interesting to note that, until World War II, younger players, particularly under 25, showed noticeably stronger home run rates than older players, whereas after World War II this gradually flips, with players under 25 consistently showing the lowest home run rates in the 2000s. Prior to the Live Ball Era, this may partly be because home runs reflected the speed of a player as much as being able to knock it over the fence, and speed is generally considered a young person’s skill. One could speculate that the recent reversal may be due to better physical training available for more experienced players, as well as the effects of drug use in preserving and enhancing their strength and health.
Finally, I note several spikes in the relative production of players 35+ compared to other age groups. Because the 35+ group is consistently the smallest segment among the player population, I would speculate that this is due to a small number of extraordinary individual players entering that age group at the end of their careers, thereby ballooning the average rates for the group as a whole.
Based on some of these observations, it might be interesting to develop this visualization further, and include some interactive features. For one thing, the question of the impact of individual players on overall trends could be made explorable by developing a filter to add individual player rates in addition to the group rates displayed here. It would be interesting to see how Hank Aaron’s home run production compares to his contemporaries’ production, especially as he progressed through different age ranges.
Furthermore, it would be interesting to extend this visualization to other statistical categories, to analyze similar kinds of shifts to what we can observe here. An effective display would probably be limited to selecting one statistical category at a time, but baseball has many to choose from, and this could make for a very interesting tool to peruse.