Baseball Stats through the Ages: An Interactive Visualization


Visualization

Introduction

In this project, I have set out to expand on a previous information visualization, in which I sought to contextualize home run production in Major League Baseball (MLB) both in terms of player age and historical era.

The visualization that started it all...

The visualization that started it all…

The original visualization got me thinking that it would be interesting to apply a similar methodology to additional baseball statistics, and to be able to explore the effects that different individual players might have on average MLB numbers.  This seemed a natural situation for an interactive visualization, and I therefore set out to create one that is thorough, understandable, engaging, and enjoyable.  To advance this goal, I have applied both the best practices of information visualization, and user testing to refine the final product.

Methodology

To begin the project, I developed a prototype to enable user testing.  This prototype contained most of the features I intended to include in the final visualization, but was limited to three statistics that a user could choose from.  I intended to use significantly more statistics in the final version, but the time involved in data preparation made this impractical for the prototype, since user feedback was anticipated to result in significant changes.  Furthermore, the selection of statistics to include was also expected to be heavily influenced by user feedback.

Visualization Prototype

Visualization Prototype

Construction of the prototype and the final version followed more or less the same process.  First, data were collected from the data export feature at Fangraphs.  These data consisted of player names, ages, and specific statistics for every season since 1871.  The prototype included statistics for 2016, while the final version did not, as the current season is incomplete and not ideal for comparison.  For the prototype, I also had to do some manipulation of the retrieved data, as I was using non-standard statistics.

After confirming that the retrieved data were adequate and appropriate for the intended visualization, the data table was converted from crosstab data to normalized data in Google Refine, to allow for effective processing in Tableau, the visualization software used to create the prototype and final product.

As with the original visualization, the data were used to create three stacked line graphs, representing league average production, average production within age groups, and age groups as a percentage of the total player population.  To the first, league average, line graph, I also added individual player plots.  Interactive filters were added to the dashboard to control the statistics displayed and the players included.

Once the prototype was constructed, it was subjected to user testing.  Five test subjects were selected, with the intent of sampling self-identified baseball fans with some degree of interest in baseball statistics.  The sample included four males and one female ranging in age from 39 years old to 80 years old, with an average age of 48.  Ideally, the sample would have included representation of younger users, as well, but this sample was able to offer consistent and knowledgeable feedback.  Asked to rate their level of interest in engaging with baseball statistics, the users averaged 7.2 on a scale of 1 to 10 (1 being “not at all interested” and 10 being “very much interested”.)  All users rated 5 or higher.

Tests were conducted using phone interviews.  Each interview began with an explanation of the purpose of the test, and how it would be conducted.  The users were then asked for basic background information, as summarized in the previous paragraph.  Users were then provided a link at their personal computer to access the prototype, and they were instructed to use the “Think Aloud” method to narrate their experience of interacting with the site.  It was indicated to them that the prototype did not contain the complete set of statistics that would be included in the final version.  The interviewer took notes, and occasionally prompted users to explain what they were doing, why they were doing it, and to elaborate on particular statements.  At the end of their exploration, users were asked what additional statistics they would be interested in seeing, and if they would like to see additional features.  User interviews lasted between 20 to 40 minutes.

Findings from user testing were then incorporated into the final visualization, which is displayed below and available here.

Dashboard 1 (1)

Visualization Design

As this visualization focuses on time-series data, and hopes to reveal changing patterns over time, line graphs were deemed appropriate for all the information presented.  User testing confirmed there was no difficulty interpreting the information thus presented.  In setting the axes for the graphs, it was decided the y-axis would not be fixed to start at zero, but would rather be set automatically by the Tableau software.  Although this has the potential to exaggerate the magnitude of differences in the data, the point of this visualization is to identify these differences over time.  Exaggerated differences were therefore deemed preferable to minimized differences.

A color gradient was used to segment the data into four age groups: under 25, 25-29, 30-34, and 35+.  A blue-to-red gradient was chosen, as it provided each group with a color that was distinguishable from others groups.  Users reported no difficulties in distinguishing the different colors, and appeared to interpret them correctly almost immediately.  The color gradient was used both to identify the age groups represented by the lines in the bottom two graphs, and to indicate an individual player’s transition through age groups in the top graph.

In the prototype, the color gradient was also applied to the overall MLB average line in the top graph, corresponding to the average age of all MLB players.  However, this was found to be confusing, as it seemed to imply that the statistics represented by that line were being produced by a specific age group, not all players.  Therefore, it was determined that the overall average should not have an age applied to it.

For the second graph, a 5-year moving average was employed, rather than plotting singular years of data.  This moving average includes two years before and two years after a given data point.  The decision to use a moving average was due to the year-to-year variability of the statistics within each age group.  This variability created a degree of spikiness in the graph that made it very difficult to distinguish one age group from another.  The smoother lines created by the moving average show clearer distinctions between age groups, without losing the meaning of the underlying data.

Each graph was divided into distinct vertical bands, alternating grey and white, representing commonly recognized historical eras in baseball.  The eras are labeled in the top graph, reading vertically to minimize overlapping, and to orient the label with the vertical bands that align throughout all three graphs.  Use of these historical bands appeared to work seamlessly with all users, who often found this additional context contributed to their interpretation of the information in the line plots.

Labels were also added to each line on the graph, both for player groups and individual players.  Users noted that not all lines showed up with labels, especially when there were many players selected, and so Tableau’s default setting was turned off, thereby allowing labels to overlap each other.

No x-axis label was used, as it was considered that showing years along the axis in relation to historical baseball data would be self-explanatory.  User testing confirmed this hypothesis.  Y-axis labels were more problematic.  Ideally, the label would correspond to the statistic being displayed in the graph.  However, as the graph is designed to be able to display different statistics, (albeit one at a time,) and Tableau does not appear to have an option for dynamic axis labels, the generic label “Selected Stat” was chosen for the top two graphs.  This was changed from the prototype label, “Stat Value,” which users found confusing.  “Selected Stat” is intended to correspond to the user’s action of choosing what to display, thereby hopefully reducing confusion.

While the basic information available on the line graphs themselves was for the most part easily comprehended by users, Tableau also generates pop-up information when the user scrolls over the various data points on a graph.  After users expressed some uncertainty about what the pop-up information meant, labels and statistics included in the pop-ups were chosen more carefully for each graph to avoid redundant or ambiguous information, and to offer the more detailed information suited to this pop-up format.

Two interactive filters are included in the visualization.  The first filter allows users to select and deselect players to include in the top graph.  To emphasize this ability, one player is pre-selected, Babe Ruth.  Ruth is one of the most universally known figures in baseball history, with a unique statistical profile that lends itself to analysis, both against league averages, and against other players.  He is also unusual in that he was both a hitter and a pitcher during his career, meaning that he returns information for any of the statistical categories I have included in this visualization.

The player filter is designed to be searchable in order to find any player in the data set.  A dropdown option listing players was considered, but deemed too cumbersome with a list this long.  Users didn’t seem to have much difficulty using this feature, as they knew without prompting which players they would be interested in selecting.  It was also decided to turn off the default setting that populates the graph with all values when nothing is selected in the filter, as this issue was encountered when a user clicked the “Clear All” button, yielding an amusing and meaninglessly cluttered graph.  The “Clear All” button only appears after four entries have been selected in the filter; ideally, this button would not be available, as it doesn’t yield any desirable results, but no option to remove it was identified in Tableau.

The second filter allows users to select which baseball statistic is displayed on the top two graphs.  Depending on whether it is a pitching or hitting statistic, it will also affect the bottom graph, as the player pools are different between hitters and pitchers.  The stat filter includes the 20 statistics finally chosen for inclusion.  User feedback revealed that no pitching statistics were included in the prototype, whereas users were about as interested in pitchers as in hitters.  Therefore, 10 statistics were included for both hitters and pitchers.

The statistics in the prototype were non-standard, as I sought to include rate-based statistics that were less prone to influence by the vagaries of baseball definitions, and could be more accurately compared across time.  However, these unusual statistics were at times unclear to users, and therefore it was determined to primarily choose statistics that are commonly known, so that no explanation would be required, and the stat values are easily interpretable.  Several instances of less common “Sabermetric” statistics were included, though these were limited to the most commonly referenced stats in mainstream baseball journalism.

The stat filter is set to only allow one statistic to display at a time, as multiple statistics operate on different scales and would not be comparable.  The default stat is set to “HR” (home runs), as it is the most common measuring stick in baseball, and tracks very well across baseball eras, with changing home run rates often defining the boundaries of different eras.  Stat names in the filter are preceded by “Hitting” or “Pitching” to clarify exactly what they refer to, and to aid in the alphabetic sorting of the filter.  It would be preferable to sort the statistics in the filter according to their popularity, but as custom sorting does not seem to be available in Tableau, it was at least necessary to separate the pitching from the hitting stats, for clarity.

Finally, in addition to a legend indicating the color coding of the different age groups, an informational panel is included on the dashboard, briefly explaining what this visualization is for, and how to begin using it.  This panel was added in response to frequent user confusion about what options were available for interaction, an issue which also prompted changing the filter labels to “Select Players” and “Select Stat”.  To highlight the informational panel, it is included near the top of the dashboard, and is differentiated from the rest of the content by being placed in a light yellow box.

These interactive elements, (the filters and the informational panel,) were moved from the right side of the dashboard in the prototype, to the left side in the final version, in order to make them more prominent.  A link to the data source is provided at the bottom of the page, in response to a user observation that sourcing information is valuable, and should be listed.

Findings

Known Issues

In addition to user findings which were addressed in the revised visualization, there are a number of issues which were not addressed.  As mentioned earlier, the color gradient should not be applied to the line which represents more than one age group.  Age was therefore removed from the overall MLB average line in the top graph.  However, Tableau interprets the resulting null age value as a zero in its age classification, and therefore has colored the line in the “under 25” category.

Additionally, the legend for the color gradient would be clearer if the colors were listed vertically, and each age group listed next to its respective color.  The legend options in the software are limited, unfortunately, and this would probably require the legend to be manually constructed.

Certain statistical categories do not report values for every year, such as strikeouts and stolen bases for hitters.  In these cases, it would be appropriate to display a gap for years with no data.  In this visualization, a line is drawn across the missing years to connect the two years with data.  This affects all three graphs, and was generally confusing to all users.  However, no solution presented itself within the software.

Also, it was noted that, when a user moved their cursor to the area where two eras bordered each other, a pop-up would appear listing the year of the border.  This pop-up is not especially functional, and can impair the ability to read and access other information.  An option to turn this feature off was not identified.

Statistical Observations

With just the three unusual statistics provided in the prototype, users were excited to discover meaning in the information they were looking at.  Several users noted that, in the last half century, players aged 35+ often have had higher home run rates on average than younger players.  This seemed counterintuitive, until they noted in the bottom graph that players 35+ make up the smallest percentage of the player pool.  This in turn sparked the insight that, for a player to still be playing at this age, he probably already possesses higher than average talent, skewing upwards the average production level of that age group.

Among the statistics I included in the final version of the visualization was wild pitches (“WP”) for pitchers.  While not the most commonly referred to statistic, it is nonetheless generally known, and I thought it would be interesting to include after it was suggested by one of the user test subjects.  Plotting the data for average wild pitches per year revealed an unexpected and abrupt change in rates between 1974 and 1975.  Before 1974 there was no average rate higher than 0.972 wild pitches per year, and since 1975 the rate never fell below 3.727, with the second lowest being 4.831.  The reasons for this are entirely mysterious to me, and suggest either a change in the rules of how the statistic was collected, or else a flaw in the data set.

User Recommendations

Users were generally enthusiastic about adding more and different kinds of stats.  Users who were more interested in baseball statistics suggested more advanced “Sabermetric” stats, and the feeling overall was, the more stats, the better.

Some additional data dimensions were suggested to make the information even more explorable.  For example, users suggested including ethnicity or defensive position as other modes of comparison.  It was also suggested to allow the user to filter by a specific age, or age range, of their own choosing.

Finally, another possible extension to this visualization would be a progressive graph of average player production by age, without respect to year.  In this case, the x-axis would be the player’s age, and a single line could represent the average of all players.

Future Development

As discussed in the Known Issues section above, there are a number of ways that this visualization could be cleaned in terms of the information currently displayed.  These corrections may or may not be feasible within Tableau, but are worth further inspection.

Adding additional statistics is an obvious next step to enhancing the user experience, but this is somewhat inhibited by the space limitations imposed by having the filter only provide options in a single column, and listing all options strictly alphabetically.  Some alternative menu format would be preferable for selecting from a larger range of statistics, perhaps with a hierarchy for pitching and hitting, and separating “common” statistics from “advanced” statistics.

And I am intrigued by the suggestion to add ethnicity and position as additional dimensions to the data.  Position is an obvious choice, as different positions often have their own historical statistical identity, and it would be interesting to compare a player against his own position, rather than all players.  And the idea of ethnicity reminded me of the cartographic project I recently pursued, mapping the place of origin of MLB players by decade.

Both of these dimensions could be significant enhancements to the experience of this visualization, though both would require a significant update to the data structure, not to mention additional data sources, which further complicates the prospect.  It is unclear that the public version of Tableau has the features to accommodate all of these changes, though it would be fun and educational to try and find out.