{"id":5135,"date":"2016-07-05T20:15:38","date_gmt":"2016-07-06T00:15:38","guid":{"rendered":"http:\/\/research.prattsils.org\/?p=5135"},"modified":"2016-07-05T20:15:38","modified_gmt":"2016-07-06T00:15:38","slug":"baseball-stats-ages-interactive-visualization","status":"publish","type":"post","link":"https:\/\/studentwork.prattsi.org\/infovis\/visualization\/baseball-stats-ages-interactive-visualization\/","title":{"rendered":"Baseball Stats through the Ages: An Interactive Visualization"},"content":{"rendered":"<p><strong>Introduction<\/strong><\/p>\n<p>In this project, I have set out to expand on <a href=\"http:\/\/research.prattsils.org\/blog\/coursework\/information-visualization\/visualizing-exploring-home-run-rates-baseball-history\/\" target=\"_blank\">a previous information visualization<\/a>, in which I sought to contextualize home run production in Major League Baseball (MLB) both in terms of player age and historical era.<\/p>\n<div id=\"attachment_4652\" style=\"width: 310px\" class=\"wp-caption alignright\"><a href=\"http:\/\/research.prattsils.org\/blog\/coursework\/information-visualization\/visualizing-exploring-home-run-rates-baseball-history\/\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-4652\" class=\"wp-image-4652\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infoshow\/wp-content\/uploads\/sites\/2\/2016\/06\/MLB-HR-per-PA-620x462.png?resize=310%2C231\" alt=\"The visualization that started it all...\" width=\"310\" height=\"231\" \/><\/a><p id=\"caption-attachment-4652\" class=\"wp-caption-text\">The visualization that started it all&#8230;<\/p><\/div>\n<p>The original visualization got me thinking that it would be interesting to apply a similar methodology to additional baseball statistics, and to be able to explore the effects that different individual players might have on average MLB numbers.\u00a0 This seemed a natural situation for an interactive visualization, and I therefore set out to create one that is thorough, understandable, engaging, and enjoyable.\u00a0 To advance this goal, I have applied both the best practices of information visualization, and user testing to refine <a href=\"https:\/\/public.tableau.com\/views\/BaseballStatsthroughtheAges\/Dashboard1?:embed=y&amp;:display_count=yes&amp;:showTabs=y\" target=\"_blank\">the final product<\/a>.<\/p>\n<p><strong>Methodology<\/strong><\/p>\n<p>To begin the project, I developed a prototype to enable user testing.\u00a0 This prototype contained most of the features I intended to include in the final visualization, but was limited to three statistics that a user could choose from.\u00a0 I intended to use significantly more statistics in the final version, but the time involved in data preparation made this impractical for the prototype, since user feedback was anticipated to result in significant changes.\u00a0 Furthermore, the selection of statistics to include was also expected to be heavily influenced by user feedback.<\/p>\n<div id=\"attachment_5137\" style=\"width: 620px\" class=\"wp-caption aligncenter\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5137\" class=\"size-medium wp-image-5137\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infoshow\/wp-content\/uploads\/sites\/2\/2016\/07\/Dashboard-1-620x477.png?resize=620%2C477\" alt=\"Visualization Prototype\" width=\"620\" height=\"477\" \/><p id=\"caption-attachment-5137\" class=\"wp-caption-text\">Visualization Prototype<\/p><\/div>\n<p>Construction of the prototype and the final version followed more or less the same process.\u00a0 First, data were collected from the data export feature at <a href=\"http:\/\/www.fangraphs.com\/\" target=\"_blank\">Fangraphs<\/a>.\u00a0 These data consisted of player names, ages, and specific statistics for every season since 1871.\u00a0 The prototype included statistics for 2016, while the final version did not, as the current season is incomplete and not ideal for comparison.\u00a0 For the prototype, I also had to do some manipulation of the retrieved data, as I was using non-standard statistics.<\/p>\n<p>After confirming that the retrieved data were adequate and appropriate for the intended visualization, the data table was converted from crosstab data to normalized data in Google Refine, to allow for effective processing in Tableau, the visualization software used to create the prototype and final product.<\/p>\n<p>As with the original visualization, the data were used to create three stacked line graphs, representing league average production, average production within age groups, and age groups as a percentage of the total player population.\u00a0 To the first, league average, line graph, I also added individual player plots.\u00a0 Interactive filters were added to the dashboard to control the statistics displayed and the players included.<\/p>\n<p>Once the prototype was constructed, it was subjected to user testing.\u00a0 Five test subjects were selected, with the intent of sampling self-identified baseball fans with some degree of interest in baseball statistics.\u00a0 The sample included four males and one female ranging in age from 39 years old to 80 years old, with an average age of 48.\u00a0 Ideally, the sample would have included representation of younger users, as well, but this sample was able to offer consistent and knowledgeable feedback.\u00a0 Asked to rate their level of interest in engaging with baseball statistics, the users averaged 7.2 on a scale of 1 to 10 (1 being \u201cnot at all interested\u201d and 10 being \u201cvery much interested\u201d.)\u00a0 All users rated 5 or higher.<\/p>\n<p>Tests were conducted using phone interviews.\u00a0 Each interview began with an explanation of the purpose of the test, and how it would be conducted.\u00a0 The users were then asked for basic background information, as summarized in the previous paragraph.\u00a0 Users were then provided a link at their personal computer to access the prototype, and they were instructed to use the \u201cThink Aloud\u201d method to narrate their experience of interacting with the site.\u00a0 It was indicated to them that the prototype did not contain the complete set of statistics that would be included in the final version.\u00a0 The interviewer took notes, and occasionally prompted users to explain what they were doing, why they were doing it, and to elaborate on particular statements.\u00a0 At the end of their exploration, users were asked what additional statistics they would be interested in seeing, and if they would like to see additional features.\u00a0 User interviews lasted between 20 to 40 minutes.<\/p>\n<p>Findings from user testing were then incorporated into the final visualization, which is displayed below and <a href=\"https:\/\/public.tableau.com\/views\/BaseballStatsthroughtheAges\/Dashboard1?:embed=y&amp;:display_count=yes&amp;:showTabs=y\" target=\"_blank\">available here<\/a>.<\/p>\n<p><a href=\"https:\/\/public.tableau.com\/views\/BaseballStatsthroughtheAges\/Dashboard1?:embed=y&amp;:display_count=yes&amp;:showTabs=y\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5138\" src=\"https:\/\/i0.wp.com\/studentwork.prattsi.org\/infoshow\/wp-content\/uploads\/sites\/2\/2016\/07\/Dashboard-1-1-940x723.png?resize=720%2C554\" alt=\"Dashboard 1 (1)\" width=\"720\" height=\"554\" \/><\/a><\/p>\n<p><strong>Visualization Design<\/strong><\/p>\n<p>As this visualization focuses on time-series data, and hopes to reveal changing patterns over time, line graphs were deemed appropriate for all the information presented.\u00a0 User testing confirmed there was no difficulty interpreting the information thus presented.\u00a0 In setting the axes for the graphs, it was decided the y-axis would not be fixed to start at zero, but would rather be set automatically by the Tableau software.\u00a0 Although this has the potential to exaggerate the magnitude of differences in the data, the point of this visualization is to identify these differences over time.\u00a0 Exaggerated differences were therefore deemed preferable to minimized differences.<\/p>\n<p>A color gradient was used to segment the data into four age groups: under 25, 25-29, 30-34, and 35+.\u00a0 A blue-to-red gradient was chosen, as it provided each group with a color that was distinguishable from others groups.\u00a0 Users reported no difficulties in distinguishing the different colors, and appeared to interpret them correctly almost immediately.\u00a0 The color gradient was used both to identify the age groups represented by the lines in the bottom two graphs, and to indicate an individual player\u2019s transition through age groups in the top graph.<\/p>\n<p>In the prototype, the color gradient was also applied to the overall MLB average line in the top graph, corresponding to the average age of all MLB players.\u00a0 However, this was found to be confusing, as it seemed to imply that the statistics represented by that line were being produced by a specific age group, not all players.\u00a0 Therefore, it was determined that the overall average should not have an age applied to it.<\/p>\n<p>For the second graph, a 5-year moving average was employed, rather than plotting singular years of data.\u00a0 This moving average includes two years before and two years after a given data point.\u00a0 The decision to use a moving average was due to the year-to-year variability of the statistics within each age group.\u00a0 This variability created a degree of spikiness in the graph that made it very difficult to distinguish one age group from another.\u00a0 The smoother lines created by the moving average show clearer distinctions between age groups, without losing the meaning of the underlying data.<\/p>\n<p>Each graph was divided into distinct vertical bands, alternating grey and white, representing commonly recognized historical eras in baseball.\u00a0 The eras are labeled in the top graph, reading vertically to minimize overlapping, and to orient the label with the vertical bands that align throughout all three graphs.\u00a0 Use of these historical bands appeared to work seamlessly with all users, who often found this additional context contributed to their interpretation of the information in the line plots.<\/p>\n<p>Labels were also added to each line on the graph, both for player groups and individual players.\u00a0 Users noted that not all lines showed up with labels, especially when there were many players selected, and so Tableau\u2019s default setting was turned off, thereby allowing labels to overlap each other.<\/p>\n<p>No x-axis label was used, as it was considered that showing years along the axis in relation to historical baseball data would be self-explanatory.\u00a0 User testing confirmed this hypothesis.\u00a0 Y-axis labels were more problematic.\u00a0 Ideally, the label would correspond to the statistic being displayed in the graph.\u00a0 However, as the graph is designed to be able to display different statistics, (albeit one at a time,) and Tableau does not appear to have an option for dynamic axis labels, the generic label \u201cSelected Stat\u201d was chosen for the top two graphs.\u00a0 This was changed from the prototype label, \u201cStat Value,\u201d which users found confusing.\u00a0 \u201cSelected Stat\u201d is intended to correspond to the user\u2019s action of choosing what to display, thereby hopefully reducing confusion.<\/p>\n<p>While the basic information available on the line graphs themselves was for the most part easily comprehended by users, Tableau also generates pop-up information when the user scrolls over the various data points on a graph.\u00a0 After users expressed some uncertainty about what the pop-up information meant, labels and statistics included in the pop-ups were chosen more carefully for each graph to avoid redundant or ambiguous information, and to offer the more detailed information suited to this pop-up format.<\/p>\n<p>Two interactive filters are included in the visualization.\u00a0 The first filter allows users to select and deselect players to include in the top graph.\u00a0 To emphasize this ability, one player is pre-selected, Babe Ruth.\u00a0 Ruth is one of the most universally known figures in baseball history, with a unique statistical profile that lends itself to analysis, both against league averages, and against other players.\u00a0 He is also unusual in that he was both a hitter and a pitcher during his career, meaning that he returns information for any of the statistical categories I have included in this visualization.<\/p>\n<p>The player filter is designed to be searchable in order to find any player in the data set.\u00a0 A dropdown option listing players was considered, but deemed too cumbersome with a list this long.\u00a0 Users didn\u2019t seem to have much difficulty using this feature, as they knew without prompting which players they would be interested in selecting.\u00a0 It was also decided to turn off the default setting that populates the graph with all values when nothing is selected in the filter, as this issue was encountered when a user clicked the \u201cClear All\u201d button, yielding an amusing and meaninglessly cluttered graph.\u00a0 The \u201cClear All\u201d button only appears after four entries have been selected in the filter; ideally, this button would not be available, as it doesn\u2019t yield any desirable results, but no option to remove it was identified in Tableau.<\/p>\n<p>The second filter allows users to select which baseball statistic is displayed on the top two graphs.\u00a0 Depending on whether it is a pitching or hitting statistic, it will also affect the bottom graph, as the player pools are different between hitters and pitchers.\u00a0 The stat filter includes the 20 statistics finally chosen for inclusion.\u00a0 User feedback revealed that no pitching statistics were included in the prototype, whereas users were about as interested in pitchers as in hitters.\u00a0 Therefore, 10 statistics were included for both hitters and pitchers.<\/p>\n<p>The statistics in the prototype were non-standard, as I sought to include rate-based statistics that were less prone to influence by the vagaries of baseball definitions, and could be more accurately compared across time.\u00a0 However, these unusual statistics were at times unclear to users, and therefore it was determined to primarily choose statistics that are commonly known, so that no explanation would be required, and the stat values are easily interpretable.\u00a0 Several instances of less common \u201cSabermetric\u201d statistics were included, though these were limited to the most commonly referenced stats in mainstream baseball journalism.<\/p>\n<p>The stat filter is set to only allow one statistic to display at a time, as multiple statistics operate on different scales and would not be comparable.\u00a0 The default stat is set to \u201cHR\u201d (home runs), as it is the most common measuring stick in baseball, and tracks very well across baseball eras, with changing home run rates often defining the boundaries of different eras.\u00a0 Stat names in the filter are preceded by \u201cHitting\u201d or \u201cPitching\u201d to clarify exactly what they refer to, and to aid in the alphabetic sorting of the filter.\u00a0 It would be preferable to sort the statistics in the filter according to their popularity, but as custom sorting does not seem to be available in Tableau, it was at least necessary to separate the pitching from the hitting stats, for clarity.<\/p>\n<p>Finally, in addition to a legend indicating the color coding of the different age groups, an informational panel is included on the dashboard, briefly explaining what this visualization is for, and how to begin using it.\u00a0 This panel was added in response to frequent user confusion about what options were available for interaction, an issue which also prompted changing the filter labels to \u201cSelect Players\u201d and \u201cSelect Stat\u201d.\u00a0 To highlight the informational panel, it is included near the top of the dashboard, and is differentiated from the rest of the content by being placed in a light yellow box.<\/p>\n<p>These interactive elements, (the filters and the informational panel,) were moved from the right side of the dashboard in the prototype, to the left side in the final version, in order to make them more prominent.\u00a0 A link to the data source is provided at the bottom of the page, in response to a user observation that sourcing information is valuable, and should be listed.<\/p>\n<p><strong>Findings<\/strong><\/p>\n<p><em>Known Issues<\/em><\/p>\n<p>In addition to user findings which were addressed in the revised visualization, there are a number of issues which were not addressed.\u00a0 As mentioned earlier, the color gradient should not be applied to the line which represents more than one age group.\u00a0 Age was therefore removed from the overall MLB average line in the top graph.\u00a0 However, Tableau interprets the resulting null age value as a zero in its age classification, and therefore has colored the line in the \u201cunder 25\u201d category.<\/p>\n<p>Additionally, the legend for the color gradient would be clearer if the colors were listed vertically, and each age group listed next to its respective color.\u00a0 The legend options in the software are limited, unfortunately, and this would probably require the legend to be manually constructed.<\/p>\n<p>Certain statistical categories do not report values for every year, such as strikeouts and stolen bases for hitters.\u00a0 In these cases, it would be appropriate to display a gap for years with no data.\u00a0 In this visualization, a line is drawn across the missing years to connect the two years with data.\u00a0 This affects all three graphs, and was generally confusing to all users.\u00a0 However, no solution presented itself within the software.<\/p>\n<p>Also, it was noted that, when a user moved their cursor to the area where two eras bordered each other, a pop-up would appear listing the year of the border.\u00a0 This pop-up is not especially functional, and can impair the ability to read and access other information.\u00a0 An option to turn this feature off was not identified.<\/p>\n<p><em>Statistical Observations<\/em><\/p>\n<p>With just the three unusual statistics provided in the prototype, users were excited to discover meaning in the information they were looking at.\u00a0 Several users noted that, in the last half century, players aged 35+ often have had higher home run rates on average than younger players.\u00a0 This seemed counterintuitive, until they noted in the bottom graph that players 35+ make up the smallest percentage of the player pool.\u00a0 This in turn sparked the insight that, for a player to still be playing at this age, he probably already possesses higher than average talent, skewing upwards the average production level of that age group.<\/p>\n<p>Among the statistics I included in the final version of the visualization was wild pitches (\u201cWP\u201d) for pitchers.\u00a0 While not the most commonly referred to statistic, it is nonetheless generally known, and I thought it would be interesting to include after it was suggested by one of the user test subjects.\u00a0 Plotting the data for average wild pitches per year revealed an unexpected and abrupt change in rates between 1974 and 1975.\u00a0 Before 1974 there was no average rate higher than 0.972 wild pitches per year, and since 1975 the rate never fell below 3.727, with the second lowest being 4.831.\u00a0 The reasons for this are entirely mysterious to me, and suggest either a change in the rules of how the statistic was collected, or else a flaw in the data set.<\/p>\n<p><em>User\u00a0Recommendations<\/em><\/p>\n<p>Users were generally enthusiastic about adding more and different kinds of stats.\u00a0 Users who were more interested in baseball statistics suggested more advanced \u201cSabermetric\u201d stats, and the feeling overall was, the more stats, the better.<\/p>\n<p>Some additional data dimensions were suggested to make the information even more explorable.\u00a0 For example, users suggested including ethnicity or defensive position as other modes of comparison.\u00a0 It was also suggested to allow the user to filter by a specific age, or age range, of their own choosing.<\/p>\n<p>Finally, another possible extension to this visualization would be a progressive graph of average player production by age, without respect to year.\u00a0 In this case, the x-axis would be the player\u2019s age, and a single line could represent the average of all players.<\/p>\n<p><strong>Future Development<\/strong><\/p>\n<p>As discussed in the <em>Known Issues<\/em> section above, there are a number of ways that this visualization could be cleaned in terms of the information currently displayed.\u00a0 These corrections may or may not be feasible within Tableau, but are worth further inspection.<\/p>\n<p>Adding additional statistics is an obvious next step to enhancing the user experience, but this is somewhat inhibited by the space limitations imposed by having the filter only provide options in a single column, and listing all options strictly alphabetically.\u00a0 Some alternative menu format would be preferable for selecting from a larger range of statistics, perhaps with a hierarchy for pitching and hitting, and separating \u201ccommon\u201d statistics from \u201cadvanced\u201d statistics.<\/p>\n<p>And I am intrigued by the suggestion to add ethnicity and position as additional dimensions to the data.\u00a0 Position is an obvious choice, as different positions often have their own historical statistical identity, and it would be interesting to compare a player against his own position, rather than all players.\u00a0 And the idea of ethnicity reminded me of <a href=\"http:\/\/research.prattsils.org\/blog\/coursework\/information-visualization\/mapping-major-league-talent-time\/\" target=\"_blank\">the cartographic project<\/a> I recently pursued, mapping the place of origin of MLB players by decade.<\/p>\n<p>Both of these dimensions could be significant enhancements to the experience of this visualization, though both would require a significant update to the data structure, not to mention additional data sources, which further complicates the prospect.\u00a0 It is unclear that the public version of Tableau has the features to accommodate all of these changes, though it would be fun and educational to try and find out.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In this project, I have set out to expand on a previous information visualization, in which I sought to contextualize home run production in Major League Baseball (MLB) both in terms of player age and historical era. The original visualization got me thinking that it would be interesting to apply a similar methodology to&hellip;<\/p>\n","protected":false},"author":171,"featured_media":5138,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[78,79,91,81],"coauthors":[],"class_list":["post-5135","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-visualization","tag-baseball","tag-baseball-history","tag-baseball-stats","tag-mlb"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paBdcV-1kP","_links":{"self":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts\/5135","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/users\/171"}],"replies":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/comments?post=5135"}],"version-history":[{"count":0,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/posts\/5135\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/"}],"wp:attachment":[{"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/media?parent=5135"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/categories?post=5135"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/tags?post=5135"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/infovis\/wp-json\/wp\/v2\/coauthors?post=5135"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}