Mapping Major League Talent over Time


Visualization

Introduction

I have been experimenting with information visualizations related to baseball players and baseball history, so I wondered what would be an appropriate cartographic exploration of baseball information.  I decided to construct a series of maps showing which country or U.S. state players playing in each decade were born in.  Since this information is available for most players dating back to the 1870’s, this seemed like a potentially interesting way to capture the growth of the sport, first in parallel with the growth of the country, and then as baseball extended into other markets around the world.

Inspirations

To achieve this visualization, it seemed appropriate to use a choropleth map, such as the one below.  For our purposes, it may be more appropriate to use a color gradient where color intensity increases when more players are from a location, rather than changing between hues as in this example.

The below example is a closer depiction of the use of increasing intensity, in this case amusingly illustrating which countries have more vowels in their names.  This map includes information both for U.S. states and for non-U.S. countries, another feature I hope to include in my visualization.  However, I would prefer to avoid breaking out Canada’s provinces, and to keep political boundaries limited to the areas being represented, rather than showing all sub-national boundaries, as they do here.

Finally, I intended to use small multiples of these maps to make it simpler for the viewer to observe the progression over time.  The example below shows an instance of small multiples being used to convey cartographic changes over time in a way that is highly digestible.  (This is a map of drought areas over several decades.)  However, given that the data in my choropleth maps contain relatively more information, and are spread across a world map, it may not be possible to compress these multiples quite as effectively as in this example.

Materials

Constructing the maps required two sets of data.  First, I collected the birthplace information of major league baseball players from the statistical resource at www.baseball-reference.com.  These data were readily available on the site, and required relatively little transformation to be put in usable form.  I combined both U.S. and international players into one CSV file per decade, listing simply the name of the state or country, the number of players born there, and the percentage of total MLB players born there.

The visualization also required preparing shapefile data including both U.S. states and non-U.S. countries.  As no such shapefile data was readily available, I was able to merge two separate datasets to serve this purpose.  A set of country shapefiles was retrieved from Natural Earth, and state shapefiles from www.arcgis.com.  These datasets were then combined in QGIS’s geographic information software, applying state and country names in a single field, and removing any duplicated instances of states and countries caused by overlaps between the two shapefiles.

The baseball data were then merged with the shapefile data using CartoDB’s online software.  Each of the small multiple maps was created as an individual interactive map.  Each map was colored based on the number of players born in each state or country, set on a seven-part scale of increasing color intensity, from a light cream color to a dark red.  A filter was also applied so that only places with one or more player would be colored.  The buckets on the scale were defined by quantile, using one of CartoDB’s presets, so that the scale is applied relatively evenly across states and countries, rather than having only a few places showing significant player populations, and the rest of the world being in a single bucket.  A legend was created for each map, indicating how many players were represented by each color on the scale.

Below is the compilation of maps from the 1870’s to the 2010’s.  Click on an individual map to see the interactive version.

mlb_players_by_birthplace_-_1870s_by_iknight_06_27_2016_02_48_36mlb_players_by_birthplace_-_1880s_by_iknight_06_27_2016_02_45_50mlb_players_by_birthplace_-_1890s_by_iknight_06_27_2016_02_43_28mlb_players_by_birthplace_-_1900s_by_iknight_06_27_2016_02_40_59mlb_players_by_birthplace_-_1910s_by_iknight_06_27_2016_02_39_11mlb_players_by_birthplace_-_1920s_by_iknight_06_27_2016_02_37_08mlb_players_by_birthplace_-_1930s_by_iknight_06_27_2016_02_35_14mlb_players_by_birthplace_-_1940s_by_iknight_06_27_2016_02_33_52mlb_players_by_birthplace_-_1950s_by_iknight_06_27_2016_02_31_26mlb_players_by_birthplace_-_1960s_by_iknight_06_27_2016_02_30_03mlb_players_by_birthplace_-_1970s_by_iknight_06_27_2016_02_27_51mlb_players_by_birthplace_-_1980s_by_iknight_06_27_2016_02_26_33mlb_players_by_birthplace_-_1990s_by_iknight_06_27_2016_02_24_47mlb_players_by_birthplace_-_2000s_by_iknight_06_27_2016_02_22_35mlb_players_by_birthplace_-_2010s_by_iknight_06_27_2016_02_19_28

Discussion

The first thing I noted when reviewing these results is that the small multiples are not particularly small.  As a result, it is difficult to have more than two at a time in your field of view, and looking through the entire timeline requires a good deal of scrolling.  This defeats one of the primary advantages of using small multiples, which is being able to easily scan through related information to discern patterns.  Unfortunately, reducing these images any further would risk losing information, as small states and countries would become indistinct.  Even at this scale, a number of East Coast states are not visible.

However, there is still plenty of interesting information to be noted in this presentation, especially if one focuses on a particular landmass as they scroll through.  For example, looking just at the United States, one can see that the major leagues were a mostly local phenomenon through their first few decades of existence, being dominated by players from the industrial Northeast and Midwest, where the professional teams and baseball itself flourished at the time.  As the decades progress, one observes the American frontier being closed, and Western states increasingly adding talent to the Eastern pro-ball teams.  Finally, in more recent times, the old industrial belt becomes overshadowed by warmer weather states in the South and by California, where athletes have greater opportunities to train and play, at a time when baseball enjoys a truly national presence.

Similarly, one can infer from the European continent the changing trends in immigration and assimilation in America over time.  Ireland and Great Britain maintained noticeable contributions to the player pool for several decades, perhaps in part because of similar games played there.  Other Northern European nations are also consistently represented up until the early to mid-twentieth century, after which there is no consistent contribution from the continent.

Meanwhile, the influx of baseball talent from Latin America is very clear, starting with Cuba and Mexico in the first half of the twentieth century, and then gradually spreading and intensifying.  A similar trend appears to be developing in the last few decades with countries of East Asia and Australia.  Also clear is that much of the world has next to no connection to major league baseball, as there was no need to include India on any of these maps, nor yet the entire continent of Africa.

Finally, I note that this visualization brings home to me the shortness and closeness of time.  Nothing in professional sports history seems very far away, and yet this set of maps is significantly shaped by such far-off seeming episodes as the Civil War and the settlement of the American West.  Major historical changes can be observed over just two or three maps, which puts the present we live in into a different perspective.

Future Directions

As I noted, these maps could be more effective if they were small enough to see more at a time.  One natural approach to this would be to try a different map projection.  Given where most of the data lie, a projection focused on the Western Hemisphere and compressing other landmasses around it might be effective.

It might also be helpful to make the color scale on these maps less confusing by using a uniform scale, instead of one that changes from map to map.  This revised scale would probably have to use percentage of players, rather than number of players, since the total numbers change significantly over time.  The scale would then want to be set to an appropriate interval that still yields useful information across all maps.