Men’s Tennis Stars: 1991-2016


Lab Reports, Networks
The Big Three in men’s tennis. Source: Grand Slam Courts

Introduction

Every year, it seems like it’s the same handful of tennis players vying for the championship of the four major tournaments. I’m not a big tennis fan, but I could easily list out the top five or so tennis players in the world, only because they haven’t changed in the past two decades — or so it seems.

Sports data lends itself easily to network visualizations: it’s often the same players or same teams playing together in a tournament or championship series. And having rounds of elimination means that the top players amass connections/edges every time they proceed to the next round. By mapping out the men’s tennis matches from the past few decades (1991-2016), I would be able to quickly identify the dominant players who have demonstrated remarkable career longevity.

Methods

The dataset I found comes from a larger source of ATP World Tour data on Data Hub. The data comes directly from the ATP Tour website, and is organized by years/decades: 1877-1967, 1968-1990, 1991-2016, and 2017 (when it was last updated). I am only focusing on the 1991-2016 dataset, since those are the players that I know best. (Note: The ATP dataset only includes men’s tennis, not women’s — it would be interesting to compare the two, however.)

First, I cleaned and prepped the dataset so it was ready to be imported into Gephi. Using R, I removed all columns except for three: Winner, Loser, and Tournament. Then I renamed the “Winner” and “Loser” columns to “Source” and “Target,” respectively. I also examined the “Tournament” variable and saw that it contained 125 different values. I left those in for now and quickly imported the table into Gephi just to see what kind of result I would get. I made it a directed graph, ranking by out degree to visualize the winners — using both size and color of node to indicate significance.

Out-degree graph of dataset containing all tournaments.

Sure enough, the results showed some predictable names (Roger Federer, Andre Agassi) but also a lot of surprises (Tommy Haas? Lleyton Hewitt as the central figure?). It was also odd to see someone like Pete Sampras represented in a smaller node than someone like Carlos Moya.

I went back to the dataset and filtered by just the four major tournaments: U.S. Open, Australian Open, Wimbledon, and French Open (Roland Garros). There are a lot of small tournaments (in Beijing, Bangkok, Tel-Aviv, Las Vegas, etc.) that the major players just simply don’t participate in, and to include the stats of those matches with the major ones would skew the results.

Results

With this smaller dataset, I now have 1355 nodes and 16099 edges. The average degree per node is 11.881, the network diameter is 10, and the graph density is .009, or only .9%, which is pretty low. I created two graphs with directed edges: one for out degree (to see how many times a player has won) and in degree (to see how many times a player has lost).

Out-degree graph of dataset with the four major tournaments.

The out-degree graph now makes much more sense: As suspected, the “big three” of men’s tennis are the prominent nodes: Federer, Nadal, and Djokovic. These three players have been the dominant men’s tennis players since 2003 (through today!). Federer seems to be the top player of the group, represented in a larger node, and the close proximity of Nadal and Djokovic seems to indicate that they often play each other — they are also known to be each other’s primary rival.

Because this dataset starts in 1991, the graph also shows the main players of the 1990s and early 2000s, namely Andre Agassi and Pete Sampras, found in the lower left section of the graph. Other nodes in the region all belong to that same era: Michael Chang, Todd Martin, etc.

In-degree graph of dataset with the four major tournaments.

The in-degree graph shows which players had often competed in these four major tournaments but had the highest number of losses. Here, no one figure really stands out; most of the nodes seem to be roughly the same small/medium size. It’s noticeable, however, how much the prominent figures from the out-degree graph have now shrunk into much smaller nodes.

Out-degree graph with edges color-coded by tournament. Purple = French Open; Green = US Open; Orange: Wimbledon; Blue: Australian Open.

I was also curious to see if I could partition and color code the edges by tournament. I know from the dataset that the tournaments are each roughly about 25% of the total, so they’re all evenly represented. The graph, however, did not give me any meaningful insight. From this visualization, it doesn’t seem like any of the prominent players (Federer, Nadal, Djokovic, Agassi, Sampras) excelled at a particular tournament more than the other three.

Reflections

Visualizing the matches of the four major tennis tournaments as a network allowed me to quickly identify the players that have dominated the sport (for men) since 1991. It also suggests that a tennis player’s successful career can be long, very long — lasting even decades. Looking at the careers of each individual player, this seems to be true. In 2018, Federer won 20 Grand Slam titles at the age of 36, sharing the record with Nadal. Djokovic trails closely behind at 19 titles. Serena Williams, at the age of 35, won 23 titles. This may be due to advancements in medicine and equipment, as well as players training how to play more skillfully, not having to rely on their athleticism as much.