Exploring Movies on Streaming platforms with rstudio


Charts & Graphs, Visualization

Introduction & Inspiration

Watching movies has been a crucial way for my entertainment. It helps me relax when I’m stressed out and it enriches my life when I’m bored.

Recently, I was running short of movies and dramas to watch but I wasn’t sure which steaming platform I should choose to explore movies and dramas on. It inspired me to investigate and visualize data about movies on streaming platforms for this project. Moreover, I almost always take look at the rating score of the movies before I watch them. Thus, in this project, I’m also curious to find out how the rating systems differ from one another.

Research questions

I listed the following research questions to direct me for this project:

  1. Which platform give the most options for people to explore movies on given the genre?
  2. Which platform give the most options for people to explore movies on given the age group?
  3. Which rating platform is more harsh, IMDb or Rotten Tomatoes? Which streaming platform has better rated movies?
  4. How has the quality of movies change over time according to the two rating platforms?

Process & Methods:

Step 1. Finding datasets

To create the dashboard, I found a dataset on Kaggle about Movies on Netflix, Prime Video, Hulu, and Disney+ and their corresponding ratings from IMDb and Rotten Tomatoes. The dataset is a CSV file. 

The initial fields include the following dimensions: 

  • #: index of a row
  • ID: Unique movie ID
  • Title
  • Year: the year in which the movie was produced
  • Age: target age group
  • IMDb: IMDb rating out of 10
  • Rotten Tomatoes: Rotten Tomatoes%
  • Netflix: Whether the movie is found on Netflix. 0 indicates no, 1 means yes.
  • Hulu: Whether the movie is found on Hulu
  • Prime Video: Whether the movie is found on Prime Video
  • Disney +: Whether the movie is found on Disney +

Step 2. Tidy up data

  • I used Excel to cleanup rows that have empty cells, which leads to 5080 rows left, half of the original data.
  • Used OpenRefine to transpose Netflix, Hulu, Prime Video, and Disney+ to two columns “Platform” and “Value”. Filter out rows that value = 1 as it indicates the true platform that the movie is shown on; delete rows that value = 0. Delete value column since the platform column has all information needed.
  • Separated language with “,” ; transposed columns; deleted needless columns.
  • Separated Genres with “,” ; transposed columns; deleted needless columns.
  • Used OpenRefine to separate and delete the part “/100” and “/10” in each value; then used Excel to convert IMDb, Rotten Tomatoes into percentage.
  • After cleaning up data, there are 39673 rows of records.

Step 3. Calculate and create graphs using R Studio

Using the data that was tidied up, I used RStudio to first calculate the data needed, including average ratings by year “df_avgIMDb_byYear” and “df_RT_byYear”. Then, to find out the answers for the four research questions, I used “ggplot” to create six images of graphs that examine the number of movies by genre, distribution of movies in streaming platforms by age, and compares the two rating systems by platforms and by year. 

Step 4. Peer review and difficulties I encountered

As R Studio has a very steep learning curve, I have encountered many difficulties in the process.

  1. When I sent my RPubs link for Kailen to review, the labels on the axis were clustered in the first two images (Number of Movies by Genre & Number of Movies by Age) due to fixed image dimension and lack of spacing between lines. Kailen suggested me to try setting fig.dim=c(height, width) in the markdown file and see if the axis labels can be more spaced out. but it didn’t quite work out. I then tried to change the spacing between bars but it did not affect the spacing between labels, so it failed again. At last, for the first image (Number of Movies by Genre), I decided to cut down the number of genres by only showing the top 12 genres based on the number of movies, which ended up with enough space for the labels. With 12 genres, I was able to change the color palette to “set 3” since there are 12 colors in the set. 
  2. Differently for the second image (Number of Movies by Age), I changed one row of five graphs to two rows, which gives more space for the platform labels on x-axis, but “Prime Video” still overlaps on “Netflix” due to its long length. In the end, I used scale_x_discrete to shorten the label name in order to show all labels with a comfortable spacing.
  3. For the two graphs about the Distribution of Movies in the Two Rating Systems, the breaks shown on x-axis are different for the two: IMDb rating has a break of 25 while Rotten Tomatoes has a break of 20, making it hard to compare the two images. I researched and realized that I could use scale_x_continuous (breaks =c(40, 60, 80,100)) to manually make the breaks consistent. Moreover, though I wanted to show the average rating score under x-axis along with other labels, I was unable to figure it out. So the best I could do was to manually insert the number to the graphs by using geom_text. 
  4. For the last two figures (Average IMDb Ratings vs Rotten Tomatoes Ratings by Year), I didn’t quite figure out how to display the function for the regression line and I couldn’t change the x labels for some reasons. I have tried “+ labs” and “+ scale_x_continuous” but none of them work out so the x label is still shown as “~Year ~mean_IMDb”.

Results — Visualizations & Interpretations

In this section, I will analyze and discuss the graphs I created with R. 

  1. Prime Video has the most Dramas while Disney+ provides the most Family movies
Figure 1. Distribution of Movies in the Top 12 Genres on Streaming Platforms

The graphs show the top 12 genres based on the number of movies. The list of genres in the y-axis is ordered by the total number of movies. By comparing the number of movies by genre in the four streaming platforms, we can see that, though the Drama genre has the highest number of movies on Hulu, Netflix, and Prime Video, Prima Video has the most Drama movies among all three. Netflix and Prime Video have a similar amount of Comedy movies and Action movies. In contrast, Disney+ has a significantly higher amount of Family movies and animations. 

2. Prime Video has the most 18+ movies while Disney+ has the most movies for all age groups

Figure 2. Distribution of Movies on Streaming Platforms by Age

Based on the graphs, we can tell that Prime Video, Netflix, and Hulu are mainly targeting adults and young adults. Not surprisingly, Disney+ provides the highest amount of movies that can be watched by all age groups, making it a friendly platform for kids. 

3. Rotten Tomatoes has a slightly higher average score for movie ratings, making IMDb the harsher one

Figure 3. Distribution of Movies in terms of Two Rating Systems in Different Platforms

The two graphs above show the distribution of movie ratings in the four streaming platforms rated by IMDb and Rotten Tomatoes. On IMDb, all films are given an overall rating out of ten. The ratings are derived from votes submitted by IMDb users, not movie critics. Differently, Rotten Tomatoes gives films a score out of 100 based on the average reviews of professional film critics. Here, to make the ratings consistent and easy to compare, all ratings are converted into a percentage setting so that they are all out of 100.

To make it obvious what graphs are for which rating system, I used gold color for IMDb rating since it is IMDb’s logo color, and I used tomato color for Rotten Tomatoes rating so that readers can differentiate the two rating platforms at the first sight.

By calculating the average movie rating for IMDb and Rotten Tomatoes, we got that Rotten Tomatoes has an average rating of 64.76, slightly higher than IMDb’s average rating of 63.97, making IMDb the harsher rating platform.

By marking the line of the average score (the dotted line) on the graph, we can compare and see if the rating distribution for each platform is higher or lower than the average. For IMDb ratings, while Disney+ and Hulu ratings are distributed to the right of the average IMDb rating, Netflix and Prime Video ratings are distributed around the average though Prime Video has a longer left tail. In fact, looking at the average IMDb score calculated by R, Disney+ and Hulu’s are higher than the overall average, Netflix is about the same, and Prime Video is much lower. This shows that Disney+ and Hulu have higher quality movies according to IMDb. 

Looking at the Rotten Tomatoes rating, Hulu and Disney+ ratings are distributed to the right of the average line; Netflix ratings are distributed around the average; Prime Video ratings are distributed to the left of the average line. Verified by the calculated mean for each platform, Disney+ and Hulu have significantly higher average ratings, Netflix is about the average, and Prime Video is much lower than the average. This means that even though Hulu has the least amount of movies (according to Figures 1 and 2), it provides the best quality of movies according to Rotten Tomatoes so it’s worth it for viewers to explore movies on the site. On the contrary, though Prime Video has the highest number of movies, the quality of movies is below average according to the Rotten Tomatoes rating. 

4. Average IMDb ratings decrease in more recent movies while average Rotten Tomatoes ratings increase with the year the movies produced

Figure 4. Average IMDb Ratings vs Rotten Tomatoes Ratings by Year

To find out how the quality of movies has changed over time, I used scatter plots to illustrate the average score for each rating system in each year from 1920 to 2020 and drew a regression line for how the average rating has changed over time in each rating platform. Interestingly, while IMDb shows a negative regression line meaning that the quality of movies decreases over time, Rotten Tomatoes shows a positive regression line suggesting that the quality of movies increases over time. 

Curious about the difference, I conducted secondary research to find out the potential factors.  According to Global News, one crucial factor that leads to the rising score in Rotten Tomatoes may be the rating platform’s close financial ties with the movie industry. The website, launched in 1998 by three recent Berkeley grads, was purchased by Warner Bros. in 2011. And in 2016, Comcast (which also owns NBCUniversal) acquired a 70% stake through a deal that turned Rotten Tomatoes into a division of the ticket vendor Fandango. These media groups produce many of the movies and TV shows that are rated on the site.

Moreover, different from IMDb raters who can be anyone, Rotten Tomatoes ratings are mainly based on professional film critics, and most of the critics are freelancers nowadays. Those critics heavily rely on studios for access to advanced press screenings, which is something tenured critics at big publications don’t need to think about. Therefore, freelancers may hesitate to pan a film out of fear of drawing the studio’s anger. As one of the freelancer critics said, “If I continuously slam, for justifiable reasons, a given series of films from a given studio, I may stop being invited by that studio.” This may have caused higher Rotten Tomatoes ratings in more recent years.

The growth of internet critics writing for specific audiences could also potentially help explain the rising Rotten Tomatoes scores. For instance, if the critic writes a review aimed at horror fans, they are most likely judging the film based on how it appeals to that specific audience, which is different from critics from 20 years ago who had to review every film that came out every weekend.

Overall, there is no doubt that higher rating scores in more recent movies can help boost the box-office records since roughly one in four moviegoers would use Rotten Tomatoes to decide if they will go see the movie. But the rising scores in Rotten Tomatoes can be confusing for viewers who use the site to decide what to watch. For example, this spring’s Godzilla vs Kong, a movie about a giant CGI ape brought out of retirement to battle a giant CGI lizard, had a score of 79/100 in the lead-up to its opening weekend (it has since dropped to 75/100). That’s a higher score than 14 best picture winners, including Forrest Gump (1994): 71/100; Gladiator (2000): 77/100; and Braveheart (1995); 78/100. In contrast, the IMDb rating for Godzilla vs Kong is 6.4/10 and it’s 8.8/10 for Forrest Gump.

Reflection

The process of exploring movies on different platforms and comparing the rating platforms is fun and intriguing. It provides me with certain insights about which platforms I should go exploring movies on.

However, the process of using RStudio was a bit tough for me due to its steep learning curve. As much as it is very convenient for calculations like finding the average rating score or the number of movies in a certain genre, I found it very difficult to use for cleaning data so I switched to OpenRefine and Excel. Comparing RStudio with Tableau for data visualization, Tableau is a lot easier to learn and creates simple, interactive graphics for people with no code experience. R can be confusing sometimes when installing the packages and libraries needed and figuring out the right functions to use; it can take hours to just figure out how to adjust the labels in the x-axis; it is also hard for beginners to figure out how to generate interactive charts. Yet, R is more flexible to create different forms of charts, and it is very easy and quick to upload data and update/knit charts. In sum, I would choose to use Tableau if I were to create simple, interactive information visualization, and I would use RStudio for exploratory analysis. 

Future Direction

Though IMDb provides a relatively more useful and realistic rating score of films, particularly older movies and less well-known movies, it doesn’t mean that the ratings are absolutely fair. In my secondary research, I found that there is a lack of balance in raters’ gender in the rating platforms, which caused the rating sites to skew pretty heavily towards the opinions of men. If looking closely at the breakdown information for the movie ratings from secondary research, men consistently rate masculine films higher than films that feature female leads or more traditionally female themes. For instance, the IMDb ratings for Sex and the City show that over 29,000 men gave the film an average rating of 5.8, while 43,000 women came up with a score of 8.1. A straight-up averaging of the scores gives it a ranking of 7.4, but IMDb’s maths leaves it with a final score of 7. Thus, in the future, I’m interested in investigating how the demographics of the rating platforms may affect movie ratings.

Another thing I found interesting is how Rotten Tomatoes rank movies. For its main rankings, Rotten Tomatoes only takes into account reviews from approved critics and approved publications. Moreover, Rotten Tomatoes also weights its rankings depending on how many reviews a film has instead of simply the average rating score. In addition, not surprisingly, most of Rotten Tomatoes’ selected critics are men (78%), and 82% of reviews were written by white critics. Therefore, I’d like to explore more about various factors including critics’ background, the number of ratings, and demographics that may influence movie rankings. 

At last, since the rating systems have different requirements for raters, I’m interested in building out a rating system that combines IMDb, Rotten Tomatoes, and even more rating platforms like Douban, Metacritic, and Letterboxd ratings. It would also be highly useful to show what platform the top-rated movies are released on. I believe it would save a lot of time and trouble for movie seekers. 

References

Bhatia, Ruchi. “Movies on Netflix, Prime Video, Hulu and Disney+.” Kaggle, 2 Aug. 2021, https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney.

“GGPLOT2 Colors : How to Change Colors Automatically and Manually?” STHDA, http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually.

“ggplot2 Quick Reference: Colour (and Fill).” Software and Programmer Efficiency Research Group, http://sape.inf.usi.ch/quick-reference/ggplot2/colour.

“How Does R Compare against Tableau for Data Visualisation?” Quora, https://www.quora.com/How-does-R-compare-against-Tableau-for-data-visualisation.

Maxhartshorn. “Movies Are Scoring Higher and Higher on Rotten Tomatoes – but Why?” Global News, Global News, 15 June 2021, https://globalnews.ca/news/7947449/movies-are-scoring-higher-and-higher-on-rotten-tomatoes-but-why/.

“Plotting Distributions (ggplot2).” Plotting Distributions (GGPLOT2), http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/.

Reynolds, Matt. “You Should Ignore Film Ratings on Imdb and Rotten Tomatoes.” WIRED UK, 24 Oct. 2017, https://www.wired.co.uk/article/which-film-ranking-site-should-i-trust-rotten-tomatoes-imdb-metacritic.

“Scale_x_continuous: Continuous Position Scales (X & Y).” RDocumentation, https://www.rdocumentation.org/packages/ggplot2/versions/1.0.0/topics/scale_x_continuous.

“SCALE_X_DISCRETE: Discrete Position.” RDocumentation, https://www.rdocumentation.org/packages/ggplot2/versions/0.9.0/topics/scale_x_discrete.

“Which Movie Website Is Better: Imdb or Rotten Tomatoes? Why?” Quora, https://www.quora.com/Which-movie-website-is-better-IMDb-or-Rotten-Tomatoes-Why.

“Why Use R When You Have Tableau? Tableau vs. R?” Nandeshwar.info, 3 Dec. 2019, https://nandeshwar.info/data-science-2/tableau-vs-r/.