Introduction
I bought a new Lego set for for the first time in about 20 years and was amazed at the level of sophistication and complexity the new kits have. They’re more modular, sturdy, and intricately put together than any kit I remember from childhood. I wanted to investigate if Lego sets have actually gotten more complicated or they only seem that way because I ignored instructions as a child and built whatever I wanted. Data for this project comes from the rebrickable repository which I found through the tidy tuesday repository on GitHub. The data was visualized using RStudio with a handful of packages including ggplot2. I imported each data set from rebrickable, but only generated charts from the sets, colors, and themes tables.
Tools and Methodology
Data was imported directly from rebrickable according to the manual upload instructions on the tidy tuesday Readme. For example, I got the “themes” table by entering the following command into the RStudio console:
themes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-06/themes.csv.gz')
I imported each table available on rebrickable into RStudio to familiarize myself with the overall dataset. I used ggplot2 to create visualizations of the information available.
Visualizations
The first thing I visualized to determine complexity was the number of parts in any given set over the years since data started being collected in 1965. I initially created a scatter plot and then updated it to a 2d bin in order to capture the nuance of how many data points existed at any given year.
The data shows an increase in larger sets with more pieces. The bin distribution helps us see that even though there are more, larger sets, the overall distribution of sets still skews towards smaller sets.
Next, I wanted to look at sets by theme to see how those had expanded over time. For example, “Castle” or “Star Wars” are overarching themes that many sets belong to. I created a count chart of the parent ID’s in the “themes” table to group sets into their respective themes. I created another 2d bin for theme ID’s by year to see how large the Lego ecosystem had gotten.
Gaps in the count chart represent parent sets, bars represent sets that belong to a parent set. The distribution shows a larger number of sets that exist within a theme than sets that exist on their own. The bin for themed sets by years shows a general increase in the range and number of sets. It also shows a concentration of sets within certain themes, particularly in the 500-600 range which include Educational, Collectibles, and licensed sets such as Star Wars, Ninjago, and Disney. While this doesn’t necessarily speak to the complexity of any given Lego set, the concentration of sets within themes suggests an interoperability of pieces across a large distribution of sets.
Limitations
Unfortunately, I could not find data related to the number steps or pages of instructions for any given set, which could indicate a change in complexity over time. Additionally, the dataset was somewhat challenging to work with. For one, because I manually uploaded to RStudio, I did not know how to combine the tables. A larger issue was a lack of common fields within tables. Most of the tables had attributes unique to themselves and few had longitudinal data which made mapping data over time or combining information on minifigs, themes, and sets a difficult task.
It does appear that Lego’s are getting more complicated, at least in the sense that there are more pieces and the sets belong to larger themed ecosystems. However, in order to really answer the question, I would need more data on how the instructions, number of steps, and techniques applied to the build steps have changed over time.