Visualizing Dynamic Hierarchies in The Art Genome Project

Final Projects


This report explores analytical data from The Art Genome Project, Artsy’s ongoing study into the characteristics and connections between artists and artworks, using Tableau Public, a free data visualization software. The goal of the resulting dashboard is to reveal salient features of quantitative genomic data that will aid in prioritizing content for future information architecture restructuring efforts.


Artsy’s mission is to make all the world’s art accessible to anyone with an internet connection. By partnering with the world’s leading art auctions, galleries, fairs, and museums, the site functions as a free and powerful resource for users interested in art collecting and education. At present, there are over 800,000+ works of art and design featured on the platform by more than 80,000 artists.

The Art Genome Project (“TAGP”) is the classification system and technological framework that fuels Artsy. TAGP maps the characteristics, referred to as “genes” in-house, that connect artists, artworks, architecture, and design objects across time. There are currently over 1,300 genes of more than 40 types covering art-historical movements, subject matter, formal qualities, and more. Importantly, unlike tags, which are binary—something is either tagged “chair” or not—a gene is evaluated on a scale from 0-100 and then hand applied to an artist or artwork record by an expert contributor. While TAGP also uses tags to highlight specific iconography, motifs, and subject matter, dynamic genoming allows for greater nuance when capturing the conceptual and formal aspects of cultural heritage records.

TAGP is a semi-structured controlled vocabulary, however it is not in compliance with ISO 25964 or related standards (ISO, n.d.) and its information hierarchy is relatively flat, going 2 levels deep at most. To meet these standards and create a deep taxonomy that is truly user-friendly, one of the first steps is to visualize its current structure and identify areas that are working and, conversely, need improvement to prioritize future iterations.


Stephen Few’s guidelines for representing quantitative data in Now You See It (2009) provided both inspiration and reference for creating the line graphs in this dashboard. To quote Few: “If your objective is to see how quantitative values have change during a continuous period of time, nothing works better than a line graph. Lines work better than any other means to make visible the sequential flow of values as they have changed with the passage of time” (2009, p. 150). Although Few notes the great deal of overplotting in Figure 7.19 “Website Visitors by Hour” makes comparison difficult (2009, p. 154), it still served as inspiration for “Followers by Gene Family” in the dashboard.

In searching for a Tableau-friendly network visualization inspiration, Ward, Grinstein, and Keim’s recommendation of using treemaps for displaying hierarchical structures using the space-filling treemap technique illustrated in their book Interactive Data Visualization: Foundations, Techniques, and Applications in Figure 8. 2 (2010, p. 273) provided inspiration for “Gene Hierarchies.” The packed bubble chart in the dashboard was influenced by Ken Ferlage’s Word Usage in Sacred Texts (2017).


Data Collection and Transformation

This dataset was created by running SQL queries in Looker, a proprietary data analytics platform, against Artsy’s AWS Redshift data warehouse. Rather than attempt to create a single, complex query that would denormalize Artsy’s relational data into tabular data better suited for visualization in Tableau, relational data was exported to a local CSV file to be further transformed in OpenRefine, an open source desktop application for data wrangling.

Using OpenRefine, the dataset was transposed to turn the count of followers of a given gene each month into tabular data. Text faceting was used to decrease gene types from 40 choices to 30 with the intention of reducing overcrowding when creating visualizations. Beyond transposing and faceting, no further data transformation or wrangling was required. The dataset was then exported to CSV to be used for visualization creation in Tableau Public.

Visualization Creation

After connecting the dataset to Tableau Public (Desktop), there was a considerable amount of time spent experimenting with the drag-and-drop interface. Tableau Help provided excellent documentation for novice users such as the author, particularly articles on building treemaps and creating custom hierarchies. The measures “Gene Family,” “Gene Type,” and “Gene Name,” were used to create one such custom hierarchy that was used by each visualization in the dashboard.

The “Gene Hierarchies” treemap was created first, using size to represent the count of distinct gene names given that there were many duplicates due to tabular data formatting. Color is used to represent gene families and carried through subsequent visualizations. The “Followers by Gene Family” line graph uses the sum of followers on the vertical axis and Tableau’s automatically separated date hierarchy along the horizontal axis using the custom gene hierarchy as its dimension. Since “Followers by Gene Family” sums followers at the top-most level of the gene hierarchy, “Followers by Genes” was created to illustrate the most followed individual genes at the lowest level of the hierarchy.


[iframe src=”” width=”90%” height=”500″]

Not only does the “Style and Movement” gene family contain the greatest number of distinct genes, it also has the most followers, regardless of month, quarter, or year. Of all the genes, “Emerging Art” from the “Style and Movement” gene family has the largest number of followers. Interestingly, “Photography” and “Prints,” two material genes, along with “Design” and “Latin America and the Caribbean” rise towards the top of followed genes amidst the sea of other style and movement genes.

Future Directions

With more time and resources, the author envisions connecting Tableau to Artsy’s Amazon Redshift database to keep data fresh. Although outside the scope of this data research agreement, adding more dimensions like gene pageviews, conversions, entrances, and search query frequencies, would create better metrics for identifying problems in TAGP’s structure than solely relying on the count of followers. For instance, is there a high bounce rate on certain gene pages and if so, is this an issue of inaccurate labeling or poor content and layout? Are search queries suggesting what users want and cannot find in current genes? If so, should these be translated into new genes? By including more analytical data to the dashboard, many of these questions can be answered in the future. Adding more granular analytical data would also open up opportunities for new visualizations to add to the dashboard, like radar graphs for gene pageviews.


Ferlage, K. (2017). Word usage in sacred texts. Retrieved from

Few, S. Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Oakland, CA: Analytics Press.

International Organization for Standardization. (n.d.). The international standard for thesauri and interoperability with other vocabularies (ISO No. 25964). Retrieved from

Ward, M. O., Grinstein, G., & Keim, D. (2010). Interactive data visualization: Foundations, techniques, and applications. Natick, MA: A K Peters, Ltd.