Non-relational Databases: Two Case Studies and their Potential for Collection Management Application

On September 18 Seth Kaufman, founder of Whirl-i-Gig, presented at the Code4Lib NYC meetup at the Metro offices. He demonstrated the Whirl-i-Gig projects ePandda and Inquisite which make use of non-relational database structures to store and access research data .

Kaufman began his presentation by explaining that relational databases have been the dominant structure for commercial databases since the 1970s, but they have limitations for certain applications.  Kaufman walked the audience through the process behind Whirl-i-Gig team’s development of two non-relational databases that were able to meet his clients’ research-centered needs in ways that relational databases could not.

Kaufman explained that document-oriented databases are based on the notion of a document rather than on joined records as in a relational database. With document-oriented databases, each document is “schema-less”. There are no pre-defined fields, every field can have multiple values, and there are no relationships or joins. According to Kaufman, the advantages of document-oriented databases are that they are fast, flexible, scalable, and are easily integrated with programming languages. For these reasons, Whirl-i-Gig decided to use a document-oriented database for a solution they eventually called ePandda (enhancing Paleontological and Neontological Data Discovery API).

Whirl-i-Gig built ePandda to meet the needs of paleontological researchers at Yale who needed to observe patterns taken from multiple data sets that span geological epochs. The ePandda API made existing paleontology databases interoperable in order to query across enormous data sets. Through ePandda’s http-based query language ElasticSearch, researchers could examine questions such as “When did brachiopods die?” which would lead to their developing theories about climate change more broadly.

Whirl-i-Gig’s other non-relational database project, Inquisite, uses a graph structure. It was developed to analyze data for NYU’s NewYorkScapes project. NewYorkScapes is a project that studies the history of urban cultures in New York City and makes collected data discoverable for research. In Kaufman’s own words, the role of Inquisite in the project is to “streamline the formation of open research communities and foster the acquisition, preservation, dissemination and reuse of research data.”

Inquisite uses a graph structure, a non-relational database format that has been around for decades but has seen a resurgence in recent years. Graph databases use nodes (entities) and edges (expressions of two related nodes) to create structure. In graph databases, the relationships themselves contain data which results in a simpler data model that is faster to query. Relational databases, on the other hand, rely on key structures to represent relationships. As the depth of these relationships increases, query performance can be slowed.

Inquisite makes use of graph databases’ simple structure and fast query ability to enable  project participants to use their mobile devices to gather and upload data (audio, video, and images). Inquisite then visualizes these create community-generated data sets as graphs, timelines and maps.

Kaufman’s presentation demonstrated innovative ways that academic research data can be expressed and accessed through non-relational databases. I was left wondering how these principles might apply to other content areas. I was particularly curious to know more about the potential of non-relational databases in collection management software for cultural heritage institutions.

Whirl-i-Gig is the creator of the open source collection management database CollectiveAccess which is used by dozens of art organizations, libraries and museums. I reached out to Kaufman to ask about the Whirl-i-Gig team’s choice of structure for CollectiveAccess and his perspective on the role of non-relational databases in collection management software generally.

Kaufman explained that CollectiveAccess is a typical relational database built with SQL. In his opinion, a document-oriented structure is not well-suited for collections databases. The main reason for this is their lack of structured relationships  which are necessary to represent collection content.

Kaufman considers Graph databases to be more promising for collection management software. The developers at Whirl-i-Gig considered a graph structure for the next major version of CollectiveAccess. Kaufman is confident that graph structures would function well for this purpose.

Though Kaufman describes query processing speed as one of the main reasons non-relational databases were chosen for Inquisite and ePandda, I wonder if it is graph database’s query intuitiveness that may be the greatest advantage for collection management databases for cultural heritage institutions. For example, the widely used collection management database TMS (Gallery Systems’s The Museum Systems) is a SQL-based relational database that has a non-intuitive query interface that requires specific training to understand. A collection management system built in the graph model might feel more natural for users to query because the relational structure itself is built using a syntax that more closely resembles the language people use to ask questions verbally.

An analogy of navigational instructions given to a driver can illustrate this concept. Relational databases are the equivalent of giving a driver a map. The driver needs to look at the map and understand how to orient themselves and then figure out for themselves which path they need to follow. A graph database is like giving the driver step by step directions, for example, “turn right at the next intersection and your destination is the third building on your left.” The argument for the efficiency of graph databases is that the query is more streamlined. Less time is wasted in an interface that needs to interpret the user’s query and then if the query was not possible, to re-interpret the result back to the user.

Despite the potential of graph databases to more effectively produce queries, Kaufman doesn’t expect to see graph databases used widely for collections because of the IT support required to implement and maintain them. Kaufman went on to explain that graph databases tend not to be well understood by IT professionals. Most cultural heritage institutions are not likely to allocate the resources required for the more extensive IT support. In his opinion, the familiarity and ease of relational databases for developers will likely cause them to remain the most prevalent database structure for collection management software for the foreseeable future.

References

Kaufman, Seth. 2018, September 18. Managing research data with non-relational databases.

Sasaki, Bryce Merkl. 2018, August 15. Graph Databases for Beginners: Why a Database Query Language Matters (More Than You Think). Retrieved from https://neo4j.com/blog/why-database-query-language-matters/.