Introduction to OpenRefine

[vimeo 165483420 w=640 h=360]

Introduction to OpenRefine from Sarah Hatoum on Vimeo (also available on YouTube). In the video’s description, there are time ranges listed if you would like to skip to different sections of the video.

About the Skillshare:

This tutorial (recorded using SnagIt) is intended to introduce users to OpenRefine, its basic features, and to act as a springboard for a digital humanities project/study that involves great quantities of data. The dataset used in this Skillshare was generated from New York Public Library’s (NYPL’s) crowdsourcing project What’s on the Menu?, where members of the public transcribed menu items from the 1840s to the present. Cleaning datasets is often the first step that needs to be taken when using public datasets, as they are typically messy, and OpenRefine can help users accomplish large-scale data cleaning (and data manipulation).

Overall, this Skillshare intends to introduce users to basic features of OpenRefine in order to make datasets more discernable (particularly for digital humanists who often use public humanities datasets), readying them for further analysis.

What is OpenRefine?

OpenRefine, formerly known as GoogleRefine and re-branded in 2012, is an open source software that can be used to clean, transform, and reconcile datasets. Publicly available raw datasets can be messy; if data is manually entered into a spreadsheet, there is room for human error–for example, there can be several variations of a word due to typos, differences in capitalization conventions or trailing/leading whitespaces (extra spaces after or before a word). Data can also be transformed (changed from one format to another) and reconciled (linked to data in external pages/databases and cross-checked for accuracy).

OpenRefine allows you see the “big picture” of your data, interact with your data, and ask questions about your data–and sometimes answer those selfsame questions–quickly and fairly easily.

What will I learn in the Skillshare video?

How to increase memory allocation: If you are working with a substantial dataset, OpenRefine may perform slowly or crash. The maximum amount of memory you can allocate to OpenRefine is dependent on your RAM and which bit version (32 or 64) of Java you have installed. Each OS involves a different process of memory allocation; read the “going for more memory” section in Using OpenRefine for instructions on how to allocate memory.
There are extensions that can aid your analysis of data without leaving OpenRefine: For example, this extension automatically calculates “elementary statistics” (e.g., standard deviation) of a column in OpenRefine; it is recommended for datasets with large amounts of numbers.

OpenRefine features:

Sorting: Data can be sorted by text (e.g., a to z, or z to a), numbers, dates, and booleans.
Facets: Facets work much like filters. Facets do not affect your data points, and instead allow you to isolate subsets of data. Facets allow you to not only apply transformations to subsets of data, but view and edit each individual record. You can create text, numeric, timeline, scatterplot, and custom facets.
- Clustering: Using clustering, identical values with variations are detected and grouped. You are able to give a universal name to the values, decreasing the size of your dataset; viewing faceted values by “count” will also show which value occurs most/least often.
- In-facet editing: Instead of bulk editing, you can edit individual data points within a facet. If, after clustering, you still notice identical values with variations, you may need to manually alter these values within the facet so that you create more unified values.

Google Refine (now General Refine) Expression Language (GREL): GREL resembles JavaScript, and is a programming language that allows you to make unique, bulk transformations based on the content of your dataset (e.g., extracting the area code from a column that contains phone numbers). A dialog box will pop up after choosing to Edit cells -> Transform or Custom facets, and this is where you will write and apply your expressions.
Text filters: You can search for individual data points using text filters. This is the equivalent of Ctrl + F on Windows/Command + F on Macs.
Common transforms:
- Trimming leading and trailing whitespaces: It can be difficult to detect whitespaces. This common transform can remove most stray white spaces found either after or before a value.
- To titlecase, to uppercase, to lowercase: You can bulk transform columns to have the first letter of each word capitalized, all words capitalized, or all words changed to lowercase, respectively.
- To number, to data, to text: Sometimes you may need to tell OpenRefine what kind of content you have in your columns, and these transforms allow you to do so.
Batch rows: You can flag/star records that you find to be problematic at any location in your dataset, recall them, and edit them individually or delete them in bulk.

What are some advanced features of OpenRefine?

The OpenRefine Wiki offers “small workflows and code fragments” called recipes. You can perform more complex transformations using these recipes; others have created these recipes to streamline your process of data cleaning and analysis. Use them!

Reconciliation is another feature that can be useful for information professionals (museum professionals, archivists, librarians) because it allows you to check the consistency of your data against data in an external database; you can check your collection’s vocabulary with a controlled vocabulary. See the end of this post for reconciliation sources.

Lastly, you can add various extensions in order to add new functionalities to OpenRefine.

How can OpenRefine be used as a digital humanities tool?

As mentioned, cleaning data is often the very first step that needs to be taken when working with a public dataset. In a practicum at the Maryland Institute for Technology in the Humanities, Lydia Zvyagintseva and Trevor Muñoz imagined ways to curate the dataset generated from NYPL’s What’s on the Menu? project and make it more digestible for digital humanities researchers. Currently, as Zvyagintseva noted, there is not a thematic/categorical classification of the 24,000 digitized menus, so she and Muñoz aimed to determine ways to classify (curate) the collection.

After assessing all data, they received summaries of requests from researchers who used NYPL’s What’s on the Menu? API. This helped Zvyagintseva and Muñoz understand user needs (regarding the use of data), and determine the types of questions humanities researchers would ask of the dataset(s). In order to classify the datasets accordingly, they first needed to use OpenRefine:

We used OpenRefine…software to cluster and rename data fields that we considered of particular use or interest to future users of this collection, principally names of the businesses offering these menus and also, where present, the names of the categories (supplied by original cataloguers?) to which the menus had been assigned. While OpenRefine helped with the initial clustering, the large number of name and spelling variations meant some tedious line-by-line editing. This is also the data curator’s job.

After using OpenRefine to clean up humanities data similar to the What’s on the Menu? datasets, questions can be asked from various disciplinary perspectives and can ultimately guide how a humanist develops his/her DH project.

What is an alternative to OpenRefine?

Besides Microsoft Excel, Data Wrangler can be used as an alternative to OpenRefine. Data Wrangler is an open source, web-based application (as opposed to OpenRefine, which is desktop-based) for manipulating data. Data Wrangler allows for “…interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, [etc.]” While OpenRefine may be best for cleaning and reconciling data, Data Wrangler may be best used for reorganizing or reformatting data.

Note: *A novice OpenRefine user expressed uncertainty with the definition of facet, and the definition has been updated. However, if you would like to learn more, I would recommend reading through OpenRefine’s Wiki page for faceting/filtering.

References and further reading:

Download OpenRefine. (2013). Retrieved from http://openrefine.org/download.html

Hooland, S. V., Verborg, R., & Wilde, M. D. (2013, August 5). Cleaning data with OpenRefine. Programming Historian. Retrieved from http://programminghistorian.org/lessons/cleaning-data-with-openrefine

Huynh, D. (2011, Feb). Google Refine. Retrieved from http://davidhuynh.net/spaces/nicar2011/tutorial.pdf

Morris, T. General Refine Expression Language. (2015 May 19). Retrieved from https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language

Muñoz, T. (2013, Aug 19). Refining the problem — More work with NYPL’s open data, part two. Retrieved from http://www.trevormunoz.com/notebook/2013/08/19/refining-the-problem-more-work-with-nypl-open-data-part-two.html

Muñoz, T. & Rawson, K. (2014). Curating menus. Retrieved from curatingmenus.org

[tfmorris]. (2012, Dec 22). Screencasts: Screencasts about Refine. Retrieved from https://github.com/OpenRefine/OpenRefine/wiki/Screencasts

Verborgh, R. & Wilde, M. D. (2013, Sept). Analyzing and fixing data. In Using OpenRefine. Packt Publishing. Retrieved from https://www.packtpub.com/sites/default/files/9781783289080_Chapter_02.pdf [*An e-book version can be found HERE.]

Zvyagintseva, L. (2013, Jun 21). Organizing historical menus: a data curation experiment. MITH. Retrieved from http://mith.umd.edu/taxonomizing-historical-menus-a-data-curation-project/

Reconciliation resources:

[alctsce]. (2014 Mar 3). Using OpenRefine to update, clean up, and link your metadata to the wider world. [Video file]. Retrieved from https://www.youtube.com/watch?v=E-NbMR3_MRw&list=UUvPEK3a3Qb0GMCSzqoZDnlg

Heller, M. (2013 May 1). A librarian’s guide to OpenRefine. ACRL TechConnect. Retrieved from http://acrl.ala.org/techconnect/post/a-librarians-guide-to-openrefine

Multimedia Lab, MasTIC, & Information School at the University of Washington. (2016). Free your metadata. Retrieved from freeyourmetadata.org.

Digital Humanities@ Pratt