{"id":1066,"date":"2016-05-07T12:01:22","date_gmt":"2016-05-07T16:01:22","guid":{"rendered":"http:\/\/dh.prattsils.org\/?p=1066"},"modified":"2016-05-07T12:01:22","modified_gmt":"2016-05-07T16:01:22","slug":"introduction-to-openrefine","status":"publish","type":"post","link":"https:\/\/studentwork.prattsi.org\/dh\/2016\/05\/07\/introduction-to-openrefine\/","title":{"rendered":"Introduction to OpenRefine"},"content":{"rendered":"<p>[vimeo 165483420 w=640 h=360]<\/p>\n<p><a href=\"https:\/\/vimeo.com\/165483420\">Introduction to OpenRefine<\/a> from <a href=\"https:\/\/vimeo.com\/user51885633\">Sarah Hatoum<\/a> on <a href=\"https:\/\/vimeo.com\">Vimeo<\/a>\u00a0(also available<a href=\"https:\/\/www.youtube.com\/watch?v=WCRexQXYFrI\">\u00a0on YouTube<\/a>). \u00a0In the video&#8217;s description, there are time ranges listed if you would like\u00a0to skip to different sections of the video.<\/p>\n<h2>About the Skillshare:<\/h2>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">This tutorial\u00a0(recorded using SnagIt) is intended to introduce users to OpenRefine, its basic features, and to act as a springboard for a digital humanities project\/study that involves great quantities of data. The dataset used in this Skillshare was generated from New York Public Library\u2019s (NYPL\u2019s) crowdsourcing project <\/span><a href=\"http:\/\/menus.nypl.org\/\"><i><span style=\"font-weight: 400\">What\u2019s on the Menu?<\/span><\/i><\/a><span style=\"font-weight: 400\">, where members of the public transcribed menu items from the 1840s to the present. Cleaning datasets is often the first step that needs to be taken when using\u00a0public datasets, as they are typically\u00a0messy, and OpenRefine can help users accomplish large-scale data cleaning (and data manipulation).<\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Overall, this\u00a0Skillshare intends to introduce users to\u00a0basic features of OpenRefine in order to make datasets more discernable (particularly for digital humanists who often use public humanities datasets), readying them for further analysis.<\/span><\/p>\n<h2>What is OpenRefine?<\/h2>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">OpenRefine, formerly known as GoogleRefine and re-branded in 2012, is an open source software that can be used to clean, transform, and reconcile datasets. Publicly available raw datasets can be messy; if data is manually entered\u00a0into a spreadsheet, there is room for human error&#8211;for example, there can be several variations of a word due to typos, differences in capitalization conventions or trailing\/leading whitespaces (extra spaces after or before a word). Data can also be transformed (changed from one format to another) and reconciled (linked to data in external pages\/databases and cross-checked for accuracy).\u00a0<\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">OpenRefine allows you see the \u201cbig picture\u201d of your data, interact with your data, and ask questions about your data&#8211;and sometimes answer those selfsame questions&#8211;quickly and fairly easily. <\/span><\/p>\n<h2>What will I learn in the Skillshare video?<\/h2>\n<ul>\n<li style=\"font-weight: 400\"><b>How to increase\u00a0memory allocation:<\/b><span style=\"font-weight: 400\"> If you are working with a substantial dataset, OpenRefine may\u00a0perform slowly or crash. The maximum amount of memory you can allocate to OpenRefine is dependent on your RAM and which bit version (32 or 64) of Java you have installed. Each OS involves a different process of memory allocation; read the <\/span><a href=\"https:\/\/www.packtpub.com\/packtlib\/book\/Big-Data-and-Business-Intelligence\/9781783289080\/1\/ch01lvl1sec15\/Recipe%207%20%20going%20for%20more%20memory\"><span style=\"font-weight: 400\">\u201cgoing for more memory\u201d<\/span><\/a><span style=\"font-weight: 400\"> section in <\/span><i><span style=\"font-weight: 400\">Using OpenRefine <\/span><\/i><span style=\"font-weight: 400\">for instructions on how to allocate memory.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>There are extensions that can aid your analysis of data without leaving OpenRefine: <\/b>For example, t<a href=\"https:\/\/github.com\/sparkica\/refine-stats\"><span style=\"font-weight: 400\">his extension<\/span><\/a><span style=\"font-weight: 400\">\u00a0automatically calculates &#8220;elementary statistics&#8221; (e.g., standard deviation) of a column in OpenRefine; it is recommended\u00a0for datasets with large amounts of numbers.\u00a0<\/span><\/li>\n<\/ul>\n<h4>OpenRefine features:<\/h4>\n<ul>\n<ul>\n<li style=\"font-weight: 400\"><b>Sorting<\/b><span style=\"font-weight: 400\">: Data can be sorted by text (e.g., a to z, or z to a), numbers, dates, and booleans.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Facets:<\/b><span style=\"font-weight: 400\">\u00a0Facets work much like filters. Facets do not affect your data points, and instead allow you to isolate subsets of\u00a0data<\/span><span style=\"font-weight: 400\">. Facets allow you to not only apply transformations to subsets of data, but view and edit each individual record. You can create text, numeric, timeline, scatterplot, and custom facets. <\/span>\n<ul>\n<li style=\"font-weight: 400\"><b>Clustering:<\/b><span style=\"font-weight: 400\"> Using clustering, identical values with variations are detected and grouped. You are able to give a universal name to the values, decreasing the size of your dataset; viewing faceted values by \u201ccount\u201d will also show which value occurs most\/least often.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>In-facet editing:<\/b><span style=\"font-weight: 400\"> Instead of bulk editing, you can edit individual data points within a facet. If, after clustering, you still notice identical values with variations, you may need to manually alter these values within the facet so that you create more unified values.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/ul>\n<ul>\n<ul>\n<li style=\"font-weight: 400\"><b><a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/General-Refine-Expression-Language\">Google Refine (now General Refine) Expression Language (GREL)<\/a>:<\/b><span style=\"font-weight: 400\"> GREL resembles JavaScript, and is a programming\u00a0language that allows you to make unique, bulk transformations based on the content of your dataset (e.g., extracting the area code from a column that contains phone numbers). A dialog box will pop up after choosing to\u00a0Edit cells -&gt; Transform or Custom facets, and\u00a0this is where you will\u00a0write and apply your expressions.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Text filters:<\/b><span style=\"font-weight: 400\"> You can search for individual data points using text filters. This is the equivalent of Ctrl + F on Windows\/Command + F on Macs.<\/span><\/li>\n<li><strong>Common transforms:<\/strong>\n<ul>\n<li><strong>Trimming leading and trailing whitespaces:<\/strong> It can be difficult to detect whitespaces. This common transform can\u00a0remove most stray white spaces found either after or before a value.<\/li>\n<li><strong>To titlecase, to uppercase, to lowercase: <\/strong>You can bulk transform columns to have the first letter of each word capitalized, all words capitalized, or all words changed to lowercase, respectively.<\/li>\n<li><strong>To number, to data, to text:<\/strong> Sometimes you may need to tell OpenRefine what kind of content you have in your columns, and these transforms allow you to do so.<\/li>\n<\/ul>\n<\/li>\n<li><b>Batch rows: <\/b><span style=\"font-weight: 400\">You can flag\/star records that you find to be problematic at any location in your dataset, recall them, and\u00a0edit them individually or\u00a0delete them in bulk.<\/span><\/li>\n<\/ul>\n<\/ul>\n<h2>What are some advanced features of OpenRefine?<\/h2>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">The OpenRefine Wiki offers \u201csmall workflows and code fragments\u201d called <\/span><a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/Recipes\"><span style=\"font-weight: 400\">recipes<\/span><\/a><span style=\"font-weight: 400\">. You can perform more complex transformations using these recipes; others have created these recipes to streamline your process of data cleaning and analysis. Use them!\u00a0<\/span><\/p>\n<p style=\"text-align: left\">Reconciliation is another feature that can\u00a0be useful for information professionals (museum professionals, archivists, librarians) because it allows you to check the consistency of your data against data in\u00a0an external\u00a0database; you can check\u00a0your collection&#8217;s vocabulary with a controlled vocabulary. See the end of this post for reconciliation sources.<\/p>\n<p style=\"text-align: left\">Lastly, you can add <a href=\"http:\/\/openrefine.org\/download.html\">various extensions<\/a>\u00a0in order to add new functionalities to OpenRefine.<\/p>\n<h2>How can OpenRefine be used as a digital humanities tool?<\/h2>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">As mentioned, cleaning data is often the very first step that needs to be taken when\u00a0working with a public dataset. <a href=\"http:\/\/mith.umd.edu\/taxonomizing-historical-menus-a-data-curation-project\/\">In a practicum at the Maryland Institute for Technology in the Humanities<\/a>, Lydia Zvyagintseva and Trevor Mu\u00f1oz imagined ways to curate the dataset generated from NYPL\u2019s <\/span><i><span style=\"font-weight: 400\">What\u2019s on the Menu?<\/span><\/i><span style=\"font-weight: 400\">\u00a0project and make it more digestible for digital humanities researchers. Currently, as Zvyagintseva noted, there is not a thematic\/categorical classification of the 24,000 digitized menus, so she and Mu\u00f1oz aimed to determine ways to classify (curate) the collection.<\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">After assessing all data, they received summaries of requests from researchers who used NYPL\u2019s <\/span><i><span style=\"font-weight: 400\">What\u2019s on the Menu?<\/span><\/i><span style=\"font-weight: 400\"> API. This helped Zvyagintseva and Mu\u00f1oz understand user needs (regarding the use of data), and determine the types of questions humanities researchers would ask of the dataset(s).\u00a0In order to classify the datasets accordingly, they first needed to use OpenRefine:<\/span><\/p>\n<blockquote>\n<p style=\"text-align: left\">We used OpenRefine&#8230;software to cluster and rename data fields that we considered of particular use or interest to future users of this collection, principally names of the businesses offering these menus and also, where present, the names of the categories (supplied by original cataloguers?) to which the menus had been assigned. While OpenRefine helped with the initial clustering, the large number of name and spelling variations meant some tedious line-by-line editing. This is also the data curator\u2019s job.<\/p>\n<\/blockquote>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">After using OpenRefine to clean up humanities data similar to the <\/span><i><span style=\"font-weight: 400\">What\u2019s on the Menu? <\/span><\/i><span style=\"font-weight: 400\">datasets, questions can be asked from various disciplinary perspectives and can ultimately guide how a humanist develops his\/her DH project.<\/span><b style=\"font-size: 16px\">\u00a0<\/b><\/p>\n<h2>What is an alternative to OpenRefine?<\/h2>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Besides Microsoft Excel, <\/span><a href=\"http:\/\/vis.stanford.edu\/wrangler\/\"><b>Data Wrangler<\/b><\/a><span style=\"font-weight: 400\"> can be used as an alternative to OpenRefine. Data Wrangler is an open source, web-based application (as opposed to OpenRefine, which is desktop-based) for manipulating data. Data Wrangler allows for \u201c&#8230;i<\/span><span style=\"font-weight: 400\">nteractive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, [etc.]\u201d While OpenRefine may be\u00a0best for cleaning and reconciling data, Data Wrangler may be best used for reorganizing or reformatting data.<\/span><\/p>\n<p style=\"text-align: left\"><strong>Note:<\/strong> *A novice OpenRefine user expressed uncertainty with the definition of facet, and the definition has been updated. However, if you would like to learn more, I would recommend reading\u00a0through OpenRefine&#8217;s <a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/Faceting\">Wiki page for faceting\/filtering.<\/a><\/p>\n<hr \/>\n<p><b>References and further reading:<\/b><\/p>\n<p>Download OpenRefine. (2013). Retrieved from\u00a0<a href=\"http:\/\/openrefine.org\/download.html\">http:\/\/openrefine.org\/download.html<\/a><\/p>\n<p><span style=\"font-weight: 400\">Hooland, S. V., Verborg, R., &amp; Wilde, M. D. (2013, August 5). Cleaning data with OpenRefine. <\/span><i><span style=\"font-weight: 400\">Programming Historian<\/span><\/i><span style=\"font-weight: 400\">. Retrieved from <\/span><a href=\"http:\/\/programminghistorian.org\/lessons\/cleaning-data-with-openrefine\"><span style=\"font-weight: 400\">http:\/\/programminghistorian.org\/lessons\/cleaning-data-with-openrefine<\/span><\/a><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Huynh, D. (2011, Feb). Google Refine. Retrieved from <\/span><a href=\"http:\/\/davidhuynh.net\/spaces\/nicar2011\/tutorial.pdf\"><span style=\"font-weight: 400\">http:\/\/davidhuynh.net\/spaces\/nicar2011\/tutorial.pdf<\/span><\/a><\/p>\n<p>Morris, T. General Refine Expression Language. (2015 May 19). Retrieved from\u00a0<a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/General-Refine-Expression-Language\">https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/General-Refine-Expression-Language<\/a><\/p>\n<p><span style=\"font-weight: 400\">Mu\u00f1oz, T. (2013, Aug 19). Refining the problem \u2014 More work with NYPL&#8217;s open data, part two. Retrieved from <a href=\"http:\/\/www.trevormunoz.com\/notebook\/2013\/08\/19\/refining-the-problem-more-work-with-nypl-open-data-part-two.html\">http:\/\/www.trevormunoz.com\/notebook\/2013\/08\/19\/refining-the-problem-more-work-with-nypl-open-data-part-two.html<\/a><\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Mu\u00f1oz, T. &amp; Rawson, K. (2014). Curating menus. Retrieved from <a href=\"http:\/\/curatingmenus.org\">curatingmenus.org<\/a><\/span><\/p>\n<p style=\"text-align: left\">[tfmorris]. (2012, Dec 22). Screencasts: Screencasts about Refine. Retrieved from\u00a0<a href=\"https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/Screencasts\">https:\/\/github.com\/OpenRefine\/OpenRefine\/wiki\/Screencasts<\/a><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Verborgh, R. &amp; Wilde, M. D. (2013, Sept). Analyzing and fixing data. In <\/span><i><span style=\"font-weight: 400\">Using OpenRefine<\/span><\/i><span style=\"font-weight: 400\">. Packt Publishing. Retrieved from <\/span><span style=\"font-weight: 400\"><a href=\"https:\/\/www.packtpub.com\/sites\/default\/files\/9781783289080_Chapter_02.pdf\">https:\/\/www.packtpub.com\/sites\/default\/files\/9781783289080_Chapter_02.pdf<\/a>\u00a0<\/span><span style=\"font-weight: 400\">[*An e-book version can be found <\/span><a href=\"https:\/\/www.packtpub.com\/packtlib\/book\/Big-Data-and-Business-Intelligence\/9781783289080\/1\/ch01lvl1sec15\/Recipe%207%20%20going%20for%20more%20memory\"><span style=\"font-weight: 400\">HERE<\/span><\/a><span style=\"font-weight: 400\">.]<\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Zvyagintseva, L. (2013, Jun 21). Organizing historical menus: a data curation experiment. MITH. Retrieved from <\/span><a href=\"http:\/\/mith.umd.edu\/taxonomizing-historical-menus-a-data-curation-project\/\"><span style=\"font-weight: 400\">http:\/\/mith.umd.edu\/taxonomizing-historical-menus-a-data-curation-project\/<\/span><\/a><\/p>\n<p style=\"text-align: left\"><b><\/b><strong>Reconciliation resources:<\/strong><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">[alctsce]. (2014 Mar 3). Using OpenRefine to update, clean up, and link your metadata to the wider world. [Video file]. Retrieved from\u00a0<\/span><a href=\"https:\/\/www.youtube.com\/watch?v=E-NbMR3_MRw&amp;list=UUvPEK3a3Qb0GMCSzqoZDnlg\"><span style=\"font-weight: 400\">https:\/\/www.youtube.com\/watch?v=E-NbMR3_MRw&amp;list=UUvPEK3a3Qb0GMCSzqoZDnlg<\/span><\/a><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Heller, M. (2013 May 1). A librarian&#8217;s guide to OpenRefine.<\/span><i><span style=\"font-weight: 400\"> ACRL TechConnec<\/span><\/i><span style=\"font-weight: 400\">t. Retrieved from <\/span><a href=\"http:\/\/acrl.ala.org\/techconnect\/post\/a-librarians-guide-to-openrefine\"><span style=\"font-weight: 400\">http:\/\/acrl.ala.org\/techconnect\/post\/a-librarians-guide-to-openrefine<\/span><\/a><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Multimedia Lab, MasTIC, &amp; Information School at the University of Washington. (2016). Free your metadata. Retrieved from <a href=\"http:\/\/freeyourmetadata.org\">freeyourmetadata.org<\/a>.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p class=\"lead\">\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">This tutorial\u00a0(recorded using SnagIt) is intended to introduce users to OpenRefine, its basic features, and to act as a springboard for a digital humanities project\/study that involves great quantities of data. The dataset used in this Skillshare was generated from New York Public Library\u2019s (NYPL\u2019s) crowdsourcing project <\/span><a href=\"http:\/\/menus.nypl.org\/\"><i><span style=\"font-weight: 400\">What\u2019s on the Menu?<\/span><\/i><\/a><span style=\"font-weight: 400\">, where members of the public transcribed menu items from the 1840s to the present. Cleaning datasets is often the first step that needs to be taken when using public datasets, as they are typically messy, and OpenRefine can help users accomplish large-scale data cleaning (and data manipulation). <\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-weight: 400\">Overall, this\u00a0Skillshare intends to introduce users to\u00a0basic features of OpenRefine in order to make datasets more discernable (particularly for digital humanists who often use public humanities datasets), readying them for further analysis.<\/span><\/p>\n<\/p>\n<p class=\"more-link-p\"><a class=\"btn btn-danger\" href=\"https:\/\/studentwork.prattsi.org\/dh\/2016\/05\/07\/introduction-to-openrefine\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":359,"featured_media":1120,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,6],"tags":[],"class_list":["post-1066","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-resources","category-skillshares"],"_links":{"self":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/posts\/1066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/users\/359"}],"replies":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/comments?post=1066"}],"version-history":[{"count":0,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/posts\/1066\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/media\/1120"}],"wp:attachment":[{"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/media?parent=1066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/categories?post=1066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/studentwork.prattsi.org\/dh\/wp-json\/wp\/v2\/tags?post=1066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}