Metadata for All!
I attended “NYC Open Data’s Metadata for All Initiative: Project Presentation” hosted by the Metropolitan Library Council on September 25th for my event review for INFO 601-04. This event presented the Metadata for All Initiative’s results after six months of working with New York City Open Data. The Sloan Foundation sponsored the project, which was completed in partnership with the Mayor’s Fund to Advance New York City, METRO, Mayor’s Office of Data Analytics, NYC Open Data Team, Pratt Institute, Brooklyn Public Library, Queens Public Library, New York Public Library and Tiny Panther Consulting.
The initiative aimed to help make New York City’s Open Data, which includes more than 2,100 datasets, more accessible to general users through the improvement of metadata standards. Tiny Panther Consulting, a team of data librarians founded by Julia Marden, was brought on board to do a pilot study on how to make the top 100 used datasets more user-friendly. They studied this through discussions with relevant governmental departments, workshops in all five boroughs and the creation of templates for certain metadata documents.
Metadata Improvements
In order to best assess the current metadata quality, Tiny Panther created a dataset documentation checklist. (This was provided as a handout for the audience.) The checklist contained a rubric that verified the overall usability, user guide, and data dictionary. The goal was to determine if a user would be able to understand what was in the dataset, and maybe more importantly was is not in it.
The data dictionaries function like a Rosetta Stones for the dataset and are required for users to understand what is actually in each dataset – for example what all of the rows and columns mean. However currently only 90% of the datasets had a dictionary, and there wasn’t a standard template for them, so they are of varying quality.
In addition to improving the data dictionaries Tiny Panther recommended the creation of user guides tailored to each dataset. These guides would provide a context for the data, let you know when it was last modified, clarify what which data was raw or added by the city, in addition to many other factors. Tiny Panther found that many of the documents associated with datasets used inside lingo that would not be comprehensible to user who were not employed within the departments that created the dataset. The three proposed user guides were provided as handouts as well.
Audience
The most successful aspect of this event wasn’t directly about the initiative. What made this event most noteworthy was its audience. The event was not directed towards librarians and information professionals, who probably already buy into the idea of accessible metadata. About half of the audience was comprised of government workers. (This is based on a show of hands conducted early in the presentation.) These are the professionals who create and maintain the datasets, and did not necessarily have a background in information studies. It was very impactful to hear their points of view. A few members of the panel were representatives of departments Tiny Panther worked with, and discussed their impressions of the challenges around the project. The audience was given an opportunity to ask questions about the project as well.
The only way the metadata standards can be maintained across all of the datasets is if they understand why it is important. They are the ones that will be doing this extra work, on top of everything else they are responsible for. It is not as if each department has a data librarian whose sole role is to maintain their open data. Although there is probably enough work to do that it could be a full-time job! I think the presentation was accessible to them, and hopefully demonstrated the power and utility of comprehensive metadata.
Conclusion
I will be keeping tabs on NYC Open Data, checking what the metadata looks like, as well as the actual data, over the next few ‘data dumps’ to see how their metadata evolves. Open data is a very exciting tool for civic engagement, but only if users can understand what the data are actually telling them.