Digital archivists often have to contend with the difficulties of processing messy collections. They may have to deal with thousands of files on different media, which may cost more resources than the archives can afford to expend. For this reason many collections are only processed minimally.

Computers are commonly used to detect and deaccession duplicate files, but I believe we can go further. If software could automatically detect the connections between files and thereby identify edited versions of the same image or drafts of the same document, then that information could be invaluable to the archivist in discovering, navigating, and describing the contents of a digital collection.

I am developing software, codenamed “Eltrovo”, that can identify similarity between files to determine whether they may represent versions of the same work. This automatic discovery process will empower archivists with a greater understanding and enable them to describe digital collections more effectively.

The slide deck for this presentation is available online at stjo.hn/infoshow24

St John Karp
St John Karp is a recovering novelist and computer programmer. He has just completed a master's degree in library and information science and will shortly be ejected from the thorny bosom of academia and back into the real world. During grad school he has done internships at the American Numismatic Society, New York City Health + Hospitals, the NewYork-Presbyterian/Weill Cornell Medical Archives, and the Horological Society of New York. He is a shy and timid creature most active during the crepuscular hours, and responds to soothing noises and non-conflict scones.