Press "Enter" to skip to content

Assessing File Format Risk for Born-Digital Preservation Planning

This post originally appeared on the Smithsonian Institution Archives’ blog. Melissa Anderson’s internship was part of the Smithsonian Libraries and Archives’ 50th Anniversary Internship program, with funding provided by the Secretary of the Smithsonian and the Smithsonian National Board.

When I entered the MLIS program at the University of Alabama School of Library and Information Studies in 2018 and became interested in digital libraries, I was surprised to learn that the information we create and store digitally is just as, and in some cases even more, fragile than unstable media or paper. Physical damage, deterioration of digital storage media, and the technological complexity and dependency of electronic records make them uniquely vulnerable to loss, corruption, and alteration. As keepers of records with historical, cultural, and legal value, archival repositories have a responsibility to identify at-risk digital objects and take preemptive action to preserve them in a format that is accessible to the broadest possible public for the longest possible time. As a Smithsonian Libraries and Archives 50th Anniversary intern in born-digital collections, I’m learning how to do just that.

At present, more than half of the Smithsonian Institution Archives’ annual accessions contain born-digital materials, most of which are acquired in mixed collections alongside print and analog media. To document and serve the Institution, the Archives collects documents, spreadsheets, images, audiovisual (AV) material, email, databases, designs, data sets, software, websites, and social media content. These electronic records span more than 40 years and are stored in a variety of media formats, some of which require urgent preservation to avoid information loss.

Gif slideshow of Digital Collections information.
Gif slideshow of Digital Collections information.

The Archives’ employs a multi-pronged born-digital preservation strategy that follows professional standards and best practices including the OAIS Reference Model and trustworthy digital repositories. The three prongs are: bit-level preservation, migration of at-risk files to stable preservation formats, and emulation for access to records locked in obsolete formats. The first strategy creates an exact copy of a file’s content information and data structure and is applied to all digital objects on accession. Having two (or more) identical copies of every file and storing them in different locations mitigates the risk of loss due to media, system, or human failure and disasters like fire and flooding, but possession does not automatically equal access. Our ability to even open and view a file during processing depends on hardware and software that can read and render it.

Obsolescence affects both the machines and the software we use to create, store, and access digital files. Advancements in power, speed, efficiency, and cost lead to rapid obsolescence of computer hardware. The introduction and adoption of new hardware also leads to new and improved software, which eventually makes older software and the file formats it supported obsolete as well. The wide use of proprietary file formats has created a situation in which only the program that created the file—or, even more specifically, a particular version of that program—can be used to open that file.

Sometimes only the information (i.e written text) contained in a file is important, but often we need to preserve the appearance and function of files as well to ensure that evidential and use value is maintained. Take, for example, a newsletter created using Adobe InDesign 1.0 (circa 1999) and selected for a digital exhibition commemorating the Smithsonian’s 175th anniversary. If we’re only able to render the text of that document but not the images, layout, colors, or fonts, we would have only a part of the newsletter the original user experienced. This is where our second and third prongs—migration and emulation—come into play.

Migration involves moving a file from an at-risk or obsolescent format to a format digital archivists agree is more stable. Despite dependence on hardware and software, migration is an effective way to preserve digital objects and make them accessible, so long as it’s done promptly and as needed to keep up with technology. But it requires archivists to verify fixity, which assures that a copied or converted file hasn’t been altered from the original. Digital files can be changed or corrupted accidentally during preservation events, through human error, or maliciously by actors who wish to alter or destroy records. Checksums enable archvists to validate the authenticity of records, which is essential for maintaining public confidence in the trustworthiness of repositories. If the hash of a copied file matches the hash of the original, archivists can be confident the record has been reproduced exactly.

When files can’t be migrated, emulation provides another mode of preservation and access. This method uses programming to emulate the appearance and function of obsolescent computing technologies—one can, for instance, turn a Raspberry Pi into an original Nintendo gaming system. But the kind of emulation needed to preserve both the information content and appearance of digital records is much more complicated and expensive. A well-known and early use of emulation was undertaken at Emory University’s Rose Library. In 2009, when I was a third-year doctoral student in American Studies there, my digital humanties friends were all excited about a digital archives technology that convincingly replicated Salman Rushdie’s Power Macintosh 5400. Emory’s case study became an early model for successful digital preservation, but their innovation was supported by a resource-rich institution that invests heavily in its archives and special collections libraries.

Ten years later, when I entered library school, I understood why my Emory classmates had been so excited; migration and emulation enable us to preserve and provide access to electronic records at scale.Today, we’re challenged to develop preservation policies and workflows that include strategic risk assessment. The Archives’ digital preservation team is performing a detailed risk analysis of the born digital holdings in our collections.This process starts at ingestion by identifying and validating the format type of each file using DROID and JHOVE, as well the PRONOM technical registry. Digital archivist Lynda Schmitz Fuhrig or another team member reviews this and other administrative metadata. All the administrative information about the born digital content in an accession is gathered in the Archives’ DArcInfo (Digital Archive Information System) database.

A computer window titled DArchInfo with clickable heading tabs labeled Search, Query Results, Clipbo
Screenshot of DArcInfo query results showing format type and count by accession for born-digital holdings.

By querying this database, we can determine and document our preservation backlog (how many assets we hold that do not yet have a preservation master file), giving us the scope of our to-do list. We can also inventory the format status of our digital holdings (including format type and version) and assess storage media (type, stability, and condition). We intend to use this information to identify the range of formats and how many files in each format we hold by accession. From there, we will draft a plan for targeting the most valuable and at-risk digital objects in our collections so that we can preserve them in accessible formats before they’re lost.


One Comment

  1. William Blackerby

    A fascinating read. I’m so glad Dr. MacCall shared it during SLIS orientation today!

Leave a Reply

Your email address will not be published. Required fields are marked *