Expanding Data’s Reach

by Allyson Ota

Aloha! I’m a second-year graduate student with the University of Hawaii’s LIS program, planning to graduate with an MLISc and a certificate in Archival Studies, in May 2017.

In this tech-driven world we live in, librarianship has evolved to include positions that specialize in caring for digital objects and collections. This summer, I was fortunate to have the opportunity to intern with the Digital Programs and Initiatives Division at Smithsonian Libraries, through the Minority Awards Program. Under the supervision of Joel Richard, Head of Web Services, I worked on three digital projects that focused on expanding the reach of the Libraries’ collections.

DOIs: Keeping things together in the vastness of the Web.

"Fig. 143. Section of foundation lines and orb of the Orchard spider" from American Spiders and Their Spinningwork by Henry McCook (1889).

“Fig. 143. Section of foundation lines and orb of the Orchard spider” from American Spiders and Their Spinningwork by Henry McCook (1889).

Project 1: Creating a workflow for batch DOI registration to Smithsonian publications

Smithsonian Libraries maintains a digital library with Books Online, the Smithsonian Research Online portal, and the DSpace document repository. These valuable resources hold citations or digital copies of publications for researchers and the general public to use. However, how easy is it for information to get lost in the vastness of the Internet? How permanent is a URL? While a hyperlink might work today, there are no guarantees it will work a year from now. Using a direct URL is not the most reliable way to ensure people can find bibliographic sources in citations. Digital Object Identifiers (DOIs) fill the role of tying publications to a unique and permanent identifier that holds metadata, and most importantly, the URI of an object. You can resolve a DOI using a browser, and it will redirect you to what you seek. For more about DOIs, check out a previous blog entry: Higher Profile for Scholarly Press Publications.

I was asked to create a workflow for Systems Librarian, Bess Missell, that would allow her to register DOIs for items in our digital collections. I decided to work with a variety of software: OpenRefine for data cleanup, MS Excel with Office XML Add-in to convert a spreadsheet into an XML file, and oXygen XML Editor to run XSLT code (co-written with Joel) in order to generate a second, properly formatted XML file to use for registering DOIs. I was able to successfully register DOIs for 24 articles from the Herpetological Information Service journal, and 2,668 digital books. I submitted workflow documentation to Bess, detailing the process and requirements. I found this project to be highly rewarding. Providing accessibility to information is one of my main areas of interest in librarianship, and learning about this behind-the-scenes process was an eye-opener.

I am both a Mac and a PC person

An awesome workspace: my desk, equipped with both an iMac and a PC!

Project 2: Linking library data to VIAF

This project tied into my first, as I explored the capabilities of OpenRefine to connect to other data sources. I experimented with linking library data to the Virtual International Authority File (VIAF). Libraries keep lists of authors, geographic locations, corporations, etc., which are referred to as their authority files. VIAF collects authority files from national libraries (and some other organizations) from countries around the world. By linking all these authority files together, VIAF creates a “super” authority record for an entry. For this project, I was asked to link our authors in the Libraries’ Digital Library (which consists of persons and organizations/corporations) to VIAF.

I was provided a database export of authors, and tested 2 different open source reconciliation services found online. I documented the process of downloading and running the service from within OpenRefine, an open source, data cleanup tool, and was able to retrieve VIAF IDs for 36 of the 50 persons I tested with. This is not a purely conclusive test, since data issues can cause additional problems and I only tested on a small portion of the data.

I discovered there was still a lot of manual work involved when it comes to verifying the matches made by the reconciliation service, since matches made by the service were not always accurate. A librarian should verify our authors are being linked to the correct authority record for each author in order to ensure accuracy.

Using OpenRefine and viaf_refine to connect to VIAF

Using OpenRefine and viaf_refine to connect to VIAF

This project was a success, since we wanted to perform this as a test to see if we could in fact connect to VIAF via OpenRefine and retrieve VIAF IDs for our data. Using VIAF authority record IDs, we hope that future Smithsonian Libraries projects can involve linking to other data sources, for example seeing that Wikipedia also contains VIAF IDs, could allow linkages to Wikipedia data that can be brought back into the Smithsonian Libraries catalog–enhancing data and creating a richer experience for users of the collection.

VIAF used for Authority Control in Wikipedia

Wikipedia record using a VIAF ID

Project 3: Digital Curation of the Galaxy of Images

The Galaxy of Images (GoI), hosts over 16,000 images taken from digitized publications. It’s utilized in generating educational materials by Smithsonian Libraries’ staff working to create educational outreach materials, and can also be used by the general public, since all images are in the public domain. The GoI will soon be going through a migration in order to add functionality and modernize the current site. In order to prepare I was asked to help with weeding the collection.

In librarianship, the term weeding describes the process of removing items from your collection. Generally weeding is done in an effort to free up space for more desirable items. I focused on evaluating images with the least views, applying criteria developed by Metadata Librarian, Douglas Dunlop. In the end, I evaluated 10,205 images, assigning them either a low, moderate, or high probability of being saved. Then I got to do the opposite of weeding, a process called selection. Selectors add items to a library’s collection(s). This was more fun. For the selection process, I used Macaw, which is an application used to collect and organize page images from scanned or digitized books, that also collects metadata about the pages for inclusion in the Biodiversity Heritage Library (BHL), as well as the GoI.

I selected images from 98 books and submitted them for consideration to be added to the Galaxy of Images. Using Macaw allowed me to interact with software being used by the Libraries in their everyday workflow and gave me the experience of selecting images for use in an online collection.

Selecting images in Macaw

Selecting images in Macaw


Accessibility and engagement are huge aspects of librarianship, and all of my projects this summer were focused towards these ends. As the world becomes more and more technologically advanced and digital content is being created at exponential rates, the amounts of information can be difficult to wade through. The Internet, with its vast resources and seemingly endless search possibilities can become a place where information gets lost or is difficult to find. Librarianship has continually adjusted to meet the demands of society and rapidly evolving technology, and helping people navigate through it all makes digital librarianship integral to the field moving forward. I had an amazing time interning with Smithsonian Libraries, and I’m so thankful to have had this wonderful opportunity!




Leave a Reply


Follow Us

Latest Tweets