Editor’s note: Rachel is an intern from the University of Maryland’s iSchool MLS program and has been with us for the past seven weeks. Her internship is coming to a close, so we’ve asked her to write a blog post to share what she has done as part of her internship. I have posted this on her behalf.
In January, Joel wrote about our plans to present the Taxonomic Literature-2 (TL-2) dataset as Linked Open Data, allowing for greater searchability and reuse. The main focus of my internship was to identify and investigate other data elements that could be converted to Linked Open Data.
The first piece of this project involved the Herbaria, which are collections of botanical specimens. In TL-2, most authors listed have donated plants or other specimens to herbaria, and the details of their donation are denoted by an acronym or shortened form of the herbarium’s name. After making some modifications to the XML file, we parsed the data to find the herbaria codes and added links to the Biodiversity Collections Index (BCI) URI within the XML.
However, there were some hiccups along the way. For example, some of the herbaria codes are simply single letters, such as “A” which refers to Harvard University’s Arnold Arboretum. Clearly, not every “A” in the text refers to that herbarium! How would our simple pattern match determine when an “A” should be counted as an herbarium reference? To tackle this problem, we set up a script to find all the potentially problematic instances, and allowed us to manually approve or deny the herbaria links based on the context. This took some time, but the quality of a linked open dataset is only as good as the data itself!
The next phase of our project involved research into authoring linked open data vocabularies. Aside from rich biographical and scientific information, TL-2 also contains unique identifiers for each author/botanist listed and each title referred to in TL-2. In addition, each title also has a unique TL-2 number. These three data points are significant in the world of botany research. We wanted a way to present this information as linked open data. After researching best practices, we settled on RDF (Resource Description Framework), which is an official semantic web data model. We’ve loaded this vocabulary onto our Drupal server so that we can use the terms in the vocabulary, and other researchers in the field may as well. (n.b. The link to the vocabulary is a downloadable RDF XML file and will not be displayed in your browser.)
The last project I worked on involves linking the TL-2 authors to authority files using the The Virtual International Authority File (VIAF) API. VIAF includes authority files from dozens of international authority files, including The Library of Congress. The tricky part of these links out is making sure that the TL-2 author is linked to the appropriate file. For many authors this is trivial; but for some, it proved quite difficult. Variation in spelling, accent marks, and quirks of the TL-2 data did result in incorrect VIAF matches or no matches for authors that are in fact listed in VIAF in our test data. We’re working on tweaking the search to maximize the correct hits and minimize or eliminate any erroneous matches.
Working with a large dataset of this nature has been consistently interesting. A lot of detail, data-scouring, and some grunt work has gone into the work, along with more intensive analysis of samplings of data, as well a deepened appreciation for the momentous work that is TL-2.
–Rachel Bloch Shapiro
Smithsonian Libraries is currently recruiting library and information science students for our Professional Development internships! Be sure to check them out.