In an earlier post in December 2011, we announced the release of the Taxonomic Literature II (TL-2) search tool that allows anyone to search and read its fifteen volumes. One of the things we mentioned in that post was our plans to open the TL-2 dataset to searchability and reuse by providing it as Linked Open Data (LOD).
This time, we’ll discuss details of our plans for Linked Open Data, some of the data we are extracting, and the challenges in creating data for a linked open data set.
At the most basic level, the TL-2 data set contains authors (or botanists) and their publications. There is some basic information about each author and each publication, but we want to be able to offer more.
In addition to the author’s name, dates of birth and death, and a unique TL-2 abbreviation for the author, we have other sections that contains a good deal of information useful to researchers. They contain such things as:
The key to linked data is that we need to link it somewhere. To that end, we need to find identifiers for data elements out on the web. We decided that the easiest of these is the herbaria names.
In the world of botany, herbaria are given a unique abbreviation. These are the same abbreviations that are used in TL-2. Therefore, our first task is to parse these out of the text and create links to them out on the web while at the same time maintaining this data in a reusable form until such time as we import the data into Drupal. (See more on our Digital Library in a previous post.)
We found that The Biodiversity Collections Index offers permanent LSID URIs for all of the herbaria around the world. Additionally, these URIs also provide linked data! For January and February we have an intern working at the Smithsonian Libraries. Rachel Bloch Shapiro, from the University of Maryland Masters of Library Science program, has a strong interest in linked data and is currently working to parse the data and create the links. This data will be placed back into the XML file that we provide for download.
After herbaria, we are going to link botanist names to the Virtual International Authority File (VIAF). This is a pretty clear-cut task in that the author names are already parsed. Beyond that, we would like to link the authors’ publications to the Biodiversity Heritage Library and WorldCat. There are also opportunities to parse the content for geographic names and link them to the GeoNames database, but this task is complicated by the fact that we have locations as subjects of publications as well as locations in which the titles were published. We want the former more than we want the latter.
This creation of linked data elements is sometimes a daunting task. In many cases, we might have existing databases that we can then reveal as linked open data. This is relatively easy. In the case of TL-2, we have started with OCR text and we are parsing text with computer code built with rules and algorithms to find the data elements. We can rely on the computer to do the easy parts while falling back on personal attention from staff to identify those that can’t be deciphered by the computer code.
Another challenge in linking data is actually identifying other sources of linked open data on the web that are accurate and reliable and that we expect to be around forever. Forever is asking for a lot from an Internet that seems to lose links on an hourly basis, but linked open data demands durable URIs and software that supports these URIs and provides support for automatic redirects. I like to think that by publishing Linked Open Data we are entering into an unspoken contract to ensure that the website and URIs will be there indefinitely.
The third part of the challenge is to make accurate connections. This is where the expertise of our staff really comes into play to make sure that we are providing the best data that we can. If we link a botanist’s name to the wrong entry at DBPedia or VIAF, then we are failing in our mission to provide knowledge to our researchers and to the world. This is the critical part of Linked Open Data that makes it work. Without good, accurate, reliable connections, the model doesn’t work.
Over the next few months, we will finalize this data and release a new version of the TL-2 search engine that will behave much as the current search does, but hidden underneath will be linked open data elements enhancing and providing context to our data while doing the same for other data sets on the internet.