Creating the Transcription Cleanup Tool

During my time in the Kathryn Turner Diversity and Technology Internship, I worked with my mentor to create a program/software that would take completed projects from the Smithsonian Transcription Center and clean up the data even further. However, I first had to begin with understanding what I was ‘cleaning up’. Within the Transcription Center are projects and themes, many from the Smithsonian Institution Archives’ collections are field books containing information such as day-by-day journaling, research data, or letters from the 19^th century. These projects and field books are typically handwritten and need to be transcribed so that their full text can be included and searched in other platforms, like the Smithsonian Collections Search Center and the Biodiversity Heritage Library (BHL).

Page of Specimen List from C. Hart Merriam’s Biological Survey of the San Francisco Mountain Region, 1889

To prevent leaving out any information, I, like other Transcription Center volunteers, was instructed to type out everything as I saw it – that’s where tags come in. Tags are used to show exactly what is written on the page when transcribing it. For example, a tag could be used when a phrase is underlined, inserted into the text, or to indicate an image or stamp on the page. I helped transcribe a few pages of the Freedman’s Bureau project in order to learn what these tags looked like and how they would be important for field books. After a while, the tags can make the transcription look messy, which is why it was necessary for us to develop the Transcription Cleanup tool.

A visual of the original and transcribed content in the Transcription Center.

Joel Richard, my mentor, and I coded the software to read through the downloaded, transcribed pages and remove all the tags that were used the most. Removing the tags is important because they can cause problems in the BHL search engine because the tags are not meaningful words for searching in BHL. Creating this software was something completely different from what I’ve done in school because I hadn’t yet completed any projects with real-world applications. Joel and I would meet virtually almost every day to discuss the next steps of the software and go over any questions I had.

One new thing that I learned is regular expressions, or regex. A regular expression is a sequence of characters that allows creation of patterns to help find, replace, and manage specific text. An example of a regex is a question mark, ?, which allows users to search for any optional character. For it to work, the question mark would have to be placed after the specified character, which applies to any regex character. A plus sign, +, would be used to search for one or more of a specified character. An example of a regex pattern is \[\[/?underlined?\]\], which would match an optional “/“ and “underline” with an optional “d”. I’d say regular expressions were the biggest part of the code we created because without them we wouldn’t have been able to efficiently remove the tags for this project.

Comparison of the original content vs. the output after running the Transcription Cleanup software

As a computer science student, being an intern for the Smithsonian was a great way for me to get my foot in the door. There was no pressure if I didn’t know something and needed a small lesson, and I could freely communicate with my mentor. I am beyond grateful for this opportunity to create, learn, and experience.

Be First to Comment

Leave a Reply Cancel reply