In my part of the Smithsonian Libraries, we work with data. You’ll hear talk of “big data“, which often refers data sets far larger than what we work on in here, but for the sake of this blog post, I’m going to use the term Big Data because I’m working with files that are far larger than anything we’ve worked with before… and it’s a sign of things to come. As the world is changing, libraries are working with more big(ger) data than ever before.
The Challenge
One of my recent tasks has been to copy some data from the Internet Archive to create a copy locally at the Smithsonian Libraries. The long term goal of this project is out of scope for this blog post, but I thought it would be fun to talk about our experiences with attempting to transfer many terabytes (TB) of data across the country. This is an endeavor fraught with peril because networks, as fast as they are, can’t keep up with the amount of data we are looking to move.
Attempt #1
My first attempt at transferring the data was to send it over the wire. From California to Washington DC. There are numerous steps in the network between here and there, so it was difficult to optimize the transfer of this data. But since it was a trial, and we only had 4 TB to transfer, it couldn’t be that bad, right? Well, I figured I’d be clever and write a smart little script that would be sure not to copy something more than once, which was clever of me. And I set it to run unattended, day and night, and even to restart itself if something went wrong, so I could forget about it for a week or so.
It turns out, it took over six weeks to copy the data. This did not bode well for the 60 TB of data I had left to copy. There’s no possible way we could try to copy this data over the network. It would take more than a year!
Attempt #2
We knew that a copy of this data already existed at one of our partner organizations, and it just so happened that they had a copy of some of the data (about 18 TB) on external hard drives that they could ship to us. The idea was to simply plug those hard drives into a server and start copying. That would be super fast! Certainly faster than using the internet, right?
Well, it was definitely faster, but it’s still not enough. Doing the math on this proved there’s still a bottleneck. Instead of a network slowing us down, we had USB to contend with. We use USB every day, for our mice, keyboards, webcams, printers, you name it. And USB is pretty fast, but when you’re talking about a dozen terabytes, it starts to break down.
USB 2.0 (dubbed “High Speed” back in the day) is theoretically capable of transmitting data at a maximum rate of 480 Mbit/s. Theoretically. The performance of the server and numerous bits and pieces and even cable quality can affect this maximum. But for the sake of argument, let’s assume that we can transmit 35 Mbytes/sec. That’s a reasonable high-end number for some simple math.
18 TB = ~18.9 million Megabytes (MB)
18.9 million MB / 35 MBytes per second = 539,267 seconds
539,267 seconds = 6.25 days (!)
This is going to be challenging because it’s not a matter of just plugging the disks in and walking away. You need to be there to swap them out, manage their progress, respond to problems and so on. So, this isn’t the best option either. We have to remember that our time is valuable and babysitting a stack of hard drives is hardly a worthwhile way to spend our time.
(As it turns out, we had other issues with the disks, such as the format being incompatible with the server, the location of the server was inconvenient, etc. This works, but it’s not ideal.)
Attempt #3
This one is still in progress, but it offered the most hope. Instead of purchasing a bunch of hard drives, we bought a NAS. That is, a storage device that sits on the network and provides (at a minimum) shared drive space. Since we can pack the NAS full of hard drives (six of them, 4 TB each) and the NAS provides better data reliability than external hard drives, it was something of a no-brainer. In a few years, when hard drives increase in capacity, we can swap them out and gain even more space in the same physical size of the box.
The idea is to fill this NAS with files (which does take time) and then ship it somewhere to be downloaded.
At this point, I’m sure you’re thinking, “How can this be better than using an external hard drive?” Well, the answer to that is this NAS contains dual gigabit network connections that can work in tandem. Although the average expected transfer speeds over Gigabit Ethernet hover around (40-60 MB/sec). Let’s do our math again, keeping in mind we should get double that over both of the network cables. We’ll use 50 MB/sec (times 2).
18.9 million MB / 100 MBytes per second = 188,734 seconds or 2.1 days
This is a bit more manageable and comes with the added bonus that it can run unattended longer and be more easily accessed remotely. There’s no need for a pair of hands to do anything but plug it in when starting and unplugging when finished. Finally, in terms of ease of use, the NAS eliminates any issues surrounding disk formats and compatibility with servers.
Conclusion
Although the jury is still out on how fast the NAS will perform, its added features make it a much more intelligent device and arguably a better long-term solution for this project. I look forward to getting my hands dirty with this data, but only to the point where it starts copying. Watching data copy is like watching paint dry. The results will be fun, but getting there is boring.
One Comment
[…] Smithsonian Libraries’ lead Web Developer, Joel Richard, has been working on the enormous task of copying their Big Data stores, approximately 64 terabytes worth, from the Internet Archive to […]