Summer Institute on Data Curation

The University of Illinois' Graduate School of Library and Information Science held its 3rd annual Summer Institute on Data Curation. This year's focus was on earth and environmental sciences data. Because library services as well as the world of scientific publishing are both changing, many people are beginning to believe that the raw data collected by scientists will be used and re-used for years just as book or journal literature is today.

Science librarians from universities, government offices and other organizations attended to listen to presentations from a variety of professionals in the field. Representatives from the Oak Ridge National Laboratory, the National Biological Information Infrastructure, the National Snow and Ice Data Center and the National Center for Atmospheric Research among others, detailed their work in collecting, describing, standardizing and making available the scientific data which is collected at their respective organizations.

These presenters and others illustrated a different facet of scientific data curation but agreed on several points:

  • Many raw data sets are "undocumented" and at risk for loss
  • Many scientists are reluctant to share their data for reasons that are understandable
  • Formally publishing and consequently citing data is one method to encourage centralized collection and sharing but there are several questions to be answered first
  • The librarian-scientist relationship–while well established in some institutions–needs to be modified to bring the librarian into the workflow at a different point than simply the literature search and access phase of research

While there has been publicity for some high-profile, large data sets from astrophysics and other research endeavors, the vast majority of scientific data is generated by what is termed, "small science." These are typically projects with a single PI and perhaps a graduate student assistant. The management of the data sets are often left to the graduate students who of course have a tendency to take a professional position after completing their degree. Small data sets are commonly found on researchers' hard disks or on servers but frequently are incompletely described or annotated.

Scientists are hesitant to share data they have collected for several reasons. Among them are that they naturally want to publish scholarship based on the work they've done and they fear that others will use their work (in data collection) to create publications themselves. Other scientists fear that their data will be mis-used or misinterpreted by other scientists or that their data will be found flawed or somehow false. But the biggest obstacle appears to be that the management of data sets is an additional task that most scientists are not familiar with, were not trained in and feel that they don't have the time for. Many Insitute attendees agreed that librarians are a natural remedy for this latter concern.

One incentive for the adequate collection, description and availability of scientific data would be if it were included in a system of reward or recognition the way scholarly publications are. A scientist whose data set was re-used would be cited in subsequent publications and these citations would be counted in the same way that citations to published papers are. However formal publication of data sets also has some issues to be worked out including the lack of a common set of standards and the amorphous nature of data which can be generated on the fly, combined with other partial data, etc. which in a sense represents a new distinct data set.

Carole Palmer, who is a noted author and director of the Center for Informatics Research in Science & Scholarship at the University of Illinois perhaps summarized best the thrust of the Institute when she mentioned that librarians should inject themselves into the research process at different points. She mentioned that perhaps librarians should turn their attention to "service development" in addition to collection development which has been a common library activity in the past.

