Open data – The Boreal Beetle

by Dezene Huber and Paul Fields, reblogged from the ESC-SEC Blog.

Have you ever read a paper and, after digesting it for a bit, thought: “I wish I could play with the data”?

Perhaps you thought that another statistical test was more appropriate for the data and would provide a different interpretation than the one given by the authors. Maybe you had completed a similar experiment and you wanted to conduct a deeper comparison of the results than would be possible by simply assessing a set of bar graphs or a table of statistical values. Maybe you were working on a meta-analysis and the entire data set would have been extremely useful in your work. Perhaps you thought that you had detected a flaw in the study, and you would have liked to test the data to see if your hunch was correct.

Whatever your reason for wishing to access to the data, and this list probably just skims the surface of the sea of possibilities, you often only have one option for getting your hands on the spread sheets or other data outputs from the study – contacting the corresponding author.

Sometimes that works. Often times it does not.

The corresponding author may no longer be affiliated with the listed contact information. Tracking her down might not be easy, particularly if she has moved on from academic or government research.

The corresponding author may no longer be alive, the fate of us all.

You may be able to track down the author, but the data may no longer be available. Perhaps the student or postdoc that produced it is now out of contact with the PI. But even if efforts have been made to retain lab notebooks and similar items, is the data easily sharable?

And, even if it is potentially sharable (for instance, in an Excel file), are the PI’s records organized enough to find it?*

The author may be unwilling to share the data for one reason or another.

Molly (2011) covers many of the above points and also goes into much greater depth on the topic of open data than we are able to do here.

In many fields of study, the issues that we mention above are the rule rather than the exception. Some readers may note that a few fields have had policies to avoid issues like this for some time. For instance, genomics researchers have long used repositories such as NCBI to deposit data at the time of a study being published. And taxonomists have deposited labeled voucher specimens in curated collections for longer than any of us have been alive. Even in those cases, however, there are usually data outputs from studies associated with the deposited material that never again see the light of day. So even those exceptions that prove the rule are part of the rule of a lack of access to data.

But, what if things were different? What might a coherent open data policy look like? The Amsterdam Manifesto, which is still a work in progress, may be a good start. Its points are simple, but potentially paradigm-shifting. It states that:

Data should be considered citable products of research.
Such data should be held in persistent public repositories.
If a publication is based on data not included in the text, those data should be cited in the publication.
A data citation in a publication should resemble a bibliographic citation.
A data citation should include a unique persistent identifier (a DataCite DOI recommended, unless other persistent identifiers are in use within the community).
The identifier should resolve to provide either direct access to the data or information on accessibility.
If data citation supports versioning of the data set, it should provide a method to access all the versions.
Data citation should support attribution of credit to all contributors.

This line of reasoning is no longer just left to back-of-napkin scrawls. Open access to long term, citable data is slowly becoming the norm rather than the exception. Several journals have begun require, or at least strongly suggest, deposition of all data associated with a study at the time of submission. These include PeerJ and various PLoS journals. It is more than likely that other journals will do the same, now that this ball is rolling.

The benefits of open data are numerous (Molloy, 2011). They include the fact that full disclosure of data allows for verification of your results by others. Openness also allows others to use your data in ways that you may not have anticipated. It ensures that the data reside alongside the papers that stemmed from them. It reduces the likelihood that your data may be lost due to various common circumstances. Above all it takes the most common of scientific outputs – the peer reviewed paper – and adds lasting value for ongoing use by others. We believe that these benefits outweigh the two main costs: the time taken to organize the data and the effort involved in posting in an online data repository.

If this interests you, and we hope that it does, the next question on your mind is probably “where can I deposit the data for my next paper?” There are a number of options available that allow citable

(DOI) archiving of all sorts of data types (text, spreadsheets, photographs, videos, even that poster or presentation file from your last conference presentation). These include figshare, Dryad, various institutional repositories, and others. You can search for specific repositories at OpenDOAR using a number of criteria. When choosing a data repository, it is important that you ensure that it is backed up by a system such as CLOCKSS.

Along with the ongoing expansion of open access publishing options, open data archiving is beginning to come into its own. Perhaps you can think of novel ways to prepare and share the data from your next manuscript, talk, or poster presentation for use by a wide and diverse audience.

—–

* To illustrate this point, one of us (DH) still has access to the data for the papers that stemmed from his Ph.D. thesis research. Or at least he thinks that he does. They currently reside on the hard drive of the Bondi blue iMac that he used to write his thesis, and that is now stored in a crawlspace under the stairs at his house. Maybe it still works and maybe the data could be retrieved. But it would entail a fair bit of work to do that (not to mention trying to remember the file structure more than a decade later). And digital media have a shelf life, so data retrieval may be out of the question at this point anyhow.