Open data

by Dezene Huber and Paul Fields, reblogged from the ESC-SEC Blog.

Have you ever read a paper and, after digesting it for a bit, thought: “I wish I could play with the data”?

Perhaps you thought that another statistical test was more appropriate for the data and would provide a different interpretation than the one given by the authors. Maybe you had completed a similar experiment and you wanted to conduct a deeper comparison of the results than would be possible by simply assessing a set of bar graphs or a table of statistical values. Maybe you were working on a meta-analysis and the entire data set would have been extremely useful in your work. Perhaps you thought that you had detected a flaw in the study, and you would have liked to test the data to see if your hunch was correct.

Whatever your reason for wishing to access to the data, and this list probably just skims the surface of the sea of possibilities, you often only have one option for getting your hands on the spread sheets or other data outputs from the study – contacting the corresponding author.

Sometimes that works. Often times it does not.

  • The corresponding author may no longer be affiliated with the listed contact information. Tracking her down might not be easy, particularly if she has moved on from academic or government research.
  • The corresponding author may no longer be alive, the fate of us all.
  • You may be able to track down the author, but the data may no longer be available. Perhaps the student or postdoc that produced it is now out of contact with the PI. But even if efforts have been made to retain lab notebooks and similar items, is the data easily sharable?
  • And, even if it is potentially sharable (for instance, in an Excel file), are the PI’s records organized enough to find it?*
  • The author may be unwilling to share the data for one reason or another.

Molly (2011) covers many of the above points and also goes into much greater depth on the topic of open data than we are able to do here.

In many fields of study, the issues that we mention above are the rule rather than the exception. Some readers may note that a few fields have had policies to avoid issues like this for some time. For instance, genomics researchers have long used repositories such as NCBI to deposit data at the time of a study being published. And taxonomists have deposited labeled voucher specimens in curated collections for longer than any of us have been alive. Even in those cases, however, there are usually data outputs from studies associated with the deposited material that never again see the light of day. So even those exceptions that prove the rule are part of the rule of a lack of access to data.

But, what if things were different? What might a coherent open data policy look like? The Amsterdam Manifesto, which is still a work in progress, may be a good start. Its points are simple, but potentially paradigm-shifting. It states that:

  1. Data should be considered citable products of research.
  2. Such data should be held in persistent public repositories.
  3. If a publication is based on data not included in the text, those data should be cited in the publication.
  4. A data citation in a publication should resemble a bibliographic citation.
  5. A data citation should include a unique persistent identifier (a DataCite DOI recommended, unless other persistent identifiers are in use within the community).
  6. The identifier should resolve to provide either direct access to the data or information on accessibility.
  7. If data citation supports versioning of the data set, it should provide a method to access all the versions.
  8. Data citation should support attribution of credit to all contributors.

This line of reasoning is no longer just left to back-of-napkin scrawls. Open access to long term, citable data is slowly becoming the norm rather than the exception. Several journals have begun require, or at least strongly suggest, deposition of all data associated with a study at the time of submission. These include PeerJ and various PLoS journals. It is more than likely that other journals will do the same, now that this ball is rolling.

The benefits of open data are numerous (Molloy, 2011). They include the fact that full disclosure of data allows for verification of your results by others. Openness also allows others to use your data in ways that you may not have anticipated. It ensures that the data reside alongside the papers that stemmed from them. It reduces the likelihood that your data may be lost due to various common circumstances. Above all it takes the most common of scientific outputs – the peer reviewed paper – and adds lasting value for ongoing use by others. We believe that these benefits outweigh the two main costs:  the time taken to organize the data and the effort involved in posting in an online data repository.

If this interests you, and we hope that it does, the next question on your mind is probably “where can I deposit the data for my next paper?” There are a number of options available that allow citable

(DOI) archiving of all sorts of data types (text, spreadsheets, photographs, videos, even that poster or presentation file from your last conference presentation). These include figshare, Dryad, various institutional repositories, and others. You can search for specific repositories at OpenDOAR using a number of criteria. When choosing a data repository, it is important that you ensure that it is backed up by a system such as CLOCKSS.

Along with the ongoing expansion of open access publishing options, open data archiving is beginning to come into its own. Perhaps you can think of novel ways to prepare and share the data from your next manuscript, talk, or poster presentation for use by a wide and diverse audience.

—–

* To illustrate this point, one of us (DH) still has access to the data for the papers that stemmed from his Ph.D. thesis research. Or at least he thinks that he does. They currently reside on the hard drive of the Bondi blue iMac that he used to write his thesis, and that is now stored in a crawlspace under the stairs at his house. Maybe it still works and maybe the data could be retrieved. But it would entail a fair bit of work to do that (not to mention trying to remember the file structure more than a decade later). And digital media have a shelf life, so data retrieval may be out of the question at this point anyhow.

Open access… Canada?

Today marked a major milestone for open science. Specifically, the Obama administration announced a directive that all US federal agencies which receive over $100 million in funds for research and development work on creating a plan to ensure open access to all research outputs within a reasonable time frame.

To quote from the Obama administration memorandum:

“To achieve the Administration’s commitment to increase access to federally funded published research and digital scientific data, Federal agencies investing in research and development must have clear and coordinated policies for increasing such access.”

You can read more about it here, and here.

A number of other countries, including Canada, have mandatory open access policies for some of their taxpayer-funded research, but for the most part the policies apply to health-related research. And in many cases you can also find research stemming directly from federal scientists freely available on the web.

In some cases (e.g. the UK and Australia and a few others) open access is mandated for all federally funded research. And now that the US has taken this step to full openness, I think that it’s fair to say that there is a lot of pressure on countries that haven’t done the same to get moving down that track.

I’m looking at you, Canada!

Like many other countries on that list, Canada has some mandatory open access policies, but they mainly pertain to health sciences. There have been rumblings of more openness from the Canadian government, as noted by one of my Twitter contacts:

…but the steps taken by the UK, Australia, and now the US are good indicators that Canada’s steps so far have been baby steps at best. It’s time for that to change.

Why should we, as Canadians, call for a mandatory open access policy for all federally funded research? Here, in brief, are a few reasons that come to mind, and I know that there are more:

  • Fairness. Taxpayers paid for the research. Why should they also have to pay to access the results of the research?
  • Open access accelerates the pace of discovery. Although I’m at a small university, the UNBC library is well-stocked with many journals that the folks in my research program and I use. But we occasionally come across articles that we need that are unavailable. The choice then is to keep looking for the information elsewhere, pay up at the paywall, or go through the interlibrary loan process. Our librarians are superb at getting access to individual journal articles that we need, but not everyone is so lucky to be affiliated with a good library at a good institution. There are many scientists who do not have access to these kind of services, and they either have to pay or hope to find the information elsewhere. And most members of the general public have absolutely no access to such services at all. Open access removes those barriers and allows research to move ahead more efficiently.
  • Open access makes research more relevant and reduces the temptation to “hoard” data. Open access allows other researchers and the general public to look at research outputs in all sorts of unpredictable ways. Full accessibility lets the full diversity of interests see and think about the work and, hopefully, take it to new and unpredictable places. In addition, while my little corner of the scientific endeavor (forest entomology, for the most part) is generally not beset by researchers afraid of being “scooped,” this tendency is present to some extent in all fields, and to a large extent in certain fields. Hoarding of data in order to hopefully glean the research glory results in competitive, rather than collaborative, use of research dollars. Replicated efforts in several competing labs may drive research to move faster, but it also sucks up declining research dollars in identical endeavors. Open access, and particularly the tendency toward open data that comes along with it, erodes these tendencies and promotes collaboration instead. The rise of biological preprint servers such as PeerJ PrerPrints and the biological portion of Arxiv also facilitate the erosion of meaningless competition.
  • Open access makes research institutions more relevant. In an era when universities are struggling with funding and, in some cases, public perception, the ability to freely disseminate the useful products of research to the public provides incentive for taxpayers to pressure governments for better funding of postsecondary education. If research results are behind paywalls, they remain mainly unknown to the public and, thus, irrelevant. If the results are irrelevant, so are the institutions in which they were produced.
  • Open access allows the public to see firsthand the evidence-based results that should be driving public policy. Ideally, all governments should consult honestly with scientists about medical, environmental, social, and other issues as they create policy. Realistically, most governments do this only as much as is optimal for their own political agenda. By removing all restrictions to access to research outputs – combined with a growing tendency for scientists to explain their research results to the public – governments will also have to be more transparent in their consultations with researchers. Perhaps we can move to a time when research drives policy rather than seeing policy attempt drive research.

It is, indeed, fantastic to see the US take this big step. And, as noted above, the US is not the first country to do this. It’s now time for the Canadian public to ask our government to start to take this issue more seriously as well, too.

PeerJ, today!

Along with being Darwin’s birthday, 12 February 2013 marks the official launch of the first articles on PeerJ.

In case you haven’t heard about it already, PeerJ is a brand new open access journal, with a twist. Or, actually, a few twists.

For instance, instead of a pay-per-article fee, PeerJ has all authors buy a lifetime membership in the journal. There are several levels of membership, depending on how much publishing you think that you might do on a yearly basis. And there are no yearly renewal fees. Instead, you maintain your membership by taking part in journal activities. For instance, if you review one article a year, your membership will stay active. This fee/membership model allows for an ongoing revenue stream (when members publish with new co-authors who are not yet members), and also stimulates ongoing and growing involvement in the journal by a diverse group of scientists.

Another welcome innovation that some other open access journals are also embracing is the insistence that authors co-publish their data with their paper in a repository such as figshare. This concept is not new to many disciplines. Genomics researchers have been publishing data along with their papers for years using repositories such as those provided by NCBI. But with the growth of the internet, there is no reason that all data associated with a paper can’t be publicly and permanently available in a citable format. By making data public in this way it is easy to anticipate that others will be able to use and build on the data in new and exciting ways.

PeerJ also commits to publishing any work that is rigorous, no matter how “cool” or “sexy” it is… or is not. To quote: “PeerJ evaluates articles based only on an objective determination of scientific and methodological soundness, not on subjective determinations of ‘impact,’ ‘novelty’ or ‘interest’.”

And one last twist that I’ll mention (please see this launch-day blog post from PeerJ for more information), authors can choose to publish the full peer review documentation alongside their accepted article. Besides giving some great insight into the review process, it also allows readers to study other expert opinion on the work and come to their own decisions.

PeerJ has an impressive advisory board that includes five Nobel laureates. It also has a huge and diverse board of academic editors, of which I’m a member (no Nobel Prize for me yet, however). I also have the honor of having been the handling academic editor on one of the first thirty articles in PeerJ.

And, one last note. PeerJ PrePrints is also going to come online in a few weeks as well. If you are familiar with physics and mathematics, you doubtless have heard of preprint servers such as Arxiv. Researchers in those fields have been publishing their preprints (nearly final draft) papers online for years. This is a constructive practice as it allows the larger community to see and comment on results as they come out. This both strengthens the eventual manuscript for final publication and it allows the research community to use the results immediately instead of waiting for the final publication. Of course, it also helps the researcher to establish priority for the work.

Historically, many journals in biological fields have had issues with the use of preprint servers as they have considered such early deposition of a manuscript as “prior publication.” This, too, is changing and I expect that the growing use of PeerJ PrePrints, and others like it, will make the change final.

I am under no illusions that the shift to a more open publishing and data sharing paradigm will be completely smooth sailing. As with anything new, there are going to be challenges and opposition from some corners to doing things in a new way. But the internet has changed the way that we do everything else in our society, often for the better. There is no reason that academic publishing and dispersal of research outputs should remain in the era of the printing press. PeerJ, and other publishers, are working diligently to guide our larger research community through this process of continual innovation.

Exciting times!

—–

Update: Some great coverage here, here, and here.

Bark beetles on ice

Over the next while I plan to blog about various papers that have come out of our research program. I won’t get to all of them, obviously. But I do plan to pick and choose a few recent ones, and/or ones that have been highlights to this point in my career.

I’m going to begin with a very recent paper from my lab on bark beetle larval overwintering physiology. The paper is entitled “Global and comparative proteomic profiling of overwintering and developing mountain pine beetle, Dendroctonus ponderosae (Coleoptera: Curculionidae), larvae” and is available in open access here.

Context: Mountain pine beetles usually spend their winters as small, young larvae under the bark of their host tree. In this location, they are exposed to extremely cold temperatures, sometimes ranging below –30°C and even pushing down towards -40°C. Mountain pine beetle larvae survive those temperatures by resisting freezing. Sometime in the autumn they begin to accumulate at least one antifreeze compound (glycerol) in their bodies, and then in the spring they presumably return that antifreeze compound (and perhaps others) to general metabolism for energy to complete their development

Cold temperatures have historically limited the range of the mountain pine beetle both in terms of longitude and latitude, and in terms of elevation. However, climate change has reduced the probability of cold winter temperatures – particularly the probability of extreme cold events fairly early in the autumn or fairly late in the spring. At those ‘shoulder seasons’ the larval insects have either not accumulated enough antifreeze compounds in their tissues (autumn, around Hallowe’en) or have metabolized most of it (spring, around Easter). Those are the vulnerable periods, and deep cold at those can cause populations to crash rapidly.

The lack of unseasonal cold events or of generally very deep cold in the heart of the winter over the past years has been one factor that has driven the dramatic outbreak that we’ve seen in British Columbia. In addition, historically colder areas such as the eastern slopes of the Rockies and central Alberta or high elevation areas in the Rockies have not been as cold either. This has allowed mountain pine beetles to survive winters and to move into hosts, such as jack pine and whitebark pine, that they have not historically used in the recorded past. In the case of jack pine outbreaks, the fear is that the beetle, freed from its main confine on west slope of the Rockies, is poised to move across Canada’s boreal forest. In the case of whitebark pine, the insect may further endanger already-threatened trees that are important to higher alpine ecosystems.

What we did: Up until now, the main known antifreeze compound in mountain pine beetle larva has been glycerol. We suspected that there was more to the insect’s overwintering physiology than just that, as most insects use several strategies to avoid freezing. So we conducted a proteomics experiment. That means that we surveyed the levels of all of the proteins in early-autumn larvae and compared them to levels of proteins in late-autumn larvae to look for changes. Similarly, we compared the levels of all detectable proteins between early-spring and late-spring larvae. Because we now have copious amounts of genomic data for the mountain pine beetle, we could identify which proteins did what in the insect and we could draw some conclusions as to which metabolic pathways and physiological processes were activated or deactivated in overwintering larvae at different times of the year.

What we found: In total we found 1507 proteins in all of our larval samples. Of these, 33 either increased or decreased in their levels between early- and late-autumn and 473 either increased or decreased in their levels between early- and late spring. Of the proteins that were present in either increased or decreased levels in one of the two seasons, 18 of them showed such changes in both seasons. This Venn diagram from the paper shows this general result:

 

 

These proteins can be classified into a number of general functional groups, as seen in this pie chart from the paper:

 

Of course, large groupings are not as informative as looking at individual proteins. So that is what we did, as I will write about in the next section.

What this means: In proteomics work like this, when we are dealing with hundreds of proteins, it is obvious that there is so much complexity that it would take untold pixels to explain everything. In fact, like may ‘-omics’ studies, the original authors (us, in this case) have to pick and choose things that seem interesting to them and then leave it to others wearing different research glasses to find other interesting trends. What follows are a few highlights that we noticed in the context of our research program. Our hope is that others will take our data and find other interesting things that we may have missed.

Glycerol: Our results confirm past work implicating glycerol as an important antifreeze compound in the mountain pine beetle. The data also confirm previous work in our lab (Fraser 2011, referenced in the paper) that shows certain glycerol biosynthetic genes being upregulated in the autumn and downregulated in the spring. Of particular note were the extreme variations in an enzyme called PEPCK (phosphoenolpyruvate carboxykinase) which likely indicates some level of nutritional stress in larvae heading into the cold of winter.

Trehalose: Trehalose is a major hemolymph (insect “blood”) sugar, and it has been found to be important in insect cold tolerance in other species. The levels of an enzyme involved in trehalose biosynthesis increased significantly in the autumn and decreased significantly in the spring, indicating that trehalose might function alongside glycerol as an antifreeze compound.

2-deoxyglucose: The largest autumn increases and spring decreases for any protein that we observed was for one enzyme that is involved in the biosynthesis of 2-deoxyglucose. By looking at what 2-deoxyglucose does in other organisms, we can make some guesses as to what it is doing in the mountain pine beetle. It is possible that 2-deoxyglucose regulates larval metabolism to direct energy flow appropriately toward overwintering in the autumn; that it acts in stress physiology as the insect enters a difficult period of its life; or that it is functional as an antifreeze compound. It’s also possible that it functions in more than of these roles. What is clear is that this metabolite, not previously detected in this species, is likely very important in mountain pine beetle overwintering physiology. So we have some work on our hands to figure out exactly what it’s doing.

Stress, in general: The levels of a number of proteins associated with stress physiology – for instance ferritin, superoxide dismutase and phospholipid hydroperoxide glutathione peroxidase – increased in the autumn and, in some cases, decreased again in the spring. The fact that winter is a stressful period in a mountain pine beetle’s life cycle is obvious from the basic ecology of the organism. We now have a number of stress physiology protein targets to investigate in further research.

Energy use during development: The increases and decreases of particular enzymes involved in basic metabolism indicate that mountain pine beetle larvae put most of their resources into overwintering preparation in the autumn, and only when they have survived to the spring do they begin to divert resources to ongoing developmental processes.

Detoxification of host defenses: A number of proteins commonly involved in detoxification of host chemical defenses were present in autumn larvae but, for the most part, showed reductions in the larvae as the spring progressed. Previous work in our lab has shown that larvae in the late-summer experience extremely high levels of host defense compounds. So autumn larvae are working hard to get prepared for overwintering while also dealing with a toxic environment. Once the winter is over, and the host tree is long dead, it is likely that residual host toxins have either been removed by the beetle’s symbiotic fungi or that they have naturally degraded or dissipated. In any case, the detoxification enzymes are seeming not needed to nearly the degree in the late spring that they were during the autumn. The larvae that survive living in a toxic wasteland in the autumn and that do not freeze to death in the winter are then free to use remaining stores of energy plus whatever they can glean from their host tree to complete their developmental cycle through the spring and early-summer.

Why this is important: This is the first comprehensive look at what is going on in an overwintering bark beetle. While there has been a bit of previous physiological work on mountain pine beetles and a few other bark beetle species, our work in the Tria Project has moved us into the post-genomic era for the mountain pine beetle. That means that we have an extensive genomic database and that we can conduct experiments like this that reveal the workings of a number of physiological systems all at once. We are doing other ‘-omics’ work as well on overwintering mountain pine beetle larvae, including transcriptomics (monitoring messenger RNA levels during different seasons) and directed metabolomics (monitoring specific metabolites related to overwintering) work. And we are doing experiments where we track the expression of specific genes and the activity of specific enzymes revealed to be important during this phase of the insect’s life cycle. Of course our lab, alone, can’t do all of the experimentation suggested by these results. In fact, the data are so extensive that we can’t even conceive of all of the potential experiments. That is what is cool about ‘-omics’ research – there’s no telling who will look at it and think “ah ha! I have a great idea!”

Ultimately we hope that this paper has blown the door open on bark beetle overwintering physiology. Further research is bound to uncover new and interesting results, and since winter cold and climate change play such a large role in the growth of mountain pine beetle populations, such results will help us to understand better where and how the beetles are spreading into new regions and new, susceptible hosts.

Where we are going with this: As I mentioned above, the amount of data from this one study is staggering. This is our lab’s first publication from the larger Tria Project and there are others in the works. Some of them will also produce similar copious data. Others have been designed to look at specific small portions of this study and of some of our other data. We are currently focusing in on some of the metabolic pathways and physiological processes that I mentioned above. And we hope that others are able to take our data and use it for different analyses. For instance, we have surveyed protein levels across much of the larval developmental period. Perhaps others interested in insect development will find and be able to use new information on development in the Coleoptera (beetles) generally, and in bark beetles and other weevils specifically.

This was a really fun study. We certainly hope that the data will be as useful to others as it has been for us already. This work has also moved our research program firmly into the realm of insect overwintering research, and it has been a great introduction for us into proteomics and the era of “big data” in the biological sciences.

ResearchBlogging.org

Bonnett TR, Robert JA, Pitt C, Fraser JD, Keeling CI, Bohlmann J, & Huber DP (2012). Global and comparative proteomic profiling of overwintering and developing mountain pine beetle, Dendroctonus ponderosae (Coleoptera: Curculionidae), larvae. Insect biochemistry and molecular biology, 42 (12), 890-901 PMID: 22982448

The rise of biological preprints

Although I’m not particularly long-in-the-tooth, for my entire scientific life I have known that publishers (at least in my field) do not accept papers that have been published elsewhere. And while workers in fields like mathematics and physics have long been able to post preprints of their work prior to peer review and subsequent publication in a journal, researchers in the biological sciences have generally not been allowed to do that. This is because most, if not all, journals that accept biological research manuscripts have historically considered posting a preprint as prior publication. And papers that have been previously published are, rightly, persona non grata in reputable journals.

This “prior publication” attitude toward preprints is a pity because such posting has many upsides (outlined in detail here and here and here) and very few downsides. As an editor of a small journal, and a regular reviewer for a large number of other journals in my field, I can attest to the fact that posting to such a service, in which members of the community can comment and critique an article prior to review, would have helped to strengthen just about every manuscript that has ever come across my desktop.

Some of the biggest advantages of preprint posting that I can see are:

Increased community involvement in the scientific process: Scientists at all levels would be able to take part in reading, processing, and commenting on others’ work. Amateurs would also have access to the process and could provide their often-valuable input as well. That would build community, connections, and collaborations. And that would, in turn, help to strengthen and improve the scientific endeavor in general.

Providing authors with valuable feedback and allowing them to improve their work prior to a formal review: As an editor and reviewer I understand quite intimately the (generally thankless) time and effort that it takes to process an article from first submission to final publication. As an author, I know what it feels like to have the “reject” button pressed on a study that I have invested blood and sweat into. In both cases, prior thoughtful advice and critique from the larger community would help to make the formal process smoother.

Results become visible and public more rapidly: Again, as an editor, I know how long it can take for a paper to move from submission to publication. While some traditional journals have done their best to speed things along in recent years, we all have stories of papers that have languished for eons on some editor’s or reviewer’s desk, holding up the publication of the work for even years. Preprint posting does an end-around, allowing the work to be seen immediately and reducing the irritation that slow processing by a journal might cause. The rest of the scientific community would have access to results that may improve research in other labs or even other fields prior to official acceptance and formal publication.

Less fear about being scooped: I’m thankful that my area of biology generally moves at merely a moderate clip. I’m also thankful that, in general, colleagues in my field are much more willing and eager to collaborate than to compete. However, I’m fully aware that not all fields are like this. In those fields, researchers rightly worry about another lab beating them to the punch. Preprint posting, as it is fully public, would give a researcher a claim to precedence that could be fully validated as necessary. Personally, I see this is the least important of the reasons for posting to a preprint server. But I understand that it is a consideration for many.

In the last little while many major publishers have changed their tune on this. Most recently that included the stable of journals held by the Ecological Society of America. In addition, a new kid on the block, PeerJ, is going to run a preprint service as a part of its overall open access journal offering. This is a trend that is being welcomed by many in the field. And it’s one more example of how scientific publishing is necessarily changing – I think for the better – as it is stretched by new technologies and concomitant new ways of doing things.