Ten Simple Rules for Reproducible Computational Research

In a paper published last week in PLoS Computational Biology, Sandve, Nekrutenko, Taylor and Hovig highlight the issue of replication across the computational sciences. The dependence on software libraries, APIs and toolchains, coupled with massive amounts of data, interdisciplinary approaches and the increasing complexity of the questions being asked are complicating replication efforts.

To address this, they present ten simple rules for reproducibility of computational research:

Rule 1: For Every Result, Keep Track of How It Was Produced

Rule 2: Avoid Manual Data Manipulation Steps

Rule 3: Archive the Exact Versions of All External Programs Used

Rule 4: Version Control All Custom Scripts

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

Rule 7: Always Store Raw Data behind Plots

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9: Connect Textual Statements to Underlying Results

Rule 10: Provide Public Access to Scripts, Runs, and Results

The rationale underpinning these rules clearly resonates with the work of the Software Sustainability Institute: better science through superior software. Based at the universities of Edinburgh, Manchester, Oxford and Southampton, it is a national facility for cultivating world-class research through software (for example, Software Carpentry). An article that caught my eye in July was the Recomputation Manifesto: computational experiments should be recomputable for all time. In light of the wider open data and open science agenda, should we also be thinking about open software and open computation?

3 thoughts

Also, open web: Mozilla Science Lab.

The semantic web people have a 5-star measurement for open data (http://5stardata.info/). Level 3, to me, appears to be a good minimum for scientific data. I’d love to know of any good tooling that would assist getting to level 4 and 5.

Reaching level 5 in scientific data allows data to be mined for e-Science purposes. Furthermore, there are tools for documenting reusable scientific workflows (http://www.taverna.org.uk/).

As usual, I’m amazed by the BioInf people, who seem to do open-data and open-access better than us CompSci people. We _really_ should follow their lead.

New publication: Best Practices for Scientific Computing by Wilson et al. in PLOS Biology (Jan 2014)

Tom says:

1 November 2013 at 12:07 am

Also, open web: Mozilla Science Lab.

Aidan Delaney says:

1 November 2013 at 10:00 am

The semantic web people have a 5-star measurement for open data (http://5stardata.info/). Level 3, to me, appears to be a good minimum for scientific data. I’d love to know of any good tooling that would assist getting to level 4 and 5.

Reaching level 5 in scientific data allows data to be mined for e-Science purposes. Furthermore, there are tools for documenting reusable scientific workflows (http://www.taverna.org.uk/).

As usual, I’m amazed by the BioInf people, who seem to do open-data and open-access better than us CompSci people. We _really_ should follow their lead.

Tom says:

8 January 2014 at 10:57 pm

New publication: Best Practices for Scientific Computing by Wilson et al. in PLOS Biology (Jan 2014)

Digital Society & Policy

Technology, systems and public policy

Ten Simple Rules for Reproducible Computational Research

Rule 1: For Every Result, Keep Track of How It Was Produced

Rule 2: Avoid Manual Data Manipulation Steps

Rule 3: Archive the Exact Versions of All External Programs Used

Rule 4: Version Control All Custom Scripts

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

Rule 7: Always Store Raw Data behind Plots

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9: Connect Textual Statements to Underlying Results

Rule 10: Provide Public Access to Scripts, Runs, and Results

3 thoughts

Leave a comment Cancel reply

Rule 1: For Every Result, Keep Track of How It Was Produced

Rule 2: Avoid Manual Data Manipulation Steps

Rule 3: Archive the Exact Versions of All External Programs Used

Rule 4: Version Control All Custom Scripts

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

Rule 7: Always Store Raw Data behind Plots

Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Rule 9: Connect Textual Statements to Underlying Results

Rule 10: Provide Public Access to Scripts, Runs, and Results

Share this:

Related

3 thoughts

Leave a comment Cancel reply