In a paper published last week in PLoS Computational Biology, Sandve, Nekrutenko, Taylor and Hovig highlight the issue of replication across the computational sciences. The dependence on software libraries, APIs and toolchains, coupled with massive amounts of data, interdisciplinary approaches and the increasing complexity of the questions being asked are complicating replication efforts.
To address this, they present ten simple rules for reproducibility of computational research:
Rule 1: For Every Result, Keep Track of How It Was Produced
Rule 2: Avoid Manual Data Manipulation Steps
Rule 3: Archive the Exact Versions of All External Programs Used
Rule 4: Version Control All Custom Scripts
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
Rule 7: Always Store Raw Data behind Plots
Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
Rule 9: Connect Textual Statements to Underlying Results
Rule 10: Provide Public Access to Scripts, Runs, and Results
The rationale underpinning these rules clearly resonates with the work of the Software Sustainability Institute: better science through superior software. Based at the universities of Edinburgh, Manchester, Oxford and Southampton, it is a national facility for cultivating world-class research through software (for example, Software Carpentry). An article that caught my eye in July was the Recomputation Manifesto: computational experiments should be recomputable for all time. In light of the wider open data and open science agenda, should we also be thinking about open software and open computation?
Also, open web: Mozilla Science Lab.
The semantic web people have a 5-star measurement for open data (http://5stardata.info/). Level 3, to me, appears to be a good minimum for scientific data. I’d love to know of any good tooling that would assist getting to level 4 and 5.
Reaching level 5 in scientific data allows data to be mined for e-Science purposes. Furthermore, there are tools for documenting reusable scientific workflows (http://www.taverna.org.uk/).
As usual, I’m amazed by the BioInf people, who seem to do open-data and open-access better than us CompSci people. We _really_ should follow their lead.
New publication: Best Practices for Scientific Computing by Wilson et al. in PLOS Biology (Jan 2014)