Artefact Review and Badging: Improving Confidence in our Experimental Results

Guest Blog Post

Dr Michel Steuwer is a Lecturer (Assistant Professor) at the University of Glasgow in the UK. He works in the area of programming languages and compilation. He is co-leading the Lift project which is a novel approach aiming to achieve portable performance across modern parallel processors.

Introduction

Academic papers undergo peer reviewing to help decide whether a paper should be accepted for presentation at a scientific conference or publication in a journal. One main function of the peer reviewing process is to ensure that scientific standards are maintained and to provide credibility in the authenticity of the work. We do this by having academic experts scrutinise their peers’ work. But to what extent can scientific standards be guaranteed if reviewers only review the paper and do not scrutinise the experimental work underpinning the research?

A recent study from 2015 by Christian Collberg, Todd Proebsting, and Alex M Warren [2, 3] from the University of Arizona investigated software used in experimental studies of papers published at ACM conferences and journals from the computer systems community. Starting from 602 total papers, they investigated 402 papers whose results where backed by code and only for 32.3% they were able to obtain and built the code within 30 minutes. Depending on your personal experience you might find these numbers alarmingly low or not too bad at all, but it is important to point out, that this study did not try to replicate the actual experiments being performed.

Summary Graph by Collberg, Proebsting, and Warren showing how many source code from academic studies builds sucessfully (OK in the top right corner).
Summary Graph by Collberg, Proebsting, and Warren [2] showing how many sources from academic studies build successfully (OK in the top right corner).

I think it is safe to say that reproducibility in computer science is not great and needs improving.

Artefact Evaluation

I am active in the computer systems research community, more precisely in programming languages and compilation, and I would like to highlight how our community is addressing this reproducibly challenge.

Since about 2015 almost all major conferences in programming languages (including PLDI, POPL, ICFP) as well as in compilation (including CGO and PACT) allow authors of accepted papers to submit research artefacts for a formal audit by an independent committee of reviewers. In our community, the artefact usually comprises the actual research software alongside the code and data used for performing an experimental evaluation.

ACM describes a research artefact more generally, as follows:

By “artefact” we mean a digital object that was either created by the authors to be used as part of the study or generated by the experiment itself. For example, artefacts can be software systems, scripts used to run experiments, input datasets, raw data collected in the experiment, or scripts used to analyse results. [1]

The process of evaluating these artefacts must be shaped by each community individually, but ACM provides a general framework for awarding badges to highlight papers which were successfully evaluated. I provide a summary of here, but complete details can be found on ACM’s website [1].

Badges

ACM recommends awarding three different type of badges to communicate how the artefact has been evaluated. A single paper can receive up to three badges – one badge of each type.

  • The green Artefact Available badge indicates that an artefact is publicly accessible in an archival repository. For this badge to be awarded the paper does not have to be independently evaluated.Artifact Available
  • The red Artefact Evaluated badges indicate that a research artefact has been successfully completed an independent audit. A reviewer has verified that the artefact is documented, complete, consistent, and exercisable.Artifact Evaluated - Function
    Artifact Evaluated - Reusable
  • The blue Results Validated badges indicate that the main results of the paper have been successfully obtained by an independent reviewer.Results Replicated
    Results Reproduced

In addition, for the red and blue badges ACM suggests to distinguish two different levels.

The lighter red Artefact Evaluated – Functional badge indicates a basic level of functionality. The darker red Artefact Evaluated – Reusable badge indicates a higher quality artefact which significantly exceeds minimal functionality so that reuse and repurposing is facilitated.

The lighter blue Results Replicated badge indicates that the main results of the paper have been successfully obtained using the provided artefact. The darker blue Results Reproduced badge indicates that the main results of the paper have been independently obtained without using the author-provided research artefact.

Artefact Evaluation Process

In the programming language and compilation community artefact evaluation is currently voluntary and does not influence the acceptance decision of a paper. Usually, artefact evaluation is performed after papers have been accepted for publication at the conference. Authors are invited to submit research artefacts which are then evaluated by members the artefact evaluation committee. The process is similarly organised to regular paper reviewing where at the end the artefact evaluation chair decides which paper is awarded what kind of badges. Waiting until paper acceptance has been decided also reduces the number of artefact evaluators required for the process.

In this process we are not aiming to award the dark blue Results Reproduced badge which requires a subsequent study to reproduce the results without the provided artefact. Also, the green Artefact Available badge does not require the formal audit and, therefore, can be awarded directly by the publisher – if the authors provide a link to the deposited artefact.

Some experiences to share

I have been involved in the organisation of artefact evaluation at the International Symposium on Code Generation and Optimisation (CGO) 2018 and the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES) 2018 as artefact evaluation chair. CGO started artefact evaluation in 2015 and for LCTES we introduced it this year for the first time.

For CGO 2018 50% (15 out of 30) of accepted papers submitted a research artefact and for LCTES 2018 42% (5 out of 12). In both conferences we successfully evaluated all but one artefact (that is 14 for CGO and 4 for LCTES). It is important to stress, that an unsuccessful artefact evaluation does not necessarily mean that the results in the paper are not trustworthy. For the two cases where we couldn’t successfully complete the evaluation we were not able to get the software artefact running on the machines of the evaluators. We did not have a case where we identified a flaw in the experimental results.

It is crucial to have technically skilled reviewers for the artefact evaluation process. We mainly recruited senior PhD students or post-docs as reviewers by asking the programme committee to nominate suitable candidates. While the reviewing process is similarly organised as the normal peer reviewing, we have experienced that a more frequent communication between reviewers and authors is very helpful in overcoming smaller technical difficulties with the research artefacts which often block the successful evaluation. In this way artefact evaluation is actively improving the quality of the submitted research artefacts.

Finally, it can be challenging to create the suitable technological environment to evaluate a research artefact. In one case we had to arrange access for the reviewers to a particular version of a proprietary licensed benchmarking suite. In another case the evaluation required an embedded hardware device and measurements being performed with an oscilloscope (see the tweet below). While for these individual cases evaluation is challenging, they also show that technical challenges can be overcome when all sides are committed to improve the state of the art in reproducibility in computer science.

Some ideas going forward

I hope that this post will prove useful and might inspire more research communities to improve replicability in their field. Every community must develop their own process and define standards the community is comfortable with.

I hope that artefact evaluation will eventually become the norm and at some point will be required for acceptance for publication – at least for most papers. It is very important to move towards this goal with the research community together and not to over ambitiously impose rules and restrictions which scares people away and leads to more damage than good.

As an immediate next step, I want to discuss in my community the idea of making artefact evaluation mandatory for tool papers, a category of papers which describes useful research tools that might lack the scientific novelty we demand from technical papers. I think it is reasonable to demand the independent audit of such research tools and the results that are produced with them.

I would also like to see artefact evaluators higher valued and credited than they currently are. Maybe we should allow artefact evaluators to publish a small report about their experience evaluating the artefact? Maybe these reports could even include a description of the results as they have been reproduced by the evaluator? These reports could be much more nuanced than a batch on the front of the paper and describe precisely what has been evaluated and replicated. They could also be cited independently providing the evidence for the reproducibility effort.


Useful links

Besides ACM’s general framework for artefact reviewing and badging [1] the cTuning foundation, led by Grigori Fursin, provides excellent in-depth resources about artefact evaluation and reproducible science in computer systems [4].


[1]: ACM Artifact Review and Badging

[2]: Christian S. Collberg, Todd A. Proebsting:
Repeatability in computer systems research. Commun. ACM 59(3): 62-69 (2016)

[3]: University of Arizona: Repeatability in Computer Science

[4]: cTuning foundation: Artifact Evaluation for Systems/AI/ML Publications