Reproduce, Recombine, Retain

by Eduard Kamburjan — 21 August 2023
Topics: interview

Dirk Beyer is full professor of computer science and has a research chair for software and computational-systems engineering at LMU Munich, Germany (since 2016). Before, he was full professor of computer science at the University of Passau, Germany (2009–2016), following assistant and associate professor positions at Simon Fraser University, B.C., Canada (2006–2009), and postdoctoral research at EPFL in Lausanne, Switzerland, and the University of California, Berkeley, USA in the group of Tom Henzinger. Dirk holds a Dr. rer. nat. degree (2002) in computer science and a Dipl.-Inf. degree (1998) from the Brandenburg University of Technology in Cottbus, Germany. In 1998, he was a software engineer with Siemens AG, SBS dept. Major Projects in Dresden, Germany. His research focuses on models, algorithms, and tools for the construction and analysis of reliable software systems. He is the architect, designer, and implementor of several successful tools. For example, Rabbit is a fast BDD-based model checker for timed automata, CrocoPat is the first efficient interpreter for relational programming, CCVisu is a successful tool for visual clustering, CPAchecker and BLAST are two well-known and successful software model checkers, and BenchExec is an infrastructure for reliable and accurate benchmarking. CPAchecker won the Goedel medal in 2014. Dirk is the founder of two influential competitions of software tools, the Competition on Software Verification and the Competition on Software Testing.

Congratulations on your Test-of-Time Award! Can you describe the paper and what the research field was at the time of its publication?

In 2013, explicit-state model checking tracked all states in the (software) system including all variables. And this, of course, led to state-space explosion. Our paper introduced an abstraction in this setting by determining which variables must really be tracked.

At that time, CEGAR (CounterExample-Guided Abstraction Refinement) had already been around for several years, but was mostly used with predicate abstraction. We had a lot of experience with CEGAR and wanted to apply it to explicit-state model checking as well to see how to balance between the advantages of fully explicit states, where every variable is tracked, and forgetting some of that information. CEGAR starts with no information and then iteratively refines the level of abstraction until enough information is tracked such that no infeasible counterexamples are found anymore and the property can be proven. A path is infeasible if it cannot be executed in reality and its exploration is due to over-approximation.

The paper describes how, given an infeasible path through the program, to tell for each location on the program path, which variables need to be tracked in order to not run into the infeasible path again. We needed interpolation, which was widely used for predicate abstraction, but we did not want to use an SMT-solver, because that was often costly and slowed the system down. So we extended the notion of interpolation to constraint sequences. Given two contradicting constraint sequences, we compute which constraints are necessary directly on the sequences. Until that time, interpolation was only done in logic with an SMT-solver.

Combined, these techniques led to a huge performance gain, because the analysis tracked a lot less information.

What was the impact it had on verification?

I think it influenced a lot of other people with the insight that abstraction is the key to efficiency, and our approach is one way to achieve abstraction. More concretely, the influence is that the abstract domain of value assignments, which was until then used by very few tools, has been established as an important technique. In the most recent report of the competition on software verification, we see many tools that use value analysis as an abstract domain.

The paper contains a thorough experimental-evaluation section, how did you get to value reproducibility?

All experimental results should be reproducible as much as possible. When I started my research on software verification in the 2000s, it was considered best practice to offer supplementary web pages with research articles, which had and still have the problem that they go away if the group changes, for example, if the professor moves or the student finishes. We speculated what would happen to our results, tools, and experimental data. We still need such web pages to offer interactive access to the data and the tools, as complement to the archived reproduction package. But our reproduction packages are nowadays always archived at Zenodo in a long-term archive (we have about 80 of them). We realized that we had to fix a particular version of the software and data that was used to produce the results in a particular paper, as the tools evolved. Zenodo supports multiple versions of the same tool or artifact, and it also supports anonymous artifacts, such that the artifacts can be made available already at the time of submission of the paper.

Reproducibility is a bigger focus now, for example, through the artifact evaluation at conferences. Do you still see space for improvement?

There is actually a lot of room for improvement. The paper that won the award still links to a normal web page that could go away. I think that this was a bad practice. With several co-authors I recently published an article on 10 years of artifact evaluation, and I would like to point to some insights here.

(1) One of the insights is that the percentage of unavailable artifacts is much higher if ordinary URLs are used. If DOI links are used, then they are normally available. Sometimes data and software are provided just via source-code repositories, but these are services for development, not for long-term archiving, as they do not guarantee longevity. And we saw this: GoogleCode disappeared once Google decided to close the service. This can happen to other services as well. We need to use long-term archives, which promise that the content is not changed, is openly available for long terms, and has meta-data. There are several such long-term archives. I like Zenodo the most: It is a government-funded EU project and run by CERN, so it is convincing that it will run for a long time.

Meta-data are underestimated, but artifacts need meta-data. After an artifact is created, the meta-data should be changed to include a reference to the paper that describes the artifact, or other updates to the description. Zenodo supports meta-data hosting. It assigns a DOI and meta-data to each version and lifts experimental data to first-class citizens of the library in the bibliographic sense. Explicit meta-data are also recommended best practice by the ACM task force.

(2) Another point is versioning. It is really important to point to the particular version that was used for the paper. At Zenodo and other platforms, you can update the artifact and create new versions, and many authors think that you want to access the newest version, which is nice for the reuse aspect, but not the reproducibility aspect! Even if the authors have fixed some bugs, you want them in the software because you want to reproduce the results reported in the paper.

One solution that I am pushing for in the community is that every paper with an artifact should have a data-availability statement, a little section between the conclusion and references, to cite the artifact and describe which DOI is for reuse and which one is for reproducibility. So you don’t need to go through the paper to find the link or reference. Springer has recently even started to put those statements in the CrossRef meta-data, together with authors, funding, and other meta-data.

Similarly, every artifact must have a license file, which is often forgotten and then the artifact cannot be reused easily.

(3) An important topic is also the badging system—there are several badging systems around. Currently, I feel that it converges to the one introduced by an ACM task force. They are widespread and people know what they mean. There are also other badge systems, but having several is confusing. I think it would be nice if the community agrees on one and sticks to it. ACM explicitly allows even to replace the ACM acronym inside the badges—all ETAPS conferences with badges use some variant of these badges already.

Your awarded paper is based on CPAchecker, which has been developed for 15 years. How is it to maintain a tool for such a long time?

It is based on my experiences in the BLAST project at Berkeley with Tom Henzinger. BLAST was one of the first model-checking tools for software, and at that time nobody knew how to best design such a tool. So I had a difficult time later on maintaining BLAST. We found out that its monolithic architecture is not suited for maintenance. BLAST was for many years one of the best software model checkers, the performance and concepts were all very good, but the maintenance was very difficult. We tried to implement the idea of configurable program analysis in BLAST, but in the end, I decided to reimplement it.

The first task in the CPAchecker project was to reimplement what BLAST can do but in a more flexible architecture and a more accessible language (BLAST was implemented in OCaml). I went for Java, and using Java for the project was a good decision, because of its advantages for code maintenance, its ecosystem, availability of people that are familiar with the language. There is some cost associated with starting up the VM, but performance results are about the asymptotic complexity of the algorithms, so if the algorithm is good you can still outperform tools written in lower-level languages.

The framework was designed in a very modular architecture to easily integrate many different approaches. My goal was a tool in which all important algorithms for software verification are implemented. Therefore, we organize the project such that it is feasible that students on the Bachelor, Master, and PhD level can use it as a basis for their research work (and potentially contribute). This invited replication studies, in which algorithms from other researchers are implemented and adopted. As a result of these design choices, the most important algorithms for automatic software verification are now available in CPAchecker.

The project has a lot of contributors, around 100 developers now. We have two core project leads, Philipp Wendler and me. Philipp has a lot of experience with the project (he joined in 2009) and a permanent position at LMU. The permanent position enables him to spend time on maintenance because he does not need to push for a thesis anymore, which is the problem that PhD and Master students face. The maintenance work is not something that gives immediate credit in the academic world.

Do you think we need more of these permanent positions?

Yes, I think that CPAchecker is as well-maintained as it is because we have Philipp on a permanent position. I have a permanent position too, and still have the overview and know the architecture. But as a professor with a large research chair, I do not have the capacity for the actual coding anymore. I am very happy that ETAPS is helping to give more academic credits for tool development, for example via the Test-of-Time Tool Award. I admire the recent awardees, who received the award for the tool CADP, which they have been developing and maintaining for more than three decades!