Late Stage Academia

🞋

This paper got a public website in 2021 and was 'anonymously reviewed' one year later.

🞋

This paper's very logical measurement approach could also be used to show that vaccines don't work. (That's a bad thing)

🞋

This paper claims that ASan can detect errors better than ... not using ASan?. Big if true.

🞋

Least dishonest academic

🞋

Believe it or not, but even fuzzing paper still have a dark side. To be specific, hardware fuzzing papers are somehow even worse than normal fuzzing papers. But where there is shadow, there is also light. And in this case the sun is smiling at us because the comedic value of this field is truly something to behold.

Let's look at MABFuzz and PSOFuzz The same authors for both papers running their own software (TheHuzz) with the same setup and getting completely different values in their result tables. The number of tests for that FENCE.i bug jump from 68 to 600 for the same experimental setup (and the same goes for all other benchmarks). I don't know about you, but if my measurements increase by nearly 800% when I redo my experiment, I do wonder if I'm doing something wrong.

As a bonus: TheHuzz is apparently open source, but it seems from all papers that can run it that it's only "open source" if you're one of the authors. In fact, one of the few non-author papers referencing the "TheHuzz" is a paper saying that the paper isn't actually open source, so it can't be replicated.

🞋

This paper not only does the good ol' N=1 trick, but if you look at Table I their 15.25x speedup comes from reducing the iterations-to-exposure from 183 to 12. I don't know how one can make runtime statistics of a RISC-V processor in 12 samples, but science makes it possible!

🞋

Fuzz4All

Some papers make me believe that reviewers think blind review means they just blindly accept papers at random. Fuzz4All is a recent example of that. It has all the usual Fuzzing paper qualities. RQ2 is a pretty nonsensical experiment. RQ3 has no p-values. RQ4 doesn't even have a baseline (or any description of the experimental setup).

RQ1 is really what makes this paper special. They try to measure the code coverage they reach with their approach. However, if you look at the baselines such as CSmith and YARPGen, then you might realize that those don't actually try to do the same. These baselines only test the optimizer with well-defined programs. In other words, they try to reach 0 code coverage in the huge error handling component of the compiler (and only required coverage in the parser). It is not hard to beat a baseline that doesn't try to solve the same problem.

In other words, the baseline tries (by design) to only get coverage in the optimizer component while they gather coverage everywhere. Arguably that's not a very fair comparison. Now, the only way this comparison could be more distorted if the very component the baseline targets doesn't exist. And believe it or not, this looks like there are no optimizations enabled in the experiment. Running a baseline that tests the optimizer with optimizations disabled is probably the textbook definition of late stage academia.

There is probably more stuff to complain about. The generated code seems to include a lot of system code (which isn't really addressed) and arguably there should be the question of how much coverage just the training data would get. But life is to short to keep reading this paper so let's move on.

I'll end this post by filling in one blank of the paper. They say they benchmark Clang but then don't end up measuring any coverage for it (huh?). I thought I would do the experiment for them by just using the standard OSS-fuzz Clang setup that is used since a long time to fuzz Clang. Just to see how much their stuff improves on the (quite old) 'state of the art'.

When I run their test inputs from the artifact in the OSS-fuzz Clang wrapper and measure coverage, I get about 60k edges (as per AFL++ instrumentation) covered with Fuzz4All. If I run the same Clang wrapper with AFL++ and the default seeds, I get 85k edge covered in a bit less than an hour (MWU p=0.008). The reason I only run for an hour is mostly to account for the fact that my old PC might be faster than theirs (24x times faster seems enough padding).

Does that mean that this approach is substantially worse than the state of the art? No, because just posting a random coverage number without deeper analysis is not conclusive evidence for anything. Maybe this could work better by fine-tuning the prompt or using a different model? Maybe most of the original coverage came from included system code (which is not usable in the OSS-fuzz Clang wrapper)? We'll probably never find out.

On a final note: I don't think this paper is especially bad. It's obviously not good, but most of the problems here are par for the course in fuzzing research. In a better alternative timeline this paper could have just been positioned as an alternative to blackbox fuzzing and it would easily beat the baselines. There would be some interesting experiments that evaluate the thing beyond bogus bug hunting and we would all have learned something. Maybe the reviewers could have read the paper and even helped this paper to actually evaluate the questions at hand? Maybe. But, alas, we're in the bad timeline. And it seems we might be here for a while longer.

🞋

This paper is truly unique. They only start mentioning statistical analysis of their random samples at the END of the paper in a separate section (probably the one reviewer that currently owns the shared brain cell asked them for that). But they then state that their calculated results (how did they calculate them?) are not statistically significant (what alpha?) and then just say there is a 'high probability' their results are not noise (what p-value did you even calculate? What is a 'high probability'?).

We're inching each day closer to the defamation lawsuit from Scientology that claims that computer science gives the word "Science" a bad rep.

🞋

This fuzzing paper uses N=1 when collecting random samples. A fun fact to get over this sad reality: The (missing) p-value of this paper matches the number of brain cells of the people that blindly accepted this.

🞋

DARPA announced that from now on they'll just directly burn their research funding in the first furnace they can find.

Bonus points: The director's paper list shows exactly how disqualified they are to run this program. Related joke: "What's the difference between a normal person and an academic? The former knows nothing, and the latter doesn't even know that."

🞋

Let's kick this off: This paper is literally just a bug report.

🞋

It's always a good idea to start by admitting a mistake: I was wrong to think that endpoint security software is useless. In fact, it probably did more over the last weeks to stop climate change than all the bogus sustainability companies.

Raphael Isemann
Department of Computer Science
Faculty of Science
Vrije Universiteit Amsterdam
De Boelelaan 1111, 1081 HV
Amsterdam North-Holland
The Netherlands