Teaching Compiler Construction with LLVM

This is a short post-mortem of the compiler construction course that I was TAing during my PhD. The course consisted of normal lectures, graded exams and then 3 to 4 weekly assignments. The course is in the shortest study period at the VU, so the whole course happens in a short time frame of about 1.5 months.

It should also be pointed out that the students that take this course are usually very motivated. It's an elective course, and the people that sign up for it are usually very interested in the topic itself. So this post is really about making a course about students that actually care about the topic.

The Original Course

I started TAing the course on short notice. I read the assignments for the first time over Christmas and the course itself started in January. Assignment 1 to 3 had a test suite that was given to the students. However, there were also some additional hidden tests that were kept secret. Those hidden tests were only executed as part of the final grading work. The grading involved running all tests and then inspecting the code for tasks that had no tests associated with them. Most of that grading work was done by a (well compensated) army of student TAs.

The assignment contents were as follows:

Assignment 1 was about finishing a partially completed compiler frontend that generates LLVM IR. The input language was 'FenneC', which is a subset of C with some minor extensions to it. The frontend was written in Python and used a (quite peculiar) Python parser generator called PLY.
Assignment 2 was about writing various standard optimization passes in LLVM. The list of passes was ADCE, various peephole optimizations, and LICM.
For Assignment 3, students has to implement a custom spatial memory sanitizer with LLVM and a custom runtime in C. The assignment required them to thread the array bounds of each value alongside the original array pointer. This also had to work across functions.
Assignment 4 was optional and involved just a few tasks in the backend of LLVM. This was required them to build their own LLVM where they had to randomize some registers, code pointers, and a Spectre mitigation, etc. The grading here was fully manually and there were no tests.

Student Complaints

When I first TA'd the course, I got feedback from students in several ways. I talked with students during office hours, I set up an (anonymous) feedback form, and there was also some official course feedback that I got my hands on. For some extra unfiltered course feedback, I joined the CompSci student Discord where people were discussing the assignments between themselves (and also me, I guess).

I think the feedback was overall mixed, with students liking the subject but not the way the assignments were designed. There were a few common complaints that can be summarized as follows.

The consensus was that finishing the incomplete Python parser was a nightmare. Especially the fact that function doc strings had semantic meaning in PLY (really!) drove several students close to madness. Except for that, trying to explore the Python code without code completion (there were little to no type annotations) was considered a waste of time by most students.
The fact that students could not see their grade when working on the assignments was also badly received. I think it mostly just caused anxiety and made students not experiment to avoid breaking some obscure hidden test.
Some hidden test cases covered obscure test cases that most students were unaware of. Often these edge cases had a binary choice (e.g., whether fcmp is ordered or unordered), so the result was that many students randomly lost or gained points.
On top of this noise, sporadic human error in the grading also caused some noise to influence the grades. Most of this was not the fault of the graders, but just the fact that manual grading guidelines always leave some room for interpretation.
Many of the tests were also very strict on what solutions they accepted. The main reason for this is that the tests were using LLVM's FileCheck, which is essentially just testing via pattern-matching. The students ended up often just reverse engineering the expected outcome without understanding why. Also, when a FileCheck test rejected a more aggressive optimization, the reaction was often disappointment that extra effort is punished instead of rewarded.
Because of the strict checks, many assignments did a lot of hand-holding to make sure students come up with the 'right' solution. The result was that some students felt like they "are treated like 5-year olds" (excerpt from student feedback form).
While LLVM itself was considered useful, it does have a bunch of usability problems (e.g., modifying code while traversing it leads to nonsensical errors).
Also some optimization pass assignments looked like they were taken from some old LLVM tutorial and had weird 'tips' in them that I don't think really apply in any recent LLVM version.

The New Course

Most of the problems in the course were relatively easy to fix on their own. The only problem was that some quirks were also necessary to keep a reasonable grade distribution. Especially the hidden tests prevented grade inflation, so just revealing all tests would most likely just boosted a lot of people to a perfect grade. My planned workaround for this was to make all assignments more difficult without adding hidden tests or weird edge cases.

The new assignments look like this:

Assignment 1 is now about writing a frontend from scratch. That can be done in any language, but the students are given a grammar file for a different Python parser generator.
Assignment 2 is now graded using a profiler that emulates a somewhat quirky machine that executes LLVM instructions. Originally I thought about building a simple emulator for a real ISA, but I quickly realized that debugging/profiling this with instruction selection in the middle is a nightmare. The optimizations stayed the same, except a new inlining pass that had to use a hand-crafted heuristic (the emulator used a FIFO queue as an I-cache to not make it too easy). The grading scale was based on the optimized program's cycle count linearly interpolated beteween O0 and a 'good' solution.
Assignment 3 is now about creating a sanitizer for spatial and temporal memory errors. How to do this was left to the students, and everything from AddressSanitizer's approach to the old value-threading was allowed.
Assignment 4 stayed the same but I made an alternative assignment where students could also try out debugging optimization passes using differential testing.

Discussion

So how did the new course do?

Generally speaking, very few students ended up feeling lost with much more open-ended assignments. That is maybe caused by the class being mostly very motivated students. For the few that got stuck, the office hours were able to handle it.
Some student's came up with unique solutions to problems that I did not anticipate. I manually inspected all assignments just to see them, and I could not see anything where an assignment was trivialized by this. In fact, the unique assignment solutions where often more sophisticated than the intended one. Whether this is a good thing is hard to say.
The optimization assignment was received both well and very badly. The good part was that many students liked implementing and profiling their own optimizations.
One disliked part of the optimization assignment was that a standard inlining heuristic of 'inline if small' would lower your grade. Instead, a good heuristc required understanding the cascading effects of inlining to enable further optimizations. In general, the main problem here was just that the student's had too little understanding of the problem, and most negative feedback could be addressed with explanations in the office hours.
The other problem with the optimization assignment was simply that maintaining the original grade distribution required taking the interpolated score from 0 to 10, and then round it down. Student's really hated seeing their grade being rounded down and 90% of course feedback was about this. I thought about just changing it to rounding up after deducting a flat 0.5, but that seemed a bit too silly to play psychological tricks on students.

Evaluation

Unfortunately, it is hard to quantify how much the course improved as I don't have (yet) access to a proper student evaluation. There are a lot of positive recommendations for the course on the student discord, but given that most student's know I am there, it is hard to say whether this is unfiltered feedback.

Notes

The current course repo can be found here, and the assignments can be found here.

And as a small treat, a fun little interchange on the student discord that I captured before people knew I was there.

The CompSci's Guide To Noise

In 1492, Christopher Columbus set sail towards America and what followed is what we now call the Columbian Exchange ¹ . Europeans intentionally introduced various life forms into America, such as cattle or wheat, and brought back new species like tomatoes. However, the sailors on these ships also brought some cargo across the Atlantic that they were not aware of.

European contact introduced smallpox into America, which then spread across the continent and led (at least in part) to the downfall of various civilizations. At the same time, this process occurred the other way around too, and new diseases were brought to Europe, with one likely introduced disease being syphilis ². Similar to smallpox in America, syphilis in Europe became a problem for which a solution was needed. Unfortunately, medicine was still in its infancy. There was little understanding of diseases themselves and the proper cure (i.e., antibiotics) was completely unknown at the time. As a result, there were various dubious treatment options available. One popular and rather exclusive treatment was the medicine Calomel ³.

But what exactly is Calomel? Most people have never heard of it, and it is unlikely that your doctor will ever give it to you nowadays (even if you asked for it). The reason for this should be obvious when we look at the side effects from its main ingredient: itching, pink cheeks, numbness, formication (that is the feeling of insects under your skin), desquamation (shedding of skin, which may solve the formication), burning pain, swelling, loss of hair/teeth/nails, cramps, vomiting, bloody diarrhea, and a lot of other unpleasantries.

In case you haven't guessed it already, Calomel contains mercury (Mercury(I) chloride to be exact). In those days, if you visited the doctor to get a cure for syphilis, the doctor would (unintentionally) give you mercury poisoning on top of that and send you home. But why did so many doctors think that mercury would be a cure?

Just to get the first thing you thought out of the way: They probably did not act with bad intentions, and they weren't charlatans. In fact, many physicians such as Paracelsus genuinely believed in its effectiveness ³. And while there aren't many patient records from that time, it is possible that these so-called 'mercurialists' saw Calomel work on their own patients, and formed their convictions based on that.

Unfortunately, seeing Calomel work and it actually being effective are two different things. After all, the fact that someone saw it working is just an anecdote, and the plural of anecdote isn't fact. To prove this point, many years later a different study found another treatment that was equally effective: doing nothing. A physician in Oslo analyzed the records of 2000 syphilis patients over the span of 20 years. One of the observations of this study was that syphilis lesions disappear at very different intervals between patients. While some untreated patients lose them very quickly, others still have them up to a year later ³.

What Paracelsus and others saw with their own eyes wasn't data, but just the noise inherent to the disease itself. Luckily for us, modern medicine (and science in general) can handle noise and differentiate it from data using methods such as statistical analysis. In fact, many medical reporting guidelines mandate accounting for noise in publications to avoid identifying a poison as a cure ⁴.

Having said that, at least in computer science, these methods are sometimes not used correctly. In some fields a majority of papers even ignore them altogether ⁵. It is impossible to know why researchers ignore these practices. However, in my experience when talking to other researchers, a common pattern is a lack of knowledge about these methods or why they are necessary.

Overview

The point of this long introduction is to hammer home the point that science cannot just ignore noise. The goal of the rest of this post is how to do better than just ignoring it.

However, this is not a statistics course and none of this is ~~medical~~ statistical advice. I am not a statistician nor do I remember anything from my undergrad statistics class aside from my professor's Monty Python jokes.

Instead, it is just enough to avoid the most serious errors that occur (unfortunately) in a substantial amount of papers and student projects. Also, most of these examples here are for CompSci PL/Security topics and similar, and might not apply to whatever you are working on.

The Simple Case

Below are a few basic steps you can follow to account for noise in a simple experiment. Note that those steps do not eliminate noise. Instead, they just control the rate in which noise is presented as actual data.

In most literature, "presenting noise as data", or concluding there is an effect when there is none, is called a Type-I error. A Type-II error is the opposite, and means that there is a real effect, but we conclude there is none. The goal of any experiment is to minimize both kinds of errors.

1. Pick an α and write it into the paper.

α represents the chance you are willing to take that you publish noise (Type-I errors). In other words, an alpha of 0.05 equals 5% risk that your result (which the evaluation confirms as data) is actually noise. You should always pick an α before doing your experiments and write it clearly into the paper.

But what α should you choose? A lot of papers assume that the 'right' α in various CompSci subfields is 0.05, but to my knowledge there is not really a good reason for this specific value. In fact, this number seems to be just a cargo-cult inherited from other fields such as psychology.

The answer is that there is no right α value, and a lower α is not necessarily better. In fact, an α of 0 just means that you can never detect an actual effect. Similarly, other low α values will just lead to a higher rate of valid theories being rejected in your evaluation. What exactly is the risk we are willing to take is really up to debate and depends on whatever your subfield considers 'good enough'.

To not stretch out this section too much, let's just say it usually makes sense to pick an α that is either 0.05 or 0.01. While 0.01 might seem rather harsh, most of the papers I read (and that had a proper statistical evaluation) would come to the same conclusions with either α value.

Side note: There is also β which is the equivalent of α for Type-II errors. You need this for proper power analysis if you do that. See below.

2. Pick an N and run N independent trials.

Next, you should run a number of independent trials. Those are for example N fuzzing runs of the same project with the same fuzzer. 'Independent' means that they should not have any dependencies on each other. For example, they should not share information from a previous run and not use the same seed.

But what N should you pick, and why does it matter? A lower N means that you are more likely to have your data rejected as noise. A higher N allows you to minimize Type-II errors, but you might also waste computing resources. Luckily, the resource usage is not necessarily a huge concern in most computer science where we just waste some energy and don't deal with patients.

In any case, the proper answer to this is called power analysis. You can find various calculators that give you an N based on α and various other parameters. But the short answer is that for most PL/SysSec papers I have seen, N=20 is often enough to minimize Type-II errors. However, if possible, proper power analysis should be done.

3. Perform a statistical test on your measurements and report the results.

Given the measurements from your N trials, you can now perform a statistical test. There are various kinds of tests that each are designed for a different purpose. They also all make different assumptions about the data you collected, such as the shape of the distribution.

Unfortunately, these tests cannot verify these assumptions, so if you pick the wrong one you simply get an incorrect result. Even worse, I have not seen any concrete data for specific fields that necessarily allow us to verify these assumptions. Some of these assumptions can be rarely made, such as equal variance, as changing overhead tends to increase variance. However, violating some assumption in minor ways is sometimes tolerated by these tests and the results only change in minor ways (often referred to as 'robustness'). As you can infer from the vagueness of the sentence before, when exactly this is acceptable is not clearly defined.

In any case, there are a few steps that you should always do. First, pick a statistical test, read its assumptions and try to understand whether they apply on your data. Then perform the test and write at least the p-value in your paper. If the p-value is smaller than your α, then you can say the observed mean/median changes. If the p-value is larger than your α, then you cannot infer that there is an effect (and you should be clear about this in the paper).

There are also a few things you should not do. You should not change the tests or the metric (median/mean) after doing the test in an attempt to get a significant result. You should also not change your α or select fewer/more measurements to change the result of the test. And you should also not omit the test from the paper just because it is negative. As a rule of thumb, you should not change anything that affects your conclusions after the test is completed.

4. Report variance, confidence intervals, effect size.

There is more to a result than just a mean/median and a p-value, and stopping here leaves out much information that puts the result into context.

Confidence intervals describe where the true mean/median could be, and you should always write them in or put them in a graph. However, a 95% confidence interval does probably not mean what you think it means, so my recommendation is to just put it in the paper and leave it be. The reader can draw its own conclusion from it.

Common Language Effect Size (CLES) designates how likely it is that your approach outperforms the baseline and is important when you have distributions that overlap. It is best explained with the picture below. In the left graph, you can see an example of a larger CLES, while the right graph has a much smaller CLES closer to 0.5 (which is in practice the 'worst' value). There are also other effect sizes that do not express probability which might be more insightful than CLES in some cases.

Variance helps with describing the spread of the involved distributions, and can be useful to know whether you are using the right statistical test (which might make assumptions about variance).

'Fixing' Experiments

With this basic experiment out of the way, how can you still mess up? There are a few ways, and the most common one is somewhat specific to computer science.

The idea is quite simple: Let's say my randomized experiment is a d6 die roll, and the desired outcome (from the researcher's perspective) is a roll of 6. So how can I ensure that I always get a 6? I simply roll the die, and whenever I get a result that is not a 6, I can come up with an excuse why the experiment is invalid. "Oh, I rolled it the wrong way!", "The table is uneven!", "The wind was blowing", etc. You can see that I can easily repeat this procedure until I get a 6, and then I can just stop my experiment and report the outcome I wanted.

The way I see this happening in many student projects (and papers, as sometimes authors like to chat too much), is like this: Whenever the results do not pan out, you start debugging the implementation. And then you can always find some random minor bug and blame it. Or, you just have a parameter or configuration value and change that to a more 'optimal' one. Repeat this process until you get a publishable outcome.

The takeaway from this story is not to never double-check your experiment. In fact, a proper way to do is to always check your experiments. Ideally, you would have tests for everything that should happen in your experiment and run those before doing your evaluation. And if you decide to only debug your final experiment, you should decide whether to re-run it without looking at its results.

Accumulating Errors

Even if you never rerun experiments, you can still exceed the α rate that you set yourself at the start. The most common way this happens is by simply running multiple experiments. Then, when you get a low enough p-value on one of them, you write something along the lines of "see, sometimes it works" into the paper. The problem here is the same as with the dice roll experiment before. Rolling a d6 has a 1/6 chance to roll a 6. But if you roll 1000 dice, the chance that you roll at least one 6 is nearly 100%. The same situation occurs with all experiments that you run in your paper.

The easiest way to solve this is to account for multiple comparisons in your tests (e.g., with the Holm–Bonferroni method). You might also be able to avoid doing some of this work (and avoid some comparisons) by doing a Kruskal-Wallis test or similar first.

Note: The adjusted α values can get very small if you do a lot of experiments. One workaround is to aggregate (e.g., taking the mean) all your experimental results and then just do one statistical test on them. This (obviously) only works though if it makes logical sense to aggregate the results and if the experiments are independent. However, you cannot then infer from this one test that your approach works better in one specific target. You can only make a claim of the overall efficiency.

Timeouts

Sometimes you have to use timeouts in your experiment. For example, a fuzzer might target one specific error, and if the fuzzer does not expose the error within 24 hours (or another arbitrary timeout), you stop the benchmark. In this case, you cannot use the approaches used above, but instead should look into tests that can handle right-censored measurements (Kaplan-Meier, log-rank test) ⁶.

However, it should be said that a lot of distributions in CompSci are strange. For example, the graph below shows two distributions you might see in reality. The left distribution has far less variance, while the right one has a lot. They are also both symmetrical and the distortion is caused by my poor drawing skills.

Depending on where you set the arbitrary timeout, you might see very different results in your tests. With the current timeout, the right distribution will time out less often. If we shift the timeout to the right, then the results will flip and the left distribution will rarely/never timeout.

Do the statistical tests account for this? If their assumptions are met yes, but it is unclear whether these assumptions actually hold for the distributions we see in various CompSci fields.

References

[1] Nunn, Nathan, and Nancy Qian. "The Columbian exchange: A history of disease, food, and ideas." Journal of Economic Perspectives 24.2 (2010): 163-188.

[2] Harper, Kristin N., et al. "The origin and antiquity of syphilis revisited: An Appraisal of Old World pre‐Columbian evidence for treponemal infection." American journal of physical anthropology 146.S53 (2011): 99-133. 🔗 https://journals.lww.com/stdjournal/_layouts/15/oaks.journals/downloadpdf.aspx?an=00007435-199303000-00010

[3] O'shea, J. G. "‘Two minutes with venus, two years with mercury’-mercury as an antisyphilitic chemotherapeutic agent." Journal of the Royal Society of Medicine 83.6 (1990): 392-395. 🔗 https://journals.sagepub.com/doi/pdf/10.1177/014107689008300619

[4] Hopewell, Sally, et al. "CONSORT 2025 statement: updated guideline for reporting randomised trials." The Lancet 405.10489 (2025): 1633-1640.

[5] Schloegel, Moritz, et al. "Sok: Prudent evaluation practices for fuzzing." 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024.

[6] Geretto, Elia, et al. "LibAFLGo: Evaluating and Advancing Directed Greybox Fuzzing." 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P). IEEE, 2025.

Index

Teaching Compiler Construction with LLVM The CompSci's Guide To Noise