Teaching Compiler Construction with LLVM
This is a short post-mortem of the compiler construction course that I was TAing during my PhD. The course consisted of normal lectures, graded exams and then 3 to 4 weekly assignments. The course is in the shortest study period at the VU, so the whole course happens in a short time frame of about 1.5 months.
It should also be pointed out that the students that take this course are usually very motivated. It's an elective course, and the people that sign up for it are usually very interested in the topic itself. So this post is really about making a course about students that actually care about the topic.
The Original Course
I started TAing the course on short notice. I read the assignments for the first time over Christmas and the course itself started in January. Assignment 1 to 3 had a test suite that was given to the students. However, there were also some additional hidden tests that were kept secret. Those hidden tests were only executed as part of the final grading work. The grading involved running all tests and then inspecting the code for tasks that had no tests associated with them. Most of that grading work was done by a (well compensated) army of student TAs.
The assignment contents were as follows:
-
Assignment 1 was about finishing a partially completed compiler frontend that generates LLVM IR. The input language was 'FenneC', which is a subset of C with some minor extensions to it. The frontend was written in Python and used a (quite peculiar) Python parser generator called PLY.
-
Assignment 2 was about writing various standard optimization passes in LLVM. The list of passes was ADCE, various peephole optimizations, and LICM.
-
For Assignment 3, students has to implement a custom spatial memory sanitizer with LLVM and a custom runtime in C. The assignment required them to thread the array bounds of each value alongside the original array pointer. This also had to work across functions.
-
Assignment 4 was optional and involved just a few tasks in the backend of LLVM. This was required them to build their own LLVM where they had to randomize some registers, code pointers, and a Spectre mitigation, etc. The grading here was fully manually and there were no tests.
Student Complaints
When I first TA'd the course, I got feedback from students in several ways. I talked with students during office hours, I set up an (anonymous) feedback form, and there was also some official course feedback that I got my hands on. For some extra unfiltered course feedback, I joined the CompSci student Discord where people were discussing the assignments between themselves (and also me, I guess).
I think the feedback was overall mixed, with students liking the subject but not the way the assignments were designed. There were a few common complaints that can be summarized as follows.
-
The consensus was that finishing the incomplete Python parser was a nightmare. Especially the fact that function doc strings had semantic meaning in PLY (really!) drove several students close to madness. Except for that, trying to explore the Python code without code completion (there were little to no type annotations) was considered a waste of time by most students.
-
The fact that students could not see their grade when working on the assignments was also badly received. I think it mostly just caused anxiety and made students not experiment to avoid breaking some obscure hidden test.
-
Some hidden test cases covered obscure test cases that most students were unaware of. Often these edge cases had a binary choice (e.g., whether
fcmp
is ordered or unordered), so the result was that many students randomly lost or gained points. -
On top of this noise, sporadic human error in the grading also caused some noise to influence the grades. Most of this was not the fault of the graders, but just the fact that manual grading guidelines always leave some room for interpretation.
-
Many of the tests were also very strict on what solutions they accepted. The main reason for this is that the tests were using LLVM's
FileCheck
, which is essentially just testing via pattern-matching. The students ended up often just reverse engineering the expected outcome without understanding why. Also, when aFileCheck
test rejected a more aggressive optimization, the reaction was often disappointment that extra effort is punished instead of rewarded. -
Because of the strict checks, many assignments did a lot of hand-holding to make sure students come up with the 'right' solution. The result was that some students felt like they "are treated like 5-year olds" (excerpt from student feedback form).
-
While LLVM itself was considered useful, it does have a bunch of usability problems (e.g., modifying code while traversing it leads to nonsensical errors).
-
Also some optimization pass assignments looked like they were taken from some old LLVM tutorial and had weird 'tips' in them that I don't think really apply in any recent LLVM version.
The New Course
Most of the problems in the course were relatively easy to fix on their own. The only problem was that some quirks were also necessary to keep a reasonable grade distribution. Especially the hidden tests prevented grade inflation, so just revealing all tests would most likely just boosted a lot of people to a perfect grade. My planned workaround for this was to make all assignments more difficult without adding hidden tests or weird edge cases.
The new assignments look like this:
-
Assignment 1 is now about writing a frontend from scratch. That can be done in any language, but the students are given a grammar file for a different Python parser generator.
-
Assignment 2 is now graded using a profiler that emulates a somewhat quirky machine that executes LLVM instructions. Originally I thought about building a simple emulator for a real ISA, but I quickly realized that debugging/profiling this with instruction selection in the middle is a nightmare. The optimizations stayed the same, except a new inlining pass that had to use a hand-crafted heuristic (the emulator used a FIFO queue as an I-cache to not make it too easy). The grading scale was based on the optimized program's cycle count linearly interpolated beteween O0 and a 'good' solution.
-
Assignment 3 is now about creating a sanitizer for spatial and temporal memory errors. How to do this was left to the students, and everything from AddressSanitizer's approach to the old value-threading was allowed.
-
Assignment 4 stayed the same but I made an alternative assignment where students could also try out debugging optimization passes using differential testing.
Discussion
So how did the new course do?
-
Generally speaking, very few students ended up feeling lost with much more open-ended assignments. That is maybe caused by the class being mostly very motivated students. For the few that got stuck, the office hours were able to handle it.
-
Some student's came up with unique solutions to problems that I did not anticipate. I manually inspected all assignments just to see them, and I could not see anything where an assignment was trivialized by this. In fact, the unique assignment solutions where often more sophisticated than the intended one. Whether this is a good thing is hard to say.
-
The optimization assignment was received both well and very badly. The good part was that many students liked implementing and profiling their own optimizations.
-
One disliked part of the optimization assignment was that a standard inlining heuristic of 'inline if small' would lower your grade. Instead, a good heuristc required understanding the cascading effects of inlining to enable further optimizations. In general, the main problem here was just that the student's had too little understanding of the problem, and most negative feedback could be addressed with explanations in the office hours.
-
The other problem with the optimization assignment was simply that maintaining the original grade distribution required taking the interpolated score from 0 to 10, and then round it down. Student's really hated seeing their grade being rounded down and 90% of course feedback was about this. I thought about just changing it to rounding up after deducting a flat 0.5, but that seemed a bit too silly to play psychological tricks on students.
Evaluation
Unfortunately, it is hard to quantify how much the course improved as I don't have (yet) access to a proper student evaluation. There are a lot of positive recommendations for the course on the student discord, but given that most student's know I am there, it is hard to say whether this is unfiltered feedback.
Notes
The current course repo can be found here, and the assignments can be found here.
And as a small treat, a fun little interchange on the student discord that I captured before people knew I was there.