On Monday, February 11, CVE-2019-5736 was disclosed. This vulnerability is a flaw in runc, which can be exploited to escape Linux containers launched with Docker, containerd, CRI-O, or any other user of runc. But how does it work?
Dive in! Processes interact with the operating system to perform a variety of operations (for example, reading and writing files, taking input, communicating on the network, etc.) via system calls, or syscalls. Syscalls can perform a variety of actions.
The ones I’m interested in today involve creating other processes (typically throughfork(2) or clone(2)) and changing the currently running program into something else (execve(2)). File descriptors are how a process interacts with files, as managed by the Linux kernel. File descriptors are short identifiers (numbers) that are passed to the appropriate syscalls for interacting with files: read(2), write(2), close(2), and so forth.
Sometimes a process wants to spawn another process. That might be a shell running a program you typed at the terminal, a daemon that needs a helper, or even concurrent processing without threads. When this happens, the process typically uses thefork(2) orclone(2) syscalls.
These syscalls have some differences, but they both operate by creating another copy of the currently executing process and sharing some state. That state can include things like the memory structures (either shared memory segments or copies of the memory) and file descriptors. After the new process is started, it’s the responsibility of both processes to figure out which one they are (am I the parent?
Am I the child?). Then, they take the appropriate action. In many cases, the appropriate action is for the child to do some setup, and then execute theexecve(2) syscall.