How to (and why) supervise forking processes

Yesterday’s celebratory blog post demonstrated that Upstart is now able to supervise processes that fork into the background, as most daemons do. Now that the code has undergone a little more testing, and been pushed into the archive, it’s worth explaining a little bit more of the background as to the how, and why, we do this.

The why is easiest to answer first. Daemons are normally written to fork, usually twice; this detaches them from the terminal, process group and session that they were spawned from so that they remain running after the user logs out. The fork isn’t just mechanism though, over time a convention has occurred that means daemons don’t go into the background until their initialisation is complete and they’re ready to receive connections — if that’s their bag.

Simply adding an option to remain in the foreground might appear to eliminate the need to deal with the problem, but this also takes away the notification that the daemon is ready for use. Over time this signal can be replaced with other notifications: registering a known D-Bus name, or simply raising SIGSTOP; but these require code changes that need to be agreed with upstream first. Making code changes also assumes that we have the code. Whether we like it or not, sysadmins will often have the need to run proprietary daemons — or even simply older versions of software where the patch is too invasive.

So that’s why we have to do it, now how do we?

This is one of the reasons that building the service supervisor into init, rather than having it as a seperate process, makes sense. Init has a few special kernel-provided buffs, one of which is that orphaned processes are reparented to it. When you run a daemon from the command-line, the process is initially your child; it forks once and the parent dies, the new child is now orphaned, and thus reparented to init. (Most daemons now run setsid and fork a second time. This is to ensure that if they open a tty device, they don’t unexpectedly become its owner.) Init, like any other process, receives notification about its children through wait so will know when daemons terminate; the “must have” of supervision.

So if all daemons are our children we are notified when they terminate and why; we can compare their exit status or signal against a list of known good ones, and choose whether we need to respawn the dead job or mark it as stopped normally.

This isn’t enough though, all we get is the process id of the dead child. We still need to relate that back to a job somehow. One way to do that is to use waitid with the WNOWAIT flag, leaving the process on the table so we can examine /proc to find out more about it. This seems like quite a reasonable approach, we can then match a process to a job by details such as what binary it was actually running. Unfortunately this only works for singleton processes where we’re guaranteed that only one of them exists, both at the job level and at the process-level itself; should the process fork, even to run another child, we could accidentally consider it to have died. Daemons need to be able to run their own children, or even have pools of them to use; and we also need to be able to run multiple copies of daemons where we can support it.

So we really do need to know the process id of the actual daemon process we should be supervising. Unfortunately any method of passing this back to init, even relatively common ones like writing it to a pid file, aren’t sufficiently standard or reliable to do this kind of work.

Ideally the kernel would just tell init when a process was reparented to it, provided both the child process id and that of its previous parent. Such a notification doesn’t exist today, though would be a nice project to try and get it into the kernel mainline; difficult if there’s only one implementation using it.

If we can’t have that, a syscall that would allow us to watch a process and find out when it forks would be the second-best thing. We’d have the previous process id since we were watching it, and we’d hopefully be able to obtain the new child process id from this.

Happily that syscall exists, and I suspect you use it all the time if you’re a developer; it’s a bit of a mad leap to using it inside init, but as you can see, it works rather nicely. All we need do is watch the process, and follow it each time it spawns a new child. We stop watching as soon as we have followed twice (once if a different option is used), or if the process runs a different binary by itself. And thus we can know the process id of daemons we spawned, even if they attempt to detach from their parent process which they’ll just be reparented to anyway.

What’s the syscall? Oh, hmm, is that the time? Got to go! Alright, it’s ptrace.

8 Comments

  1. Ross Burton:

    I can’t decide if that is twisted or genius. :)

  2. Joe Shaw:

    Woo hoo! I guessed right. :)

    Can multiple process ptrace at the same time? Ie, can you still run strace or gdb on those processes?

  3. Jeff Schroeder:

    @Joe, one of the biggest limitations of ptrace() is that you can not ptrace an already ptrace’d process… It will all go away once they merge utrace, but that is sometime in the future.

    http://people.redhat.com/roland/utrace/

  4. Anonymous:

    Could upstart support PID files when a daemon already knows how to write them, and just use ptrace as a fallback for those that don’t?

  5. Scott James Remnant:

    Note that the trace is removed as soon as we’ve seen the two (or one) forks, so it’s only there for a very short amount of time. As soon as Upstart believes the service is actually running, you can attach strace or gdb to it as normal.

    (If you run the service under strace or gdb, you don’t need the fork following since strace/gdb stay in the foreground and follow the forks of the daemon themselves.)

  6. Mildred:

    Maybe it could be possible to keep watching the process in all its forks and just stop watching any of them when they do exec …

    We could imagine for example a daemon that would fork for example when a request is made to it, and then it’s the parent that would handle the request and dies letting the chile waiting for others requests.
    Well it seems a little bit odd but I think it could be useful in some cases. Also because when we write the upstart script we don’t know if the daemon forks once or twice.

  7. The Evil Blog » Blog Archive » Evilbuntu?:

    [...] a prominent Ubuntu developer: all daemons are our [...]

  8. Dafydd:

    Eww:

    - It’s fairly common for ptrace to be disabled for security purposes on server systems.
    - Service forked != service ready.
    - Eww.

    How does upstart decide whether to wait for one or two forks?

    For services that connect to the system bus, why not just tell upstart what their D-Bus name is? When launching network manager, you know it’s ready when org.freedesktop.NetworkManager appears. No upstream changes necessary.

    Similarly, for services that listen on sockets, why not just tell upstart which port they listen on? The downside being that (as far as I know) you have to poll for this information.

    Also: since daemon processes become children of init when the process group is orphaned, can’t upstart just monitor what its children are and infer readiness from that?

Leave a comment