Upstart 0.5: Job Lifetime

Continuing the series of posts on Upstart 0.5, in this post I’ll be talking about the various ways that Upstart allows you to manage the lifetime of a job. These are guarantees that Upstart provides you so that when you start a job, you know what will happen if that job dies unexpectedly or someone else tries to start the job as well.

Respawning

We’ve all encountered those daemons that mysteriously die: sometimes they’re taken out by the OOM killer, and sometimes they’re just buggy and crash from time to time. And there’s also those processes that exit when they’re done, and need to be restarted (e.g. getty).

For all of these, Upstart provides the facility to respawn the job; effectively an automatic restart in the case of failure. Respawning is controlled by three things:

  • Whether or not to respawn
  • Whether or not the job exited “normally”
  • Whether it has been respawned too many times recently

Let’s take the sobby server as an example, here’s a job that tends to crash every now and then, and we’d like to keep it running. However, we’re also aware that every now and then, it crashes hard and needs repairing; so we limit it’s respawning to 10 times in 5 seconds (which happens to be the default).


  exec /usr/bin/sobby --autosave-file=/var/lib/sobby/autosave /var/lib/sobby/autosave

  respawn
  respawn limit 10 5

The daemon will be continually respawned until either the limit is reached, or the service is explicitly stopped by request. This isn’t ideal though, sobby has an exit command which we wish to honour; the daemon is well written enough that it only returns the zero exit code if this command has been run, and otherwise always returns a failure or signal of some description.

In addition, we know that the ABRT signal is raised on the daemon when the session file is corrupted (I’m making this up, btw), so we want to stop respawning in that case:

To accomplish this, we simply state which exit codes and signals are considered a normal exit condition:


  exec /usr/bin/sobby --autosave-file=/var/lib/sobby/autosave /var/lib/sobby/autosave

  respawn
  respawn limit 10 5

  normal exit 0 ABRT

Tasks can be respawned too; the only difference is that zero is always considered a normal exit condition for a task:


  task
  exec /usr/sbin/some-check $DEVICE

  respawn

This task will be continually run until it ends with a zero (success) exit code. We could add additional normal exit conditions as well, just as we can with a service.

Singletons

All Upstart jobs are singletons by default, this means that only one instance of that job may be running at any one time. To illustrate, let’s continue using the sobby job we defined above and start it:


  # start sobby
  sobby running (start), process 14977

Ok, we have a single instance of the sobby job running, and we can interrogate the status of that:


  # status sobby
  sobby running (start), process 14977

Now what happens if we (or someone else) tries to start another copy:


  # start sobby
  start: cannot start 'sobby': Already running
  zsh: exit 1   start sobby

This is the most sensible and sane default, it saves you having to worry about locking between services and mos importantly means that you can treat failures to obtain resources as true errors.

For example, if you request a D-Bus name and don’t get it, or attempt to bind to a socket and fail, you can treat that as an error since you know the service manager is already ensuring you’re a singleton. This means that you won’t silently pretend everything’s ok, and thus won’t hide problems.

Instance jobs

But what if you do want to be able to run multiple copies of the job? Upstart supports this though instance jobs, which may have multiple copies running. As well as being identified by the shared job name, each instance is also identified by a second-level instance name.

The instance name for each instance of a job must be unique within that job. Attempting to start another instance with an already used name will return an already running error again.

Thus the usual method for defining an instance name is by using variables from the job environment, which you’ll recall come from sources including the start request.

Let’s use the getty job we defined in the last post and turn that into an instance job:


  instance $TTY
  exec /sbin/getty 38400 $TTY

The instance keyword is the new addition, this defines the name for each instance of the job. Setting it to an ordinary string wouldn’t be much help, since there could only be one unique expansion, and you’d be back to a singleton job again; so we define it using variables from the job’s environment which will be expanded.

In this case, we can have an instance of the job for each unique value of the $TTY variable. This makes sense since this is also what we pass to getty. This means that Upstart is still able to provide the guarantee that another getty won’t be running with the same tty.

All that we need do is pass the value of the TTY environment variable when we start or stop the getty job:


  # start getty TTY=tty1
  getty (tty1) running (start), process 15001
  # start getty TTY=tty2
  getty (tty2) running (start), process 15006

And if we try and run another copy with the same TTY variable, we’ll still get already running:


  # start getty TTY=tty1
  start: cannot start 'getty': Already running
  zsh: exit 1   start getty TTY=tty1

There’s no builtin way to allow unlimited instances, since these would tend to eventually consume all available resources. Since any service or task needs to operate on something, or even just write something, then you’ll need some kind of locking and something in the job environment to tell it what to work on or write. If someone manages to come up with a truly unlimited instance job, you could do it trivially by passing a UUID=$(uuidgen) variable and instancing on that.

In the next post, I’ll cover one of the major differences between Upstart and other service managers: events!

11 Comments

  1. Dennis K says:

    Excellent series of posts, Scott. There’s a typo in your last example though: s/sobby/getty/ ;)

  2. Philipp Kern says:

    We would be very curious to see backtraces of crashes of Sobby 0.4.x…

  3. @pkern: I was kinda picking on sobby as an example ;)

  4. Kees Cook says:

    AIUI, Debian policy for maintainer scripts calling “start” on an init script requires that “start” exit 0 when the service is already started. Will there be something similar to start-stop-daemon’s “–oknodo” option?

  5. Jeff Bailey says:

    Will up HUP or some such cause the respawns-exceeded timeout to get overridden? That would be nice for troubleshooting.

  6. Sitsofe says:

    very of offtopic but will upstart ever do port reservation for a service it will later start like like launchd does (thus preventing issues like https://bugzilla.redhat.com/show_bug.cgi?id=430825 )?

  7. Diego says:

    “status” is not a very good name for a executable, I must say :/

  8. @kees: good point

    @jeff: probably a d-bus command

    @sitsofe: depends whether it has inetd support

    @diego: why? if any package has the right to claim generic names like start, stop and status – it’s PID #1 :p

  9. chrisfarms says:

    I’m really loving upstart, great work. wish more of my boxes had 0.5 on them though.
    Can’t wait until all the rc structures and init.d are gone.

    I think the only thing that makes it tricky to write scripts is the lack of a list of events and what emits them. I understand that this would require some kind of registration system, which does sway a tad from upstart’s simplicity…. maybe just a log of every unique event ever emitted would be handy.

  10. esp says:

    the “respawn” functiopn i didnt know of, Is it somewhere in the doc?

    I cant get it to work, I have a daemon here wich keeps crashing, i dont know why.
    I made a upstart script for it, so that i could restart it with the respawn, but that doesnt work.
    If i check with ’status jobname’ after some hours it says ‘jobname (stop) waiting’

    I would appreciate input on how to get that daemon running forever.

  11. Juri Haberland says:

    Is there a possibility to configure upstart to retry a start of a failed daemon after a couple of minutes, like INIT does, if a service is respawning to fast? Currenty upstart just disables the service forever, if it is respawning to fast.

Leave a Reply