Upstart 0.5: Job Lifetime
Continuing the series of posts on Upstart 0.5, in this post I’ll be talking about the various ways that Upstart allows you to manage the lifetime of a job. These are guarantees that Upstart provides you so that when you start a job, you know what will happen if that job dies unexpectedly or someone else tries to start the job as well.
Respawning
We’ve all encountered those daemons that mysteriously die: sometimes they’re taken out by the OOM killer, and sometimes they’re just buggy and crash from time to time. And there’s also those processes that exit when they’re done, and need to be restarted (e.g. getty).
For all of these, Upstart provides the facility to respawn the job; effectively an automatic restart in the case of failure. Respawning is controlled by three things:
- Whether or not to respawn
- Whether or not the job exited “normally”
- Whether it has been respawned too many times recently
Let’s take the sobby server as an example, here’s a job that tends to crash every now and then, and we’d like to keep it running. However, we’re also aware that every now and then, it crashes hard and needs repairing; so we limit it’s respawning to 10 times in 5 seconds (which happens to be the default).
exec /usr/bin/sobby --autosave-file=/var/lib/sobby/autosave /var/lib/sobby/autosave
respawn
respawn limit 10 5
The daemon will be continually respawned until either the limit is reached, or the service is explicitly stopped by request. This isn’t ideal though, sobby has an exit command which we wish to honour; the daemon is well written enough that it only returns the zero exit code if this command has been run, and otherwise always returns a failure or signal of some description.
In addition, we know that the ABRT signal is raised on the daemon when the session file is corrupted (I’m making this up, btw), so we want to stop respawning in that case:
To accomplish this, we simply state which exit codes and signals are considered a normal exit condition:
exec /usr/bin/sobby --autosave-file=/var/lib/sobby/autosave /var/lib/sobby/autosave
respawn
respawn limit 10 5
normal exit 0 ABRT
Tasks can be respawned too; the only difference is that zero is always considered a normal exit condition for a task:
task
exec /usr/sbin/some-check $DEVICE
respawn
This task will be continually run until it ends with a zero (success) exit code. We could add additional normal exit conditions as well, just as we can with a service.
Singletons
All Upstart jobs are singletons by default, this means that only one instance of that job may be running at any one time. To illustrate, let’s continue using the sobby job we defined above and start it:
# start sobby
sobby running (start), process 14977
Ok, we have a single instance of the sobby job running, and we can interrogate the status of that:
# status sobby
sobby running (start), process 14977
Now what happens if we (or someone else) tries to start another copy:
# start sobby
start: cannot start 'sobby': Already running
zsh: exit 1 start sobby
This is the most sensible and sane default, it saves you having to worry about locking between services and mos importantly means that you can treat failures to obtain resources as true errors.
For example, if you request a D-Bus name and don’t get it, or attempt to bind to a socket and fail, you can treat that as an error since you know the service manager is already ensuring you’re a singleton. This means that you won’t silently pretend everything’s ok, and thus won’t hide problems.
Instance jobs
But what if you do want to be able to run multiple copies of the job? Upstart supports this though instance jobs, which may have multiple copies running. As well as being identified by the shared job name, each instance is also identified by a second-level instance name.
The instance name for each instance of a job must be unique within that job. Attempting to start another instance with an already used name will return an already running error again.
Thus the usual method for defining an instance name is by using variables from the job environment, which you’ll recall come from sources including the start request.
Let’s use the getty job we defined in the last post and turn that into an instance job:
instance $TTY
exec /sbin/getty 38400 $TTY
The instance keyword is the new addition, this defines the name for each instance of the job. Setting it to an ordinary string wouldn’t be much help, since there could only be one unique expansion, and you’d be back to a singleton job again; so we define it using variables from the job’s environment which will be expanded.
In this case, we can have an instance of the job for each unique value of the $TTY variable. This makes sense since this is also what we pass to getty. This means that Upstart is still able to provide the guarantee that another getty won’t be running with the same tty.
All that we need do is pass the value of the TTY environment variable when we start or stop the getty job:
# start getty TTY=tty1
getty (tty1) running (start), process 15001
# start getty TTY=tty2
getty (tty2) running (start), process 15006
And if we try and run another copy with the same TTY variable, we’ll still get already running:
# start getty TTY=tty1
start: cannot start 'getty': Already running
zsh: exit 1 start getty TTY=tty1
There’s no builtin way to allow unlimited instances, since these would tend to eventually consume all available resources. Since any service or task needs to operate on something, or even just write something, then you’ll need some kind of locking and something in the job environment to tell it what to work on or write. If someone manages to come up with a truly unlimited instance job, you could do it trivially by passing a UUID=$(uuidgen) variable and instancing on that.
In the next post, I’ll cover one of the major differences between Upstart and other service managers: events!






Dennis K:
Excellent series of posts, Scott. There’s a typo in your last example though: s/sobby/getty/
19 April 2008, 7:08 pmPhilipp Kern:
We would be very curious to see backtraces of crashes of Sobby 0.4.x…
19 April 2008, 7:26 pmScott James Remnant:
@pkern: I was kinda picking on sobby as an example
19 April 2008, 8:46 pmKees Cook:
AIUI, Debian policy for maintainer scripts calling “start” on an init script requires that “start” exit 0 when the service is already started. Will there be something similar to start-stop-daemon’s “–oknodo” option?
19 April 2008, 10:43 pmJeff Bailey:
Will up HUP or some such cause the respawns-exceeded timeout to get overridden? That would be nice for troubleshooting.
19 April 2008, 11:47 pmSitsofe:
very of offtopic but will upstart ever do port reservation for a service it will later start like like launchd does (thus preventing issues like https://bugzilla.redhat.com/show_bug.cgi?id=430825 )?
20 April 2008, 10:07 amDiego:
“status” is not a very good name for a executable, I must say :/
20 April 2008, 5:28 pmScott James Remnant:
@kees: good point
@jeff: probably a d-bus command
@sitsofe: depends whether it has inetd support
@diego: why? if any package has the right to claim generic names like start, stop and status - it’s PID #1 :p
20 April 2008, 8:25 pm