<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Scott James Remnant &#187; Upstart</title>
	<atom:link href="http://www.netsplit.com/category/tech/upstart/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.netsplit.com</link>
	<description></description>
	<lastBuildDate>Thu, 27 May 2010 13:35:12 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Dependency-based &amp; Event-based init daemons and launchd</title>
		<link>http://www.netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/</link>
		<comments>http://www.netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/#comments</comments>
		<pubDate>Thu, 27 May 2010 13:30:08 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=255</guid>
		<description><![CDATA[With the recent announcement of systemd, I&#8217;ve noticed some increased confusion around Upstart and what it means to be an event-based init daemon.  Now seems as good a time as any to try and clear that up by describing what I mean by that.
Dependency-based init
Before Upstart came along, the state of the art of init [...]]]></description>
			<content:encoded><![CDATA[<p>With the recent announcement of systemd, I&#8217;ve noticed some increased confusion around Upstart and what it means to be an <em>event-based init daemon</em>.  Now seems as good a time as any to try and clear that up by describing what I mean by that.</p>
<h3>Dependency-based init</h3>
<p>Before <a href="http://upstart.ubuntu.com/">Upstart</a> came along, the state of the art of init daemon replacements were the <em>dependency-based init daemons</em>.  The two most well-known at the time was the <a href="http://www.sun.com/bigadmin/content/selfheal/smf-quickstart.jsp">Service Management Facility</a> (SMF) of Solaris, and <a href="http://initng.sourceforge.net/trac">initng</a> on Linux.</p>
<p>The easiest way to understand how a dependency-based init daemon works is to look at another dependency-based system you&#8217;re probably more familiar with: the package manager of your Linux distribution.</p>
<p>When you want to install a package, for example the Apache Web Server, you tell the package manager to do that.  The Apache package will list additional dependencies that it requires to be installed, and those in turn will list additional dependencies, and so on.  The package manager will walk this dependency tree, eliminating those that you already have installed, and it will then flatten the remaining tree to get an order in which those remaining can be safely installed.</p>
<p>To put it simply: you say that you want Apache installed, but you may get more than that installed to ensure that Apache works.</p>
<p>A dependency-based init daemon works in fundamentally the same way.  When you say that you want Apache started, it looks at the configuration for that service for the list of dependency services, and builds up a similar tree.  Eliminating those already running, and flattening the tree, gives you a list of services that must be started in an order that they should be safe to start in.</p>
<p>You say you want Apache running, but you may get more than Apache running as a result.</p>
<p>Booting a system with a dependency-based init daemon, however, is a little strange.  They need to know the target set of services that must be running, otherwise they would start nothing.  SMF simply started all services that were not in manual start mode, initng had the concept of goal services whose dependencies were those that should be running &#8212; and used these to define the runlevels.</p>
<p>Once you have that list of goal services, you work out the dependency trees, and flatten them as normal &#8211; and thus you get an order that all services on the system should be started in.</p>
<p>Dependency-based init daemons work, but I believed there was a better way to do things.  I invented the <em>event-based init daemon</em> instead.</p>
<h3>Event-based init</h3>
<p>An event-based init daemon isn&#8217;t really a great leap from a dependency-based init daemon, it simply does everything backwards.  A simplistic view says that instead of starting Apache&#8217;s dependencies because Apache is started, it starts Apache because its dependencies are now running.</p>
<p>But it&#8217;s much more interesting than that, and much more flexible.  Most people don&#8217;t get the epiphany.</p>
<p>A better description might be that services are started and stopped due to external influences on them.  Those external influences can be anything, for example: hardware coming and going; changes in the time; and not least, other services.</p>
<p>The events represent changes in the system state, and services define the states in which they can be running, and the system reacts accordingly.</p>
<p>I&#8217;m still convinced this is the best way to work, not in the least because you can implement a dependency-based system with an event-based init daemon.  Starting a service causes an event for each of its dependencies declaring a need for them, and the service waits for those events to complete; those events cause the dependencies to be started.</p>
<h3>launchd</h3>
<p>The other well-known init daemon out there is Apple&#8217;s <a href="http://developer.apple.com/macosx/launchd.html">launchd</a>, of which Lennart&#8217;s recent <a href="http://www.freedesktop.org/wiki/Software/systemd">systemd</a> project is similar implementation in some ways but not in others.</p>
<p>launchd&#8217;s modus operandi is that it starts services on demand, and it does this on the assumption that all services communicate through sockets or through the Mach IPC model.  For the socket-based services, launchd itself creates the listening sockets, and when it receives a connection it starts the service and hands off the listening socket to it.</p>
<p>This has a beautiful engineering elegance, and it&#8217;s easy to see why it appeals to us.</p>
<p>You don&#8217;t need to configure a service&#8217;s dependencies or requirements in the init daemon, instead the service causes its dependencies to be started through this on-demand activation.  If the dependency isn&#8217;t ready to be started, the service simply blocks in the <code>connect</code> or <code>open</code> syscall until it is ready.</p>
<p>As launchd as matured, Apple have added support to watch for files on the disk and for cron-like schedule events.  In many ways, this makes launchd kinda like an event-based init daemon, except with listening sockets.</p>
<p>systemd takes a similar approach with regard to the listening sockets, though my understanding so far is that it combines it with a dependency-based resolution procedure for other parts of the system, rather than an event-based one.  I&#8217;m willing to be corrected on this though.</p>
<h3>Upstart</h3>
<p>Upstart is an event-based init daemon; it&#8217;s taken a little while to develop because it&#8217;s the first pure example of its kind, and I only replaced the working sysvinit cautiously.  I basically had to prove to myself, and others, that an event-based init daemon can really work.  That&#8217;s why Ubuntu 9.10 and 10.04 were the first versions to really start taking advantage of it.</p>
<p>I also wanted to keep it relatively stable to encourage adoption by other distributions, and I believe this has also paid off given that Fedora, RedHat and OpenSuSE have all adopted it now.</p>
<p>I&#8217;ve proven it works, and it&#8217;s been adopted, now the fun development can begin!</p>
<p>Two of the main complains about Upstart are that the <code>start on</code> and <code>stop on</code> mechanism to define services is complicated and exposes far too much of the event model, and that it&#8217;s not very well documented.  Ironically, these two complaints are entirely related.</p>
<p>The <code>start on</code>/<code>stop</code> on mechanism is basically just a debug interface, it allowed me during early development to access the raw event queue and find out what types of service model we really needed.  Since it&#8217;s a debug interface, it wasn&#8217;t documented; I knew that future versions of Upstart would have a much better model.</p>
<p>So to correct a common misconception, the hideous <code>start on</code> lines are not a side-effect of event-based init daemons; they&#8217;re a side-effect of developing an event-based init daemon in a release early open-source way.</p>
<p>I&#8217;ve also mentioned that events can be just about anything, not just directly from other services.  This includes on-demand activation; I don&#8217;t see any reason why Upstart should not be able to create sockets as launchd does, a connection on those sockets would simply be an event that would cause a service to be started.</p>
<p>Likewise, I fully intend Upstart to take over activation of system and session bus services from D-Bus, using an event from the D-Bus daemon to start and manage the service on its behalf.</p>
<p>This latter example neatly illustrates how start on will be replaced.  Take a system bus service, you might declare such a service like this:</p>
<pre><code>dbus system-bus org.freedesktop.UDisks
exec /usr/lib/udisks-daemon</code></pre>
<p>That initial line replaces a whole slew of previous verbs.  It tells Upstart that this service should be activated from the D-Bus system bus when a message for the given name has no destination in the bus.  It also tells Upstart that this service should not be considered &#8220;ready&#8221; until it actually registers that name on the bus.</p>
<p>Finally it tells Upstart that the service can only be run while the D-Bus system bus service is running.  You might think this superfluous, but remember from above that an event-based init daemon can work both ways; starting this service manually as a system administrator would start the message bus for you, if it wasn&#8217;t already running.  This can be done with either an event or through the service connecting to the message bus via a known socket.</p>
<p>It&#8217;s this flexibility that still leaves me convinced that Upstart is a better all-round approach than the purity of launchd (or systemd).</p>
<p>Take another service, for example, the printing service: CUPS.  At first glance, you might believe that it can be on-demand activated when something connects to its socket.</p>
<p>And that would certainly appear to work, you&#8217;d click Print in an application and the printer service would be started.</p>
<p>But that&#8217;s not the full picture; what if there was a job in the queue from before you shut down?  You also need the service started if there are any files in the named queue directory.</p>
<p>And that&#8217;s still not the full picture; CUPS performs remote printer discovery, you most certainly don&#8217;t want to click Print and see no printers because CUPS hasn&#8217;t had time to discover them, having only just been started.  Users have short attention spans to wait, I know I certainly do.</p>
<p>You need a combination of different conditions to start CUPS; it should be started on demand, it should be started if there are files in the print queue, and it should be still started on boot (just low-priority once the system is idle) to discover remote printers.</p>
<p>A pure on-demand daemon just doesn&#8217;t cut it, you need something more flexible.</p>
<p>The last point about user impatience is also my other major disagreement here.  launchd supposes that you should always optimise for the minimum system footprint, at a cost to interaction performance.</p>
<p>It assumes that it&#8217;s ok to wait for a service to start when you click a button the first time, or bogusly that all services start immediately!</p>
<p>While this might be true in many situations, it&#8217;s also not true in many others.  I&#8217;ve met very few system administrators who think that their web server should only ever be started on demand, and shut down again once there are no users browsing it.</p>
<p>And if you&#8217;re going to do always-running services like this, you do need to be able to encode their dependencies and requirements in the init-daemon configuration, which negates the engineering precision of avoiding doing so through on-demand activation.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/feed/</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
		<item>
		<title>On systemd</title>
		<link>http://www.netsplit.com/2010/04/30/on-systemd/</link>
		<comments>http://www.netsplit.com/2010/04/30/on-systemd/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 11:47:31 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Canonical]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=246</guid>
		<description><![CDATA[I&#8217;m sure you&#8217;ve all by now read the announcement of systemd, and have probably come running to my blog to see what the reaction of Ubuntu and the Upstart author is!
As you know, improvements to the boot process has been something that Ubuntu have been working on for a few years now and this led [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m sure you&#8217;ve all by now read the announcement of <a href="http://0pointer.de/blog/projects/systemd.html">systemd</a>, and have probably come running to my blog to see what the reaction of Ubuntu and the <a href="http://upstart.ubuntu.com/">Upstart</a> author is!</p>
<p>As you know, improvements to the boot process has been something that Ubuntu have been working on for a few years now and this led to the development of Upstart.  We&#8217;re not the only ones working in this area, Intel have also been hard at work with different improvements of their own with the Moblin and MeeGo projects.</p>
<p>So it&#8217;s great to see some Fedora and OpenSuSE guys working on this too, and bringing some different ideas to the table!</p>
<p>I can&#8217;t say I disagree with some of Lennart&#8217;s observations about problems with Upstart, it&#8217;s certainly nowhere near perfect.  Now that the stable period leading up to the release of Ubuntu 10.04 LTS is over, I&#8217;m looking forwards to getting back into the code and trying to address them.</p>
<p>It&#8217;s far too early to tell which approach is going to work out better in the end; but that&#8217;s one of the great things about Linux.  The different distributions are able to develop in different directions, and we&#8217;re able to try out many different things.</p>
<p>On a personal note, I&#8217;m particularly pleased that Lennart has continued the punny naming scheme I began with <a href="http://www.thefreedictionary.com/upstart">Upstart</a>. <a href="http://www.urbandictionary.com/define.php?term=System%20D"> System D</a> is a French concept that embraces responding to challenges when they happen, thinking fast and on your feet and adapting and improvising to get the job done.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2010/04/30/on-systemd/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Upstart adoption continues</title>
		<link>http://www.netsplit.com/2008/09/23/upstart-adoption-continues/</link>
		<comments>http://www.netsplit.com/2008/09/23/upstart-adoption-continues/#comments</comments>
		<pubDate>Tue, 23 Sep 2008 18:36:07 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=179</guid>
		<description><![CDATA[A complete surprise to me, from slides of today&#8217;s OSiM Maemo Developer Session it appears that Maemo (the Nokia open source Internet Tablet platform) has adopted Upstart.  Does anyone know whether they are using native jobs or still using SysV compatibility?
]]></description>
			<content:encoded><![CDATA[<p>A complete surprise to me, from slides of today&#8217;s OSiM Maemo Developer Session <a href="http://www.internettablettalk.com/2008/09/18/osim-maemo-developer-session/">it appears that </a>Maemo (the Nokia open source Internet Tablet platform) has adopted Upstart.  Does anyone know whether they are using native jobs or still using SysV compatibility?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2008/09/23/upstart-adoption-continues/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Upstart 0.5: Relationships</title>
		<link>http://www.netsplit.com/2008/05/01/upstart-05-relationships/</link>
		<comments>http://www.netsplit.com/2008/05/01/upstart-05-relationships/#comments</comments>
		<pubDate>Thu, 01 May 2008 02:19:04 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=148</guid>
		<description><![CDATA[Even the relatively simple System V rc scripts recognise that there are relationships between services, and that in many cases one or more others must be started before a particular service can itself be started: it allows for such relationships to be expressed by using a directory of numbered scripts that are run in series [...]]]></description>
			<content:encoded><![CDATA[<p>Even the relatively simple System V rc scripts recognise that there are relationships between services, and that in many cases one or more others must be started before a particular service can itself be started: it allows for such relationships to be expressed by using a directory of numbered scripts that are run in series by the sysv rc script.</p>
<p>Tackling this problem in some way is arguably one of the main reasons that each of the alternate init daemons exists.  Even launchd acknowledges the problem, even if its solution is to tell service developers that they should spin or sleep while dependencies aren&#8217;t available.</p>
<h3>The Competition</h3>
<p>The way in which the other leading init replacements tackle the relationship problem is through <em>dependencies</em>.  This is not that surprising, since the concept is shared (and effectively mirrored) by both the dynamic link loader and the package manager; both things that a service maintainer knows well.</p>
<p>To illustrate how dependencies work, since I use that term precisely to mean only this behaviour, we&#8217;ll use one of the chains of the well known Network Manager service.</p>
<ul>
<li>Network Manager <em>depends on</em> HAL</li>
<li>HAL <em>depends on</em> D-Bus</li>
</ul>
<p>When A <em>depends on</em> B, B is required for A to function properly.  Any attempt to start A must first start B.</p>
<p>This works well for the link loader, when we load an executable we also need to load and map the shared objects it links to.</p>
<p>It also works well for the package manager, when we install Network Manager it means we also need to install HAL and D-Bus for it to function.</p>
<p>However for an init daemon, it&#8217;s not normally ideal: the only reason that D-Bus and HAL will be running is because Network Manager depends on them.  If we were to stop Network Manager, we would also stop HAL and D-Bus.</p>
<p>This obviously isn&#8217;t what we want, HAL and D-Bus are both essential services in their own right.  Thus we end up with a target or goal set of services that must be started anyway, within this group the dependency relationships are only effective for ordering of them.  Ironically, it is very rare indeed for a service to not be a target and so all of the complex ability of the dependency-based daemon is lost; the only reason to generate the dependency tree at runtime at all is to allow for parallel starts.</p>
<h3>Upside Down Dependencies</h3>
<p>Thus one of the first things that service maintainers have to get used to about Upstart is that its service relationships are upside down from the way that they might expect.  Upstart assumes that if a service is installed, not disabled, and the required services, tasks or hardware is available then the service should be running.</p>
<p>In the dependency-based model, starting Network Manager would first start HAL which would first start D-Bus.</p>
<p>In the Upstart (event-based) model, D-Bus is started fulfilling HAL&#8217;s requirements so HAL is started, fulfilling Network Manager&#8217;s requirements (once a network card is available?) so Network Manager is then started.</p>
<p>Upstart has no notion of targets or goals, it simply ensures that all services that can and should be running are; and ensures that services are stopped when it is no longer the right time for them to be running.</p>
<h3>Relationships through Events</h3>
<p>The way in which relationships between services are defined is by having services react to each other&#8217;s events.  To continue with our example, HAL would therefore have the following in its job definition:</p>
<pre><code>
start on started dbus
stop on stopping dbus
</code></pre>
<p>The first line means that when the <code>dbus</code> service is fully up and running (recall from previous posts that this event can be delayed as necessary), HAL will itself be started.</p>
<p>The second line is a little more interesting.  Events in Upstart will block until the jobs they affect complete, and the <code>stopping</code> event is emitted before the <code>dbus</code> job is actually stopped and blocks it from doing so.  Put more simply, HAL will be fully stopped before D-Bus is stopped.</p>
<p>Thus we have the simplest kind of Upstart relationship.  Starting D-Bus will start HAL immediately afterwards, and stopping D-Bus will stop HAL first.</p>
<h3>The portmap problem</h3>
<p>Most maintainers at this point will be feeling quite smug and about to hit the comments button because they&#8217;ve thought of an example service that actually is a dependency, and should not be running if nothing needs it.</p>
<p>Remember that I said they were rare, not non-existant.</p>
<p>One such example is portmap, another is often something like tomcat.  There are a few, but they&#8217;re certainly not the common case.</p>
<p>Happily one of the elegant things about Upstart&#8217;s design is that it <em>does</em> still support this model where it&#8217;s needed.  In order for portmap to be started when we start an nfs-server, we simply write the following in portmap&#8217;s job definition:</p>
<pre><code>
start on starting nfs-server
stop on stopped nfs-server
</code></pre>
<p>Compare to the example for D-Bus/HAL and you&#8217;ll notice that it&#8217;s the events that have changed.</p>
<p>Remember that the starting event, like the stopping event we used in the previous example, blocks the job until jobs affected by the event are completed.  Thus this first line means that when we start nfs-server, it will not be started until portmap is started.</p>
<p>And the second line is pretty much the mirror of the first in the previous example, once the nfs-server is stopped, we stop portmap as well since it&#8217;s no longer needed.</p>
<p>It may seem a little odd that the rules go in portmap, and not nfs-server, but it makes logical sense.  It means that for an admin to work out why portmap is getting started, they just need to read the portmap definition and not hunt around the system to see what else might be doing it.</p>
<p>Also in many of the cases, such requirements are actually conditional.  Apache doesn&#8217;t need to require tomcat, it&#8217;s only a requirement if it&#8217;s installed.  Thus it makes more sense for tomcat to add itself to Apache&#8217;s environment rather than Apache to look for tomcat.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2008/05/01/upstart-05-relationships/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Upstart 0.5: Events</title>
		<link>http://www.netsplit.com/2008/04/27/upstart-05-events/</link>
		<comments>http://www.netsplit.com/2008/04/27/upstart-05-events/#comments</comments>
		<pubDate>Sun, 27 Apr 2008 22:03:55 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=147</guid>
		<description><![CDATA[In the previous posts, I&#8217;ve covered the various features that make Upstart a good service manager, but these are things you&#8217;ll find in most others as well.  It&#8217;s now time to cover that which is singularly unique to Upstart, Events.
Start and Stop
You&#8217;ve already seen the start and stop commands, which do somewhat unsurprising things [...]]]></description>
			<content:encoded><![CDATA[<p>In the previous posts, I&#8217;ve covered the various features that make Upstart a good service manager, but these are things you&#8217;ll find in most others as well.  It&#8217;s now time to cover that which is singularly unique to Upstart, Events.</p>
<h3>Start and Stop</h3>
<p>You&#8217;ve already seen the <code>start</code> and <code>stop</code> commands, which do somewhat unsurprising things to jobs.  The important thing to remember about these is that <em>they are not events</em>.  I just wanted to clear that up before we start, since it&#8217;s often been a source of confusion not helped by the design of some earlier versions of Upstart.</p>
<p><code>start</code> and <code>stop</code> operate directly on jobs, and the command will not normally return until the operation is complete or otherwise interrupted.  Services are considered complete when they are running, Tasks are considered complete when they have stopped again; in both cases the stop command is complete when the service or task has actually stopped.</p>
<p>This is important since it provides a common-sense behaviour, ensuring that the following operation is not a race condition:</p>
<pre><code>
# start apache
apache running (start), process 3591
# wget http://localhost/
</code></pre>
<p>Solving race conditions is one key part of Upstart&#8217;s purpose.</p>
<p>Both commands may also set environment variables, those set by the start command form part of the environment of the job itself and those set by the stop command are available to the <code>pre-stop</code> script.</p>
<pre><code>
# cat /etc/init/jobs.d/getty
instance $TTY
env SPEED=38400
exec /sbin/getty $SPEED $TTY

# start getty TTY=tty1
getty (tty1) running (start), process 4152
</code></pre>
<h3>Events</h3>
<p>As described above, the start and stop commands are admin instructions that act directly on named jobs.  Events have many similar properties: they carry environment variables that end up in the environment of jobs they start, and they are not complete until the jobs that they affected have been started or stopped as appropriate.</p>
<p>The difference is that the start and stop commands are targeted at specific jobs, whereas events have no such targetting and instead it is jobs that specify which events they are interested in.</p>
<p>In the Upstart world events serve three general purposes: they act as signals of state changes that jobs can react to (e.g. hardware going away), as method calls to automatically start or stop jobs (e.g. shutdown) and as a way of passing information between jobs.</p>
<p>Events are identified by their name and have a different namespace to that of jobs.  They are emitted by a D-Bus call or by using <code>emit</code> on the command-line, naming the event and providing any associated environment variables you wish:</p>
<pre><code>
# emit interface-up IFACE=eth0 ADDRFAM=Ethernet ADDRESS=01:23:45:67:89:0a
</code></pre>
<p>Jobs may match them on this name and any number of their environment variables, specifying whether the event would automatically start or stop the Job.</p>
<pre><code>
start on interface-up IFACE=eth* ADDRFAM=Ethernet
</code></pre>
<p>As a short-hand, where the order of the variables for an event is fixed, the names may be omitted:</p>
<pre><code>
start on interface-up wlan*
</code></pre>
<p>When a job is started by an event, the environment for that event forms part of the environment for the job and may be used when matching events that can automatically stop the job.  Harking back to our <code>getty</code> job from previous posts, we can bind this to the lifetime of the underlying device.</p>
<pre><code>
start on tty-added
stop on tty-removed TTY=$TTY

instance TTY
exec /sbin/getty 38400 $TTY
</code></pre>
<p>We can also match multiple events, either requiring that both occur or either using unsurprising operators:</p>
<pre><code>
start on a-up and b-up
stop on a-down or b-down
</code></pre>
<p>In these situations, once stopped, both the <code>a-up</code> and <code>b-up</code> events must happen again for the job to be restarted.</p>
<h3>Upstart Events</h3>
<p>Upstart itself only emits a few events, leaving the rest up to application authors to define.  The <code>startup</code> event is the most interesting of these, and is ultimately what nearly all jobs get chained from.</p>
<h3>Job Events</h3>
<p>As jobs are started and stopped, Upstart emits events on their behalf for four key points in their lifecyle.</p>
<ul>
<li><strong>starting</strong> is emitted when the job is first starting, and the job will not actually be started until this event completes.</li>
<li><strong>started</strong> is emitted once the job is fully running.</li>
<li><strong>stopping</strong> is emitted when the job is stopping (after the pre-stop has completed), the job will not actually be stopped until this event completes.</li>
<li><strong>stopped</strong> is emitted once the job is fully stopped.</li>
</ul>
<p>All of the events have the name of the job in the first variable, <code>JOB</code> and the instance of the job (if applicable) in the second variable, <code>INSTANCE</code>.  The <em>stopping</em> and <em>stopped</em> events then have a series of variables indicating the reason for the job stopping: <code>RESULT</code> indicates whether it was a normal stop or a failure then if it failed, <code>PROCESS</code> will say what failed and <code>EXIT_SIGNAL</code> or <code>EXIT_STATUS</code> will contain the terminating signal or exit code.</p>
<p>For example, we can take action to backup a database if the server crashes:</p>
<pre><code>
start on stopping hersql RESULT=failed EXIT_SIGNAL=SEGV
task
exec hersql-backup
</code></pre>
<p>Jobs can also export variables from their own environment to others through these events by using the <code>export</code> stanza:</p>
<pre><code>
start on interface-up
stop on interface-down $IFACE

instance $IFACE
export IFACE
exec ...
</code></pre>
<p>Another job may then be started along with this one, and know what interface it&#8217;s bound to:</p>
<pre><code>
start on started JOBNAME
stop on stopping JOBNAME

instance $IFACE
</code></pre>
<p>We&#8217;ll look at the various powerful forms of dependency that these events allow us to express in the next post.</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2008/04/27/upstart-05-events/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Upstart 0.5: Job Lifetime</title>
		<link>http://www.netsplit.com/2008/04/19/upstart-05-job-lifetime/</link>
		<comments>http://www.netsplit.com/2008/04/19/upstart-05-job-lifetime/#comments</comments>
		<pubDate>Sat, 19 Apr 2008 18:38:17 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=146</guid>
		<description><![CDATA[Continuing the series of posts on Upstart 0.5, in this post I&#8217;ll be talking about the various ways that Upstart allows you to manage the lifetime of a job.  These are guarantees that Upstart provides you so that when you start a job, you know what will happen if that job dies unexpectedly or [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing the series of posts on Upstart 0.5, in this post I&#8217;ll be talking about the various ways that Upstart allows you to manage the <em>lifetime</em> of a job.  These are guarantees that Upstart provides you so that when you start a job, you know what will happen if that job dies unexpectedly or someone else tries to start the job as well.</p>
<h3>Respawning</h3>
<p>We&#8217;ve all encountered those daemons that mysteriously die: sometimes they&#8217;re taken out by the OOM killer, and sometimes they&#8217;re just buggy and crash from time to time.  And there&#8217;s also those processes that exit when they&#8217;re done, and need to be restarted (e.g. getty).</p>
<p>For all of these, Upstart provides the facility to respawn the job; effectively an automatic restart in the case of failure.  Respawning is controlled by three things:</p>
<ul>
<li>Whether or not to respawn</li>
<li>Whether or not the job exited &#8220;normally&#8221;</li>
<li>Whether it has been respawned too many times recently</li>
</ul>
<p>Let&#8217;s take the <code>sobby</code> server as an example, here&#8217;s a job that tends to crash every now and then, and we&#8217;d like to keep it running.  However, we&#8217;re also aware that every now and then, it crashes hard and needs repairing; so we limit it&#8217;s respawning to 10 times in 5 seconds (which happens to be the default).</p>
<pre><code>
  exec /usr/bin/sobby --autosave-file=/var/lib/sobby/autosave /var/lib/sobby/autosave

  respawn
  respawn limit 10 5
</code></pre>
<p>The daemon will be continually respawned until either the limit is reached, or the service is explicitly stopped by request.  This isn&#8217;t ideal though, sobby has an exit command which we wish to honour; the daemon is well written enough that it only returns the zero exit code if this command has been run, and otherwise always returns a failure or signal of some description.</p>
<p>In addition, we know that the ABRT signal is raised on the daemon when the session file is corrupted (I&#8217;m making this up, btw), so we want to stop respawning in that case:</p>
<p>To accomplish this, we simply state which exit codes and signals are considered a normal exit condition:</p>
<pre><code>
  exec /usr/bin/sobby --autosave-file=/var/lib/sobby/autosave /var/lib/sobby/autosave

  respawn
  respawn limit 10 5

  normal exit 0 ABRT
</code></pre>
<p>Tasks can be respawned too; the only difference is that zero is always considered a normal exit condition for a task:</p>
<pre><code>
  task
  exec /usr/sbin/some-check $DEVICE

  respawn
</code></pre>
<p>This task will be continually run until it ends with a zero (success) exit code.  We could add additional normal exit conditions as well, just as we can with a service.</p>
<h3>Singletons</h3>
<p>All Upstart jobs are <em>singletons</em> by default, this means that only one <em>instance</em> of that job may be running at any one time.  To illustrate, let&#8217;s continue using the sobby job we defined above and start it:</p>
<pre><code>
  # start sobby
  sobby running (start), process 14977
</code></pre>
<p>Ok, we have a single instance of the sobby job running, and we can interrogate the status of that:</p>
<pre><code>
  # status sobby
  sobby running (start), process 14977
</code></pre>
<p>Now what happens if we (or someone else) tries to start another copy:</p>
<pre><code>
  # start sobby
  start: cannot start 'sobby': Already running
  zsh: exit 1   start sobby
</code></pre>
<p>This is the most sensible and sane default, it saves you having to worry about locking between services and mos importantly means that you can treat failures to obtain resources as true errors.</p>
<p>For example, if you request a D-Bus name and don&#8217;t get it, or attempt to bind to a socket and fail, you can treat that as an error since you know the service manager is already ensuring you&#8217;re a singleton.  This means that you won&#8217;t silently pretend everything&#8217;s ok, and thus won&#8217;t hide problems.</p>
<h3>Instance jobs</h3>
<p>But what if you do want to be able to run multiple copies of the job?  Upstart supports this though <em>instance</em> jobs, which may have multiple copies running.  As well as being identified by the shared job name, each instance is also identified by a second-level instance name.</p>
<p>The instance name for each instance of a job must be unique within that job.  Attempting to start another instance with an already used name will return an already running error again.</p>
<p>Thus the usual method for defining an instance name is by using variables from the job environment, which you&#8217;ll recall come from sources including the start request.</p>
<p>Let&#8217;s use the <code>getty</code> job we defined in the <a href="http://www.netsplit.com/category/tech/upstart/">last post</a> and turn that into an instance job:</p>
<pre><code>
  instance $TTY
  exec /sbin/getty 38400 $TTY
</code></pre>
<p>The <code>instance</code> keyword is the new addition, this defines the name for each instance of the job.  Setting it to an ordinary string wouldn&#8217;t be much help, since there could only be one unique expansion, and you&#8217;d be back to a singleton job again; so we define it using variables from the job&#8217;s environment which will be expanded.</p>
<p>In this case, we can have an instance of the job for each unique value of the $TTY variable.  This makes sense since this is also what we pass to getty.  This means that Upstart is still able to provide the guarantee that another getty won&#8217;t be running with the same tty.</p>
<p>All that we need do is pass the value of the TTY environment variable when we start or stop the getty job:</p>
<pre><code>
  # start getty TTY=tty1
  getty (tty1) running (start), process 15001
  # start getty TTY=tty2
  getty (tty2) running (start), process 15006
</code></pre>
<p>And if we try and run another copy with the same TTY variable, we&#8217;ll still get already running:</p>
<pre><code>
  # start getty TTY=tty1
  start: cannot start 'getty': Already running
  zsh: exit 1   start getty TTY=tty1
</code></pre>
<p>There&#8217;s no builtin way to allow unlimited instances, since these would tend to eventually consume all available resources.  Since any service or task needs to operate on something, or even just write something, then you&#8217;ll need some kind of locking and something in the job environment to tell it what to work on or write.  If someone manages to come up with a truly unlimited instance job, you could do it trivially by passing a UUID=$(uuidgen) variable and instancing on that.</p>
<p>In the next post, I&#8217;ll cover one of the major differences between Upstart and other service managers: events!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2008/04/19/upstart-05-job-lifetime/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Upstart 0.5: Job Environment</title>
		<link>http://www.netsplit.com/2008/04/16/upstart-05-job-environment/</link>
		<comments>http://www.netsplit.com/2008/04/16/upstart-05-job-environment/#comments</comments>
		<pubDate>Wed, 16 Apr 2008 14:07:18 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=145</guid>
		<description><![CDATA[In my previous post on Upstart 0.5, I talked about the ways you can define a service for Upstart to manage and introduced the different processes in a job&#8217;s lifecyle.  In this post, I&#8217;ll look into the detail of those processes and their environment.
Upstart ensures that each process it runs has a sane, safe [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://www.netsplit.com/2008/04/12/upstart-05-job-lifecycle/">previous post</a> on Upstart 0.5, I talked about the ways you can define a service for Upstart to manage and introduced the different processes in a job&#8217;s lifecyle.  In this post, I&#8217;ll look into the detail of those processes and their environment.</p>
<p>Upstart ensures that each process it runs has a sane, safe and predictable environment.  By default each process is run in a new process group and session, but not as a leader of that process group or session (otherwise the process would have to be careful on all open() calls to make sure it didn&#8217;t suddenly own any ttys it opened); the standard input, output and error file descriptors are bound to <code>/dev/null</code>; the PATH environment variable is set to a sensible default, and the TERM variable inherited from the kernel, otherwise no other variables are set; and all resource limits and the like are inherited from init itself.</p>
<p>There are, of course, many ways to customise this environment from the job definition:</p>
<ul>
<li>Jobs may run as a process group and session leader (normally getty likes this).</li>
<li>Jobs may have standard file descriptors sent to <code>/dev/console</code> and may be the <em>owner</em> of <code>/dev/console</code> (so they receive Ctrl-C).</li>
<li>Jobs may specify custom resource limits, umask, &#8220;nice&#8221; level, working directory and chroot directory.</li>
</ul>
<h3>Environment Variables</h3>
<p>To say that jobs only have the PATH and TERM environment variables set is quite a fallacy, these are just the two variables that all jobs always have set.  In fact, the additional environment variables for a job are very important to Upstart since they are the primary method of communicating with that job how it should behave.</p>
<p>To illustrate this, take an instance of the <code>getty</code> service; it needs to know which tty it should use.  We could invent some kind of common configuration or parameter database (or D-Bus service) for this kind of thing, with the job being able to run commands to interrogate it, etc. but that&#8217;s entirely unnecessary.  UNIX already gives us the functionality we need in environment variables, which you&#8217;ve probably noticed your shell documentation calls <em>parameters</em> anyway.</p>
<p>In our getty example, we would store the tty in the <code>TTY</code> environment variable, and then the job definition is nice and simple to understand:</p>
<pre><code>
exec /sbin/getty 38400 $TTY
</code></pre>
<p>So environment variables can be set from a number of sources: the built-in PATH and TERM variables will always be set; others can be set from the job definition (which can specify to inherit the value from init&#8217;s environment); and finally environment can come from the start request for the job.  I&#8217;ll explain more on the latter in later posts, but for now, it suffices to demonstrate that we&#8217;d start our getty example with:</p>
<pre><code>
# start getty TTY=tty1
</code></pre>
<p>So Upstart allows you to define the job&#8217;s true life cycle, including any setup and cleanup it needs to perform before and after the daemon is running; and it allows you to define the environment that daemon runs in, so you don&#8217;t have to worry about unexpected situations.  In the next post, I&#8217;ll talk about how you can manage the <em>lifetime</em> of a job, looking at things such as singletons and respawning.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2008/04/16/upstart-05-job-environment/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Upstart 0.5: Job Lifecycle</title>
		<link>http://www.netsplit.com/2008/04/12/upstart-05-job-lifecycle/</link>
		<comments>http://www.netsplit.com/2008/04/12/upstart-05-job-lifecycle/#comments</comments>
		<pubDate>Sat, 12 Apr 2008 17:13:27 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=144</guid>
		<description><![CDATA[Next month I am hoping to release Upstart 0.5.0, the culmination of almost a year&#8217;s worth of work on it.  Comparitively the version that shipped in edgy (0.2.x) was simply an essay to figure out the basics and the version in feisty thru hardy (0.3.x) a first draft.  The new version has been stripped back [...]]]></description>
			<content:encoded><![CDATA[<p>Next month I am hoping to release <a href="http://upstart.ubuntu.com/">Upstart</a> 0.5.0, the culmination of almost a year&#8217;s worth of work on it.  Comparitively the version that shipped in edgy (0.2.x) was simply an essay to figure out the basics and the version in feisty thru hardy (0.3.x) a first draft.  The new version has been stripped back to the very basics and rebuilt to correct the problems we found with the earlier versions, and to make sure it can handle real world uses as simply and elegantly as possible.</p>
<p>Over the next few weeks, I&#8217;ll be writing about the new version; both how it has improved from previous versions and how it compares to what else is out there.</p>
<h3>Introduction</h3>
<p>First we&#8217;ll look at how Upstart allows you to manage the lifecyle of services and tasks (collectively jobs) that you wish to manage.  We&#8217;ll use the D-Bus daemon as an example service, simply because it&#8217;s a modern, well-behaved service that we&#8217;re all familiar with.</p>
<p>With SystemV RC, we would have had a single <code>/etc/init.d/dbus</code> file accepting both <code>start</code> and <code>stop</code> as arguments.  They may have looked something like this:</p>
<pre><code>
case "$1" in
    start)
        start-stop-daemon --start --pidfile /var/run/dbus.pid /usr/sbin/dbus-daemon
        ;;
    end)
        start-stop-daemon --stop --pidfile /var/run/dbus.pid
        ;;
esac
</code></pre>
<p>As you&#8217;re well aware, the simple act of starting a daemon and stopping again is not so simple this way.  You nearly always end up requiring some kind of helper like <code>start-stop-daemon</code> to help out, and rely on accurate PID files and the like.</p>
<p>Upstart, like just about every other modern service manager (but strangely, not SMF), takes care of all of this hard work for you.  Instead of defining <em>how</em> to start and stop a service you just define <em>what</em> to start.  Here&#8217;s how you&#8217;d define the same service in Upstart:</p>
<pre><code>
exec /usr/sbin/dbus-daemon
</code></pre>
<h3>Setup and teardown</h3>
<p>Of course, we all know that no service definition is ever that simple.  I massively simplified the SystemV example for the purposes of documentation.  In reality, we frequently need to do various things to set up the system for the daemon and clean up again afterwards.  The original start shell code probably looks more like this (and even now, I&#8217;m simplifying for space):</p>
<pre><code>
mkdir /var/run/dbus
chown messagebus.messagebus /var/run/dbus

/usr/bin/dbus-uuidgen --ensure

start-stop-daemon --start --pidfile /var/run/dbus.pid /usr/sbin/dbus-daemon
</code></pre>
<p>We need a directory for socket files, etc. and to create the machine id if missing.  ANd likewise to shut it down, we need to clean up:</p>
<pre><code>
start-stop-daemon --stop --pidfile /var/run/dbus.pid

rm -rf /var/run/dbus
</code></pre>
<p>And this is where most init replacements fall down (especially launchd).  In fact, ironically, you&#8217;ll often find the developers using their minimal service definitions when they talk about how fast their system can boot.  You can boot really fast if you don&#8217;t start anything properly.</p>
<p>Obviously I wouldn&#8217;t be pointing this out if Upstart didn&#8217;t allow you to do this properly; we&#8217;ll extend our minimal service definition to include the set up and tear down code necessary.</p>
<pre><code>
pre-start script
    mkdir /var/run/dbus
    chown messagebus.messagebus /var/run/dbus

    /usr/bin/dbus-uuidgen --ensure
end script

exec /usr/sbin/dbus-daemon

post-stop script
    rm -rf /var/run/dbus
end script
</code></pre>
<p>Before we just defined one process in a job&#8217;s lifecycle, known as the main process.  Our new definition defines two more, the <em>pre-start</em> and <em>post-stop</em> processes.  We&#8217;ve chosen to define them as shell scripts embedded in the definition, we could have defined them as binaries to execute if we preferred (using <code>pre-start exec</code>), and we could have defined the main process as a script (using <code>script...end script</code>).</p>
<p>As their name suggests, these processes are run before the main process is started and after it has been stopped respectively.  In fact, Upstart guarantees more than that:</p>
<ul>
<li>For every time that the job is started, the post-stop process will be run.</li>
<li>For every time that the main process is run, the pre-start process will have been completed successfully first.</li>
</ul>
<p>It might seem a little strange that the post-stop process will always run but the pre-start process doesn&#8217;t have as strong a guarantee.  This is because it&#8217;s possible for the job to be stopped immediately after it is started.  Should that happen, Upstart will not run the main process since there&#8217;s no need, and therefore will also not run the pre-start process; however to ensure the system is clean, it always runs the post-stop process.</p>
<p>These guarantees also provide sane restart behaviour.  If you restart a job, the main process is killed, the post-stop process is run, then the pre-start process is run again before the main process.  If you cancel a restart (by stopping the job again) after the post-stop process has been run, it will always be run again.</p>
<h3>Spawned, Running and Killed</h3>
<p>Upstart makes important distinctions in the state of the main process, it does not necessarily assume that just because the <code>exec()</code> syscall has succeeded that the process is in a suitable running state.  Likewise, it does not assume that just because the <code>kill()</code> syscall has succeeded that the process is no longer running.</p>
<p>The latter is easy to understand, delivering the <code>TERM</code> signal to a running process normally just invokes its own termination handler which may perform any number of activities before cleanly shutting down.  Upstart waits for the actual child signal signifying termination before running the post-stop script, until that point the process is considered merely &#8220;killed&#8221;.  Obviously too long in the &#8220;killed&#8221; state means Upstart delivers the much more harcode <code>KILL</code> signal, but that&#8217;s adjustable.</p>
<p>The former is harder to understand since the new binary is in memory and is probably at least initialising, but that&#8217;s the point: it isn&#8217;t yet ready for other jobs to use.  In the SystemV script, this wasn&#8217;t an issue, since we could generally rely on daemons (well behaved ones anyway) to follow the convention that they should not <code>fork()</code> until initialisation was completed successfully.</p>
<p>Since Upstart forks and supervises its own processes, it generally prefers that daemons do not <code>fork()</code> and remain as the pid they were given when started.  So how do jobs signify that they are ready?  There are a few ways:</p>
<ul>
<li>By forking as before.  As I&#8217;ve <a href="http://www.netsplit.com/2007/12/06/supervising-forking-processes/">talked</a> <a href="http://www.netsplit.com/2007/12/07/how-to-and-why-supervise-forking-processes/">about</a> before, Upstart <em>can</em> supervise process that fork, and it will wait for that to happen before assuming the process is ready.</li>
<li>By raising the <code>STOP</code> signal.  Jobs marked with <code>expect stop</code> will wait for this, and once received will sent it the <code>CONT</code> signal and assume that it is now ready.</li>
<li>By registering a D-Bus name.  An early 0.5.x release will wait for a particular D-Bus name to be registered, and not assume that the job is ready until it has done so.</li>
<li>By calling <code>listen()</code>.  Again, planned for an early 0.5.x release, Upstart will use the same mechanism it uses to follow forks to watch for the <code>listen()</code> system call.</li>
<li>With a post-start script, more on that in a second.</li>
</ul>
<h3>The last two processes</h3>
<p>I&#8217;ve introduced the three processes that most jobs will tend to use, but there&#8217;s also another two which will be somewhat rarer but are probably the most powerful of them all.  These are the post-start and pre-stop processes, and they&#8217;re interesting because they&#8217;re run <em>while the main process is running</em>.</p>
<p>The post-start process, as its name suggests, is run after the main process has been spawned and any event we were expecting (see above) has happened.  The job will not be considered ready until the post-start process completes, thus a common use for it is to interrogate the daemon or send it commands it can only act on once its running.</p>
<p>The pre-stop process is run when a request to stop the job occurs (this means it is not run if the main process terminates on its own), and the process is not killed until it finishes.  It receives information about the request, and can cause that request to be ignored (thus leaving the job running).  Another common use is to send the daemon commands before it receives the TERM signal.</p>
<h3>Next&#8230;</h3>
<p>So that&#8217;s a look at the ways we can define the lifecycle of an Upstart job.  In the next couple of posts we&#8217;ll look at the environment and session of jobs, and then at matters such as respawning and singletons.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2008/04/12/upstart-05-job-lifecycle/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>How to (and why) supervise forking processes</title>
		<link>http://www.netsplit.com/2007/12/07/how-to-and-why-supervise-forking-processes/</link>
		<comments>http://www.netsplit.com/2007/12/07/how-to-and-why-supervise-forking-processes/#comments</comments>
		<pubDate>Fri, 07 Dec 2007 10:57:43 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/2007/12/07/how-to-and-why-supervise-forking-processes/</guid>
		<description><![CDATA[Yesterday&#8217;s celebratory blog post demonstrated that Upstart is now able to supervise processes that fork into the background, as most daemons do.  Now that the code has undergone a little more testing, and been pushed into the archive, it&#8217;s worth explaining a little bit more of the background as to the how, and why, [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday&#8217;s <a href="http://www.netsplit.com/2007/12/06/supervising-forking-processes/">celebratory blog post</a> demonstrated that <a href="http://upstart.ubuntu.com/">Upstart</a> is now able to supervise processes that <code>fork</code> into the background, as most daemons do.  Now that the code has undergone a little more testing, and been pushed into <a href="http://codebrowse.launchpad.net/~keybuk/upstart/main/changes/scott%40netsplit.com-20071207102013-vxgrfua46bda226c?start_revid=scott%40netsplit.com-20071207102013-vxgrfua46bda226c">the archive</a>, it&#8217;s worth explaining a little bit more of the background as to the how, and why, we do this.</p>
<p>The why is easiest to answer first.  Daemons are normally written to <code>fork</code>, usually twice; this detaches them from the terminal, process group and session that they were spawned from so that they remain running after the user logs out.  The <code>fork</code> isn&#8217;t just mechanism though, over time a convention has occurred that means daemons don&#8217;t go into the background until their initialisation is complete and they&#8217;re ready to receive connections &#8212; if that&#8217;s their bag.</p>
<p>Simply adding an option to remain in the foreground might appear to eliminate the need to deal with the problem, but this also takes away the notification that the daemon is ready for use.  Over time this signal can be replaced with other notifications: registering a known D-Bus name, or simply raising <code>SIGSTOP</code>; but these require code changes that need to be agreed with upstream first.  Making code changes also assumes that we have the code.  Whether we like it or not, sysadmins will often have the need to run proprietary daemons &#8212; or even simply older versions of software where the patch is too invasive.</p>
<p>So that&#8217;s why we have to do it, now how do we?</p>
<p>This is one of the reasons that building the service supervisor into init, rather than having it as a seperate process, makes sense.  Init has a few special kernel-provided buffs, one of which is that orphaned processes are reparented to it.  When you run a daemon from the command-line, the process is initially your child; it <code>fork</code>s once and the parent dies, the new child is now orphaned, and thus reparented to init.  (Most daemons now run <code>setsid</code> and <code>fork</code> a second time.  This is to ensure that if they open a tty device, they don&#8217;t unexpectedly become its owner.)  Init, like any other process, receives notification about its children through <code>wait</code> so will know when daemons terminate; the &#8220;must have&#8221; of supervision.</p>
<p>So if all daemons are our children we are notified when they terminate and why; we can compare their exit status or signal against a list of known good ones, and choose whether we need to respawn the dead job or mark it as stopped normally.</p>
<p>This isn&#8217;t enough though, all we get is the process id of the dead child.  We still need to relate that back to a job somehow.  One way to do that is to use <code>waitid</code> with the <code>WNOWAIT</code> flag, leaving the process on the table so we can examine <code>/proc</code> to find out more about it.  This seems like quite a reasonable approach, we can then match a process to a job by details such as what binary it was actually running.  Unfortunately this only works for singleton processes where we&#8217;re guaranteed that only one of them exists, both at the job level and at the process-level itself; should the process <code>fork</code>, even to run another child, we could accidentally consider it to have died.  Daemons need to be able to run their own children, or even have pools of them to use; and we also need to be able to run multiple copies of daemons where we can support it.</p>
<p>So we really do need to know the process id of the actual daemon process we should be supervising.  Unfortunately any method of passing this back to init, even relatively common ones like writing it to a pid file, aren&#8217;t sufficiently standard or reliable to do this kind of work.</p>
<p>Ideally the kernel would just tell init when a process was reparented to it, provided both the child process id and that of its previous parent.  Such a notification doesn&#8217;t exist today, though would be a nice project to try and get it into the kernel mainline; difficult if there&#8217;s only one implementation using it.</p>
<p>If we can&#8217;t have that, a syscall that would allow us to watch a process and find out when it <code>fork</code>s would be the second-best thing.  We&#8217;d have the previous process id since we were watching it, and we&#8217;d hopefully be able to obtain the new child process id from this.</p>
<p>Happily that syscall exists, and I suspect you use it all the time if you&#8217;re a developer; it&#8217;s a bit of a mad leap to using it inside init, but as you can see, it works rather nicely.  All we need do is watch the process, and follow it each time it spawns a new child.  We stop watching as soon as we have followed twice (once if a different option is used), or if the process runs a different binary by itself.  And thus we can know the process id of daemons we spawned, even if they attempt to detach from their parent process which they&#8217;ll just be reparented to anyway.</p>
<p>What&#8217;s the syscall?  Oh, hmm, is that the time?  Got to go!  <span style="font-size: xx-small">Alright, it&#8217;s <code>ptrace</code>.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2007/12/07/how-to-and-why-supervise-forking-processes/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Supervising forking processes</title>
		<link>http://www.netsplit.com/2007/12/06/supervising-forking-processes/</link>
		<comments>http://www.netsplit.com/2007/12/06/supervising-forking-processes/#comments</comments>
		<pubDate>Thu, 06 Dec 2007 19:21:49 +0000</pubDate>
		<dc:creator>Scott James Remnant</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/2007/12/06/supervising-forking-processes/</guid>
		<description><![CDATA[
quest /tmp# cat test.c
#include &#60;sys/types.h&#62;

#include &#60;stdlib.h&#62;
#include &#60;unistd.h&#62;

int
main (int   argc,
      char *argv[])
{
        pid_t pid;

        pid = fork ();
        if (pid &#62; 0)
       [...]]]></description>
			<content:encoded><![CDATA[<pre><code>
quest /tmp# cat test.c
#include &lt;sys/types.h&gt;

#include &lt;stdlib.h&gt;
#include &lt;unistd.h&gt;

int
main (int   argc,
      char *argv[])
{
        pid_t pid;

<strong>        pid = fork ();</strong>
        if (pid &gt; 0)
                exit (0);

<strong>        pid = fork ();</strong>
        if (pid &gt; 0)
                exit (0);

        pause ();
        exit (0);
}
quest /tmp# gcc -Wall -g -O0 -o test test.c
</code></pre>
<pre><code>
quest /tmp# cat /etc/event.d/test
<strong>wait for daemon</strong>
exec /tmp/test
</code></pre>
<pre><code>
quest /tmp# start test
test (#0) goal changed from stop to start
test (#0) state changed from waiting to starting
event_new: Pending starting event
Handling starting event
event_finished: Finished starting event
test (#0) state changed from starting to pre-start
test (#0) state changed from pre-start to spawned
process_spawn: Spawned main process 6380 for test (#0)
<strong>Active test (#0) main process (6380)</strong>
<strong>test (#0) main process (6380) forked new child 6381</strong>
<strong>test (#0) main process (6381) forked new child 6382</strong>
test (#0) state changed from spawned to post-start
test (#0) state changed from post-start to running
event_new: Pending started event
Handling started event
event_finished: Finished started event
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://www.netsplit.com/2007/12/06/supervising-forking-processes/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>
