Sunday, July 13, 2014

Why systemd?

This blog post is based on a talk I gave on 2014-05-21.

For a few years now, GNU/Linux distributions have been migrating away from SysV init and towards a plethora of different new init systems. For users who have been happy with SysV init, this can come as a surprise. SysV init simply works, why are so many distributions moving away?

In this blog post, I will try to explain what the problems of SysV init are, and also what solutions systemd offers for them.

I would like to note that I am not a big fan of systemd. I see it as a tool that is now widely used, nothing more.

What is the Job of an Init System?

When a computer starts, some kind of built-in firmware (in PCs, this is the BIOS or UEFI) takes over control, initializes the system, and loads a boot loader from a storage device (usually a hard drive). This boot loader is a bit more elaborate and loads a kernel from a storage device. The kernel then loads drivers, initializes hardware access, and starts one very specific program: The init process. As it is the first process to be run by the kernel, it gets the process ID (PID) 1.

Now, some magic happens, and then the network is up, the database is running, and the web server is accepting requests

Making that magic happen is the job of the init system. This is not a clearly-defined task, and depending on the interpretation, this can be quite broad.

At least, it involves starting, stopping and restarting services. But on modern systems, this should be done in parallel, not just sequentially. Then, these services might need to be run in specialized environments, for example as a different user than root, using certain user limits (ulimit), with a specified CPU affinity, in a chroot or as a cgroup.

Then, the init system should allow the status of a service to be looked at, to see if it is running. Ideally, it would not just allow an admin to manually look at the service, but also monitor it by itself and possibly restart the service if and when it fails.

Finally, and often forgotten, the kernel can send certain events to the user land that can be responded to. For example, on PCs, Linux intercepts the ctrl-alt-del key combo and sends an appropriate signal to which init can respond by e.g. shutting down the system.

Why Not SysV Init

The term SysV init refers to both the binary /sbin/init as well as the whole process. Actually, in many discussions, the init program itself is simply forgotten. In SysV init, it only has a very limited role, even though it can do a lot of things.

When SysV init takes over control from the kernel, the init(8) process reads /etc/inittab and follows the configuration therein. The init process knows different run levels, does utmp/wtmp account, can start processes, and even restart them when they terminate. It knows seven different ways of starting processes (sysinit, boot, bootwait, once, wait, respawn, ondemand) and also reacts to six different kernel events (powerwait, powerfail, powerokwait, powerfailnow, ctrlaltdel, kbrequest).

And while this is used to start processes such as gettys, sulogin and sometimes monitoring software such as monit, usually, all this program does is to start /etc/init.d/rc with the configured run level. This then runs init scripts.

Starting and Stopping

The canonical way of starting and stopping SysV init services is using the service command. While many people do use the init script /etc/init.d/$name directly, this is actually problematic, as it starts the init script with the current environment, which might be different from the environment available at boot time. This can result in a software that starts correctly when run from the command line, but fails to start on boot.

On Debian systems, init scripts are parallelized using the startpar program, which reads dependency information from specifically formatted comments in the init scripts.

Service Status

To figure out if a process is running, SysV init sends signal 0 to the process. This is a special signal under Unix which is actually not delivered at all to the process, and simply means the kill(2) system call will fail if the given PID is not associated to any process.

Sadly, as all services have to daemonize themselves which involves a double fork, SysV init has no way of knowing the PID of the service. To work around this problem, all services have to support writing their final PID to a special file known to the init script, the PID file.

Now with that convention in place, SysV init can send signal 0 to a PID, but it still does not know if the service terminated long ago and the PID was re-used by the kernel for a different process. So in addition to sending a signal, on GNU/Linux, it also checks if /proc/$pid/exe is the right executable for the process.

After all this work, we get this output:

[ ok ] sshd is running.

Process Environment

When the administrator wants to change the environment in which a process is started, for example to drop root privileges for the process, they have to modify the init script directly. This is often difficult, as the scripts are complex and each one is slightly different from the others.

For example, in the Debian sshd init script, the sshd process is started from different points, so you better don’t forget one of them.

Additionally, the helper functions and configuration options are often not set up with this in mind, so that most of the time, process environments adjustment is either done by the service itself instead of the init system, or not at all.

Not My Problem

And this is about all SysV init actually does. Other problems that might be addressed by the init system are delegated to other programs.

For example, starting processes on socket connections has to be done by inetd or xinetd. If you want to monitor a service and restart it when it fails, you have to use supervisor or circus. And so on.

But the service script does not know about services started from those programs, so the admin is left with multiple different places to look for whether a service is running, or even what services are running at all.

Similarly, SysV init simply expects each and every service to re-implement common functionality instead of implementing it itself once.

This starts with the requirement of every service having to daemonize itself. This is not a trivial task, requiring 15 separate steps to do correctly. Even the system call daemon(7) is discouraged because it forgets some of the steps.

All services that open TCP ports below 1024 are expected to start as root, open the socket, and drop user privileges by themselves. Generally, dropping user privileges and setting up a chroot has to be done by services. Logging is another aspect that every service gets to re-implement badly.

Between the limited functionality SysV offers, and the convoluted and baroque way in which it is implemented, there has to be a better way.

New Init Systems

All new init systems try to fix one or more of the problems of SysV init. Of GNU/Linux distributions, pretty much no one uses pure SysV init anymore. The big ones switched to systemd (Fedora, OpenSUSE, Arch Linux, Mageia, OpenMandriva, Red Hat Enterprise Linux, CentOS), upstart (Ubuntu), or OpenRC (Gentoo), and those that still use SysV init have extended it for parallelism and other features using make-style init files (Debian). Other init systems for GNU/Linux include initng, busybox-init, runit, s6, eINIT, procd (OpenWrt), BootScripts (GoboLinux), DEMONS (KahelOS) and Mudur (Pardus). Or they never adopted SysV init in the first place, sticking with simple BSD-style init setups (Slackware). And recently, both Debian and Ubuntu has announced that they will be switching to systemd, which will bring derivates like Mint with them. (The discussion document for Debian is a very interesting read on systemd and highly recommended.)

But this development is not specific to GNU/Linux. Solaris has replaced the init system with the Service Management Facility (SMF) quite a while ago, and MacOS X uses its own launchd.

For GNU/Linux, though, the decision by Debian and Ubuntu to switch to systemd simply means that systemd has won. Whether we like it or not, systemd is now the GNU/Linux init system.

The good news here is that, if so many different distributions have separately adopted systemd, it can’t be completely bad.

Why systemd?

The first thing that about systemd is that it’s huge. The documentation innocently says “this index contains 1504 entries in 14 sections, referring to 164 individual manual pages”, and some people mention a total of 69 binaries in the systemd distribution. So similarly to how SysV init refers to both the binary /sbin/init as well as the whole process, systemd refers to both the central systemd binary as well as the whole project. And while an entry in the documentation refers to directives within config files and individual command line arguments, this is still huge.

The upside?

Well, at least it’s documented.

Starting and Stopping

systemd processes are started using systemctl.

It knows and tracks dependencies, both for start time and run time. That is, systemd can differentiate between a service requiring another service to be up before it can start, and it needing another service up at some point to be able to work successfully. It aggressively parallelizes service startup accordingly. It will also track services correctly, with or without PID files, and restart them when they terminate unexpectedly.

The configuration files are declarative instead of procedural as with SysV init, and comparatively short. A quick check showed that the unit files on my system are on average 15 lines, while the init scripts average around 100 lines.

A trivial example:

[Unit]
Description=My awesome service
After=network.target

[Service]
ExecStart=/usr/bin/mas
Restart=on-failure

[Install]
WantedBy=default.target

That’s it. The service will be run by default, but only after the network came up, and if it terminates unexpectedly, it will be restarted.

Activation

But systemd does not only support starting a service on system boot or on an explicit command, but also on various other events. Services can be started when a device is plugged in, when a mount point becomes available, when a path is created or on a timer.

One of the more useful activation protocols is socket activation. With this, systemd opens a socket and starts a service, passing the socket to the service. The trivial form of this works like inetd, but socket activation is a lot more.

systemd can pass a listening socket, not only an accepted socket, to the service. This allows for two use cases. First, systemd can open a port below 1024 as root, start the process as a non-root user, and pass the listening socket to the process, making it trivial to write services that listen on privileged ports. But there’s a second use case.

Consider a database service like PostgreSQL. The init system starts it, it does some boot up things, and then opens port 5432 to wait for database users. As systemd aggressively parallelizes startup, it’s not out of the question that a database-using service is started in between the time when PostgreSQL was started (systemd makes sure the service is started afterwards) and the time the port is actually opened, leading to some annoying race condition. And while systemd provides a protocol for services to announce when they’re done initializing, it can also just open the socket itself and hand it over to the PostgreSQL server. Clients can then immediately connect to the socket, which will succeed once PostgreSQL finishes initializing.

This improves parallelism and start up speed.

New-Style Daemons

A service does not have to daemonize to work with systemd. Just like with supervisor, the process can just run normally. It can even just write to stdout and stderr to create log messages in syslog.

This both simplifies service development a lot as well as allowing for a unified place for the configuration of logging.

Additionally, it avoids a race condition for log output. With SysV init, if a service can’t open its log file or can’t connect to syslog for some reason, there is no way for it to notify the administrator of that problem. New-style daemons make this a non-issue.

Service status

systemd uses control groups to track processes, which means it does not need PID files or any cooperation from the process at all. This allows it to store a lot more information than SysV init.

lbcd.service - responder for load balancing
Loaded: loaded (/lib/systemd/system/lbcd.service; enabled)
Active: active (running) since Sun 2013-12-29 13:01:24 PST; 1h 11min ago
Docs: man:lbcd(8)
http://www.eyrie.org/~eagle/software/lbcd/
Main PID: 25290 (lbcd)
CGroup: name=systemd:/system/lbcd.service
└─25290 /usr/sbin/lbcd -f -l

Dec 29 13:01:24 wanderer systemd[1]: Starting responder for load balancing…
Dec 29 13:01:24 wanderer systemd[1]: Started responder for load balancing.
Dec 29 13:01:24 wanderer lbcd[25290]: ready to accept requests
Dec 29 13:01:43 wanderer lbcd[25290]: request from ::1 (version 3)

At a quick glance, the administrator can see since when a service is running (“hey, did we restart this service after the upgrade?”), the full list of processes belonging to the service and not just the main one, and even the recent log messages this service emitted. No need to find the correct log file or grepping through oodles of syslog.

While not exactly revolutionary, I find this to be quite useful.

Process Environment

Setting up the process environment is also very easy. At least as long as there are unit file directives for that. Luckily, there are tons of those. Changing the user and group is one directive away, likewise the nice level, cpu scheduling or affinity, user limits, or kernel capabilities. You can easily deny a process access to individual devices or even the network.

Finally, systemd natively supports starting services in their own container.

Which is awesome.

The Future: Container

A container is a virtual environment, like chroot on steroids, but not a virtual machine. It uses control groups and namespaces of the kernel to isolate processes, but does not actually emulate hardware, giving the isolation of virtual machines with full bare metal speed. The future for GNU/Linux servers will look a bit like this.

The host system runs the kernel and the systemd init stack, and pretty much nothing else. systemd starts services within isolated and independent containers.

This allows for isolated service environments, making services easier to set up and configure, and also adds some more security to the setup.

Summary

Summarizing, the insight from this blog post should be that SysV init is simply outdated. It has very few features a modern init system could provide, and the few features it has are implemented in a suboptimal way. That it would be replaced was not a question of if, but when that would happen.

The race for the standard to replace SysV init was won, for better or worse, by systemd.

And while systemd has problems (which I haven’t addressed in this blog post, there are plenty out there that do), it has a lot of pretty nice features that system administrators and software developers can look forward to using.